GPU - Linear Algebra Operators For GPU Implementation of Num
GPU - Linear Algebra Operators For GPU Implementation of Num
Figure 1: We present implementations of techniques for solving sets of algebraic equations on graphics hardware. In this way, numerical
simulation and rendering of real-world phenomena, like 2D water surfaces in the shown example, can be achieved at interactive rates.
N 2D-Textures
1 i N
N N ... ... ... ...
N-i
1 i N
N
Figure 3: The representation of a 2D matrix as a set of diagonal vectors, and finally as a set of 2D textures is shown.
texture coordinates and N. Finally, the destination index ((i+j) mod equal to NULL, the combiner operation is carried out on the prod-
N) is calculated and converted to 2D texture coordinates. uct x times y rather than only on x.
After the first diagonal has been processed as described, the cur- The reduce operation combines the vector entries in multiple ren-
rent render target is simultaneously used as a texture and as a ren- dering passes by recursively combining the result of the previous
der target. Thus, fragments in consecutive passes have access to pass. Starting with the initial 2D texture containing one vector
the intermediate result, and they can update this result in each iter- and the quadrilateral lined up with the texture in screen space, in
ation. After rendering the last diagonal, the result vector is already each step a quadrilateral scaled by a factor of 0.5 is rendered. In
in place, and the current render target can be used to internally rep- the shader program, each fragment that is covered by the shrunken
resent this vector. quadrilateral combines the texel that is mapped to it and the three
A considerable speed-up is achieved by specifying multiple adja- adjacent texel in positive (u,v) texture space direction. The distance
cent diagonals as multi-textures, and by processing these diagonals between texels in texture space is issued as a constant in the shader
at once in every pass. Parameters i and N only have to be issued program. The result is written into a new texture, which is now of a
once in the shader program. A particular fragment has the same in- factor of two smaller in each dimension than the previous one. The
dex j in all diagonals, and as a matter of fact it only has to be com- entire process is illustrated in Figure 4. This technique is a standard
puted once. Starting with the first destination index, this index is approach to combine vector elements on parallel computer architec-
successively incremented about one for consecutive diagonals. The tures, which in our scenario is used to keep the memory footprint
number of diagonals that can be processed simultaneously depends for each fragment as low as possible. For a diagonal vector that
on the number of available texture units. is represented by a square texture with resolution 2n , n rendering
Let us finally mention, that with regard to the described imple- passes have to be performed until the result value val is obtained in
mentation of matrix-vector products, there is no particular need to one single pixel. The respective pixel value is finally returned to the
organize matrices into sets of diagonal vectors. For instance, dense application program.
matrices might be represented as sets of column vectors, giving rise
to even more efficient matrix-vector multiplication. Every column original Texture 1st pass 2nd pass
just has to be multiplied with the respective vector element, result-
ing in a much smaller memory footprint, yet requiring a simple ...
...
shader program to which only the index of the currently processed
...
column is input.
...
float clVecReduce (
CL enum cmb, 4 Sparse Matrices
clVec x, clVec y,
); So far, the operations we have encountered execute very efficiently
on current commodity graphics hardware. On the other hand, they
The enumerator cmb can be one of CL ADD, CL MULT, are not suitable to process matrices as they typically arise in numer-
CL MAX, CL MIN, CL ABS. If the second parameter y is not ical simulations. For instance, let us assume that the solution to the
2D wave equation offset with respect to the current column. Most effectively, the off-
set is specified in a vertex shader program, by means of which each
∂ 2u ∂ 2u ∂ 2u
µ ¶
vertex compute the exact 2D position in screen space.
= c2 +
∂ t2 ∂ x2 ∂ y2
Sparse Matrix
Iterations
n
on a grid of resolution 512 x 512 has to be computed numerically vertex array 1
(including boundary points). If first and higher order partial deriva- vertex array 2 color, position, texture coordinate
...
tives are approximated by finite differences, the partial difference vertex array n-1
n vertex array n
equation writes as a set of finite difference equations for each grid
point (ij):
texture
ut+1 t−1
à t
t ui+1 j + uti−1 j + uti j+1 + uti j−1 − 4uti j
!
i j − 2ui j + ui j 2
bind buffer as texture
= c Vector
∆t 2 ∆x∆y
Figure 5: This image illustrates the computation of a sparse-
Using the implicit Crank-Nicholson scheme, where the aver-
matrix-vector product based on the internal representation of ma-
age of the right-hand side is taken, i.e. for all grid points we set
trix columns as sets of vertices.
uti, j = 0.5(ut+1 t
i, j + ui, j ), the difference equation contains more than
one unknown and the system of algebraic equations has to be solved For the multiplication of a matrix times a vector, the color of
simultaneously. each vertex has to be multiplied with the color of the corresponding
If initial and boundary conditions are specified, the set of equa- entry in the vector. The vector, however, is not static and can thus
tions can be written in matrix from as Ax = b, where A is a 5122 not be coded into the vertex array. As a matter of fact, we associate
x 5122 matrix, and both b and the solution vector x are of length with each vertex a texture coordinate, which is used to access the
5122 . Here, x contains the unknowns ut+1 i j to be solved for. In the vector via the 2D textures used to represent it. Fortunately, these
particular example, A is a banded matrix (a triangular matrix with texture coordinates can also be stored on the GPU, so that only the
fringes) with a maximal bandwidth of six. Obviously, storing the appropriate textures have to be bound during matrix-vector compu-
matrix as a full matrix is quite inefficient both in terms of mem- tations (see Figure 5)
ory requirements and numerical operations. In order to effectively It is a nice feature of the described scheme, that the realization of
represent and process general sparse N x N matrices, in which only matrix-vector operations on the GPU as it was proposed in Chap-
O(N) entries are supposed to be active, an alternative representation ter 3 is not affected by the graphical primitives we use to internally
on the GPU needs to be developed. represent and render matrices. The difference between sparse and
full matrices just manifests in that we render every diagonal or col-
4.1 Banded Matrices umn vector as a set of vertices instead of set of 2D textures. In this
way, a significant amount of texture memory, rasterization opera-
With regard to the internal representation of matrices as a set of tions and texture fetch operations can be saved in techniques where
diagonal vectors, we can effectively exploit the existence of a sparse matrices are involved. For instance, to compute the prod-
banded matrix with only a few non-zero off-diagonals. Zero off- uct between the sparse matrix described above and a vector only
diagonals are simply removed from the internal representation, and 5122 x6 textured vertices have to be rendered.
off-diagonals that do not have a counterpart on either side of the
main diagonal are padded with zero entries.
In the above example, where the 2D wave equation has been 5 Examples
discretized by means of finite differences, only six diagonals have
to be stored internally. As a consequence, the product of this matrix We will now exemplify the implementation of general techniques
times a vector costs no more than six vector-vector products. of numerical computing by means of the proposed basis operations
In the general setting, however, were non-zero entries are posi- for matrix-vector and vector-vector arithmetic.
tioned randomly in the matrix, the diagonal layout of vectors does
not allow for the exploitation of the sparseness in a straight forward
way. 5.1 Conjugate Gradient Method
The conjugate gradient (CG) method is an iterative matrix algo-
4.2 Sparse Random Matrices rithm for solving large, sparse linear system of equations Ax = b,
where A ∈ Rnxn . Such systems arise in many practical applications,
To overcome this problem, we use vertices to render the matrix
such as computational fluid dynamics or mechanical engineering.
values at the correct position. For each non-zero entry in a column
The method proceeds by generating vector sequences of iterates
vector one vertex is generated. The coordinate of each vertex is
(i.e. successive approximations to the solution), residuals r corre-
chosen in such a way, that it renders at exactly the same position as
sponding to the iterates, and search directions used in updating the
the respective vector element if it was rendered via the 2D texture
iterates and residuals. The CG algorithm remedies the shortcom-
used to represent the vector. For each column we thus store as many
vertices as there are non-zero entries. Matrix values are stored as ings of steepest descent by forcing the search directions p(i) to be
T
colors associated with the respective vertices. A-conjugate, that is p(i) Ap(i) = 0, and the residuals to be orthog-
Vertices and corresponding colors are stored in a vertex array on onal. Particularly in numerical simulation techniques, where large
the GPU. As long as the matrix is not going to be modified, the but sparse finite difference equations have to be solved, the CG-
internal representation does not have to be changed. Note that in algorithm is a powerful and widely used technique.
case of a banded matrix, where apart from start and end conditions In the following, pseudo code for the unpreconditioned version
for each NxN block in the matrix the same band is present in every of the CG algorithm is given. Lower and upper subscripts indicate
row, it is sufficient to store one representative set of vertices for in- the values of scalar and vector variables, respectively, in the speci-
ner grid points. Then, this set can be rendered using the appropriate fied iteration. For a good introduction to the CG method as well as
to other solution methods for linear system equations let us refer to of x(i) have to be done in place. Based on the representation of
[Press et al. 2002]. matrices as set of column vectors, we sweep through the matrix in a
column-wise order, using the result vector x(i) as the current render
target as well as a currently bound texture. Initially, the content of
Unpreconditioned CG r(i) is copied into x(i) . When the j-th column of L is rendered, each
1 p(0) = r(0) = b − Ax(0) for some initial guess x(0) element is multiplied with the j-th element in x(i) , and the result is
2 for i ← 0 to #itr
T added to x(i) . We thus always multiply every column with the most
3 ρi = r(i) r(i) recently updated value of x(i) .
4 q(i) = Ap(i)
T
5 αi = ρi /p(i) q(i)
6 x(i+1) = x(i) + αi p(i)
6 Discussion and Performance Evalua-
7 r(i+1) = r(i) − αi q(i) tion
T
8 βi = r(i+1) r(i+1) /ρi To verify the effectiveness of the proposed framework for the im-
9 p(i+1) = r(i+1) + βi p(i) plementation of numerical simulation techniques we have imple-
10 convergence check mented two meaningful examples on the graphics processor. All
our experiments were run under WindowsXP on a P4 2.8 GHz pro-
cessor equipped with an ATI 9800 graphics card.
With regard to the realization of methods of numerical comput-
The CG method can effectively be stated in terms of the de- ing on graphics hardware, limited accuracy in both the internal tex-
scribed building blocks for GPU implementation of numerical tech- ture stages and the shader stages is certainly the Achilles´ heel of
niques (Note that using a preconditioner matrix, for instance the any approach. In many cases, numerical methods have to be per-
diagonal part of A stored in the first diagonal vector in our inter- formed in double precision to allow for accurate and stable compu-
nal representation, only involves solving one more linear system in tations. As a matter of fact, our current target architecture does not
each iteration): provide sufficient accuracy in general. Other graphics cards, on the
other hand, like NVIDIAs GeForceFX, already provide full IEEE
floating point precision in both the shader and texture stages. Thus,
Unpreconditioned GPU-based CG it will be of particular interest to evaluate this GPU in particular
1 clMatVec(CL SUB, A, x(0) , b, r(0) ) initial guess x(0) as well as other near-future graphics architectures with regard to
2 clVecOp(CL ADD, −1, 0, r(0) , NULL, r(0) ) computation accuracy.
Let us now investigate the performance of our approach as well
3 clVecOp(CL ADD, 1, 0, r(0) , NULL, p(0) )
as the differences to CPU implementations of some of the described
4 for i ← 0 to #itr
basic operations. In our experiments the resolution of vectors and
5 ρi = clVecReduce(CL ADD, r(i) , r(i) ) matrices was chosen such as to avoid paging between texture mem-
6 clMatVec(CL ADD, A, p(i) , NULL, q(i) ) ory and main memory and between main memory and disk. All our
7 αi = clVecReduce(CL ADD, p(i) , q(i) ) operations were run on vectors and matrices of size 5122 to 20482 .
8 α i = ρi / α i We have also not considered the constant time to initially load tex-
9 clVecOp(CL ADD, 1, αi , x(i) , p(i) , x(i+1) ) tures from main memory to texture memory. The reason is, that we
10 clVecOp(CL SUB, 1, αi , r(i) , q(i) , r(i+1) ) predominantly focus on iterative techniques, where a large number
11 βi = clVecReduce(CL ADD, r(i+1) , r(i+1) ) of iterations have to be performed until convergence. Supposedly,
12 βi = βi / ρi in these particular applications the time required to setup the hard-
ware is insignificant compared to the time required to perform the
13 clVecOp(CL ADD, 1, βi , r(i+1) , p(i) , p(i+1) )
computations. During all iterations the data resides on the GPU and
14 convergence check
it has neither to be reloaded from main memory nor duplicated on
the CPU. In other scenarios, e.g. if frequent updates of a matrix
happen, this assumption may not be justifiable anymore. In this
case, also the time needed to transfer data between different units
In the GPU implementation, the application program only needs has to be considered.
to read single pixel values from the GPU thus minimizing bus trans- On vectors and full matrices the implementation of standard
fer. All necessary numerical computations can be directly per- arithmetic operations, i.e. vector-vector arithmetic and matrix-
formed on the GPU. Moreover, the final result is already in place vector multiplication, was about 12-15 times faster compared to
and can be rendered as a 2D texture map. an optimized software implementation on the same target architec-
ture. A considerable speed-up was achieved by internally storing
5.2 Gauss-Seidel Solver vectors and matrices as RGBA textures. Sets of 4 consecutive en-
tries from the same vector are stored in one RGBA texel. Thus, up
Next, let us describe the GPU implementation of a Gauss-Seidel to four times as many entries can be processed simultaneously. We
solver. Denoting with L and U the strict lower and upper triangular should note here, that operations on vectors and matrices built upon
sub-matrices, and with D the main diagonal of the matrix A, we can this particular internal format perform in exactly the same way as
rewrite A as L + D + U. In one iteration, the Gauss-Seidel method outlined. Just at the very end of the computation need the vector
essentially solves for the following matrix-vector equation: elements stored in separate color components to be rearranged for
rendering purposes. We can easily realize this task by means of a
x(i) = Lx(i) + (D +U)x(i−1) simple shader program that for each pixel in the result image fetches
the respective color component.
where x(k) is the solution vector at the k-th iteration. On average, the multiplication of two vectors of length 5122 took
r(i) = (D +U)x(i−1) can be derived from the previous time step 0.2 ms. Performance dropped down to 0.72 ms and 2.8 ms for
by one matrix-vector product. To compute Lx(i) , however, updates vectors of length 10242 and 20482 , respectively. Multiplication of
µ ¶
a 40962 full matrix times a vector was carried out in roughly 0.23 1 2
Ft = vt + 4t ∇ v −V · ∇v + fy (6)
seconds. In contrast, the multiplication of a sparse banded matrix Re
of the same size, which was composed of 10 non-zero diagonals,
took 0.72 ms. Given current values for u and v at every grid points, Gt and F t
Obviously, only one vector element can be stored in a single can be directly evaluated. In the current implementation, we em-
RGBA texel if numerical operations on vector-valued data have ploy an explicit scheme to resolve for Gt and F t . The diffusion
to be performed. On the other hand, in this case also the perfor- operator is discretized by means of central differences, and, as pro-
mance of software implementations drops down due to enlarged posed in [Stam 1999], we solve for the advection part by tracing
memory footprints. Our current software implementation is highly the velocity field backward in time. Note that these operations are
optimized with regard to the exploitation of cache coherence on the carried out on a 2D grid represented by a 2D texture. The involved
CPU. In practical applications, a less effective internal representa- computations are performed at each grid point in a pixel shader pro-
tion might be used, so that we rather expect a relative improvement gram. Finally, in order to compute updated velocities at time t + 1,
of the GPU based solution. In this respect it is important to know we have to solve for the pressure at time t + 1. From the continu-
that also on the GPU access to higher precision textures slows down ity equation for incompressible media (div(V ) = 0), we obtain the
performance about a factor of 1.5-2. following Poisson equation for the pressure p:
The least efficient operation compared to its software counter-
∂ 2 pt+1 ∂ 2 pt+1 1 ∂ F t ∂ Gt
µ ¶
part is the reduce operation. It is only about a factor of three faster
+ = + (7)
even though we store four elements in one RGBA texture. For in- ∂ x2 ∂ y2 4t ∂ x ∂y
stance, reducing a 10242 vector takes about 1.6 ms on the GPU. For
a vector of length 20482 , this time is 5.4 ms. The relative loss in The partial derivatives of F and G are solved at each grid point,
performance is due to the fact, that the pixel shader program to be represented as a 2D texture, by means of forward differences. Fi-
used for this kind of operation is a lot more complex than one that nally, the right hand side of equation 7 is input to the GPU imple-
is used for vector-vector multiplication. On the other hand, even a mentation of the CG solver. Because vectors are internally repre-
performance gain of a factor of three seems to be worth an imple- sented as 2D matrices, the data does not have to be converted and
mentation on the GPU. can be directly used to feed the CG solver. Equipped with appro-
In the following, we present two examples that demonstrate the priate boundary conditions, the CG solver iteratively computes a
efficient solution of finite difference equations on the GPU. In both solution for p at time t + 1, which can be directly passed back to
examples, a 1024 x 1024 computational grid was employed, and the explicit scheme to compute new velocity values by means of
matrices were represented as set of diagonal vectors. equations 3 and 4.
In the first example, a solution to the 2D wave equation was com- Overall, by means of the GPU implementation of both the ex-
puted based on the implicit Crank-Nicholson scheme as described plicit and the implicit scheme we were able to interactively demon-
(see Figures 6 and 7 for results). Compared to explicit schemes, the strate the numerical solution of the NSE at 9 fps on a 10242 grid. In
implicit approach allows us to considerable increase the step size in each time step, we use the pressure distribution from the last time
time. To solve the system of equations we employed the GPU im- step as initial guess for the CG solver. In this way, only a few iter-
plementation of the conjugate gradient solver. The banded structure ations have to be performed, yet resulting in good accuracy. In the
of the matrix was exploited by reducing the number of diagonal vec- current implementation, four iterations were executed. Such a small
tors to be rendered in one matrix-vector multiplication. The com- number of iterations, on the other hand, yields inaccurate results
putation of one matrix-vector product (10242 x10242 -sparse-matrix once abrupt changes are applied by means of external forces. In
times 10242 -vector) took roughly 4.54 ms. Overall, one iteration of Figure 8, we show a snapshot of our interactive tool, which allows
the conjugate gradient solver was finished in 15.4 ms. By perform- one to interact with the velocity field, and to visualize the dynamics
ing only a limited number of iterations, five in the current example, of injected dye into this field.
interactive simulation at 13 fps could be achieved.
In our second example, we describe a GPU implementation of a
numerical solution to the incompressible Navier-Stokes equations 7 Conclusion
(NSE) in 2D:
In this work, we have described a general framework for the im-
∂u 1 2 plementation of numerical simulation techniques on graphics hard-
= ∇ u −V · ∇u + fx − ∇p (1)
∂t Re ware. For this purpose, we have developed efficient internal layouts
∂v 1 2 for vectors and matrices. By considering matrices as a set of diag-
= ∇ v −V · ∇v + fy − ∇p (2) onal or column vectors and by representing vectors as 2D texture
∂t Re
maps, matrix-vector and vector-vector operations could be acceler-
Here, u and v correspond to the components of the velocity V in ated considerably compared to software based approaches.
the x and y direction, respectively. Re is the Reynolds number, p Our emphasis was on providing the building blocks for the de-
the pressure, and via fx and fy external forces can be specified. sign of general techniques of numerical computing. This is in con-
We first discretize partial derivative of u and v, resulting in an trast to existing approaches, where dedicated, mainly explicit solu-
explicit scheme to compute new velocities at time t + 1 from values tion methods have been proposed. In this respect, for the simulation
at time t: of particular phenomena some of these approaches might be supe-
rior to ours in terms of performance. On the other hand, our frame-
∂ pt+1 work offers the flexibility to implement arbitrary explicit or implicit
ut+1 = Gt + 4t (3)
∂x schemes, and it can thus be used in applications where larger step
∂ pt+1 sizes and stability are of particular interest. Furthermore, because
vt+1 = F t + 4t (4) our internal matrix layout can benefit from the sparsity of columns
∂y quite effectively, we do not expect our method to be significantly
with slower compared to customized explicit schemes.
µ ¶ In order to demonstrate the effectiveness and the efficiency of our
1 2 approach, we have described a GPU implementation of the conju-
G t
= t
u + 4t ∇ u −V · ∇u + fx (5) gate gradient method to numerically solve the 2D wave equation
Re
and the incompressible Navier-Stokes equations. In both examples,
implicit schemes were employed to allow for stable computations,
yet providing interactive rates. Despite precision issues, we could
achieve considerably better performance compared to our software
realization. On the other hand, to allow for a fair comparison we
should consider timing statistics of SSE-optimized software solu-
tions, which are supposed to perform about a factor of 2 to 3 faster.
The lack of a contiguous floating point pipeline on our target
architecture still prohibits its use in numerical applications where
accuracy is a predominant goal. On the other hand, with regard
to the fact that full floating point pipelines are already available,
the implementation of numerical techniques on commodity graph-
ics hardware is worth an elaborate investigation. Particularly in en-
tertainment and virtual scenarios, where precision issues might be
of lesser dominant concern, such implementations can be used ef-
fectively for interactive physics based simulation.
In the future, we will implement matrix-matix operations based
on the described internal layout, and we will investigate meth-
ods to efficiently update vector and matrices that are stored in Figure 7: A GPU-based tool to interact with water surfaces in real-
texture memory. In this way, linear algebra operations like LU- time is shown. By means of the mouse, the user can simulate ex-
decomposition or Singular Value decomposition can be imple- ternal forces that disturb the water surface. On a 10242 grid the
mented. In the long term, we aim at providing the functionality applications runs at 13 fps.
that is available in the BLAS library, thus allowing general linear
algebra packages to be built upon GPU implementations.
8 Acknowledgements
We would like to thank ATI for providing the 9800 graphics card,
and in particular Mark Segal for providing information about the
technical details of this card.
F OSTER , N., AND F EDKIW, R. 2001. Practical animation of liquids. Com- W EISKOPF, D., H OPF, M., AND E RTL , T. 2002. Hardware-accelerated
puter Graphics SIGGRAPH 01 Proceedings, 23–30. Lagrangian-Eulerian texture advection for 2D flow visualization. In Pro-
ceedings Workshop on Vision, Modeling, and Visualization VMV ’02.
F OSTER , N., AND M ETAXAS , D. 1996. Realistic animation of liquids.
Graphical Models and Image Processing 58, 5, 471–483.
H ARRIS , M., C OOMBE , G., S CHEUERMANN , T., AND L ASTRA , A. 2002.
Physically-based visual simulation on graphics hardware. In Proceed-
ings Eurographics/SIGGRAPH Workshop on Graphics Hardware 2002.
H ART, J. 2001. Perlin noise pixel shaders. In Proceedings Eurograph-
ics/SIGGRAPH Workshop on Graphics Hardware 2001.
H EIDRICH , W., W ESTERMANN , R., S EIDEL , H.-P., AND E RTL , T. 1999.
Applications of pixel textures in visualization and realistic image synthe-
sis. In ACM Symposium on Interactive 3D Graphics, 110–119.
H ILLESLAND , K., M OLINOV, S., AND G RZESZCZUK , R. 2003. Nonlinear
Optimization Framework for Image-Based Modelling on Programmable
Graphics Hardware. Computer Graphics SIGGRAPH 03 Proceedings.
H OPF, M., AND E RTL , T. 1999. Accelerating 3D convolution using graph-
ics hardware. In Proceedings IEEE Visualization’99, 471–474.
H OPF, M., AND E RTL , T. 2000. Hardware accelerated wavelet transfor-
mations. In Proceedings EG/IEEE TCVG Symposium on Visualization
VisSym ’00, 93–103.
J OBARD , B., E RLEBACHER , G., AND H USSAINI , Y. 2000. Lagrangian-
Eulerian advection of noise and dye textures for unsteady flow visualiza-
tion. In Proceedings IEEE Visualization’00, 110–118.
K AAS , M., AND M ILLER , G. 1990. Rapid, stable fluid dynamics for
computer graphics. Computer Graphics SIGGRAPH 90 Proceedings,
49–57.
L ARSEN , E. S., AND M C A LLISTER , D. 2001. Fast matrix multiplies using
graphics hardware. In Proceedings Supercomputing 2001.
L INDHOLM , E., K ILGARD , M., AND M ORETON , H. 2001. A user-
programmable vertex engine. Computer Graphics SIGGRAPH 01 Pro-
ceedings.
M ICROSOFT, 2002. DirectX9 SDK. [Link]
M ONTRYM , J., AND M ORETON , H. 2002. GeForce4. In Proceedings
Eurographics/SIGGRAPH Workshop on Graphics Hardware 2002.
N V IDIA , 2002. nvidia OpenGL game of life.
[Link] gameoflife.
N V IDIA , 2003. Sample effects on the nVIDIA graphics cards.
[Link]
O LANO , M., AND L ASTRA , A. 1998. A shading-language on graphics
hardware. Computer Graphics SIGGRAPH 98 Proceedings, 159–168.
P RESS , W., T EUKOLSKY, S., V ETTERLING , W., AND F LANNERY, B.
2002. Numerical Recipes in C++ : The Art of Scientific Computing.
Cambridge University Press.
P URCELL , T., B UCK , I., M ARK , W., AND H ANRAHAN , P. 2002. Ray
tracing on programmable graphics hardware. Computer Graphics SIG-
GRAPH 98 Proceedings, 703–712.
S TAM , J. 1999. Stable fluids. Computer Graphics SIGGRAPH 99 Proceed-
ings, 121–128.
S TRZODKA , R., AND RUMPF, M. 2001. Nonlinear diffusion in graphics
hardware. In Proceedings EG/IEEE TCVG Symposium on Visualization
2001, 75–84.