GPU - Linear Algebra Operators For GPU Implementation of Num

Uploaded by

Terry Lynn

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

37 views9 pages

GPU - Linear Algebra Operators For GPU Implementation of Num

Uploaded by

Terry Lynn

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Linear Algebra Operators

for GPU Implementation of Numerical Algorithms

Jens Krüger and Rüdiger Westermann
Computer Graphics and Visualization Group, Technical University Munich∗

Figure 1: We present implementations of techniques for solving sets of algebraic equations on graphics hardware. In this way, numerical
simulation and rendering of real-world phenomena, like 2D water surfaces in the shown example, can be achieved at interactive rates.

Abstract matics. These techniques have a variety of applications in physics

based simulation and modelling, and they have been frequently
In this work, the emphasis is on the development of strategies to employed in computer graphics to provide realistic simulation of
realize techniques of numerical computing on the graphics chip. In real-world phenomena [Kaas and Miller 1990; Chen and da Vito-
particular, the focus is on the acceleration of techniques for solving ria Lobo 1995; Foster and Metaxas 1996; Stam 1999; Foster and
sets of algebraic equations as they occur in numerical simulation. Fedkiw 2001; Fedkiw et al. 2001]. Despite their use in numerical
We introduce a framework for the implementation of linear alge- simulation, these techniques have also been applied in a variety of
bra operators on programmable graphics processors (GPUs), thus computer graphics settings, e.g. the simulation of watercolor draw-
providing the building blocks for the design of more complex nu- ings [Curtis et al. 1997], the processing of polygonal meshes [Des-
merical algorithms. In particular, we propose a stream model for brun et al. 1999], or the animation of deformable models [Baraff
arithmetic operations on vectors and matrices that exploits the in- and Witkin 1998; Debunne et al. 2001], to mention just a few.
trinsic parallelism and efficient communication on modern GPUs. The numerical complexity of techniques for solving sets of alge-
Besides performance gains due to improved numerical computa- braic equations often imposes limitations on the numerical accuracy
tions, graphics algorithms benefit from this model in that the trans- or extremely high demands on memory and computing resources.
fer of computation results to the graphics processor for display is As a consequence thereof, parallelization of numerical solvers on
avoided. We demonstrate the effectiveness of our approach by im- multi-processor architectures has been an active research area for
plementing direct solvers for sparse matrices, and by applying these quite a long time.
solvers to multi-dimensional finite difference equations, i.e. the 2D An alternative direction of research is leading towards the imple-
wave equation and the incompressible Navier-Stokes equations. mentation of general techniques of numerical computing on com-
CR Categories: I.6.7 [Simulation and Modeling]: Simu- puter graphics hardware. Driven by the evolution of graphics pro-
lation Support Systems—; I.3.7 [Computer Graphics]: Three- cessors from fixed function pipelines towards fully programmable,
Dimensional Graphics and Realism— floating point pipelines, additional effort is spent on the develop-
ment of numerical algorithms amenable to the intrinsic parallelism
Keywords: Numerical Simulation, Graphics Hardware and efficient communication on modern GPUs. Recent examples
include GPU implementations of matrix multiplications [Thomp-
1 Introduction son et al. 2002], multi-grid simulation techniques [Bolz et al.
2003] and numerical solution techniques to least squares problems
The development of numerical techniques for solving partial differ- [Hillesland et al. 2003]. Particularly in computer graphics appli-
ential equations is one of the traditional subjects in applied mathe- cations, the goal of such implementations of numerical techniques
is twofold: to speed up computation processes, as they are often
∗ [Link], westermann@[Link] at the core of the graphics application, and to avoid the transfer of
computation results from the CPU to the GPU for display.
Based on early prototypes of programmable graphics architec-
tures [Olano and Lastra 1998; Lindholm et al. 2001], the design
of graphics hardware as a pipeline consisting of highly optimized
stages providing fixed functionality is more and more abandoned
on modern graphics chips, e.g. the NVIDIA NV30 [Montrym and
Moreton 2002] or the ATI R300 [Elder 2002]. Today, the user
has access to parallel geometry and fragment units, which can be
programmed by means of standard APIs. In particular, vertex and
pixel shader programs enable direct access to the functional units,
and they allow for the design of new classes of hardware supported
graphics algorithms. As a representative example for such algo- can be implemented by executing GPU implementations of these
rithms, let us refer to [Purcell et al. 2002], where ray-tracing on operations in a particular order. One of our goals is to replace soft-
programmable fragment processors was described. ware implementations of basic linear algebra operators as available
In our current work, we employ the Pixel Shader 2.0 API [Mi- in widespread linear algebra libraries, i.e. the BLAS (Basic Linear
crosoft 2002], a specific set of instructions and capabilities in Algebra Subprogram) library [Dongarra et al. 1988; Dongarra et al.
DirectX9-level hardware, which allows us to perform hardware 1990], by GPU implementations, thus enabling more general linear
supported per-fragment operations. Besides basic arithmetic oper- algebra packages to be implemented on top of these implementa-
ations, instructions to store intermediate results in registers and to tions, i.e. the LAPACK (Linear Algebra Package) [Anderson et al.
address and access multiple texture units are available. Our target 1999].
architecture is the ATI Radeon 9800, which supports 32-bit floating In the remainder of this paper, we will first introduce the internal
point textures as well as a set of hardware supported pixel shader. representation for vectors and matrices on the graphics processor,
Pixel shader provide 24-bit precision internal computations. To and we will describe the syntax and the semantic of the vector and
save rendering results and to communicate these results to consec- matrix routines our approach is built upon. Similar to the notation
utive passes, rendering can be directed to a 32-bit offscreen buffer. used in the BLAS library, we outline specific operations on vectors
This buffer can be directly bound to a texture map, i.e. the content and matrices. Although we use a different syntax than BLAS, and
of the buffer is immediately available as 2D texture map. we also do not provide the many different operators contained in the
Until today, the benefits of graphics hardware have mainly been BLAS function definition, it should become obvious that by means
demonstrated in rendering applications. Efforts have been focused of our approach the same functionality can be achieved.
on the creation of static and dynamic imagery including polygonal We will also address sparse matrix representation and opera-
models and scalar or vector valued volumetric objects. In a few tions on such matrices as they typically occur in numerical sim-
examples, strategies to realize numerical computations on graphics ulation techniques. In this way, we achieve a significant speed-
processors were described, usually implemented by means of low- up compared to software approaches. Next, GPU implementations
level graphics APIs that did not yet provide programmable vertex of two methods for solving sparse linear systems as the occur in
and pixel shader. many numerical simulation techniques are described: the Conju-
Hopf et al. [Hopf and Ertl 1999; Hopf and Ertl 2000] de- gate Gradient (CG) method and the Gauss-Seidel solver. Finally,
scribed implementations of morphological operations and wavelet we demonstrate the effectiveness of our approach by solving the
transforms on the graphics processor. Numerical computations 2D wave equation and the incompressible Navier-Stokes equations
were realized by means of blending functionality, and by exploit- on graphics hardware, and by directly visualizing the results on the
ing the functionality provided by the OpenGL imaging subset. Us- ATI 9800.
ing similar coding mechanisms, the simulation of cellular automata
and stochastic fractals was demonstrated in [nVidia 2002; Hart
2001]. Strzodka and Rumpf [Strzodka and Rumpf 2001a; Strzodka 2 Matrix Representation on GPUs
and Rumpf 2001b] proposed first concepts for the implementation
of numerical solvers for partial differential equations on graphics In this section, we describe the internal representation of matrices
hardware. Therefore, the intrinsic communication and computa- on graphics hardware. The proposed representation enables the ef-
tion patterns of numerical solution techniques for finite difference ficient computation of basic algebraic operations used in numerical
equations were mapped to OpenGL functionality. Non-standard simulation techniques. The general idea is to store matrices as tex-
OpenGL extensions were employed in [Heidrich et al. 1999; Jo- ture maps and to exploit pixel shader programs to implement arith-
bard et al. 2000; Weiskopf et al. 2001] to interactively visualize 2D metic operations. For the sake of simplicity only column matrices,
vector fields. At the core of these techniques, vector valued data i.e. vectors and square NxN matrices are discussed. General matri-
is encoded into RGB texture maps, thus allowing for the tracing of ces, however, can be organized in exactly the same way.
particles by successive texture fetch and blend operations.
Matrix
With the availability of programmable fragment units, the possi- (0,1) (1,1)
2D-Product
(0) (1)
bility to implement general numerical techniques on graphics pro-
cessors was given. A number of examples have been demonstrated
since then, ranging from physics based simulation of natural phe-
nomena to real-time shading and lighting effects [nVidia 2003; ATI
2003]. Weiskopf et al. [Weiskopf et al. 2002] extended their previ-
ous work towards the interactive simulation of particle transport in (0,0) (1,0)
Vector
flow fields. Recently, Harris et al. [Harris et al. 2002] described the (0) (1)

implementation of an explicit scheme for the solution of a coupled

map lattice model on commodity graphics cards. In both exam- Figure 2: This illustration demonstrates the computation of a
ples, numerical computations were entirely carried out in a frag- matrix-vector product using 1D and 2D textures to represent vectors
ment shader program. GPU implementation of matrix-multiplies and matrices, respectively. The 1D texture is continued periodically
based on a particular distribution strategy for 2D textures across a across the rendered quadrilateral.
cube-shaped lattice was described in [Larsen and McAllister 2001].
In contrast to previous approaches, which were specifically de- On the graphics processor, vectors might be represented as 1D
signed with regard to the solution of particular problems, our goal texture maps. This kind of representation, however, has certain
is to develop a generic framework that enables the implementation drawbacks: First, limitations on the maximum size of 1D textures
of general numerical techniques for the solution of difference equa- considerably reduce the number of vector elements that can be
tions on graphics hardware. Therefore, we provide the basic build- stored in one texture map. Second, rendering 1D textures is signifi-
ing block routines that are used by standard numerical solvers. Built cantly slower than rendering square 2D textures with the same num-
upon a flexible and efficient internal representation, these functional ber of texture elements. On current graphics cards, the use of 2D
units perform arithmetic operations on vectors and matrices. In the textures yields a performance gain of about a factor of 2. Third, if
same way as linear algebra libraries employ encapsulated basic vec- a 1D vector contains the result of a numerical simulation on a com-
tor and matrix operations, many techniques of numerical computing putational grid, the data finally has to be rearranged in a 2D texture
for rendering purposes. Fourth, this representation prohibits the ef- are handled by that element. Texture handles and the size of each
ficient computation of matrix-vector products. Although by means texture can be accessed via public functions.
of multi-textures both the matrix and the vector can be rendered si-
multaneously and combined on a per-fragment basis (see Figure 2),
the computed values are not in place and have to be summed along 3.1 Vector Arithmetic
the rows of the matrix to obtain the result vector.
To circumvent the mentioned drawbacks, we represent matrices Arithmetic operations on two vectors can be realized in a simple
as set of diagonal vectors and we store vectors in 2D texture maps pixel shader program. The application issues both operands as
(see Figure 3). To every diagonal starting at the i-th entry of the first multi-textures, which are accessed and combined in the shader pro-
row in the upper right part of the matrix, its counterpart starting gram. On current graphics cards supporting the Pixel Shader 2.0
at the (N-i)-th entry of the first column in the lower left part of instruction set, arithmetic operations like addition, subtraction and
the matrix is appended. In this way, no texture memory is wasted. multiplication are available. The product of a scalar times a vector
Each vector is converted into a square 2D texture by the application is realized by issuing the scalar as a constant value in the shader
program. Vectors are padded with zero entries if they do not entirely program.
fill the texture. This representation has several advantages: The function header for implementing standard arithmetic oper-
ations on vector elements looks like this:
• Much larger vectors can be stored in one single texture ele-
ment. void clVecOp (
CL enum op,
• Arithmetic operations on vectors perform significantly faster float α , float β ,
because square 2D textures are rendered. clVec x, clVec y,
clVec res
• Vectors that represent data on a 2D grid can directly be ren- );
dered to visualize the data.
The enumerator op can be one of CL ADD, CL MULT or
• Matrix-vector multiplication can be mapped effectively to
CL SUB. The scalars α and β are multiplied with each element
vector-vector multiplication.
of x and y, respectively. At the beginning of each routine a consis-
• The result of a matrix-vector multiplication is already in place tency check is performed to verify that both vectors have the same
and does not need to be rearranged. internal representation. Then, the respective shader program is ac-
tivated and vectors x and y are issued as multi-textures. Finally, a
Most notable, however, the particular layout of matrices as set of square quadrilateral is rendered, which is lined up in screen space
diagonal vectors allows for the efficient representation and process- with the 2D textures used to represent the active vectors. The result
ing of banded diagonal matrices, as they typically occur in numeri- is kept as 2D texture to be used in consecutive computations.
cal simulation techniques. In a pre-process the application program
inspects every diagonal, discarding those diagonals that do not carry
any information. If no counterpart exists for one part of a diagonal, 3.2 Matrix-Vector Product
it is filled with zero entries.
In the following, we consider the product of a matrix times a vector.
A nice feature of this representation is, that the transpose of a
A second vector might be specified to allow for the computation of
matrix is generated by simply ordering the diagonals in a different
Ax op y, where A is a matrix, x and y are vectors, and op is one
way. Off-diagonals numbered i, which start at the i-th entry of the
of CL ADD, CL MULT, CL SUB. To compute Ax op y we first
first row, now become off-diagonals numbered N-i. Entries located
render the result of Ax into the render target. Now, the result is
in the former upper right part of the matrix are swapped with those
bound as an additional texture, and in a final rendering pass it is
entries located in the lower left part. Swapping does not have to be
combined with the vector y and rendered to the destination target.
performed explicitly, but it can be realized by shifting indices used
to access these elements. Each index has to be shifted by the num- The header of the function performing matrix-vector operations
ber of entries in the vector that come from the lower left part of the looks like this (if one of A or x, or y is NULL, only the respective
matrix. This can easily be accomplished in the pixel shader pro- component not equal to NULL considered in the operation):
gram, where the indexing during matrix-vector operations is per-
void clMatVec (
formed (see below).
CL enum op,
clMat A,
3 Basic Operations clVec x, clVec y,
clVec res
Now, we describe the implementation of basic algebraic operations );
on vectors and matrices based on the proposed internal representa-
tion. In each operation, rendering is directed to a specific render Because matrices are represented as a set of diagonal vectors,
target that can be directly bound to a 2D texture for further use. To matrix-vector multiplication becomes a multiplication of every di-
update values in a target that is not the current target anymore, it agonal with the vector. Therefore, N rendering passes are per-
is made the current render target again. Then, its content can be formed, in each of which one diagonal and the vector are issued
modified by consecutive rendering passes. as separate textures in the shader program. Then, corresponding
Vector and matrix containers are defined as classes clVec and entries in both textures are multiplied. However, to the j-th element
clMat, respectively. Both containers assemble C++ arrays in that of a diagonal that starts at the ith entry of the first row of the matrix
the array is decomposed into one or multiple vectors. Vectors com- corresponds the ((i+j) mod N)-th entry of the vector. This entry first
posed of zero entries neither have to be stored nor processed. has to be computed in the shader program before it can be used to
Upon initialization, for each vector a texture is created and bound access the vector.
to a texture handle. Internally, each class element stores the reso- Values i and N are simply issued as constant values in the shader
lution of the respective vector or matrix and of all the textures that program. Index j, however, is directly derived from the fragments
i
Matrix N Vectors

N 2D-Textures
1 i N
N N ... ... ... ...

N-i

1 i N

N
Figure 3: The representation of a 2D matrix as a set of diagonal vectors, and finally as a set of 2D textures is shown.

texture coordinates and N. Finally, the destination index ((i+j) mod equal to NULL, the combiner operation is carried out on the prod-
N) is calculated and converted to 2D texture coordinates. uct x times y rather than only on x.
After the first diagonal has been processed as described, the cur- The reduce operation combines the vector entries in multiple ren-
rent render target is simultaneously used as a texture and as a rendering passes by recursively combining the result of the previous
der target. Thus, fragments in consecutive passes have access to pass. Starting with the initial 2D texture containing one vector
the intermediate result, and they can update this result in each iter- and the quadrilateral lined up with the texture in screen space, in
ation. After rendering the last diagonal, the result vector is already each step a quadrilateral scaled by a factor of 0.5 is rendered. In
in place, and the current render target can be used to internally rep- the shader program, each fragment that is covered by the shrunken
resent this vector. quadrilateral combines the texel that is mapped to it and the three
A considerable speed-up is achieved by specifying multiple adja- adjacent texel in positive (u,v) texture space direction. The distance
cent diagonals as multi-textures, and by processing these diagonals between texels in texture space is issued as a constant in the shader
at once in every pass. Parameters i and N only have to be issued program. The result is written into a new texture, which is now of a
once in the shader program. A particular fragment has the same in- factor of two smaller in each dimension than the previous one. The
dex j in all diagonals, and as a matter of fact it only has to be com- entire process is illustrated in Figure 4. This technique is a standard
puted once. Starting with the first destination index, this index is approach to combine vector elements on parallel computer architec-
successively incremented about one for consecutive diagonals. The tures, which in our scenario is used to keep the memory footprint
number of diagonals that can be processed simultaneously depends for each fragment as low as possible. For a diagonal vector that
on the number of available texture units. is represented by a square texture with resolution 2n , n rendering
Let us finally mention, that with regard to the described imple- passes have to be performed until the result value val is obtained in
mentation of matrix-vector products, there is no particular need to one single pixel. The respective pixel value is finally returned to the
organize matrices into sets of diagonal vectors. For instance, dense application program.
matrices might be represented as sets of column vectors, giving rise
to even more efficient matrix-vector multiplication. Every column original Texture 1st pass 2nd pass
just has to be multiplied with the respective vector element, result-
ing in a much smaller memory footprint, yet requiring a simple ...
...
shader program to which only the index of the currently processed
...

column is input.
...

3.3 Vector Reduce

Quite often it is necessary to combine all entries of one vector into
one single element, e.g. computing a vector norm, finding the max-
imum/minimum element etc. Therefore, we provide a special oper-
ation that outputs the result of such a reduce operation to the appli- Figure 4: Schematic illustration of the reduce operation.
cation program:

float clVecReduce (
CL enum cmb, 4 Sparse Matrices
clVec x, clVec y,
); So far, the operations we have encountered execute very efficiently
on current commodity graphics hardware. On the other hand, they
The enumerator cmb can be one of CL ADD, CL MULT, are not suitable to process matrices as they typically arise in numer-
CL MAX, CL MIN, CL ABS. If the second parameter y is not ical simulations. For instance, let us assume that the solution to the
2D wave equation offset with respect to the current column. Most effectively, the off-
set is specified in a vertex shader program, by means of which each
∂ 2u ∂ 2u ∂ 2u
µ ¶
vertex compute the exact 2D position in screen space.
= c2 +
∂ t2 ∂ x2 ∂ y2
Sparse Matrix
Iterations
n
on a grid of resolution 512 x 512 has to be computed numerically vertex array 1
(including boundary points). If first and higher order partial deriva- vertex array 2 color, position, texture coordinate

...
tives are approximated by finite differences, the partial difference vertex array n-1
n vertex array n
equation writes as a set of finite difference equations for each grid
point (ij):
texture

ut+1 t−1
Ã t
t ui+1 j + uti−1 j + uti j+1 + uti j−1 − 4uti j
!
i j − 2ui j + ui j 2
bind buffer as texture

= c Vector

∆t 2 ∆x∆y
Figure 5: This image illustrates the computation of a sparse-
Using the implicit Crank-Nicholson scheme, where the aver-
matrix-vector product based on the internal representation of ma-
age of the right-hand side is taken, i.e. for all grid points we set
trix columns as sets of vertices.
uti, j = 0.5(ut+1 t
i, j + ui, j ), the difference equation contains more than
one unknown and the system of algebraic equations has to be solved For the multiplication of a matrix times a vector, the color of
simultaneously. each vertex has to be multiplied with the color of the corresponding
If initial and boundary conditions are specified, the set of equa- entry in the vector. The vector, however, is not static and can thus
tions can be written in matrix from as Ax = b, where A is a 5122 not be coded into the vertex array. As a matter of fact, we associate
x 5122 matrix, and both b and the solution vector x are of length with each vertex a texture coordinate, which is used to access the
5122 . Here, x contains the unknowns ut+1 i j to be solved for. In the vector via the 2D textures used to represent it. Fortunately, these
particular example, A is a banded matrix (a triangular matrix with texture coordinates can also be stored on the GPU, so that only the
fringes) with a maximal bandwidth of six. Obviously, storing the appropriate textures have to be bound during matrix-vector compu-
matrix as a full matrix is quite inefficient both in terms of mem- tations (see Figure 5)
ory requirements and numerical operations. In order to effectively It is a nice feature of the described scheme, that the realization of
represent and process general sparse N x N matrices, in which only matrix-vector operations on the GPU as it was proposed in Chap-
O(N) entries are supposed to be active, an alternative representation ter 3 is not affected by the graphical primitives we use to internally
on the GPU needs to be developed. represent and render matrices. The difference between sparse and
full matrices just manifests in that we render every diagonal or col-
4.1 Banded Matrices umn vector as a set of vertices instead of set of 2D textures. In this
way, a significant amount of texture memory, rasterization opera-
With regard to the internal representation of matrices as a set of tions and texture fetch operations can be saved in techniques where
diagonal vectors, we can effectively exploit the existence of a sparse matrices are involved. For instance, to compute the prod-
banded matrix with only a few non-zero off-diagonals. Zero off- uct between the sparse matrix described above and a vector only
diagonals are simply removed from the internal representation, and 5122 x6 textured vertices have to be rendered.
off-diagonals that do not have a counterpart on either side of the
main diagonal are padded with zero entries.
In the above example, where the 2D wave equation has been 5 Examples
discretized by means of finite differences, only six diagonals have
to be stored internally. As a consequence, the product of this matrix We will now exemplify the implementation of general techniques
times a vector costs no more than six vector-vector products. of numerical computing by means of the proposed basis operations
In the general setting, however, were non-zero entries are posi- for matrix-vector and vector-vector arithmetic.
tioned randomly in the matrix, the diagonal layout of vectors does
not allow for the exploitation of the sparseness in a straight forward
way. 5.1 Conjugate Gradient Method
The conjugate gradient (CG) method is an iterative matrix algo-
4.2 Sparse Random Matrices rithm for solving large, sparse linear system of equations Ax = b,
where A ∈ Rnxn . Such systems arise in many practical applications,
To overcome this problem, we use vertices to render the matrix
such as computational fluid dynamics or mechanical engineering.
values at the correct position. For each non-zero entry in a column
The method proceeds by generating vector sequences of iterates
vector one vertex is generated. The coordinate of each vertex is
(i.e. successive approximations to the solution), residuals r corre-
chosen in such a way, that it renders at exactly the same position as
sponding to the iterates, and search directions used in updating the
the respective vector element if it was rendered via the 2D texture
iterates and residuals. The CG algorithm remedies the shortcom-
used to represent the vector. For each column we thus store as many
vertices as there are non-zero entries. Matrix values are stored as ings of steepest descent by forcing the search directions p(i) to be
T
colors associated with the respective vertices. A-conjugate, that is p(i) Ap(i) = 0, and the residuals to be orthog-
Vertices and corresponding colors are stored in a vertex array on onal. Particularly in numerical simulation techniques, where large
the GPU. As long as the matrix is not going to be modified, the but sparse finite difference equations have to be solved, the CG-
internal representation does not have to be changed. Note that in algorithm is a powerful and widely used technique.
case of a banded matrix, where apart from start and end conditions In the following, pseudo code for the unpreconditioned version
for each NxN block in the matrix the same band is present in every of the CG algorithm is given. Lower and upper subscripts indicate
row, it is sufficient to store one representative set of vertices for in- the values of scalar and vector variables, respectively, in the speci-
ner grid points. Then, this set can be rendered using the appropriate fied iteration. For a good introduction to the CG method as well as
to other solution methods for linear system equations let us refer to of x(i) have to be done in place. Based on the representation of
[Press et al. 2002]. matrices as set of column vectors, we sweep through the matrix in a
column-wise order, using the result vector x(i) as the current render
target as well as a currently bound texture. Initially, the content of
Unpreconditioned CG r(i) is copied into x(i) . When the j-th column of L is rendered, each
1 p(0) = r(0) = b − Ax(0) for some initial guess x(0) element is multiplied with the j-th element in x(i) , and the result is
2 for i ← 0 to #itr
T added to x(i) . We thus always multiply every column with the most
3 ρi = r(i) r(i) recently updated value of x(i) .
4 q(i) = Ap(i)
T
5 αi = ρi /p(i) q(i)
6 x(i+1) = x(i) + αi p(i)
6 Discussion and Performance Evalua-
7 r(i+1) = r(i) − αi q(i) tion
T
8 βi = r(i+1) r(i+1) /ρi To verify the effectiveness of the proposed framework for the im-
9 p(i+1) = r(i+1) + βi p(i) plementation of numerical simulation techniques we have imple-
10 convergence check mented two meaningful examples on the graphics processor. All
our experiments were run under WindowsXP on a P4 2.8 GHz pro-
cessor equipped with an ATI 9800 graphics card.
With regard to the realization of methods of numerical comput-
The CG method can effectively be stated in terms of the de- ing on graphics hardware, limited accuracy in both the internal tex-
scribed building blocks for GPU implementation of numerical tech- ture stages and the shader stages is certainly the Achilles´ heel of
niques (Note that using a preconditioner matrix, for instance the any approach. In many cases, numerical methods have to be per-
diagonal part of A stored in the first diagonal vector in our inter- formed in double precision to allow for accurate and stable compu-
nal representation, only involves solving one more linear system in tations. As a matter of fact, our current target architecture does not
each iteration): provide sufficient accuracy in general. Other graphics cards, on the
other hand, like NVIDIAs GeForceFX, already provide full IEEE
floating point precision in both the shader and texture stages. Thus,
Unpreconditioned GPU-based CG it will be of particular interest to evaluate this GPU in particular
1 clMatVec(CL SUB, A, x(0) , b, r(0) ) initial guess x(0) as well as other near-future graphics architectures with regard to
2 clVecOp(CL ADD, −1, 0, r(0) , NULL, r(0) ) computation accuracy.
Let us now investigate the performance of our approach as well
3 clVecOp(CL ADD, 1, 0, r(0) , NULL, p(0) )
as the differences to CPU implementations of some of the described
4 for i ← 0 to #itr
basic operations. In our experiments the resolution of vectors and
5 ρi = clVecReduce(CL ADD, r(i) , r(i) ) matrices was chosen such as to avoid paging between texture mem-
6 clMatVec(CL ADD, A, p(i) , NULL, q(i) ) ory and main memory and between main memory and disk. All our
7 αi = clVecReduce(CL ADD, p(i) , q(i) ) operations were run on vectors and matrices of size 5122 to 20482 .
8 α i = ρi / α i We have also not considered the constant time to initially load tex-
9 clVecOp(CL ADD, 1, αi , x(i) , p(i) , x(i+1) ) tures from main memory to texture memory. The reason is, that we
10 clVecOp(CL SUB, 1, αi , r(i) , q(i) , r(i+1) ) predominantly focus on iterative techniques, where a large number
11 βi = clVecReduce(CL ADD, r(i+1) , r(i+1) ) of iterations have to be performed until convergence. Supposedly,
12 βi = βi / ρi in these particular applications the time required to setup the hard-
ware is insignificant compared to the time required to perform the
13 clVecOp(CL ADD, 1, βi , r(i+1) , p(i) , p(i+1) )
computations. During all iterations the data resides on the GPU and
14 convergence check
it has neither to be reloaded from main memory nor duplicated on
the CPU. In other scenarios, e.g. if frequent updates of a matrix
happen, this assumption may not be justifiable anymore. In this
case, also the time needed to transfer data between different units
In the GPU implementation, the application program only needs has to be considered.
to read single pixel values from the GPU thus minimizing bus trans- On vectors and full matrices the implementation of standard
fer. All necessary numerical computations can be directly per- arithmetic operations, i.e. vector-vector arithmetic and matrix-
formed on the GPU. Moreover, the final result is already in place vector multiplication, was about 12-15 times faster compared to
and can be rendered as a 2D texture map. an optimized software implementation on the same target architec-
ture. A considerable speed-up was achieved by internally storing
5.2 Gauss-Seidel Solver vectors and matrices as RGBA textures. Sets of 4 consecutive en-
tries from the same vector are stored in one RGBA texel. Thus, up
Next, let us describe the GPU implementation of a Gauss-Seidel to four times as many entries can be processed simultaneously. We
solver. Denoting with L and U the strict lower and upper triangular should note here, that operations on vectors and matrices built upon
sub-matrices, and with D the main diagonal of the matrix A, we can this particular internal format perform in exactly the same way as
rewrite A as L + D + U. In one iteration, the Gauss-Seidel method outlined. Just at the very end of the computation need the vector
essentially solves for the following matrix-vector equation: elements stored in separate color components to be rearranged for
rendering purposes. We can easily realize this task by means of a
x(i) = Lx(i) + (D +U)x(i−1) simple shader program that for each pixel in the result image fetches
the respective color component.
where x(k) is the solution vector at the k-th iteration. On average, the multiplication of two vectors of length 5122 took
r(i) = (D +U)x(i−1) can be derived from the previous time step 0.2 ms. Performance dropped down to 0.72 ms and 2.8 ms for
by one matrix-vector product. To compute Lx(i) , however, updates vectors of length 10242 and 20482 , respectively. Multiplication of
µ ¶
a 40962 full matrix times a vector was carried out in roughly 0.23 1 2
Ft = vt + 4t ∇ v −V · ∇v + fy (6)
seconds. In contrast, the multiplication of a sparse banded matrix Re
of the same size, which was composed of 10 non-zero diagonals,
took 0.72 ms. Given current values for u and v at every grid points, Gt and F t
Obviously, only one vector element can be stored in a single can be directly evaluated. In the current implementation, we em-
RGBA texel if numerical operations on vector-valued data have ploy an explicit scheme to resolve for Gt and F t . The diffusion
to be performed. On the other hand, in this case also the perfor- operator is discretized by means of central differences, and, as pro-
mance of software implementations drops down due to enlarged posed in [Stam 1999], we solve for the advection part by tracing
memory footprints. Our current software implementation is highly the velocity field backward in time. Note that these operations are
optimized with regard to the exploitation of cache coherence on the carried out on a 2D grid represented by a 2D texture. The involved
CPU. In practical applications, a less effective internal representa- computations are performed at each grid point in a pixel shader pro-
tion might be used, so that we rather expect a relative improvement gram. Finally, in order to compute updated velocities at time t + 1,
of the GPU based solution. In this respect it is important to know we have to solve for the pressure at time t + 1. From the continu-
that also on the GPU access to higher precision textures slows down ity equation for incompressible media (div(V ) = 0), we obtain the
performance about a factor of 1.5-2. following Poisson equation for the pressure p:
The least efficient operation compared to its software counter-
∂ 2 pt+1 ∂ 2 pt+1 1 ∂ F t ∂ Gt
µ ¶
part is the reduce operation. It is only about a factor of three faster
+ = + (7)
even though we store four elements in one RGBA texture. For in- ∂ x2 ∂ y2 4t ∂ x ∂y
stance, reducing a 10242 vector takes about 1.6 ms on the GPU. For
a vector of length 20482 , this time is 5.4 ms. The relative loss in The partial derivatives of F and G are solved at each grid point,
performance is due to the fact, that the pixel shader program to be represented as a 2D texture, by means of forward differences. Fi-
used for this kind of operation is a lot more complex than one that nally, the right hand side of equation 7 is input to the GPU imple-
is used for vector-vector multiplication. On the other hand, even a mentation of the CG solver. Because vectors are internally repre-
performance gain of a factor of three seems to be worth an imple- sented as 2D matrices, the data does not have to be converted and
mentation on the GPU. can be directly used to feed the CG solver. Equipped with appro-
In the following, we present two examples that demonstrate the priate boundary conditions, the CG solver iteratively computes a
efficient solution of finite difference equations on the GPU. In both solution for p at time t + 1, which can be directly passed back to
examples, a 1024 x 1024 computational grid was employed, and the explicit scheme to compute new velocity values by means of
matrices were represented as set of diagonal vectors. equations 3 and 4.
In the first example, a solution to the 2D wave equation was com- Overall, by means of the GPU implementation of both the ex-
puted based on the implicit Crank-Nicholson scheme as described plicit and the implicit scheme we were able to interactively demon-
(see Figures 6 and 7 for results). Compared to explicit schemes, the strate the numerical solution of the NSE at 9 fps on a 10242 grid. In
implicit approach allows us to considerable increase the step size in each time step, we use the pressure distribution from the last time
time. To solve the system of equations we employed the GPU im- step as initial guess for the CG solver. In this way, only a few iter-
plementation of the conjugate gradient solver. The banded structure ations have to be performed, yet resulting in good accuracy. In the
of the matrix was exploited by reducing the number of diagonal vec- current implementation, four iterations were executed. Such a small
tors to be rendered in one matrix-vector multiplication. The com- number of iterations, on the other hand, yields inaccurate results
putation of one matrix-vector product (10242 x10242 -sparse-matrix once abrupt changes are applied by means of external forces. In
times 10242 -vector) took roughly 4.54 ms. Overall, one iteration of Figure 8, we show a snapshot of our interactive tool, which allows
the conjugate gradient solver was finished in 15.4 ms. By perform- one to interact with the velocity field, and to visualize the dynamics
ing only a limited number of iterations, five in the current example, of injected dye into this field.
interactive simulation at 13 fps could be achieved.
In our second example, we describe a GPU implementation of a
numerical solution to the incompressible Navier-Stokes equations 7 Conclusion
(NSE) in 2D:
In this work, we have described a general framework for the im-
∂u 1 2 plementation of numerical simulation techniques on graphics hard-
= ∇ u −V · ∇u + fx − ∇p (1)
∂t Re ware. For this purpose, we have developed efficient internal layouts
∂v 1 2 for vectors and matrices. By considering matrices as a set of diag-
= ∇ v −V · ∇v + fy − ∇p (2) onal or column vectors and by representing vectors as 2D texture
∂t Re
maps, matrix-vector and vector-vector operations could be acceler-
Here, u and v correspond to the components of the velocity V in ated considerably compared to software based approaches.
the x and y direction, respectively. Re is the Reynolds number, p Our emphasis was on providing the building blocks for the de-
the pressure, and via fx and fy external forces can be specified. sign of general techniques of numerical computing. This is in con-
We first discretize partial derivative of u and v, resulting in an trast to existing approaches, where dedicated, mainly explicit solu-
explicit scheme to compute new velocities at time t + 1 from values tion methods have been proposed. In this respect, for the simulation
at time t: of particular phenomena some of these approaches might be supe-
rior to ours in terms of performance. On the other hand, our frame-
∂ pt+1 work offers the flexibility to implement arbitrary explicit or implicit
ut+1 = Gt + 4t (3)
∂x schemes, and it can thus be used in applications where larger step
∂ pt+1 sizes and stability are of particular interest. Furthermore, because
vt+1 = F t + 4t (4) our internal matrix layout can benefit from the sparsity of columns
∂y quite effectively, we do not expect our method to be significantly
with slower compared to customized explicit schemes.
µ ¶ In order to demonstrate the effectiveness and the efficiency of our
1 2 approach, we have described a GPU implementation of the conju-
G t
= t
u + 4t ∇ u −V · ∇u + fx (5) gate gradient method to numerically solve the 2D wave equation
Re
and the incompressible Navier-Stokes equations. In both examples,
implicit schemes were employed to allow for stable computations,
yet providing interactive rates. Despite precision issues, we could
achieve considerably better performance compared to our software
realization. On the other hand, to allow for a fair comparison we
should consider timing statistics of SSE-optimized software solu-
tions, which are supposed to perform about a factor of 2 to 3 faster.
The lack of a contiguous floating point pipeline on our target
architecture still prohibits its use in numerical applications where
accuracy is a predominant goal. On the other hand, with regard
to the fact that full floating point pipelines are already available,
the implementation of numerical techniques on commodity graph-
ics hardware is worth an elaborate investigation. Particularly in en-
tertainment and virtual scenarios, where precision issues might be
of lesser dominant concern, such implementations can be used ef-
fectively for interactive physics based simulation.
In the future, we will implement matrix-matix operations based
on the described internal layout, and we will investigate meth-
ods to efficiently update vector and matrices that are stored in Figure 7: A GPU-based tool to interact with water surfaces in real-
texture memory. In this way, linear algebra operations like LU- time is shown. By means of the mouse, the user can simulate ex-
decomposition or Singular Value decomposition can be imple- ternal forces that disturb the water surface. On a 10242 grid the
mented. In the long term, we aim at providing the functionality applications runs at 13 fps.
that is available in the BLAS library, thus allowing general linear
algebra packages to be built upon GPU implementations.

8 Acknowledgements
We would like to thank ATI for providing the 9800 graphics card,
and in particular Mark Segal for providing information about the
technical details of this card.

Figure 8: An interactive tool for the visualization of the solution

to the 2D Navier-Stokes equations is demonstrated. The user can
modify the velocity field, and dye can be injected into the field. On
a 10242 grid the applications runs at 9 fps.

ATI, 2003. Sample effects on the ATI graphics cards.

[Link]
BARAFF , D., AND W ITKIN , A. 1998. Large steps in cloth simulation.
Computer Graphics SIGGRAPH 98 Proceedings, 43–54.
B OLZ , J., FARMER , I., G RINSPUN , E., AND S CHROEDER , P. 2003. Sparse
matrix solvers on the GPU: Conjugate gradients and multigrid. Computer
Graphics SIGGRAPH 03 Proceedings.
C HEN , J., AND DA V ITORIA L OBO , N. 1995. Towards interactive-rate
Figure 6: GPU-based interactive simulation of 2D water surfaces is simulation of fluids with moving obstacles using Navier-Stokes equa-
demonstrated. The implementation runs at 43 fps on a 5122 grid. tions. Graphical Models and Image Processing 57, 2.
C URTIS , C., A NDERSON , S., S EIMS , J., F LEISCHER , F., AND S ALESIN ,
D. 1997. Computer-generated watercolor. Computer Graphics SIG-
GRAPH 97 Proceedings.
References D EBUNNE , G., D ESBRUN , M., M.-P., C., AND BARR , A. 2001. Dy-
A NDERSON , E., BAI , Z., B ISCHOF, C., B LACKFORD , S., D EMMEL , J., namic real-time deformations using space and time adaptive sampling.
D ONGARRA , J., D U C ROZ , J., G REENBAUM , A., H AMMARLING , S., In Computer Graphics SIGGRAPH 01 Proceedings.
M C K ENNEY, A., AND S ORENSEN , D. 1999. LAPACK Users’ Guide, D ESBRUN , M., M EYER , M., S CHROEDER , P., AND BARR , A. 1999. Im-
third ed. Society for Industrial and Applied Mathematics, Philadelphia, plicit fairing of irregular meshes using diffusion and curvature flow. In
PA. Computer Graphics SIGGRAPH 99 Proceedings, 317–324.
D ONGARRA , J., D U C ROZ , J., H AMMARLING , S., AND H ANSON , R. S TRZODKA , R., AND RUMPF, M. 2001. Using graphics cards for quantized
1988. An extended set of FORTRAN basic linear algebra subprograms. FEM computations. In Proceedings VIIP 2001, 98–107.
ACM Transactions on Mathematical Software 14, 1–17.
T HOMPSON , C., H AHN , S., AND O SKIN , M. 2002. Using modern graph-
D ONGARRA , J., D U C ROZ , J., H AMMARLING , S., AND H ANSON , R. ics architectures for general-purpose computing: A framework and anal-
1990. A set of level 3 basic linear algebra subprograms,. ACM Transac- ysis. Proceedings of 35th International Symposium on Microarchitecture
tions on Mathematical Software 16, 1–17. (MICRO-35).
E LDER , G. 2002. Radeon 9700. In Proceedings Eurographics/SIGGRAPH W EISKOPF, D., H OPF, M., AND E RTL , T. 2001. Hardware-accelerated
Workshop on Graphics Hardware 2002. visualization of time-varying 2D and 3D vector fields by texture advec-
F EDKIW, R., S TAM , J., AND J ENSEN , H. 2001. Visual simulation of tion via programmable per-pixel operations. In Proceedings Workshop
smoke. Computer Graphics SIGGRAPH 01 Proceedings, 15–22. on Vision, Modeling, and Visualization VMV’01, 439–446.

F OSTER , N., AND F EDKIW, R. 2001. Practical animation of liquids. Com- W EISKOPF, D., H OPF, M., AND E RTL , T. 2002. Hardware-accelerated
puter Graphics SIGGRAPH 01 Proceedings, 23–30. Lagrangian-Eulerian texture advection for 2D flow visualization. In Pro-
ceedings Workshop on Vision, Modeling, and Visualization VMV ’02.
F OSTER , N., AND M ETAXAS , D. 1996. Realistic animation of liquids.
Graphical Models and Image Processing 58, 5, 471–483.
H ARRIS , M., C OOMBE , G., S CHEUERMANN , T., AND L ASTRA , A. 2002.
Physically-based visual simulation on graphics hardware. In Proceed-
ings Eurographics/SIGGRAPH Workshop on Graphics Hardware 2002.
H ART, J. 2001. Perlin noise pixel shaders. In Proceedings Eurograph-
ics/SIGGRAPH Workshop on Graphics Hardware 2001.
H EIDRICH , W., W ESTERMANN , R., S EIDEL , H.-P., AND E RTL , T. 1999.
Applications of pixel textures in visualization and realistic image synthe-
sis. In ACM Symposium on Interactive 3D Graphics, 110–119.
H ILLESLAND , K., M OLINOV, S., AND G RZESZCZUK , R. 2003. Nonlinear
Optimization Framework for Image-Based Modelling on Programmable
Graphics Hardware. Computer Graphics SIGGRAPH 03 Proceedings.
H OPF, M., AND E RTL , T. 1999. Accelerating 3D convolution using graph-
ics hardware. In Proceedings IEEE Visualization’99, 471–474.
H OPF, M., AND E RTL , T. 2000. Hardware accelerated wavelet transfor-
mations. In Proceedings EG/IEEE TCVG Symposium on Visualization
VisSym ’00, 93–103.
J OBARD , B., E RLEBACHER , G., AND H USSAINI , Y. 2000. Lagrangian-
Eulerian advection of noise and dye textures for unsteady flow visualiza-
tion. In Proceedings IEEE Visualization’00, 110–118.
K AAS , M., AND M ILLER , G. 1990. Rapid, stable fluid dynamics for
computer graphics. Computer Graphics SIGGRAPH 90 Proceedings,
49–57.
L ARSEN , E. S., AND M C A LLISTER , D. 2001. Fast matrix multiplies using
graphics hardware. In Proceedings Supercomputing 2001.
L INDHOLM , E., K ILGARD , M., AND M ORETON , H. 2001. A user-
programmable vertex engine. Computer Graphics SIGGRAPH 01 Pro-
ceedings.
M ICROSOFT, 2002. DirectX9 SDK. [Link]
M ONTRYM , J., AND M ORETON , H. 2002. GeForce4. In Proceedings
Eurographics/SIGGRAPH Workshop on Graphics Hardware 2002.
N V IDIA , 2002. nvidia OpenGL game of life.
[Link] gameoflife.
N V IDIA , 2003. Sample effects on the nVIDIA graphics cards.
[Link]
O LANO , M., AND L ASTRA , A. 1998. A shading-language on graphics
hardware. Computer Graphics SIGGRAPH 98 Proceedings, 159–168.
P RESS , W., T EUKOLSKY, S., V ETTERLING , W., AND F LANNERY, B.
2002. Numerical Recipes in C++ : The Art of Scientific Computing.
Cambridge University Press.
P URCELL , T., B UCK , I., M ARK , W., AND H ANRAHAN , P. 2002. Ray
tracing on programmable graphics hardware. Computer Graphics SIG-
GRAPH 98 Proceedings, 703–712.
S TAM , J. 1999. Stable fluids. Computer Graphics SIGGRAPH 99 Proceed-
ings, 121–128.
S TRZODKA , R., AND RUMPF, M. 2001. Nonlinear diffusion in graphics
hardware. In Proceedings EG/IEEE TCVG Symposium on Visualization
2001, 75–84.

Kruger, Westermann - Linear Algebra Operators For GPU Implementation of Numerical Algorithms
No ratings yet
Kruger, Westermann - Linear Algebra Operators For GPU Implementation of Numerical Algorithms
9 pages
The Mathematics of Computer Graphics
No ratings yet
The Mathematics of Computer Graphics
10 pages
Solving Pdes With Cuda
No ratings yet
Solving Pdes With Cuda
34 pages
Linear Solvers GPU
No ratings yet
Linear Solvers GPU
10 pages
Journal of Computational Physics: Sanghyun Ha, Junshin Park, Donghyun You
No ratings yet
Journal of Computational Physics: Sanghyun Ha, Junshin Park, Donghyun You
19 pages
Computer Graphics - T
No ratings yet
Computer Graphics - T
5 pages
GPU Acceleration in CFD Simulations
No ratings yet
GPU Acceleration in CFD Simulations
67 pages
Gribble 08 Ray
No ratings yet
Gribble 08 Ray
11 pages
U.P.B. Sci. Bull., Series, Vol.
No ratings yet
U.P.B. Sci. Bull., Series, Vol.
13 pages
Interactive Deformation and Visualization of Level Set Surfaces Using Graphics Hardware
No ratings yet
Interactive Deformation and Visualization of Level Set Surfaces Using Graphics Hardware
18 pages
Algorthims For Diffrential Graphics
No ratings yet
Algorthims For Diffrential Graphics
24 pages
Module1 Notes of CGV
No ratings yet
Module1 Notes of CGV
32 pages
Numerical Grid Generation Foundations An
No ratings yet
Numerical Grid Generation Foundations An
1 page
ECCOMAS Oslo Article
No ratings yet
ECCOMAS Oslo Article
12 pages
Fast Fluid Thermodynamics Simulation by Solving Heat Diffusion Equation
No ratings yet
Fast Fluid Thermodynamics Simulation by Solving Heat Diffusion Equation
11 pages
CIT 2211 - Lecture 1 - Introduction To Computer Graphics
No ratings yet
CIT 2211 - Lecture 1 - Introduction To Computer Graphics
28 pages
PDF of Computer Graphics and Image Processing Lab 1
No ratings yet
PDF of Computer Graphics and Image Processing Lab 1
203 pages
Cuda Opencl
No ratings yet
Cuda Opencl
17 pages
Overview of Computer Graphics Concepts
No ratings yet
Overview of Computer Graphics Concepts
24 pages
GPU Cluster
No ratings yet
GPU Cluster
12 pages
Christen 07
No ratings yet
Christen 07
8 pages
Thesis 1997 Abdullah
No ratings yet
Thesis 1997 Abdullah
259 pages
Performance Analysis of Different Iterative Solvers Parallelized On Gpu Architecture
No ratings yet
Performance Analysis of Different Iterative Solvers Parallelized On Gpu Architecture
8 pages
CGEMS ParametricCurves PDF
No ratings yet
CGEMS ParametricCurves PDF
9 pages
GTU Computer Graphics Syllabus
No ratings yet
GTU Computer Graphics Syllabus
3 pages
Digital Simulation Lab-1
No ratings yet
Digital Simulation Lab-1
73 pages
Mmte 004
No ratings yet
Mmte 004
125 pages
Kochurov 2015 - GPU Implementation of Jacobi Method and Gauss-Seidel For Large Data
No ratings yet
Kochurov 2015 - GPU Implementation of Jacobi Method and Gauss-Seidel For Large Data
13 pages
The Impact of GPU/Multicore in Signal Processing: A Quantitative Approach
No ratings yet
The Impact of GPU/Multicore in Signal Processing: A Quantitative Approach
11 pages
Notes PDF
No ratings yet
Notes PDF
9 pages
Articles CAF Symmetric FSM Published
No ratings yet
Articles CAF Symmetric FSM Published
9 pages
Numerical Grid Generation for PDEs
No ratings yet
Numerical Grid Generation for PDEs
332 pages
Parallel-Vector Equation Solvers For Finite Element Engineering Applications
No ratings yet
Parallel-Vector Equation Solvers For Finite Element Engineering Applications
15 pages
Sub Ecs 504 30sep14
No ratings yet
Sub Ecs 504 30sep14
92 pages
Reviews: Analogue Computing With Metamaterials
No ratings yet
Reviews: Analogue Computing With Metamaterials
19 pages
Esd Accession List ESTI Call NR 66967: Adom
No ratings yet
Esd Accession List ESTI Call NR 66967: Adom
114 pages
Group6 Simplegpu
No ratings yet
Group6 Simplegpu
6 pages
CUDA Optimization for Jacobi Method
No ratings yet
CUDA Optimization for Jacobi Method
4 pages
Overview of Computer Graphics Concepts
No ratings yet
Overview of Computer Graphics Concepts
4 pages
Dense Matrix Algebra On The GPU
No ratings yet
Dense Matrix Algebra On The GPU
22 pages
Hierarchical Parallel CFD on Hybrid Supercomputers
No ratings yet
Hierarchical Parallel CFD on Hybrid Supercomputers
10 pages
Numerical Methodology of Mesh-Based Simulation Technique
No ratings yet
Numerical Methodology of Mesh-Based Simulation Technique
5 pages
LectureNoteInCS1573 (VECPAR'98)
No ratings yet
LectureNoteInCS1573 (VECPAR'98)
11 pages
CHAPTER 3 Wave Equation Computations and Truly Parallel Processing - 1989 - Handbook of Geophysical Exploration Seismic Exploration
No ratings yet
CHAPTER 3 Wave Equation Computations and Truly Parallel Processing - 1989 - Handbook of Geophysical Exploration Seismic Exploration
20 pages
Phong 1975
No ratings yet
Phong 1975
7 pages
Computer Graphics and Visualization
No ratings yet
Computer Graphics and Visualization
16 pages
(Ebook - PDF) Modelling Physics Correctly For Computer Graphics
No ratings yet
(Ebook - PDF) Modelling Physics Correctly For Computer Graphics
23 pages
PDE For Modelica
No ratings yet
PDE For Modelica
108 pages
Design of Graphics Processing Framework On FPGA
No ratings yet
Design of Graphics Processing Framework On FPGA
5 pages
Mingpu: A Minimum Gpu Library For Computer Vision: Pavel Babenko and Mubarak Shah
No ratings yet
Mingpu: A Minimum Gpu Library For Computer Vision: Pavel Babenko and Mubarak Shah
30 pages
Thompson & Otros - Numerical Grid Generation. Foundations and Applications
No ratings yet
Thompson & Otros - Numerical Grid Generation. Foundations and Applications
336 pages
Matrix
No ratings yet
Matrix
13 pages
Computer Graphics & Multimedia Syllabus
No ratings yet
Computer Graphics & Multimedia Syllabus
115 pages
Parallel Computing in CFD: Milovan Perić
No ratings yet
Parallel Computing in CFD: Milovan Perić
25 pages
JETIR2204008
No ratings yet
JETIR2204008
10 pages
Save the Dead: A Mordheim Scenario
No ratings yet
Save the Dead: A Mordheim Scenario
4 pages
Stroll in The Garden, A
No ratings yet
Stroll in The Garden, A
2 pages
Skinned Instancing White Paper
No ratings yet
Skinned Instancing White Paper
16 pages
Deepshad
No ratings yet
Deepshad
8 pages
Instant Notes in Organic Chemistry
No ratings yet
Instant Notes in Organic Chemistry
368 pages
CSTR
No ratings yet
CSTR
30 pages
Ulan sa Aking Puso: Isang Tula ng Pag-ibig
No ratings yet
Ulan sa Aking Puso: Isang Tula ng Pag-ibig
2 pages
Ajv DPW 00 TMS53 in R03 2972 - 00
No ratings yet
Ajv DPW 00 TMS53 in R03 2972 - 00
2 pages
Chapter 3
100% (1)
Chapter 3
28 pages
Electromagnetic Tooth Clutch ECD
No ratings yet
Electromagnetic Tooth Clutch ECD
2 pages
Class: XII Subject: Physics Assignment 1 Chapter: Electrostatics
No ratings yet
Class: XII Subject: Physics Assignment 1 Chapter: Electrostatics
2 pages
Farmer Database Format NP & CAN User North Zone
No ratings yet
Farmer Database Format NP & CAN User North Zone
354 pages
De Cuong Anh 8 HK I 23-24 Oanh
No ratings yet
De Cuong Anh 8 HK I 23-24 Oanh
8 pages
Sikorskaya 2017 Intercultural Education Policies Across Europe 2006-2016
No ratings yet
Sikorskaya 2017 Intercultural Education Policies Across Europe 2006-2016
24 pages
Women's Body Hair Perceptions Study
No ratings yet
Women's Body Hair Perceptions Study
9 pages
Kaveri Seeds DCF Valuation Report
No ratings yet
Kaveri Seeds DCF Valuation Report
14 pages
Grade 9-lc2021
No ratings yet
Grade 9-lc2021
1 page
Code No.: Time Allowed: 3 Hours Maximum Marks: 70 (I) (Ii) Answer The Questions After Carefully Reading The Text
No ratings yet
Code No.: Time Allowed: 3 Hours Maximum Marks: 70 (I) (Ii) Answer The Questions After Carefully Reading The Text
8 pages
Cadet Heating Boiler Installation Manual
100% (1)
Cadet Heating Boiler Installation Manual
92 pages
The Depth of Influence of Loaded Areas
No ratings yet
The Depth of Influence of Loaded Areas
11 pages
Priceline
No ratings yet
Priceline
2 pages
Samsung Galaxy Product Life Cycle Analysis
No ratings yet
Samsung Galaxy Product Life Cycle Analysis
7 pages
Troubleshooting Model A Engine Overheating
No ratings yet
Troubleshooting Model A Engine Overheating
3 pages
Acr 24500
No ratings yet
Acr 24500
17 pages
Ship Automation & Control System - Shippipedia
No ratings yet
Ship Automation & Control System - Shippipedia
3 pages
CVT Transmission Wiring Diagram
No ratings yet
CVT Transmission Wiring Diagram
2 pages
Mygov 10000000001887797823
No ratings yet
Mygov 10000000001887797823
3 pages
Guided Discovery Matrix
No ratings yet
Guided Discovery Matrix
1 page
Digital Platforms Contribution To Improvement of Service Provision To Citizens in Nampula
No ratings yet
Digital Platforms Contribution To Improvement of Service Provision To Citizens in Nampula
13 pages
En 1426 2024
No ratings yet
En 1426 2024
18 pages
DBMS Relational Calculus
No ratings yet
DBMS Relational Calculus
9 pages
Beam Analysis for Engineers
No ratings yet
Beam Analysis for Engineers
7 pages
MS Visio (Lecture Slides)
No ratings yet
MS Visio (Lecture Slides)
22 pages
Life Skills for Adolescents: Gender & Relationships
No ratings yet
Life Skills for Adolescents: Gender & Relationships
55 pages

GPU - Linear Algebra Operators For GPU Implementation of Num

Uploaded by

GPU - Linear Algebra Operators For GPU Implementation of Num

Uploaded by

Linear Algebra Operators

for GPU Implementation of Numerical Algorithms

Abstract matics. These techniques have a variety of applications in physics

implementation of an explicit scheme for the solution of a coupled

3.3 Vector Reduce

Figure 8: An interactive tool for the visualization of the solution

ATI, 2003. Sample effects on the ATI graphics cards.

You might also like