Generalized Cannon's Algorithm For Parallel Matrix Multiplication

Uploaded by

kmarali

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

26 views

Generalized Cannon's Algorithm For Parallel Matrix Multiplication

Uploaded by

kmarali

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

Generalized Cannon’s Algorithm for Parallel Matrix Multiplication*

Hyuk- Jae Lee James P. Robertson and Jo& A.B. Fortes

Department of Computer Science School of Electrical and Computer Engineering

Louisiana Tech University Purdue University
Ruston, LA 71272, USA W. Lafayette, IN 47907, USA
hlee@engr.iatech.edu {robertso,fortes}@ecn.purdue.edu

Abstract toroidal mesh interconnections. The original algo-

rithm assumes that the input matrices are block-
Cannon’s algorithm is a memory-efficient matrix distributed, but it is not clear how to efficiently imple-
multiplication technique for parallel computers with ment this algorithm for block-cyclic (block-scattered)
toroidal mesh interconnections. This algorithm as- distribution. This paper revisits Cannon’s algorithm
sumes that input matrices are block distributed, but and generalizes it to the case when input matrices are
it is not clear how it can deal with block-cyclic dis- block-cyclic distributed. An efficient schedule for re-
tributed matrices. This paper generalizes Cannon’s ducing communication overhead is also proposed. The
algorithm for the case when input matrices are block- efficiency of the proposed algorithm is analyzed and
cyclic distributed across a two-dimensional processor comparatively evaluated thru experiments conducted
array with an arbitrary number of processors and on an Intel Paragon.
toroidal mesh interconnections. An efficient schedul- Extensive research has focused on development of
ing technique is proposed so that the number of com- parallel matrix multiplication subroutines [4]-(111.
munication steps is reduced to the least common mul- Communication and memory overheads are two ma-
tiple of P and Q for a given P x Q processor array. In jor obstacles for these parallel subroutines. To reduce
addition, a partitioning and communication scheme is communication overheads, various ideas have been ex-
proposed to reduce the number of page faults for the plored. One important approach is to overlap commu-
case when matrices are too large to fit into main mem- nications with computations. Another technique tries
ory. Performance analysis shows that the proposed to increase the granularity of computations between
generalized Cannon’s algorithm (GCA) requires fewer data exchanges so that the number of communications
page faults than a previously proposed algorithm can be minimized. Both solutions benefit if the com-
(SUMMA). Experimental results on Intel Paragon munication is done by using communication primitives
show that GCA performs better than SUMMA when that can be efficiently implemented by the target ar-
blocks of size larger than about (65 x 65) are used. chitecture. For example, if an architecture requires
However, GCA performance degrades if the block size more overhead for broadcasts than for shifts, then it
is relatively small while SUMMA maintains the same is better to use the latter instead of the former if possi-
performance. It is also shown that GCA maintains ble (and vice versa). Memory overhead increases traf-
higher performance for large matrices than SUMMA fic between different levels of memory hierarchies and
does. therefore is another important source of degradation.
Previously proposed parallel multiplication algo-
1 Introduction rithms fall into two broadly defined classes. One class
relies on using broadcast communication primitives
Matrix multiplication is one of the most important ba- (possibly using shift primitives as well). Algorithms
sic operations in scientific computations and many al- included in this class are BMR (Broadcast-Multiply-
gorithms for parallel matrix multiplication have been Roll) [l, 21, PUMMA (Parallel Universal Matrix Mul-
proposed. An early development is Cannon’s algo- tiplication Algorithm) [4], and SUMMA (Scalable
rithm [S] which is memory-efficient and suitable for Universal Matrix Multiplication Algorithm) [5]. The
other class does not use broadcasts but relies exclu-
* This research was partially funded by the National sively on shifts. This class includes Cannon’s algo-
Science Foundation under grants MIP-9500673 and CDA-
9015696.
rithm and systolic matrix multiplication [2],[3],[6].
to republish, to post on servem or to redistrihuk IO IisIs. requires specitic
Permission to make digital/hard copies of all or p;ul of this material for
permission and/or fee
personal or classroom use is granted without fm provided Ihrt the copies
are not made or distributed for profit or commercial advmt;rge. Ihc copy- /IX 97 Vienna Awwia
right notice, :he title ofthe publication md it< date appear. and notice is Copyright 1997 ACM O-8979 I-902-5/97/7..$3.50
given lhrl cupyright is by permission of the ACM, Inc. To copy otherwise,

44
The original Cannon’s algorithm [6] assumes a
skewed distribution of the input matrices and consists
of repeated shift-and-multiply steps. The following
example gives a clear picture of the original Cannon’s
algorithm for the case when all matrices are square.

Example 1 Consider the matrix multiplication algo-

rithm which computes C = A x B. Suppose that the
size of a given processor array is (6 x 6). In Cannon’s
algorithm, A and B are decomposed into 36 blocks in
a (6 x 6) arrangement, which are initially aligned and
multiplied by each other as shown in Fig. 1. The next
step is to shift A to the left and B up to neighbor pro-
cessors where block-wise multiplication can take place
again and its result can be added to the current value
Of Ci,, This shift and multiply step is repeated until
all blocks in a row of A are multiplied by all blocks
in a column of B. The data distributions at. t = 0, 1,
and 3 are shown in Fig. 1. n
Two advantages of Cannon’s algorithm for square
matrices are readily apparent. First., all processors are
busy in all iterations. Second, the only communica-
tion necessary is the initial skew of the input matri-
ces and the shifting of blocks in each iteration. The
initial skew can be eliminated by allocating matri-
ces initially in a skewed manner. A data distribu-
tion independent computation and active library ap-
proach have been studied for development of a source-
to-source translator that automatically allocates data
arrays as required by given subroutines (131. In ma-
chines where a toroidal mesh-interconnection (virtual
or real) is assumed for processors. shifts are highly
efficient and therefore Cannon’s algorithm performs
very well. The first question addressed by this pa-
per is whether Cannon’s algorithm can be general-
ized to work with matrices of arbitrary shape and
~l2Tpz$g~%I,/j
t (b) = 1
size initially stored according to arbitrary block-cyclic
distributions and arbitrary mesh ratio (size, shape,
and blocks must, still obey certain restrictions that
guarantee matrix-multiplication to be well-defined).
The second question is whether and when the gen-
eralized Cannon’s algorithm offers performance ad-
vantages over broadcast-based algorithms, namely,
SUMMA. Both questions are given positive answers
supported by analysis and experimental results.
The rest of this paper is organized as follows. Sec-
tion 2 develops the generalized Cannon’s algorithm
(GCA). An efficient method to reduce communication
frequency is also proposed. Section 3 develops a par-
tit,ioning scheme to reduce memory faults. Section
4 evaluates the proposed technique against previous
work. Experimental results obtained with an Intel
Paragon arc provided in Sectlion 5. Section 6 present,s
conclusions of thf, paper.

Figure 1: Data distribution for Cannon’s algorithm

45
2 Generalized Cannon’s Algorithm (GCA)

This section proposes a block-cyclic version of Can-

non’s algorithm and a scheduling scheme that re-
0011100
duces the number of communications. Throughout
the section, matrix multiplication C = AB is consid-
ered where matrices A, B, and C are partitioned into
clnnclcl
(bz x b-z), (bz x b-y), and (blc x b-y) blocks, re-
spectively. In its simplest form, Cannon’s algorithm
requires that the size of the processor array be the
clclclocl
(a) 3 x 5 virtual processor array
same as the partition size (i.e., the number of blocks)
of matrix C (= bx x b-y). Let such an array be
called the virtual processor away. However, in gen-
eral, the available physical processor array is smaller
than the virtual processor array. Hence, the virtual
array needs to be partitioned to be allocated into the
available physical array. A single physical processor
needs to store many data blocks and perform com-
putations with these data blocks. Hence, the global
indices of matrix blocks should be carefully converted (b) Partitioning of the virtual array
into the local block indices in each processor.
Consider the case when the virtual processor array
is not square. For example, suppose that matrices A,
B, and C are decomposed into 6, 10, and 15 blocks,
in a (3 x 2), (2 x 5), and (3 x 5) arrangement, re-
spectively. Then, the virtual processor array is 3 x 5
(Figure 2 (a)). The size of the virtual processor array
is larger than that of the partition of matrix A. In this
case, the virtual processor array needs to be decom-
posed into subarrays of the same size as the partition
of matrix A ((3 x 2) in this example). Figure 2 (b) (c) Distribution of A
shows the decomposition of the virtual processor ar-
ray. Blocks of matrix A are skewed in the same way
as Cannon’s algorithm, and mapped into each of these
subarrays (Figure 2 (c)). Note that the third subarray
is smaller than the matrix A partition, and therefore,
more than one block is stored in a single processor.
With this mapping, three copies of matrix A exist in
the virtual processor array. For matrix B, the virtual
processor array is decomposed into (2 x 5) subarrays p!g/m/m
and blocks of matrix B are mapped into each of these
(d) Distribution of B
subarrays (Figure 2 (d)). With these distributions
of matrices A and B, all blocks are well-aligned and
therefore multiplications of A and B can take place for
every processor. In the next step, matrix A shifts left
and matrix B shifts up (Figure 2 (e)). Then, again
matrices A and B are well aligned, and all processors
can compute matrix product. In general, the virtual
processor array needs to be decomposed into subar-
rays with the size of (bz x bz), for distribution of
matrix A. For matrix B, the virtual processor array
needs to be decomposed into subarrays with the size B 030 B1.1 BOJ B1.3 B0,4
of (bz x b-y). Then, the multiply and shift steps are
(e) Distribution of A and B after
repeated for b-z times.
data multiplication and shift step
Consider the allocation of the virtual processor ar-
ray to a physical processor array. Recall that multiple Figure 2: Data distribution for GCA
copies of matrices A and B exist in the virtual pro-

46
cessor array. For example, in Figure 2 (c), processors result from lcm(P, Q) data movements can be done
(0, O), (0,2) and (0,4) contain Ao,o. By allocating without communications. Therefore, to reduce com-
these three processors to one physical processor, just munications, the execution order should be changed in
one copy of Ao,o is maintained in the physical proces- such a way that all computations that need lcm(P, Q)
sor array. In general, all virtual processors horizon- data movements in the virtual processor array are per-
tally separated by b-z processors contain the same formed consecutively.
block of matrix A while all virtual processors verti- The pseudocode of GCA for matrix multipli-
cally separated by b-z processors contain the same cation is shown in the following program. For
block of matrix B. Hence, all processors separated by simplicity, this program assumes that both P
b-z processors either horizontally or vertically should and Q divide M,W, and K, respectively. In
be mapped into the same physical processor. In this this code, C-Cloc3 Cil Cjl , A-Iloc) Cil Cj 'I , and
way, only one copy of input matrix is stored in the B-{loc3CiJl Cjl denote the matrices of A, B , and C ,
physical processor array. stored in local memory. Variables i J and j 'need to
Upon mapping the virtual processors onto physical be modified when data shifts cause the change of stor-
ones it may be possible to eliminate some virtual com- age location in a local memory.
munication by mapping communicating virtual pro-
cessors into the same physical one. Since the size of I
the physical processor is small, some data movement lcm = least~common~multiplier(P,Q);
in the virtual processor array does not require actual for(t=O; t<lcm; t++) I
data movement in the physical processor array. For for(i=O; i< (M/P); i++) C
example, suppose that the virtual processor array is for(j=O; j < (N/Q); j++) I
6 x 6 while the physical processor array is 3 x 3. Then, for(l=O; 1 < (K/lcm); I++) C
three left-shift steps in the virtual processor array cor- jJ=(j%k+l*lcm/Q)%(k/Q);
respond to no data movement in the physical processor iJ=(i%k+l*lcm/P)%(k/P);
array. By taking advantage of this basic idea, one can C-I1oc)Cil [jl +=A-Cloc)[il Cj '1
reduce the number of communications. *B-Clod [i'l Cjl ;
133
Example 1 (continued) Figure 1 (c) shows the data shift-west (A-{loc) [i] [j ,O<=i<M/P,O<=j'<k/lcm) ;
storage of matrices A and B after three steps of data shift-north(B-{loc) [i'] [jl ,O<=il<k/lcm,O<=j<N/Q);
movement. Note that the data blocks stored in each 1
processor are the same as the initial data distribu- 1
tion (in Figure 1 (a)). Hence, no communication is
necessary if the computations with these blocks are The number of communications required by Can-
performed right after the computations with initial non's algorithm is b-z while that by the proposed
data storage. For matrix multiplication, all opera- scheme is Ecm(P, Q). In general, b-z is much larger
tions are commutative, and therefore it is possible to than P or Q. Hence, the proposed scheme can greatly
change the order of computations. Hence, by perform- reduce communication overhead.
ing the computations as shown in Figure 1 (c) imme-
diately after those in Figure 1 (a), one can eliminate
3 Partitioning
one global communication step. Any pair of computa-
tionstages separated by three shifts can be computed In this section, a new partitioning scheme is proposed
without communication. Hence, significant commu- to reduce the number of page faults. The basic idea
nication overhead can be reduced b; reordering com- is explained with Figure 3. Suppose that there are
putations in such a way that two computation stages 2 x 2 processors. Matrix A is decomposed into 2 x 2
separated by three shifts must always be executed con- submatrices as shown in Figure 3 (a) and each block
secutively. II is stored in one processor. In the Cannon's algorithm,
each processor completes all computations with its lo-
The above example gives an idea of how to reduce cal matrix, and then shifts the local matrix to the left
the number of communications by changing the ex- processor. This computation and shift step repeats
ecution order. For the 3 x 3 processor array, three until the local matrix moves back to the original pro-
shifts return data to the original processor source. cessor. If a matrix is too large to fit into main memory,
In general, given a P x Q processor array, lcm(P, Q) page faults occur at each computation and shift step.
shifts return the data to the original processor, where In the new scheme, the submatrix in each proces-
lcm(P, Q) denotes the least common multiple of P and sor is decomposed again into submatrices in such a
Q. Hence, computations on the initial data distribu- way that each submatrix fits into main memory. In
tion and computations on the data distribution that Figure 3 (b), each submatrix is partitioned into 2 x 2
submatrices, and assume that each of the submatrices
(4 Initial mesnory all*

{a) Initial partitioning of matrix A (b) Partitioning of submatrices

Figure 3: Partitioning of matrix

(h) I'd'rriilinni,,: oflnrwix

fits into main memory. Initial18 all processors have its

own local submatrix of A ) and performs compu-
tations on the submatrix. Once these computations
are finished, each submatrix A::; is sent to the left
processor so that all processors perform other com-
putations with the new submatrix received from the
right processor without any disk read. The computa-
tion and shift step repeats until the submatrices move
back to the original processors. Therefore, submatri-
ces A;: are used by all necessary processors before
they are paged out of main memory. Once all com-
putations with A:;!
axe completed, submatrices A;:
axe loaded into main memory and then used for com-
putation and shift. This procedure continues until all
submatrices axe loaded and used for computations. In
this way, many disk reads can be eliminated.
Consider how to partition matrices and reduce
page faults in a single processor. Suppose that matri-
ces A,B, and C are initially allocated to main memory
as shown in Figure 4 (a) where a shaded region rep-
resents the part of a matrix that is allocated to main
memory. In this example, N' x N submatrices are al-
located for all A, B , and C . Figure 4 (b) illustrates
the partitioning of the matrices. The upperleftmost
~ , and Bo,o) is N' x N' square subma-
block ( C O , Ao,o,
trix. Other blocks axe also shown in Figure 4 (b).
Figure 4 (c) shows the parts of matrices which are in-
volved in computation Co,o = CO,O AO,Ox B o , (dark ~+
shaded region). As in (a), the light shaded regions still
represent the part allocated to main memory. Once
this part of computation is conlpleted, Ao,o shifts left
and Bo,o moves up to the next processors, respectively.
This computation and shift step repeats until Ao,o and
Bo,0 move back to the original processor. Then, the
+
next computation CO,I = C o , ~ AO,Ix B I , ~is per-
formed. The dark shaded regions in Figure 4 (d) rep-
resent the submatrices involved in this computation.
Figure 4 (e), (f), . . ., (i), and (j), show the order of
remaining computations.

Figure 4: Partitioning and scheduling for page fault

reduction
4 Page Fault Analysis of Matrix Multiplica-
tion Algorithms
In this subsection, the number of page faults is esti-
mated for both GCA and SUMMA. Figure 4 is used
t o estimate the number of page faults made by GCA.
For simplicity, assume that A, B , and C are N x N
matrices. Let P,,,, be the size (in bytes) of a single
page of main memory and M,,,, be the size (in bytes)
of main memory. Let N' x N' be the size of Co,o in
Figure 4 (b). For simplicity, assume that each element
of a matrix takes one byte of memory space. Initially,
Co,o and C O ,are
~ allocated to main memory. At the
computation in Figure 4 (g) and (i), Ci,o and C I , ~
are loaded into main memory, respectively. There-
fore, a number of page faults occur in these steps.
The total number of elements to be loaded in these
steps is ( N - N') x N. The number of page faults is
[((N - N') x N)/P,,,,I. For simplicity, the 'ceiling' Figure 5: Memory allocation for SUMMA
operation is omitted in the remaining analysis. Thus,
the number of page faults is ( ( N - N') x N)/P,,,,.
Matrix A has exactly the same number of page faults
as C does.
The estimation of page faults for B is more com-
plex. The computations in Figure 4 (d) and (h)
cause ( ( N - N') x N)/P,,,, page faults, respectively.
The computations in Figure 4 (f) and (g), cause
((N- N') x N)/P,,,, page faults. The computation in
Figure 4 (j) also causes ( N - N1)(N - N')/P,,,, page
faults. Hence, the total number of page faults with B
+
is 3((N - N') x N)/PsZze ( N - N1)(N - N')/P,,,,. size of matrix
Therefore, the total number of page faults is
Figure 6: The estimated number of page faults for
GCA and SUMMA

For SUMMA, each processor has different mem- all elements of A and C are visited once. The total
ory allocation at a given computation step. Figure 5 number of elements in A, B , C and two work spaces is
shows the memory allocation of Processor (0,O). This +
(3N2 2nbN). The number of elements in outside of
figure is used to estimate the number of page faults main memory is (3NQ2nbN) -M,i,,. Therefore, ap-
in Processor (0,O). Other processors have almost the proximately ( ( 3 N % 2 n b ~ )- M,i,,)/P,,,, page faults
same number of page faults. As shown in Figure 5 occur for the remaining of the computation. Hence,
(a), SUMMA requires two work spaces for A and B , the total number of page faults is
respectively. The size of the work space is nb x N
for both matrices where nb is to be chosen by a P~SUMMA(= N )( ( N - N') x N)/Psize +
subroutine developer. To start computation, matri- +
( N - N')(N - Nr)/Psize 2(nb x N)/Psize
ces are allocated to main memory as shown in Fig- + ( ( 3 ~ % 2nbN) - Msiz,)/Ps,,,. (2)
ure 5 (b). For C , all elements are loaded. Hence,
( ( N - N') x N)/P,,,, page faults occur. For A, the Figure 6 compares the estimated page faults for
first nb columns are loaded to the work space. Hence, GCA and SUMMA. In this estimation, it is assumed
( N - N') ( N - N')/P,,,, page faults occur to access A that the memory size is 3 mega bytes, and the page
and (nb x N)/P,,,, page faults occur to load data into size is 256 bytes. For SUMMA, nb is 20 as suggested
the work space. For B , the first nb rows are loaded to in 151. AS shown in this analysis, GCA causes fewer
the work space. Since these rows are initially stored in page faults than SUMMA.
main memory, no page fault occurs for B . For the ac-
cess of the work space for B , another (nb x N)/P,,,,
page faults occur. For the remaining computation,
5 E%taarierarrrt6a The result in Figure 10 shows that it is desirable
to choose the largest blocks. However, if the block size
a ~ ~ ~ r e ~MFLOPS are used based
Ars a ~ ~ X " r f o ~ r - rIIIC&LS~~P(*, is too large, the number of blocks is so small that it is
O X I 2 M N h" Boating point op-
o r 8 ~ ( 1 4 -~ W I X F I I ~ ~ ~ Ithat impossible to evenly distribute these blocks across the
p-rat~crtrhswr3 xtr*rehh;rsy for the product of an M x K processor array. Therefore, performance is degraded
trratrrx b y a K x r\: matrix. by poor load balancing. Hence, it is best to choose
Ftgure 7 illustrates the number of MFLOPS vs. the largest block size among those which guarantee
the domain size. The Y-axis shows the number of full utilization of processors.
MFLOPS and the X-axis shows the size of A and B. Figures 11 and 12 illustrate the number of
All matrices are square, and a processor array of size MFLOPS when an 8 x 4 processor array and an 8 x 7
8 x 8 is used. As shown in this figure, GCA per- processor array are used, respectively. For a given ma-
forms better than SUMMA for all sizes of matrices. trix A(A4 x K ) and matrix B ( K x M), the block size
For small matrices, this is mainly due to differences is chosen to be M I 8 x K / 8 and K / 8 x N / 4 for matrix
in communication overhead whereas for large matri- A and B , respectively, in experiment of Figure 11. On
ces, it is a consequence of difference in efficiency of the other hand, the experiment of Figure 12 chooses
memory usage. As shown in this figure, the number the block size to be A418 x K / 5 6 and K / 5 6 x N / 7
of MFLOPS decreases drastically when local blocks for matrix A and B , respectively. These block sizes
no longer fit into memory, and must be paged to disk. are the largest possible sizes which allow full proces-
Since SUMMA uses more memory than GCA, its per- sor utilization. In Figure 11, GCA achieves higher
formance dropoff occurs earlier than that of GCA. MFLOPS than SUMMA. However, in Figure 12, GCA
Figure 8 compares scalability of GCA and achieves less number of MFLOPS than SUMMA. This
SUMMA. The X-axis shows the size of processor array is because a small block size is used for full processor
and the Y-axis shows MFLOPS. The processor grid is utilization.
always square, therefore 64 processors are arranged as \ I
an 8 x 8 array. Each processor has one block of size

-
256 x 256. Both algorithms maintain quite a good
performance as the processor size increases. Again
though, GCA is more efficient than SUMMA.
Figure 9 illustrates the number of MFLOPS vs. size of matrix in a single
the number of processors. As the experiment of Fig-
- 5 1
ure 8 , the processor grid is always square. However, in 150 5000 10000 20 40 60 80 100
slze of rnatru size of processor array
this experiment, the matrix sizes are fixed at 720 x 720.
As the number of processors increases, the communi- Figure 7 Figure 8
40
cation time becomes a larger factor in the program ex-
ecution time, and this causes the number of MFLOPS 30
to drop. 3
Figure 10 illustrates the number of MFLOPS for $20
I
ct---o GCA
+--x SUMMA
varying block sizes. For this figure, the Y-axis shows
10
the number of MFLOPS and the X-axis shows the
block size for each processor. The matrix sizes were I 0
50 100 100 200 300
fixed at 2048 x 2048. The processor array was fixed size of processor array block size
at 8 x 8. This means that for a block size of 256, each Figure 9 Figure 10
processor has one 256 x 256 block. For a block size of 50 i
1
2, each processor has 128 ( 2 x 2 ) blocks. In order for
SUMMA to have the best possible performance, the

-
number of columns of blocks of B in each processor is
the same, and the number of rows of blocks of A in
each processor is the same. This allows each processor
to have the same number of blocks of C (since the
number of blocks of C is equal to the number of rows of ? % k k 0 0 '0 2000 4000 6000 8000
blocks of A by the number of columns of blocks of B). size of matrix size of matrix

In this manner, an entire row or column of blocks can Figure 11 Figure 12

be broadcasted at one time. This is why SUMMA'S
times are constant for varying block sizes. However,
MFLOPS of GCA decreases drastically as the block
size decreases. If the block size is smaller than about
sixty five, SUMMA performs better than GCA.
[3] K.K. Mathur and S.L. Johnsson, "Multiplication
of matrices of arbitrary shape on a data parallel
computer," Parallel Computing, vol. 20, pp. 919-
951, July 1994.
101 M GCA [4] J. Choi, J.J. Dongarra, and D.W. Walker,
H SUMMA
01 I "Pumma: Parallel universal matrix multiplica-
5000
size of matrlx
10000
tion algorithms on distributed memory concur-
Figure 13 rent computers," Concurrency: Practice and Ex-
perience, vol 6, no. 7, pp. 543-570, Oct. 1994.
Figure 13 shows the number of MFLOPS with the [5] R. van de Geijn and J. Watts, "SUMMA: Scal-
assumption that input matrices are distributed across able universal matrix multiplication algorithm,"
processor array as required by SUMMA. In this case, University of Texas, Department of Computer
GCA requires initial data movements to redistribute Sciences, Tech. Rep. TR-95-13, April 1995. Also:
data as required by Cannon's algorithm. Hence, the LAPACK Working Note #96, May 1995.
execution time for GCA is the initial data redistribu-
tion time plus the computation time. Figure 13 shows [6] L.E. Cannon, "A cellular computer to implement
that the MFLOPS of GCA is smaller than that of the Kalman filter algorithm," Ph.D. dissertation,
SUMMA even though the initial data redistribution Montana State Univ., Bozeman, MT, 1969.
is required for GCA. This is because the initial redis-
tribution time is much smaller than the computation [7] R.C. Agarwal, F.G. Gustavson, and M. Zubair,
time. "A high performance matrix multiplication al-
gorithm on a distributed-memory parallel com-
puter, using overlapped communication," IBM
6 Conclusions Journal 0.f Research and Development, vol. 38,
The main contribution of this paper is the generaliza- no. 6, pp. 673-681, 1994.
tion of Cannon's algorithm to block-cyclic distributed [8] S.L. Johnsson, "Communication efficient basic
input matrices. For parallel computers with toroidal linear algebra computations on hypercube archi-
mesh interconnections, Cannon's algorithm can be ef- tectures," J. Parallel Distributed Computing, vol
ficiently implemented. The generalized Cannon's al- 4, no. 2, pp. 132-172, April 1987.
gorithm (GCA) has the same advantages as Cannon's
algorithm and therefore it is more efficient than other [9] P. Bjprstad, F. Manne, T . Sprevik, and M. Va-
algorithms developed for block-cyclic distributed ma- jtersic, "Efficient matrix multiplication on SIMD
trices. Efficiency of GCA degrades when the block size computers," SIAM J. Matrix Anal. Appl., vol.
is small. Improvement of GCA for these cases is un- 13, no. 1, pp. 386-401, Jan. 1992.
der investigation. When matrices are too large to fit
into main memory, performance degrades due to page [lo] S. Huss-Lederman, E.M. Jacobson, and A. Tsao,
faults. A partitioning and scheduling scheme is pro- "Comparison of scalable parallel matrix multipli-
posed to reduce page faults. Analysis shows that the cation libraries," in Proc. of the Scalable Parallel
performance degradation of GCA with the proposed Libraries Conference, Oct. 1993.
partitioning is much less than that of SUMMA. Future [ll] A. Chtchelkanova, J. Gunnels, G. Morrow, J.
work includes implementation of the proposed parti- Overfelt, and R.A. van de Geijn, "Parallel Im-
tioning scheme and validation of the analysis through plementation of BLAS: General Techniques for
experiments. Level 3 BLAS," Univ. of Texas, Dept. of Com-
puter Sciences, Tech. Rep. TR-95-40, Oct. 1995.
References
[12] J. Choi, J.J. Dongarra, S. Ostrouchov, A. Petitet,
[I] G.C. Fox, S.W. Otto, and A.J.G. Hey, "Matrix D. Walker, and R.C. Whaley, "A proposal for a
algorithms on a hypercube I: matrix multiplica- set of parallel basic linear algebra subprograms,"
tion," Parallel Computing, vol. 4, pp. 17-31, 1987. LAPACK Working Note #loo, May 1995.
[2] S. Huss-Lederman, E.M. Jacobson, A. Tsao, and [13] H.-J. Lee and J.A.B. Fortes, "Toward data dis-
G. Zhang, "Matrix multiplication on the Intel tribution independent parallel matrix multiplica-
Touchstone Delta," Concurrency: Practice and tion," in Proc. Int. Parallel Processing Sympo-
Experience, vol 6, no. 7, pp. 571-594, Oct. 1994. sium, pp. 436-440, April 1995.

Introduction To Presales Consulting and Proposal Authoring
100% (4)
Introduction To Presales Consulting and Proposal Authoring
73 pages
A Practical Performance Comparison of Parallel Matrix Multiplication Algorithms On Networks of Workstations
No ratings yet
A Practical Performance Comparison of Parallel Matrix Multiplication Algorithms On Networks of Workstations
2 pages
Canon's Algorithm
No ratings yet
Canon's Algorithm
11 pages
Class18 - Linalg II Handout PDF
No ratings yet
Class18 - Linalg II Handout PDF
48 pages
Cannon Strassen DNS Algorithm
No ratings yet
Cannon Strassen DNS Algorithm
10 pages
High Performance Computing Matrix Mul.
No ratings yet
High Performance Computing Matrix Mul.
15 pages
Efficient Parallel Implementation of The Fox Algorithm
No ratings yet
Efficient Parallel Implementation of The Fox Algorithm
8 pages
Priyanka - 50300 16 130
No ratings yet
Priyanka - 50300 16 130
4 pages
EC3021 Computer Organisation and Architecture: Latest Technologies in Multiplier Design
No ratings yet
EC3021 Computer Organisation and Architecture: Latest Technologies in Multiplier Design
6 pages
Parallel Processing
No ratings yet
Parallel Processing
35 pages
Advanced Computer Architecture 1
No ratings yet
Advanced Computer Architecture 1
14 pages
DSP Arch
No ratings yet
DSP Arch
10 pages
Dense Matrix Algorithms: Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar
No ratings yet
Dense Matrix Algorithms: Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar
55 pages
Implementing Linear Algebraalgorithms For Dense Matrices
No ratings yet
Implementing Linear Algebraalgorithms For Dense Matrices
22 pages
Linear Algebra Journal
No ratings yet
Linear Algebra Journal
8 pages
Multiplexer-Based Array Multipliers: Kiamal Z. Pekmestzi
No ratings yet
Multiplexer-Based Array Multipliers: Kiamal Z. Pekmestzi
9 pages
University of Massachusetts Dept. of Electrical & Computer Engineering
No ratings yet
University of Massachusetts Dept. of Electrical & Computer Engineering
19 pages
The Efficient Implementation of An Array Multiplier
No ratings yet
The Efficient Implementation of An Array Multiplier
5 pages
bhh93
No ratings yet
bhh93
27 pages
Cryptacus 2018 Paper 4
No ratings yet
Cryptacus 2018 Paper 4
4 pages
Systolic Arrays & Their Applications
No ratings yet
Systolic Arrays & Their Applications
36 pages
Unit 4 HPC Part8
No ratings yet
Unit 4 HPC Part8
16 pages
Trade-Offs in Multiplier Block Algorithms For Low Power Digit-Serial FIR Filters
No ratings yet
Trade-Offs in Multiplier Block Algorithms For Low Power Digit-Serial FIR Filters
6 pages
Computer Organisation and Architecture:Multiplier Design
No ratings yet
Computer Organisation and Architecture:Multiplier Design
6 pages
DSP Notes Unit1 and 2
No ratings yet
DSP Notes Unit1 and 2
45 pages
Optimization Method For Matrix Chain MultipliCation
No ratings yet
Optimization Method For Matrix Chain MultipliCation
6 pages
Simplified Optimal Parenthesization Scheme For Matrix Chain Multiplication Problem Using Bottom-Up Practice in 2-Tree Structure
No ratings yet
Simplified Optimal Parenthesization Scheme For Matrix Chain Multiplication Problem Using Bottom-Up Practice in 2-Tree Structure
6 pages
Architectures For Programmable Digital Signal Processing Devices
No ratings yet
Architectures For Programmable Digital Signal Processing Devices
24 pages
Module 2 Notes
No ratings yet
Module 2 Notes
28 pages
Literature Survey: 2.1 Background of The Project
No ratings yet
Literature Survey: 2.1 Background of The Project
5 pages
Implementation of Low Power and High Speed Multiplier-Accumulator Using SPST Adder and Verilog
No ratings yet
Implementation of Low Power and High Speed Multiplier-Accumulator Using SPST Adder and Verilog
8 pages
Implementation of Low Power and High Speed Multiplier-Accumulator Using SPST Adder and Verilog
No ratings yet
Implementation of Low Power and High Speed Multiplier-Accumulator Using SPST Adder and Verilog
8 pages
A New VLSI Architecture of Parallel Multiplier-Accumulator Based On Radix-2 Modified Booth Algorithm
No ratings yet
A New VLSI Architecture of Parallel Multiplier-Accumulator Based On Radix-2 Modified Booth Algorithm
8 pages
MEGA MAC a Merged Accumulation Based App
No ratings yet
MEGA MAC a Merged Accumulation Based App
4 pages
Design of Modified Low Power Booth Multiplier
No ratings yet
Design of Modified Low Power Booth Multiplier
6 pages
Parallel MAC
No ratings yet
Parallel MAC
6 pages
Matrix Multiplications and Collective Communication: Michael Hanke
No ratings yet
Matrix Multiplications and Collective Communication: Michael Hanke
38 pages
Parallel Algorithms Underlying MPI Implementations
No ratings yet
Parallel Algorithms Underlying MPI Implementations
55 pages
DesignandimplementationofMultiplierunitMAC ROBA
No ratings yet
DesignandimplementationofMultiplierunitMAC ROBA
10 pages
Chained Matrix Multiplication
No ratings yet
Chained Matrix Multiplication
32 pages
Introduction To Parallel Programming: Parallel Methods For Matrix Multiplication
No ratings yet
Introduction To Parallel Programming: Parallel Methods For Matrix Multiplication
50 pages
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
From Everand
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
Fouad Sabry
No ratings yet
Mux Implementation of Bec-1 Based Pipelined Vedic Mac Using Han Carlson Accumulator
No ratings yet
Mux Implementation of Bec-1 Based Pipelined Vedic Mac Using Han Carlson Accumulator
94 pages
CAO Final
No ratings yet
CAO Final
10 pages
Cpe626 Multipliers
No ratings yet
Cpe626 Multipliers
37 pages
A Novel 32-Bit Scalable Multiplier Architecture: Yeshwant Kolla Yong-Bin Kim John Carter
No ratings yet
A Novel 32-Bit Scalable Multiplier Architecture: Yeshwant Kolla Yong-Bin Kim John Carter
4 pages
Dmatm: Dual Modified Adaptive Technique Based Multiplier
No ratings yet
Dmatm: Dual Modified Adaptive Technique Based Multiplier
6 pages
Switching Characteristics of Generalized Array Multiplier Architectures and Their Applications To Low Power Design
No ratings yet
Switching Characteristics of Generalized Array Multiplier Architectures and Their Applications To Low Power Design
23 pages
Kernel Methods: Fundamentals and Applications
From Everand
Kernel Methods: Fundamentals and Applications
Fouad Sabry
No ratings yet
Parallel Algorithms Underlying MPI Implementations
No ratings yet
Parallel Algorithms Underlying MPI Implementations
55 pages
Vlsi Architecture of Parallel Multiplier - Accumulator Based
No ratings yet
Vlsi Architecture of Parallel Multiplier - Accumulator Based
8 pages
Radial Basis Networks: Fundamentals and Applications for The Activation Functions of Artificial Neural Networks
From Everand
Radial Basis Networks: Fundamentals and Applications for The Activation Functions of Artificial Neural Networks
Fouad Sabry
No ratings yet
Thesis Phase 1 Report
No ratings yet
Thesis Phase 1 Report
7 pages
Dynamic Programming: Department of CSE JNTUA College of Engg., Kalikiri
No ratings yet
Dynamic Programming: Department of CSE JNTUA College of Engg., Kalikiri
66 pages
Evolutionary Algorithms for Mobile Ad Hoc Networks
From Everand
Evolutionary Algorithms for Mobile Ad Hoc Networks
Bernabé Dorronsoro
No ratings yet
A New Vlsi Architecture For Modi Ed
No ratings yet
A New Vlsi Architecture For Modi Ed
6 pages
Matrix Chain Multiplication
No ratings yet
Matrix Chain Multiplication
2 pages
Design FF Low Power Multiplier Unit Using Wallace Tree Algorithm IJERTV9IS020069
No ratings yet
Design FF Low Power Multiplier Unit Using Wallace Tree Algorithm IJERTV9IS020069
5 pages
Dynamic Programming - Set 8 (Matrix Chain Multiplication) - GeeksforGeeks - GeeksforGeeks1
No ratings yet
Dynamic Programming - Set 8 (Matrix Chain Multiplication) - GeeksforGeeks - GeeksforGeeks1
12 pages
HPC_Bankai
No ratings yet
HPC_Bankai
7 pages
Lecture 14 Basic Communication Operations.pptx
No ratings yet
Lecture 14 Basic Communication Operations.pptx
40 pages
MACH Kogame
No ratings yet
MACH Kogame
8 pages
CEC335 ANTENNA DESIGN LAB MANUAL
No ratings yet
CEC335 ANTENNA DESIGN LAB MANUAL
37 pages
Brandes 2008
No ratings yet
Brandes 2008
17 pages
Top 5 Sites To Buy Verified PayPal Accounts (Personal and
No ratings yet
Top 5 Sites To Buy Verified PayPal Accounts (Personal and
7 pages
E-Mgo Radio - 1505w
No ratings yet
E-Mgo Radio - 1505w
28 pages
Data Structures Algorithms - Lecture 15 16 17 - Array Data Structure
No ratings yet
Data Structures Algorithms - Lecture 15 16 17 - Array Data Structure
79 pages
2023101771
No ratings yet
2023101771
101 pages
General Specification For Remote Fault Passage Indicator For Underground Medium Voltage Network
No ratings yet
General Specification For Remote Fault Passage Indicator For Underground Medium Voltage Network
10 pages
Fnirsi Manual
No ratings yet
Fnirsi Manual
30 pages
Lesson 3 Properties of Color
No ratings yet
Lesson 3 Properties of Color
6 pages
pg202 Mipi Dphy en Us 4.3
No ratings yet
pg202 Mipi Dphy en Us 4.3
100 pages
Week1 e Text PDF
No ratings yet
Week1 e Text PDF
48 pages
syllabus Cryptography and Network Security
No ratings yet
syllabus Cryptography and Network Security
5 pages
AWS Auto Scaling
No ratings yet
AWS Auto Scaling
8 pages
Zohocrmplus Feature List
No ratings yet
Zohocrmplus Feature List
44 pages
MediHub Version 3 User Guide For 1st Time Login and Subsequent Login - 21 Aug 2019 Effective 1 Sep 2019
No ratings yet
MediHub Version 3 User Guide For 1st Time Login and Subsequent Login - 21 Aug 2019 Effective 1 Sep 2019
16 pages
HRM-10_Manual
100% (1)
HRM-10_Manual
31 pages
Immediate download Decision Sciences: Theory and Practice 1st Edition Raghu Nandan Sengupta ebooks 2024
100% (1)
Immediate download Decision Sciences: Theory and Practice 1st Edition Raghu Nandan Sengupta ebooks 2024
55 pages
Instant Download Windows Kernel Programming Second Edition Pavel Yosifovich PDF All Chapters
100% (1)
Instant Download Windows Kernel Programming Second Edition Pavel Yosifovich PDF All Chapters
67 pages
An Efficient and High Speed Overlap Free Karatsuba Based Finite Field Multiplier For Fpga Implementation
No ratings yet
An Efficient and High Speed Overlap Free Karatsuba Based Finite Field Multiplier For Fpga Implementation
167 pages
12 Computer Science Sp 03 A
No ratings yet
12 Computer Science Sp 03 A
18 pages
Maths Notes Class 8
No ratings yet
Maths Notes Class 8
24 pages
CCNA4 Case Study Inst en
No ratings yet
CCNA4 Case Study Inst en
9 pages
Chapter 11: Consumer-Oriented E-Commerce: Week 7
No ratings yet
Chapter 11: Consumer-Oriented E-Commerce: Week 7
18 pages
Wgs Classifieds 061114
No ratings yet
Wgs Classifieds 061114
7 pages
Case - TS012798208
No ratings yet
Case - TS012798208
5 pages
Case Studyaz204
No ratings yet
Case Studyaz204
4 pages
Wifi Cracker
No ratings yet
Wifi Cracker
13 pages
Chang et al._2009_Visualizing the Republic of Letters
No ratings yet
Chang et al._2009_Visualizing the Republic of Letters
2 pages