0% found this document useful (0 votes)
26 views

Generalized Cannon's Algorithm For Parallel Matrix Multiplication

Uploaded by

kmarali
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views

Generalized Cannon's Algorithm For Parallel Matrix Multiplication

Uploaded by

kmarali
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Generalized Cannon’s Algorithm for Parallel Matrix Multiplication*

Hyuk- Jae Lee James P. Robertson and Jo& A.B. Fortes

Department of Computer Science School of Electrical and Computer Engineering


Louisiana Tech University Purdue University
Ruston, LA 71272, USA W. Lafayette, IN 47907, USA
hlee@engr.iatech.edu {robertso,fortes}@ecn.purdue.edu

Abstract toroidal mesh interconnections. The original algo-


rithm assumes that the input matrices are block-
Cannon’s algorithm is a memory-efficient matrix distributed, but it is not clear how to efficiently imple-
multiplication technique for parallel computers with ment this algorithm for block-cyclic (block-scattered)
toroidal mesh interconnections. This algorithm as- distribution. This paper revisits Cannon’s algorithm
sumes that input matrices are block distributed, but and generalizes it to the case when input matrices are
it is not clear how it can deal with block-cyclic dis- block-cyclic distributed. An efficient schedule for re-
tributed matrices. This paper generalizes Cannon’s ducing communication overhead is also proposed. The
algorithm for the case when input matrices are block- efficiency of the proposed algorithm is analyzed and
cyclic distributed across a two-dimensional processor comparatively evaluated thru experiments conducted
array with an arbitrary number of processors and on an Intel Paragon.
toroidal mesh interconnections. An efficient schedul- Extensive research has focused on development of
ing technique is proposed so that the number of com- parallel matrix multiplication subroutines [4]-(111.
munication steps is reduced to the least common mul- Communication and memory overheads are two ma-
tiple of P and Q for a given P x Q processor array. In jor obstacles for these parallel subroutines. To reduce
addition, a partitioning and communication scheme is communication overheads, various ideas have been ex-
proposed to reduce the number of page faults for the plored. One important approach is to overlap commu-
case when matrices are too large to fit into main mem- nications with computations. Another technique tries
ory. Performance analysis shows that the proposed to increase the granularity of computations between
generalized Cannon’s algorithm (GCA) requires fewer data exchanges so that the number of communications
page faults than a previously proposed algorithm can be minimized. Both solutions benefit if the com-
(SUMMA). Experimental results on Intel Paragon munication is done by using communication primitives
show that GCA performs better than SUMMA when that can be efficiently implemented by the target ar-
blocks of size larger than about (65 x 65) are used. chitecture. For example, if an architecture requires
However, GCA performance degrades if the block size more overhead for broadcasts than for shifts, then it
is relatively small while SUMMA maintains the same is better to use the latter instead of the former if possi-
performance. It is also shown that GCA maintains ble (and vice versa). Memory overhead increases traf-
higher performance for large matrices than SUMMA fic between different levels of memory hierarchies and
does. therefore is another important source of degradation.
Previously proposed parallel multiplication algo-
1 Introduction rithms fall into two broadly defined classes. One class
relies on using broadcast communication primitives
Matrix multiplication is one of the most important ba- (possibly using shift primitives as well). Algorithms
sic operations in scientific computations and many al- included in this class are BMR (Broadcast-Multiply-
gorithms for parallel matrix multiplication have been Roll) [l, 21, PUMMA (Parallel Universal Matrix Mul-
proposed. An early development is Cannon’s algo- tiplication Algorithm) [4], and SUMMA (Scalable
rithm [S] which is memory-efficient and suitable for Universal Matrix Multiplication Algorithm) [5]. The
other class does not use broadcasts but relies exclu-
* This research was partially funded by the National sively on shifts. This class includes Cannon’s algo-
Science Foundation under grants MIP-9500673 and CDA-
9015696.
rithm and systolic matrix multiplication [2],[3],[6].
to republish, to post on servem or to redistrihuk IO IisIs. requires specitic
Permission to make digital/hard copies of all or p;ul of this material for
permission and/or fee
personal or classroom use is granted without fm provided Ihrt the copies
are not made or distributed for profit or commercial advmt;rge. Ihc copy- /IX 97 Vienna Awwia
right notice, :he title ofthe publication md it< date appear. and notice is Copyright 1997 ACM O-8979 I-902-5/97/7..$3.50
given lhrl cupyright is by permission of the ACM, Inc. To copy otherwise,

44
The original Cannon’s algorithm [6] assumes a
skewed distribution of the input matrices and consists
of repeated shift-and-multiply steps. The following
example gives a clear picture of the original Cannon’s
algorithm for the case when all matrices are square.

Example 1 Consider the matrix multiplication algo-


rithm which computes C = A x B. Suppose that the
size of a given processor array is (6 x 6). In Cannon’s
algorithm, A and B are decomposed into 36 blocks in
a (6 x 6) arrangement, which are initially aligned and
multiplied by each other as shown in Fig. 1. The next
step is to shift A to the left and B up to neighbor pro-
cessors where block-wise multiplication can take place
again and its result can be added to the current value
Of Ci,, This shift and multiply step is repeated until
all blocks in a row of A are multiplied by all blocks
in a column of B. The data distributions at. t = 0, 1,
and 3 are shown in Fig. 1. n
Two advantages of Cannon’s algorithm for square
matrices are readily apparent. First., all processors are
busy in all iterations. Second, the only communica-
tion necessary is the initial skew of the input matri-
ces and the shifting of blocks in each iteration. The
initial skew can be eliminated by allocating matri-
ces initially in a skewed manner. A data distribu-
tion independent computation and active library ap-
proach have been studied for development of a source-
to-source translator that automatically allocates data
arrays as required by given subroutines (131. In ma-
chines where a toroidal mesh-interconnection (virtual
or real) is assumed for processors. shifts are highly
efficient and therefore Cannon’s algorithm performs
very well. The first question addressed by this pa-
per is whether Cannon’s algorithm can be general-
ized to work with matrices of arbitrary shape and
~l2Tpz$g~%I,/j
t (b) = 1
size initially stored according to arbitrary block-cyclic
distributions and arbitrary mesh ratio (size, shape,
and blocks must, still obey certain restrictions that
guarantee matrix-multiplication to be well-defined).
The second question is whether and when the gen-
eralized Cannon’s algorithm offers performance ad-
vantages over broadcast-based algorithms, namely,
SUMMA. Both questions are given positive answers
supported by analysis and experimental results.
The rest of this paper is organized as follows. Sec-
tion 2 develops the generalized Cannon’s algorithm
(GCA). An efficient method to reduce communication
frequency is also proposed. Section 3 develops a par-
tit,ioning scheme to reduce memory faults. Section
4 evaluates the proposed technique against previous
work. Experimental results obtained with an Intel
Paragon arc provided in Sectlion 5. Section 6 present,s
conclusions of thf, paper.

Figure 1: Data distribution for Cannon’s algorithm

45
2 Generalized Cannon’s Algorithm (GCA)

This section proposes a block-cyclic version of Can-


non’s algorithm and a scheduling scheme that re-
0011100
duces the number of communications. Throughout
the section, matrix multiplication C = AB is consid-
ered where matrices A, B, and C are partitioned into
clnnclcl
(bz x b-z), (bz x b-y), and (blc x b-y) blocks, re-
spectively. In its simplest form, Cannon’s algorithm
requires that the size of the processor array be the
clclclocl
(a) 3 x 5 virtual processor array
same as the partition size (i.e., the number of blocks)
of matrix C (= bx x b-y). Let such an array be
called the virtual processor away. However, in gen-
eral, the available physical processor array is smaller
than the virtual processor array. Hence, the virtual
array needs to be partitioned to be allocated into the
available physical array. A single physical processor
needs to store many data blocks and perform com-
putations with these data blocks. Hence, the global
indices of matrix blocks should be carefully converted (b) Partitioning of the virtual array
into the local block indices in each processor.
Consider the case when the virtual processor array
is not square. For example, suppose that matrices A,
B, and C are decomposed into 6, 10, and 15 blocks,
in a (3 x 2), (2 x 5), and (3 x 5) arrangement, re-
spectively. Then, the virtual processor array is 3 x 5
(Figure 2 (a)). The size of the virtual processor array
is larger than that of the partition of matrix A. In this
case, the virtual processor array needs to be decom-
posed into subarrays of the same size as the partition
of matrix A ((3 x 2) in this example). Figure 2 (b) (c) Distribution of A
shows the decomposition of the virtual processor ar-
ray. Blocks of matrix A are skewed in the same way
as Cannon’s algorithm, and mapped into each of these
subarrays (Figure 2 (c)). Note that the third subarray
is smaller than the matrix A partition, and therefore,
more than one block is stored in a single processor.
With this mapping, three copies of matrix A exist in
the virtual processor array. For matrix B, the virtual
processor array is decomposed into (2 x 5) subarrays p!g/m/m
and blocks of matrix B are mapped into each of these
(d) Distribution of B
subarrays (Figure 2 (d)). With these distributions
of matrices A and B, all blocks are well-aligned and
therefore multiplications of A and B can take place for
every processor. In the next step, matrix A shifts left
and matrix B shifts up (Figure 2 (e)). Then, again
matrices A and B are well aligned, and all processors
can compute matrix product. In general, the virtual
processor array needs to be decomposed into subar-
rays with the size of (bz x bz), for distribution of
matrix A. For matrix B, the virtual processor array
needs to be decomposed into subarrays with the size B 030 B1.1 BOJ B1.3 B0,4
of (bz x b-y). Then, the multiply and shift steps are
(e) Distribution of A and B after
repeated for b-z times.
data multiplication and shift step
Consider the allocation of the virtual processor ar-
ray to a physical processor array. Recall that multiple Figure 2: Data distribution for GCA
copies of matrices A and B exist in the virtual pro-

46
cessor array. For example, in Figure 2 (c), processors result from lcm(P, Q) data movements can be done
(0, O), (0,2) and (0,4) contain Ao,o. By allocating without communications. Therefore, to reduce com-
these three processors to one physical processor, just munications, the execution order should be changed in
one copy of Ao,o is maintained in the physical proces- such a way that all computations that need lcm(P, Q)
sor array. In general, all virtual processors horizon- data movements in the virtual processor array are per-
tally separated by b-z processors contain the same formed consecutively.
block of matrix A while all virtual processors verti- The pseudocode of GCA for matrix multipli-
cally separated by b-z processors contain the same cation is shown in the following program. For
block of matrix B. Hence, all processors separated by simplicity, this program assumes that both P
b-z processors either horizontally or vertically should and Q divide M,W, and K, respectively. In
be mapped into the same physical processor. In this this code, C-Cloc3 Cil Cjl , A-Iloc) Cil Cj 'I , and
way, only one copy of input matrix is stored in the B-{loc3CiJl Cjl denote the matrices of A, B , and C ,
physical processor array. stored in local memory. Variables i J and j 'need to
Upon mapping the virtual processors onto physical be modified when data shifts cause the change of stor-
ones it may be possible to eliminate some virtual com- age location in a local memory.
munication by mapping communicating virtual pro-
cessors into the same physical one. Since the size of I
the physical processor is small, some data movement lcm = least~common~multiplier(P,Q);
in the virtual processor array does not require actual for(t=O; t<lcm; t++) I
data movement in the physical processor array. For for(i=O; i< (M/P); i++) C
example, suppose that the virtual processor array is for(j=O; j < (N/Q); j++) I
6 x 6 while the physical processor array is 3 x 3. Then, for(l=O; 1 < (K/lcm); I++) C
three left-shift steps in the virtual processor array cor- jJ=(j%k+l*lcm/Q)%(k/Q);
respond to no data movement in the physical processor iJ=(i%k+l*lcm/P)%(k/P);
array. By taking advantage of this basic idea, one can C-I1oc)Cil [jl +=A-Cloc)[il Cj '1
reduce the number of communications. *B-Clod [i'l Cjl ;
133
Example 1 (continued) Figure 1 (c) shows the data shift-west (A-{loc) [i] [j ,O<=i<M/P,O<=j'<k/lcm) ;
storage of matrices A and B after three steps of data shift-north(B-{loc) [i'] [jl ,O<=il<k/lcm,O<=j<N/Q);
movement. Note that the data blocks stored in each 1
processor are the same as the initial data distribu- 1
tion (in Figure 1 (a)). Hence, no communication is
necessary if the computations with these blocks are The number of communications required by Can-
performed right after the computations with initial non's algorithm is b-z while that by the proposed
data storage. For matrix multiplication, all opera- scheme is Ecm(P, Q). In general, b-z is much larger
tions are commutative, and therefore it is possible to than P or Q. Hence, the proposed scheme can greatly
change the order of computations. Hence, by perform- reduce communication overhead.
ing the computations as shown in Figure 1 (c) imme-
diately after those in Figure 1 (a), one can eliminate
3 Partitioning
one global communication step. Any pair of computa-
tionstages separated by three shifts can be computed In this section, a new partitioning scheme is proposed
without communication. Hence, significant commu- to reduce the number of page faults. The basic idea
nication overhead can be reduced b; reordering com- is explained with Figure 3. Suppose that there are
putations in such a way that two computation stages 2 x 2 processors. Matrix A is decomposed into 2 x 2
separated by three shifts must always be executed con- submatrices as shown in Figure 3 (a) and each block
secutively. II is stored in one processor. In the Cannon's algorithm,
each processor completes all computations with its lo-
The above example gives an idea of how to reduce cal matrix, and then shifts the local matrix to the left
the number of communications by changing the ex- processor. This computation and shift step repeats
ecution order. For the 3 x 3 processor array, three until the local matrix moves back to the original pro-
shifts return data to the original processor source. cessor. If a matrix is too large to fit into main memory,
In general, given a P x Q processor array, lcm(P, Q) page faults occur at each computation and shift step.
shifts return the data to the original processor, where In the new scheme, the submatrix in each proces-
lcm(P, Q) denotes the least common multiple of P and sor is decomposed again into submatrices in such a
Q. Hence, computations on the initial data distribu- way that each submatrix fits into main memory. In
tion and computations on the data distribution that Figure 3 (b), each submatrix is partitioned into 2 x 2
submatrices, and assume that each of the submatrices
(4 Initial mesnory all*

{a) Initial partitioning of matrix A (b) Partitioning of submatrices

Figure 3: Partitioning of matrix


(h) I'd'rriilinni,,: oflnrwix

fits into main memory. Initial18 all processors have its


own local submatrix of A ) and performs compu-
tations on the submatrix. Once these computations
are finished, each submatrix A::; is sent to the left
processor so that all processors perform other com-
putations with the new submatrix received from the
right processor without any disk read. The computa-
tion and shift step repeats until the submatrices move
back to the original processors. Therefore, submatri-
ces A;: are used by all necessary processors before
they are paged out of main memory. Once all com-
putations with A:;!
axe completed, submatrices A;:
axe loaded into main memory and then used for com-
putation and shift. This procedure continues until all
submatrices axe loaded and used for computations. In
this way, many disk reads can be eliminated.
Consider how to partition matrices and reduce
page faults in a single processor. Suppose that matri-
ces A,B, and C are initially allocated to main memory
as shown in Figure 4 (a) where a shaded region rep-
resents the part of a matrix that is allocated to main
memory. In this example, N' x N submatrices are al-
located for all A, B , and C . Figure 4 (b) illustrates
the partitioning of the matrices. The upperleftmost
~ , and Bo,o) is N' x N' square subma-
block ( C O , Ao,o,
trix. Other blocks axe also shown in Figure 4 (b).
Figure 4 (c) shows the parts of matrices which are in-
volved in computation Co,o = CO,O AO,Ox B o , (dark ~+
shaded region). As in (a), the light shaded regions still
represent the part allocated to main memory. Once
this part of computation is conlpleted, Ao,o shifts left
and Bo,o moves up to the next processors, respectively.
This computation and shift step repeats until Ao,o and
Bo,0 move back to the original processor. Then, the
+
next computation CO,I = C o , ~ AO,Ix B I , ~is per-
formed. The dark shaded regions in Figure 4 (d) rep-
resent the submatrices involved in this computation.
Figure 4 (e), (f), . . ., (i), and (j), show the order of
remaining computations.

Figure 4: Partitioning and scheduling for page fault


reduction
4 Page Fault Analysis of Matrix Multiplica-
tion Algorithms
In this subsection, the number of page faults is esti-
mated for both GCA and SUMMA. Figure 4 is used
t o estimate the number of page faults made by GCA.
For simplicity, assume that A, B , and C are N x N
matrices. Let P,,,, be the size (in bytes) of a single
page of main memory and M,,,, be the size (in bytes)
of main memory. Let N' x N' be the size of Co,o in
Figure 4 (b). For simplicity, assume that each element
of a matrix takes one byte of memory space. Initially,
Co,o and C O ,are
~ allocated to main memory. At the
computation in Figure 4 (g) and (i), Ci,o and C I , ~
are loaded into main memory, respectively. There-
fore, a number of page faults occur in these steps.
The total number of elements to be loaded in these
steps is ( N - N') x N. The number of page faults is
[((N - N') x N)/P,,,,I. For simplicity, the 'ceiling' Figure 5: Memory allocation for SUMMA
operation is omitted in the remaining analysis. Thus,
the number of page faults is ( ( N - N') x N)/P,,,,.
Matrix A has exactly the same number of page faults
as C does.
The estimation of page faults for B is more com-
plex. The computations in Figure 4 (d) and (h)
cause ( ( N - N') x N)/P,,,, page faults, respectively.
The computations in Figure 4 (f) and (g), cause
((N- N') x N)/P,,,, page faults. The computation in
Figure 4 (j) also causes ( N - N1)(N - N')/P,,,, page
faults. Hence, the total number of page faults with B
+
is 3((N - N') x N)/PsZze ( N - N1)(N - N')/P,,,,. size of matrix
Therefore, the total number of page faults is
Figure 6: The estimated number of page faults for
GCA and SUMMA

For SUMMA, each processor has different mem- all elements of A and C are visited once. The total
ory allocation at a given computation step. Figure 5 number of elements in A, B , C and two work spaces is
shows the memory allocation of Processor (0,O). This +
(3N2 2nbN). The number of elements in outside of
figure is used to estimate the number of page faults main memory is (3NQ2nbN) -M,i,,. Therefore, ap-
in Processor (0,O). Other processors have almost the proximately ( ( 3 N % 2 n b ~ )- M,i,,)/P,,,, page faults
same number of page faults. As shown in Figure 5 occur for the remaining of the computation. Hence,
(a), SUMMA requires two work spaces for A and B , the total number of page faults is
respectively. The size of the work space is nb x N
for both matrices where nb is to be chosen by a P~SUMMA(= N )( ( N - N') x N)/Psize +
subroutine developer. To start computation, matri- +
( N - N')(N - Nr)/Psize 2(nb x N)/Psize
ces are allocated to main memory as shown in Fig- + ( ( 3 ~ % 2nbN) - Msiz,)/Ps,,,. (2)
ure 5 (b). For C , all elements are loaded. Hence,
( ( N - N') x N)/P,,,, page faults occur. For A, the Figure 6 compares the estimated page faults for
first nb columns are loaded to the work space. Hence, GCA and SUMMA. In this estimation, it is assumed
( N - N') ( N - N')/P,,,, page faults occur to access A that the memory size is 3 mega bytes, and the page
and (nb x N)/P,,,, page faults occur to load data into size is 256 bytes. For SUMMA, nb is 20 as suggested
the work space. For B , the first nb rows are loaded to in 151. AS shown in this analysis, GCA causes fewer
the work space. Since these rows are initially stored in page faults than SUMMA.
main memory, no page fault occurs for B . For the ac-
cess of the work space for B , another (nb x N)/P,,,,
page faults occur. For the remaining computation,
5 E%taarierarrrt6a The result in Figure 10 shows that it is desirable
to choose the largest blocks. However, if the block size
a ~ ~ ~ r e ~MFLOPS are used based
Ars a ~ ~ X " r f o ~ r - rIIIC&LS~~P(*, is too large, the number of blocks is so small that it is
O X I 2 M N h" Boating point op-
o r 8 ~ ( 1 4 -~ W I X F I I ~ ~ ~ Ithat impossible to evenly distribute these blocks across the
p-rat~crtrhswr3 xtr*rehh;rsy for the product of an M x K processor array. Therefore, performance is degraded
trratrrx b y a K x r\: matrix. by poor load balancing. Hence, it is best to choose
Ftgure 7 illustrates the number of MFLOPS vs. the largest block size among those which guarantee
the domain size. The Y-axis shows the number of full utilization of processors.
MFLOPS and the X-axis shows the size of A and B. Figures 11 and 12 illustrate the number of
All matrices are square, and a processor array of size MFLOPS when an 8 x 4 processor array and an 8 x 7
8 x 8 is used. As shown in this figure, GCA per- processor array are used, respectively. For a given ma-
forms better than SUMMA for all sizes of matrices. trix A(A4 x K ) and matrix B ( K x M), the block size
For small matrices, this is mainly due to differences is chosen to be M I 8 x K / 8 and K / 8 x N / 4 for matrix
in communication overhead whereas for large matri- A and B , respectively, in experiment of Figure 11. On
ces, it is a consequence of difference in efficiency of the other hand, the experiment of Figure 12 chooses
memory usage. As shown in this figure, the number the block size to be A418 x K / 5 6 and K / 5 6 x N / 7
of MFLOPS decreases drastically when local blocks for matrix A and B , respectively. These block sizes
no longer fit into memory, and must be paged to disk. are the largest possible sizes which allow full proces-
Since SUMMA uses more memory than GCA, its per- sor utilization. In Figure 11, GCA achieves higher
formance dropoff occurs earlier than that of GCA. MFLOPS than SUMMA. However, in Figure 12, GCA
Figure 8 compares scalability of GCA and achieves less number of MFLOPS than SUMMA. This
SUMMA. The X-axis shows the size of processor array is because a small block size is used for full processor
and the Y-axis shows MFLOPS. The processor grid is utilization.
always square, therefore 64 processors are arranged as \ I
an 8 x 8 array. Each processor has one block of size

-
256 x 256. Both algorithms maintain quite a good
performance as the processor size increases. Again
though, GCA is more efficient than SUMMA.
Figure 9 illustrates the number of MFLOPS vs. size of matrix in a single
the number of processors. As the experiment of Fig-
- 5 1
ure 8 , the processor grid is always square. However, in 150 5000 10000 20 40 60 80 100
slze of rnatru size of processor array
this experiment, the matrix sizes are fixed at 720 x 720.
As the number of processors increases, the communi- Figure 7 Figure 8
40
cation time becomes a larger factor in the program ex-
ecution time, and this causes the number of MFLOPS 30
to drop. 3
Figure 10 illustrates the number of MFLOPS for $20
I
ct---o GCA
+--x SUMMA
varying block sizes. For this figure, the Y-axis shows
10
the number of MFLOPS and the X-axis shows the
block size for each processor. The matrix sizes were I 0
50 100 100 200 300
fixed at 2048 x 2048. The processor array was fixed size of processor array block size
at 8 x 8. This means that for a block size of 256, each Figure 9 Figure 10
processor has one 256 x 256 block. For a block size of 50 i
1
2, each processor has 128 ( 2 x 2 ) blocks. In order for
SUMMA to have the best possible performance, the

-
number of columns of blocks of B in each processor is
the same, and the number of rows of blocks of A in
each processor is the same. This allows each processor
to have the same number of blocks of C (since the
number of blocks of C is equal to the number of rows of ? % k k 0 0 '0 2000 4000 6000 8000
blocks of A by the number of columns of blocks of B). size of matrix size of matrix

In this manner, an entire row or column of blocks can Figure 11 Figure 12


be broadcasted at one time. This is why SUMMA'S
times are constant for varying block sizes. However,
MFLOPS of GCA decreases drastically as the block
size decreases. If the block size is smaller than about
sixty five, SUMMA performs better than GCA.
[3] K.K. Mathur and S.L. Johnsson, "Multiplication
of matrices of arbitrary shape on a data parallel
computer," Parallel Computing, vol. 20, pp. 919-
951, July 1994.
101 M GCA [4] J. Choi, J.J. Dongarra, and D.W. Walker,
H SUMMA
01 I "Pumma: Parallel universal matrix multiplica-
5000
size of matrlx
10000
tion algorithms on distributed memory concur-
Figure 13 rent computers," Concurrency: Practice and Ex-
perience, vol 6, no. 7, pp. 543-570, Oct. 1994.
Figure 13 shows the number of MFLOPS with the [5] R. van de Geijn and J. Watts, "SUMMA: Scal-
assumption that input matrices are distributed across able universal matrix multiplication algorithm,"
processor array as required by SUMMA. In this case, University of Texas, Department of Computer
GCA requires initial data movements to redistribute Sciences, Tech. Rep. TR-95-13, April 1995. Also:
data as required by Cannon's algorithm. Hence, the LAPACK Working Note #96, May 1995.
execution time for GCA is the initial data redistribu-
tion time plus the computation time. Figure 13 shows [6] L.E. Cannon, "A cellular computer to implement
that the MFLOPS of GCA is smaller than that of the Kalman filter algorithm," Ph.D. dissertation,
SUMMA even though the initial data redistribution Montana State Univ., Bozeman, MT, 1969.
is required for GCA. This is because the initial redis-
tribution time is much smaller than the computation [7] R.C. Agarwal, F.G. Gustavson, and M. Zubair,
time. "A high performance matrix multiplication al-
gorithm on a distributed-memory parallel com-
puter, using overlapped communication," IBM
6 Conclusions Journal 0.f Research and Development, vol. 38,
The main contribution of this paper is the generaliza- no. 6, pp. 673-681, 1994.
tion of Cannon's algorithm to block-cyclic distributed [8] S.L. Johnsson, "Communication efficient basic
input matrices. For parallel computers with toroidal linear algebra computations on hypercube archi-
mesh interconnections, Cannon's algorithm can be ef- tectures," J. Parallel Distributed Computing, vol
ficiently implemented. The generalized Cannon's al- 4, no. 2, pp. 132-172, April 1987.
gorithm (GCA) has the same advantages as Cannon's
algorithm and therefore it is more efficient than other [9] P. Bjprstad, F. Manne, T . Sprevik, and M. Va-
algorithms developed for block-cyclic distributed ma- jtersic, "Efficient matrix multiplication on SIMD
trices. Efficiency of GCA degrades when the block size computers," SIAM J. Matrix Anal. Appl., vol.
is small. Improvement of GCA for these cases is un- 13, no. 1, pp. 386-401, Jan. 1992.
der investigation. When matrices are too large to fit
into main memory, performance degrades due to page [lo] S. Huss-Lederman, E.M. Jacobson, and A. Tsao,
faults. A partitioning and scheduling scheme is pro- "Comparison of scalable parallel matrix multipli-
posed to reduce page faults. Analysis shows that the cation libraries," in Proc. of the Scalable Parallel
performance degradation of GCA with the proposed Libraries Conference, Oct. 1993.
partitioning is much less than that of SUMMA. Future [ll] A. Chtchelkanova, J. Gunnels, G. Morrow, J.
work includes implementation of the proposed parti- Overfelt, and R.A. van de Geijn, "Parallel Im-
tioning scheme and validation of the analysis through plementation of BLAS: General Techniques for
experiments. Level 3 BLAS," Univ. of Texas, Dept. of Com-
puter Sciences, Tech. Rep. TR-95-40, Oct. 1995.
References
[12] J. Choi, J.J. Dongarra, S. Ostrouchov, A. Petitet,
[I] G.C. Fox, S.W. Otto, and A.J.G. Hey, "Matrix D. Walker, and R.C. Whaley, "A proposal for a
algorithms on a hypercube I: matrix multiplica- set of parallel basic linear algebra subprograms,"
tion," Parallel Computing, vol. 4, pp. 17-31, 1987. LAPACK Working Note #loo, May 1995.
[2] S. Huss-Lederman, E.M. Jacobson, A. Tsao, and [13] H.-J. Lee and J.A.B. Fortes, "Toward data dis-
G. Zhang, "Matrix multiplication on the Intel tribution independent parallel matrix multiplica-
Touchstone Delta," Concurrency: Practice and tion," in Proc. Int. Parallel Processing Sympo-
Experience, vol 6, no. 7, pp. 571-594, Oct. 1994. sium, pp. 436-440, April 1995.

You might also like