Generalized Cannon's Algorithm For Parallel Matrix Multiplication
Generalized Cannon's Algorithm For Parallel Matrix Multiplication
44
The original Cannon’s algorithm [6] assumes a
skewed distribution of the input matrices and consists
of repeated shift-and-multiply steps. The following
example gives a clear picture of the original Cannon’s
algorithm for the case when all matrices are square.
45
2 Generalized Cannon’s Algorithm (GCA)
46
cessor array. For example, in Figure 2 (c), processors result from lcm(P, Q) data movements can be done
(0, O), (0,2) and (0,4) contain Ao,o. By allocating without communications. Therefore, to reduce com-
these three processors to one physical processor, just munications, the execution order should be changed in
one copy of Ao,o is maintained in the physical proces- such a way that all computations that need lcm(P, Q)
sor array. In general, all virtual processors horizon- data movements in the virtual processor array are per-
tally separated by b-z processors contain the same formed consecutively.
block of matrix A while all virtual processors verti- The pseudocode of GCA for matrix multipli-
cally separated by b-z processors contain the same cation is shown in the following program. For
block of matrix B. Hence, all processors separated by simplicity, this program assumes that both P
b-z processors either horizontally or vertically should and Q divide M,W, and K, respectively. In
be mapped into the same physical processor. In this this code, C-Cloc3 Cil Cjl , A-Iloc) Cil Cj 'I , and
way, only one copy of input matrix is stored in the B-{loc3CiJl Cjl denote the matrices of A, B , and C ,
physical processor array. stored in local memory. Variables i J and j 'need to
Upon mapping the virtual processors onto physical be modified when data shifts cause the change of stor-
ones it may be possible to eliminate some virtual com- age location in a local memory.
munication by mapping communicating virtual pro-
cessors into the same physical one. Since the size of I
the physical processor is small, some data movement lcm = least~common~multiplier(P,Q);
in the virtual processor array does not require actual for(t=O; t<lcm; t++) I
data movement in the physical processor array. For for(i=O; i< (M/P); i++) C
example, suppose that the virtual processor array is for(j=O; j < (N/Q); j++) I
6 x 6 while the physical processor array is 3 x 3. Then, for(l=O; 1 < (K/lcm); I++) C
three left-shift steps in the virtual processor array cor- jJ=(j%k+l*lcm/Q)%(k/Q);
respond to no data movement in the physical processor iJ=(i%k+l*lcm/P)%(k/P);
array. By taking advantage of this basic idea, one can C-I1oc)Cil [jl +=A-Cloc)[il Cj '1
reduce the number of communications. *B-Clod [i'l Cjl ;
133
Example 1 (continued) Figure 1 (c) shows the data shift-west (A-{loc) [i] [j ,O<=i<M/P,O<=j'<k/lcm) ;
storage of matrices A and B after three steps of data shift-north(B-{loc) [i'] [jl ,O<=il<k/lcm,O<=j<N/Q);
movement. Note that the data blocks stored in each 1
processor are the same as the initial data distribu- 1
tion (in Figure 1 (a)). Hence, no communication is
necessary if the computations with these blocks are The number of communications required by Can-
performed right after the computations with initial non's algorithm is b-z while that by the proposed
data storage. For matrix multiplication, all opera- scheme is Ecm(P, Q). In general, b-z is much larger
tions are commutative, and therefore it is possible to than P or Q. Hence, the proposed scheme can greatly
change the order of computations. Hence, by perform- reduce communication overhead.
ing the computations as shown in Figure 1 (c) imme-
diately after those in Figure 1 (a), one can eliminate
3 Partitioning
one global communication step. Any pair of computa-
tionstages separated by three shifts can be computed In this section, a new partitioning scheme is proposed
without communication. Hence, significant commu- to reduce the number of page faults. The basic idea
nication overhead can be reduced b; reordering com- is explained with Figure 3. Suppose that there are
putations in such a way that two computation stages 2 x 2 processors. Matrix A is decomposed into 2 x 2
separated by three shifts must always be executed con- submatrices as shown in Figure 3 (a) and each block
secutively. II is stored in one processor. In the Cannon's algorithm,
each processor completes all computations with its lo-
The above example gives an idea of how to reduce cal matrix, and then shifts the local matrix to the left
the number of communications by changing the ex- processor. This computation and shift step repeats
ecution order. For the 3 x 3 processor array, three until the local matrix moves back to the original pro-
shifts return data to the original processor source. cessor. If a matrix is too large to fit into main memory,
In general, given a P x Q processor array, lcm(P, Q) page faults occur at each computation and shift step.
shifts return the data to the original processor, where In the new scheme, the submatrix in each proces-
lcm(P, Q) denotes the least common multiple of P and sor is decomposed again into submatrices in such a
Q. Hence, computations on the initial data distribu- way that each submatrix fits into main memory. In
tion and computations on the data distribution that Figure 3 (b), each submatrix is partitioned into 2 x 2
submatrices, and assume that each of the submatrices
(4 Initial mesnory all*
For SUMMA, each processor has different mem- all elements of A and C are visited once. The total
ory allocation at a given computation step. Figure 5 number of elements in A, B , C and two work spaces is
shows the memory allocation of Processor (0,O). This +
(3N2 2nbN). The number of elements in outside of
figure is used to estimate the number of page faults main memory is (3NQ2nbN) -M,i,,. Therefore, ap-
in Processor (0,O). Other processors have almost the proximately ( ( 3 N % 2 n b ~ )- M,i,,)/P,,,, page faults
same number of page faults. As shown in Figure 5 occur for the remaining of the computation. Hence,
(a), SUMMA requires two work spaces for A and B , the total number of page faults is
respectively. The size of the work space is nb x N
for both matrices where nb is to be chosen by a P~SUMMA(= N )( ( N - N') x N)/Psize +
subroutine developer. To start computation, matri- +
( N - N')(N - Nr)/Psize 2(nb x N)/Psize
ces are allocated to main memory as shown in Fig- + ( ( 3 ~ % 2nbN) - Msiz,)/Ps,,,. (2)
ure 5 (b). For C , all elements are loaded. Hence,
( ( N - N') x N)/P,,,, page faults occur. For A, the Figure 6 compares the estimated page faults for
first nb columns are loaded to the work space. Hence, GCA and SUMMA. In this estimation, it is assumed
( N - N') ( N - N')/P,,,, page faults occur to access A that the memory size is 3 mega bytes, and the page
and (nb x N)/P,,,, page faults occur to load data into size is 256 bytes. For SUMMA, nb is 20 as suggested
the work space. For B , the first nb rows are loaded to in 151. AS shown in this analysis, GCA causes fewer
the work space. Since these rows are initially stored in page faults than SUMMA.
main memory, no page fault occurs for B . For the ac-
cess of the work space for B , another (nb x N)/P,,,,
page faults occur. For the remaining computation,
5 E%taarierarrrt6a The result in Figure 10 shows that it is desirable
to choose the largest blocks. However, if the block size
a ~ ~ ~ r e ~MFLOPS are used based
Ars a ~ ~ X " r f o ~ r - rIIIC&LS~~P(*, is too large, the number of blocks is so small that it is
O X I 2 M N h" Boating point op-
o r 8 ~ ( 1 4 -~ W I X F I I ~ ~ ~ Ithat impossible to evenly distribute these blocks across the
p-rat~crtrhswr3 xtr*rehh;rsy for the product of an M x K processor array. Therefore, performance is degraded
trratrrx b y a K x r\: matrix. by poor load balancing. Hence, it is best to choose
Ftgure 7 illustrates the number of MFLOPS vs. the largest block size among those which guarantee
the domain size. The Y-axis shows the number of full utilization of processors.
MFLOPS and the X-axis shows the size of A and B. Figures 11 and 12 illustrate the number of
All matrices are square, and a processor array of size MFLOPS when an 8 x 4 processor array and an 8 x 7
8 x 8 is used. As shown in this figure, GCA per- processor array are used, respectively. For a given ma-
forms better than SUMMA for all sizes of matrices. trix A(A4 x K ) and matrix B ( K x M), the block size
For small matrices, this is mainly due to differences is chosen to be M I 8 x K / 8 and K / 8 x N / 4 for matrix
in communication overhead whereas for large matri- A and B , respectively, in experiment of Figure 11. On
ces, it is a consequence of difference in efficiency of the other hand, the experiment of Figure 12 chooses
memory usage. As shown in this figure, the number the block size to be A418 x K / 5 6 and K / 5 6 x N / 7
of MFLOPS decreases drastically when local blocks for matrix A and B , respectively. These block sizes
no longer fit into memory, and must be paged to disk. are the largest possible sizes which allow full proces-
Since SUMMA uses more memory than GCA, its per- sor utilization. In Figure 11, GCA achieves higher
formance dropoff occurs earlier than that of GCA. MFLOPS than SUMMA. However, in Figure 12, GCA
Figure 8 compares scalability of GCA and achieves less number of MFLOPS than SUMMA. This
SUMMA. The X-axis shows the size of processor array is because a small block size is used for full processor
and the Y-axis shows MFLOPS. The processor grid is utilization.
always square, therefore 64 processors are arranged as \ I
an 8 x 8 array. Each processor has one block of size
-
256 x 256. Both algorithms maintain quite a good
performance as the processor size increases. Again
though, GCA is more efficient than SUMMA.
Figure 9 illustrates the number of MFLOPS vs. size of matrix in a single
the number of processors. As the experiment of Fig-
- 5 1
ure 8 , the processor grid is always square. However, in 150 5000 10000 20 40 60 80 100
slze of rnatru size of processor array
this experiment, the matrix sizes are fixed at 720 x 720.
As the number of processors increases, the communi- Figure 7 Figure 8
40
cation time becomes a larger factor in the program ex-
ecution time, and this causes the number of MFLOPS 30
to drop. 3
Figure 10 illustrates the number of MFLOPS for $20
I
ct---o GCA
+--x SUMMA
varying block sizes. For this figure, the Y-axis shows
10
the number of MFLOPS and the X-axis shows the
block size for each processor. The matrix sizes were I 0
50 100 100 200 300
fixed at 2048 x 2048. The processor array was fixed size of processor array block size
at 8 x 8. This means that for a block size of 256, each Figure 9 Figure 10
processor has one 256 x 256 block. For a block size of 50 i
1
2, each processor has 128 ( 2 x 2 ) blocks. In order for
SUMMA to have the best possible performance, the
-
number of columns of blocks of B in each processor is
the same, and the number of rows of blocks of A in
each processor is the same. This allows each processor
to have the same number of blocks of C (since the
number of blocks of C is equal to the number of rows of ? % k k 0 0 '0 2000 4000 6000 8000
blocks of A by the number of columns of blocks of B). size of matrix size of matrix