Parallel Algorithms
Parallel Algorithms
discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/241684993
Parallel Algorithms
CITATIONS READS
5 10,518
3 authors:
Yves Robert
Ecole normale supérieure de Lyon
543 PUBLICATIONS 6,201 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Yves Robert on 17 March 2014.
Parallel Algorithms
CRC PRESS
Boca Raton London New York Washington, D.C.
Preface
Parallel computing has undergone a stunning evolution, with high points (e.g.,
being able to solve many of the grand-challenge computational problems out-
lined in the 80’s) and low points (e.g., the demise of countless parallel com-
puter vendors). Today, parallel computing is omnipresent across a large spec-
trum of computing platforms. At the “microscopic” level, processor cores have
used multiple functional units in concurrent and pipelined fashion for years
and multiple-core chips are now commonplace with a trend toward rapidly
increasing numbers of cores per chip. At the “macroscopic” level, one can now
build clusters of hundreds to thousands of individual (multi-core) computers.
Such distributed-memory systems have become mainstream and affordable in
the form of commodity clusters. Furthermore, advances in network technol-
ogy and infrastructures have made it possible to aggregate parallel computing
platforms across wide-area networks in so-called “grids”.
Objective
The aim of this book is to provide a rigorous yet accessible treatment of par-
allel algorithms, including theoretical models of parallel computation, parallel
algorithm design for homogeneous and heterogeneous platforms, complexity
and performance analysis, and fundamental notions of scheduling. The fo-
cus is on algorithms for distributed-memory parallel architectures in which
computing elements communicate by exchanging messages. While such plat-
forms have become mainstream, the design of efficient and sound parallel
algorithms is still a challenging proposition. Fortunately, in spite of the “leaps
and bounds” evolution of parallel computing technology, there exists a core
of fundamental algorithmic principles. These principles are largely indepen-
dent from the details of the underlying platform architecture and provide
the basis for developing applications on current and future parallel platforms.
This book identifies and synthesizes fundamental ideas and generally applica-
ble algorithmic principles out of the mass of parallel algorithm expertise and
practical implementations developed over the last decades.
iii
iv
(ii) a set of exercises; and (iii) solution sketches for exercises marked with
a . This book should be ideally suited for teaching a course on parallel algo-
rithms, or as a complementary text for teaching a course on high-performance
computing. Importantly, although most of the content of the book is about
algorithm design and analysis, it is nevertheless a sound basis for teaching
applied parallel programming. Many of the included examples, case studies,
and exercises are natural starting points for hands-on homework assignments.
Although the content in the more theoretical chapters and that in the more
applied chapters are complementary, it is possible to cover only a subset of the
chapters. For a course focused on theoretical foundations of parallel algorithm
design, one may opt for covering only Chapters 1, 2, 7, and 8. For a course
focused on more practical algorithm design, one may opt for covering only
Chapters 3, 4, 5, and 6.
Acknowledgments
The content of this book, or at least preliminary versions of it, has been
used to teach courses at École Normale Supérieure de Lyon, École Polytech-
nique, the University of Tennessee, Knoxville, and the University of Hawai‘i
at Mānoa. We are grateful to the students for their feedback and suggestions.
We also wish to thank the following people who have contributed to some
of the content by their insightful suggestions, their own previously published
work, or their help reviewing draft chapters: Olivier Beaumont, Mahdi Bel-
caid, Anne Benoit, Rémi Bertin, Vincent Boudet, Benjamin Depardon, Larry
Carter, Alain Darte, Frédéric Desprez, Jack Dongarra, Jeanne Ferrante, Mat-
thieu Gallet, Philip Johnson, Jean-Yves L’Excellent, Loris Marchal, Fab-
rice Rastello, Arnold Rosenberg, Veronika Rehn-Sonigo, Mark Stillwell, Marc
Tchiboukdjian, Jean-Marc Vincent, Frédéric Vivien, and Joshua Wingstrom.
Henri Casanova
Arnaud Legrand
Yves Robert
Contents
Preface iii
I Models 1
1 PRAM Model 3
1.1 Pointer Jumping . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.1 List Ranking . . . . . . . . . . . . . . . . . . . . . . . 5
1.1.2 Prefix Computation . . . . . . . . . . . . . . . . . . . 7
1.1.3 Euler Tour . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2 Performance Evaluation of PRAM Algorithms . . . . . . . . 10
1.2.1 Cost, Work, Speedup and Efficiency . . . . . . . . . . 10
1.2.2 A Simple Simulation Result . . . . . . . . . . . . . . . 10
1.2.3 Brent’s Theorem . . . . . . . . . . . . . . . . . . . . . 12
1.3 Comparison of PRAM Models . . . . . . . . . . . . . . . . . 12
1.3.1 Model Separation . . . . . . . . . . . . . . . . . . . . . 13
1.3.2 Simulation Theorem . . . . . . . . . . . . . . . . . . . 14
1.4 Sorting Machine . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.4.1 Merge . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.4.2 Sorting Trees . . . . . . . . . . . . . . . . . . . . . . . 18
1.4.3 Complexity and Correctness . . . . . . . . . . . . . . . 20
1.5 Relevance of the PRAM Model . . . . . . . . . . . . . . . . . 24
1.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
Matrix Multiplication . . . . . . . . . . . . . . . . . . . . . . 26
Selection in a List . . . . . . . . . . . . . . . . . . . . . . . . 26
Splitting an Array . . . . . . . . . . . . . . . . . . . . . . . . 26
Looking for Roots in a Forest . . . . . . . . . . . . . . . . . . 26
First Non-Zero Element . . . . . . . . . . . . . . . . . . . . . 27
Mystery Function . . . . . . . . . . . . . . . . . . . . . . . . . 27
Connected Components . . . . . . . . . . . . . . . . . . . . . 28
1.7 Answers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2 Sorting Networks 37
2.1 Odd-Even Merge Sort . . . . . . . . . . . . . . . . . . . . . . 37
2.1.1 Odd-Even Merging Network . . . . . . . . . . . . . . . 38
2.1.2 Sorting Network . . . . . . . . . . . . . . . . . . . . . 41
2.1.3 0–1 Principle . . . . . . . . . . . . . . . . . . . . . . . 42
vii
viii
3 Networking 57
3.1 Interconnection Networks . . . . . . . . . . . . . . . . . . . . 57
3.1.1 Topologies . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.1.2 A Few Static Topologies . . . . . . . . . . . . . . . . . 59
3.1.3 Dynamic Topologies . . . . . . . . . . . . . . . . . . . 61
3.2 Communication Model . . . . . . . . . . . . . . . . . . . . . 62
3.2.1 A Simple Performance Model . . . . . . . . . . . . . . 62
3.2.2 Point-to-Point Communication Protocols . . . . . . . 63
3.2.3 More Precise Models . . . . . . . . . . . . . . . . . . . 66
3.3 Case Study: the Unidirectional Ring . . . . . . . . . . . . . . 72
3.3.1 Broadcast . . . . . . . . . . . . . . . . . . . . . . . . . 74
3.3.2 Scatter . . . . . . . . . . . . . . . . . . . . . . . . . . 76
3.3.3 All-to-all . . . . . . . . . . . . . . . . . . . . . . . . . 77
3.3.4 Pipelined Broadcast . . . . . . . . . . . . . . . . . . . 78
3.4 Case Study: the Hypercube . . . . . . . . . . . . . . . . . . . 79
3.4.1 Labeling Vertices . . . . . . . . . . . . . . . . . . . . . 79
3.4.2 Paths and Routing in a Hypercube . . . . . . . . . . . 80
3.4.3 Embedding Rings and Grids into Hypercubes . . . . . 82
3.4.4 Collective Communications in a Hypercube . . . . . . 83
3.5 Peer-to-Peer Computing . . . . . . . . . . . . . . . . . . . . 87
3.5.1 Distributed Hash Tables and Structured Overlay Net-
works . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
3.5.2 Chord . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
3.5.3 Plaxton Routing Algorithm . . . . . . . . . . . . . . . 91
3.5.4 Multi-casting in a Distributed Hash Table . . . . . . . 91
3.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
Torus, Hypercubes and Binary Trees . . . . . . . . . . . . . . 93
Torus, Hypercubes and Binary Trees (again!) . . . . . . . . . 93
Cube-Connected Cycles . . . . . . . . . . . . . . . . . . . . . 93
Matrix Transposition . . . . . . . . . . . . . . . . . . . . . . . 94
Dynamically Switched Networks . . . . . . . . . . . . . . . . 94
De Bruijn network . . . . . . . . . . . . . . . . . . . . . . . . 96
Gossip . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
3.7 Answers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
ix
Bibliography 321
Index 331
Part I
Models
1
Chapter 1
PRAM Model
Shared
Memory
P1 P2 P3 ......... Pn
The PRAM model, which stands for Parallel RAM1 , comprises a shared
central memory that can be accessed by the various Processing Units (see
Figure 1.1), or PUs. All PUs execute synchronously the same algorithm and
work on distinct (or not) memory areas. In this model neither the number
of PUs nor the size of the memory is bounded, which clearly does not hold
3
4 Chapter 1. PRAM Model
1 RANK COMPUTATION(L)
2 forall i in parallel do { Initialization }
3 if next[i ] = Nil then d [i ] ← 0 else d [i ] ← 1
4 while there exists a node i such that next[i] 6= Nil do { Main loop }
5 forall i in parallel do
6 if next[i ] 6= Nil then
7 d [i ] ← d[i] + d [next[i ]]
8 next[i ] ← next[next[i ]]
2 As in most computer science books, all logarithms are calculated in base 2 and we use
log n to denote log2 n.
6 Chapter 1. PRAM Model
1 1 1 1 1 0 /
Step 1 2 2 2 2 1 / 0 /
Step 2 4 4 3 / 2 / 1 / 0 /
Step 3 5 / 4 / 3 / 2 / 1 / 0 /
FIGURE 1.2: Typical execution of the list ranking algorithm. Gray cells
indicate active values, i.e., values that are in the process of being computed.
that indices for array A start at 1 and that PU Pi is responsible for updating
array element A[i]):
forall i in parallel do
if i > 1 then
A[i − 1 ] ← A[i ]
The problem here is that processor P2 may update A[2] before P1 can read
it. The same, if not as glaring, problem occurs in statements 7 and 8 of
Algorithm 1.1. To ensure that the loop executes correctly, it suffices to ensure
that all read operations happen before all write operations. Therefore, the
semantic of a forall i in parallel do loop is assumed to be as follows:
forall i in parallel do
forall i in parallel do temp[i ] ←B [i ]
⇐⇒
A[i ] ←B [i ] forall i in parallel do
A[i ] ←temp[i ]
Another problem with the algorithm as it is written in Algorithm 1.1 is that
the test in Statement 4 can be done in constant time only on a CRCW PRAM.
On a CRCW PRAM, the test could be implemented by having all PUs write
their boolean value of next[i] = Nil to a single memory cell, say done. Using
a CRCW with a fusion mode for concurrent writes based on a logical AND,
done is true only once the algorithm has completed. Unfortunately, there is
no such constant time solution on a CREW PRAM. On a CREW PRAM the
best approach is to perform pair-wise logical AND operations in a binary tree
pattern, leading to O(log n) steps. We will further discuss this distinction
between a CRCW and a CREW PRAM in Section 1.3.1.
In the particular case of our list ranking algorithm, we know that the algo-
rithm proceeds in dlog ne steps. Therefore, we can rewrite the main loop:
while there exists a node i such that next[i] 6= Nil do
as:
for step=1 to dlog ne do
1.1. Pointer Jumping 7
1 PREFIX COMPUTATION(L)
2 forall i in parallel do { Initialization }
3 y[i ] ←x [i ]
4 while there exists a node i such that next[i] 6= Nil do { Main loop }
5 forall i in parallel do
6 if next[i ] 6= Nil then
7 y[next[i ]] ← y[i ] ⊗ y[next[i ]]
8 next[i ] ←next[next[i ]]
d [i ] ← d[i] + d [next[i ]]
by
temp[i ] ← d [next[i ]]
d [i ] ← d[i] + temp[i ]
With this last simple transformation, we now obtain a O(log n) algorithm on
an EREW PRAM, the most restrictive PRAM model.
A=1 C = −1 A=1 C =0
B =0 B =1
B =0 B =0 B =2 B =2
B =0 B =0 B =0 B =3 B =3 B =3
B =0 B =0 B =0 B =4 B =4 B =4
FIGURE 1.4: Euler tour. 1.4(a): creation of the Eulerian path and initial-
ization; 1.4(b): results after the prefix computation.
The above creates a depth-first path through the tree. We denote by x [i ] the
value for which PU Pi is responsible. These values are initialized as follows:
1
if Pi is a PU of type A,
x[i] = 0 if Pi is a PU of type B,
−1 if Pi is a PU of type C.
Figure 1.4(a) shows these values, written as “PU type = value” for simplicity.
Once the above list is established and the values of its elements are ini-
tialized, which can be done in constant time on an EREW PRAM, we can
perform a prefix computation for the addition operator. The reader can easily
check that this computation leads to the values shown in Figure 1.4(b). The
depth of a vertex is stored in the value store by the C PU associated to that
vertex. In hindsight we now see the rationale for the initial values of the list
elements:
• the C PU’s contribution to the sum is equal to −1, accounting for going
up a level in the tree.
Intuitively, cost is a rectangle of area Tpar (p, n) (the execution time) times
p (the number of PUs). Therefore, the cost is minimal if, at each step, all PUs
are used to perform useful computations, i.e., computations that are part of
the sequential algorithm. In this case, the cost is equal to the work.
The speedup of a PRAM algorithm is the factor by which the program’s
execution time is lower than that of the sequential program.
performance model of the algorithm for a PRAM with more PUs. The main
idea is that a PRAM with fewer PUs can simulate a PRAM with more PUs
by executing fewer operations concurrently (with some loss of performance).
type. In this case efficiency is greater than 1 and one speaks of super-linear
speedup. One cause of super-linear speedup is that p processors typically
have p times as much cache and memory as a single processor. Therefore,
when using sufficiently many processors the entire data for the problem at
hand may suddenly fit entirely in memory (or in cache), thus avoiding costly
swapping (or cache misses). It is important to understand that, although
highly desirable in practice, achieving super-linear speedup does not mean
that the parallel algorithm is particularly efficient. From a strictly parallel
algorithms perspective, comparing a parallel execution of an algorithm that
achieves super-linear speedup to a sequential execution that runs out of mem-
ory or cache is not a fair comparison. One option would be to compute a
speedup relative to the execution time when using the smallest number of
processors that leads to a super-linear speedup.
1 COMPUTE MAXIMUM(A, n)
2 forall i ∈ {1, . . . , n} in parallel do
3 m[i ] ←True
4 forall i, j ∈ {1, . . . , n}2 , i 6= j in parallel do
5 if A[i ] < A[j ] then m[i ] ← False
6 forall i ∈ {1, . . . , n} in parallel do
7 if m[i ] = True then max ← A[i ]
8 return max
can be done in O(1) units of time only if Ω(n) copies of e have been created,
which cannot be done in less than Ω(log n) time. More generally, broadcasting
an information to n PUs on an EREW PRAM requires Ω(log n) steps, with
double of the number of copies of the information at each step.
THEOREM 1.2. Any CRCW algorithm with p PUs has an execution time
at most O(log p) times lower than the best EREW algorithm with p PUs for
solving the same problem.
Proof. Let us assume that the CRCW PRAM uses a consistent mode (only
concurrent writes of the same values are authorized). We show a method to
simulate concurrent writes with only exclusive writes (concurrent reads can
be handled in a similar fashion).
P0 P0 (8,12)=A[0] (8,12) P0
12 8
P1 P1 (29,43)=A[1] (8,12) P5
P3 P3 (29,43)=A[3] (29,43) P2
P4 P4 (92,26)=A[4] (29,43) P3
P5 P5 (8,12)=A[5] (92,26) P4
26 92
Let us consider a given step of the CRCW algorithm and simulate it with
O(log p) steps of an EREW algorithm. Both PRAMs have the same computing
power (i.e., the same number of processors). Therefore, we only need to focus
on memory accesses. Our EREW algorithm requires a temporary array A of
size p. When PU Pi of the CRCW algorithm writes a value xi to address
1.4. Sorting Machine 15
1.4.1 Merge
In all that follows we assume that we sort/merge arrays of integers. The
merge of two sorted sequences J and K is denoted J|K. We say an integer x
is between a and b if and only if a < x 6 b.
DEFINITION 1.3 (Rank). The rank of an element x in a sequence J is
defined as the number of elements of J that are smaller than x:
rank (x, J) = card{j ∈ J, j < x} .
Likewise, the cross-rank of A in B is the function R[A, B] : A → N .
e 7→ rank (e, B)
This function can be represented as an array of size |A| whose i-th entry is
the rank of the i-th element of A in B.
16 Chapter 1. PRAM Model
level 3 1,2,3,4,5,6,7,8
(root)
level 0
8 7 6 5 4 3 2 1
(leaves)
FIGURE 1.6: Example binary tree for Cole’s parallel merge sort.
L= L1 L2 L3 L4 L5 L6 L7 L8
K=
K(1) K(2) K(3) K(4) K(5) K(6) K(7) K(8) K(9)
J|K = (J(1)|K(1)).(J(2)|K(2))...(J(9)|K(9))
Proof. Each of the three steps can be done in O(1) time using R[L, J], R[L, K],
R[J, L] and R[K, L]:
COLE MERGE()
1 Receive X(t + 1) from the left child and Y (t + 1) from the right child
2 Merge: val(t + 1) ← MERGE WITH HELP(X(t + 1), Y (t + 1), val(t))
3 Reduce: send Z(t + 1) = REDUCE(val(t + 1)) to its father.
{ REDUCE() keeps one value out of every four:
REDUCE({z1 , z2 , . . . , zn }) = {z4 , z8 , z12 , . . .}. }
• Step 2: Since J(i) and K(i) have at most three elements, parallel fusion
can clearly be done in O(1) time with |L| + 1 PUs.
• Step 3: Knowing R[L, J] and R[L, K], we can compute R[L, J|K]. In-
deed, for each element l ∈ L we have:
The PU Pi responsible for resi computes the rank r of li−1 in J|K and
stores the elements of resi (at most six elements) starting from position
r + 1. Step 3 can thus be performed with |L| + 1 PUs.
Mode of Operation
At step t = 0, all nodes store an empty sequence except for the leaves, which
store one element of the array to be sorted. At step t + 1, each node runs an
algorithm that uses a simple REDUCE() operation, as shown in Algorithm 1.5.
We will prove in Section 1.4.3 that at any step t we have:
A node of the tree is said to be complete when it has received all its inputs,
i.e., a sorted sequence of 2k elements for a node at level k. As soon as this
sequence is non-empty, its size doubles at each step from 1 to 2k . If a node
is complete at step t then the REDUCE() operation is changed in the following
way:
1.4. Sorting Machine 19
• at step t + 2, it sends one element out of every two (the second one, the
fourth one, etc.) from val(t + 1) to its father;
• from step t + 4 on, it stops working and does not send anything to its
father any longer.
• the size of the input of node at level k in the tree doubles at each step
(from 1 to 2k ) until the node is complete;
• we can compute R[S(t + 1), S(t)] for any sequence of input or output of
a node.
[] []
[6,8] [2,4]
[7,8] [5,6] [3,4] [1,2] [7,8] [5,6] [3,4] [1,2] [7,8] [5,6] [3,4] [1,2]
FIGURE 1.8: Sorting an array of size 8 with Cole’s parallel merge sort.
20 Chapter 1. PRAM Model
In this section, we describe in detail how the algorithm works with an input
of size 8 (see Figure 1.8). At step t = 0, leaves have sequences of size 20 and
are hence complete. At step t = 1 and t = 2 they send one element out of
four, then one out of two, i.e., no element at all. At step t = 3, they send
their unique value to their father and stop working.
Let us focus on the father [8, 7] of leaves 8 and 7. At step t = 3, it computes
val(3), i.e., MERGE WITH HELP({8}, {7}, ∅), and becomes complete (it is a level
1 node). At step t = 4, it does not send anything to its father. At step t = 5,
it sends one element out of every two, i.e., {8}. At step t = 6, it sends its two
elements then stops working.
Let us now focus on the root node of the subtree [8, 7, 6, 5]. At step t = 5,
it receives {8} and {6} and computes val(5) = MERGE WITH HELP({8}, {6}, ∅).
At step t = 6, it receives X(6) = {7, 8} and Y (6) = {5, 6} and computes
val(6) = MERGE WITH HELP({7, 8}, {5, 6}, val(5)). We can check that val(5) is
a GS of X(6) and Y (6). At step t = 6, the node is complete (it is a level 2
node). At step t = 7, it sends {8}, and at step t = 8, it sends {6, 8}. At step
t = 9, it sends its four elements and stops working.
Lastly, let us look at the root of the tree. At step t = 7, it receives {8} and
{4} and computes val(7) = MERGE WITH HELP({8}, {4}, ∅). At step t = 8, it re-
ceives {6, 8} and {2, 4} and computes val(8) = MERGE WITH HELP({6, 8}, {2, 4},
val(7)). At step t = 9, it receives {5, 6, 7, 8} and {1, 2, 3, 4} and computes
val(9) = MERGE WITH HELP({5, 6, 7, 8}, {1, 2, 3, 4}, val(8)). On can easily check
that val(t) is a GS of X(t + 1) and Y (t + 1) for t = 6, 7, 8. At the end of step
t = 9, the root is complete and all values are sorted.
LEMMA 1.2. The pipelined sorting tree algorithm runs in O(log n) time
with O(n) PUs.
Proof. We still use the notation X|Y to denote the fusion of X and Y . If X
is a GS of X 0 , then X|W is clearly still a GS of X 0 for any set W . However,
if X is a GS of X 0 and Y is a GS of Y 0 , there X|Y is not necessarily a GS of
X 0 |Y 0 .
Indeed, let us consider for example X = [2, 7], X 0 = [2, 5, 6, 7], Y =
[1, 8], and Y 0 = [1, 3, 4, 8]. Then, we have X|Y = [1, 2, 7, 8] and X 0 |Y 0 =
[1, 2, 3, 4, 5, 6, 7, 8] but there are 5 elements of X 0 |Y 0 between 2 and 7 (that
are yet consecutive elements of X|Y ). This is the reason why we resort to the
reduce operator.
Let us prove the following property: there are at most 2r + 2 elements of
X 0 |Y 0 between r consecutive elements of X|Y (we assume that −∞ and +∞
are in X and Y ).
Coming back to the proof of the lemma, let us define Z = REDUCE(X|Y ) and
Z 0 = REDUCE(X 0 |Y 0 ). Let us consider k + 1 consecutive elements z1 , z2 , . . . ,
zk+1 of Z. Since the reduce operator keeps one elements out of every four, we
have z1 = e4h , z2 = e4(h+1) , . . . , zk+1 = e4(h+k) , where X|Y = {e1 , e2 , . . . , ep }.
Thus, there are 4k + 1 elements of X|Y between z1 , z2 , . . . , zk+1 . Using the
previous property with r = 4k + 1, we know that there are at most 8k + 4
elements of X 0 |Y 0 between these 4k + 1 elements. Since the reduce operator
keeps one element out of every four, there are at most 8k+4 4 = 2k + 1 elements
of Z 0 between the k + 1 consecutive elements of Z, proving that Z 0 is a GS of
Z.
In steady-state, a node receives a sorted sequence X(t+1) from its left child
and a sorted sequence Y (t + 1) from its right child. It computes val(t + 1) =
MERGE WITH HELP(X(t+1), Y (t+1), val(t)) and sends Z(t+1)=REDUCE(val(t + 1))
22 Chapter 1. PRAM Model
to its father. Since Lemma 1.3 shows that Z(t) is a GS of Z(t + 1), we have
the following invariants:
3. Y (t) is a GS of Y (t + 1).
Note that the property still holds true for the last two communications (one
element out of every two is sent, then finally all the elements).
Lastly, we have to ensure that the requirements of Lemma 1.1 hold true: we
need to know the cross-ranks for MERGE WITH HELP() to run in O(1) time. At a
given step, X is a GS of X 0 and Y is a GS of Y 0 , U = X|Y and Z = REDUCE(U ).
We can assume that cross-ranks R[X 0 , X] and R[Y 0 , Y ] are known by induction
hypothesis. To compute U 0 = X 0 |Y 0 with MERGE WITH HELP(X 0 , Y 0 , U ), we
need to know cross-ranks R[X 0 , U ], R[Y 0 , U ], R[U, X 0 ], and R[U, Y 0 ]. Finally
we can compute Z 0 = REDUCE(U 0 ) and R[Z 0 , Z] to get our invariant. Of course,
we assume that for each sorted sequence S we know the cross-rank R[S, S]. In
other words, we know the index of each element in the sorted sequence. This
is computed as part of the internal representation of S.
Proof. Take b0 = −∞ and bk+1 = +∞. The rank of a is then computed with
the following loop:
forall i, 0 6 i 6 k in parallel do
if bi < a 6 bi+1 then
rank ←i
There are no write conflicts because only one processor stores the value of
rank .
Proof. We assume that whenever a sequence S is sorted, we also know R[S, S].
For any a ∈ S1 ⊂ S we have
We can now prove our invariant on cross-ranks for MERGE WITH HELP().
1.4. Sorting Machine 23
This partition is computed in O(1) time with O(|X|) PUs because we know
R[X 0 , X]. We also partition U with X (which is the same as partitioning Y
with X because U = X|Y ):
For each i we use |U (i)| PUs. Altogether we thus require O(|U |) PUs. As
X is a GS of X 0 , each X 0 (i) consists of at most three elements and thus
the computation runs in O(1) time. We have therefore computed R[X 0 , U ].
R[Y 0 , U ] is of course computed in a similar fashion.
To compute R[U, X 0 ] we need R[X, X 0 ] and R[Y, X 0 ]. Let us see how to
compute R[X, X 0 ]. Consider an element ai from X \ X 0 and search for the
minimal element a0 of X 0 (i+1). The rank of ai in X 0 is the same as the rank of
a0 in X 0 . This rank is already computed as part of the internal representation
of the sorted sequence X 0 . Thus, we can compute rank (ai , X 0 ) in O(1) time
with a single processor. To compute R[Y, X 0 ] consider y ∈ Y . We compute
rank (y, X) using Lemma 1.5, because U = X|Y is already computed. Then,
we compute rank (y, X 0 ) using rank (y, X) and R[X, X 0 ]. This way, we compute
R[U, X 0 ] in O(1) time with O(|U |) PUs. We can compute R[U, Y 0 ] in a similar
way.
LEMMA 1.7. Using Lemma 1.6’s notation, let
algorithms have potential practical relevance and that most promising are
those cost-optimal algorithms with minimal execution time.
Bibliographical Notes
Some of the introduction material in this chapter is inspired by the books by
Cormen, Leiserson, and Rivest [44] and by Gengler, Ubéda and Desprez [60].
The presentation of Cole’s parallel merge sort algorithm is taken from the book
by Gibbons and Rytter [62]. The original article by Cole [43] also presents
an EREW version of the sorting machine with the same performance. For
additional information on the PRAM model, we refer the reader to the book
by Reif [101].
26 Chapter 1. PRAM Model
1.6 Exercises
We start with six classical exercises. The first three ones are based on
the pointer jumping technique, which the reader should master by now. The
fourth one comes back to model separation. The fifth one is a nice application
of the O(1) CRCW algorithm for finding the largest value of an array. We do
not spoil the surprise about the sixth one and let the reader discover it. We
end with a seventh exercise on connected components [62], which represents
a nice example of the idea contained within Brent’s theorem.
We have seen in Section 1.1 how to implement these two operators in time
O(log n) on an EREW PRAM. Consider the SPLIT function below:
1 SPLIT(A, Flags)
2 Idown ←PRESCAN(not(Flags))
3 Iup ←n - REVERSE(SCAN(REVERSE(Flags)))
4 forall i ∈ {1, . . . , n} in parallel do
5 if Flags(i) then Index [i] ←Iup[i]
6 else Index [i] ←Idown[i]
7 Result ←PERMUTE(A,Index )
8 return Result
This function uses two functions: REVERSE(A) and PERMUTE(A, Index ). The
former reverses array A, and the latter reorders array A according to a per-
mutation specified as an array of indices, Index . The slightly cumbersome
REVERSE(SCAN(REVERSE(Flags))) simply scans from the end of the array Flags,
considering its elements as integers.
1. Given an array Flags of booleans, what does the SPLIT function returns?
What is its execution time?
(a) What is the result of the MYSTERY function when applied to A = [5, 7, 3,
1, 4, 2, 7, 2] and Number Of Bits = 3?
(c) Assuming the size of integers is O(log n) bits, what is the execution time
of MYSTERY with n PUs? What if only p PUs are used? What are the
values of p that lead to an optimal value of the algorithm’s work?
Removing any one edge of this circuit results in an in-tree. A star is a 0-tree-
loop.
The previous invariant can thus be read as: “the directed graph (V, {(i, C(i)) |
i ∈ V }) consists of stars.” We can freely identify pseudo-vertices and stars:
the center of a star is the label of the corresponding pseudo-vertex. Connected
components are computed by applying the following two functions in sequence
as many times as needed:
Exercises 29
1 GATHER()
2 forall i ∈ S in parallel
do
3 T (i) ← min C(j) | {i, j} ∈ E, C(j) 6= C(i)
4 { if this set is empty, then C(i) is associated to T (i) }
5 forall i ∈ S in parallel
do
6 T (i) ← min T (j) | C(j) = i, T (j) 6= i
7 { if this set is empty, then T (i) is associated to T (i) }
8 JUMP()
9 forall i ∈ S in parallel do B(i) ← T (i)
10 repeat log n times
11 forall i ∈ S in parallel do T (i) ← T (T (i))
12 forall i ∈ S in parallel do
C(i) ← min B(T (i)), T (i)
7
4
3
1
6 2
5 8
9
Apply the GATHER function to this graph, then the JUMP function, then the
GATHER function, and so on. Follow the effect of these steps on the directed
graphs (V, {(i, T (i)) | i ∈ V }) and (V, {(i, C(i)) | i ∈ V }).
4. Prove that applying the GATHER and JUMP functions dlog ne times en-
ables pseudo-vertices induced by C to correspond exactly to the connected
components of the original graph.
5. What is the complexity of this algorithm? How many PUs are used?
30 Chapter 1. PRAM Model
1.7 Answers
Exercise 1.2 (Selection in a List)
We simply use a version of the pointer jumping technique to obtain Algo-
rithm 1.6. Each PU determines the location of the first blue object after it in
the list in O(log n) time.
This EREW algorithm requires a bit of explanation. It works like classical
pointer jumping but in parallel on many lists. The last object in each of these
lists is either blue or Nil. Each PU ends up with a pointer to the end of its
list, i.e., to the next blue element. All pointer jumps are made on independent
lists and thus do not interfere with each other. At the end of the algorithm
the list of blue elements starts either with the first element of the initial list
if it was blue, or with its first blue successor.
1 SELECT-BLUE()
2 forall i in parallel do
3 if next(i) = Nil Or color(next(i)) = blue then
4 done(i) ←True
5 blue(i) ←next(i)
6 while there is a node i such that done(i) = False do
7 forall i in parallel do
8 if done(i) = False then
9 done(i) ← done(next(i))
10 if done(i) = True then
11 blue(i) ← blue(next(i))
12 next(i) ← next(next(i))
1 FIND ROOT()
2 forall i in parallel do
3 if father (i) = Nil then
4 root(i) ← i
5 while there exists a node i such that father (i) 6= Nil do
6 forall i in parallel do
7 if father (i) 6= Nil then
8 if father (father (i)) = Nil then
9 root(i) ← root(father (i))
10 father (i) ← father (father (i)).
father (especially in the end!) and thus will need to access simultaneously the
same data father (i).
The same analysis as the one for list ranking (see Section 1.1) can be applied.
Each node in the forest find its root in at most time O(log d), where d is
the maximal depth of the trees, and all nodes find their roots in parallel.
Therefore, the algorithm runs in time O(log d).
. Question 2. Let us consider the worst case for the EREW model, i.e.,
the case where the forest contains only one tree. Let us count the number of
PUs that may know the root at each step. In the EREW model, the number
of PUs that know an information can at most double at each step. In our
case, exactly one PU knows the root at the beginning and therefore at least
Ω(log n) steps are required to propagate this information to all PUs.
Using O(n) PUs SCAN and PRESCAN can be done in time O(log n). As other
operations only require constant time, SPLIT runs in time O(log n).
. Question 2.
(a) A = [ 5 7 3 1 4 2 7 2 ]
bit(0) = [ 1 1 1 1 0 0 1 0 ]
A ←SPLIT(A, bit(0)) = [ 4 2 2 5 7 3 1 7 ]
bit(1) = [ 0 1 1 0 1 1 0 1 ]
A ←SPLIT(A, bit(1)) = [ 4 5 1 2 2 7 3 7 ]
bit(2) = [ 1 1 0 0 0 1 0 1 ]
A ←SPLIT(A, bit(2)) = [ 1 2 2 3 4 5 7 7 ]
(b) Based on the example, it looks like MYSTERY sorts its input array. In fact,
it is a parallel implementation of the well-known radix-sort algorithm:
starting with the least-significant bit, the SPLIT function splits the array
in two parts depending on the value of this bit. Each call to SPLIT sorts
elements according to the current bit value while maintaining the order
obtained with previous bits. This is why the algorithm goes from the
least-significant bit to the most-significant bit.
(c) There are O(log n) iterations of the main loop. The execution time of
the MYSTERY function is thus O(log2 n) with O(n) PUs. When using
only p PUs, the execution time of SPLIT becomes O( np + log p) and the
execution time of the parallel radix-sort becomes O(( np + log p) log n) =
O( np log n + log n log p). The work is optimal (i.e., equal to O(n log n))
for p such that p log p 6 n, e.g., for p = nq with 0 < q < 1.
C=7
C=4 7
4
C=3
C=1 3
1 C=6
6 2
C=2
C=5 C=8
5 8
9
C=9
First, we can notice that the directed graph (V, {(i, T (i)) | i ∈ V }) consists of
1-tree-loops:
T =2
T =1 7
4
T =2
T =4 3
1 T =3
6 2
T =3
T =1 T =6
5 8
9
T =6
After pointer jumping in the JUMP function, all these 1-tree-loops have been
transformed into stars:
T =3
T =4 7
4
T =3
T =1 3
1 T =2
6 2
T =2
T =4 T =3
5 8
9
T =3
These stars are merged in (V, {(i, C(i)) | i ∈ V }) in the last part of the JUMP
function:
C=2
C=1 7
4
C=2
3
1 C=1 C=2
6 2
C=2
C=1 C=2
5 8
9
C=2
34 Chapter 1. PRAM Model
There are only two remaining pseudo-vertices and C is thus as follows before
applying GATHER again:
C=2
C=1 7
4
C=2
C=1 3
1 C=2
6 2
C=2
C=1 C=2
5 8
9
C=2
and finally
T =2
T =1 7
4
T =2
T =2 3
1 T =2
6 2
T =1
T =1 T =2
5 8
9
T =2
We end up with a directed graph (V, {(i, T (i)) | i ∈ V }) that consists solely
of 1-tree-loops:
T =2
T =1 7
4
T =2
T =2 3
1 T =2
6 2 T =1
T =1 T =2
5 8
9
T =2
Answers 35
Each vertex is now aware of its p-vertex, hence connected components have
been computed.
. Question 4. We just need to prove that each step halves the number of
connected components. Let us focus on p-vertices and on the graph induced by
T on these vertices. In this graph, two pseudo-vertices i and j are connected
if and only if there are two vertices k and l that are connected in the original
graph and such that C(k) = i and C(l) = j. The function JUMP merges
all these pseudo-vertices into a single pseudo-vertex. Therefore, the number
of pseudo-vertices is at least halved at each step. As there are originally n
36 Chapter 1. PRAM Model
1 forall i ∈ S in parallel
do
2 T (i) ← min C(j) | {i, j} ∈ E, C(j) 6= C(i)
3 { if this set is empty, then C(i) is associated to T (i) }
1 forall i, j ∈ S in parallel do
2 if {i, j} ∈ E And C(i) 6= C(j) then T emp(i, j) ← C(j)
3 else T emp(i, j) ← ∞
4 forall i ∈ S in paralleldo
5 T emp(i, 1) ← min T emp(i, j) | j ∈ S
6 forall i ∈ S in parallel do
7 if T emp(i, 1) = ∞ then T (i) ← C(i)
8 else T (i) ← T emp(i, 1)
Algorithm 1.8 can be transformed into Algorithm 1.9. The first loop of
Algorithm 1.9 clearly runs in time O(1) with O(n2 ) PUs (more precisely with
O(|E|) PUs) on a CREW. The next two loops run in time O(log n) with O(n2 )
PUs using classical pointer jumping (n PUs are assigned to each row and each
PU computes its minimum independently of others).
Computing connected components can thus be done in time O(log2 |V |)
with O(|V | + |E|) PUs. However, the JUMP function wastes resources in the
minima computation.
2 With Brent’s theorem we can lower the number of
n |E|
PUs down to O log n (actually O log |V | + |V | PUs) without changing the
execution time.
Chapter 2
Sorting Networks
a min(a, b)
b max(a, b)
In this chapter, we present two sorting networks. The first network imple-
ments a merge sort, just like Cole’s PRAM algorithm presented in Section 1.4.
The second network uses an odd-even transposition scheme that can easily be
mapped to a one-dimensional network of processors.
37
38 Chapter 2. Sorting Networks
a1
b1
a3
b3
a1 a2
b1 b2
a2 a4
b2 b4
(a) Merge1 network to merge two (b) Merge2 network to merge two sorted se-
sorted sequences of length 2. quences of length 4.
Let us build a Mergem network that merges two sorted sequence of length
2m . For m = 0, we only need a single comparator. For m = 1, assum-
ing SORTED(ha1 , a2 i) and SORTED(hb1 , b2 i), we can use three comparators as
depicted in Figure 2.2(a). It is not hard to see that this Merge1 works as
expected. We know that a1 6 a2 and b1 6 b2 . The upper output is min(a1 , b1 )
and the lower output is max(a2 , b2 ). An additional comparator is added to
sort the two outputs from the middle, i.e., max(a1 , b1 ) and min(a2 , b2 ).
Determining that Merge2 (depicted in Figure 2.2(b)) works is more diffi-
cult. So instead let us prove the result in the general case, by induction. The
Mergem network is built with two copies of the Mergem−1 network followed
with a column of 2m − 1 comparators. The first copy of Mergem−1 merges
odd elements from the input sequences and the second copy merges even ele-
ments. Surprisingly, a simple column of comparators suffices to complete the
merging of the two input sequences.
2.1. Odd-Even Merge Sort 39
Then, we have
Proof. Without loss of generality, we can assume that elements are distinct.
In the resulting sequence d1 is in first position, which is correct as it is the
smallest element of the whole sequence. Likewise, e2n ’s position is correct. In
the general case, di and ei−1 (for i > 2 and i 6 2n) are in position 2i − 2 or
2i − 1 in the resulting sequence. We show that their positions are correct by
showing that they both dominate 2i−3 elements of the complete sequence and
are dominated by 4n − 2i + 1 elements of the complete sequence. Therefore,
their position in the whole sequence is necessarily 2i − 2 or 2i − 1 and the final
comparison finally sets them in their correct positions.
We have 4 items to prove for 2 6 i 6 2n:
a1 d1 c1
a2 d2 c2
a3 d3 c3
.
a4 .
. . c4
. di+1
. . c5
a2i−1 .
.
a2i Mergem−1 .
. .
. .
.
b1 e1 c2i
b2 e2
. c2i+1
b3 .
b4 .
. ei .
. . .
. . .
.
b2i−1
b2i
. Mergem−1
.
.
Figure 2.3 shows the Mergem network, with its two copies of the Mergem−1
network whose output are connected with 2m − 1 comparators.
The processing time tm of the Mergem network is defined as the maximum
number of comparators an input must traverse. Therefore, we have t1 = 2
because some data have to go through 2 comparators (even though some only
go through one) and t2 = 3. Of course, many comparators can be active at
the same time (this book is about parallel computing after all).
LEMMA 2.1. The processing time tm and the number of comparators pm of
Mergem satisfy the following recursions:
t0 = 1 t1 = 2 tm = tm−1 + 1 (i.e., tm = m + 1)
m
p0 = 1 p1 = 3 pm = 2pm−1 + 2 − 1 (i.e., pm = 2m m + 1)
Proof. These recursions follow directly from Proposition 2.1. The expression
pm = 2m m + 1 can be proved via a simple induction.
When expressed as a function of the length n = 2m of the input sequence,
these quantities are respectively O(log n) and O(n log n). The efficiency of the
network is rather low: by multiplying the number of comparators with the
processing time, one obtains the total work (see Section 1.2): Wn = pn × tn =
O(n(log n)2 ). This is far beyond the total work of a sequential merge: using
a single comparator and O(n) steps, the total work of the sequential merge is
O(n). Note however that the overall processing time, tm , is very short.
The rather poor efficiency of this network can easily be explained: each
comparator is used only once during the merge. The efficiency of the network
2.1. Odd-Even Merge Sort 41
Sort1
Sort1
Merge1 Merge1
Sort2
Sort1
Sort1
Merge1 Merge1
Sort2
Merge2
LEMMA 2.2. The processing time t0m and the number of comparators p0m of
Sortm satisfy the following recursions:
Proof. These recursions follow directly from the recursive design of the net-
work. Since t0m = t0m−1 + tm−1 = t0m−1 + m, we get t0m = O(m2 ). For the
second equation, p0m = 2p0m−1 + 2m−1 (m − 1) + 1 = 2m−1 (1 + 2 + 3 + · · · +
(m − 1)) + (1 + 2 + 4 + · · · + 2m−1 ) = 2m−1 ( m(m−1)
2 ) + 2m − 1. Therefore,
0 m−1 m(m−1) m 2
pm = 2 ( 2 + 2) − 1 = O(2 m ).
42 Chapter 2. Sorting Networks
It is easy to see that R does not sort the 0–1 sequence hf (x1 ), . . . , f (xn )i
correctly. Indeed, f (R(xk )) = 1 is output at position k and f (R(xk+1 )) = 0 is
2.1. Odd-Even Merge Sort 43
The 0–1 principle can be used to prove the correctness of Mergem . Let us
provide a new, less cumbersome proof of Proposition 2.1.
New proof of Proposition 2.1. Using the 0–1 principle we can now restrict the
proof of Proposition 2.1 to 0–1 sequences.
Let us denote by ZEROS(x) the number of zeros in sequence x = hx1 , . . . , xm i.
A sorted 0–1 sequence is always structured as x = 0r 1m−r where r = ZEROS(x).
Let p = ZEROS(ha1 , . . . , a2n i) and q = ZEROS(hb1 , . . . , b2n i). We distinguish
four different cases:
hd1 , . . . , d2n i and he1 , . . . , e2n i have the name number of 0’s. The last
column of comparators receives as input p0 + q 0 − 1 pairs of 0’s, followed
by a 10 (the first 1 of d and the last 0 of e) and 2n − p0 − q 0 − 1 pairs
of 1’s. The sorted sequence is thus obtained thanks to the (p0 + q 0 )-th
comparator.
ZEROS(hd1 , . . . , d2n i) = p0 + q 0
ZEROS(he1 , . . . , e2n i) = p0 + q 0 − 1 .
ZEROS(hd1 , . . . , d2n i) = p0 + q 0
ZEROS(he1 , . . . , e2n i) = p0 + q 0 − 2 .
min(a, b) max(a, b)
Proof. We use the 0–1 principle. Let ha1 , . . . , an i denote a 0–1 sequence. Let
us denote by k the number of 1’s in this sequence and let j0 denote the position
of the last 1 (i.e., the rightmost 1). In Figure 2.6, we show an example for
n = 7, k = 3 and j0 = 4. Let us first note that a 1 never moves to the left:
the only possible move is when it is compared with a 0 on its right, in which
case it moves to the right.
Let us follow the last 1’s moves. If j0 is even then it does not move in the
first step but it moves to the right in the second step. If j0 is odd, it moves
to the right in the first step. In both cases, it moves to the right from step 2
and for all following steps until it is at position n. Before step 2, the last 1 is
at least at position 2 and thus it always has enough time to arrive at position
n in n − 1 steps.
Let us now follow the moves of the next-to-last 1. At step 0 it is at position
j1 (j1 = 2 in Figure 2.6). As the last 1 moves to the right from step 2 on (at
least), the next-to-last 1 is never blocked by the last 1 when moving to the
right. Therefore, from step 3 on, the next-to-last 1 moves to the right until
it reaches position n − 1. More generally, the i-th 1 (counting from the right)
2.2. Sorting on a One-Dimensional Network 45
step
odd
even
odd
even
odd
even
odd
even
n=8 n=7
1 1 0 1 0 0 0
FIGURE 2.6: Illustration of the proof that the odd-even transposition net-
work is a sorting networks.
moves to the right starting (at least) at step i + 1 until it reaches position
n − i + 1: it is never blocked by another 1 on its right.
At last, the leftmost 1 (the k-th as there are k 1’s in total) moves to the
right at step k + 1 until it reaches its final position n − k + 1. The k − 1
remaining 1’s are on its right and the sequence is thus sorted.
Another proof of Proposition 2.3. Another proof was proposed by Knuth [75]
in exercises 36 and 37, Section 5.3.4. It relies on “primitive sorting networks.”
A primitive sorting network is a sorting network such that comparisons are
only made between neighbors (the odd-even transposition network is thus a
primitive network).
Formally, a sorting network α can be modeled by a sequence of comparators,
i.e., α = [xk , yk ] ◦ . . . ◦ [x2 , y2 ] ◦ [x1 , y1 ] where k is the number of comparators.
46 Chapter 2. Sorting Networks
A primitive network is such that for all i: yi = xi + 1. The proof relies on two
lemmas:
LEMMA 2.3. Let α be a n-entry primitive network. α is a sorting network
if and only if α sorts hn, n − 1, . . . , 2, 1i.
Proof. Proving the implication is trivial. Let us prove the reciprocal by con-
tradiction.
Let x denote an input sequence such that α(x)i > α(x)j , with i < j. Let
y = hn, n − 1, . . . , 2, 1i. We prove that α(y)i > α(y)j , which establishes the
lemma. We prove this by induction on the number of comparators, k. More
precisely our induction hypothesis is:
Let us first prove Hi,j,x,y (0): a primitive network β of size 0 does not sort
anything so we necessarily have β(y)i = n + 1 − i > n + 1 − j = β(y)j , hence
the result.
Let us now assume that Hi,j,x,y (q − 1) is true and consider an arbitrary
primitive network γ of size q such that γ(x)i > γ(x)j . We have γ = [p, p+1]◦β,
where β is a primitive sorting network of size q − 1.
We need to distinguish between a few cases depending on p:
P1 P2 P3 P4 P5 P6
Bibliographical Notes
This chapter draws inspiration from the book by Gibbons and Rytter [62]
for the Batcher network and from the book by Akl [3] for the section on the
one-dimensional network. For additional information on sorting networks, we
refer the curious reader to seminal book by Knuth [75].
2.3. Exercises 49
2.3 Exercises
The first exercise is a straightforward warm-up. The second exercise deals
with bitonic sorting and is classical (see [62] or [44]). The third exercise
presents a more sophisticated example of sorting network. Many other (com-
plex) examples can be found in [81].
1 SNAKE MERGE(A)
2 SHUFFLE each row of the network (using odd-even
transpositions based on the elements’ indices).
3 { Note that this amounts to SHUFFLE the columns. }
4 Sort pairs of columns, i.e., snake-ordered n × 2 grids, using
2n steps of odd-even transpositions on the induced linear
subnetwork.
5 Apply 2n steps of odd-even transpositions on the induced
linear network of size n2 .
2.4 Answers
Exercise 2.1 (Particular Sequences)
⇒ Let us first assume that α sorts the sequence hn, n − 1, . . . , 1i. Then (see
the proof of Proposition 2.2), α also sorts the sequences hfi (n), fi (n −
1), . . . , fi (1)i for any i ∈ {1, . . . , n} where
Therefore, the network does not sort sequences h1i 0n−i i correctly for all
i ∈ {1, . . . , n}.
FIGURE 2.10: Using split networks to build a bitonic sorting network for
sequences of length n.
n
• If b is of the form 0i 1j 0k , then since j > 2 the output sequence is of the
n n
form 0i 1j− 2 0k 1 2 .
We can thus easily check that in all cases the output sequence consists of two
equal-length bitonic subsequences with at least one of them clean. The proof
is similar when there are more 0’s than 1’s.
We can now use this property to build a bitonic sorting network from split
networks (see Figure 2.10). Let tm and pm denote the depth and the size of
bitonic sorting networks for inputs of length n = 2m . We can easily check
that:
t1 = 1, tm = tm−1 + 1, which implies that tm = m,
p1 = 1, pm = tm 2m−1 , which implies that pm = m2m−1 .
This bitonic sorting network thus contains O(n log n) comparators and its
depth is O(log n).
16 15 14 13 11 12 9 10 11 9 12 10
12 11 10 9 16 15 14 13 16 14 15 13
8 7 6 5 3 4 1 2 3 1 4 2
4 3 2 1 8 7 6 5 8 6 7 5
Initial layout Recursive call Column shuffling
2 × 2 grid 4 × 4 grid
1 3 2 4 1 2 3 4
8 6 7 5 8 7 6 5
9 11 10 12 9 10 11 12
16 14 15 13 16 15 14 13
Column sorting 2n odd-even
4 × 4 grid 4 × 4 grid
This index set should then be sorted using local comparisons. Let us now
consider the primitive network α made of p − 1 stages. Stage i performs the
following comparisons:
hp − i + 1, p − i + 2i, hp − i + 3, p − i + 4i, . . . , hp + i − 1, p + ii .
For example, for n = 8 and p = 4 the algorithm would sort h1, 3, 5, 7, 2, 4, 6, 8i.
The three stages of α consists of the following comparators: (4, 5) in the first
stage, (3, 4), (5, 6) in the second stage and (2, 3), (4, 5), (6, 7) in the third stage.
Using a simple induction, one can prove that α makes it possible to sort
the sequence c and thus can be used to SHUFFLE the columns.
The time needed for the SHUFFLE operation is n2 − 1. The time needed for
the other steps is 2 × 2n and the execution time of the merging algorithm is
thus smaller than 92 n.
. Question 3. Let tm denote the time needed to sort on a 2m × 2m grid.
Using the previous question, we have tm 6 tm−1 + 92 2m . We also have t0 = 0
since a 1 × 1 grid is always snake-ordered. Summing up all these inequalities
for tm , tm−1 , tm−2 , . . . , we obtain:
m
9X k 9
tm 6 2 = (2m+1 − 2) 6 9 · 2m .
2 2
k=1
√
The time needed to sort a sequence of length n = 22m on a grid is thus O( n).
. Question 4. We use the 0–1 principle to simplify the proof. The odd-
even transposition sort on a grid is correct if and only if the third step of the
algorithm can be done with only 2n odd-even transpositions on 0–1 sequences.
The two possible layouts of snake-ordered 0–1 sequences are depicted in
Figure 2.13. The index i of the last row that starts with a 0 is always odd.
There are two cases depending on whether the number of remaining 0’s does
not exceed a row or spills over the following row.
1
1
We only have to show that after the second step of the algorithm (sorting
on linear networks of size 2n) at most two rows of the grid are not clean, i.e.,
consisting only of 0’s or only of 1’s (see Figure 2.14).
56 Chapter 2. Sorting Networks
shuffle
i1 i1 + ε 1 i1 + i2 + ε1/2
i2 i1 + i2 i1 + i2
i2 + ε 2 + ε1 + ε2
i3 + ε 3
i3 i3 + i4 + ε3/4
i4 i4 + ε 4 i3 + i4 i3 + i4
+ ε3 + ε4
sorting
pairs of
columns
at most
n 2 lines
In this chapter, we present network design and operation principles that are
relevant to the study of parallel algorithms. We start with a brief description
of classical network topologies (Section 3.1). Next, we present common mes-
sage passing mechanisms along with a few performance models (Section 3.2).
Then, we focus on the routing problem for two classical topologies: the ring
(Section 3.3) and the hypercube (Section 3.4). More precisely, we discuss how
to implement point to point communications as well as global communications
on these topologies. Finally, we present some recent applications of some of
these techniques to Peer-to-Peer Networks (Section 3.5).
57
58 Chapter 3. Networking
3.1.1 Topologies
The processors in a distributed memory parallel platform are connected
through an interconnection network. Nowadays all computers have specialized
coprocessors dedicated to communication that route messages and put data
in local memories (i.e., network cards). In what follows we say that a parallel
platform comprises nodes. Each node consists of a (computing) processor, a
memory, and a communication coprocessor. When there is no ambiguity we
sometimes use the term processor to denote the entire node.
Nodes can be interconnected in arbitrary ways. Since the 70s, many indus-
trial and academic projects have tried to determine the best way to intercon-
nect nodes efficiently and at low cost. There are two main approaches:
(b)
(a) (c)
(d)
(e) (f)
Let us first note that the number of processors p of a static topology with
degree ∆ is bounded as follows:
∆(∆ − 1)D − 2
p 6 1 + ∆ + ∆(∆ − 1) + ∆(∆ − 1)2 + . . . + ∆(∆ − 1)D−1 = .
∆−2
This bound is known as Moore’s bound. It is derived by counting the neighbors
of a processor, the uncounted neighbors of neighbors, and so on until the
farthest processors are reached (at distance D).
A short discussion of the topologies from Figure 3.1 follows along with a
summary of their main characteristics in Table 3.1.
• Clique (Figure 3.1(a)). This is the ideal network as the distance be-
tween any two processors is equal to 1. However, the number k of links
per processor is equal to p − 1 and thus the total number of links Nl is
equal to p(p − 1)/2. This type of network is only feasible with a limited
number of nodes at reasonable cost. Thanks to routing techniques, some
of which we discuss later in this chapter, any interconnection network
can always been thought of as a clique by the application programmer.
But in this case communication times are larger and there is a possibil-
ity of network contention when two messages travel through the same
communication link at the same time.
• Ring (Figure 3.1(b)). This is one of the simplest topologies and is in
fact used as a model for developing many useful parallel algorithms (see
Chapter 4).
• 2-dimensional grid (Figure 3.1(c)). The maximum degree of proces-
sors is equal to 4. The main drawback of a grid is its lack of symmetry.
Indeed, border processors and central processors have different charac-
teristics. The 2-D grid is well suited to, for instance, image processing
problems where computations are local and communications are typi-
cally performed between neighbors. Another interesting aspect of 2-D
grids is their scalability with regard to practical implementation: pro-
cessors only need to be added at the edges.
• 2-dimensional torus (Figure 3.1(d)). This one is easily derived from
the 2-D grid by connecting edge processors with each others. The di-
ameter is thus halved with the same degree. The connectivity can even
be increased with a 3-D torus like the one in the Cray T3D platform.
• Hypercube (Figure 3.1(e)). This topology has been used extensively
in the previous decades. Due to its recursive definition, one can design
simple yet efficient algorithms that operate dimension by dimension.
Another attractive aspect of this topology is its low diameter (logarith-
mic in the number of nodes). However, the total number of links grows
too quickly to be of practical interest for massively parallel machines.
A hypercube is characterized by its dimension d, where p = 2d .
3.1. Interconnection Networks 61
TABLE 3.1: Main characteristics of classical topologies.
Topology Num. of proc. Degree Diameter Num. of links Bisec. Width
p k D Nl LB
Clique p p−1 1 p(p − 1)/2 (p/2)2
Ring p 2 bp/2c p 2
√ √ √ √ √
2-D Grid p p 2→4 2( p − 1) 2p − 2 p p
√ √ √ √
2-D Torus p p 4 2b p/2c 2p 2 p
Hypercube p = 2d d = log (p) d = log(p) p log (p)/2 p/2
1 2 3 4
3 3
4 4
5 5
6 6
7 7
8 8
to Pj is often modeled as
where Li,j is the start-up time (also called latency) expressed in seconds and
Bi,j is the bandwidth expressed in byte per second. For convenience, we also
define bi,j as the inverse of Bi,j . Li,j and Bi,j depend on many factors: the
length of the route, the communication protocol, the characteristics of each
network link, the software overhead, the number of links that can be used
in parallel, whether links are half-duplex or full-duplex, etc. This model was
proposed by Hockney [68] to model the performance of the Intel Paragon and
is probably the most commonly used.
Such a protocol results in a rather poor latency and bandwidth. The com-
munication cost can be reduced by using pipeline. The message is split in
r packets of size mr . Packets are sent one after the other from Pi to Pj . The
first packet reaches node j after ci,j (m/r) time units and the remaining r − 1
packets arrive immediately after, on after another, in (r − 1)(L + m r b) time
units. The total communication time is thus equal to
m
d(i, j) − 1 + r L + b .
r
One can then seek the value of r that minimizes the previous expression. This
is a minimization problem of the form (γ + α.r)(δ + β/r) with four constants
α, β, γ, and δ. Removing all constant terms, this boils down to minimizing
αδ · r + γβ/r, and hence the optimal value of r, ropt , is:
r
γβ
ropt = .
αδ
64 Chapter 3. Networking
Indeed, the sum of two terms whose product is constant is minimized when
the two terms are equal. This is the famous theorem of the goat in a pen
whose perimeter, if its area is fixed, is minimal if the pen is a square (The
reader unimpressed by bucolic analogies can use the function’s derivative to
calculate ropt analytically.) The minimum value obtained with ropt is:
p p p
((γ + α.r)(δ + β/r))[ γβ/αδ] = ((γ/δ)(δ + δβα/γ))(δ + αδβ/γ)
p p
= ( γ/δ(δ + δβα/γ))2
p p
= ( γδ + βα)2 .
√ 2 √
r
p m m
(d(i, j) − 1)L + mb = mb + O( mb) = +O .
B B
Latency Bandwidth
P3
P2 Store & Forward
P1
P0 t
P3
P2 Circuit Switching
P1
P0 t
P3
P2 Wormhole
P1
P0 t
Header Data
protocols have been used extensively for the dedicated networks in parallel
platforms, with good buffer management, almost no message loss and thus no
need for flow-control mechanisms. By contrast, in large networks (e.g., on the
Internet) there is a strong need for congestion avoidance mechanisms. Proto-
cols like TCP split messages just like wormhole protocols but do not send all
packets at once. Instead, they wait until they have received some acknowl-
edgments from the destination node. To this end, the sender uses a sliding
window of pending packet transmissions whose size changes over time de-
pending on network congestion. The size of this window is typically bounded
by some operating system parameter Wmax and the achievable bandwidth is
thus bounded by Wmax /(Round Trip Time). Such flow-control mechanisms
66 Chapter 3. Networking
involve a behavior called slow start: it takes time before the window gets full
and the maximum bandwidth is thus not reached instantly. One may thus
wonder whether the Hockney model is still valid in such networks. It turns
out that in most situations the model remains valid but Li,j and Bi,j cannot
be determined via simple measurements.
P0
o o o o
g g g L
P1 o
L L L
g g
P2 o o o
o
g L
P3 o
time
captures the bandwidth for long messages. One may however wonder whether
such a model would still hold for average-size messages. pLogP [74] is an
extension of LogP when L, o and g depends on the message size m. This
model also introduces a distinction between the sender and receiver overhead
(os and or ).
Affine models
One drawback of the LogP models is the use of floor functions to account for
explicit MTU. The use of these non linear functions causes many difficulties
for using the models in analytical/theoretical studies. As a result, many fully
linear models have also been proposed. We summarize them through the
general scenario depicted in Figure 3.6.
(nw)
ci,j (m)
(s)
bi,j · m
(s)
sending node Pi Li,j
(nw)
bi,j ·m
(nw)
link ei,j Li,j
(r)
bi,j · m
(r)
receiving node Pj Li,j
time
Banikazemi et al. [12] propose a model that is very close to the general model
presented above. They use affine functions to model the occupation time of
the processors and of the communication link. The only minor difference is
that they assume that the time intervals during which Pi is busy sending (of
(s) (r)
duration ci,j (m)) and during which Pj is busy receiving (of duration ci,j (m))
do not overlap. Consequently, they write
(s) (nw) (r)
ci,j (m) = ci,j (m) + ci,j (m) + ci,j (m) .
k-ports – A node may have k > 1 network cards and a possible extension
of the 1-port model is thus to consider that a node cannot be involved in more
than an emission and a reception on each network card. This model will be
used in Chapters 4 and 5.
We also have
∀r ∈ R : %r > 0 .
The network protocol will eventually reach an equilibrium that results in an
allocation % such that A% 6 B and % > 0. Most protocols actually optimize
some metric under these constraints. Recent works [88, 86] have successfully
characterized which equilibrium is reached for many protocols. For example
ATM networks recursively maximize a weighted minimum of the %r whereas
some versions of TCP optimize a weighted sum of the log(%r ) or even more
complex functions.
Such models are very interesting for performance evaluation purposes but
they almost always prove too complicated for algorithm design purposes. For
72 Chapter 3. Networking
this reason, in the rest of this book we use the Hockney model or even sim-
plified versions of it model (e.g., with no latency) under the multi-port or the
1-port model. These models represent a good trade-off between realism and
tractability.
...
Pp−1
P3
P0
P2
P1
Each processor has its own local memory. All processors execute the same
program, which operates on the data in their respective local memories. This
mode of operation is common and termed SPMD, Single Program Multiple
Data. To access data that is not in their local memories, processors must
communicate via explicit sending and receiving of messages. This mode of
operation is also common and is termed Message Passing. Each processor can
send a message to its successor on the ring by calling the following function:
SEND(addr , m),
where addr is the address (in the memory of the sender processor) of the
first data element to be sent and m is the message length expressed as a
number of data elements. For simplicity we assume that the data elements
of a message must be contiguous in memory, starting at base address addr .
Note that message passing implementations, for instance those of the MPI
3.3. Case Study: the Unidirectional Ring 73
RECEIVE(addr , m).
Both functions above and others allowing processors to communicate are often
termed “communication primitives.” A few important points must be noted:
• Calls to communication primitives must match: if processor Pi executes
a RECEIVE, then its predecessor (processor Pi−1 mod p ) must execute a
SEND. Otherwise, the program cannot terminate.
• Since each processor has a single outgoing communication link there is no
need for specifying the destination processor: it is always the successor of
the sending processor. Similarly, when executing a RECEIVE, the source
of the incoming message is always the predecessor processor. For a
more complex logical topology, one may have to specify the destination
processor in the SEND and the source processor in the RECEIVE.
• The address of the first data element in a message, which is passed to
SEND, is an address in the local memory of the sender processor. Simi-
larly, the address passed to RECEIVE is in the local memory of the receiver
processor. Therefore, even if the program uses the same variable for stor-
ing both addresses (remember that we are in an SPMD execution), the
value of this variable will most likely be different in both processors.
One can of course use two different variables, as shown below:
q ←MY NUM()
if q = 0 then SEND(addr1 , m)
if q = 1 then RECEIVE(addr2 , m)
So far we have not said anything regarding the semantics of the communication
primitives. There are three standard assumptions:
• A rather restrictive assumption consists in assuming that each SEND and
RECEIVE is blocking, i.e., the processor that calls one of these communica-
tion primitives can continue its execution only once the communication
has completed. This completely synchronous message passing mode,
which is also called “rendez-vous,” is typical of first generation parallel
computing platforms.
• A classical assumption is to keep the RECEIVE blocking but to assume that
the SEND is non-blocking. This allows a processor to initiate a send but to
continue execution while the data transfer takes place. Typically, this
is implemented via two functions: one to initiate the communication,
and the other to check whether the communication has completed. In
74 Chapter 3. Networking
3.3.1 Broadcast
For a given processor index k, we wish to write a program by which pro-
cessor Pk sends the same message, of length m, to all other processors: this
operation is called a broadcast. This is a fundamental collective communi-
cation primitive. For instance, the sender processor can be a “master” that
broadcasts general information (e.g., problem size, input data) to all other
“worker” processors.
At the beginning of the program the message is stored at address addr
in the memory of the sender, processor Pk . At the end of the program the
message will be stored at address addr in the memory of each processor. All
processors must call the following function:
3.3. Case Study: the Unidirectional Ring 75
The main idea is to have the message go around the ring, from processor
Pk to processor Pk+1 , then from processor Pk+1 to processor Pk+2 , etc. For
the remainder of the chapter, we will often implicitly assume that processor
indices are to be taken modulo p. There is no parallelism here since the SEND
and the RECEIVE executed by each processor are not independent. Therefore,
the whole program is written as shown in Algorithm 3.1. It is important to
note that the predecessor of the sender processor, i.e., processor Pk−1 , must
not send the message. This is not only because processor Pk already has the
message, since it is the sender processor, but also because it does not execute
a RECEIVE in the program as we have written it.
1 BROADCAST(k, addr , m)
2 q ←MY NUM()
3 p ←NUM PROCS()
4 if q = k then
5 SEND(addr , m)
6 else
7 if q = k − 1 mod p then
8 RECEIVE(addr , m)
9 else
10 RECEIVE(addr , m)
11 SEND(addr , m)
For the program to be correct, we assume that the RECEIVE is blocking, since
processors forward the message with a SEND immediately after the RECEIVE.
For this program, the semantics of the SEND does not matter. Since we have
a sequence of p − 1 communications, the time needed to broadcast a message
of length m is (p − 1)(L + mb).
Note that message passing implementations, for instance those of the MPI
standard [110], typically do not use a ring topology for implementing collec-
tive communication primitive such as a broadcast but rather use various tree
topologies. Logical tree topologies are more efficient on modern parallel plat-
forms for the purpose of collective communications. Nevertheless, we use a
ring topology in this chapter for two reasons. First, a linear topology such as
a ring makes for straightforward collective communication algorithms while
highlighting key general concepts such as communication pipelining. Second,
we will see in Chapter 4 that when designing algorithms for a logical ring
76 Chapter 3. Networking
3.3.2 Scatter
We now turn to the scatter operation by which processor Pk sends a different
message to each processor. To simplify we assume that all sent messages have
the same length, m. A scatter is useful, for instance, to distribute different
data to worker processors, such as matrix blocks or parts of an image.
At the beginning of the execution, processor Pk holds the message to be
sent to processor Pq at address addr [q]. To keep things uniform, we assume
that there is a message at address addr [k ] to be sent by processor Pk to itself.
(Such a convention is used in standard communication libraries as it turns
out to be convenient in practice.) At the end of the execution, each processor
holds its own message at address msg.
The simple but key idea to implement the scatter operation is to pipeline
message sending, starting with the message to be sent to the furthest proces-
sor, that is processor Pk−1 . While this message is on its way, other messages
to be sent to closer processors can be sent as well.
that the SEND is non-blocking while the RECEIVE is blocking. We will make
this common assumption for the vast majority of our algorithms. Finally,
note that instructions tempS ↔ tempR and msg ← tempR are mere pointer
updates, and not memory copies.
The time for the scatter is the same as for the broadcast, i.e., (p − 1)(L +
mb). Indeed, pipelining of the communications on the rings allows for several
communication links to be used simultaneously. Therefore, it is possible to
send p − 1 messages in the same time as one, provided that their destinations
are p − 1 consecutive processors along the networks.
3.3.3 All-to-all
1 BROADCAST(k, addr , m)
2 q ←MY NUM()
3 p ←NUM PROCS()
4 if q = k then
5 for i = 0 to r − 1 do
6 SEND(addr [i], m/r)
7 else if q = k − 1 mod p then
8 for i = 0 to r − 1 do
9 RECEIVE(addr [i ], m/r)
10 else
11 RECEIVE(addr [0], m/r)
12 for i = 0 to r − 2 do
13 SEND(addr [i], m/r) || RECEIVE(addr [i + 1], m/r)
14 SEND(addr [r − 1], m/r)
To determine the execution time, we just need to determine when the last
processor, Pk−1 , receives the last message piece. There must be p − 1 com-
munication steps for the first message piece to arrive at Pk−1 , which amounts
to time (p − 1)(L + m r b). Then, all remaining r − 1 message pieces arrive
subsequently, in time (r − 1)(L + m r b). Therefore, the overall execution time
is the sum of the two: m
(p + r − 2) L + b .
r
We seek the value of r that minimizes the previous expression. Using the “goat
3.4. Case Study: the Hypercube 79
q
in a pen” theorem, we obtain ropt = m(p−2)b
L and the optimal execution time
is:
p √ 2
(p − 2)L + mb . (3.2)
In this expression, p, L, and b are all fixed. Therefore, for long messages,
the expression tends to mb, which does not depend on p! This is the same
principle as the one used in IP networks, which split messages into many small
packets to improve throughput over multi-hop paths thanks to pipelining of
communications over multiple communication links.
Then, the 0-th, 1-st, . . . , j − 1-th in this order. Let us call γ0 the path we
have just built. Let us assume that γj0 and γj1 share a common vertex X.
The lengths of the subpaths from A to X in γj0 and γj1 are equal otherwise
we could build a shorter path from A to B. However, for any j and k, the
first adjusted bits in γj is Sj,k = {j, . . . , j + k − 1} if j + k − 1 < n and
Sj,k = {j, . . . , n − 1} ∪ {0, . . . , k − (n − j)} otherwise. Therefore, Sj0 ,k = Sj1 ,k
if and only if j0 = j1 . Lastly, it is impossible to find more independent paths.
Any path starts by adjusting a bit, say the j-th, and thus is in conflict with
path γj . Note that i independent paths only reach i(i − 1) + 2 vertices and
the bandwidth available from A is fully used only when i = n.
available (useful) link, it has no other choice but to wait for one of them to be
available again. This algorithm is harder to implement than one may think
at first glance. One needs to ensure the fairness of the routing protocol or at
least to avoid starvation: a communication should not be delayed indefinitely
on a loaded network. One also needs to rebuild a message upon reception
since packets may follow different routes and thus may arrive out of order.
Lastly, one needs to ensure that all messages are delivered to their destination,
which is easy with the static algorithm but could become rather nightmarish
if messages are allowed to travel backwards: if links 1 and 2 are busy at
processor A in our previous example, one could try to use link 0 even if it
extends the route.
The static algorithm was implemented by Intel on the iPSC2. The dynamic
algorithm was also implemented but on the Paragon’s 2-D grid but not on a
hypercube. It uses the same idea: a message sent from processor (x, y) to
processor (x0 , y 0 ) is labeled with x0 − x, y 0 − y assuming that x0 > x and
y 0 > y. A simple static routing procedure is obtained by forwarding the
message horizontally x0 −x times and then vertically y 0 −y times. It is however
possible to use any Manhattan path. The message can use the vertical link
whenever the horizontal one is busy and conversely. The only restriction is to
never go out the rectangle defined by the source and the destination.
where
(r)
We denote by gi the i-th element of the Gray code of dimension r.
The Gray code Gn makes it possible to define a ring of 2n nodes in the
n-cube. Let us assume that this is true for n − 1, with n > 2 (this is trivial
for n = 1). Consecutive elements in position 0 to 2n−1 − 1 differ by a single
bit (due to the recursion). Similarly, consecutive elements in position 2n−1 to
2n − 1 also differ by a single bit. Finally, since the last element of Gn−1 is
equal to the first element of Grev
n−1 , elements 2
n−1
− 1 and 2n−1 also differ by
a single bit. All elements are thus connected.
To embed a 2r × 2s 2-D torus in a n-cube, with r + s = n, we can simply use
the Cartesian product Gr ×Gs . Processor with coordinates (i, j) on the grid is
(r) (s)
mapped to processor f (i, j) = (gi , gj ) in the n-cube. Horizontal neighbors
(r) (s)
(i ± 1, j) are mapped to f (i ± 1, j) = (gi±1 , gj ), where index i ± 1 is taken
modulo 2r : these nodes are indeed neighbors of f (i, j) in the n-cube. We can
prove likewise that vertical neighborhood is preserved. This construction can
easily be generalized to a 2r × 2s × 2t 3-D torus, where r + s + t = n.
0000
1111
FIGURE 3.9: Broadcast using a spanning tree on a hypercube for n = 4.
Let us denote by BIT(A, b) the value of the b-th bit of processor A. The
broadcast of a message of length m by processor k can be done with Algo-
rithm 3.5.
As there are n steps, the execution time with the store and forward model
is n(L + mb). As at each step, a processor communicates with only one other
processor, making the algorithm valid for the 1-port model. We could consider
splitting the message in packets and pipelining them to improve the execution
time as seen in Sections 3.2.2 and 3.3.4. It would however only work in the n-
port model. Indeed, as soon as the root of the broadcast reaches steady-state
it needs to send its packets simultaneously to all its neighbors.
3.4. Case Study: the Hypercube 85
1 BROADCAST(k, addr , m)
2 q ← MY NUM()
3 n ← log(TOT PROC NUM())
{ Update pos to work as if P0 was the root of the broadcast }
4 pos ←q XOR k
{ Find the rightmost 1 }
5 first1 ←0
6 while ((BIT(pos, first1 ) = 0) And (first1 < n)) do
7 first1 ←first1 + 1
{ Core of the algorithm }
8 for phase= n − 1 to 0 do
9 if (phase=first1 ) then RECEIVE(phase, addr , m)
10 else if (phase<first1 ) then SEND(phase, addr , m)
4. After this XOR operation, node 0 is a leaf in each tree Ti . We can then
merge these trees in a single tree rooted at 0 (see Figure 3.10(b)).
This broadcast works only when assuming all links are bidirectional: 101
communicates with 100 in the leftmost subtree while 100 communicates with
101 in the rightmost subtree (see Figure 3.10(b)).
Using optimal pipelining on a single tree, this new algorithm improves the
execution time to:
√ p 2
mb + (n − 1)L .
86 Chapter 3. Networking
A0 A1 A2
000 000 000
000
This bound cannot be reached: only one bit has been initially sent so how
could the end of the message arrive right after the first bit at no cost? The
study of the complexity of collective communications has led to many research
articles published in the 80s: the difficulty comes from the non-linear term L
that precludes the use of flow theory.
this problem is not particularly difficult in, say, a dedicated parallel platform
with 128 peers interconnected via a hypercube topology.
Peer-to-peer systems and applications have caught the interest of many
computer science researchers in the last decade. Depending on how the peers
in the overlay network are linked to each other, one can classify a P2P net-
work as unstructured or structured. Unstructured networks are generally built
using randomized protocols: new peers randomly connect to some peers and
update their neighbor sets over time based on random walks. The graph of
the corresponding overlay network is thus generally modeled well by random
graphs. Data or peer localization is achieved via “flooding”, i.e., by broadcast-
ing the request to the whole networks with a bound on the maximum number
of hops through the network. Flooding is known to be very inefficient. Struc-
tured networks are built using a globally consistent and deterministic protocol.
This protocol ensures that the graph of the overlay network has a particular
structure that facilitates efficient routing. We present briefly a few important
ideas underlying structured networks, and refer the reader to [112] for a com-
prehensive review of developments in peer-to-peer networking over the last
decade.
Let N , ∆, and D respectively denote the size, the maximum degree, and
the diameter of the network. Using Moore’s bound (see Section 3.1.2), we
easily obtain:
log N
D=Ω .
log ∆
Several structures can thus be considered:
log N
• With degree O(log N ), we obtain a route length of at least Ω log log N .
In practice, many systems have D = Θ(log N ), which is not optimal but
convenient. Hypercubes have typically such a degree and such a diame-
ter. They also have a large bisection and have thus been a great source
of inspiration to design peer-to-peer protocols (e.g., Chord [113], Pas-
try [104], Tapestry [121]). Overlay networks with an optimal diameter
have later been proposed [71] but the routing protocol is very complex
and difficult to implement.
used and advertised slightly later and have thus not received as much
attention.
• With degree O(N α ) we obtain a diameter of Ω(1) but such a high degree
is far too large to be of practical interest.
3.5.2 Chord
In the chord system [113], the set of keys is {0, . . . , 2m − 1} and the peers
are organized in a virtual ring. If peers n1 and n2 are two consecutive peers on
this ring, then peer n2 is responsible for all keys that fall between n1 and n2 .
Finding a key in this network would obviously be very time consuming since its
diameter is O(N ). This is why there are “shortcuts.” Node n is connected to all
peers responsible for key n + 2i [2m ] for all i ∈ {0, . . . , m − 1} (see Figure 3.11).
The lookup is thus easily done in O(log N ) hops by always forwarding the
request to the peer with the closest smallest ID (the distance to the destination
is halved at each hop). Also, the routing table, meaning the data structure
that holds the neighbor set, remains relatively small (O(log N )).
N1 N1
Finger table
N8 N8
+1 N8 + 1 N14 lookup(54)
N56 N8 + 2 N14 N56
+2 N8 + 4 N14
K54
+4 N8 + 8 N21
N51 N51
+32 +8 N14 N8 +16 N32 N14
N8 +32 N42
N48 +16 N48
K10
N21 N21
K38
N42 K30 K24 N42
N38 N38
N32 N32
by the hash h(t) of the corresponding string. As previously, the peer nt whose
ID is closer to h(t) is responsible for topic t. Any participant willing to sub-
scribe to this topic can thus route a message to h(t) to inform nt about its
subscription. Let us denote by S(t) the set of subscribers to topic t. Merging
the set of routes from s ∈ S(t) to nt we obtain a tree rooted at nt that can be
used backward to multi-cast notification messages from nt to every subscriber
s ∈ S(t). In practice, when a peer n subscribes to topic t the routing ends as
soon as it encounters a peer belonging to the tree.
This approach has many interesting features. First, nt does not need to
know the whole set of subscribers. Indeed, routes from the subscribers to the
owner of the topic often merge very soon and thus the tree is well balanced
in practice. Peers responsible for hot topics are thus overloaded neither by
requests nor by notifications. Second, the upper part of the tree is made of
geographically diverse peers, whereas the leaves are generally close to each
other within a subtree. Therefore, the more communication-intensive part of
the execution consists of a large group of efficient and mostly independent
communications.
Bibliographical Notes
The famous PVM message passing library [59] provided primitives for both
synchronous and asynchronous communications. The current standard for
message passing is MPI [110], which offers a rich set of collective communi-
cation primitives. A large part of the material in this chapter is inspired by
Desprez’s thesis [51] and by the book by Gengler, Ubéda and Desprez [60].
The book by Culler and Singh [48] is a great reference to find other exam-
ples of collective communication algorithms on grids or hypercube and more
information on commutation networks and distributed memory architectures.
Most work regarding peer-to-peer architectures is very recent. Our section on
this topic takes a very particular point of view that is strongly related to the
preceding sections on interconnection networks and routing. [112] provides a
comprehensive overview of developments in peer-to-peer networking over the
last decade.
3.6. Exercises 93
3.6 Exercises
Exercise 3.1 : Torus, Hypercubes and Binary Trees
1. What is the difference between a 6-cube and a 4 × 4 × 4 3-D torus?
2. What is the number of paths (with minimal length) from (x1 , y1 ) to (x2 , y2 )
in a n × n torus?
st
st
st
ag
ag
ag
ag
e0
e1
e2
e3
row 000
row 001
row 010
row 011
node (011,3)
row 100
row 101
row 110
row 111
FIGURE 3.13: BU T (3): the butterfly network of order 3.
1. What kind of network is obtained when all the vertices of a given row are
grouped?
2. The butterfly network is built recursively. Give two ways to split a but-
terfly network of order r into two butterfly networks of order r − 1.
3. Prove that there is a unique path of length r between any vertex hw, 0i
from stage 0 and any vertex hw0 , ri from stage r. What is the diameter of
BU T (r)?
st
st
st
st
st
st
ag
ag
ag
ag
ag
ag
ag
e0
e1
e2
e3
e4
e5
e6
row 000
row 001
row 010 Configuration 1
row 011
row 100
Configuration 2
row 101
row 110
row 111
4. Prove (by induction on the order r of the network) that Beneš networks
can be configured to implement any permutation: given an arbitrary permu-
tation π of {0, . . . , 2r − 1}, there is a configuration of switches for establishing
simultaneous routes from any input i to the corresponding output π(i).
Figure 3.15 depicts a configuration for the Beneš network of order 2 imple-
menting permutation π = (0, 1, 2, 3) → (3, 1, 2, 0).
96 Chapter 3. Networking
stage 0 stage 2
0 0
1 1
2 2
3 3
FIGURE 3.15: Configuration of Beneš network of order 2 for permutation
π = (0, 1, 2, 3) → (3, 1, 2, 0).
1. What is the degree and the diameter of B(m)? Propose a simple and
optimal routing algorithm in the De Bruijn graph.
2. Assume now that the network is undirected (i.e., that αxβ is connected
with 0αx, 1αx, xβ0, and xβ1). What is the diameter of the undirected
B(m)? Assuming that it is possible to easily compute the longest common
sub-sequence between u and v ∈ {0, 1}m , propose a simple and optimal rout-
ing algorithm in the De Bruijn graph.
3.7 Answers
Exercise 3.1 (Torus, Hypercubes and Binary Trees)
. Question 1. There is no difference: both architectures are the same! One
can easily check that their diameter and degree are equal to 6 in both cases.
The embedding of the torus in the hypercube with Gray codes, as seen in
Section 3.4.3, is an isomorphism: each processor has exactly 6 neighbors in
both networks. One can also easily check that a 4-cube is strictly equivalent
to a 4 × 4 2-D torus.
To bound the diameter, we can try to bound the distance between to arbi-
trary vertices. As we can XOR the hypercube components, this amounts to
finding a path from vertex h0, 0i to a vertex hw, ii.
Like for the hypercube, we adjust the bits of w starting for example from the
rightmost ones. For each 1 in w, we go through the corresponding hypercube
link. To this end, we will however have to move forward on one or more
vertices of the ring to reach the desired hypercube link. We can thus reach
the ring labeled by w in at most 2m − 1 steps: at most m steps on the
hypercube links and at most m − 1 steps on the ring’s links. Then, we need
to reach position i on the ring. The diameter of a ring of size m is b m2 c. The
upper bound on the diameter of our CCC(m) is thus D 6 2m − 1 + b m 2 c.
A nice feature is that the degree of such a network is bounded (it is equal
to 3) while its diameter is logarithmic in the total number of processors (we
have m 6 log p).
98 Chapter 3. Networking
. Question 2. The previous upper bound was almost tight. We can easily
check in Figure 3.12 that the diameter of CCC(3) is equal to 6.
Let us now focus on the case m > 3. To go from h0, 0i to hw, ii, we
necessarily go through at least |w| hypercube links, where |w| is the number
of 1’s in w. Let us assume for now that a shortest path goes through exactly
w hypercube links. Let us consider the “boxed” cycle associated to 0 where
vertices corresponding to 1’s in w have been boxed. For example, if m = 8
and w = 11000110, we have:
0 → 1 → 2 →3
↑ ↓
7 ← 6 ← 5 ←4
We will go through a hypercube link as soon as we arrive on a boxed vertex.
Thus, we need to find a shortest path from 0 to i going through all boxed
vertices. The total length of the path from h0, 0i to hw, ii is then equal to |w|
plus the length of the path on the boxed ring. If i = 5, a shortest path on the
boxed cycle is:
0 → 1 → 2 → 1 → 0 → 7 → 6 → 5,
and its length is 7. By adding |w| = 4 we get the total routing length.
With such a representation, it is clear that going through additional hyper-
cube links is useless. One gets the worst case with w = 11 . . . 1 (all bits are
set to 1) and i = b m2 c: the shortest path from 0 to i that goes through all
links is of length m + b m
2 c − 2. Hence, the result. With our previous example
(m = 8), a shortest path from 0 to 4 going through all vertices is
0 → 7 → 6 → 5 → 6 → 7 → 0 → 1 → 2 → 3 → 4,
whose length is 10 = 8 + 4 − 2.
P0
P1
P2
P3
ring of size 3
ring of size 4
ring of size 5
ring of size 6
bX
2c
p
n2 n2 p2 n2
1 j p k j p k
k L+b 2 = +1 L+b 2 ∼ L+ b.
p 2 2 2 p 8 8
k=1
All these phases can thus be done in parallel, hence a transposition time equal
to:
j q k n2
n2 √
2 · b 2 + L ∼ b√ + L p .
2 q p
This is optimal (in a store and forward mode) since the execution time is equal
to the communication time of the farthest two processors.
but it is not at straightforward. The first network consists of all even rows
and the second one consists of all odd rows. Indeed, removing the last level
amounts to ignoring the last bit of the row indices.
1 1
2 Upper 2
3 half network 3
4 4
5 5
6 6
Lower
7 half network 7
8 8
Parallel Algorithms
103
Chapter 4
Algorithms on a Ring of Processors
for i = 0 to n − 1 do
yi ← 0
for j = 0 to n − 1 do
yi ← yi + Ai,j × xj
105
106 Chapter 4. Algorithms on a Ring of Processors
Each iteration of the outer loop (the i loop) computes the scalar product of
one row of A by vector x. Furthermore, all these scalar product computa-
tions are independent, in the sense that they can be performed in any order.
Consequently, a natural way to parallelize a matrix-vector multiplication is
to distribute the computation of these scalar products among the processors.
Let us assume that n is divisible by p, the number of processors, and let us
define r = n/p. Each processor will compute r scalar products to obtain r
components of the result vector y. To do this, each processor must store r
rows of matrix A. It is natural for a processor to store r contiguous such rows,
which is often termed a block of rows, or simply a block row. For instance, one
can allocate the first block row of matrix A to the first processor, the second
block row to the second processor, and so on. The components of vector y
(which are to be computed) are then distributed among the processors in a
similar manner as the rows of matrix A (e.g., the first r components to the
first processor, etc.). These kinds of data distributions are typically called
“1-D distributions,” because arrays are partitioned along a single dimension.
A 1-D data distribution is a natural choice when developing an algorithm on
a 1-D logical topology like a ring or processors.
If one assumes that vector x is fully duplicated across all processors, then
the computations of the scalar products are completely independent since no
input data is shared. But it is usual, in practice, to assume that vector x is
distributed among the processors in the same manner as matrix A and vector
y. This is for reasons of modularity and consistency. Parallel programs often
consist of sequences of parallel operations. Assuming that the input vector is
distributed in the same manner as the matrix and the output vector makes
it possible to perform a second matrix-vector multiplication z = By, where
the input vector is the output vector of the previous computation (assuming
that matrix B is distributed similarly to matrix A). Furthermore, parallel
algorithms are often easier to understand and modify when all data objects
are distributed in a consistent manner.
Each processor holds r rows of matrix A, stored in an array A of dimension
r × n, and r elements of vectors x and y, stored in two arrays x and y of
dimension r. More precisely, processor Pq holds rows qr to (q + 1)r − 1 of
matrix A, and components qr to (q + 1)r − 1 of vectors x and y. Thus, one
needs the following declarations in our parallel program:
var A: array[0..r − 1,0..n − 1] of real;
var x, y: array[0..r − 1] of real;
Using these declarations, element A[0, 0] on processor P0 corresponds to ele-
ment A0,0 of the matrix, but element A[0, 0] on processor P1 corresponds to
element Ar,0 of the matrix. Typically the indices of matrix elements, which
we denote via subscripts, are called the global indices, while the indices of
array elements, which we denote with square brackets, are called the local
indices. The mapping between global and local indices is one of the techni-
cal difficulties when writing parallel programs that operate over distributed
4.1. Matrix-Vector Multiplication 107
A00 A01 A02 A03 A04 A05 A06 A07 x0
P0
A10 A11 A12 A13 A14 A15 A16 A17 x1
A20 A21 A22 A23 A24 A25 A26 A27 x2
P1
A30 A31 A32 A33 A34 A35 A36 A37 x3
A40 A41 A42 A43 A44 A45 A46 A47 x4
P2
A50 A51 A52 A53 A54 A55 A56 A57 x5
A60 A61 A62 A63 A64 A65 A66 A67 x6
P3
A70 A71 A72 A73 A74 A75 A76 A77 x7
arrays. However, for our particular program in this section this mapping is
straightforward: global indices (i, j) correspond to local indices (i − b ri c, j) on
processor Pbi/rc .
Note that faster execution is not the only motivation for the parallelization
of a sequential algorithm on a distributed memory platform. Parallelization
also makes it possible to solve larger problems. For instance, in the case of
matrix-vector multiplication, the matrix is distributed over the local memo-
ries of p processors rather than being stored in a single local memory as in
the sequential case. Therefore, the parallel algorithm can solve a problem
(roughly) p times larger than its sequential counterpart.
The principle of the algorithm is depicted in Figure 4.1, which shows the
initial distribution of matrix A and of vector x, and in Figure 4.2, which shows
the p steps of the algorithm. At each step, the processors compute a partial
result, i.e., the product of a r ×r matrix by a vector of size r. In the beginning
(step 0), each processor Pq holds the q th block of vector x, and can calculate
the partial scalar product that corresponds to a diagonal block of matrix A.
We use a block notation by which x cq (resp. ybq ) is the q th block of size r of
vector x (resp. y), and by which A q,s is the r × r block at the intersection of
d
th th
the q block row and the s block column of matrix A. Using this notation,
each processor Pq can compute ybq = A q,q × x
cq during the first algorithm step.
d
While this computation is taking place, one can do a circular block shift of
vector x among the processors: processor Pq sends x cq to processor Pq+1 . Note
that we assume that processor indices are taken modulo p.
This is shown in Figure 4.2 with communicated blocks stored into buffer
tempR. As mentioned earlier in Section 3.3, we assume that sending, receiv-
ing, and computing can all occur concurrently at a processor as long as they
are independent of each other. By the beginning of step 1, processor Pq has
received x[ q−1 . (Note that, like for processor indices, we implicitly assume
that block indices are taken modulo p.) Processor Pq can therefore compute
ybq = ybq + A\q,q−1 × x
[ q−1 , and at the same time it can participate in the circular
block shift of vector x. At the end of the pth step, array y in processor Pq con-
tains block ybq , i.e., the desired r components of the result: yqr , . . . , yqr+r−1 .
108 Chapter 4. Algorithms on a Ring of Processors
A00 A01 • • • • • • x0 x6
P0 tempR ←
A10 A11 • • • • • • x1 x7
• • A22 A23 • • • • x2 x0
P1 tempR ←
• • A32 A33 • • • • x3 x1
• • • • A44 A45 • • x4 x2
P2 tempR ←
• • • • A54 A55 • • x5 x3
• • • • • • A66 A67 x6 x4
P3 tempR ←
• • • • • • A76 A77 x7 x5
(a) First step
• • • • • • A06 A07 x6 x4
P0 tempR ←
• • • • • • A16 A17 x7 x5
A20 A21 • • • • • • x0 x6
P1 tempR ←
A 30 A 31 • • • • • • x1 x7
• • A42 A43 • • • • x2 x0
P2 tempR ←
• • A52 A53 • • • • x3 x1
• • • • A64 A65 • • x4 x2
P3 tempR ←
• • • • A74 A75 • • x5 x3
(b) Second step
• • • • A04 A05 • • x4 x2
P0 tempR ←
• • • • A14 A15 • • x5 x3
• • • • • • A26 A27 x6 x4
P1 tempR ←
• • • • • • A36 A37 x7 x5
A40 A41 • • • • • • x0 x6
P2 tempR ←
A A
50 51 • • • • • • x1 x7
• • A62 A63 • • • • x2 x0
P3 tempR ←
• • A72 A73 • • • • x3 x1
(c) Third step
• • A02 A03 • • • • x2 x0
P0 tempR ←
• • A12 A13 • • • • x3 x1
• • • • A24 A25 • • x4 x2
P1 tempR ←
• • • • A34 A35 • • x5 x3
• • • • • • A46 A47 x6 x4
P2 tempR ←
• • • • • • A56 A57 x7 x5
A60 A61 • • • • • • x0 x6
P3 tempR ←
A70 A71 • • • • • • x1 x7
(d) Fourth (and last) step
This parallel algorithm is shown in Algorithm 4.1. Note our use of the ||
symbol to indicate actions that occur concurrently at a processor.
The reader will notice that the circular block shift of vector x in the last
step is not absolutely necessary. But it allows all processors to hold at the end
of the execution the block of x that they held initially, which may be desirable
for subsequent operations.
The computation of the execution time of this algorithm on p processors is
straightforward. There are p identical steps. Each step lasts as long as the
longest of the three activities performed during the step: compute, send, and
receive. Since the time to send and the time to receive are identical, T (p) can
be written as:
T (p) = p max{r2 w, L + rb} ,
where w is the computation time for one basic operation (in this case multi-
plying two vector components and adding the result to another vector com-
ponent), b is the inverse of the data transfer rate measured in number of basic
data units (in this case one vector component) per second, and L is the com-
munication start-up cost. Since r = n/p, for a given number of processors
p the computation component in the equation above is asymptotically larger
2
than the communication component as n becomes large ( np2 w L + np b).
This means that our algorithm achieves an asymptotic parallel efficiency of 1
as n becomes large. Note that if we had opted for a full duplication of vector
x across all processors, the disadvantages of which were discussed earlier, then
there would be no need for any communication at all. Therefore, the parallel
110 Chapter 4. Algorithms on a Ring of Processors
for i = 0 to n − 1 do
for j = 0 to n − 1 do
Ci,j ← 0
for k = 0 to n − 1 do
Ci,j ← Ci,j + Ai,k × Bk,j
15 tempS ↔tempR
processor Pq holds the desired blocks of the result matrix C. The algorithm is
shown in Algorithm 4.2. This algorithm is very similar to the one for matrix-
vector multiplication, essentially replacing the scalar products by sub-matrix
multiplications and replacing the circular shifting of vector components by
circular shifting of matrix rows. Consequently, the performance analysis is
also similar. The algorithm proceeds in p steps. Each step lasts as long as the
longest of the three activities performed during the step: compute, send, and
receive. Therefore, using the same notations as in the previous section, T (p)
can be written as:
T (p) = p. max{nr2 w, L + nrb} .
Here again, the asymptotic parallel efficiency when n is large is 1. It is in-
teresting to note that the matrix-matrix multiplication could be achieved by
executing the matrix-vector multiplication algorithm n times. With this more
naı̈ve algorithm the execution time would be simply the one obtained in the
previous section multiplied by n:
The only difference between T (p) and T 0 (p) is the term L, which has become
nL. While in the naı̈ve approach processors exchange vectors of size r, in
the algorithm developed in this section they exchange matrices of size r × n,
thereby reducing the overhead due to network start-up cost L. This change
112 Chapter 4. Algorithms on a Ring of Processors
does not reduce the asymptotic efficiency, but it can be significant in practice.
This notion of sending data in bulk is a well-known and general principle
used to reduce communication overhead in parallel algorithms, and we will
see it again and again in this chapter. Furthermore, with modern processors,
operating on blocks of data is often better for performance as it exploits a
processor’s memory hierarchy (registers, multiple levels of cache, main mem-
ory) to the best of its capabilities. As a result, operating on blocks of data
improves both computation and communication performance. Note that in
this algorithm the total number of data elements sent over the network is
p × p × n × r = p × n2 (recall that r = n/p). We will see how this total amount
of communication can be reduced.
NW N NE
W c E
SW S SE
W c
FIGURE 4.4: A simple stencil in which cell c is updated using the values
of its West and North neighbors.
Greedy Algorithm
A first idea to parallelize our stencil algorithm is to use a greedy approach
by which processors send the cells they compute to their neighbors as early as
possible. Such a parallelization reduces the start-up latencies (i.e., processors
114 Chapter 4. Algorithms on a Ring of Processors
j
0 1 2 3 4 5 6
1 2 3 4 5 6 7
2 3 4 5 6 7 8
3 4 5 6 7 8 9
4 5 6 7 8 9 10
i
n2
T (n, p) = p − 1 + (w + b + L) .
p
This algorithm was designed so that the time between a cell value compu-
4.3. A First Look at Stencil Applications 117
tation and its reception by the next processor is as short as possible. But
a glaring problem is that the algorithm performs many communications of
data items that may be small. In practice, the communication startup cost
L can be orders of magnitude larger than b if the size of a cell value is small.
In the case of stencil applications the cell value is often as small as a single
integer or a single floating-point number. Furthermore, w may not be large.
It may be comparable to or smaller than L. Indeed, cell value computations
can be very simple and involve only a few operations. This is the case for
many numerical methods that update cell values based on simple arithmetic
formulae. Therefore, for many applications, a large fraction of the execution
time T (n, p) could be due to the L terms in the previous equation. Spending a
large fraction of the execution time in communication overhead reduces paral-
lel efficiency significantly. This can be seen more plainly by simply computing
the parallel efficiency. The parallel efficiency of the algorithm is equal to the
sequential execution time, n2 w, divided by p × T (n, p). When n gets large,
the parallel efficiency tends to w/(w + b + L), which may be well below 1. In
the next section we present two techniques for addressing this problem.
P1 1 2 3 4 5 6 . . .
P2 2 3 4 5 6 7 . . .
P3 3 4 5 6 7 8 . . .
First row of each processor Second row
FIGURE 4.6: Steps of the modified algorithm with k > 1 and p = 4 pro-
cessors.
still in a cyclic fashion. At each step each processor now computes r × k cell
values. We assume that r × p divide n for simpler performance analysis. For
instance, with r = 3, n = 36 and p = 4, we have the following allocation of
rows to processors:
P0 P1 P2 P3
0, 1, 2 3, 4, 5 6, 7, 8 9, 10, 11
12, 13, 14 15, 16, 17 18, 19, 20 21, 22, 23
24, 25, 26 27, 28, 29 30, 31, 32 33, 34, 35
One interesting question is: are there optimal values of k and r? We can
answer this question via performance analysis of the algorithm. We assume
that n > kp so that no processor stays idle and leave the (less relevant in
practice) n < kp case as an exercise for the motivated reader. The algorithm
proceeds in a sequence of steps. At each step at least one processor is involved
in the following activities:
The analysis is very similar to that for our simple greedy algorithm. Here again
we assume that receiving a message is a blocking operation while sending a
message is not. Consequently, the time needed to perform one step of the
algorithm is krw + kb + L. The computation terminates when processor Pp−1
finishes computing the rightmost segment of its last block of rows of cells.
It takes p − 1 algorithm steps before processor Pp−1 can start doing any
computation. At this point, processor Pp−1 will compute one segment of a
block row at each step. There are n2 /(kr) such segments in the domain, and
each processor holds the same number of segments. Therefore, processor Pp−1
computes for n2 /(pkr) steps. Overall, the algorithm runs for p − 1 + n2 /(pkr)
steps and the total execution time T (n, p, r, k) is:
n2
T (n, p, r, k) = p−1+ (krw + kb + L) .
pkr
Let us compare the asymptotic parallel efficiency of this algorithm with that
of the algorithm in Section 4.3.2, whose parallel efficiency was w/(w + b + L).
Dividing the sequential execution time n2 w by p × T (n, p, r, k) we obtain an
asymptotic efficiency of w/(w + b/r + L/rk). In essence, increasing r and
k makes it possible to achieve significantly higher asymptotic efficiency by
reducing communication costs.
While low asymptotic parallel efficiency is important, one may wonder what
values of k and r should be used in practice. It turns out that, for n and p
fixed, one can easily compute the optimal value of k, kopt (r). Equating the
derivative of T (n, p, r, k) to zero and solving for k one obtains the value k 0 (r):
s
0 L
k (r) = n .
p(p − 1)r(rw + b)
Since k 6 n/p, we obtain kopt (r) = min(k 0 (r), n/p). Finally, for given
values of L, w and b, one can compute kopt (r) and inject it in the expression
for T (n, p, r, k). One can then determine the best value for r numerically.
120 Chapter 4. Algorithms on a Ring of Processors
4.4 LU Factorization
In this section, we develop a parallel algorithm that performs the classic LU
decomposition of a square matrix A using the Gaussian elimination method,
which in turn allows for the straightforward solution of linear systems of the
form Ax = b. Readers familiar with the algorithm will notice that we make
two simplifying assumptions so as not to make matters overly complicated.
First, we use the Gaussian elimination method without any pivoting. This as-
sumption is rather inconsequential at least in the case of partial pivoting: since
the columns of A will be distributed among the processors, partial pivoting
would not add any extra communication. Second, our algorithm eliminates
columns of A one after another. This is not realistic for modern comput-
ers as their memory hierarchies would not be exploited to the best of their
capability. Instead, the algorithm should really process several columns at
a time (i.e., column blocks), exactly as in our matrix-matrix multiplication
algorithm (Section 4.2). But from a purely algorithmic standpoint there is no
conceptual difference between computing a single column or a column block,
and the algorithm’s spirit is unchanged. Note however that when writing the
corresponding programs in practice the utilization of column blocks requires
many more lines of code. This is true even for sequential algorithms. For
instance, see the sequential version of the Gaussian elimination algorithm in
the publicly available LAPACK library [8].
Already computed
Already computed
current column
to be
updated
1 for k = 0 to n − 2 do
2 PREP(k) :
3 for i = k + 1 to n − 1 do
4 Aik ← −Aik /Akk
5 for j = k + 1 to n − 1 do
6 UPDATE(k, j) :
7 for i = k + 1 to n − 1 do
8 Aij ← Aij + Aik × Akj
9 BROADCAST(ALLOC(k), buffer , n − k − 1)
10 for j = k + 1 to n − 1 do
11 UPDATE(k, j) :
12 for i = k + 1 to n − 1 do
13 Aij ← Aij + buffer [i − k − 1] × Akj
which was possible with single calls to SEND. The LU decomposition algorithm
however needs to broadcast columns of the array, that is elements that are
not contiguous in memory! Therefore, we place these elements into a helper
array so that a single call to BROADCAST can send matrix elements in bulk. The
alternative would have been to place n individual calls to BROADCAST, one for
each element in the current column of matrix A. This alternative leads to high
overhead due to network latencies (typically much higher than the overhead of
an extra memory copy). Understanding memory layouts is always a good idea
(for instance to improve cache reuse) but is paramount when implementing
distributed memory algorithms to ensure that data is communicated in bulk
as much as possible.
One remaining issue is to map global indices to local indices for accessing
elements of matrix A. Defining r = n/p, each processor stores r columns of
A in its local memory. As the algorithm makes progress, some columns stop
being accessed: after step k, only columns with an index higher than k are
read and/or updated. The classical idea here is for each processor to use a
local index, l, which indicates the next column of the local array that should
be accessed. At the beginning of the execution all processors set l = 0. At
step k = 0, processor ALLOC(0) prepares column 0 by calling function PREP(0).
Its local index will then be incremented to l = 1. When this processor updates
its columns of matrix A, it only updates the r − l last ones. The value of l is
unchanged at other processors. At step k = 1, processor ALLOC(1) increments
its own value of l after calling PREP(1). Using array declarations and array
indices to replace the matrix element specifications in the previous algorithm,
4.4. LU Factorization 123
Given the above, we need to find an allocation that balances both the
memory consumption (all processors must hold the same number of columns
124 Chapter 4. Algorithms on a Ring of Processors
j−1
X 1 1
ops(j) = (n − k − 1) = − j 2 + n − j,
2 2
k=0
r−1
X r−1
X
p(0, P0 ) → u(0, 1, P1 ), p(1, P1 ) → u(1, j, P2 ), p(2, P2 ) → . . .
j=1 j=2
Note that to be precise we use w0 to denote the time for a basic column prepa-
ration operation (one division and one negation), and w to denote the time for
a basic column update operation (one multiplication and one addition). As n
3
grows, the overall execution time becomes asymptotically equivalent to n3p w.
Note that the total number of update operations to be performed is given by
i=n
X
ops(i) ,
i=1
We can improve the above sequence. The key insight is that, instead of
executing all the UPDATE(0, j) followed by PREP(1), and then sending column
k = 1, processor P1 can execute UPDATE(0, 1), then PREP(1), then send column
k = 1, and then execute all remaining UPDATE(0, j) for j = p + 1, 2p + 1, 3p + 1,
etc. From the perspective of the other processors, both sequences of opera-
tions are equivalent. But in the second sequence, column k = 1 is sent earlier,
4.4. LU Factorization 129
which ends up greatly reducing and sometimes removing the pipeline bubbles
mentioned earlier. The basic principle here is, again, to perform communi-
cations as early as possible. The pseudo-code of the look-ahead algorithm is
shown in Algorithm 4.9, omitting the code for the PREP() and UPDATE() inlined
functions. Consequently, we have added an extra argument to these functions
to specify the buffer array that is to be used. Note the use of two different
buffers in this version of the algorithm so that a processor can call PREP() in
step k of the algorithm and then place UPDATE() calls that are “left over” from
step k − 1.
see that the new algorithm is indeed more efficient than the simple pipelined
algorithm from the previous section because it removes most of the pipeline
bubbles. Remember that our depiction of the execution makes the unrealistic
assumption that column preparation, column update, and column commu-
nication all take the same time. Nevertheless, the ability of this look-ahead
algorithm to reduce idle time is well observed in practice.
21 begin { Phase 2 }
22 B [0,0] ←UPDATE(A[0, 0], Nil, A[0, 1], fromP [0], A[1, 0])
23 B [r − 1,0] ←UPDATE(A[r − 1, 0], Nil, A[r − 1, 1], A[r − 2, 0], fromS [0])
24 for j = 1 to n − 2 do
25 B [0,j] ←UPDATE(A[0, j], A[0, j − 1], A[0, j + 1], fromP [j], A[1, j])
26 B [r − 1,j] ←
27 UPDATE(A[r − 1, j], A[r − 1, j − 1], A[r − 1, j + 1], A[r − 2, j], fromS [j])
28 end
29 B [0,n − 1] ←
UPDATE(A[0, n − 1], A[0, n − 2], Nil, fromP [n − 1], A[1, n − 1])
30 B [r − 1,n − 1] ←
UPDATE(A[r − 1, n − 1], A[r − 1, n − 2], Nil, A[r − 2, n − 1], fromS [n − 1])
The second phase of the algorithm consists in computing two rows, and thus
takes time 2nw. The overall execution time of the algorithm, T (n, p), is:
When n becomes large, T (n, p) ∼ wn2 /p. Since the sequential execution time
is wn2 the parallel algorithm’s asymptotic efficiency is 1.
...
SEND(pred, ADDR(A[0, 0]), n) || RECV(succ, fromS , n)
SEND(succ, ADDR(A[r − 1, 0]), n) || RECV(pred, fromP , n)
...
With this new communication phase, the algorithm’s execution time be-
comes:
Bibliographical Notes
In terms of algorithms, a seminal reference is the book by Kumar et al. [77].
The content of this chapter’s section on matrix-vector and matrix-matrix mul-
tiplication belongs to popular parallel computing knowledge. The discussion
and performance modeling of stencil applications are inspired by articles by
Miguet and Robert [91, 92]. Finally, the parallel Gaussian elimination al-
gorithm used in our LU factorization is a classic (see [102] and other cited
references).
4.9. Exercises 139
4.9 Exercises
We start with two classical linear algebra sequential algorithms and their
parallelization on a ring of processors. The third exercise revisits the definition
of parallel speedup and introduces the important notion of scaled speedup.
The fourth exercise is a hands-on implementation of a matrix multiplication
algorithm on a parallel platform using the MPI message-passing standard.
0 . . . 0 a0i,k a0i,k+1 . . . a0i,n cos θ − sin θ 0 . . . 0 ai,k ai,k+1 . . . ai,n
←
0 . . . 0 0 a0j,k+1 . . . a0j,n sin θ cos θ 0 . . . 0 aj,k aj,k+1 . . . aj,n
The reader can easily compute the value of θ so that element aj,k is zeroed
out. The sequential algorithm can be written as follows:
1 GIVENS(A)
2 for k = 1 to n − 1 do
3 for i = n DownTo k + 1 do
4 ROT(i − 1, i, k)
Consider that one rotation executes in one unit of time, independently of the
value of k.
140 Chapter 4. Algorithms on a Ring of Processors
3. Run your program on a parallel platform, e.g., a cluster, and plot the
parallel speedup and efficiency for 2, 4, and 8 processors as functions of the
matrix size, n. Each data point should be obtained as an average over 10
runs.
4. If you have not done so, experiment with non-blocking MPI communi-
cation and see whether there is any impact on your program’s performance
when you overlap communication and computation.
142 Chapter 4. Algorithms on a Ring of Processors
4.10 Answers
Exercise 4.1 (Solving a Triangular System)
. Question 1. If we distribute columns of A among the processors, we must
parallelize operations within rows of A. (Parallelizing the operations within
a column would lead to sequential execution.) For each row, each processor
should contribute some computations for the columns it holds. Consider the
typical sequential algorithm:
1 for i = 0 to n − 1 do
2 s←0
3 for j = 0 to i − 1 do
4 s ← s − ai,j × xj
5 xi ← (bi − s)/ai,i
1 for i = 0 to n − 1 do
2 t←0
3 forall j ∈ MyCols, j < i do
4 t ← t + ai,j × xj
5 s = GATHER(ALLOC(i), t, 1)
6 if i ∈ MyCols then
7 xi ← (bi − s)/ai,i
In this algorithm, we use variable MyCols to denote the set of the indices of
the columns allocated to each processor. The GATHER operation is used so that
the sum of the partial scalar products, computed locally at each processor, is
computed and stored in the memory of processor ALLOC(i). The above pseudo-
code is written using only global indices, and we let the reader write it using
local array indices. This can be done using the same technique as for the LU
factorization algorithm in Section 4.4.
Answers 143
. Question 2. We only sketch the main idea of the solution, which is similar
to that for Question 1. If matrix rows are distributed to processors, we need to
operate on a matrix column at each step, so that each processor can contribute
by updating the fraction of the column corresponding to the processor’s local
rows. This implies swapping the two loops in the sequential version:
1 for j = 0 to n − 1 do
2 x[j] ← b[j]/a[j, j]
3 for i = j + 1 to n − 1 do
4 b[i] ← b[i] − a[i, j] × x[j]
As before, a cyclic allocation will nicely balance the work among the pro-
cessors. One allocates row i of matrix A and component bi to processor
ALLOC(i) = i mod p, which is responsible for the computation of xi . With
this allocation, we obtain the parallel algorithm:
1 for j = 0 to n − 1 do
2 if j ∈ MyRows then
3 x[j] ← b[j]/a[j, j]
4 BROADCAST(()alloc(j),x[j],1)
5 for i ∈ MyRows, i > j do
6 b[i] ← b[i] − a[i, j] × x[j]
1 2 3 4 5 6 7 8 → P1 → P2 → P 3 → P4 → P 5 → P6 → P 7 → P8
When a row arrives at a processor, it stays there for one step. When a
second row arrives at that processor, it is combined with the first row. The
first row is then sent to the next processor and the combined row takes its
place. We obtain the following execution, shown on an example for n = 8
(each table element shows which matrix row each processor contains at each
algorithm step):
144 Chapter 4. Algorithms on a Ring of Processors
Step P1 P2 P3 P4 P5 P6 P7 P8
t=1 8
t=2 ROT(7, 8, 1)
t=3 ROT(6, 7, 1) 8
t=4 ROT(5, 6, 1) ROT(7, 8, 2)
t=5 ROT(4, 5, 1) ROT(6, 7, 2) 8
t=6 ROT(3, 4, 1) ROT(5, 6, 2) ROT(7, 8, 3)
t=7 ROT(2, 3, 1) ROT(4, 5, 2) ROT(6, 7, 3) 8
t=8 ROT(1, 2, 1) ROT(3, 4, 2) ROT(5, 6, 3) ROT(7, 8, 4)
t=9 1 ROT(2, 3, 2) ROT(4, 5, 3) ROT(6, 7, 4) 8
t = 10 1 2 ROT(3, 4, 3) ROT(5, 6, 4) ROT(7, 8, 5)
t = 11 1 2 3 ROT(4, 5, 4) ROT(6, 7, 5) 8
t = 12 1 2 3 4 ROT(5, 6, 5) ROT(7, 8, 6)
t = 13 1 2 3 4 5 ROT(6, 7, 6) 8
t = 14 1 2 3 4 5 6 ROT(7, 8, 7)
t = 15 1 2 3 4 5 6 7 8
. Question 2. The idea is to “fold in half” the ring used in the previous
question so that rows travel first to the right and then to the left.
Tpar (1 − f )T1
Tp > Tseq + = f · T1 + ,
p p
T1 1 1
Sp = 6 1−f
6 .
Tp f+ p f
speed and I/O speed. With 1 processor, the maximum feasible problem size,
nmax (1), is defined by β(nmax (1))2 = M .
With p processors, one can compute a larger problem since one can ag-
√
gregate all the processors’ memories: nmax (p) = p · nmax (1). How do we
then compute the parallel speedup for a problem of size nmax (p), that is for a
problem too large to be executed on a single processor? The simple idea is to
scale the speedup: we compute A(p), the mean time to perform an arithmetic
operation with p processors for a problem of maximum size nmax (p). The new
A(1)
definition of the speedup, proposed by Gustafson [66], is Sp = A(p) .
For our matrix computation, assuming perfect parallelization of the arith-
metic computations, we have:
nα w + γn2 τi/o
A(1) = with βn2 = M , and
nα
nα w 2
p + γn τi/o
A(p) = with βn2 = pM .
nα
If α = 2, A(1) = w + γτi/o and A(p) = wp + γτi/o , we obtain the traditional
parallel speedup that is bounded by a value that does not depend on p. But
for α > 3, we have
γτi/o
A(1) = w + α−2 , and
M
β
w γτi/o
A(p) = + α−2
p M
β pα−2
Let us conclude with some parallel computing humor: a single mover takes
infinite time to move a piano up a few floors, but two movers only take a few
minutes: infinite speedup!
Chapter 5
Algorithms on Grids of Processors
147
148 Chapter 5. Algorithms on Grids of Processors
time as it can send (or receive) that amount of data. This assumption may or
may not be realistic for the underlying physical platform. It is straightforward
to modify the programs presented in this chapter and, more importantly, their
performance analyses, in case the full-duplex assumption does not hold.
The second issue is that of the number of communications in which a sin-
gle processor can be engaged simultaneously. With four bidirectional links,
conceivably a processor can be involved in one send and one receive on all its
network links, all concurrently. The assumption that such concurrent com-
munications are allowed at each processor with no decrease in communication
speed when compared to a single communication is termed the multi-port
model. In the case of our grid topology, we talk of a 4-port model. If instead
at most two concurrent communications are allowed, one of them being a send
and one of them being a receive, then one talks of a 1-port model. Going back
to our stencil algorithm on a bidirectional ring (Section 4.5.2), the reader will
see that we had implicitly used the 1-port model. In this chapter, we will show
performance analyses for both the 1-port and the 4-port model. Once again,
it is typically straightforward to adapt these analyses to other assumptions
regarding concurrency of communications.
As discussed at the end of Chapter 4, an important issue is impedance
matching between logical and physical topology: how do the grid and ring
logical topologies compare in terms of realism for a given physical platform?
It turns out that there are platforms whose physical topologies are or include
grids and/or rings. A famous example of a supercomputer using a grid is
the defunct Intel Paragon. A more recent example, at least at the time this
book is being written, is IBM’s Blue Gene/L supercomputer and its 3-D torus
5.2. Communication on a Grid of Processors 149
topology, which contains grids and rings. When both a ring and a grid map
well to the physical platform, the grid is preferable for many algorithms. For
a given number of processors p, a torus topology uses 2p network links (and
√
a grid topology 2(p − p)), twice more than a ring topology which uses only
p network links. As a result, more communications can occur in parallel and
there are more opportunities for developing parallel algorithms with lower
communication costs. Interestingly, even on platforms whose physical topol-
ogy does not contain a grid (e.g., on a platform using a switched interconnect),
using a logical grid topology can allow for more concurrent communications.
In this chapter, we will see that the opportunity for concurrent communica-
tions is the key advantage of logical grid topologies for implementing popular
algorithms. But we will also see that even if the underlying platform offers no
possibility of concurrent communications, writing some algorithms assuming
a grid topology is inherently beneficial!
where dest has value north, south, west, or east. For a torus topology, which
is the topology we will use for the majority of the algorithms in this chapter,
the North, South, West, and East neighbors of processor Pi,j are Pi−1,j mod q ,
Pi+1,j mod q , Pi,j−1 mod q , and Pi,j+1 mod q , respectively. We often omit the
modulo and assume that all processor indices are taken modulo q implicitly.
If the topology is a grid then some dest values are not allowed for some source
processors. Each SEND call has a matching RECV call:
Note that in the case of a torus, each processor row and processor column
is a ring embedded in the processor grid. Therefore, the two above functions
can use the pipelined implementation of the broadcast on a ring developed in
Section 3.3.4. If in addition links are bidirectional and one assumes a 4-port
model (or in fact just a 2-port model in this case), then the broadcast can
be done faster by sending data from the source processors in both directions
simultaneously. We will see later in this chapter that this does not change the
asymptotic performance of the broadcast. If the topology is not a torus but
links are bidirectional, then this broadcast can be implemented by sending
messages both ways from the source processor. If the topology is not a torus
and links are unidirectional, then these functions cannot be implemented. We
assume that a processor that calls these functions and is not in the relevant
processor row or column returns immediately. This assumption will simplify
the pseudo-code of our algorithms by removing the need for processor row
and column indices before calling BROADCASTROW() or BROADCASTCOL().
topology, i.e., a ring, we had used a natural 1-D data distribution. Here, our
2-D topology naturally induces a 2-D data distribution. We define m = n/q.
The standard approach is to assign a m × m block of each matrix to each
processor according to the grid topology. More precisely, processor Pi,j , 0 6
i, j < q, holds matrix elements Ak,l , Bk,l , and Ck,l with i.m 6 k < (i + 1).m
and j.m 6 l < (j + 1).m. We denote the three matrix blocks assigned to
processor Pi,j by A di,j , Bi,j , and Ci,j , as depicted in Figure 5.2 for matrix A.
d d
All algorithms hereafter use this distribution scheme.
It turns out that this so-called “outer-product algorithm” [1, 56, 77] leads
to a particularly simple and elegant parallelization on a torus of processors.
The algorithm proceeds in n steps, that is n iterations of the outer loop. At
each step k, Ci,j is updated using Ai,k and Bk,j . Recall that all three matrices
are partitioned in q 2 blocks of size m × m, as in the right side of Figure 5.2.
152 Chapter 5. Algorithms on Grids of Processors
The algorithm above can be written in terms of matrix blocks and of matrix
multiplications, and it proceeds in q steps as follows:
for k = 0 to q − 1 do
for i = 0 to q − 1 do
for j = 0 to q − 1 do
C i,j ← Ci,j + Ai,k × Bk,j
d d d d
Matrix A Matrix B
The above analysis is for the 1-port model, with the horizontal and the ver-
tical broadcasts happening in sequence at each step. With the 4-port model,
both broadcasts can occur concurrently, and the execution time of the algo-
rithm is obtained by removing the factor 2 in front of each Tbcast in the above
equation. When n becomes large, Tbcast ∼ n2 b/p, and thus T (m, q) ∼ n3 w/p,
which shows that the algorithm achieves an asymptotic efficiency of 1. This
algorithm is used by the ScaLAPACK [36] library, albeit often using a block-
cyclic data distribution (see Section 5.4)
designers have striven to reduce communication costs for parallel matrix mul-
tiplication. We discuss below in what way a grid topology is advantageous
compared to a ring topology.
In practice for large values of n, up to a point, the algorithm’s execution time
can be dominated by communication time. This happens, for instance, when
the ratio w/b is low. The communication time (which is then approximately
equal to the execution time) on a grid, using the outer-product algorithm de-
√
scribed in the previous section, would be 2n2 b/ p, assuming a 1-port model.
The communication time of the matrix multiplication algorithm on a ring,
developed and analyzed in Section 4.2, is pn2 /pb = n2 b. The time spent in
√
communication when using a grid topology is thus a factor 12 p smaller than
when using a ring topology. This is easily seen when examining the communi-
cation patterns of both algorithms. When using a ring topology, at each step
the communication time is equal to that needed for sending n2 /p elements
between neighbor processors, and there are p such steps. By contrast, for the
algorithm on a grid, the communication at each step involves the broadcast of
√
twice n2 /p matrix elements, and there are p such steps. With the pipelined
implementation of the broadcast, broadcasting n2 /p matrix elements in a pro-
cessor row or column can be done in approximately the same time as sending
n2 /p matrix elements from one processor to another on the ring (provided
that n is not too small). Since there are two broadcasts, at each step the
algorithm on the grid spends twice as much time communicating as on the
√
ring. But it performs a factor p fewer steps! So the algorithm on a grid
1√
spends a factor 2 p less time communicating than the algorithm on a ring.
√
With a 4-port model, this factor is p.
The above advantage of the grid topology can be attributed to the presence
of more network links and to the fact that many of these links can be used
concurrently. In fact, for matrix multiplication, the 2-D data distribution
induced by a grid topology is inherently better than the 1-D data distribution
induced by a ring topology, regardless of the underlying physical topology!
To see this, let us just compute the total amount of data that needs to be
communicated in both versions of the algorithm.
The algorithm on a ring communicates p matrix stripes, each containing
n2 /p elements at each step, for p steps, amounting to a total of p.n2 matrix
√
elements sent on the network. The algorithm on a grid proceeds in p steps.
√ 2 √
At each step 2 × p blocks of n /p elements are sent, each to p − 1 pro-
√ √ √
cessors, for a total of 2 p. p − 1.n2 /p 6 2.n2 elements. Since there are p
steps, the total number of matrix elements sent on the network is lower than
√ √
2 p.n2 , i.e., at least a factor 2 p lower than in the case of the algorithm on
a ring! We conclude that when using a 2-D data distribution one inherently
sends less data then when using a 1-D distribution, by a factor that increases
with the number of processors. Although we do not show it here formally,
this result is general and holds for any (reasonable) matrix multiplication al-
gorithm. The implication of this result is that, for the purpose of matrix
multiplication, using a grid topology (and the induced 2-D data distribution)
156 Chapter 5. Algorithms on Grids of Processors
is at least as good as using a ring topology, and possibly better. For instance,
when implementing a parallel matrix multiplication in a physical topology on
which all communications are serialized (e.g., on a bus architecture like a non-
switched Ethernet network), one should opt for a logical grid topology with a
2-D data distribution to reduce the amount of transferred data. Recall how-
ever that for n sufficiently large, the two logical topologies become equivalent
with execution time dominated by computation time.
FIGURE 5.4: Block data distribution of matrices A and B after the preskew-
ing phase of the Cannon algorithm (on a 4 × 4 processor grid).
We depict the first two steps of the algorithm in Figure 5.5, which shows
which block multiplications are performed by each processor. The symbol
indicates block-wise matrix multiplications. For instance, in the first step,
processor P2,3 updates its block of matrix C, C d2,3 , by adding to it the result of
the A2,1 × B1,3 product, while processor P1,0 adds A
d d d1,1 × B1,0 to C1,0 . In the
d d
second step, blocks of A and B have been shifted horizontally and vertically,
as seen in the figure. So during this step processor P2,3 adds A 2,2 × B2,3 to
d d
C 2,3 , while processor P1,0 adds A1,2 × B2,0 to C1,0 . Intuitively one can see
d d d d
that eventually each processor Pi,j will have computed all A i,l × Bl,j products,
d d
l = 0, . . . , q − 1, needed to obtain the final value of C di,j .
158 Chapter 5. Algorithms on Grids of Processors
C
d00 C01 C02 C03
d d d A
d00 A01 A02 A03
d d d B
d00 B11 B22 B33
d d d
C C11 C12 C13
d d A A12 A13 A10
d d B B21 B32 B03
d d
10 11 10
d d d d d d
+ =
C C21 C22 C23
A A23 A20 A21
d d B B31 B02 B13
20 22 20
d d d d d d d d d d
C
d30 C31 C32 C33
d d d A
d33 A30 A31 A32
d d d B
d30 B01 B12 B23
d d d
C
d00 C01 C02 C03
d d d A
d01 A02 A03 A00
d d d B
d10 B21 B32 B03
d d d
C
10 C11 C12 C13
A
12 A13 A10 A11 B20 B31 B02 B13
d d d d d d d d d d d d
+ =
C
20 C21 C22 C23
A
23 A20 A21 A22 B30 B01 B12 B23
d d d d d d d d d d d d
C
d30 C31 C32 C33
d d d A
d30 A31 A32 A33
d d d B
d00 B11 B22 B33
d d d
FIGURE 5.5: The first two steps of the Cannon algorithm on a 4 × 4 grid
of processors, where at each step each processor multiplies one block of A and
one block of B and adds this product to a block of C.
This algorithm developed by Fox in [56] was originally designed for Cal-
Tech’s hypercube platform but it uses a torus logical topology. The algorithm
performs broadcasts of blocks of matrix A and is also known as the broadcast-
multiply-roll algorithm. Unlike Cannon’s algorithm, the Fox algorithm does
not require any preskewing or postskewing of the matrices. The algorithm
proceeds in q steps and at each step it performs a vertical shift of blocks
of matrix B. At step k, 1 6 k 6 q, the algorithm also performs horizontal
broadcasts of all blocks of the k-th block diagonal of matrix A within processor
rows. A block A i,j is on the k-th block diagonal, k > 1, if j = i + k − 1 mod q.
d
Therefore, at step k processor Pi,i+k−1 mod q sends its block of matrix A to all
processors Pi,j , j 6= i + k − 1 mod q. The Fox algorithm, written in high-level
pseudo-code, is shown in Algorithm 5.3.
Like for the Cannon algorithm we illustrate the first two steps of this al-
gorithm in Figure 5.6. The figure shows which matrix block multiplications
are performed by each processor at each step. During the first step relevant
5.3. Matrix Multiplication on a Grid of Processors 159
blocks of A are blocks A d i,i , 0 6 i 6 q, that is blocks of the first block diagonal
of matrix A. For instance, in the first step, processor P2,3 updates its block of
matrix C, C 2,3 , by adding to it the results of the A2,2 × B2,3 product, while
d d d
processor P1,0 adds A 1,1 × B1,0 to C1,0 . In the second step, relevant blocks of
d d d
A are blocks Ai,i+1\ mod q , that is blocks on the second block diagonal. During
this step, processor P2,3 adds A 2,3 × B3,3 to C2,3 , and processor P1,0 adds
d d d
A1,2 × B2,0 to C1,0 . Here again it is easy to see that eventually processor Pi,j
d d d
will have computed all block products necessary to obtain the final value of
C
d i,j .
C
d00 C01 C02 C03
d d d A
d00 A00 A00 A00
d d d B
d00 B01 B02 B03
d d d
C C11 C12 C13
d d A A11 A11 A11
d d B B11 B12 B13
d d
10 11 10
d d d d d d
+ =
C C21 C22 C23
A A22 A22 A22
d d B B21 B22 B23
20 22 20
d d d d d d d d d d
C
d30 C31 C32 C33
d d d A
d33 A33 A33 A33
d d d B
d30 B31 B32 B33
d d d
C
d00 C01 C02 C03
d d d A
d01 A01 A01 A01
d d d B
d10 B11 B12 B13
d d d
C
10 C11 C12 C13
A
12 A12 A12 A12 B20 B21 B22 B23
d d d d d d d d d d d d
+ =
C
20 C21 C22 C23
A
23 A23 A23 A23 B30 B31 B32 B33
d d d d d d d d d d d d
C
d30 C31 C32 C33
d d d A
d30 A30 A30 A30
d d d B
d00 B01 B02 B03
d d d
FIGURE 5.6: The first two steps of the Fox algorithm on a 4 × 4 grid of
processors, where at each step each processor multiplies one block of A and
one block of B and adds this product to a block of C.
Figure 5.7 shows the first two steps of the algorithm. The blocks of C that
are updated by the global sum operations are shown in boldface, along the
first diagonal for the first step, and the second diagonal for the second step.
In this sense the meaning of the + = sign in this figure is different from that
in Figure 5.5 and 5.6. Indeed, only q blocks of matrix C are updated at each
step, as opposed to q 2 blocks. But each block is updated only once during the
execution of the algorithm. For instance, in the second step, processor P2,3
updates block C 2,3 by adding to it the three products A2,0 × B0,3 , A2,1 × B1,3 ,
d d d d d
and A 2,2 × B2,3 , received from processors P2,0 , P2,1 , and P2,2 respectively, and
d d
the locally computed product A 2,3 × B3,3 .
d d
FIGURE 5.7: The first two steps of the Snyder algorithm on a 4 × 4 grid of
processors, where at each step each processor multiplies one block of A and
one block of B. All such products are added together and added to the blocks
of C shown in boldface, within each processor row.
Cannon Algorithm
Let us start with the 4-port model. The algorithm’s execution time is
4p
the sum of two terms, Tskew , the time for preskewing and postskewing the
4p
matrices, and Tcomp , the time to perform the computation (“4p” stands for
4-port).
For the preskewing and postskewing steps, one can limit the number of
shifts of blocks of matrices A and B to bq/2c. To understand this, consider a
processor row and the shifts of blocks of matrix A that must be performed by
processors in that row. It should be clear to the reader that performing q − 1
left shifts is equivalent to performing one right shift. More generally, perform-
ing x left shifts is equivalent to performing q − x right shifts. Therefore, in the
worst case, a processor row only needs to perform bq/2c shifts, this maximum
number of shifts being performed by the middle processor row(s). Therefore,
the preskewing of matrix A takes time 2q (L + m2 b). The time to preskew
162 Chapter 5. Algorithms on Grids of Processors
Fox Algorithm
The Fox algorithm proceeds in q steps, with no preskewing or postskewing,
and thus the execution time is simply the time taken by a single step multiplied
by q. The computation time at each step is m3 w, just like for the Cannon
algorithm.
At each step, there are q concurrent broadcasts of blocks of matrix A, one
broadcast in each processor row. Using the pipelined broadcast presented in
Section 3.3.4 with the optimal packet size, the time for the broadcast, Tbcast ,
is: p √ 2
Tbcast = (q − 2)L + m2 b .
Due to the fact that links are bidirectional, with the 4-port model the broad-
cast time above can be reduced by having the source processor simultaneously
5.3. Matrix Multiplication on a Grid of Processors 163
send data in both directions on the ring. With this technique, the execution
time of the broadcast is obtained by replacing q in the above equation by
dq/2e. This is because the first packets sent in both directions go through
at most dq/2e hops. Note that the asymptotic performance of the broadcast
when m gets large is unchanged by this modification.
The shift of the blocks of matrix B can be done in time L + m2 b. In the
4-port model, this shift can occur concurrently with the broadcasts of the
blocks of matrix A at each step. As a result, the time to perform the shift is
completely hidden (Tbcast > L + m2 b, for q > 1). Computation at each step
occurs concurrently with these communications. However, processors must all
wait for the first broadcast to complete before proceeding. Then, at each step
the computation for that step, the shift for that step and the broadcast for
the next step can occur concurrently. In the last step, only the computation
and the shift occur. We obtain the overall execution time in the 4-port model,
T 4p , as:
q p √ 2
T 4p = d e (q − 2)L + m2 b +
2
q p √ 2
3 2
(q − 1) max m w, d e (q − 2)L + m b +
2
max m3 w, L + m2 b
With the 1-port model, horizontal and vertical communications cannot oc-
cur concurrently and during the broadcast the source can only send data in
one direction at a time. Therefore, the execution time is simply:
p √ 2
T 1p = (q − 2)L + m2 b +
p √
(q − 1) max m3 w, ( (q − 2)L + m2 b)2 + L + m2 b +
max m3 w, L + m2 b .
Snyder Algorithm
The major differences between the Snyder algorithm and the previous ones
are the use of a pre- and post-transposition of matrix B and the use of global
sums to compute blocks of matrix C by accumulation. Let us discuss both
these operations.
164 Chapter 5. Algorithms on Grids of Processors
If our logical topology is a torus this time can be cut roughly in half. Indeed,
with a torus, transposing the block initially held by processor Pq,1 takes only
two communication steps, using the loopback links. The factor 2(q − 1) above
then becomes 2bq/2c. The 1-port model precludes concurrent communications
of the blocks originally in the upper half of the matrix and of the blocks
originally in the lower half of the matrix. Therefore, the execution time is
multiplied by a factor 2.
Another approach could consist in shifting the blocks in each row so that
they are in their destination columns. As soon as a block has reached its
destination column it can be shifted vertically to reach its destination row
provided there is no link contention. While in a 4-port model this algorithm
has the same execution time as the algorithm above, in the 1-port model it
uses a smaller number of steps. Writing the algorithm to satisfy the “there
is no link contention” constraint requires care, and we leave it as an exercise
for the reader. Note that enforcing this constraint is not necessary to have a
correct algorithm, but it may be difficult to analyze its performance.
Finally, note that if wormhole routing is implemented on the underlying
platform, then a clever recursive transposition algorithm described in [45]
leads to a shorter execution time:
1p
Ttranspose = dlog2 (q)e(L + m2 b).
Let us first consider the 4-port model when developing the performance
analysis of the rest of the algorithm. The execution consists of a sequence of q
steps, where each step consists in a product of matrix blocks, a shift of blocks
of matrix B along processor columns, and a global sum of blocks of matrix C
along processor rows (see Algorithm 5.4). The first global sum can only be
done after the matrix products have been performed. These matrix products
take time m3 w, and can be done concurrently with the shift of matrix B,
5.3. Matrix Multiplication on a Grid of Processors 165
Note our use of w0 in this equation to denote the time needed to add two
matrix elements together, which is lower than w, the time needed to multiply
two matrix elements together and add them to a third matrix element.
Alternatively, one can design a pipelined algorithm on a unidirectional ring
in the same fashion as the pipelined broadcast developed in Section 3.3.4. Note
that it is possible to implement a faster broadcast using the fact that our rings
are bidirectional, as for the Fox algorithm. However, note that here we have
both communications and computations (for the summing of blocks of C), so
this technique is only useful if communication time is larger than computation
time. We leave the development of a bidirectional global sum of blocks of C
as an exercise for the reader. We split the m2 matrix elements to be sent and
added by each processor into r individual chunks. Let us consider a given
processor row i. Without loss of generality let us also assume that Pi,q−1
is the destination processor and that communication occurs in the direction
of increasing processor column indices. In the first step of the algorithm,
processor P0 sends a chunk of m2 /r matrix elements to P1 . In the second step,
processor P1 adds these matrix elements to the corresponding matrix elements
it owns (to perform the global sum), and receives the next chunk from P0 .
166 Chapter 5. Algorithms on Grids of Processors
As in the case of the simple pipelined broadcast, one question here is to find
the optimal value for r. Fortunately, we can apply the same technique. For
instance, assuming that b > w0 , the maximum in the above equation becomes
simply L + m2 b/r. The optimal execution time is obtained for r = ropt , where
r
(2q − 3)m2 b
ropt = .
L
Recall that the above can be obtained by directly applying the “goat in a pen”
theorem, or by simply computing the derivative of Γ(r) with respect to r (see
Sections 3.2.2 and 3.3.4). The optimal execution time for the global sum when
r = ropt , Γopt , is then:
p
Γopt = (2q − 2)L + m2 b + 2m (2q − 3)bL.
The case b < w0 is more involved, and is left as an exercise for the interested
reader. We finally have the overall execution time for the algorithm, T 4p , in
the case b > w0 , as:
4p
T 4p = 2Ttranspose + max m3 w, L + m2 b +
p
(q − 1) max m3 w, (2q − 2)L + m2 b + 2m (2q − 3)bL +
p
(2q − 2)L + m2 b + 2m (2q − 3)bL.
With the 1-port model, the shifts of blocks of matrix B cannot occur con-
currently with the global sums of blocks of matrix C. We obtain the overall
execution time as:
1p
T 1p = 2Ttranspose + max m3 w, L + m2 b +
p
(q − 1) max m3 w, L + m2 b + (2q − 2)L + m2 b + 2m (2q − 3)bL +
p
L + m2 b + (2q − 2)L + m2 b + 2m (2q − 3)bL.
5.4. 2-D block cyclic data distribution 167
Conclusion
When n gets large, all these algorithms achieve an asymptotic parallel effi-
ciency of 1. By now it should be clear to the reader that this is not terribly
difficult to achieve for matrix multiplication, given that the computation is
O(n3 ) and the communication O(n2 ). The above performance analyses make
it possible to compare the three algorithms for particular values of n and q
and of the characteristics of the platform. More importantly, the main merit
of going through these admittedly lengthy performance analyses is to expose
the reader to several typical algorithms such as matrix transposition or global
sums, to put principles such as pipelining to use, and to better understand
the impact of 1-port and 4-port models on algorithm design and performance.
0,0 0,1 0,2 0,3 0,0 0,1 0,2 0,3 0,0 0,1
P0,0 P0,1 P0,2 P0,3 1,0 1,1 1,2 1,3 1,0 1,1 1,2 1,3 1,0 1,1
2,0 2,1 2,2 2,3 2,0 2,1 2,2 2,3 2,0 2,1
P1,0 P1,1 P1,2 P1,3 3,0 3,1 3,2 3,3 3,0 3,1 3,2 3,3 3,0 3,1
0,0 0,1 0,2 0,3 0,0 0,1 0,2 0,3 0,0 0,1
P2,0 P2,1 P2,2 P2,3 1,0 1,1 1,2 1,3 1,0 1,1 1,2 1,3 1,0 1,1
2,0 2,1 2,2 2,3 2,0 2,1 2,2 2,3 2,0 2,1
P3,0 P3,1 P3,2 P3,3
3,0 3,13,2 3,3 3,0 3,1 3,2 3,3 3,0 3,1
0,0 0,1 0,2 0,3 0,0 0,1 0,2 0,3 0,0 0,1
1,0 1,1 1,2 1,3 1,0 1,1 1,2 1,3 1,0 1,1
Bibliographical Notes
In addition to the many references cited in this chapter, a valuable reference
for examples of algorithms on 2-D processor grids is, once again, the book
by Kumar et al. [77]. Of interest is also the book by Cosnard et al. [45].
Finally, both the ScaLAPACK [36] and PLAPACK [117] library contain many
implementations of interesting algorithms on 2-D processor grids.
5.5. Exercises 169
5.5 Exercises
In the first two exercises, we write the pseudo-code for two algorithms that
we have seen in this chapter, namely the matrix block transposition in Snyder’s
algorithm and a stencil application, on 2-D processor grids. The third exercise
deals with the parallelization of the well-known Gauss-Jordan method for
solving a linear system of equation. Finally, the fourth exercise is a hands-
on implementation of the outer-product matrix multiplication algorithm on a
parallel platform using the MPI message-passing standard.
1. Write the pseudo code for the above algorithm on a q × q 2-D processor
torus, where q divides n.
2. Give a performance model for your algorithm with the 4-port assumption
and the 1-port assumption.
1 GAUSSJORDAN(A, b, x)
2 for i = 0 to n − 1 do
3 for j = 0 to n − 1 do
4 if i 6= j then
5 for k = j to n − 1 do
6 Ai,k ← Ai,k −(Ai,j /Aj,j )×Aj,k
7 bi ← bi − (Ai,j /Aj,j ) × bj
8 for i = 0 to n − 1 do
9 xi ← bi /Ai,i
A and matrix B in two arrays. Initialize these arrays with matrix elements
defined as ai,j = i and bi,j = i + j (these are global indices starting at 0). At
the end of the execution each processor holds a piece of matrix C stored in
a third array. Your program should check the validity of the results (indeed,
they can be computed analytically as ci,j = i.n.(n − 1)/2 + i.j.n, using global
indices). Hint: it may be a good idea to use multiple MPI communicators,
one for each processor row, and one for each processor column, as a convenient
way to implement the BROADCASTROW() and BROADCASTCOLUMN() functions.
2. Run your program on a parallel platform, e.g., a cluster, and plot the
parallel speedup and efficiency for 2, 4, and 8 processors as functions of the
matrix size, n. Each data point should be obtained as the average over 10
runs.
3. If you have not done so, experiment with non-blocking MPI communi-
cation and see whether there is any impact on your program’s performance
when you overlap communication and computation.
5.6. Answers 171
5.6 Answers
Exercise 5.1 (Matrix Transposition)
According to the description of the algorithm, a block stored on a processor
in the lower part of the processor grid travels to the right on that processor’s
processor row until the processor on the diagonal of the processor grid is
reached. Then, the block travels upwards in that processor’s column until it
reaches its destination. Blocks stored on processors in the upper part of the
processor grids travel similarly but first downward and then to the left.
Given the above algorithm, processor Pi,i (0 6 i 6 q − 1) is involved in
forwarding a total of 2i blocks. Each processor Pi,j in the lower part of the
matrix sends min(i, j) blocks to its West neighbor and min(i + 1, j + 1) blocks
to its East neighbor, with corresponding block receptions by these neighbors.
Each processor Pi,j in the upper part of the matrix sends min(i, j) blocks
to its North neighbor and min(i + 1, j + 1) blocks to its South neighbor,
with corresponding block receptions by these neighbors. The figure below
shows an example for a 5 × 5 processor grid; for each processor it depicts how
many send/receive operations this processor performs throughout the overall
execution.
Assuming as usual that sends are non-blocking and that receives are block-
ing, we can now write the pseudo-code as shown in Algorithm 5.6, with
m = n/q. The algorithm is written so that the clockwise and the coun-
terclockwise communications happen in parallel. Note that this algorithm
can be further improved by allowing a message reception for step k1 + 1 (resp.
k2 + 1) to occur in parallel with a message sending for step k1 (resp. k2 ),
as for instance in Algorithm 4.1. This improvement leads to the performance
model given in Section 5.3.4.
172 Chapter 5. Algorithms on Grids of Processors
FIGURE 5.9: Depiction of the block held by a processor, in the center, and
of the blocks held by its four neighbors. Each processor can update the inside
cells of its block (in white) independently of its neighbors. The edge cells (in
gray) must be exchanged between neighbors so that they can be updated by
all the processors.
. Question 2. Each processor can compute internal cells of its block while
exchanging edge cells with its neighbors. Let us consider a processor not on
the edge of the processor grid. Using the same notation as in Section 4.5, the
time for this processor to update its internal cells is:
2
n
−1 w ,
q
and the time for this processor for exchange its edge cells with its neighbors
174 Chapter 5. Algorithms on Grids of Processors
175
176 Chapter 6. Load Balancing on Heterogeneous Platforms
a dynamic strategy can lead to poor load balancing, as we will see later in
this chapter. In such cases, one must resort to static task allocation schemes,
which are the focus of this chapter.
1 ALLOCATION((t1 , . . . , tp ), M )
{ Initialization: compute ci values such that
ci × ti ≈ Constant and c1 + c2 + . . . + cp 6 M }
2 for i = 1to p do
1
ci = Pp ti 1 ×M
3 k=1 tk
1
After the initialization step of the algorithm, for all i, ci 6 Pp ti 1 × M .
k=1 tk
Pp Pp
Since M = k=1 ok 6 oj tj × k=1 t1k , we have ci ti 6 PpM 1 6 oj tj , and
k=1 tk
ok0 . We have tk0 (ck0 + 1) 6 tk0 ok0 6 tj oj , and the choice of k 0 implies that
tk (ck + 1) 6 tk0 (ck0 + 1). Therefore, invariant (I) is satisfied after this step.
Finally, the obtained allocation, (c1 , c2 , . . . , cp ), is optimal because for all i
max{ci × ti } 6 oj tj = max(oi ti ).
Cycle time t1 = 3 t2 = 5 t3 = 8
Num. of c1 c2 c3 tmean Chosen P1 P2 P3
tasks proc.
0 0 0 0 P1
time
time
1 1 0 0 3 P2
2 1 1 0 2.5 P1
3 2 1 0 2 P3
4 2 1 1 2 P1
5 3 1 1 1.8 P2
6 3 2 1 1.67 P1
7 4 2 1 1.71 P1
8 5 2 1 1.87 P2
9 5 3 1 1.67 P3
10 5 3 2 1.6
P1 P2 P3 P1 P2 P3
After Step 3 After Step 4
FIGURE 6.1: Steps of the incremental algorithm with three processors with
cycle times t1 = 3, t2 = 5, and t3 = 8.
Consider step 4 as an example, that is the step in which the fourth task
is allocated to a processor. The algorithm bases the allocation decision on
the allocation obtained so far, that is (c1 , c2 , c3 ) = (2, 1, 0). This allocation is
depicted on the Gantt chart labeled as “After step 3” on the right-hand side
of Figure 6.1. This diagram depicts the three tasks already allocated, two
to processor P1 and one to processor P2 , with time along the vertical axis.
180 Chapter 6. Load Balancing on Heterogeneous Platforms
1 INCREMENTALDISTRIBUTION(t1 , . . . , tp , M )
{ Initialization }
2 C = (c1 , . . . , cp ) = (0, . . . , 0)
3 A = {}
{ Iterative computation of allocation A }
4 for n = 1 to M do
5 i = argmin16j6p (tj × (cj + 1))
6 A[n] = i
7 ci ← ci + 1
8 return A
P1 0 1 2 3 4 5
P2 1 2 3 4 5 6
P3 2 3 4 5 6 7
P1 6 7 8 9 10 11
P2 7 8 9 10 11 12
P3 8 9 10 11 12 13
.. .. .. .. .. ..
FIGURE 6.2: Static cyclic allocation of rows to three heterogeneous pro-
cessors with cycle times t1 = 3, t2 = 5, and t3 = 8, for the stencil application.
and ci × ti = 120 for all i. The execution time is 12079 N , and it is the optimal
value that can be achieved.
One can thus increase the granularity of the application (to trade off par-
allelism for lower communication overhead) by choosing π values that are
multiples of the smallest π value that achieves perfect load balancing. This
is exactly the same process as when using larger blocks of rows in the cyclic
allocation used in the homogeneous case. Note however that the smallest π
value that achieves perfect load balancing may be large. With Ppp processors
of cycle time ti , 1 6 i 6 p, the smallest π value is π = L i=1 t1i , where
L = lcm(t1 , . . . , tp ) is he least common multiple of the cycle-times. In our
previous example we had L = 120. But for instance, with cycle times t1 = 11,
t2 = 23, and t3 = 31, we obtain the least common multiple L = 7, 843 and the
smallest π value is then π = 1, 307. With such a large period, unless the total
number of rows N is orders of magnitude larger than π, the reduced paral-
lelism for the first and second step of the application execution would have a
prohibitive impact on performance. In this case, one may opt for a value of π
that does not lead to perfect load balancing but that allows processors P2 and
P3 to start computing earlier. For instance, using π = 88 leads to an execu-
tion in which the load imbalance between the processors at each step is lower
than 1% (i.e., the relative differences between all processor execution times
at each step are all below 1%). Indeed, we obtain c1 × t1 = 48 × 11 = 528,
c2 × t2 = 23 × 23 = 529 and c3 × t3 = 17 × 31 = 527.
a simple scaling; this processor then broadcasts these updated elements to all
other processors; all processors can then update all elements of columns k + 1
and higher on rows k + 1 and higher. Note that this algorithm is non-blocked,
in the sense that columns are processed one at a time. While this is inefficient
on processors with a memory hierarchy, it does not change the spirit of the
algorithm. Let us consider the execution of this algorithm on p processors,
Pi , i = 0, . . . , p − 1, with cycle times t1 , . . . tp .
The performance analysis of the LU factorization algorithm in Section 4.4
shows that the bulk of the computation time is due to the column updates,
while the time for performing column preparation is asymptotically negligible.
Our goal here is then to determine an allocation of columns to the processors
that leads to a well-balanced execution of the column updates. Let us consider
the first step of the algorithm. Once the first column has been prepared, all
columns that need to be updated can be updated independently. Therefore, if
the matrix is of size n, n − 1 update tasks must be allocated to the processors.
A simple idea is to use Algorithm 6.1 to determine the column allocations.
While this would lead to good load balance during the first step of the algo-
rithm, unfortunately the number of columns to be updated decreases as the
algorithm makes progress. In the second step, only n − 2 columns need to be
updated, and thus the allocation of columns to processors need to be recom-
puted to lead to good load balancing. One then faces a conundrum. On the
one hand, as the algorithm makes progress the column allocation likely leads
to worsening load balance. On the other hand, redistributing the columns
among the processors at each step according to the optimal column allocation
at this step makes the algorithm much more complicated and would likely
cause large overhead. One way to strike a compromise would be to only
re-allocate columns periodically each x iterations, where x is chosen appro-
priately given n and the characteristics of the underlying platform. However,
it is difficult to determine the best choice for x and one must resort to some
empirical method based on previous runs and benchmarks.
It turns out that there is an elegant solution to the above conundrum,
which does not require column re-allocations among the processors and which
achieves optimal load balance at each step of the algorithm. Let us denote
the (n − k + 1) update tasks at step k = 0, . . . , n − 1, of the algorithm as
uk+1 , . . . , un−1 . This is a slight abuse of notation as we do not use a subscript
to indicate the algorithm step, meaning that, for instance, un−1 denotes the
last update task at all steps. We seek an allocation of u1 , . . . , un−1 onto the
p processors, with the following constraint: for each i ∈ {1, . . . , n − 1}, the
number of tasks allocated to processor Pj among tasks ui , . . . , un−1 should be
(approximately) proportional to its cycle time tj .
The attentive reader will recognize that the above constraint corresponds
almost exactly to the task allocation computed by our our incremental al-
gorithm from Section 6.1.2. The only difference is that we need to run the
algorithm in reverse. Indeed, the algorithm produces an optimal task alloca-
tion for all subsets [1, i] of [1, n − 1], while we need an optimal task allocation
6.1. Load Balancing for 1-D Data Distributions 185
for all subsets [i, n − 1] of [1, n − 1]. Let us go back to our example of 3 pro-
cessors with cycle times t1 = 3, t2 = 5 and t3 = 8 for π = 10 column update
tasks. To obtain our desired allocation for the LU factorization algorithm we
can just read the table on the left-hand side of Figure 6.1 from bottom to top,
thus allocating column 0 to P3 , column 1 to P2 , and so on. The entire pattern
is (P3 P2 P1 P1 P2 P1 P3 P1 P2 P1 ). For illustration purposes Figure 6.3 depicts two
allocations. On the left-hand side the figure shows the “reversed” allocation
computed by the above algorithm, which is optimally load-balanced at each
step of the LU factorization. On the right-hand side the figure shows the
“non-reversed” allocation, which is optimally load-balanced only for the first
step of the LU factorization. Above each allocation we plot the compute time
at each algorithm step, thus showing that using the reversed heterogeneous
allocation leads to better performance.
time time
30 30
total time = 99 total time = 116
20 20
10 10
step step
3 2 1 1 2 1 3 1 2 1 1 3 2 1 3 2 1 3 2 1
P1
P2
P3 P1 P3
P2
B B
A C A C
c1 c2 c3
Let us note that depending on the cycle time of the processors and on the
size of the matrices, it is not always possible to achieve perfect load balancing.
See the example shown in Figure 6.6, for a 2 × 2 processor grid with the
four processor cycle times indicated in the corresponding matrix blocks. We
arbitrarily normalize r1 and c1 to 1. We observe that to balance the load
between processor P1,1 and processor P1,2 , we need c2 = 12 (which may only
188 Chapter 6. Load Balancing on Heterogeneous Platforms
1 1/2
1 t11 = 1 t12 = 2
Underutilized
P min
P max{ri × cj × ti,j } .
( i ri =1; j cj =1) i,j
Given a solution to this problem, one computes the actual dimensions of the
rectangular blocks assigned to each processor by multiplying the ri and cj
values by n, the original matrix dimension. One then rounds all values to
integers so that the sums of the ri values and the sum of the cj values are
both equal to n.
The above formulation of the load balancing problem as an optimization
problem does not lend itself to an easy solution. But at any rate, the problem
that we need to solve is much more complex. Indeed, the above optimization
problem is for a given arrangement of the processors in a 2-D grid. However,
there are p! such arrangements! So we must compute the optimal solution
to the optimization problem for all possible arrangements, and then pick the
optimal solution for the optimal arrangement. As expected, this problem is
NP-complete, which is shown in the next section.
that the processors be arranged in a grid at all! At each step of the outer-
product algorithm, there is a horizontal broadcast of a column of A and a
vertical broadcast of a row of B. These broadcasts involve different numbers
of source and destination processors depending on where the column or row is
located. The figure shows an example vertical broadcast of a row of B, which
is partially held by four processors. For instance, the processor holding the
top-right block of the matrix is involved in receiving data from all four source
processors, while the processor holding the bottom-left block of the matrix is
involved in receiving data from only one source processor.
At each step each destination processor receives an amount of data propor-
tional to the half-perimeter of the rectangular block it holds, typically from
more than two source processors. This is in contrast with the data distribu-
tion seen in Figure 6.4, with which at each step each destination processors is
only engaged in two receives (one horizontal and one vertical).
rectangular blocks are fixed, one can adjust their shapes to lead to the lowest
amount of communication.
Depending on the network used by the underlying physical platform, dif-
ferent communication models are possible. For instance, all communications
could be sequential if the processors are interconnected by, say, a non-switched
Ethernet network. This could be the case if the participating processors are
heterogeneous workstations in some laboratory for instance. Probably more
typical for modern parallel platforms, such as heterogeneous commodity clus-
ters, the communications can happen concurrently due to the use of a switched
interconnect. If communications can happen concurrently, then one wishes to
minimize the maximum of the half-perimeters of the rectangular blocks. If
instead the communications happen sequentially, then one wishes to minimize
the sum of the half-perimeters.
Given the above, we can now see one of the key points highlighted in Chap-
ter 5: if the platform consists of q 2 homogeneous processors one can achieve
lower communication costs by using a 2-D distribution on a grid of proces-
sors than by using a 1-D distribution on a ring. On a ring, the sum of the
half-perimeters of the matrix blocks is 1 + q 2 , while on a q × q grid it is q + q.
Considering only the geometrical interpretation of the problem, both op-
timization problems above are easily stated as follows: how can one parti-
tion
Pp a unit square in p rectangles with given areas s1 , s2 , . . . , sp , such that
i=1 si = 1, in a ways that minimizes:
s3 1/3
36/61 s1
s4 1/3
25/61 s2
s5 1/3
c1 c2 c3
S1 S4
S11
S5
S8
S10
S3 1
S9
S6
S12
S2 S7
FIGURE 6.9: Column partition of the unit square with C = 3 columns with
each k1 = 5, k2 = 3 and k3 = 4 rectangles.
The algorithm to solve this problem uses dynamic programming and relies
on the two following ideas:
1. It renumbers variables s1 , . . . , sp so that s1 6 s2 6 . . . 6 sp .
2. It iteratively constructs p functions fC for values of C going from 1 to
p. For q ∈ {1, . . . , p}, fC (q) is defined as the sum of the half-perimeters
inPan optimal partition of a rectangle with height 1 and with width
q
( i=1 si ) in C columns and q rectangles with areas s1 , . . . , sq .
The key idea behind the algorithm is that it is straightforward to compute
function fC recursively based on function fC−1 as follows:
X
fC (q) = min 1 + (q − a) si + fC−1 (a) (6.1)
a∈[C−1,q−1]
a<i6q
6.3. Free 2-D Partitioning on a Heterogeneous Grid 195
TABLE 6.1: Table showing the values of fC (q) and of a0 (separated by a
“|”) for our 8-rectangle example. The values in boldface indicate the optimal
solution.
H q
H q=1 q=2 q=3 q=4 q=5 q=6 q=7 q=8
C H
H
H
C =1 1.05 | 0 1.2 | 0 1.54 | 0 2.12 | 0 2.9 | 0 4|0 5.90 | 0 9|0
C =2 2.10 | 1 2.28 | 2 2.56 | 2 2.94 | 3 3.50 | 3 4.38 | 4 5.76 | 5
C =3 3.18 | 2 3.38 | 3 3.66 | 4 4|4 4.58 | 5 5.50 | 6
C =4 4.28 | 3 4.48 | 4 4.78 | 5 5.20 | 6 5.88 | 7
C =5 5.38 | 4 5.60 | 5 5.98 | 6 6.50 | 7
C =6 6.50 | 5 6.80 | 6 7.28 | 7
C =7 7.70 | 6 8.10 | 7
C =8 9|7
which simply gives Pqthe sum of the half perimers of a rectangle with height 1
and with width ( i=1 si ) in 1 column and q rectangles with areas s1 , . . . , sq .
To better understand how the algorithm works, let us apply it on an exam-
ple. Consider p = 8 rectangles with areas (0.05; 0.05; 0.08; 0.1; 0.1; 0.12; 0.2; 0.3).
The algorithm recursively computes all fC (q) values for 1 6 q, C 6 8, as shown
in Table 6.1. In each column of the table we show the optimal value, i.e., the
one with the smallest fC (q) value, in boldface. Since we wish to partition the
unit square in 8 rectangles, we look at column q = 8 and find that the optimal
fC (q) value, 5.5, is achieved for C = 3, indicating that the optimal partition
consists of 3 columns. Furthermore, the optimal fC (q) value is achieved for
a0 = 6. Therefore, the last column of the optimal partition must contain
8 − 6 = 2 rectangles. We now look at column q = 6 of the table and find out
that the optimal fC (q) value is achieved for a0 = 3. Therefore, the next-to-
last column of the optimal partition must contains 6 − 3 = 3 of the remaining
rectangles. Similarly, we determine that in column q = 3 of the table the
optimal fC (q) value is achieved for a0 = 0. The first column of the optimal
partitioning consists of all remaining 3 − 0 = 3 rectangles, which makes sense
since we know that the optimal partitioning consists of 3 columns. The widths
of the three columns in the optimal partitioning are thus c1 = s1 + s2 + s3 ,
c2 = s4 + s5 + s6 , and c3 = s7 + s8 . This partitioning is shown in Figure 6.10.
196 Chapter 6. Load Balancing on Heterogeneous Platforms
0.18 0.32 0.5
0.12
0.08
0.3
1 0.1
0.05
1 PARTITION(s1 , . . . , sp )
2 S=0
3 for q = 1 to p do
4 S = S + sq
5 f1 (q) = 1 + S × q
6 f1cut (q) = 0
7 for C = 2 to p do
8 for q = C to p do
X
fC (q) = min 1 + (q − a) si + fC−1 (a)
a∈[C−1,q−1]
9 q−a<i6q
10 fCcut (q) = q − a0 { where a0 is the value of a that leads to the
minimum in the previous expression. }
11 return (f∗cut )
computed as follows:
1 RE-BUILD(f∗cut , Copt )
2 q=p
3 for C = Copt down to 1 do
4 kC = q − fCcut (q)
5 q = fCcut (q)
6 return (k1 , . . . , kCopt )
6 6 6 6
12 15
6 6
9
9
6 6
28 28
6 6
15
12
3 3
3 3
Performance Guarantee
In this section, we show that column partitioning leads to a good approxi-
mation of the optimal (free) partition. This is especially true when the ratio
between the largest rectangle area, max si , and the smallest area, min si , is
low.
√
r
C
b 1 max si 1
6 r 1+ √ = 1+ √ .
LB p min si p
And therefore,
b∗ √
C 1 r p
P √ 6P √ + + √ P √ 2 .
2 i si i si 2 2 r( i si )
Furthermore,
X
si = 1 =⇒ p max si > 1
i
1
=⇒ min si > ,
pr
which leads to: r
X√ p p
si > p min si > .
i
r
Finally, we obtain:
b∗ r √ √
C r r r
P √ 6 + +
2 i si p 2 2
√
1
6 r 1+ √ .
p
Since Cb corresponds to the best solution among all the possible column
partitionings, we have C
b6C b ∗ , which completes the proof.
Bibliographical Notes
The load balancing results for 1-D data distributions presented in the first
part of this chapter are well known, and we refer the reader to the referenced
work therein for further details. The section on load balancing for 2-D data
distributions come for the most part from the Ph.D. thesis by Rastello [99] and
related articles, which contain many results. Generally speaking, the literature
is rife with works that study load balancing problems and we explore some of
them in the exercises accompanying this chapter.
200 Chapter 6. Load Balancing on Heterogeneous Platforms
6.4 Exercises
The first exercise is a straightforward demonstration that rather than facing
the difficulty of load balancing for 2-D data distributions, an easier option is
to transform a “grid algorithm” into a “ring algorithm.” The second exercise
studies load balancing for the LU factorization on a heterogeneous grid of
processors. Finally, the third exercise discusses optimal load balancing for a
stencil application on a heterogeneous ring of processors (inspired by the work
in [79], to which we refer the reader for more details).
1. Consider the case in which the matrix of processor cycle times is of rank
1. In this case, we know that perfect load balancing is possible for the whole
matrix. In other words, we can find ci and Prj , i = 1, . . . P
, p, j = 1, . . . , q, such
that ri ti,j cj = 1 for all i, j, and such that i ri = 1 and j cj = 1. Processor
Exercises 201
Pi,j is thus allocated a ri × cj block of the matrix (which are really normalized
and must be multiplied by n and rounded to integer values). See Section 6.2.1
for all details. Show that if the matrix of processor cycle times is of rank 1,
then it is possible to distribute columns and rows to processors such that the
distribution is optimal at each step step of the LU factorization.
1. Let us first consider the case of a ring with heterogeneous processors and
homogeneous network links: the time for processor Pi to process a column
is wi ; the time for a processor to exchange a column with both its successor
and predecessors is c × D (where c is the time to exchange one unit of data
with both its neighbors). Show that the optimal allocation uses either one or
all processors. Give a condition sufficient for the optimal allocation to use all
processors.
6.5 Answers
Exercise 6.1 (Matrix Product on a Heterogeneous Ring)
. Question 1. Just consider the ring as a rectangular 1×p grid and simply run
the outer-product algorithm on this grid. For a C = A × B product, the three
matrices are distributed across the processors by column blocks of r = n/p
columns. At step k = 1, . . . , p, the processor that holds the k-th column block
of A broadcasts it to all other processors. This is the only communication
since vertical broadcasts of B are now replaced by local memory accesses
within block rows.
21
B B
51
21 51
A C A C
40 24 15 40 24 15
ri0 (k) × c0j (k) elements of sub-matrix A(k). We easily see that:
since ri ti,j cj = 1. One can then separate the maxima, which are independent,
to obtain:
r0 (k) c0j (k)
T (k) = max i × max .
i ri j cj
The objective is to minimize the above quantity for all k. It turns out that
the static load balancing algorithm from Section 6.1.1, whose allocation is
used in reverse, can be used to minimize both these maxima because it leads
to 1-D data distributions that are optimal at each step of the LU factoriza-
tion. (Minimizing the maximum of the, for instance, ri0 (k)/ri ratio for all i
is equivalent to ensuring that each processor has an amount of work that is
as commensurate as possible to its cycle time.) We conclude that using the
allocation produced by this algorithm in reverse and along both dimension
produces a data distribution optimal at each step!
. Question 2. Consider a 3 × 3 processor grid with the cycle times depicted
in Figure 6.13. There is one fast processor and 8 slow processors. We must
show that the optimal allocation of matrix elements for a 3 × 3 matrix cannot
be constructed based on the optimal allocation of matrix elements for a 2 × 2
matrix.
For a 2 × 2 matrix there are only three possible kinds of allocations of
matrix elements to the processors: (i) the fast processor is assigned all matrix
elements; (ii) the fast processor is assigned only one matrix element; (iii) the
204 Chapter 6. Load Balancing on Heterogeneous Platforms
fast processor is assigned no matrix element. In case (i), the time to update
the matrix is 4 (the fast processor updates 4 elements sequentially). In the
other cases, since a slow processor is involved, the time to update the matrix
is at least 5. So the best distribution is to allocate all matrix elements to the
fast processor.
(a) (b)
Scheduling
205
Chapter 7
Scheduling
7.1 Introduction
This chapter presents basic (but important) results on task graph schedul-
ing. We start with a motivating example before providing a quick overview
of models and complexity results.
for i = 1 to n do
Task Ti,i : xi ← bi /ai,i
for j = i + 1 to N do
Task Ti,j : bj ← bj − aj,i × xi
T1,1 <seq T1,2 <seq T1,3 <seq . . . <seq T1,n <seq T2,2 <seq T2,3 <seq . . . <seq Tn,n .
However, there are independent tasks that can be executed in parallel. Intu-
itively, independent tasks are tasks whose execution order can be interchanged
without modifying the result of the program execution. A necessary condi-
tion for tasks to be independent is that they do not update the same variable.
They can read the same value, but they cannot write into the same memory
location (otherwise there would be a race condition and the result would be
non-deterministic). For instance tasks T1,2 and T1,3 both read x1 but modify
distinct components of b, hence they are independent.
207
208 Chapter 7. Scheduling
We can express this notion of independence more formally. Each task T has
an input set In(T ) (read values) and an output set Out(T ) (written values). In
our example, In(Ti,i ) = {bi , ai,i } and Out(Ti,i ) = {xi }. For j > i, In(Ti,j ) =
{bj , aj,i , xi } and Out(Ti,j ) = {bj }. Two tasks T and T 0 are not independent
(we write T ⊥T 0 ) if they share some written variable:
In(T ) ∩ Out(T 0 ) 6= ∅
0
T ⊥T ⇔ or Out(T ) ∩ In(T 0 ) 6= ∅
or Out(T ) ∩ Out(T 0 ) 6= ∅
where + denotes the transitive closure. In other words, we take the transitive
closure of the intersection of ⊥ and <seq to derive the set of all constraints
that need to be satisfied to preserve the semantics of the original program. In
a sense, ≺ captures the intrinsic sequentiality of the original program. The
original total ordering <seq was unduly restrictive, only the partial ordering
≺ needs to be respected. Why do we need to take the transitive closure of
<seq ∩ ⊥ to get a correct definition of ≺? In our example, we have T2,4 ⊥T4,4
(which is not a predecessor relationship, as there is T3,4 in between) and
T4,4 ⊥T4,5 , hence a path of dependences from T2,4 to T4,5 , while we do not
have T2,4 ⊥T4,5 . We need to track dependence chains to define ≺ correctly.
We can draw a directed graph to represent the dependence constraints that
need to be enforced. The vertices of the graph denote the tasks, while the
edges express the dependence constraints. An edge e : T → T 0 in the graph
means that the execution of T 0 must begin only after the end of the execution
of T , whatever the number of available processors. We do not usually draw
transitivity edges on the graph, as they represent redundant information; if
T ≺ T 0 and T 0 ≺ T 00 and if there exists a dependence T ⊥T 00 , then it will be
7.1. Introduction 209
We end up with the graph shown in Figure 7.1. We will use this graph several
times in this chapter for illustrative purposes.
7.1.2 Overview
This chapter presents classic theorems and algorithms from scheduling the-
ory. The communication model used by this theory is rather unrealistic but
it makes it possible to obtain fundamental complexity results.
We start with the most simple (one might say crude) model where all
communication delays between processors are neglected. We introduce ba-
sic definitions in Section 7.2. When there is no restriction on the number of
available processors, optimal schedules can be found in polynomial time, as
shown in Section 7.3. Section 7.4 deals with a limited number of processors;
the scheduling problem becomes NP-complete, and so-called list scheduling
heuristics are the typical approach. An elegant and powerful theorem shows
that any list scheduling algorithm generates a schedule that is no longer than
twice the optimal schedule (Section 7.4.2). We continue on the theoretical
side: in Section 7.4.5, we discuss the scheduling of independent tasks, and we
derive arbitrarily good approximation algorithms, i.e., polynomial algorithms
whose performance can be guaranteed within a (1 + ε) of the optimal, for any
arbitrary ε > 0.
Next we move to the classical scheduling model in which communication
costs are taken into account each time two dependent tasks are not assigned
to the same processor. We detail this model in Section 7.5. In this case,
even the problem with unlimited processors is NP-complete, as explained in
Section 7.6. We present heuristics for p identical processors in Section 7.7
and briefly discuss how to extend these heuristics to handle heterogeneous
processors in Section 7.8.
This condition expresses the fact that if two tasks T and T 0 are allocated to
the same processor, then their executions cannot overlap in time.
2 This is not a restriction; tasks weights can be rational numbers. However, because there
is a finite number of tasks, the weights can always be scaled up to integers.
3 In fact we need no more processors than the total number of tasks.
7.2. Scheduling Task Graphs 211
T1,1
T2,2
T3,3
T4,4
T4,5 T4,6
T5,5
T5,6
T6,6
Theorem 7.1 states that scheduling deals with directed acyclic graphs (or
DAGs):
212 Chapter 7. Scheduling
Our last definition introduces the notions of speedup and efficiency for
schedules (see [77] for a detailed discussion of speedup and efficiency):
Seq is the optimal execution time MSopt (1) of a schedule with a single
processor. We have the following well-known result:
processors
active
P4
idle
P3
P2
P1
time
Another way to state Theorem 7.2 is to say that the speedup with p proces-
sors is always bounded by p. No superlinear speedup with our model! Here is
an easy result to conclude this section: the more processors, the smaller (or
equal) the optimal makespan.
Seq = MSopt (1) > . . . > MSopt (p) > MSopt (p + 1) > . . . > MSopt (∞) .
We are now ready to address the search for optimal schedules. Not sur-
prisingly, it turns out that the problem Pb(p) with limited processors is more
difficult than Pb(∞), whose solution is explained in the next section.
Proof. The proof has two parts. First we show that σf ree is indeed a schedule,
then we derive its optimality. Both are easy:
The free schedule σf ree is also known as the as soon as possible (ASAP)
schedule.
Hence, MSopt (∞) is simply the maximal weight of a path in the graph. Note
that σf ree is not the only optimal schedule. For example the as late as possible
(ALAP) schedule σlate is also optimal. We define σlate as follows:
To understand the definition, note that bl(v) is the maximal weight of a path
from v to exit nodes, hence the need to start the execution of v no later than
MS(σf ree , ∞) − bl(v) if all tasks must have completed within MS(σf ree , ∞)
time units.
Proof. From Theorem 7.4 we know that the optimal schedule σf ree can be
computed using top levels and that MSopt (∞) is the maximal weight of a
path in the graph. Because G is acyclic, these quantities can be computed by
a traversal of the graph, hence the complexity O(|V | + |E|).
Going back to the triangular system (Figure 7.1), because all tasks have
weight 1, the weight of a path is equal to its length plus 1. The longest path
is
T1,1 → T1,2 → T2,2 → . . . → Tn−1,n−1 → Tn,−1,n → Tn,n ,
whose weight is 2n − 1. Note that we do not need as many processors as
tasks to achieve execution within 2n − 1 time units. For example, we can use
only n − 1 processors. Let 1 6 i 6 n; at time 2i − 2, processor P1 starts the
execution of task Ti,i , while at time 2i − 1, the first n − i processors P1 , P2 ,
. . ., Pn−i execute tasks Ti,j , i + 1 6 j 6 n.
216 Chapter 7. Scheduling
THEOREM 7.5.
• Indep-tasks(2) is NP-complete but can be solved by a pseudo-polynomial
algorithm
Proof. First, Dec(p) (and hence all the other problems, which are restrictions
of it) belongs to NP: if we are given a schedule σ whose makespan is less
than or equal to K, we can check in polynomial time that both dependences
and resource constraints are satisfied. Indeed, we have to ensure that each
dependence constraint (each edge in E) is satisfied, which is straightforward.
Also, we need to check that no more than p tasks ever execute simultaneously.
We can sort the tasks by their starting time, and check the latter condition
by scanning the sorted array. This can easily be done in time polynomial in
the size of the problem instance.
For proving the NP-completeness of Indep-tasks(2), we show that 2-Partition
can be polynomially reduced to Indep-tasks(2). Consider an arbitrary Pninstance
Inst1 of 2-Partition, with n integers {a1 , a2 , . . . , an }, and let S = i=1 ai be
even (otherwise we know there is no solution). We build an instance Inst2
of Indep-tasks(2) as follows. We let p = 2 (of course), G = (V, E, w) with
V = {v1 , v2 , . . . , vn }, E = ∅, and w(vi ) = ai , 1 6 i 6 n. We also let K = S2 .
The construction of Inst2 is polynomial (and even linear) in the size of Inst1 .
Moreover, Inst1 has a solution if and only if there exists a schedule that meets
the bound K, hence if and only if Inst2 has a solution.
The pseudo-polynomial algorithm to solve Indep-tasks(2) is a simple dy-
namic programming algorithm. For 1 6 i 6 n and 0 6 T 6 S, let the boolean
variable c(i, T ) be true if there exists a subset of {a1 , a2 , . . . , ai } whose sum
is T . We need to determine the value of c(n, S2 ). We use the induction
which basically states that either ai is involved in the target subset, or not.
The initialization is c(1, a1 ) = 1, c(i, 0) = 1 for all i and all other boundary
values set to 0. The complexity of the algorithm is O(nS), which is not
polynomial
Pn in the problem size, whose typical binary encoding would be O(n+
i=1 log ai ). (However, if the ai ’s are encoded in unary, we have polynomial
complexity, which is the definition of a pseudo-polynomial algorithm.)
The reduction for the strong NP-completeness of Indep-tasks(p) is straight-
forward. Consider an arbitrary instance Inst1 of 3-PARTITION, with 3n
integers {a1 , a2 , . . . , a3n } and bound B as stated above. The instance Inst2
of Indep-tasks(p) is built with 3n independent tasks of weight ai , p = n pro-
cessors and K = B. Clearly, Inst1 has a solution if and only if there exists a
schedule that meets the bound K, hence if and only if Inst2 has a solution.
218 Chapter 7. Scheduling
Y1 Y2 Yn
X1 X2 X3 ··· Xn
Z1 Z2 Zn
• the first processor P1 executes all 2n tasks Xi and Yi . These tasks are
totally ordered along a dependence path of length K = 2nB.
decide which tasks are given priority in the (frequent) case where there are
more free tasks than available processors. But a key result due to Coffman [42]
is that any list algorithm can be shown to achieve at most twice the optimal
makespan. We express this more formally after giving some definitions.
It is important to point out that Theorem 7.6 holds for any list schedule,
regardless of the strategy to choose among free tasks when there are more free
tasks than available processors.
Proof. Let Ti1 be a task whose execution terminates at the end of the schedule:
Let t1 be the largest time smaller than σ(Ti1 ) and such that there exists an
idle processor during the time interval [t1 , t1 +1[ (let t1 = 0 if such a time does
not exist). Why is this processor idle? Because σ is a list schedule, no task
is free at t1 , otherwise the idle processor would start executing a free task.
Therefore, there must be a task Ti2 that is an ancestor 4 of Ti1 and that is
being executed at time t1 ; otherwise Ti1 would have been started at time t1 by
the idle processor. Because of the definition of t1 we know that all processors
4 The ancestors of a task are its predecessors, the predecessors of its predecessors, and so
on.
220 Chapter 7. Scheduling
are active between the end of the execution of Ti2 and the beginning of the
execution of Ti1 .
We start the construction again from Ti2 so that we obtain a task Ti3 such
that all processors are active between the end of Ti3 and the beginning of
Ti2 . Iterating the process, we end up with r tasks Tir , Tir−1 , . . . , Ti1 that
belong to a dependence path Φ of G and such that all processors are active
except perhaps during their execution. In other words, the idleness of some
processors can only occur during the execution of these r tasks, during which
at least one processor
Pr is active (the one that executes the task). Hence,
Idle 6 (p − 1) × j=1 w(Tij ) 6 (p − 1) × w(Φ).
Fundamentally, Theorem 7.6 says that any list schedule is within 50% of
the optimum. Therefore, list scheduling is guaranteed to achieve half the best
possible performance, regardless of the strategy to choose among free tasks.
Before presenting the most widely used strategy to perform this choice (in
order to obtain a practical scheduling algorithm), we make a short digression
to show that the bound 2p−1p cannot be improved.
PROPOSITION 7.2. Let MSlist (p) be the shortest possible makespan pro-
duced by a list scheduling algorithm. The bound
2p − 1
MSlist (p) 6 MSopt (p)
p
is tight.
(1)
Tp
(K(p−1)) (K(p−1)) (K(p−1))
T1 T2 ··· Tp−1
(K(p−1))
T2p+1
FIGURE 7.3: The DAG used to bound list scheduling performance; task
weights are indicated as exponents inside parentheses.
However, the DAG can be scheduled in only Kp+1 time units. The key is to
deliberately keep p−1 processors idle while executing task Tp at time 0 (which
is forbidden in a list schedule). Then, at time 1, each processor executes one of
the p tasks Tp+1 , Tp+2 , . . . , T2p . At time 1 + K one processor starts executing
T2p+1 while the other p − 1 processors execute tasks T1 , T2 , . . . , Tp−1 . This
defines a schedule with a makespan equal to 1 + K + K(p − 1) = Kp + 1, which
is optimal because it is equal to the weight of the path Tp → Tp+1 → T2p+1 .
Hence, we obtain the ratio
MS(σ, p) K(2p − 1) 2p − 1 2p − 1 2p − 1
> = − = − ε(K),
MSopt (p) Kp + 1 p p(Kp + 1) p
1. Initialization:
(a) Compute the priority of all tasks, for some definition of priority.
(b) Place the free tasks in a priority queue, sorted by non-increasing
priority.
(c) Let t be the current time: t = 0.
2. While there remain tasks to schedule:
(a) Add new free tasks, if any, to the priority queue. If the execution of
a task terminates at time t, remove this task from the predecessor
list of all its successors. Add those tasks whose predecessor list has
become empty to the priority queue.
(b) If there are q available processors and r tasks in the priority queue,
remove the first min(q, r) tasks from the priority queue and sched-
ule them; for each such task T set σ(T ) = t.
(c) Increment t by one (recall that task weights are integers).
Let G = (V, E, w) be a DAG and assume there are p available processors.
Let σ be any list schedule of G. Our aim is to derive an implementation
whose complexity is O(|V | log |V | + |E|) for computing the schedule. Clearly,
the above scheme must be modified because time varies from t = 0 up to
t = MS(σ, p), implying that the complexity depends on task weights. Indeed,
MS(σ, p) may be of the order of Seq, the sum of all task weights, and we would
have a pseudo-polynomial algorithm instead of a true polynomial algorithm; a
binary encoding of the problem instance is of size log(Seq), not of size Seq. We
outline a possible solution written in pseudo-code in Algorithm 7.1. Rather
than using time we use events which correspond to times when tasks become
free or processors become idle.
A few words of explanation are in order for Algorithm 7.1. We use a heap Q
(see [44]) to store free tasks for two reasons; we can access the task with highest
priority in constant time; and we can insert a task in the heap, according to
its priority level, in time proportional to the logarithm of the heap size, which
is bounded by |V |. We use another heap P to handle active processors; a
processor executing a task v ∈ V is valued by the time at which the execution
of v terminates. Thereby we can compute the next event in constant time, and
we can insert a new active processor in the heap in O(log |P|) 6 O(log |V |)
time. When we extract a processor from the processor heap, meaning that a
task v has terminated, we need to update the in-degree of each successor of v
in array A. On the fly, if the in-degree of a given successor v 0 becomes zero,
we insert v 0 in the priority heap Q. This way, we process each edge of G only
once, for a global cost O(|E|). Overall, each task causes two insertions: the
first is the insertion of the task itself in heap Q; the second is the insertion
of the processor that executes it in heap P. Because each operation costs
at most O(log |V |), we obtain the desired complexity O(|V | log |V | + |E|) for
computing the schedule.
7.4. Solving Pb(p) 223
T4 T5 T6 T7 T8
TABLE 7.1: Weights and critical paths for DAG in Figure 7.4.
Tasks T1 T2 T3 T4 T5 T6 T7 T8
Weights 3 2 1 3 4 4 3 6
Critical Paths 3 6 7 3 4 4 3 6
224 Chapter 7. Scheduling
T3 T8 T7
P3
T2 T5 T4
P2
T1 T6
P1
0 1 2 3 4 5 6 7 8 9 10 time steps
FIGURE 7.5: Critical path schedule for the example DAG in Figure 7.4.
P3 T3 T6 T7
P2 T2 T5 T4
T1 T8
P1
0 1 2 3 4 5 6 7 8 9 10 time steps
FIGURE 7.6: Optimal schedule for the example DAG in Figure 7.4.
Note that it is possible to schedule the DAG in only 9 time units, as shown
in Figure 7.6. The trick is to leave a processor idle at time t = 1 deliberately;
although it has the highest critical path, T8 can be delayed by two time
units. T5 and T6 are given preference to achieve a better load balance between
processors. How do we know that the schedule shown in Figure 7.6 is optimal?
Because Seq = 26, so that three processors require at least d 26 3 e = 9 time
units. This small example illustrates the difficulty of scheduling with a limited
number of processors.
The rationale to sort the tasks is that the arrival of a big task in the end
may unbalance the whole execution. However, we need to know all the tasks
(for sorting) before starting the execution of the algorithm. For this reason
SORTED-GREEDY is called an off-line algorithm. By contrast, GREEDY
can be applied to an on-line problem in which new tasks dynamically arrive
(e.g., to a dual-processor computer).
THEOREM 7.7. GREEDY is a 23 -approximation and SORTED-GREEDY
is a 7/6-approximation for Indep-tasks(2), and these approximation factors
cannot be improved.
Proof. We first show that the bounds are tight (note that the list scheduling
bound is 2 − 21 = 32 ). For GREEDY, take an instance with three tasks of
weight 1, 1, and 2: GREEDY has a makespan of 3 while the optimal is 2. For
SORTED-GREEDY, take five tasks, three of weight 2 and two of weight 3.
SORTED-GREEDY has a makespan of 7, while the optimal is 6.
For the approximations, recall that MSopt > Psum 2 and that MSopt > ai
for all i. Let us start with GREEDY. Let P1 and P2 be the two processors.
Assume that P1 finishes execution last. Let M1 be the execution time on P1
(the sum of all task weights assigned to it) and M2 the execution time on P2 .
Because P1 terminates last, M1 > M2 . Of course M1 + M2 = Psum .
Let Tj be the last task that executes on P1 . Let M0 = M1 − aj be the
load of P1 before Tj (of weight aj ) is assigned to it. Why does the GREEDY
algorithm choose to assign Tj to P1 ? It can only be because at that time
226 Chapter 7. Scheduling
P2 has more load than (or the same load as) P1 : M0 is not larger than the
current load of P2 at that time, which itself is not larger than its final load
M2 (note that P2 may have been assigned more tasks after Tj was scheduled
on P1 ). Therefore, M0 6 M2 . To summarize, the makespan of the schedule
computed by GREEDY is M1 , and
1
M1 = M0 + aj = ((M0 + M0 + aj ) + aj )
2
1 1
6 ((M0 + M2 + aj ) + aj ) = (Psum + aj )
2 2
aj MSopt
6 MSopt + 6 MSopt + ,
2 2
hence proving the 3/2-approximation result for GREEDY.
For SORTED-GREEDY the same line of reasoning is used, but with a
tighter bounding of aj than by MSopt . First, if aj 6 13 MSopt we obtain what
we need, i.e., M1 6 67 MSopt . But if aj > 13 MSopt , then necessarily j 6 4.
Indeed, if Tj was the fifth task or higher, because task weights are sorted,
there would be at least five tasks of weight greater than 13 MSopt ; in any
schedule, including the optimal schedule, one processor would receive three of
these tasks, a contradiction. Next we observe that the makespan achieved by
SORTED-GREEDY is the same when scheduling all tasks as when scheduling
only the first four tasks. But for any problem instance with n 6 4, SORTED-
GREEDY is optimal, and M1 = MSopt .
Proof. Let Pmax = maxi ai and L = max( Psum 2 , Pmax ). We already know
that L 6 MSopt . We call big jobs those tasks Ti whose weights are such that
ai > εL and small jobs those such that ai 6 εL. The number of big jobs is at
most:
Psum 2L 2
6 = =B.
εL εL ε
Because ε is fixed, B is a constant, so there is a (possibly large but) constant
number of big jobs. We temporarily forget about small jobs, consider only
big jobs and search for the best schedule. There are 2B possible schedules
7.4. Solving Pb(p) 227
(each big job can be assigned to either processor), which is a constant number
big
again, so we try them all and keep the best one, say σopt . The resulting
big big
makespan MSopt satisfies MSopt 6 MSopt because there are fewer jobs than in
the original problem.
big
Now we extend σopt and schedule the small jobs after the big jobs, using
GREEDY, and we obtain a schedule σ for the original problem. We claim
that σ solves the problem, i.e., that MS(σ) 6 (1 + ε)MSopt . If the makespans
big
of σopt and σ are equal, then σ is optimal. Otherwise, σ terminates with a
small job Tj , say on the first processor P1 . But the load of P1 before this last
assignment could not exceed Psum 2 , otherwise GREEDY would have assigned
Tj on P2 (same proof as in Theorem 7.7). Hence,
Psum
MS(σ) 6 + aj 6 L + εL 6 (1 + ε)MSopt ,
2
which proves the theorem.
Theorem 7.8 is interesting, but the combinatorial search for the best as-
signment of big jobs can be prohibitively expensive when ε tends to 0. This
motivates our last result, which states that Indep-tasks(2) has a FPTAS.
(This will conclude our initiation to the fascinating world of approximation
schemes.)
THEOREM 7.9. ∀ ε > 0, Indep-tasks(2) has a (1 + ε)-approximation whose
complexity is polynomial in 1ε .
Proof. We encode schedules as Vector Sets. The first component of each vector
represents the load of the first processor P1 , and the second component the
load of P2 . Here is the construction:
The idea is to add new vectors during each phase of the construction only
if they fall into empty boxes. In other words, boxes are small enough so that
we do not need several vectors per box. Two vectors [x1 , y1 ] et [x2 , y2 ] fall
into the same box if and only if
x1
6 x2 6 ∆x1
∆
y1 6 y2 6 ∆y1
∆
The pruned construction is as follows:
We generate at most one vector per box. As a result the total number of
vectors is bounded by nM 2 . Because M has polynomial size in 1ε and in
log Psum , we have the required complexity. Let us prove that the pruned
construction leads to a (1 + ε)-approximation of the optimal schedule.
LEMMA 7.2. ∀[x, y] ∈ V Sk , ∃[x# , y # ] ∈ V Sk# such that x# 6 ∆k x and
y # 6 ∆k y.
(
x# 6 ∆k u + ∆ak 6 ∆k (u + ak ) = ∆k x
⇒
y # 6 ∆v # 6 ∆k y
if alloc(T ) = alloc(T 0 )
0
cost(T, T 0 ) = 0
c(T, T ) otherwise
where alloc(T ) denotes the processor that executes task T (see Section 7.2),
and c(T, T 0 ) is defined by the application specification. The above model
states that the time for communication between two tasks running on the
same processor is negligible. The model also assumes that the processors
are part of a fully connected clique. This so-called macro-dataflow model
makes two main assumptions: (i) communication can occur as soon as data is
available; and (ii) there is no contention for network links. Assumption (i) is
230 Chapter 7. Scheduling
allocate one task per processor. Then, we can check that the makespan
of the ASAP schedule is equal to 14. To see this, it is important to
point out that once the allocation of tasks to processors is given, we
can compute the makespan easily: for each edge e : T → T 0 , add a
virtual node of weight c(T, T 0 ) if the edge links two different processors
(alloc(T ) 6= alloc(T 0 )), and do nothing otherwise. Then, consider the
new graph as a DAG (without communications) and traverse it to com-
pute the length of the longest path, as explained in Section 7.3. In our
case, because all tasks are allocated to different processors, we add a
virtual node on each edge. The longest path is T1 → T2 → T7 , whose
length is w(T1 ) + c(T1 , T2 ) + w(T2 ) + c(T2 , T7 ) + w(T7 ) = 14.
T3 T4 T5 T6 T7
P2
P1 T1 T2
0 1 2 3 4 5 6 7 8 9 time steps
Note that dependence constraints are satisfied in Figure 7.8. For example,
T2 can start at time 1 on processor P1 because this processor executes T1 ,
hence there is no need to pay the communication cost c(T1 , T2 ). By contrast,
T3 is executed on processor P2 , hence we need to wait until time 2 to start it
even though P2 is idle: σ(T1 ) + w(T1 ) + c(T1 , T3 ) = 0 + 1 + 1 = 2.
How did we find the schedule shown in Figure 7.8? And how do we know it
232 Chapter 7. Scheduling
Proof. The decision problem Comm(∞) associated with Pb(∞) is the follow-
ing. Given a cDAG G = (V, E, w, c) and an execution bound K ∈ N∗ , does
there exist a schedule σ for G such that MS(σ, ∞) 6 K? We want to show
that Comm(∞) is NP-complete. First, Comm(∞) belongs to NP. If we are
given a schedule σ whose makespan is less than or equal to K, we can check
in polynomial time that dependence constraints are satisfied. For each task
we know the beginning σ(T ) of its execution and the processor alloc(T ) that
executes it, hence we just have to check for constraints.
(A)
T0
(A)
Tn+1
P2 Tasks in T2 Tn+1
P1 T0 Tasks in T1
LEMMA 7.4. Tasks T0 and Tn+1 are not executed by the same processor in
schedule σ.
Proof. Assume that the same processor P executes both T0 and Tn+1 . Then,
P executes all n other tasks Ti , 1 6 i 6 n. Otherwise, let Ti0 be a task
executed by another processor. The makespan of σ is greater than or equal
to the length of the path T0 → Ti0 → Tn+1 :
To understand this note that P1 takes at least w(T0 ) + w(T1 ) time units to
execute its tasks, and that a communication must occur before P2 can start
Tn+1 .
Similarly, MS(σ) > 2A + C + w(T2 ), because P2 must wait at least A + C
time units before starting execution. Because MS(σ) 6 K = 2A + C + α,
we have w(T1 ) 6 α and w(T2 ) 6 α. But w(T1 ) + w(T2 ) = 2α. Therefore,
w(T1 ) = w(T2 ) = α. Let I denote the set of indices of the tasks in T1 ; I is a
solution to Inst1 , our instance of 2-Partition.
Theorem 7.10 only shows that P B(∞) is NP-complete in the weak sense.
In fact, P B(∞) is NP-complete in the strong sense: even the problem in
which all task weights and communication costs have the same (unit) value,
the so-called UET-UCT problem (Unit Execution Time-Unit Communication
Time), is NP-hard [94, 95].
minT ∈V w(T )
g(G) = .
maxT,T 0 ∈V c(T, T 0 )
Minimize M∞ subject to
0
∀(T, T ) ∈ E xT ,T 0 ∈ {0, 1} (A)
∀T ∈ V s(T ) > 0 (B)
∀(T, T 0 ) ∈ E
s(T ) + w(T ) + xT ,T 0 c(T, T 0 ) ≤ s(T 0 ) (1)
P
∀T ∈ V s.t. SUCC(T ) 6= ∅ x 0 > |SUCC(T )|−1 (2)
PT 0 ∈SUCC(T ) T ,T
∀T ∈ V s.t. PRED(T ) 6= ∅ T 0 ∈PRED(T ) xT ,T > |PRED(T )|−1 (3)
0
∀T ∈ V
s(T ) + w(T ) 6 M∞ (4)
∀(T, T 0 ) ∈ E, 0 6 xT ,T 0 6 1 .
• We let (xrel
T ,T 0
, srel (T ), M∞
rel
) denote the solution of the relaxed problem
RLP(G) over the rational numbers.
Hanen and Munier define their schedule σ hm directly from the solution of
the relaxed linear program RLP(G). Let T ∈ V be any task. Constraint (2)
ensures that there is at most one successor T 0 of T such that xrel T ,T 0
< 12 ; and
00
constraint (3) ensures that there is at most one predecessor T of T such that
xrel
T 00 ,T
< 12 . Therefore, let xhmT ,T 0
= 0 for any edge e = (T, T 0 ) ∈ E such that
xrel
T ,T 0
< 12 , and xhm
T ,T 0
= 1 otherwise. For any task T ∈ V , define σThm to be
the top level of T , where bottom and top levels are computed according to
the allocation function induced by the xhm T ,T 0
. We add the communication cost
c(T, T 0 ) in the weight of a path going from T to T 0 if and only if alloc(T ) 6=
alloc(T 0 ), i.e., if and only if xhmT ,T 0
= 1. As explained earlier, this defines a
valid schedule for G.
THEOREM 7.11. Let G = (V, E, w, c) be a coarse-grain cDAG, with gran-
ularity g(G) > 1. Let σ hm be the schedule defined by Hanen and Munier.
Then,
2g(G) + 2
MS(σ hm , ∞) 6 MSopt (∞) .
2g(G) + 1
Proof. For any path in the graph going from a task T to one of its successors
T 0 , we have the communication cost xhm T ,T 0
c(T, T 0 ) for Hanen and Munier’s
rel 0
schedule, instead of xT ,T 0 c(T, T ) for the solution of RLP(G). Two cases
occur:
238 Chapter 7. Scheduling
• xhm
T ,T 0
= 0: then w(T ) + xhm
T ,T 0
c(T, T 0 ) 6 w(T ) + xrel
T ,T 0
c(T, T 0 ).
• xhm
T ,T 0
= 1: then xrel
T ,T 0
> 12 . We have
c(T,T 0 )
w(T ) + xhm
T ,T 0
c(T, T 0 ) w(T ) + c(T, T 0 ) 1+ w(T )
6 = c(T,T 0 )
and
rel
w(T ) + xT ,T 0 c(T, T 0 ) w(T ) + c(T, T 0 )/2 1+ 2w(T )
c(T,T 0 ) 1
1+ w(T ) 1+ g(G) 2g(G) + 2
c(T,T 0 )
6 1 = .
1+ 1+ 2g(G)
2g(G) + 1
2w(T )
T5
P3
T2
P2
P1 T1 T3 T4 T6 T7
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 time units
FIGURE 7.11: Naı̈ve critical path scheduling for the example DAG in Fig-
ure 7.7.
We obtain a makespan equal to 14 time units. Note that the naı̈ve critical
path (naı̈ve CP) scheduling with two processors leads to the same result: T5
would have been executed by P1 at time t = 4 rather than by P3 at time t = 5:
this is the only difference. In both cases, we obtain the same makespan, even
worse than the execution on a single processor! There must be room for
improvement.
T5
P3
T4 T6 T7
P2
T1 T3 T2
P1
0 1 2 3 4 5 6 7 8 9 10 11 time steps
FIGURE 7.12: Modified critical path scheduling for the example DAG in
Figure 7.7.
available, P1 and P2 :
T1 T2 T3
T2 T3 T1
(a) Fork graph (b) Join graph
3. Tricky: Prove that the heuristic is optimal for certain classes of graphs:
forks, joins, fork-joins, trees, etc.
The first approach is the strongest: one is ensured that the heuristic will
perform within a certain factor of the optimal in the worst case. The second
approach is quite useful (and used) in practice. And the third approach helps
to tune the heuristics so as to be optimal for certain graph classes (and maybe
to publish nice research papers!)
A small step in the first direction is the following (rather weak) counterpart
of Theorem 7.6 :
THEOREM 7.12. Let G = (V, E, w, c) be a cDAG of granularity g(G)
(see Section 7.6.2), and let MSopt (p) be the makespan of an optimal sched-
ule. Then, we can derive a schedule σ with p processors whose makespan
verifies
1
MS(σ, p) 6 2 − (1 + g(G))MSopt (p) .
p
Proof. The proof is straightforward: neglect all communication costs and con-
struct a list schedule σ: its makespan is such that
1 1
MS(σ, p) 6 2 − MS∗opt (p) 6 2 − MSopt (p),
p p
where MS∗opt (p) is the optimal makespan without communication costs. Then,
we stretch the schedule by a factor 1 + g(G), which allows to pay for the
communication cost incurred from the predecessors of each task Ti : we have an
interval of length (1+g(G))wi to communicate data from the predecessors of Ti
and execute it. Therefore, we have derived a valid schedule whose makespan
meets the desired bound. Note that this schedule is not necessarily a list-
schedule because we may have waited longer than needed to execute some
tasks.
where data(i, j) is the data volume associated to eij and vqr is the
communication time for a unit-size message from Pq to Pr (i.e., the
inverse of the bandwidth). Like in the homogeneous case, we let vqr = 0
if q = r, i.e., if both tasks are assigned the same processor. If one
wishes to generate synthetic scenarios to evaluate competing scheduling
heuristics, one then must generate two matrices: one of size n × n for
data and one of size p × p for vqr .
The last (but important) modification concerns the way in which tasks are
assigned to processors: instead of assigning the current task to the processor
that will start its execution first (given all already taken decisions), we should
assign it to the processor that will complete its execution first (given all already
taken decisions). Both choices are equivalent with homogeneous processors,
but intuitively the latter is likely to be more efficient in the heterogeneous
case.
Altogether, we have re-discovered the list heuristic called HEFT, for Het-
erogeneous Earliest Finish Time [116]. The complexity of the algorithm as we
have outlined it here is the same as that of MCP. More sophisticated versions
attempt to insert tasks in intervals of time during which processors are idle.
This technique is called insertion scheduling: instead of scheduling a new task
after those already assigned to a given processor, a good idea may be to try
and schedule it at some earlier time, provided that there exists an interval
long enough to accommodate the task and during which the processor was
idle (most likely waiting for some communication to complete).
Bibliographical Notes
All the material covered in this chapter is rather basic. Without commu-
nication costs, we mention that pioneering work includes the book by Coff-
man [42]. Chapter 9 of [83], the book by El-Rewini, Lewis, and Ali [54] and
the IEEE compilation of papers [108] provide additional material. On the
theoretical side, Appendix A5 of Garey and Johnson [58] provides a list of
NP-complete scheduling problems. Also, the book by Brucker [37] offers a
comprehensive overview of many complexity results.
The literature with communication costs is more recent. Theorem 7.10 is
due to Chrétienne [40]. Picouleau [94, 95] proves that Pb(∞) remains NP-
complete even when we assume all task weights and communication costs
to have the same (unit) value — this is the so-called UET-UCT problem
(Unit Execution Time-Unit Communication Time) — or even if communica-
tion costs are arbitrarily small (but nonzero). Several extensions to Theo-
rem 7.10 are discussed in the survey paper by Chrétienne and Picouleau [41].
Hanen and Munier’s heuristic can be extended to cope with limited processors;
see [67]. See also the book by Darte, Robert and Vivien [50] where many clus-
tering heuristics are surveyed. Finally, a recent book by Sinnen [109] provides
a thorough discussion on communication models. In particular, it describes
several extensions for modeling and accounting for communication contention.
7.9. Exercises 245
7.9 Exercises
We start with two exercises on scheduling without any communication costs.
Exercise 7.1 studies some properties of the free schedule and its variants,
and Exercise 7.2, borrowed from [65], shows surprising anomalies of the list
scheduling approach. Communication costs are involved in Exercise 7.3 that
studies the complexity of scheduling a simple fork graph using various commu-
nication models. We conclude with a more difficult problem in Exercise 7.4,
namely scheduling in-trees using Hu’s algorithm. This exercise is borrowed
from [50]. The original reference is [90], which gives a simpler proof of Hu’s
algorithm [70].
where σf ree and σlate are the ASAP and ALAP schedules defined in Sec-
tion 7.3.
2. Give an example of a DAG G = (V, E, w) that has at least three different
optimal schedules with unlimited processors.
3. Consider the DAG in Figure 7.15. Assume that all tasks have unit weight.
What is the optimal execution time MSopt (∞)? How many processors are
needed for the ASAP scheduling? For the ALAP scheduling? Determine the
minimum number popt of processors needed to achieve execution in optimal
time MSopt (∞).
T1 T2
T3 T4 T5
T6 T7
T8
7
E 8
I
7 18
A G 2
H
F 3 8
7 C J
2 3
B D
T0
d1 dn
d2 di
T1 T2 Ti Tn
(a) Give the maximum level of a task v in St for k < t < MS(σ, p) and show
that MS(σ, p) = k + L0 + 2.
(b) Show that, for all integers t, 0 6 t 6 k, |St | = p. (Use the fact that G is
an in-tree, i.e., each vertex has at most one successor.)
(c) Infer from the previous two questions that MS(σ, p) = MSopt (p). Show
that this optimality result still holds even if k = MS(σ, p) − 1 (in which
case L0 is not defined) or if k does not exist.
7.10. Answers 249
7.10 Answers
Exercise 7.1 (Free Schedule)
. Question 1. Consider any task v ∈ V . By definition, σf ree (v) is the length
of the longest path from an entry node up to v, so σf ree (v) 6 σ(v) for any
schedule σ, be it optimal or not. Now read the definition of σlate carefully:
there is a path of length σlate (v) = MSopt (∞) − bl(v) from v to an exit node.
Hence, if a schedule σ starts executing v later than time σlate (v), its makespan
will be greater than MSopt (∞), implying that σ is not optimal.
. Question 2. Consider a DAG with four tasks T1 , T2 , T3 , T4 of unit weights.
The only dependences are
T1 → T2 → T3 .
We have MSopt (∞) = 3, σf ree (T1 ) = σlate (T1 ) = 0, σf ree (T2 ) = σlate (T2 ) = 1
and σf ree (T3 ) = σlate (T3 ) = 2. However, T4 is independent from the other
tasks. We have σf ree (T4 ) = 0 and σlate (T4 ) = 2. There is room for a third
optimal schedule σ that coincides with the other two on T1 , T2 and T3 , and
such that σ(T4 ) = 1.
. Question 3. The longest path is T2 → T4 → T6 → T8 , thus MSopt (∞) = 4.
The following table shows the starting times for σf ree and σlate :
Tasks T1 T2 T3 T4 T5 T6 T7 T8
σf ree 0 0 1 1 1 2 2 3
σlate 1 0 2 1 2 2 3 3
We see from the table that we need 3 processors for σf ree (at time 1) and
also 3 processors for σlate (at time 2). Any schedule whose makespan is
MSopt (∞) = 4 requires at least 2 processors, since there are 8 tasks. The
schedule σ below is valid (check that all dependences are satisfied) and achieves
a makespan of 4 with only 2 processors:
Tasks T1 T2 T3 T4 T5 T6 T7 T8
σ 0 0 1 1 2 2 3 3
We conclude that popt = 2.
. Question 4. The problem is obviously in NP. We use a reduction from 2-
Partition.PConsider an arbitrary instance Inst1 with n integers {a1 , a2 , . . . , an }.
n
Let S = i=1 ai . We assume that S is even and that ai 6 S2 for all i (other-
wise we know there is no solution). We build an instance Inst2 of our problem
as follows: we have a DAG of n + 1 independent tasks T1 to Tn+1 . We let
w(Ti ) = ai for 1 6 i 6 n, and w(Tn+1 ) = S2 . The size of Inst2 is linear in
the size of Inst1 . We have MSopt (∞) = w(Tn+1 ) = S2 and we ask whether
popt 6 K = 3. Obviously, there is a solution to Inst1 if and only if there is
one to Inst2 .
250 Chapter 7. Scheduling
The makespan is 33, while the sequential time is 66. There is no idle time in
this schedule obtained with critical path list scheduling, and thus it is optimal.
. Question 2. Here again, we can compute bottom levels and execution
times. We obtain:
Task A B C D E F G H I J
Bottom level 30 26 10 25 23 23 17 8 7 7
σ 0 0 3 1 7 13 19 5 6 13
Processor P1 P2 P2 P2 P1 P1 P1 P2 P2 P2
The makespan is 36, three time units more than with larger task weights.
It turns out that any list scheduling algorithm leads to a makespan of at least
36. This is painful to prove: no other way than trying all possibilities. At
time 0, either we schedule A and B, or we schedule A and C, or we schedule
B and C. In this way we explore a tree of possibilities and eventually prove
the result.
. Question 3. With three processors there is a single list schedule: we have
no freedom at all. We obtain:
Task A B C D E H I J F G
σ 0 0 0 2 8 3 5 5 13 20
Processor P1 P2 P3 P2 P1 P3 P2 P3 P2 P1
The makespan is 38, five time units more than with 2 processors.
T0 I
di1 din
di2 dik
Ti1 Ti2 Tik Tin
wi1 wi2 wik win
Obviously, the size of Inst2 is linear in the size of Inst1 . It is easy to check
Pn that
Inst1 has a solution if and only if we can schedule G in time K = 12 i=1 ai .
• the last three children Tn+1 , Tn+2 and Tn+3 have weight:
In this chapter, we discuss several scheduling topics that are more advanced
than those studied in Chapter 7. We strongly believe that the macro-dataflow
task graph scheduling model should be modified to better account for network
resources consumption. It would be unrealistic to expect that a single model
realistically models all kinds of architecture/software combinations. But band-
widths of network cards and of communication links are always limiting fac-
tors, exactly as CPU speeds limit computing resource consumption.
As stated in Section 3.2.3, a common approach is to use one-port mod-
els (either uni- or bidirectional) for single-threaded programs using single-
threaded communication libraries, and to use multi-port models (with band-
width bounds and/or overheads) for asynchronous multi-threaded programs.
But recall that the work in [106] casts doubts on the ability to achieve true
asynchronous communications. Let us point out that serialized communica-
tions in the one-port model have a dramatic impact on application execu-
tion time (makespan). For example, in the traditional macro-dataflow model,
scheduling a fork graph with an unlimited number of homogeneous processors
has polynomial complexity, while in the one-port model this problem becomes
NP-hard (see Exercise 7.3).
In this chapter, we address four important (but largely independent) topics:
253
254 Chapter 8. Advanced scheduling
We mostly use one-port models, either bidirectional (Sections 8.1 and 8.2)
or unidirectional (Section 8.3). For the sake of completeness, we also use the
bounded multi-port model in Section 8.2. All application models considered
in this chapter are structured, in that applications exhibit some intrinsic reg-
ularity. This is the main reason why we succeed in establishing deeper results
than for the scheduling of a single DAG. We refer the adventurous reader to
the bibliographical notes at the end of this chapter for more information and
for some pointers to recent papers representative of the ongoing research in
the field.
for a 3-D domain. Below is a simplified pseudo-code for this application im-
plemented in parallel in a master-worker fashion. This pseudo-code is written
using the same template as in Chapter 3 and using the SCATTER function (see
Section 3.3.2) to distribute pieces of an array among the processors:
p ←NUM PROCS()
myrank ←MY NUM()
{ The master reads input data into an array }
if myrank = MASTER then
data ←n seismic events of size L read from an input file
{ Each processor receives one piece of the array }
SCATTER(myrank , data, rbuff , n × L/p)
{ Every processor computes its piece of the array }
COMPUTE(rbuff )
The master processor reads n data items and scatters them among the p
processors (in this case the master operates as a worker for the computation).
Each processor then computes results independently. This application can be
modeled sufficiently accurately as a divisible load only if the number of tasks,
n, is large when compared to the number of processors. For this particular
application n is expected to be large. For instance, during 1999 as many
as n = 817, 101 seismic events were recorded, each of which can be used to
validate the seismic model.
This example application is representative of a large class of applications
that consist of very large, some may say enormous, numbers of fine-grain com-
putations. A common and often reasonable assumption is that the execution
time of each computation is proportional to the size of the data to be pro-
cessed in that computation. Since the computations are independent there is
no need for either synchronizations or communications among the processors:
only the input messages from the master to the workers need to be taken into
account when scheduling the application.
c c
c c
P1 P2 Pi Pp
w1 w2 wi wp
Our goal is to determine the ni values so that the overall execution time,
i.e., the makespan, is minimized. Figure 8.3 shows an example execution for
p = 3 workers. Let Ti denote the execution time of processor Pi (recall that
M = P0 ). Accounting for the serialization of communications on the bus and
the order in which the master “serves” the workers, we obtain the following
expression for Ti :
- P0 : T0 = n0 .w0
- P1 : T1 = n1 .c + n1 .w1
- P2 : T2 = (n1 .c + n2 .c) + n2 .w3
Pi
- Pi : Ti = j=1 nj .c + ni .wi for i > 1
To make the above formula homogeneous we define c0 = 0 and ci = c for
i > 1, so that
Xi
Ti = nj .cj + ni .wi for i = 0, 1, . . . , p .
j=0
The above equation shows that an optimal solution for the distribution of
Wtotal tasks over p + 1 processors is obtained by distributing n0 tasks to pro-
cessor P0 , and then optimally distributing Wtotal − n0 tasks to processors
P1 , . . . , Pp . This observation would easily lead to a dynamic programming
algorithm for computing the optimal ni values. However, this solution would
only be partially satisfactory. First, it is not a closed-form solution. Second,
and more importantly, it does not solve the question of the ordering of the
workers. Indeed, we have arbitrarily decided that the master M would send
messages in the order P1 , P2 , . . . , Pp . But the workers have different com-
puting powers, so one should expect the ordering to matter. Unfortunately,
258 Chapter 8. Advanced scheduling
P3
P2
P1
M
time
0
P3
P2
P1
M
time
0
P3
P2
P1
M
time
0 end
there are p! possible orderings, way too many for resorting to an exhaustive
search for the best ordering with reasonable complexity!
One way to address this challenge is to realize that we do not actually need
a precise solution where the ni are integers. Let us think of our seismic model
validation application, with 817, 101 tasks and, say, 10 processors. With such
a large number of tasks relatively to the number of processors, one can simply
search for rational ni values, at the price of some rounding in the end to
derive a feasible work allocation. This simple relaxation of the problem is the
quintessence of the divisible load approach and turns out to be surprisingly
successful, as seen in the next section.
(recall that c1 = 0 and ci = c for i > 1). We look for a data distribution α0 ,
. . . , αp that minimizes T .
LEMMA 8.1. In an optimal solution, all processors stop computing at the
same time.
P3
P2
P1
M
time
0 end
(a) i = 1, P1 terminates earlier than P2
P3
P2
P1
M
time
0 end
(b) Decrease αi+1 = α2 by ε, increase αi = α1 by ε
P3
P2
P1
M
time
0 end
(c) Communication time for other workers is unchanged
FIGURE 8.4: Illustration of the proof of Lemma 8.1.
8.1. Divisible Load Scheduling 261
Note that the reasoning also works if i = 0, i.e., with the master P0 . If it
finishes before P1 , suppress ε from the load of P1 , and if it finishes after P1 ,
suppress ε from the load of P0 . In both cases, they finish simultaneously, and
strictly earlier than max(T0 , T1 ).
We are almost done with the proof. We start from an optimal solution
whose execution time is T = Topt . There exists at least one processor whose
end time is T . We apply the exchange procedure to any processor pair Pi
and Pi+1 such that min(Ti , Ti+1 ) < max(Ti , Ti+1 ) = T . Both termination
times of Pi and Pi+1 are smaller than T after the exchange. We continue
applying this procedure until there remains no such pair. In the end, we have
found a solution whose total execution time is smaller than T , a contradiction.
We conclude that all processors do have the same end time in any optimal
solution.
Equipped with Lemmas 8.1 and 8.2, we can characterize the best way of
assigning loads to the master P0 and to workers P1 , . . . , Pp :
• T = α0 w0 Wtotal ;
w0
• T = α1 (c + w1 )Wtotal . Therefore, α1 = c+w1 α0 ;
w1
• T = (α1 c + α2 (c + w2 ))Wtotal . Therefore, α2 = c+w2 α1 ;
wi−1
• For all i > 1 we derive αi = c+wi αi−1 ;
Pp
• We use the normalization equation i=0 αi = 1 to derive
i
!
w0 Y wk−1
α0 1+ + ... + + ... = 1 .
c + w1 c + wk
k=1
We still do not know in which order the master should communicate with
the workers. The intuition is that faster workers should be served first, so
that they can work longer. Is it correct? Let us assess the impact of the
communication order analytically. To do so, consider the load processed by
workers Pi and Pi+1 during a time t:
1 t
Processor Pi – We have αi (c + wi )Wtotal = t. Therefore, αi = c+wi Wtotal .
We see that the formula is symmetric, and we conclude that the communica-
tion order has absolutely no impact on the solution, a surprising conclusion
indeed! We can perform a similar analysis for the master M = P0 and the
first worker P1 :
1 t
Processor P0 – We have α0 w0 Wtotal = t. Then, α0 = w0 Wtotal .
1 t
Processor P1 – We have α1 (c + w1 )Wtotal = t. Hence, α1 = c+w1 Wtotal .
We see that the sum of loads is larger when w0 < w1 . Perhaps, in some
situations, the master is fixed a priori. But for some other applications it
may be chosen among all available processors. In the latter case we should
select the most powerful (or fastest) processor as the master. We conclude
this section by summarizing our results:
THEOREM 8.1. For divisible load applications on bus-structured networks,
the master should be selected (if possible) as the fastest processor. In an opti-
mal solution, all workers participate and terminate simultaneously. The com-
munication order from the master to the workers has no impact. Closed-form
formulas can be established for α0 , α1 , . . . , αp .
M
c1 cp
c2 ci
P1 P2 Pi Pp
w1 w2 wi wp
We see that the formula is symmetric for the total communication time:
the network is occupied the same amount of time whether Pi comes before
264 Chapter 8. Advanced scheduling
or after Pi+1 . However, the amount of work does depend upon the ordering:
αi + αi+1 is maximized when ci 6 ci+1 , which suggests that we should serve
the faster communicating worker first. Because the ordering of Pi and Pi+1
has no impact on the other workers, we can infer a very important result:
LEMMA 8.3. In an optimal solution, participating workers must be served
by non-decreasing values of ci .
However, we do not yet know whether we should utilize all workers as in
the case of bus platforms. The exchange procedure we used in Section 8.1.3
does not work so easily here. This is because communication times change
when shifting fractions of load from one worker to another. Nevertheless, the
result still holds:
LEMMA 8.4. In an optimal solution, all workers participate in the compu-
tation.
Proof. Consider an optimal solution. Let us renumber the processors so that
the ordering of communications is P1 , P2 , . . . , Pp . Suppose that at least one
worker is kept fully idle. In this case, at least one of the αi ’s, 1 6 i 6 p, is
zero. Let us denote by k the largest index such that αk = 0.
Case k < p – We add Pk at the end of the initial solution, thus using the or-
dering P1 , . . . , Pk−1 , Pk+1 , . . . , Pp , Pk . By construction, αp 6= 0. There-
fore, the network is not used during at least the last αp wp time units. It
α w
would thus be possible to process at least ckp+wpk > 0 additional units of
load with worker Pk , which contradicts the assumption that the original
solution was optimal: Pk does more work than in the original solution in
which it was kept idle, and all the other processors do the same amount
of work
Case k = p – We modify the initial solution by giving some work to Pp with-
out increasing the execution time. Let k 0 be the largest index such that
αk0 6= 0. By construction, the communication medium is not used dur-
ing at least the last αk0 wk0 > 0 time units. Thus, as previously, it would
be possible to process at least αcpk+w
0 wk 0
p
> 0 additional units of load with
worker Pp , which leads to a similar contradiction.
Therefore, in an optimal solution all workers participate in the computation.
It is worth pointing out that the above property does not hold true if we
consider solutions in which the communication ordering is fixed a priori. For
instance, consider a platform comprising two workers: P1 (with c1 = 4 and
w1 = 1) and P2 (with c2 = 1 and w2 = 1). If the first chunk has to be sent
to P1 and the second chunk to P2 , the optimal number of units of load that
can be processed within 10 time units is 5, and P1 is kept fully idle in this
solution. On the other hand, if the communication ordering is not fixed, then
8.1. Divisible Load Scheduling 265
6 units of load can be performed within 10 time units (5 units of load are sent
to P2 , and then 1 to P1 ). In the optimal solution, both workers perform some
computation, and both workers finish computing at the same time. This is a
general result:
Proof. The reader may want to skip this proof as it requires more involved
mathematical arguments. Consider an optimal solution. All αi ’s have strictly
positive values (Lemma 8.4). Consider the following linear program:
P
Maximize βi ,
subject
to
LB(i) ∀i, βi > 0
Pi
UB(i) ∀i, k=1 βk ck + βi wi 6 T
The αi ’s satisfy the set of constraints above, and from any set of βi ’s sat-
isfying the
P set of inequalities, we can build a valid schedule that processes
exactly βi units of load. Therefore,Pif we denote
P by (β1 , . . . , βp ) an optimal
solution of the linear program, then βi = αi .
It is known that one of the extremal solutions S1 of the linear program is
one of the convex polyhedron P induced by the inequalities [107, chapter 11]:
this means that in the solution S1 , there are at least p inequalities among the
2p equalities. Since we know that for any optimal solution all the βi ’s are
strictly positive (Lemma 8.4), then this vertex is the solution of the following
(full rank) linear system
i
X
∀i, βk ck + βi wi = T.
k=1
Thus, we conclude that there exists an optimal solution in which all workers
finish their work at the same time.
Let us denote by S2 = (α1 , . . . , αp ) another optimal solution, with S1 6= S2 .
As already pointed out, S2 belongs to the polyhedron P. Now, consider the
following function f :
R → Rn
f:
x 7→ S1 + x(S2 − S1 )
P P
By construction, we know that βi = αi . Thus, with the notation f (x) =
(γ1 (x), . . . , γp (x)):
∀i, γi (x) = βi + x(αi − βi ),
and therefore X X X
∀x, γi (x) = βi = αi .
266 Chapter 8. Advanced scheduling
Therefore, all the points f (x) that belong to P are extremal solutions of the
linear program.
Since P is a convex polyhedron and both S1 and S2 belong to P, then
∀ 0 6 x 6 1, f (x) ∈ P. Let us denote by x0 the largest value of x > 1 such
that f (x) still belongs to P: at least one constraint of the linear program is
an equality in f (x0 ), and this constraint is not satisfied for x > x0 . Could
this constraint be one of the UB(i) constraints? The answer is no, because
otherwise this constraint would be an equality along the whole line (S2 f (x0 )),
and would remain an equality for x > x0 . Hence, the constraint of interest is
one of the LB(i)’s. In other terms, there exists an index i such that γi (x0 ) = 0.
This is a contradiction since we have proved that the γi ’s correspond to an
optimal solution of the problem. Therefore, S1 = S2 , the optimal solution is
unique, and in this solution, all workers finish computing simultaneously.
Minimize Tf ,
subject
to
(1) αP
i >0 16i6p
p
(2) αi = Wtotal
Pi=1
i
(3) j=1 αj cj + αi wi 6 Tf 1 6 i 6 p (i-th worker)
THEOREM 8.2. The optimal solution is given by the solution of the linear
program above.
Theorem 8.2 is a direct consequence of the previous two lemmas. Note that
inequalities (3) will be in fact equalities in the solution of the linear program,
so that we can easily derive a closed-form expression for the αi ’s (although
it is far less elegant than for bus platforms). It is important to point out
that this is linear programming with rational numbers, hence of polynomial
complexity.
As stated above, the variant where the master is capable of computing
while communicating to one of its children can be solved by adding a fictitious
worker Pp+1 with cp+1 = 0 and wp+1 = w0 . The previous analysis shows that
the master is kept busy all the time, as expected (otherwise more units of load
could be processed). However, if the master is not fixed a priori but rather
can be freely chosen among all processors, we should introduce new variables
cij to denote the communication time of one unit of load from Pi to Pj . But
we have no easy rule to decide which processor should be elected as master
(we had such a rule for bus platforms). Instead, we can still solve p + 1 linear
programs (with Pi as master in the i-th program) and retain the best solution.
Exercise 8.1 explains how to extend the linear programming approach to
general (multi-level) heterogeneous trees.
8.1. Divisible Load Scheduling 267
c1 = 1 c3 = 5
c2 = 1
P1 P2 P3
w1 = 1 w2 = 1 w3 = 5
P1 P1
P2 P2
P3 P3
LIFO, 3 processors FIFO, 2 processors
61/135 ≈ 0.45 task/sec 1/2 = 0.5 task/sec
(best schedule)
FIGURE 8.7: With return messages, all processors do not always participate
in the computation.
P1
c1 = 7 c = 8 c3 = 12
2 P2
P1 P2 P3 P3
w1 = 6 w2 = 5 w3 = 5
Optimal schedule
(38/499 ≈ 0.076 task/sec)
P1 P1
P2 P2
P3 P3
Best FIFO schedule Best LIFO schedule
(47/632 ≈ 0.074 task/sec) (43/580 ≈ 0.074 task/sec)
FIGURE 8.8: With return messages, the optimal schedule may be neither
LIFO nor FIFO.
8.2. Steady-State Scheduling 269
Pp αp wp Pp
...
P2 α2 w2 P2
P1 α1 w1 P1
M α1g α2 g αp g M
T1 T2 Tp Tf
R0 R1 Rk
One round Multi-round
8.2.1 Motivation
An idea to circumvent the difficulty of makespan minimization is to lower
the ambition of the scheduling objective. Instead of aiming at the absolute
minimization of the execution time, why not consider asymptotic optimality?
Often, the motication for deploying an application on a parallel platform is
that the number of tasks is very large. In this case, the optimal execution time
with the optimal schedule may be very large and a small deviation from it is
likely acceptable. To state this informally: if there is a nice (e.g., polynomial)
way to derive, say, a schedule whose length is two hours and three minutes, as
opposed to an optimal schedule that would run for only two hours, we would
be satisfied.
This approach has been pioneered by Bertsimas and Gamarnik [29]. Steady-
state scheduling allows one to relax the scheduling problem in many ways. The
costs of the initialization and clean-up phases are neglected. The initial integer
270 Chapter 8. Advanced scheduling
3
Data starts A
My Intermediatenodes
here computer can compute too 2 1
Internet Cluster 2 6
Gateway Host B C
Super- 1
Partner
computer 2
Site
D
Participating PCs
and workstations
A compute A compute
A send A send
B receive B receive
B compute B compute
C receive C receive
C compute C compute
C send C send
D receive D receive
D compute D compute
A compute A compute
A send A send
B receive B receive
B compute B compute
C receive C receive
C compute C compute
C send C send
D receive D receive
D compute D compute
A compute A compute
A send A send
B receive B receive
B compute B compute
C receive C receive
C compute C compute
C send C send
D receive D receive
D compute D compute
(g) A serves B (still partly idle) (h) Period of the cyclic schedule
Maximize %,
subject toP
p
%
Pp i=1 αi
=
i=1 αi ci 6 1
∀i, α i >0
∀i, αi wi 6 1
It turns out that the linear program is so simple that it can be solved
analytically. Indeed it is a fractional knapsack problem [44] with value-to-
cost ratio c1i . We should start with the “item” (worker) of largest ratio, i.e.,
the smallest ci , and take (assign) as much tasks as we can, i.e., min c1i , w1i .
Here is the detailed procedure:
3. Workers Pq+2 to Pp (if they exist) are discarded: they will not partici-
pate in the computation.
When q = p the result is expected: it basically says that workers can be fed
with tasks fast enough so that they are all kept computing steadily. However, if
q < p the result is surprising. Indeed, if communication bandwidth is limited,
274 Chapter 8. Advanced scheduling
TABLE 8.1: Achieved throughput for the bandwidth-centric strategy on the
example tree platform.
Tasks Communication Computation
6 tasks to P1 6c1 = 6 6w1 = 18
3 tasks to P2 3c2 = 6 3w2 = 18
2 tasks to P3 2c3 = 6 2w3 = 2
some workers will partially starve. In the optimal solution these partially
starved workers are those with slow communication rates, regardless of their
processing speeds. In other words, a slow processor with a fast communication
link is to be preferred to a fast processor with a slow communication link. This
optimal strategy is often called bandwidth-centric because it delegates work
to the fastest communicating workers, regardless of their computing speeds.
Of course, slow workers will not contribute much to the overall throughput.
M
1 20
2 3 10
3 6 1 1 1
Consider the example shown in Figure 8.11. Workers are sorted by non-
decreasing ci . We see that wc11 + wc22 = 23 < 1 and that wc11 + wc22 + wc33 = 23 +3 > 1,
so that q = 2 and ε = 31 in the previous formula. Therefore, P1 and P2 will
be fully active, contributing α1 + α2 = w11 + w12 = 31 + 16 tasks per time
ε 1
unit. P3 will only be partially active, contributing α3 = min( cq+1 , wq+1 =
1 1
min( 9 , 1) = 9 . P4 and P5 will be discarded. The optimal throughput is
% = 13 + 16 + 91 = 11
18 ≈ 0.6. Table 8.1 shows that 11 tasks are computed every
Tperiod = 18 time units.
It is important to point out that if we had used a purely greedy (demand-
driven) strategy, we would have reached a much lower throughput. Indeed,
the master would serve the workers in round-robin fashion, and we would
execute only 5 tasks every 36 time units, therefore achieving a throughput of
only % = 5/36 ≈ 0.14. The conclusion is that even when resources are cheap
and abundant, resource selection is key to performance.
8.2. Steady-State Scheduling 275
Once one has obtained the solution to the linear program defined above,
say (α, %), one needs to reconstruct a (periodic) schedule. In other terms, one
needs to decide in which specific activities each computation and communica-
tion resource is involved during each period. More precisely, we need to define
Tperiod such that during a period (i) an integral number of tasks is processed
by each processor, and (ii) an integral number of messages goes through each
link.
We express all the rational numbers αi , 1 6 i 6 q, as αi = uvii , where the ui
ε 1
and the vi are relatively prime integers. We also write αq+1 = min( cq+1 , wq+1 )=
uq+1
vq+1 . The period of the schedule is set to Tperiod = lcm(v1 , . . . , vq , vq+1 ), the
least common multiple of the denominators. In the example in Figure 8.11,
we had α1 = 31 , α1 = 16 and α1 = 91 , and lcm(3, 6, 9) = 18. This is how we
found the period used in Table 8.1.
In steady-state, during each period of duration Tperiod :
• Master M sends αi Tperiod tasks to Pi , 1 6 i 6 q +1. These q +1 messages
are sent in any order. A total of %Tperiod tasks are sent during each
period. All these communications share the network but Equation (8.2)
ensures that link bandwidths are not exceeded.
• Worker Pi executes the αi Tperiod tasks that it has received during the
last period. Equation (8.1) ensures that Pi has enough time to execute
these tasks.
Obviously, the first and last period are different: no computation takes place
during the first period, and no communication during the last one. Note that
the steady-state regime is reached no later than the beginning of the second
period.
Altogether, we have a periodic schedule, which is described in compact
form: because it arises from the linear program, log(Tperiod ) is indeed a number
polynomial in the problem size, but Tperiod itself is not. Hence, describing what
happens at every time-step during the period would be exponential in the
problem size. Instead, we have a more “compact” description of the schedule:
we only need the duration of the p time intervals during which the master
sends tasks to each worker (some possibly zero), and the duration of the p
time intervals during which each worker computes (again, some possibly zero).
We conclude this section by explaining how to modify the linear program
to use the bounded multi-port model instead of the one-port model. Here is
the one-port linear program again:
Maximize %,
subject toP
p
%
P= i=1 αi (i)
p
i=1 αi ci 6 1 (ii)
∀i, αi > 0
(iii)
∀i, αi wi 6 1 (iv)
276 Chapter 8. Advanced scheduling
∀i, αi ci 6 1 ,
which states that the bandwidth of the link from M to Pi is not exceeded.
Let δ be the volume of data (in bytes) that needs to be sent for one task. We
can rewrite the last equation as:
αi δ
∀i, 61 (ii-a) ,
Bi
where Bi is the bandwidth of the link (thus ci = Bδi ). We also have to enforce
a global bound related to the bandwidth B of the master’s network card:
Pp
( i=1 αi )δ
61 (ii-b) .
B
Replacing equation (ii) by both equations (ii-a) and (ii-b) is all that is needed
to change to the bounded multi-port model!
α0 w0 6 1 .
Consider now an internal node Pi . Let Pi0 denote its parent in the tree,
and let Pi1 , Pi2 , . . . , Pik denote its children in the tree. As before, there are
equations to constrain the computations and communications of Pi :
αi wi 6 1 ,
k
X
sent(Pi → Pij )ci,ij 6 1 .
j=1
8.2. Steady-State Scheduling 277
αi wi 6 1 ,
We have built a linear program, and we can follow the same steps as for
star-shaped platforms to define the period and reconstruct the final schedule.
Essentially Proposition 8.1 says that no schedule can execute more tasks
than the optimal steady-state. There remains to bound the potential loss due
to the initialization and the clean-up phase. Consider the following algorithm
(assume T is large enough):
• Solve the linear program: compute the maximal throughput %, compute
all the values for αi , sent(Pi → Pj ), and rij to determine Tperiod . For
each processor Pi , determine per i , the total number of tasks that it
receives per period. Note that all these quantities are independent of T :
they depend only upon wi and cij , characteristics of the platform.
• Initialization: the master sends per i tasks to each processor Pi . This
requires I units of time, where I is a constant independent of T .
• Let J be the maximum time for each processor to consume per i tasks
(J = maxi {per i .wi }). Again, J is a constant independent of T .
• Let r = b TT−I−J
period
c.
Proof. Using Lemma 8.1, nbopt (T ) 6 %T . From the description of the algo-
rithm, we have nb(T ) = ((r + 1)Tperiod ).% > (T − I − J).%. This proves the
result because I, J, Tperiod and % are constants independent of T .
8.2.6 Summary
In addition to its simplicity, which makes it possible to tackle more complex
problems, steady-state scheduling has two main advantages over traditional
scheduling:
b0
S1
b1
S2 ... bk−1
Sk
bk
... Sn
bn
w1 w2 wk wn
8.3.1 Framework
We consider a pipeline with n stages Sk , 1 6 k 6 n, as illustrated in
Figure 8.12. Tasks are fed into the pipeline and processed from stage to
stage, until they exit the pipeline after the last stage. The k-th stage Sk
receives an input from the previous stage, of size bk−1 , performs a number of
wk operations, and outputs data of size bk to the next stage. The first stage
S1 receives an initial input of size b0 , while the last stage Sn returns a final
result of size bn .
We target a platform with p processors Pu , 1 6 u 6 p, that are fully
interconnected (see Figure 8.13). There is a bidirectional link linku,v : Pu ↔
Pv with bandwidth Bu,v between each processors Pu and Pv , For the sake of
simplicity, we enforce the unidirectional variant of the one-port model: a given
processor can be involved in a single communication at any time-step, either
a send or a receive. Note that independent communications between distinct
282 Chapter 8. Advanced scheduling
Bin,u Bv,out
Pu Pv
Bu,v
Wu Wv
processor pairs can take place simultaneously (this was not possible in star-
shaped platforms). However, remember that in the unidirectional variant of
the one-port model a given processor cannot send and receive data in parallel
(while this was allowed for tree-shaped platforms in Section 8.2.4).
In the most general case, we have fully heterogeneous platforms, with dif-
ferent processor speeds and link bandwidths. The speed of processor Pu is
denoted as Wu , and it takes X/Wu time units for Pu to execute X operations.
We also enforce a linear cost model for communications, hence it takes X/Bu,v
time units to send (resp. receive) a message of size X to (resp. from) Pv . We
classify below particular cases that are important, both from a theoretical and
practical perspective:
Finally, we assume the existence of two special additional processors Pin and
Pout . The initial input data for each task resides on Pin , while all final results
must be returned to Pout .
The mapping problem consists in assigning application stages to processors.
If we restrict the search to one-to-one mappings, we require that each stage
Sk of the application pipeline be mapped onto a distinct processor Palloc(k)
(which is possible only if n 6 p). The function alloc associates a processor
index to each stage index. For convenience, we create two fictitious stages S0
8.3. Workflow Scheduling 283
and Sn+1 , and we assign S0 to Pin and Sn+1 to Pout . What is the period of
Palloc(k) , i.e., the minimum delay between the processing of two consecutive
tasks? To answer this question, we need to know to which processors the
previous and next stages are assigned. Let t = alloc(k − 1), u = alloc(k) and
v = alloc(k + 1). Pu needs bk−1 /Bt,u time units to receive the input data from
Pt , wk /Wu time units to process it, and bk /Bu,v time units to send the result
to Pv , hence a cycle-time of bk−1 /Bt,u + wk /Wu + bk /Bu,v time units for Pu .
Because of the one-port communication model, these three steps are serialized
(see Figure 8.14 for an illustration). The period achieved with the mapping is
the maximum of the cycle-times of the processors, which corresponds to the
rate at which the pipeline can be activated.
In this simple instance, the optimization problem can be stated as follows:
determine a one-to-one allocation function alloc : [1, n] → [1, p] (augmented
with alloc(0) = in and alloc(n + 1) = out) such that
bk−1 wk bk
Tperiod = max + + (8.3)
16k6n Balloc(k−1),alloc(k) Walloc(k) Balloc(k),alloc(k+1)
is minimized. We denote this optimization problem One-to-one Mapping.
time-step 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 . . .
P1 F M M F M M F M M F ...
P2 F F M M F F M M F F M M ...
P3 FF M FF M F F ...
1 GREEDY ASSIGNMENT()
2 Work with the n fastest processors, numbered P1 to Pn
3 where W1 6 W2 6 . . . 6 Wn
4 Mark all stages S1 to Sn as free
5 for u = 1 to n do
6 Pick any free stage Sk s.t. bk−1 /B + wk /Wu + bk /B 6 Tperiod
7 Assign Sk to Pu . Mark Sk as already assigned
8 If no stage is found return “failure”
9 Return “success”
the total computation time is O((pn + costGA ) log(pn)), where costGA is the
cost of the greedy assignment procedure.
We now describe the greedy assignment algorithm for a given Tperiod value.
Recall that there are n stages to map onto p > n processors in a one-to-one
fashion. We target Communication Homogeneous platforms with heteroge-
neous processors (Wu 6= Wv ) but with homogeneous links (Bu,v = B). First,
we retain only the fastest n processors, which we rename P1 , P2 , . . . , Pn such
that W1 6 W2 6 . . . 6 Wn . Then, we consider the processors in the order
P1 to Pn , i.e., from the slowest to the fastest, and greedily assign to them
any free, that is, not already assigned, stage that they can process within the
period. Algorithm 8.1 details the procedure.
Before providing the proof, we proceed with a small example. Consider
an application with 3 stages with respective computation requirements 1, 1,
and 100; 3 processors of speed 1, 1, and 100; and no communication costs
whatsoever. Can we achieve Tperiod = 1? The algorithm starts by assigning
stages to the slowest processors, i.e., to the first two processors. Only the first
two stages can be assigned to these processors to fit into the period, so they
are chosen, and the last stage then fits when assigned to the fastest processor.
It cannot be assigned to one of the first two processors because the required
period would be exceeded.
The proof that the greedy procedure returns a solution if and only if there
exists a solution of period Tperiod is done via a simple exchange argument.
Consider a valid one-to-one assignment of period Tperiod , denoted by A, and
assume that in this assignment Sk1 is assigned to P1 . Note first that the
greedy procedure will indeed find a stage to assign to P1 and cannot fail,
since Sk1 can be chosen. If the choice of the greedy procedure is actually
Sk1 , we proceed by induction with P2 . If the greedy procedure has selected
another stage Sk2 for P1 , we find which processor, say Pu , has been assigned
this stage in the valid assignment A. Then, we exchange the assignments of
P1 and Pu in A. Because Pu is faster than P1 , which could process Sk1 in
286 Chapter 8. Advanced scheduling
time in the assignment A, Pu can process Sk1 in time too. Because Sk2 has
been mapped on P1 by the greedy procedure, P1 can process Sk1 in time. So
the exchange is valid. We can consider the new assignment A which is valid
and which assigns the same stage to P1 as the greedy procedure. The proof
proceeds by induction with P2 as before.
The complexity of the greedy assignment procedure is O(n2 ), because of the
two loops over processors and stages. Altogether, since n 6 p, the complexity
of the whole algorithm is O((pn log(pn)), which is indeed polynomial in the
problem size.
Formally, we define:
DEFINITION 8.1 (Hetero-1D-Partition-Dec). Given n elements a1 ,
a2 , . . . , an , p values W1 , W2 , . . . , Wp and a bound K, can we find a partition
of [1..n] into p intervals I1 , I2 , . . . , Ip , with Ik = [dk , ek ] and dk 6 ek for
1 6 k 6 p, d1 = 1, dk+1 = ek + 1 for 1 6 k 6 p − 1 and ep = n, and a
permutation σ of {1, 2, . . . , p}, such that
P
i∈Ik ai
max 6K ?
16k6p Wσ(k)
288 Chapter 8. Advanced scheduling
| {z } C D | A2 111...1
A1 111...1 | {z } C D | . . . | Am 111...1
| {z } C D
M M M
W i = B + zi , Wm+i = C + M − yi , W2m+i = D .
• We map each task Ai and the following yσ1 (i) tasks of weight 1 onto
processor Pσ2 (i) .
8.3. Workflow Scheduling 289
• We map the following M − yσ1 (i) tasks of weight 1 and the next task, of
weight C, onto processor Pm+σ1 (i) .
• We map the next task, of weight D, onto the processor P2m+i .
We do have a valid partition of all the tasks into p = 3m intervals. For
1 6 i 6 m, the load and speed of the processors are indeed equal:
• The load of Pσ2 (i) is Ai +yσ1 (i) = B+xi +yσ1 (i) and its speed is B+zσ2 (i) .
• The load of Pm+σ1 (i) is M − yσ1 (i) + C, which is equal to its speed.
• The load and speed of P2m+i are both equal to D.
The mapping does achieve the bound K = 1, hence a solution to I2 .
Suppose now that I2 has a solution, i.e., a mapping matching the bound
K = 1. We first observe that Wi < Wm+j < W2m+k = D for 1 6 i, j, k 6 m.
Indeed Wi = B + zi 6 B + M = 3M , 5M 6 Wm+j = C + M − yj 6 6M
and D = 7M . Hence, each of the m tasks of weight D must be assigned to a
processor of speed D, and it is the only task assigned to this processor. These
m singleton assignments divide the set of tasks into m intervals, namely the
set of tasks before the first task of weight D, and the m − 1 sets of tasks
lying between two consecutive tasks of weight D. The total weight of each
of these m intervals is Ai + M + C > B + M + C = 10M , while the largest
speed of the 2m remaining processors is 6M . Therefore, each of them must
be assigned to at least 2 processors each. However, there remain only 2m
available processors, hence each interval is assigned exactly 2 processors.
Consider such an interval Ai 111...1 C with M tasks of weight 1, and let
Pi1 and Pi2 be the two processors assigned to this interval. Tasks Ai and C
are not assigned to the same processor (otherwise the whole interval would).
So Pi1 receives task Ai and hi tasks of weight 1 while Pi2 receives M − hi
tasks of weight 1 and task C. The weight of Pi2 is M − h + C > C = 5M
while Wi 6 3M for 1 6 i 6 m. Hence, Pi1 must be some Pi , 1 6 i 6 m
while Pi2 must be some Pm+j , 1 6 j 6 m. Because this holds true on each
interval, this defines two permutations σ2 (i) and σ1 (i) such that Pi1 = Pσ2 (i)
and Pi2 = Pσ1 (i) . Because the bound K = 1 is achieved, we have:
• Ai + hi = B + xi + hi 6 B + zσ2 (i)
• M − hi + C 6 C + M − yσ1 (i)
Pm Pm
PTherefore,
m Pmyσ1 (i) 6Phm i and xi + hi 6 zσ2 (i) , and i=1 xi + i=1 yi 6
i=1 xi + i=1 hiP6 i=1 zi .P
m m Pm
By hypothesis, xi + i=1 P
i=1 P yi = zi , hence all inequalities are
i=1P
m m m
tight,
Pm and in
Pm particularPm i=1 xi +
Pm i=1 h i = i=1 zi . We can deduce that
i=1 yi = i=1 hi = i=1 zi − i=1 xi , and since yσ1 (i) 6 hi for all i, we
have yσ1 (i) = hi for all i. Similarly, we deduce that xi + hi = zσ2 (i) for all i,
and therefore xi + yσ1 (i) = zσ2 (i) . Altogether, we have found a solution for I1 ,
which concludes the proof.
290 Chapter 8. Advanced scheduling
2
d(c1 ,c2 ) 2
2 P1,2 P2 d(c2 ,c4 )
d(c1 ,c2 )
2 2 P2,3
1 d(c1 ,c3 ) d(c1 ,c3 )
Pin P1 P1,3 P3 P2,4
2 P3,4
d(c1 ,c4 )
P1,4 2 P4 2
d(c1 ,c4 ) d(c2 ,c4 )
1
Pout
FIGURE 8.15: The platform used in the reduction for Theorem 8.7.
Suppose first that I1 has a solution. We map stage Si onto Pπ(i) for 1 6 i 6
m, and stage Si0 onto processor Pπ(i),π(i+1) for 1 6 i 6 m − 1. The cycle-time
d(c ,c )
of P1 is 1 + 2K + π(1)2 π(2) 6 1 + 2K + K 2 6 3K. Quite similarly, the
cycle-time of Pm is smaller than 3K. For 2 6 i 6 m − 1, the cycle-time of
d(cπ(i−1) ,cπ(i) ) d(c ,c )
Pπ(i) is 2 + 2K + π(i) 2 π(i+1) 6 3K. Finally, for 1 6 i 6 m − 1,
π(i) π(i+1) d(c ,c ) d(c ,c )
the cycle-time of Pπ(i),π(i+1) is 2 + 2K + π(i) 2 π(i+1) 6 3K. The
mapping does achieve a period that is no greater than 3K, hence a solution
to I2 .
Suppose now that I2 has a solution, i.e., a mapping of period lower than
3K. We first observe that each processor is assigned at most one stage by the
mapping, because executing two stages would require at least 2K + 2K units
of time, which would be too large to match the period. Next, we observe that
1
any slow link of bandwidth 5K cannot be used in the solution: otherwise the
period would exceed 5K.
The input processor Pin has a single fast link to P1 , so necessarily P1 is
assigned stage S1 (i.e., π(1) = 1). As observed above, P1 cannot execute any
other stage. Because of fast links, stage S10 must be assigned to some P1,j ; we
let j = π(2). Again, because of fast links and the one-to-one constraint, the
only choice for stage S2 is Pπ(2) . Necessarily j = π(2) 6= π(1) = 1, otherwise
P1 would execute two stages. We proceed similarly for stage S20 , assigned
292 Chapter 8. Advanced scheduling
TABLE 8.2: Summary of complexity results for the different instances of
the workflow mapping problem.
Fully Hom. Comm. Hom. Hetero.
One-to-one polynomial (bin.search) NP-hard
Interval polynomial (dyn. prog.) NP-hard NP-hard
to some P2,k (let k = π(3)) and stage S3 assigned to Pπ(3) . Owing to the
one-to-one constraint, k 6= 1 and k 6= j, i.e., π : [1..3] → [1..m] is a one-to-one
mapping. By induction, we build the full permutation π : [1..m] → [1..m].
Because the output processor Pout has a single fast link to Pm , necessarily Pm
is assigned stage Sm , hence π(m) = m.
We have built the desired permutation, it remains to show that for 1 6
i 6 m − 1: d(cπ(i) , cπ(i+1) ) 6 K. The cycle time of processor Pπ(i) is
d(cπ(i) ,cπ(i+1) ) d(c ,c )
2 + 2K + π(i) 2 π(i+1) 6 3K, hence d(cπ(i) , cπ(i+1) ) 6 K. Al-
together, we have found a solution for I1 , which concludes the proof.
Table 8.2 summarizes all previous complexity results. We see that one level
of heterogeneity (in processor speed) is enough to make interval mapping NP-
hard, while two levels of heterogeneity (adding different link bandwidths) are
required to make one-to-one mapping NP-hard as well.
( Pej )
X bdj −1 i=dj wi bn
Trt = + + . (8.5)
Balloc(dj −1),alloc(dj ) Walloc(dj ) Balloc(n),alloc(n+1)
16j6m
The response time for a one-to-one mapping obeys the same formula (with
the restriction that each interval has length 1).
Note that it may well be the case that different data sets have different
response times (because they are mapped onto different processor sets), hence
the response time is defined as the maximum response time over all data sets.
Intuitively, response time is small when all communications are zeroed out.
An obvious candidate mapping would be to map all stages on the fastest
processor, which will result in a large period. This is a general observation:
minimizing the response time is antagonistic to minimizing the period. The
bibliographical notes at the end of the chapter point to works that tackle
8.4. Hyperplane Scheduling (or Scheduling at Compile-Time) 293
Proof. We start with interval mappings, which are more natural to minimize
the response time as many communications will be zeroed out. On Fully Ho-
mogeneous platforms, the optimal solution is to map all stages on a single
(arbitrary) processor, because this zeroes out all communications except in-
put/output ones. On Communication Homogeneous platforms, the optimal
solution is to map all stages on the fastest processor, for the same reason.
All one-to-one mappings on Fully Homogeneous platforms have the same
response time. On Communication Homogeneous platforms, the optimal so-
lution is to assign the most computationally expensive stages to the fastest
processors, in a greedy manner (largest stage on fastest processor, second
largest on second fastest, and so on).
In Exercise 8.4 we see that the response time minimization problem is NP-
hard for one-to-one mappings on Fully Heterogeneous platforms. To the best
of our knowledge, the complexity is open for interval mappings on such plat-
forms, at least at the time this book is being written.
This dream has not come (fully) true yet, but in spite of many setbacks a lot
of progress has been made in several directions. In this section, we focus solely
on the automatic parallelization of so-called uniform loop nests. We explain
simple but representative results, motivated by two seminal papers by Karp,
294 Chapter 8. Advanced scheduling
Miller and Winograd [72], and by Lamport [78] (see the bibliographical notes
at the end of the chapter for further information).
S1 : a←b+1
S2 : b←a−1
S3 : a←c−2
S4 : d←c
which reads the same variable c (but there is no write of c, hence the order
does not matter).
1 for i = 0 to N do
2 for j = 0 to N do
3 S1 (i, j) : a(i, j) = b(i, j − 6) + d(i − 1, j + 3)
4 S2 (i, j) : b(i + 1, j − 1) = c(i + 2, j + 5) + 1
5 S3 (i, j) : c(i + 3, j − 1) = a(i, j + 2)
6 S4 (i, j) : d(i, j − 1) = a(i, j − 1) − 1
Dom = {(i, j) ∈ Z2 , 0 6 i 6 N, 0 6 j 6 N } .
where <lex is the lexicographic order of the iteration vectors, and <text is the
textual order of statements in the program’s source code. In the example, we
have S3 (2, 5) <seq S1 (3, 1), S3 (2, 5) <seq S2 (2, 6) and S3 (2, 5) <seq S4 (2, 5).
There is a dependence between operations Si (I) and iteration Sj (J) if:
• Si (I) is executed before Sj (J).
296 Chapter 8. Advanced scheduling
• Si (I) and Sj (J) refer to a memory location M , and at least one of these
references is a write.
• The memory location M is not written between iteration I and itera-
tion J.
We retrieve flow, anti and output dependences, depending on whether there
is a single write to M (flow if the read occurs after, anti if it comes before) or
two writes (output). The last condition is to ensure that the dependence is
indeed between successors in the dependence graph.
The dependence vector between iteration Si (I) and iteration Sj (J) is de-
fined as d(i,I),(j,J) = J − I. The loop nest is said to be uniform if the depen-
dence vectors d(i,I),(j,J) do not depend on either I or J, and we denote them
simply di,j . We point out that each dependence vector di,j is lexicographically
positive (its first nonzero component is greater than 0), due to the semantics
of the sequential execution: if there is a dependence from Si (I) to iteration
Sj (J), then I precedes J, i.e., I 6lex J and di,j = J − I >lex 0. We can then
represent the dependences as an oriented graph with k nodes (the statements)
linked by edges corresponding to the dependence vectors.
In the example, Algorithm 8.2, we see that variable a(i, j) is produced
(written) by statement S1 (i, j) and consumed (read) by statement S4 (i, j +1),
0
hence a uniform dependence from S1 to S4 of vector d1,4 = . How did we
1
find this? First, since S4 (i, j) reads a(i, j−1), then S4 (i, j+1) reads a(i, j), the
memory location written by S1 (i, j). We do have I = (i, j) <seq J = (i, j + 1).
No other operation writes into this location in between, since in this loop nest
every memory location is written only once. We have “manually” checked the
three conditions stated above.
How can we automate the process of finding dependences? For each candi-
date array and statement pair, we can write a system of equations, and try
to solve it. Let us do this for array a and the pair (S1 , S3 ). S1 writes into a
and S3 reads it, so we search if there exists a flow dependence from S1 (I) to
S3 (J), for some I, J ∈ Dom with I <seq J. Letting I = (i, j) and J = (i0 , j 0 )
we obtain:
• i = i0 and j = j 0 + 2: we write a(i, j) in S1 (i, j), and read a(i0 , j 0 + 2) in
S3 (i0 , j 0 );
• i < i0 , or i = i0 and j < j 0 : this is the condition I <seq J.
We see that there is no solution, hence no flow dependence from S1 to S3 .
But we see (because we are used to looking at loop nests!) that there is an
anti-dependence from S3 to S1 . With the same notations, looking for an anti
dependence from S3 (J) to S1 (I), for some I, J ∈ Dom with J <seq I, and
letting J = (i0 , j 0 ), I = (i, j), we obtain:
• i0 = i and j 0 + 2 = j: we read a(i0 , j 0 + 2) in S3 (i, j), and write a(i, j) in
S1 (i, j);
8.4. Hyperplane Scheduling (or Scheduling at Compile-Time) 297
S1 S3 S2
S4
The dependence graph is shown in Figure 8.16. It captures all the informa-
tion extracted from the dependence analysis. Note that this information is not
298 Chapter 8. Advanced scheduling
100% accurate. Each uniform dependence seems to occur for each operation
over the entire iteration domain, while it is not the case at the boundary. For
instance with the anti dependence from S3 to S1 , the condition j = j 0 + 2 can-
not be satisfied if j 0 = N − 1 or j 0 = N . However, the condition is satisfied on
most points of the domain, so it makes sense to approximate dependences as
we did in the graph. What is important is to always over-approximate, i.e., to
record more dependences that there may actually exist. Over-approximations
can reduce parallelism but will never lead to a violation of the semantics of
the original program, while under-approximations are simply. . . dangerous.
Here n is the nest depth, i.e., the dimension of index points, a is the number
of constraints that define the shape of the domain, and m is the number of
dependence vectors, so that the constraint matrix A is of dimension a × n and
the dependence matrix D is of dimension n × m.
Back to the example in Figure 8.16, we have Dom = {(i, j) ∈ Z2 , 0 6 i 6
N, 0 6 j 6 N }. which translates into
−1 0 0 0
1 0 N 1
A= 0 −1 and b = 0 = N × b0 , where b0 = 0 ,
0 1 N 1
such that dependences are preserved. The vector π ∈ Q1×n (π has ratio-
nal coefficients) is called an admissible scheduling vector. Not any vector π
can be chosen as an admissible scheduling vector, since dependences must be
preserved.
The basic idea of linear schedules is to try to transform the original uniform
loop nests into an equivalent nest in which all the internal loops except the
first one are parallel, as in:
The external loop corresponds to an iteration time and E(time) is the set
of all the points computed at the step time. Such points must not be linked
by dependence vectors so that they can be executed simultaneously. With
a linear schedule σπ , we have E(time) = {p ∈ Dom, bπ.pc = time}. The
scheduling vector π defines a family of affine hyperplanes H(t) orthogonal to
π such that the set of points executed at a time t is equal to the intersection
of the domain with the hyperplane H(t). The flow of computation goes from
H(t) to H(t + 1) by a translation; hence the name of the method that we
describe in this section: the computations progress like a wavefront parallel
to the family of hyperplanes H(t).
Admissible scheduling vectors π are easy to characterize for uniform loop
nests:
πd > 1 for each dependence vector d ∈ D.
Let us explain this: if p1 , p2 ∈ Dom are two points such that p1 ≺ p2 , i.e.,
p2 = p1 + di for some di ∈ D, we must have σπ (p1 ) < σπ (p2 ), i.e., bπp1 c + 1 6
bπ(p1 + di )c. This inequality is satisfied if πdi > 1. We retrieve Lamport’s
condition:
πD > 1 ,
300 Chapter 8. Advanced scheduling
which means that πdi > 1 for all di ∈ D. Note that this condition is sufficient
for π to be an admissible scheduling vector, and it is necessary unless the
domain Dom is very small, a situation very unlikely to occur in practice.
How can we find an admissible scheduling vector? It turns out to be
straightfoward, owing to the fact that all dependence vectors are lexicograph-
ically positive. Assume that the columns of matrix D = (d1 , . . . , dm ), whose
dimension is n × m, have been sorted lexicographically. Let k1 be the index
of the first nonzero component of d1 , the “smallest” vector of D. Since d1 is
positive, d1,k1 > 0. Take πk1 = 1 and πk = 0 for k1 < k 6 n. Before defin-
ing the first components of π, remark that whatever their values, πd1 > 0.
Now, let k2 be the index of the first nonzero component of d2 . Because d1
is lexicographically smaller than d2 , k2 6 k1 . Let πk = 0 for k2 < k < k1
and take for πk2 the smallest positive integer such that πd2 > 0, modifying
maybe πk1 if k2 = k1 . Continuing the process, we obtain an admissible vec-
tor. Let us execute this procedure
for the example in Figure 8.16 to find an
π1
admissible vector π = . After sorting the dependence matrix, we have
π2
00 1 11
D= . We have n = 2, k1 = 2, and we take π2 = 1 so that
1 2 −6 −4 5
πd1 > 1, whatever the value of π1 . We have k2 = 2, and πd2 > 1, so we do
not need to change the value of π2 . Next k3 = 1 and the condition πd3> 1
7
writes π1 − 6π2 > 1; so we take π1 = 7. We can keep the value π =
1
because we already have πd4 > 1 and πd5 > 1.
We know how to find an admissible scheduling vector. But how do we find
the optimal one? We first start with the example in Figure 8.16, and then
moveto the general procedure. As discussed above, the scheduling vector
π1
π= is admissible if and only if π2 > 1 and π1 > 1 + 6π2 , which implies
π2
that both π1 and π2 are positive. The total execution time of σπ is
Dom = {p ∈ Qn , Ap 6 b} ,
To see why we may lose at most two time-steps when moving from Dom to
Dom, we simply write the following:
the existence of the maximum implying the existence of the minimum. Thus,
the search for the optimal scheduling vector can be done by solving the fol-
lowing linear problem:
πD > 1
X1 A = π
X2 A = −π
X1 > 0
X2 > 0
min (X1 + X2 )b
302 Chapter 8. Advanced scheduling
We remark that this problem is linear in b. The search for the best scheduling
vector over the family of domains {Ax 6 N b} is reduced to the search on the
domain {Ax 6 b}, which can be done without knowing the parameter N , thus
at compile-time.
For the example in Figure 8.16, we let π = (π1 , π2 ), X1 = (a1 , b1 , c1 , d1 )
and X2 = (a2 , b2 , c2 , d2 ). We solve the following optimization problem:
XD > 1 : π2 > 1, 2π2 > 1, π1 − 6π2 > 1, π1 − 4π2 > 1, π1 + 5π2 > 1
X1 A = X : −a1 + b1 = π1 , −c1 + d1 = π2
X2 A = −X : −a2 + b2 = −π1 , −c2 + d2 = −π2
X1 > 0 : a1 > 0, b1 > 0, c1 > 0, d1 > 0
X 2 > 0 : a2 > 0, b2 > 0, c2 > 0, d2 > 0
N × min (X1 + X2 )b : N × (b1 + b2 + d1 + d2 )
To solve this problem, we can use the simplex method provided by packages
such as GLPK [63] or Maple [39]. We obtain the solution: π = (7, 1), X1 =
∗
(0, 7, 0, 1) and X2 = (7, 0, 1, 0). The total execution time is Tlinear = N × (7 +
0 + 1 + 0) = 8N .
The last thing to do is to rewrite the loop nest to expose the parallelism.
We consider again the example of Figure 8.16. We need to transform the
original loop nest into
The trick is that the 2 × 2 matrix is unimodular, i.e., has an integral inverse:
i 0 1 time
= .
j 1 −7 proc
Because 0 6 i, j 6 N we can compute the new loop bounds for time and proc:
• time = 7i + j, hence timemin = 0 and timemax = 8N .
• i = proc, hence 0 6 proc 6 N . Also, j = time − 7proc hence 0 6
time − 7proc 6 N . We derive b time−N
7 c 6 proc 6 b time
7 c.
1 for time = 0 to 8N do
2 for proc = max(0, b time−N
7 c) to min(N, b time
7 c) do
3 a(proc, time − 7proc) ←
b(proc, time − 7proc − 6) + d(proc − 1, time − 7proc + 3)
4 b(proc + 1, time − 7proc − 1) ← c(proc + 2, time − 7proc + 5) + 1
5 c(proc + 3, time − 7proc − 1) ← a(proc, time − 7proc + 2)
6 d(proc, time − 7proc − 1) ← a(proc, time − 7proc − 1) − 1
we can execute all iterations of the inner loop simultaneously on distinct pro-
cessors (hence the name ”proc” for the first component). Again, it is possible
to automate the process of rewriting the loop nest in the general case. This
relies on sophisticated mathematical tools such as Hermite normal forms and
Fourier-Motzkin elimination, and we point the reader to references [50, 5] for
more details.
Bibliographical Notes
The divisible load model has been widely studied in the last several years,
after having been popularized by the landmark book by Bharadwaj, Ghose,
Mani and Robertazzi [32]. See also the introductory papers [33, 103]. Recent
results on star and tree networks are surveyed in [19].
Steady-state scheduling was pioneered by Bertsimas and Gamarnik [29].
See [13, 20, 21, 18] for recent applications of the technique.
Workflow scheduling is a hot topic. A few papers related to the material
covered in this chapter are [114, 30, 31, 111, 24]. Bicriteria optimization
(response time with period constraints, or the converse) are discussed in [115,
22, 23, 118].
Finally, loop nest scheduling dates back to the seminal papers of Karp,
Miller and Winograd [72] and Lamport [78]. Many loop transformation al-
gorithms are described in the books by Banerjee [10], by Wolfe [120], and by
Allen and Kennedy [5]. An introduction to the topic of uniform recursions,
loop parallelization and software pipelining is provided in the book by Darte,
Robert and Vivien [50].
304 Chapter 8. Advanced scheduling
8.5 Exercises
We offer quite a comprehensive set of exercises. Exercise 8.1 is devoted
to divisible load scheduling on tree platforms. Exercise 8.2 deals with the
steady-state approach for multiple applications. The next two exercises are
on the topic of workflow scheduling: Exercise 8.3 tackles the classic chains-to-
chains problem, while Exercise 8.4 explores the complexity of response time
minimization. Finally, the last two exercises study dependences in loop nests.
Exercise 8.5 applies basic techniques from Section 8.4. Exercise 8.6 is more
involved and investigates the problem of removing dependence cycles.
P0
c1 c2
w1 P1 P2 w2
c3 c4
P3 P4 w4
w3
c5
P5 w
5
1. Consider the tree shown in Figure 8.17, with a master P0 and 5 workers
P1 to P5 . The cycle-time of worker Pi is wi . Let Wtotal be the total amount
of work, and let αi be the fraction executed by Pi for 1 6 i 6 5. Assume that
c1 6 c2 , so that the master P0 will serve P1 before P2 in the optimal solution
(same proof as in Lemma 8.3). Write the linear program that characterizes
the optimal makespan T for the tree.
Exercises 305
c0 c0
w0 w−1
c1
cp ⇔
c2 ci
w1 w2 wi wp
Maximize W,
subject
to
(1) αi > 0 06i6p
(2) Pp α = W
i=0 i
(3) W c 0 + α0 w0 = 1
Pi
(4) W c0 + j=1 αj cj + αi wi = 1 16i6p
3. Explain how the previous reduction can be used to compute the optimal
solution for a general multi-level tree. Can you compare this approach to that
in Question 1?
• Application k has weight w(k) , and each task of type k involves b(k)
bytes and c(k) operations.
1. Let α(k) be the throughput achieved for application k, i.e., the number of
tasks of type k that are executed on the platform every time unit in steady-
state mode. Explain why maximizing
α(k)
min
k w(k)
X α(k)
.
k
w(k)
α(k)
n o
2. Write a linear program to maximize the former objective function mink w(k)
.
X ek
X
max ai = max ai .
16k6p 16k6p
i∈Ik i=dk
2. Give a binary search algorithm to solve the problem. What is its com-
plexity?
3. What are the admissible scheduling vectors for the previous loop nest?
Determine the one that minimizes the execution time. Use this vector to
re-write the loop nest so that the innermost loop is parallel.
2. In the general case, the node splitting technique proceeds as shown below:
for .i .=
. 1 to N do for .i .=
. 1 to N do
Sk : lhs(f (i)) ← rhs(. . . ) Sk0 : temp(f (i)) ← rhs(. . . )
... Sk : lhs(f (i)) ← temp(f (i))
... ...
Si : . . . ← lhs(g(i)) Si : ... ← temp(g(i))
Assume that the access function lhs (standing for left-hand side) is one-
to-one. Show the effect of the node splitting technique on the six possible
dependence types: dependences incoming to Sk of type flow, anti and output,
and dependences outgoing from Sk of type flow, anti and output.
3. Let G be the RDG of a loop nest, and let G0 be the graph obtained after
splitting all the nodes. Show that a cycle in G0 is either uniquely composed of
flow dependences, or uniquely composed of output dependences. In addition,
show that any cycle in G0 corresponds to an existing cycle in G.
8.6 Answers
Exercise 8.1 (Divisible Load Scheduling on Heterogeneous
Trees)
. Question 1. We give the linear program and comment on the equations
afterwards:
Minimize T,
subject
to
(i) αi > 0 16i65
P5
(ii) i=1 αi = Wtotal
(0) (α 1 + α3 )c1 + (α2 + α4 + α5 )c2 6 T
(1) (α 1 + α3 )c1 + α1 w1 6 T
(10 ) (α1 + α3 )c1 + α3 c3 6 T
(2) (α2 + α4 + α5 )c2 + α2 w2 6 T
(20 ) (α2 + α4 + α5 )c2 + α4 c4 6 T
(3) (α1 + α3 )c1 + α3 c3 + α3 w3 6 T
(4) (α2 + α4 + α5 )c2 + α4 c4 + α4 w4 6 T
0
(4 ) (α2 + α4 + α5 )c2 + α4 c4 + α5 c5 6 T
(5) (α2 + α4 + α5 )c2 + α4 c4 + α5 c5 + α5 w5 6 T
Equation (0) expresses the total communication time for the master P0 . Equa-
tion (1) says that after the end of its incoming communication, internal node
P1 should be constantly computing. Equation (1’) says that after the end of
its incoming communication, P1 should should be constantly sending data to
P3 while computing. Note that Equation (1’) is useless, due to Equation (3).
Equation (3) states that P3 keeps computing after receiving its own data,
which comes from the master together with the data for P1 (hence the term
(α1 + α3 )c1 ), and which is then forwarded by P1 (hence the term α3 c3 ). Fi-
nally, Equations (2), (4) and (5) are the counterpart for the right side of the
tree (equations (2’) and (4’) are useless as well).
. Question 2. Here, instead of minimizing the time Tf required to execute
load W , we aim at determining the maximum amount of work W that can be
processed in one time-unit. Obviously, after the end of the incoming commu-
nication, the parent P0 should be constantly computing (equation (3)). We
know that all children (i) participate in the computation and (ii) terminate
execution at the same-time. Finally, the optimal ordering for the children is
given by Lemma 8.3. Equation (4) expresses the communication and com-
putation time for each child, using the optimal ordering. This completes the
proof. Note that because equations (3) and (4) are equalities, we could derive
a closed-form expression for w−1 = 1/W .
. Question 3. First we traverse the tree from bottom to top, replacing each
single-level tree by the equivalent node. We do this until there remains a single
310 Chapter 8. Advanced scheduling
star. We solve the problem for the star, using the results of Section 8.1.4.
Then, we traverse the tree from top to bottom, and undo each transformation
in the reverse order. Going back to a reduced node, we know for how long
it is supposed to be working. We know the optimal ordering of its children
and we know for which amount of time each of the children is supposed to
work. If one of these children is a leaf node, we have computed its load. If it
is instead a reduced node, we apply the transformation recursively.
Instead of this pair of tree traversals, we could write down the linear pro-
gram for a general tree, exactly as we did for the example in Figure 8.17.
Briefly, when it receives data a given node knows exactly what to do: com-
pute itself all the remaining time, and forward data to its children in de-
creasing bandwidth order. The problem is that the size of the linear program
would grow proportionally to the size of the tree. Consequently, the recursive
solution is preferred, at least for large platforms.
With the above equation we simply try all possible splits into [1, s] (with k − 1
intervals) and the single interval [s + 1, i].
Xi
We can speed up computations by pre-computing all values f (s, i) = aj
j=s+1
in time O(n). The complexity becomes O(np).
To re-write the loop nest, we do nothing, because we can use the identity
matrix as the unimodular transformation matrix. We merely mark the second
loop as parallel (to show that we have worked hard):
for i = 1 to N do
for j = 1 to N in parallel do
S1 : a(i+1, j−1) ← b(i, j+4)+c(i−2, j−3)+1
S2 : b(i − 1, j) ← a(i, j) − 1
S3 : c(i, j) ← a(i, j − 2) + b(i, j)
S10 S1 S2
f o
The cycle has been broken. We can split (or “distribute”) the loop to exhibit
the parallelism:
for i = 1 to N do
S10 : temp(i) ← b(i) + c(i)
a(1) ← temp(1)
for i = 1 to N do
S2 : a(i + 1) ← temp(i) + 2d(i)
. Question 2. We examine the six possibilities. We obtain the following
transformations:
Si : rhs(. . . ) ← ...
Si : rhs(. . . ) ← ... fin
fin Sk0 : temp(f (i)) ← rhs(. . . )
Sk : lhs(f (i)) ← rhs(. . . ) fnew
Si : ← lhs(. . . )
Si : ← lhs(. . . )
ain Sk0 : temp(f (i)) ← rhs(. . . )
Sk : lhs(f (i)) ← rhs(. . . ) ain fnew
Si : lhs(. . . ) ← ...
Si : lhs(. . . ) ← ...
oin Sk0 : temp(f (i)) ← rhs(. . . )
Sk : lhs(f (i)) ← rhs(. . . ) oin fnew
Si : ← temp(. . . )
FIGURE 8.22: Flow dependence outgoing from Sk , before and after node
splitting.
Si : rhs(. . . ) ←
FIGURE 8.23: Anti dependence outgoing from Sk , before and after node
splitting.
dependence incoming to or outgoing from the new node Sk0 . Next, there
are only edges corresponding to output dependences that go out from
Si . Therefore, the edge following e in the cycle C also corresponds to
an output dependence. In conclusion, C is uniquely composed of output
dependence edges. And all edges in C are edges that already existed in
G.
• If e corresponds to an anti dependence, then e goes from a node Sk0 to a
node Si . The only edges going out from Si are output dependence edges.
Therefore, the edge following e in the cycle C is an output dependence
edge. Reasoning as before, C is uniquely composed of output depen-
dence edges, which contradicts our assumption regarding e. Therefore,
no cycle in G0 may include an anti dependence edge.
• If e corresponds to a flow dependence, then either e has been created
when splitting a node Sk (and then goes from Sk0 to Sk ), or e corresponds
to an existing edge going from Sk to Si in G (and then goes from Sk0 to
Si0 ). We study both cases:
fnew
– e : Sk0 −→ Sk : all edges going out from Sk correspond to output
dependences, the edge following e in the cycle C is an output de-
318 Chapter 8. Advanced scheduling
FIGURE 8.24: Output dependence outgoing from Sk , before and after node
splitting.
fin
fin ain oin fout
Sk0 aout
ain oin
Sk fnew
Sk
fout aout oout
oout
In summary, the splitting technique has not introduced any new dependence
cycle. It has made it possible to break all cycles, except those uniquely com-
posed of flow dependences, and those uniquely composed of output depen-
dences.
S4
f f
a
S1 S2
a
o f
S3
The RDG is strongly connected. Splitting nodes S2 and S3 leads to the RDG
shown in Figure 8.27.
S4 f
f
a
f
S1 S2 S20
a
o S30 f
f
S3
There is no cycle in this new graph. We rewrite the loop nest using tempo-
rary variables atemp (for the split of S3 ) and btemp (for the split of S2 ):
for i = 4 to N do
S1 : a(i + 5) ← c(i − 3) + b(2i + 2)
atemp (i − 1) + 1 if i > 5
S20 : btemp (2i) ←
a(i − 1) + 1 if i = 4
S2 : b(2i) ← btemp (2i)
S30 : atemp (i) ← c(i + 5) + 1
S3 : a(i) ← atemp (i)
btemp (2i − 4) if i > 6
S4 : c(i) ←
b(2i − 4) if i 6 5
320 Chapter 8. Advanced scheduling
321
322 Bibliography
331
332 Index
path scalability, 60
critical–, see critical path scatter, 77
peer-to-peer network, 87–92 ring, 76–77
perfect loop nest, see loop(s) schedule, 210
pipeline, 63 ALAP, 215
broadcast, 78–79 ASAP–, 215
bubble, 126–130 critical path–, see critical path
pointer jumping, 4–9 free–, 215
port list–, 218–224, 238–242
1-port, 70 shared memory, 3, 57–58
multi-port, 69 simulation theorem (PRAM), 14–15
postskewing, 156 Snyder (matrix product), 159–160, 163–
power of a model, 12–15 167
PRAM, see Chapter 1 sorting, see sorting network
precedence on a network of processors, 47–
constraints, 210 48
relation, 208 on a PRAM, 15–24, 27–28
predecessor, 209 sorting network, see Chapter 2
prefix (computation on a PRAM), 7 Batcher, 37–43
preskewing, 156 bitonic, 49
primitive (sorting network), 45–46 odd-even, 44–47
principle one-dimensional, 44–47
0–1 principle, 42–43 on a 2-D grid, 49–51
programming primitive, 45–46
dynamic, 194 spanning tree, 83–87
speedup, 10, 140, 212
rank split network, 49
cross-, 15 star, 28
in a list, 5–7 statement, 294
in a sorted sequence, 15 static
reduction routing, 66
on a PRAM, 13 topology, 58–81
resource(s) stencil application, 134
constraints, 210, 216–224, 238 on a heterogeneous platform, 181–
unlimited–, 214–215, 230–238 183
ring, see chapter 4 on a ring, 112–119, 130–131
bidirectional, 134 store-and-forward, 63–65
cube-connected cycles, 93 super-linear parallel speedup, 140
matrix transposition, 94 synchronous (communication), 73
sorting, 47, 48
routing task
dynamic, 81 allocation, 177–181
in a hypercube, 81–82 duration, see weight
in a peer-to-peer network, 90–92 static allocation heterogeneous,
static, 66 176–181
Index 335
system, 210
weight, see weight
topology
dynamic, 58, 61
static, 58–81
virtual, 135–136
torus, 59
transposition
odd-even network, 44–47
tree
binary, 93
broadcast, 83–87
fat-tree, 61
fusion, 15
merge, 18–20
spanning, 83–87
vertex
bottom level, 214
entry–, 214
exit–, 214
top level, 214
weight
task–, 210
well-ordered, 196
work, 10
wormhole, 64, 65
write (concurrent), 4