Parallel Algorithms

See
discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/241684993
Parallel Algorithms
Book · January 2008
CITATIONS READS
5 10,518
3 authors:
Henri Casanova Arnaud R Legrand

University of Hawaiʻi at Mānoa University Joseph Fourier - Grenoble 1
190 PUBLICATIONS 7,917 CITATIONS 132 PUBLICATIONS 3,339 CITATIONS
SEE PROFILE SEE PROFILE
Yves Robert
Ecole normale supérieure de Lyon
543 PUBLICATIONS 6,201 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
SimGrid SMPI View project
SimGrid View project
All content following this page was uploaded by Yves Robert on 17 March 2014.
The user has requested enhancement of the downloaded file.

Henri Casanova and Arnaud Legrand and Yves Robert
Parallel Algorithms
CRC PRESS
Boca Raton London New York Washington, D.C.
Preface
Parallel computing has undergone a stunning evolution, with high points (e.g.,
being able to solve many of the grand-challenge computational problems out-
lined in the 80’s) and low points (e.g., the demise of countless parallel com-
puter vendors). Today, parallel computing is omnipresent across a large spec-
trum of computing platforms. At the “microscopic” level, processor cores have
used multiple functional units in concurrent and pipelined fashion for years
and multiple-core chips are now commonplace with a trend toward rapidly
increasing numbers of cores per chip. At the “macroscopic” level, one can now
build clusters of hundreds to thousands of individual (multi-core) computers.
Such distributed-memory systems have become mainstream and affordable in
the form of commodity clusters. Furthermore, advances in network technol-
ogy and infrastructures have made it possible to aggregate parallel computing
platforms across wide-area networks in so-called “grids”.
Objective
The aim of this book is to provide a rigorous yet accessible treatment of par-
allel algorithms, including theoretical models of parallel computation, parallel
algorithm design for homogeneous and heterogeneous platforms, complexity
and performance analysis, and fundamental notions of scheduling. The fo-
cus is on algorithms for distributed-memory parallel architectures in which
computing elements communicate by exchanging messages. While such plat-
forms have become mainstream, the design of efficient and sound parallel
algorithms is still a challenging proposition. Fortunately, in spite of the “leaps
and bounds” evolution of parallel computing technology, there exists a core
of fundamental algorithmic principles. These principles are largely indepen-
dent from the details of the underlying platform architecture and provide
the basis for developing applications on current and future parallel platforms.
This book identifies and synthesizes fundamental ideas and generally applica-
ble algorithmic principles out of the mass of parallel algorithm expertise and
practical implementations developed over the last decades.
Intended Audience and Use

The target audience for this book is graduate students and post-graduate
researchers in computer science and related fields. Each chapter is organized in
three parts: (i) lecture material with many proofs, examples and case-studies;
iii
iv
(ii) a set of exercises; and (iii) solution sketches for exercises marked with
a . This book should be ideally suited for teaching a course on parallel algo-
rithms, or as a complementary text for teaching a course on high-performance
computing. Importantly, although most of the content of the book is about
algorithm design and analysis, it is nevertheless a sound basis for teaching
applied parallel programming. Many of the included examples, case studies,
and exercises are natural starting points for hands-on homework assignments.
Book Content and Organization

Only a very brief introduction to parallel computing is provided because
this book is intended for a graduate-level audience. We refer the reader to, for
instance, the book by El-Rewini and Lewis [53] for an introduction to the field
if needed. Also, it is important to note that this is not a “High-Performance
Computing” book. Indeed, this book provides very little content pertaining to
parallel architectures, programming for the memory hierarchy, and the like.
We refer the reader to the book by Culler and Singh [48] and to the more recent
book by El-Rewini and Abd-El-Barr [52] for details on parallel architectures
and high-performance computing programming tools and techniques.
This book is structured in three distinct parts: Models (chapters 1–3),
Parallel Algorithms (chapters 4– 6), and Scheduling (chapters 7–8).
Chapters 1 and 2 cover two classical theoretical models of parallel com-
putation: PRAMs and Sorting Networks. Both models provide theoretical
foundations for reasoning about parallel algorithms, but also point to partic-
ular techniques that can be used to implement efficient parallel algorithms in
practice.
Chapter 3 discusses network models, both for topology and performance,
and defines several classical communication primitives that will serve as the
foundations for many of the parallel algorithms presented in the next two
chapters.
Chapters 4 and 5 discuss parallel algorithms on ring and grid logical topolo-
gies, two useful and popular abstractions. Both chapters expose fundamental
issues and trade-offs encountered when designing and implementing parallel
algorithms in practice.
Chapter 6 deals with the issue of load balancing on heterogeneous comput-
ing platforms. This topic is highly relevant to emerging computing platforms,
for instance those that aggregate several clusters together.
Chapter 7 introduces the notion of scheduling, or the art of orchestrating
communication and computation to achieve higher performance. This chapter
presents fundamental results and approaches for common scheduling problems
that arise when developing parallel algorithms. Chapter 6 describes advanced
scheduling topics. These topics include divisible load scheduling and steady-
state scheduling, which are crucial because they are directly applicable to
the popular master-worker application model. The last section discusses the
automatic parallelization of loop nests.
v
Although the content in the more theoretical chapters and that in the more
applied chapters are complementary, it is possible to cover only a subset of the
chapters. For a course focused on theoretical foundations of parallel algorithm
design, one may opt for covering only Chapters 1, 2, 7, and 8. For a course
focused on more practical algorithm design, one may opt for covering only
Chapters 3, 4, 5, and 6.
Acknowledgments
The content of this book, or at least preliminary versions of it, has been
used to teach courses at École Normale Supérieure de Lyon, École Polytech-
nique, the University of Tennessee, Knoxville, and the University of Hawai‘i
at Mānoa. We are grateful to the students for their feedback and suggestions.
We also wish to thank the following people who have contributed to some
of the content by their insightful suggestions, their own previously published
work, or their help reviewing draft chapters: Olivier Beaumont, Mahdi Bel-
caid, Anne Benoit, Rémi Bertin, Vincent Boudet, Benjamin Depardon, Larry
Carter, Alain Darte, Frédéric Desprez, Jack Dongarra, Jeanne Ferrante, Mat-
thieu Gallet, Philip Johnson, Jean-Yves L’Excellent, Loris Marchal, Fab-
rice Rastello, Arnold Rosenberg, Veronika Rehn-Sonigo, Mark Stillwell, Marc
Tchiboukdjian, Jean-Marc Vincent, Frédéric Vivien, and Joshua Wingstrom.
Henri Casanova
Arnaud Legrand
Yves Robert
Contents
Preface iii
I Models 1
1 PRAM Model 3
1.1 Pointer Jumping . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.1 List Ranking . . . . . . . . . . . . . . . . . . . . . . . 5
1.1.2 Prefix Computation . . . . . . . . . . . . . . . . . . . 7
1.1.3 Euler Tour . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2 Performance Evaluation of PRAM Algorithms . . . . . . . . 10
1.2.1 Cost, Work, Speedup and Efficiency . . . . . . . . . . 10
1.2.2 A Simple Simulation Result . . . . . . . . . . . . . . . 10
1.2.3 Brent’s Theorem . . . . . . . . . . . . . . . . . . . . . 12
1.3 Comparison of PRAM Models . . . . . . . . . . . . . . . . . 12
1.3.1 Model Separation . . . . . . . . . . . . . . . . . . . . . 13
1.3.2 Simulation Theorem . . . . . . . . . . . . . . . . . . . 14
1.4 Sorting Machine . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.4.1 Merge . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.4.2 Sorting Trees . . . . . . . . . . . . . . . . . . . . . . . 18
1.4.3 Complexity and Correctness . . . . . . . . . . . . . . . 20
1.5 Relevance of the PRAM Model . . . . . . . . . . . . . . . . . 24
1.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
Matrix Multiplication . . . . . . . . . . . . . . . . . . . . . . 26
Selection in a List . . . . . . . . . . . . . . . . . . . . . . . . 26
Splitting an Array . . . . . . . . . . . . . . . . . . . . . . . . 26
Looking for Roots in a Forest . . . . . . . . . . . . . . . . . . 26
First Non-Zero Element . . . . . . . . . . . . . . . . . . . . . 27
Mystery Function . . . . . . . . . . . . . . . . . . . . . . . . . 27
Connected Components . . . . . . . . . . . . . . . . . . . . . 28
1.7 Answers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2 Sorting Networks 37
2.1 Odd-Even Merge Sort . . . . . . . . . . . . . . . . . . . . . . 37
2.1.1 Odd-Even Merging Network . . . . . . . . . . . . . . . 38
2.1.2 Sorting Network . . . . . . . . . . . . . . . . . . . . . 41
2.1.3 0–1 Principle . . . . . . . . . . . . . . . . . . . . . . . 42
vii
viii
2.2 Sorting on a One-Dimensional Network . . . . . . . . . . . . 44

2.2.1 Odd-even Transposition Sort . . . . . . . . . . . . . . 44
2.2.2 Odd-even Sorting on a One-Dimensional Network . . . 47
2.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
Particular Sequences . . . . . . . . . . . . . . . . . . . . . . . 49
Bitonic Sorting Network . . . . . . . . . . . . . . . . . . . . . 49
Sorting on a 2-D Grid . . . . . . . . . . . . . . . . . . . . . . 49
2.4 Answers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3 Networking 57
3.1 Interconnection Networks . . . . . . . . . . . . . . . . . . . . 57
3.1.1 Topologies . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.1.2 A Few Static Topologies . . . . . . . . . . . . . . . . . 59
3.1.3 Dynamic Topologies . . . . . . . . . . . . . . . . . . . 61
3.2 Communication Model . . . . . . . . . . . . . . . . . . . . . 62
3.2.1 A Simple Performance Model . . . . . . . . . . . . . . 62
3.2.2 Point-to-Point Communication Protocols . . . . . . . 63
3.2.3 More Precise Models . . . . . . . . . . . . . . . . . . . 66
3.3 Case Study: the Unidirectional Ring . . . . . . . . . . . . . . 72
3.3.1 Broadcast . . . . . . . . . . . . . . . . . . . . . . . . . 74
3.3.2 Scatter . . . . . . . . . . . . . . . . . . . . . . . . . . 76
3.3.3 All-to-all . . . . . . . . . . . . . . . . . . . . . . . . . 77
3.3.4 Pipelined Broadcast . . . . . . . . . . . . . . . . . . . 78
3.4 Case Study: the Hypercube . . . . . . . . . . . . . . . . . . . 79
3.4.1 Labeling Vertices . . . . . . . . . . . . . . . . . . . . . 79
3.4.2 Paths and Routing in a Hypercube . . . . . . . . . . . 80
3.4.3 Embedding Rings and Grids into Hypercubes . . . . . 82
3.4.4 Collective Communications in a Hypercube . . . . . . 83
3.5 Peer-to-Peer Computing . . . . . . . . . . . . . . . . . . . . 87
3.5.1 Distributed Hash Tables and Structured Overlay Net-
works . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
3.5.2 Chord . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
3.5.3 Plaxton Routing Algorithm . . . . . . . . . . . . . . . 91
3.5.4 Multi-casting in a Distributed Hash Table . . . . . . . 91
3.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
Torus, Hypercubes and Binary Trees . . . . . . . . . . . . . . 93
Torus, Hypercubes and Binary Trees (again!) . . . . . . . . . 93
Cube-Connected Cycles . . . . . . . . . . . . . . . . . . . . . 93
Matrix Transposition . . . . . . . . . . . . . . . . . . . . . . . 94
Dynamically Switched Networks . . . . . . . . . . . . . . . . 94
De Bruijn network . . . . . . . . . . . . . . . . . . . . . . . . 96
Gossip . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
3.7 Answers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
ix
II Parallel Algorithms 103
4 Algorithms on a Ring of Processors 105

4.1 Matrix-Vector Multiplication . . . . . . . . . . . . . . . . . . 105
4.2 Matrix-Matrix Multiplication . . . . . . . . . . . . . . . . . . 110
4.3 A First Look at Stencil Applications . . . . . . . . . . . . . . 112
4.3.1 A Simple Sequential Stencil Algorithm . . . . . . . . . 112
4.3.2 Parallelizations of the Stencil Algorithm . . . . . . . . 113
4.4 LU Factorization . . . . . . . . . . . . . . . . . . . . . . . . . 120
4.4.1 First Version . . . . . . . . . . . . . . . . . . . . . . . 120
4.4.2 Pipelining on the Ring . . . . . . . . . . . . . . . . . . 125
4.4.3 Look-Ahead Algorithm . . . . . . . . . . . . . . . . . 127
4.5 A Second Look at Stencil Applications . . . . . . . . . . . . 130
4.5.1 Parallelization on a Unidirectional Ring . . . . . . . . 131
4.5.2 Parallelization on a Bidirectional Ring . . . . . . . . . 134
4.6 Implementing Logical Topologies . . . . . . . . . . . . . . . . 135
4.7 Distributed vs. Centralized Implementations . . . . . . . . . 136
4.8 Summary of Algorithmic Principles . . . . . . . . . . . . . . 137
4.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
Solving a Triangular System . . . . . . . . . . . . . . . . . . . 139
Givens Rotation . . . . . . . . . . . . . . . . . . . . . . . . . 139
Parallel speedup . . . . . . . . . . . . . . . . . . . . . . . . . 140
MPI Matrix-Matrix Multiplication . . . . . . . . . . . . . . . 140
4.10 Answers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
5 Algorithms on Grids of Processors 147

5.1 Logical 2-D Grid Topologies . . . . . . . . . . . . . . . . . . 147
5.2 Communication on a Grid of Processors . . . . . . . . . . . . 149
5.3 Matrix Multiplication on a Grid of Processors . . . . . . . . 150
5.3.1 The Outer-Product Algorithm . . . . . . . . . . . . . 151
5.3.2 Grid vs. Ring? . . . . . . . . . . . . . . . . . . . . . . 154
5.3.3 Three Matrix Multiplication Algorithms . . . . . . . . 156
5.3.4 Performance Analysis of the Three Algorithms . . . . 160
5.4 2-D block cyclic data distribution . . . . . . . . . . . . . . . 167
5.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
Matrix Transposition . . . . . . . . . . . . . . . . . . . . . . . 169
Stencil Application . . . . . . . . . . . . . . . . . . . . . . . . 169
Gauss-Jordan Method . . . . . . . . . . . . . . . . . . . . . . 169
Outer-Product Algorithm in MPI . . . . . . . . . . . . . . . . 169
5.6 Answers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
x
6 Load Balancing on Heterogeneous Platforms 175

6.1 Load Balancing for 1-D Data Distributions . . . . . . . . . . 176
6.1.1 Basic Static Task Allocation Algorithm . . . . . . . . 177
6.1.2 Incremental Algorithm . . . . . . . . . . . . . . . . . . 179
6.1.3 Application to a Stencil Application . . . . . . . . . . 181
6.1.4 Application to the LU Factorization . . . . . . . . . . 183
6.2 Load Balancing for 2-D Data Distributions . . . . . . . . . . 186
6.2.1 Matrix Multiplication on a Heterogeneous Grid . . . . 186
6.2.2 On the Hardness of the 2-D Data Partitioning Problem 189
6.3 Free 2-D Partitioning on a Heterogeneous Grid . . . . . . . . 190
6.3.1 General Problem . . . . . . . . . . . . . . . . . . . . . 190
6.3.2 A Guaranteed Heuristic . . . . . . . . . . . . . . . . . 193
6.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
Matrix Product on a Heterogeneous Ring . . . . . . . . . . . 200
LU Factorization on a Heterogeneous Grid . . . . . . . . . . . 200
Stencil Application on a Heterogeneous Ring . . . . . . . . . 201
6.5 Answers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
III Scheduling 205

7 Scheduling 207
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
7.1.1 Where Do Task Graphs Come From? . . . . . . . . . . 207
7.1.2 Overview . . . . . . . . . . . . . . . . . . . . . . . . . 209
7.2 Scheduling Task Graphs . . . . . . . . . . . . . . . . . . . . . 210
7.3 Solving Pb(∞) . . . . . . . . . . . . . . . . . . . . . . . . . . 214
7.4 Solving Pb(p) . . . . . . . . . . . . . . . . . . . . . . . . . . 216
7.4.1 NP-Completeness of Pb(p) . . . . . . . . . . . . . . . 216
7.4.2 List Scheduling Heuristics . . . . . . . . . . . . . . . . 218
7.4.3 Implementing a List Schedule . . . . . . . . . . . . . . 221
7.4.4 Critical Path Scheduling . . . . . . . . . . . . . . . . . 223
7.4.5 Scheduling Independent Tasks . . . . . . . . . . . . . 224
7.5 Taking Communication Costs Into Account . . . . . . . . . . 229
7.6 Pb(∞) with communications . . . . . . . . . . . . . . . . . . 230
7.6.1 NP-completeness of Pb(∞) . . . . . . . . . . . . . . . 232
7.6.2 A Guaranteed Heuristic for Pb(∞) . . . . . . . . . . . 234
7.7 List Heuristics for Pb(p) with Communications . . . . . . . . 238
7.7.1 Naı̈ve Critical Path . . . . . . . . . . . . . . . . . . . . 238
7.7.2 Modified Critical Path . . . . . . . . . . . . . . . . . . 239
7.7.3 Hints for Comparison . . . . . . . . . . . . . . . . . . 240
7.8 Extension to Heterogeneous Platforms . . . . . . . . . . . . . 242
7.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
Free Schedule . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
List Scheduling Anomalies . . . . . . . . . . . . . . . . . . . . 246
xi
Scheduling a FORK Graph with Communications . . . . . . . 246

Hu’s Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 247
7.10 Answers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249
8 Advanced scheduling 253

8.1 Divisible Load Scheduling . . . . . . . . . . . . . . . . . . . . 254
8.1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . 254
8.1.2 Classical Approach . . . . . . . . . . . . . . . . . . . . 255
8.1.3 Divisible Load Approach . . . . . . . . . . . . . . . . . 259
8.1.4 Extension to Star Networks . . . . . . . . . . . . . . . 262
8.1.5 Going Further . . . . . . . . . . . . . . . . . . . . . . 267
8.2 Steady-State Scheduling . . . . . . . . . . . . . . . . . . . . . 269
8.2.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . 269
8.2.2 Working out an Example . . . . . . . . . . . . . . . . 270
8.2.3 Star-Shaped Platforms . . . . . . . . . . . . . . . . . . 272
8.2.4 Tree-Shaped Platforms . . . . . . . . . . . . . . . . . . 276
8.2.5 Asymptotic Optimality . . . . . . . . . . . . . . . . . 277
8.2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . 279
8.3 Workflow Scheduling . . . . . . . . . . . . . . . . . . . . . . 280
8.3.1 Framework . . . . . . . . . . . . . . . . . . . . . . . . 281
8.3.2 Period Minimization . . . . . . . . . . . . . . . . . . . 284
8.3.3 Response Time Minimization . . . . . . . . . . . . . . 292
8.4 Hyperplane Scheduling (or Scheduling at Compile-Time) . . 293
8.4.1 Uniform Loop Nests . . . . . . . . . . . . . . . . . . . 294
8.4.2 Lamport’s Hyperplane Method . . . . . . . . . . . . . 298
8.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304
Divisible Load Scheduling on Heterogeneous Trees . . . . . . 304
Steady-State Scheduling of Multiple Applications . . . . . . . 305
Chains-to-Chains . . . . . . . . . . . . . . . . . . . . . . . . . 306
Response Time of One-to-One Mappings . . . . . . . . . . . . 306
Dependence Analysis and Scheduling Vectors . . . . . . . . . 307
Dependence Removal . . . . . . . . . . . . . . . . . . . . . . . 307
8.6 Answers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309
Bibliography 321
Index 331
Part I
Models
1
Chapter 1
PRAM Model
A Turing machine is a simple abstract computational device intended to in-

vestigate the extent of what can be computed on a sequential computer. This
model enables to define the cost of an algorithm (e.g., in number of opera-
tions) precisely and is the basis for complexity results (e.g., polynomial vs.
NP-complete, etc.).
In view of the wide variety of parallel architectures, defining a precise yet
general model for parallel computers seems hopeless. Perhaps most daunt-
ing is the modeling of the costs associated to data communication within the
parallel computer. It turns out that a reasonable option is to simply ignore
communication costs altogether! One then obtains an imperfect model in
which the cost of an algorithm only bears a vague relationship to the algo-
rithm’s execution times on real-world parallel computers. However, this model
makes it possible to determine a precise classification of problems and algo-
rithms and to obtain complexity results (e.g., to show that a given algorithm
is optimal, to establish the minimal complexity of a problem).
Shared
Memory
P1 P2 P3 ......... Pn
FIGURE 1.1: The PRAM model.
The PRAM model, which stands for Parallel RAM1 , comprises a shared
central memory that can be accessed by the various Processing Units (see
Figure 1.1), or PUs. All PUs execute synchronously the same algorithm and
work on distinct (or not) memory areas. In this model neither the number
of PUs nor the size of the memory is bounded, which clearly does not hold
1A RAM, or Random Access Machine, is an abstract model of a sequential processor.
3
4 Chapter 1. PRAM Model
for real-world parallel computers. Another simplifying assumption is that any

PU can access any memory location in one unit of time. This assumption
would also never hold in the real world. It is physically impossible to ensure
that a large number of PUs can simultaneously access an arbitrarily large
shared memory without incurring prohibitive overhead. In this model, the
same memory location, or cell, can be accessed simultaneously by multiple
PUs. For this reason, one typically considers three different PRAM models
that specify which concurrent accesses to a memory cell are allowed:
• CREW (Concurrent Read Exclusive Write): This is the most common
model. It is assumed that at a given time, i.e, during a given step of an
algorithm, arbitrarily many PUs can read the value of a cell simultane-
ously while at most one PU can write a value to a cell.
• CRCW (Concurrent Read Concurrent Write): This is the most pow-
erful model, with an unbounded number of simultaneous writes to a
memory cell. Several possible “modes” are used to define the semantics
of concurrent writes to a single memory cell:
– consistent mode: all PUs must write the same value;
– arbitrary mode: PUs can write different values but only one of them
will be written. This mode does not specify which value is written
so the algorithm designer must be prepared for a non-deterministic
execution;
– priority mode: the value written by the PU with the lowest index
is chosen;
– fusion mode: a commutative and associative operation (e.g., a log-
ical OR or AND, a maximum, a sum) is applied on-the-fly to the
different written values.
• EREW (Exclusive Read Exclusive Write): this is the more restrictive
model but also the more realistic one. Only one PU can have read/write
access to a memory cell at a given time.
The above ER and EW options sometimes assume that a bounded number
of PUs can have simultaneous read/write access to a memory cell, as opposed
to just a single processor. In the EW case one must then specify the mode of
concurrent writes, as in the CW case.
1.1 Pointer Jumping

In this section, we present three simple PRAM algorithms that all rely on a
fundamental technique called pointer jumping. As we will see, this is a natural
extension of the divide and conquer technique to a parallel setting.
1.1. Pointer Jumping 5
1 RANK COMPUTATION(L)
2 forall i in parallel do { Initialization }
3 if next[i ] = Nil then d [i ] ← 0 else d [i ] ← 1
4 while there exists a node i such that next[i] 6= Nil do { Main loop }
5 forall i in parallel do
6 if next[i ] 6= Nil then
7 d [i ] ← d[i] + d [next[i ]]
8 next[i ] ← next[next[i ]]
ALGORITHM 1.1: List ranking algorithm.
1.1.1 List Ranking

Let us assume that we have a linked list L of n objects arbitrarily distributed
throughout our PRAM’s memory. We wish to determine the distance d[i] to
the end of the list for every element i:
(
0 if next[i] = Nil
d[i] = .
d[next[i]] + 1 if next[i] 6= Nil
A straightforward sequential algorithm consists in propagating and incre-

menting the d value from the end of the list toward the beginning of the list
in n step (assuming that the list is doubly-linked). The question is whether
there is a PRAM algorithm with a complexity lower than O(n). Note that
we assume that each algorithm step is executed in one time unit, so we use
the number of steps, the execution time, and the complexity interchangeably.
It turns out that one can design an algorithm with O(log n) complexity, by
associating each element of the list, i, to a different processor, Pi . This is
possible because the PRAM model assumes an unbounded number of PUs.
An example execution of the list ranking algorithm is depicted in Figure 1.2
for n = 6. The principle is straightforward: at each step one divides each list
into two sublists. For instance, in the first step, all elements with even indices
are placed in a sublist, and all elements with odd indices are placed in another
sublist. The size of each sublist is divided by two at each step, leading to a
number of steps equal to dlog ne. 2
We show the pseudo-code for our list ranking algorithm in Algorithm 1.1.
The forall i in parallel do loop is executed in parallel across all the PUs
that are responsible for the memory cells referenced in the body of the loop.
Care must be taken however to ensure that all iterations of the loop can be
executed safely in parallel. For instance, consider the following loop (assuming
2 As in most computer science books, all logarithms are calculated in base 2 and we use
log n to denote log2 n.
1 1 1 1 1 0 /
Step 1 2 2 2 2 1 / 0 /
Step 2 4 4 3 / 2 / 1 / 0 /
Step 3 5 / 4 / 3 / 2 / 1 / 0 /
FIGURE 1.2: Typical execution of the list ranking algorithm. Gray cells
indicate active values, i.e., values that are in the process of being computed.
that indices for array A start at 1 and that PU Pi is responsible for updating
array element A[i]):
forall i in parallel do
if i > 1 then
A[i − 1 ] ← A[i ]
The problem here is that processor P2 may update A[2] before P1 can read
it. The same, if not as glaring, problem occurs in statements 7 and 8 of
Algorithm 1.1. To ensure that the loop executes correctly, it suffices to ensure
that all read operations happen before all write operations. Therefore, the
semantic of a forall i in parallel do loop is assumed to be as follows:
forall i in parallel do
forall i in parallel do temp[i ] ←B [i ]
⇐⇒
A[i ] ←B [i ] forall i in parallel do
A[i ] ←temp[i ]
Another problem with the algorithm as it is written in Algorithm 1.1 is that
the test in Statement 4 can be done in constant time only on a CRCW PRAM.
On a CRCW PRAM, the test could be implemented by having all PUs write
their boolean value of next[i] = Nil to a single memory cell, say done. Using
a CRCW with a fusion mode for concurrent writes based on a logical AND,
done is true only once the algorithm has completed. Unfortunately, there is
no such constant time solution on a CREW PRAM. On a CREW PRAM the
best approach is to perform pair-wise logical AND operations in a binary tree
pattern, leading to O(log n) steps. We will further discuss this distinction
between a CRCW and a CREW PRAM in Section 1.3.1.
In the particular case of our list ranking algorithm, we know that the algo-
rithm proceeds in dlog ne steps. Therefore, we can rewrite the main loop:
while there exists a node i such that next[i] 6= Nil do
as:
for step=1 to dlog ne do
1 PREFIX COMPUTATION(L)
2 forall i in parallel do { Initialization }
3 y[i ] ←x [i ]
4 while there exists a node i such that next[i] 6= Nil do { Main loop }
6 if next[i ] 6= Nil then
7 y[next[i ]] ← y[i ] ⊗ y[next[i ]]
8 next[i ] ←next[next[i ]]
ALGORITHM 1.2: Prefix computation of a list.
One remaining question is whether this algorithm can be executed in O(log n)

time by any type of PRAM. In terms of writes, each PU Pi writes the values
of d [i ] and of next[i ] in exclusive mode. In terms of reads, PUs Pi and Pj
such that next[j ] = i may read the value of d [i ] concurrently, thus seemingly
precluding the use of an EREW PRAM. A simple solution to overcome this
problem is to introduce a temporary array and to replace
d [i ] ← d[i] + d [next[i ]]
by
temp[i ] ← d [next[i ]]
d [i ] ← d[i] + temp[i ]
With this last simple transformation, we now obtain a O(log n) algorithm on
an EREW PRAM, the most restrictive PRAM model.
1.1.2 Prefix Computation

Given a sequence (x1 , x2 , . . . , xn ) and a binary associative operation ⊗, we
wish to compute the sequence (y1 , y2 , . . . , yn ) defined as
(
y1 = x1
yk = yk−1 ⊗ xk = x1 ⊗ x2 ⊗ . . . ⊗ xk , for 1 < k 6 n.
Let us use a PRAM with n PUs. The sequence (x1 , x2 , . . . , xn ) is stored as a
linked list such that x [i ] = xi for i 6 n, next[i ] points to the cell that contains
x [i + 1 ] for i < n, and next[n] = Nil. We can devise a simple algorithm that
uses the same pointer jumping technique as the list ranking algorithm in the
previous section. The corresponding pseudo-code in shown in Algorithm 1.2.
An example execution of this algorithm is depicted in Figure 1.3. Just as
for the list-ranking problem, there are exactly dlog ne steps and the global
condition of the while do loop can thus easily be replaced by a for do loop.
Likewise, concurrent read access can easily be removed to obtain a O(log n)
EREW algorithm.
[1,1] [2,2] [3,3] [4,4] [5,5] [6,6] /
Step 1 [1,1] [1,2] [2,3] [3,4] [4,5] / [5,6] /
Step 2 [1,1] [1,2] [1,3] / [1,4] / [2,5] / [3,6] /
Step 3 [1,1] / [1,2] / [1,3] / [1,4] / [1,5] / [1,6] /
FIGURE 1.3: Example execution of the prefix computation algorithm. [i, j]

denotes xi ⊗ xi+1 ⊗ . . . ⊗ xj for i 6 j.
1.1.3 Euler Tour

In this section, we present a clever application of the prefix computation
in the previous section. Consider an arbitrary binary tree with n vertices.
We want to compute the depth of each vertex, i.e., its distance to the root
vertex. We use a simple data structure to store the tree in memory. Each
vertex i is represented by three fields: father [i], left[i] and right[i] that point
to its father, its left child and its right child, respectively, or Nil in case such
parents and children do not exist.
A naive PRAM algorithm would consists in going through the tree in
breadth-first fashion. The complexity is then O(d) where d is the depth of the
tree. This algorithm is satisfactory when the tree is balanced because in this
case d = O(log n). However, when the tree is comb-shaped d = O(n). The
Euler tour technique leads to a O(log n) algorithm regardless of the shape of
the tree.
We associate 3 PUs to each vertex. We call these the A, B, and C PUs
(one can think of them as three types of PUs, each with a different role in
the algorithm). The principle of the algorithm is to transform the problem
of computing the depth of all vertices into a single prefix computation. This
is accomplished by having each PU stores a value and a pointer to the value
stored by another PU, thus creating a list (called the “Eulerian path”). We
simply say that a PU points to another PU. We then do a prefix computa-
tion of this list. The trick consists in constructing the list so that the prefix
computation leads to the desired depth values.
We depict the list for an example tree in Figure 1.4(a). The first element of
the list is the value held by the A PU corresponding to the root vertex. The
last element of the list is the value held by the C PU at the root also. We
establish the links between the list elements as follows:
• The A PU of each vertex points to the A PU of the vertex’s left child if

any, or to the vertex’s own B PU otherwise;
• The B PU of each vertex points to the A PU of the vertex’s right child

if any, or to the vertex’s own C PU otherwise;
A=1 C = −1 A=1 C =0
B =0 B =1
A=1 C = −1 A=1 C = −1 A=2 C =1 A=2 C =1
B =0 B =0 B =2 B =2
A=1 C = −1 A=1 C = −1 A=1 C = −1 A=3 C =2 A=3 C =2 A=3 C =2
B =0 B =0 B =0 B =3 B =3 B =3
A=1 C = −1 A=1 C = −1 A=1 C = −1 A=4 C =3 A=4 C =3 A=4 C =3
B =0 B =0 B =0 B =4 B =4 B =4
(a) Initialization (b) After prefix computation
FIGURE 1.4: Euler tour. 1.4(a): creation of the Eulerian path and initial-
ization; 1.4(b): results after the prefix computation.
• If a vertex has a left child, then its C PU points to the B PU of the

vertex’s father; otherwise its C PU points to the C PU of the vertex’s
father (the C PU of the root vertex points to Nil).
The above creates a depth-first path through the tree. We denote by x [i ] the
value for which PU Pi is responsible. These values are initialized as follows:

1
 if Pi is a PU of type A,
x[i] = 0 if Pi is a PU of type B,

−1 if Pi is a PU of type C.

Figure 1.4(a) shows these values, written as “PU type = value” for simplicity.
Once the above list is established and the values of its elements are ini-
tialized, which can be done in constant time on an EREW PRAM, we can
perform a prefix computation for the addition operator. The reader can easily
check that this computation leads to the values shown in Figure 1.4(b). The
depth of a vertex is stored in the value store by the C PU associated to that
vertex. In hindsight we now see the rationale for the initial values of the list
elements:
• the A PU of vertex i adds 1 to the prefix sum, accounting for going

down a level in the tree. More precisely, d(left[i]) = d(i) + 1 (where d(i)
is the depth of vertex i);
• the B PU’s contribution to the sum is equal to 0 because the depth of

the left child of a vertex is equal to the depth of the vertex’s right child;
• the C PU’s contribution to the sum is equal to −1, accounting for going
up a level in the tree.
We have reduced our problem to a constant-time initialization phase and a

prefix computation with O(n) PUs. Therefore, using the algorithm in the
previous section, we obtain an O(log n) EREW algorithm!
1.2 Performance Evaluation of PRAM Algorithms

1.2.1 Cost, Work, Speedup and Efficiency
In this section, we give a few common definitions to evaluate the perfor-
mance of a PRAM algorithm. Let P be a problem of size n that we want to
solve (e.g., a computation over an n-element list). Let Tseq (n) be the execu-
tion time of the best (known) sequential algorithm for solving P . Let us now
consider a PRAM algorithm that solves P in time Tpar (p, n) with p PUs.
DEFINITION 1.1 (Cost and Work). The cost of a PRAM algorithm is

defined as
Cp (n) = p.Tpar (p, n) .
The work Wp (n) of a PRAM algorithm is the sum over all PUs of the number
of performed operations. The difference between cost and work is that the
work does not account for PU idle time.
Intuitively, cost is a rectangle of area Tpar (p, n) (the execution time) times
p (the number of PUs). Therefore, the cost is minimal if, at each step, all PUs
are used to perform useful computations, i.e., computations that are part of
the sequential algorithm. In this case, the cost is equal to the work.
The speedup of a PRAM algorithm is the factor by which the program’s
execution time is lower than that of the sequential program.
DEFINITION 1.2 (Speed-Up and Efficiency). The speedup of a PRAM

algorithm is defined as
Tseq (n)
Sp (n) = .
Tpar (p, n)
The efficiency of the PRAM algorithm is defined as
Sp (n) Tseq (n)

ep (n) = = .
p p.Tpar (p, n)
T
par (1,n)
Some authors use the following definition: Sp (n) = Tpar (p,n) , where Tpar (1, n)
is the execution time of the algorithm with a single PU. This definition quan-
tifies how the algorithm scales: if the speedup is close to p (i.e., is Ω(p)),
one says that the algorithm scales well. However, this definition does not
reflect the quality of the parallelization (i.e., how much we can really gain by
parallelization), and it is thus better to use Tseq (n) than Tpar (1, n).
1.2.2 A Simple Simulation Result

In this section we present our first “simulation result”. With this result one
can reason about the performance of an algorithm for a given PRAM given a
1.2. Performance Evaluation of PRAM Algorithms 11
performance model of the algorithm for a PRAM with more PUs. The main
idea is that a PRAM with fewer PUs can simulate a PRAM with more PUs
by executing fewer operations concurrently (with some loss of performance).
PROPOSITION 1.1. Let A be an algorithm whose execution time is t on

a PRAM with p PUs. A can be simulated on a RAM of the same type with
p0 6 p PUs in time O t.p p0 . The cost of the algorithm on the smaller PRAM
is at most twice the cost on the larger PRAM.
l m
Proof. Each step of A can be simulated in at most pp0 time units with p0 6 p
PUs by simply reducing concurrency and having the p0 PUs perform sequences
0
of operations. Since there are l
at most
m t steps, the execution
time t of the
simulated algorithm is at most pp0 t. Therefore, t0 = O pp0 .t = O t.p p0 .
l m
p p
We also have Cp0 = t0 .p0 6 p0 p0 .t 6 p0 + 1 p0 t = p.t 1 + p10 =
Cp .(1 + 1/p0 ) 6 2Cp .
A direct consequence of this result is that the cost of an algorithm is at

least of the same order of magnitude as the sequential complexity (otherwise,
the simulation of this algorithm with a single PU would lead to a lower com-
plexity). A PRAM algorithm is said to be efficient when its cost has the
same order of magnitude as the complexity of the corresponding sequential
algorithm (i.e., its efficiency is Ω(1)). Consequently, the work of an efficient
algorithm cannot be reduced by more than a constant factor.
When reducing the number of processors in Proposition 1.1, we have grouped
operations from a parallel algorithm and simulated them on a single proces-
sor. In practice, it is often the case that these operations could be executed
faster using a sequential algorithm directly. It is for example easy to design a
mixed version of the prefix computation algorithm whose execution time with
p PUs is O(n/p + log(p)) instead of O(n. log(n)/p). Assuming the list is split
into p sub-lists of n/p consecutive elements, each PU can sequentially com-
pute the prefix for its own sub-list in time O(n/p). Then, one can propagate
sub-prefixes of the sub-lists in time O(log p). Lastly, one simply updates each
sub-list with the new prefixes in time O(n/p). Such an approach is referred to
as “coarsening the grain of the algorithm” and is generally quite efficient. The
same technique is used in Chapter 2 to sort an array on a one-dimensional
network of processors efficiently (Section 2.2.2.)
The study of the efficiency of algorithms is extremely important in practical
contexts. Indeed, in the real world resources are not free and the efficiency
can decrease sharply as p increases beyond some reasonable threshold. The
typical reason for this decrease in efficiency is that when p becomes too large
relative to n, communication costs become large. Further increases of p can
even increase overall execution time.
There are scenarios in which it is possible to solve a problem more than p
times faster using p processors than when using a single processor of the same
type. In this case efficiency is greater than 1 and one speaks of super-linear
speedup. One cause of super-linear speedup is that p processors typically
have p times as much cache and memory as a single processor. Therefore,
when using sufficiently many processors the entire data for the problem at
hand may suddenly fit entirely in memory (or in cache), thus avoiding costly
swapping (or cache misses). It is important to understand that, although
highly desirable in practice, achieving super-linear speedup does not mean
that the parallel algorithm is particularly efficient. From a strictly parallel
algorithms perspective, comparing a parallel execution of an algorithm that
achieves super-linear speedup to a sequential execution that runs out of mem-
ory or cache is not a fair comparison. One option would be to compute a
speedup relative to the execution time when using the smallest number of
processors that leads to a super-linear speedup.
1.2.3 Brent’s Theorem

THEOREM 1.1 (Brent). Let A be an algorithm that executes a total number
of m operations and that runs in time t on a PRAM
(with
some unspecified
m
number of PUs). A can be simulated in time O p + t PRAM of the same
type that contains p PUs.
Pt
Proof. Say that at step i A performs m(i) operations
l (implying
m that i=1 m(i) =
m(i) m(i)
m). Step i can be simulated with p PUs in time p 6 p + 1. One can
simple sum these upper bounds to prove the theorem.
Brent’s theorem makes it possible to quantify the performances of an algo-

rithm when the number of PUs is reduced. Let us consider for example the
computation of the maximum of a list of integers on an EREW PRAM. This
computation can be done in O(log n) time by structuring the computation as
a binary tree, computing pair-wise maxima at each step until there is only
one value left. The first step requires O(n) PUs. But what if we had fewer
than O(n) PUs? Brent’s theorem tells us that we can simulate the algorithm
on p PUs in O( np + log n) time (we perform m = n − 1 = O(n) comparisons).
Therefore, if we choose p = logn n , we achieve the same execution time but
with fewer PUs!
1.3 Comparison of PRAM Models

In this section, we discuss the comparative power of the EREW, CREW,
and CRCW models.
1.3. Comparison of PRAM Models 13
1 COMPUTE MAXIMUM(A, n)
2 forall i ∈ {1, . . . , n} in parallel do
3 m[i ] ←True
4 forall i, j ∈ {1, . . . , n}2 , i 6= j in parallel do
5 if A[i ] < A[j ] then m[i ] ← False
7 if m[i ] = True then max ← A[i ]
8 return max
ALGORITHM 1.3: CRCW algorithm to compute

the largest value of an array.
1.3.1 Model Separation

One may wonder whether a CRCW PRAM is more powerful than a CREW
PRAM. The problem of model comparison can be stated as such: is there a
problem such that, using the same number of PUs, we can solve this problem
with a CRCW PRAM strictly faster than with the best algorithm for solving
this problem a CREW PRAM?
The problem of finding the maximum value of an array gives us the answer
to the previous question. With Algorithm 1.3, we can compute the largest
value in array A in O(1) time on a CRCW PRAM with O(n2 ) PUs. In this
algorithm all PUs write the same value concurrently, False, and thus the
PRAM runs in consistent mode. The first and the third loops use only n PUs
whereas the second loop uses n(n − 1) PUs with Pi,j , i 6= j, being responsible
for the comparison A[i ] < A[j ].
MoreNgenerally, with a CRCW PRAM with n PUs one can compute the
fusion 16i6n ei of n elements (e1 , . . . , en ), where ⊗ is an associative opera-
tor, in O(1) time. Such a reduction operation cannot be done in fewer than
Ω(log n) steps with a CREW PRAM. Indeed, with this model, at most two
values can be merged into a single one in one step. Therefore, the number of
values that need to be merged is halved at each step for a total of Ω(log n)
steps.
What about the CREW and EREW models then? Here again, there is
a simple problem that separates these models: determine whether a given
element e belongs to a set (e1 , . . . , en ) of n distinct elements. This problem
can be easily solved in O(1) steps using a CREW PRAM with n PUs. First,
a boolean variable, res, is initialized to False. Then, in parallel, each PU Pi
compares e with ei (implying concurrent read access to e) and set res to True
if both elements are equal. Since the e1 are pairwise distinct, at most one PU
will modify res (ensuring exclusive writes).
On an EREW PRAM, PUs cannot read e simultaneously. Comparisons
can be done in O(1) units of time only if Ω(n) copies of e have been created,
which cannot be done in less than Ω(log n) time. More generally, broadcasting
an information to n PUs on an EREW PRAM requires Ω(log n) steps, with
double of the number of copies of the information at each step.
1.3.2 Simulation Theorem

In the previous section, for each model, we have exhibited algorithms for
which the separation factor between the different PRAM models was Ω(log p).
One may however wonder whether larger factors could be found. The following
theorem proves that these factors are in fact maximal.
THEOREM 1.2. Any CRCW algorithm with p PUs has an execution time
at most O(log p) times lower than the best EREW algorithm with p PUs for
solving the same problem.
Proof. Let us assume that the CRCW PRAM uses a consistent mode (only
concurrent writes of the same values are authorized). We show a method to
simulate concurrent writes with only exclusive writes (concurrent reads can
be handled in a similar fashion).
P0 P0 (8,12)=A[0] (8,12) P0
12 8
P1 P1 (29,43)=A[1] (8,12) P5
P2 43 29 P2 (29,43)=A[2] Sort (29,43) P1
P3 P3 (29,43)=A[3] (29,43) P2
P4 P4 (92,26)=A[4] (29,43) P3
P5 P5 (8,12)=A[5] (92,26) P4
26 92
(a) CRCW write ac- (b) EREW simulation

cess
FIGURE 1.5: Simulation of concurrent writes with exclusive writes.
Let us consider a given step of the CRCW algorithm and simulate it with
O(log p) steps of an EREW algorithm. Both PRAMs have the same computing
power (i.e., the same number of processors). Therefore, we only need to focus
on memory accesses. Our EREW algorithm requires a temporary array A of
size p. When PU Pi of the CRCW algorithm writes a value xi to address
1.4. Sorting Machine 15
li , the corresponding PU Pi of the EREW algorithm writes A[i] = (li , xi ).

We sort A in O(log p) time (let us assume for now that we have an EREW
algorithm for sorting p values with O(p) PUs and O(log p) steps). Then, each
PU Pi reads the consecutive cells: A[i] = (lj , xj ) and A[i−1] = (lk , xk ) (where
0 6 j, k 6 p − 1). If i = 0 or lj 6= lk , then Pi writes the value xj at the address
lj . It is easy to check that these reads and writes are exclusive. This process
is depicted in Figure 1.5 for an example.
All we need to do to complete this proof is to design an EREW algorithm
that sorts p values with O(p) PUs in O(log p) steps. This is done in Section 1.4
for the CREW model. One can then use known techniques to transform this
CREW algorithm into an EREW algorithm [43, 101].
1.4 Sorting Machine

In 1986, Cole [43] proposed a very elegant CREW PRAM algorithm to sort
n numbers in O(log n) time with O(n) PUs. This algorithm is optimal as the
sequential complexity of sorting (using only comparisons) is O(n log n). Cole’s
algorithm is based on the classical merge sort algorithm whose representation
as a binary tree (see Figure 1.6) reveals potential parallelism: all merging
steps of a given level of the tree can be done in parallel. Let us assume that
we process each level in sequence. Since there are log n levels, we need to be
able to process each level in O(1) time to get a O(log n) total execution time.
How can two arbitrary-size lists be merged in O(1) time? This is where the
beauty of Cole’s algorithm resides. Partial information from previous merges
are used to compute current merges in only a constant number of steps. A
very clever but by no means straightforward algorithm. Some readers may
thus wish to skip this section due to its rather intricate mathematical content.
1.4.1 Merge
In all that follows we assume that we sort/merge arrays of integers. The
merge of two sorted sequences J and K is denoted J|K. We say an integer x
is between a and b if and only if a < x 6 b.
DEFINITION 1.3 (Rank). The rank of an element x in a sequence J is
defined as the number of elements of J that are smaller than x:
rank (x, J) = card{j ∈ J, j < x} .
Likewise, the cross-rank of A in B is the function R[A, B] : A → N .
e 7→ rank (e, B)
This function can be represented as an array of size |A| whose i-th entry is
the rank of the i-th element of A in B.
level 3 1,2,3,4,5,6,7,8
(root)
level 2 5,6,7,8 1,2,3,4
level 1 7,8 5,6 3,4 1,2
level 0
8 7 6 5 4 3 2 1
(leaves)
FIGURE 1.6: Example binary tree for Cole’s parallel merge sort.
DEFINITION 1.4 (Good Sampler). A sequence L is said to be a good

sampler (GS) of a sequence J if, for any k > 1, there are at most 2k + 1
elements of J between k + 1 (arbitrary) consecutive elements of {−∞} ∪ L ∪
{+∞}.
Intuitively, L is a GS of J if elements of L are (almost) uniformly distributed
among the elements of J. For k = 1, the definition imposes that there is no
more than three elements of J between two consecutive elements of {−∞} ∪
L ∪ {+∞}. For example, the set Even(J) (resp. Odd(J)) of elements of
even (resp. odd) indices in J is a GS of J. Let us consider J = {j1 , . . . , jn }
and check that Even(J) is a GS of J. Consider k + 1 consecutive elements
{j2i , j2(i+1) , . . . , j2(i+k) } of {−∞} ∪ Even(J) ∪ {+∞} (we use the convention
j0 = −∞ and j2(i+k) = +∞ if 2(i + k) > n). We can easily check that there
are either 2k − 1 or 2k elements of J between j2i and j2(i+k) .
Using good samplers, two sorted sequences can be quickly merged: let us
consider two sorted sequences J and K and let us assume that L is a GS
of J and K (we set l0 = −∞ and l|L|+1 = +∞). Figure 1.7 illustrates the
partitioning of J and K.
We can now define a function MERGE WITH HELP(), which is shown in Algo-
rithm 1.4. Let us explain how this function works with an example:
Let J = [2, 3, 7, 8, 10, 14, 15, 17, 18, 21] and K = [1, 4, 6, 9, 11, 12, 13, 16, 19, 20].
L = [5, 10, 12, 17] is a GS of both J and K, which can be checked exhaustively
for k = 1 (5 checks for (−∞, 5), (5, 10), (10, 12), (12, 17), and (17, +∞)), for
k = 2 (4 checks for (−8, 10), (4, 12), (10, 17), and (12, +∞)), and so on. We
obtain:
• J(1) = [2, 3], J(2) = [7, 8, 10], J(3) = ∅, J(4) = [14, 15, 17], and J(5) =
[18, 21].
• K(1) = [1, 4], K(2) = [6, 9], K(3) = [11, 12], K(4) = [13, 16], and
K(5) = [19, 20].
J(1) J(2) J(3) J(4) J(5) J(6) J(7) J(8) J(9)

J=
L= L1 L2 L3 L4 L5 L6 L7 L8
K=
K(1) K(2) K(3) K(4) K(5) K(6) K(7) K(8) K(9)
J|K = (J(1)|K(1)).(J(2)|K(2))...(J(9)|K(9))
FIGURE 1.7: Merging J and K with the help of L.
MERGE WITH HELP(J, K, L)

1 J and K are partitioned in |L| + 1 subsets
J(i) = {j ∈ J, li−1 < j 6 li } and K(i) = {k ∈ K, li−1 < k 6 li }, for
1 6 i 6 |L| + 1
{ As L is a GS of J and K, each subset has at most three elements }
2 forall i, 1 6 i 6 |L| + 1 in parallel do { Subsets are merged in parallel }
resi ←MERGE(J(i), K(i))
3 J|K ← res1 res2 . . . res|L|+1 { Partial results are concatenated: J|K
denotes the fusion of J and K }
return J|K
ALGORITHM 1.4: Merge with help Algorithm.



 res1 = MERGE([2, 3], [1, 4]) = [1, 2, 3, 4]
res2 = MERGE([7, 8, 10], [6, 9]) = [6, 7, 8, 9, 10]



• res3 = MERGE(∅, [11, 12]) = [11, 12]
res4 = MERGE([14, 15, 17], [13, 16]) = [13, 14, 15, 16, 17]




res5 = MERGE([18, 21], [19, 20]) = [18, 19, 20, 21]

LEMMA 1.1. If L is a GS of the sorted sequences J and K and if cross-ranks

R[L, J], R[L, K], R[J, L] and R[K, L] are known, then MERGE WITH HELP(J, K, L)
runs in O(1) time with |J| + |K| PUs on a CREW PRAM.
Proof. Each of the three steps can be done in O(1) time using R[L, J], R[L, K],
R[J, L] and R[K, L]:
• Step 1: J is partitioned using |J| PUs. Each Pj , j ∈ J, reads rank (j, L) =

r and inserts j in J(r). As there are at most 3 PUs writing into J(r),
we can order write accesses in exclusive mode. We proceed likewise to
COLE MERGE()
1 Receive X(t + 1) from the left child and Y (t + 1) from the right child
2 Merge: val(t + 1) ← MERGE WITH HELP(X(t + 1), Y (t + 1), val(t))
3 Reduce: send Z(t + 1) = REDUCE(val(t + 1)) to its father.
{ REDUCE() keeps one value out of every four:
REDUCE({z1 , z2 , . . . , zn }) = {z4 , z8 , z12 , . . .}. }
ALGORITHM 1.5: Cole merge Algorithm.
partition K. Knowing R[J, L] and R[K, L], it is thus possible to do the

first step in O(1) time with |J| + |K| PUs.
• Step 2: Since J(i) and K(i) have at most three elements, parallel fusion
can clearly be done in O(1) time with |L| + 1 PUs.
• Step 3: Knowing R[L, J] and R[L, K], we can compute R[L, J|K]. In-
deed, for each element l ∈ L we have:
rank (l, J|K) = rank (l, J) + rank (l, K) .
The PU Pi responsible for resi computes the rank r of li−1 in J|K and
stores the elements of resi (at most six elements) starting from position
r + 1. Step 3 can thus be performed with |L| + 1 PUs.
1.4.2 Sorting Trees

We assume that n = 2m and use the sorting tree depicted in Figure 1.6.
The key idea here is to use this tree in a pipelined fashion. Each node of the
tree stores a sorted sequence val(t) that grows step after step. Each step is
done in O(1) time because we ensure that val(t) is a GS of next step’s inputs.
Mode of Operation
At step t = 0, all nodes store an empty sequence except for the leaves, which
store one element of the array to be sorted. At step t + 1, each node runs an
algorithm that uses a simple REDUCE() operation, as shown in Algorithm 1.5.
We will prove in Section 1.4.3 that at any step t we have:
val(t) = X(t)|Y (t) .
A node of the tree is said to be complete when it has received all its inputs,
i.e., a sorted sequence of 2k elements for a node at level k. As soon as this
sequence is non-empty, its size doubles at each step from 1 to 2k . If a node
is complete at step t then the REDUCE() operation is changed in the following
way:
• at step t + 1, it sends one element out of every four from val(t + 1) to

its father;
• at step t + 2, it sends one element out of every two (the second one, the
fourth one, etc.) from val(t + 1) to its father;
• at step t + 3, it sends all the elements of val(t + 1) to its father;
• from step t + 4 on, it stops working and does not send anything to its
father any longer.
The proof of correctness of Cole’s algorithm is based on three main invari-

ants:
• the size of the input of node at level k in the tree doubles at each step
(from 1 to 2k ) until the node is complete;
• if X(t) is a GS of X(t + 1) and Y (t) is a GS of Y (t + 1) then Z(t) is a

GS of Z(t + 1);
• we can compute R[S(t + 1), S(t)] for any sequence of input or output of
a node.
Study of the Example
[] []
[6,8] [2,4]
[] [] [] [] [8] [6] [4] [2]
[7,8] [5,6] [3,4] [1,2] [7,8] [5,6] [3,4] [1,2] [7,8] [5,6] [3,4] [1,2]
[8] [7] [6] [5] [4] [3] [2] [1]

Step 3 Step 4 Step 5 Step 6
[4,8] [2,4,6,8] [1,2,3,4,5,6,7,8]
[8] [4] [6,8] [2,4]
[5,6,7,8] [1,2,3,4] [5,6,7,8] [1,2,3,4] [5,6,7,8] [1,2,3,4]
Step 7 Step 8 Step 9 Step 10
FIGURE 1.8: Sorting an array of size 8 with Cole’s parallel merge sort.
In this section, we describe in detail how the algorithm works with an input
of size 8 (see Figure 1.8). At step t = 0, leaves have sequences of size 20 and
are hence complete. At step t = 1 and t = 2 they send one element out of
four, then one out of two, i.e., no element at all. At step t = 3, they send
their unique value to their father and stop working.
Let us focus on the father [8, 7] of leaves 8 and 7. At step t = 3, it computes
val(3), i.e., MERGE WITH HELP({8}, {7}, ∅), and becomes complete (it is a level
1 node). At step t = 4, it does not send anything to its father. At step t = 5,
it sends one element out of every two, i.e., {8}. At step t = 6, it sends its two
elements then stops working.
Let us now focus on the root node of the subtree [8, 7, 6, 5]. At step t = 5,
it receives {8} and {6} and computes val(5) = MERGE WITH HELP({8}, {6}, ∅).
At step t = 6, it receives X(6) = {7, 8} and Y (6) = {5, 6} and computes
val(6) = MERGE WITH HELP({7, 8}, {5, 6}, val(5)). We can check that val(5) is
a GS of X(6) and Y (6). At step t = 6, the node is complete (it is a level 2
node). At step t = 7, it sends {8}, and at step t = 8, it sends {6, 8}. At step
t = 9, it sends its four elements and stops working.
Lastly, let us look at the root of the tree. At step t = 7, it receives {8} and
{4} and computes val(7) = MERGE WITH HELP({8}, {4}, ∅). At step t = 8, it re-
ceives {6, 8} and {2, 4} and computes val(8) = MERGE WITH HELP({6, 8}, {2, 4},
val(7)). At step t = 9, it receives {5, 6, 7, 8} and {1, 2, 3, 4} and computes
val(9) = MERGE WITH HELP({5, 6, 7, 8}, {1, 2, 3, 4}, val(8)). On can easily check
that val(t) is a GS of X(t + 1) and Y (t + 1) for t = 6, 7, 8. At the end of step
t = 9, the root is complete and all values are sorted.
1.4.3 Complexity and Correctness

The following lemma gives the relationship between the algorithm’s execu-
tion time and the number of processors.
LEMMA 1.2. The pipelined sorting tree algorithm runs in O(log n) time
with O(n) PUs.
Proof. A level k node is complete at step t = 3k. This is easily proved by

induction on k using the fact that children send all their data in 3 steps once
they are complete. The execution time for the tree is hence O(log n).
At level k of the tree, 2nk nodes merge lists of size smaller than 2k . Therefore,
O(n) PUs are required to process a level in O(1) time. One could hastily
conclude that O(n log n) PUs are required to implement the pipelined sorting
tree. However, a finer analysis reveals that (i) all levels are not active at the
same time; and (ii) the amount of work per level doubles at each step. Level
k PUs merge lists of size 2k at step t = 3k, 2k−1 at step t − 1, 2k−2 at step
t − 2, etc. The number of PUs needed for each step of the whole tree is thus
O(n).
Let us now prove the following good sampler invariant.

LEMMA 1.3. Let X, X 0 , Y and Y 0 be four sorted sequences. If X is a GS

of X 0 and if Y is a GS of Y 0 , then REDUCE(X|Y ) is a GS of REDUCE(X 0 |Y 0 ).
Proof. We still use the notation X|Y to denote the fusion of X and Y . If X
is a GS of X 0 , then X|W is clearly still a GS of X 0 for any set W . However,
if X is a GS of X 0 and Y is a GS of Y 0 , there X|Y is not necessarily a GS of
X 0 |Y 0 .
Indeed, let us consider for example X = [2, 7], X 0 = [2, 5, 6, 7], Y =
[1, 8], and Y 0 = [1, 3, 4, 8]. Then, we have X|Y = [1, 2, 7, 8] and X 0 |Y 0 =
[1, 2, 3, 4, 5, 6, 7, 8] but there are 5 elements of X 0 |Y 0 between 2 and 7 (that
are yet consecutive elements of X|Y ). This is the reason why we resort to the
reduce operator.
Let us prove the following property: there are at most 2r + 2 elements of
X 0 |Y 0 between r consecutive elements of X|Y (we assume that −∞ and +∞
are in X and Y ).
Proof. Let us consider a sequence e1 , e2 , . . . , er of r consecutive elements of

X|Y . hX elements among these r elements come from X and hY come from Y
(with hX + hY = r). Without loss of generality, we can assume that e1 ∈ X.
We consider two cases:
Case 1 (er ∈ X): As X is a GS of X 0 , there are at most 2(hX −1)+1 elements

of X 0 between e1 and er . There are also at most 2(hY + 1) + 1 elements
of Y 0 as these elements are between hY + 2 elements of Y , which is a GS
of Y 0 . Hence, there are at most 2(hX − 1) + 1 + 2(hY + 1) + 1 = 2r + 2
elements of X 0 |Y 0 between e1 , e2 , . . . , er .
Case 2 (er ∈ Y ): Let us add an element e0 ∈ Y preceding e1 and an element
er+1 ∈ X following er . Then, elements from X 0 and Y 0 lying between
e1 , e2 , . . . er come from elements for X 0 lying between hX + 1 elements
of X and elements of Y 0 lying between hY + 1 elements of Y . Therefore,
we have (2hX + 1) + (2hY + 1) = 2r + 2 elements of X 0 |Y 0 .
Coming back to the proof of the lemma, let us define Z = REDUCE(X|Y ) and
Z 0 = REDUCE(X 0 |Y 0 ). Let us consider k + 1 consecutive elements z1 , z2 , . . . ,
zk+1 of Z. Since the reduce operator keeps one elements out of every four, we
have z1 = e4h , z2 = e4(h+1) , . . . , zk+1 = e4(h+k) , where X|Y = {e1 , e2 , . . . , ep }.
Thus, there are 4k + 1 elements of X|Y between z1 , z2 , . . . , zk+1 . Using the
previous property with r = 4k + 1, we know that there are at most 8k + 4
elements of X 0 |Y 0 between these 4k + 1 elements. Since the reduce operator
keeps one element out of every four, there are at most 8k+4 4 = 2k + 1 elements
of Z 0 between the k + 1 consecutive elements of Z, proving that Z 0 is a GS of
Z.
In steady-state, a node receives a sorted sequence X(t+1) from its left child
and a sorted sequence Y (t + 1) from its right child. It computes val(t + 1) =
MERGE WITH HELP(X(t+1), Y (t+1), val(t)) and sends Z(t+1)=REDUCE(val(t + 1))
to its father. Since Lemma 1.3 shows that Z(t) is a GS of Z(t + 1), we have
the following invariants:
1. val(t) = X(t)|Y (t),
2. X(t) is a GS of X(t + 1), and
3. Y (t) is a GS of Y (t + 1).
Note that the property still holds true for the last two communications (one
element out of every two is sent, then finally all the elements).
Lastly, we have to ensure that the requirements of Lemma 1.1 hold true: we
need to know the cross-ranks for MERGE WITH HELP() to run in O(1) time. At a
given step, X is a GS of X 0 and Y is a GS of Y 0 , U = X|Y and Z = REDUCE(U ).
We can assume that cross-ranks R[X 0 , X] and R[Y 0 , Y ] are known by induction
hypothesis. To compute U 0 = X 0 |Y 0 with MERGE WITH HELP(X 0 , Y 0 , U ), we
need to know cross-ranks R[X 0 , U ], R[Y 0 , U ], R[U, X 0 ], and R[U, Y 0 ]. Finally
we can compute Z 0 = REDUCE(U 0 ) and R[Z 0 , Z] to get our invariant. Of course,
we assume that for each sorted sequence S we know the cross-rank R[S, S]. In
other words, we know the index of each element in the sorted sequence. This
is computed as part of the internal representation of S.
LEMMA 1.4. If S = [b1 , b2 , . . . , bk ] is a sorted sequence then the rank of

a given element a in S can be computed in O(1) time with O(k) PUs on a
CREW PRAM.
Proof. Take b0 = −∞ and bk+1 = +∞. The rank of a is then computed with
the following loop:
forall i, 0 6 i 6 k in parallel do
if bi < a 6 bi+1 then
rank ←i
There are no write conflicts because only one processor stores the value of
rank .
LEMMA 1.5. If we have three sorted and sequences S1 , S2 , S such that S =

S1 |S2 and S1 ∩S2 = ∅ then we can compute cross-ranks R[S1 , S2 ] and R[S2 , S1 ]
in O(1) time with O(|S|) PUs.
Proof. We assume that whenever a sequence S is sorted, we also know R[S, S].
For any a ∈ S1 ⊂ S we have
rank (a, S2 ) = rank (a, S) − rank (a, S1 ) .
Hence the results.
We can now prove our invariant on cross-ranks for MERGE WITH HELP().
LEMMA 1.6. Suppose we have sorted sequences X, Y , U = X|Y , X 0 ,

and Y 0 such that X is a GS of X 0 , Y is a GS of Y 0 , and we know the cross-
ranks R[X 0 , X] and R[Y 0 , Y ]. Then, we can compute the cross-ranks R[X 0 , U ],
R[Y 0 , U ], R[U, X 0 ], and R[U, Y 0 ] in O(1) time with O(|X| + |Y |) PUs.
Proof. We first show how to compute R[X 0 , U ]. Let X = [a1 , a2 , . . . , ak ] and
take a0 = −∞ and ak+1 = +∞. We partition the sequence X 0 with X:
X 0 (i) = {x0 ∈ X 0 , ai−1 < x0 6 ai } for 1 6 i 6 k + 1.
This partition is computed in O(1) time with O(|X|) PUs because we know
R[X 0 , X]. We also partition U with X (which is the same as partitioning Y
with X because U = X|Y ):
U (i) = {y ∈ Y, ai−1 < y 6 ai } for 1 6 i 6 k + 1.
We compute the rank of x in U as follows:

forall i, 1 6 i 6 k + 1 in parallel do
forall x0 ∈ X 0 (i) do
Compute rank (x0 , U (i)) { with Lemma 1.4 }
rank (x0 , U ) ← rank (ai−1 , U ) + rank (x0 , U (i))
For each i we use |U (i)| PUs. Altogether we thus require O(|U |) PUs. As
X is a GS of X 0 , each X 0 (i) consists of at most three elements and thus
the computation runs in O(1) time. We have therefore computed R[X 0 , U ].
R[Y 0 , U ] is of course computed in a similar fashion.
To compute R[U, X 0 ] we need R[X, X 0 ] and R[Y, X 0 ]. Let us see how to
compute R[X, X 0 ]. Consider an element ai from X \ X 0 and search for the
minimal element a0 of X 0 (i+1). The rank of ai in X 0 is the same as the rank of
a0 in X 0 . This rank is already computed as part of the internal representation
of the sorted sequence X 0 . Thus, we can compute rank (ai , X 0 ) in O(1) time
with a single processor. To compute R[Y, X 0 ] consider y ∈ Y . We compute
rank (y, X) using Lemma 1.5, because U = X|Y is already computed. Then,
we compute rank (y, X 0 ) using rank (y, X) and R[X, X 0 ]. This way, we compute
R[U, X 0 ] in O(1) time with O(|U |) PUs. We can compute R[U, Y 0 ] in a similar
way.
LEMMA 1.7. Using Lemma 1.6’s notation, let
Z = RÉDUCTION(U ) and Z 0 = RÉDUCTION(U 0 ).
We can compute R[Z 0 , Z] in O(1) time with O(|X| + |Y |) PUs.

Proof. The proof is straightforward. R[X 0 , Z] is computed from R[X 0 , U ].
Likewise, R[Y 0 , Z] is computed from R[Y 0 , U ] and we obtain R[U 0 , Z]. Z 0 is a
subset of Z, hence the result.
We have at last proved the following theorem:
THEOREM 1.3. We can sort a sequence of n values in time O(log n) on a

CREW PRAM with O(n) PUs.
The cost of Cole’s algorithm is thus O(n log n). This is optimal because the
sequential complexity of sorting is O(n log n).
1.5 Relevance of the PRAM Model

The central question in complexity theory for sequential algorithms is the
P = NP question. The corresponding question in the complexity theory of
parallel algorithms is the P = NC question. NC is the class of all problems
which, with a polynomial number of PUs, can be solved in polylogarithmic
time. An algorithm of size n is polylogarithmic if it can be solved in O(log(n)c )
time with O(nk ) PUs, where c and k are constants. Problems in NC can be
solved efficiently on a parallel computer. All the problems considered in this
chapter belong to NC. “NC” is an abbreviation of “Nick (Pippenger)’s Class.”
Beside these theoretical complexity considerations, skeptical readers may
question the relevance of the PRAM model for practical implementation pur-
poses. A common criticism of the PRAM model is the unrealistic assumption
of an immediately addressable unbounded shared parallel memory. This crit-
icism, similar to the criticism of O(n17 ) “polynomial” time algorithms, often
comes from a misunderstanding of the role of theory:
• Theory is not everything: theoretical results are not to be taken as is
and implemented by engineers.
• Theory is also not nothing: the fact that an algorithm cannot in general
be implemented as is does not mean it is meaningless.
When a O(n17 ) algorithm is designed for a problem, it does not lead to a
practical way to solve that problem. However, it proves something inherent
about the problem, namely that it is in P, and thus that its hardness does not
grow exponentially with the input size. Hopefully, in the process of proving
this result key insights may be developed that can later be used in practical
algorithms for this or other problems, or for other theoretical results. Simi-
larly, the design of PRAM algorithms for a problem proves something inherent
to the problem (namely that it is parallelizable) and can in turn lead to new
ideas.
Lastly, even if communications are not taken into account in the perfor-
mance evaluation of PRAM algorithms (a potentially considerable discrepancy
between theoretical complexity and practical execution time), trying to design
fast PRAM algorithms is not a useless pursuit. Indeed, PRAM algorithms can
be simulated on other models and do not necessarily incur prohibitive commu-
nication overheads. It is commonly admitted that only cost-optimal PRAM
1.5. Bibliographical Notes 25
algorithms have potential practical relevance and that most promising are
those cost-optimal algorithms with minimal execution time.
Bibliographical Notes
Some of the introduction material in this chapter is inspired by the books by
Cormen, Leiserson, and Rivest [44] and by Gengler, Ubéda and Desprez [60].
The presentation of Cole’s parallel merge sort algorithm is taken from the book
by Gibbons and Rytter [62]. The original article by Cole [43] also presents
an EREW version of the sorting machine with the same performance. For
additional information on the PRAM model, we refer the reader to the book
by Reif [101].
1.6 Exercises
We start with six classical exercises. The first three ones are based on
the pointer jumping technique, which the reader should master by now. The
fourth one comes back to model separation. The fifth one is a nice application
of the O(1) CRCW algorithm for finding the largest value of an array. We do
not spoil the surprise about the sixth one and let the reader discover it. We
end with a seventh exercise on connected components [62], which represents
a nice example of the idea contained within Brent’s theorem.
Exercise 1.1 : Matrix Multiplication

Describe a O(log(n)) EREW PRAM algorithm that uses n3 PUs to multiply
two n × n matrices.
Exercise 1.2 : Selection in a List

Let L be a list with n objects that are either red or blue. Design an efficient
EREW algorithms that selects blue elements.
Exercise 1.3 : Splitting an Array

Let A be an array of length n whose elements are either 0 or 1. Design a
O(log n) EREW algorithm using O(n) PUs to move all the non-zero elements
to the right side of the array while maintaining their original order.
Hint: perform a prefix computation to find out what the index of each element
should be.
Exercise 1.4 : Looking for Roots in a Forest

In this exercise we see another example of a problem that separates EREW
and CREW PRAMs. Let F be a forest of binary trees. Each node i of a tree
is associated with a processor P (i) and has a pointer to its father father (i).
In this exercise, we design and EREW and a CREW algorithm that make
each node know the root root(i ) of the tree it belongs to and thus prove the
advantage of concurrent reads.
1. Propose a CREW algorithm so that each node determines root(i ). Prove

that your algorithm only uses concurrent reads and give its complexity.
2. What is the best execution time that can be expected on an EREW

PRAM?
Exercises 27
Exercise 1.5 : First Non-Zero Element

Let A be an array of length n whose elements are either 0 or 1. Design a
O(1) CRCW algorithm using O(n) PUs to find the first element k such that
A[k] = 1. √
Hint: partition A into n equal parts and find the first part that contains a
1 using algorithm 1.3.
Exercise 1.6 : Mystery Function

We define the following two operators on an array A = [a0 , a1 , . . . , an−1 ] of
integers:
• PRESCAN(A) returns [0, a0 , a0 + a1 , a0 + a1 + a2 , . . . , a0 + a1 + . . . + an−2 ],
• SCAN(A) returns [a0 , a0 + a1 , a0 + a1 + a2 , . . . , a0 + a1 + . . . + an−1 ].
We have seen in Section 1.1 how to implement these two operators in time
O(log n) on an EREW PRAM. Consider the SPLIT function below:
1 SPLIT(A, Flags)
2 Idown ←PRESCAN(not(Flags))
3 Iup ←n - REVERSE(SCAN(REVERSE(Flags)))
5 if Flags(i) then Index [i] ←Iup[i]
6 else Index [i] ←Idown[i]
7 Result ←PERMUTE(A,Index )
8 return Result
This function uses two functions: REVERSE(A) and PERMUTE(A, Index ). The
former reverses array A, and the latter reorders array A according to a per-
mutation specified as an array of indices, Index . The slightly cumbersome
REVERSE(SCAN(REVERSE(Flags))) simply scans from the end of the array Flags,
considering its elements as integers.
1. Given an array Flags of booleans, what does the SPLIT function returns?
What is its execution time?
2. Consider the following MYSTERY function:
1 MYSTERY(A, Number Of Bits)

2 for i = 0 to Number Of bits − 1 do
3 bit(i) ←array indicating whether the ith bit of
elements of A is equal to 1 or not
4 A ←SPLIT(A, bit(i))
(a) What is the result of the MYSTERY function when applied to A = [5, 7, 3,
1, 4, 2, 7, 2] and Number Of Bits = 3?
(b) What does the MYSTERY function compute?
(c) Assuming the size of integers is O(log n) bits, what is the execution time
of MYSTERY with n PUs? What if only p PUs are used? What are the
values of p that lead to an optimal value of the algorithm’s work?
Exercise 1.7 : Connected Components

We want to design a CREW algorithm to compute the connected com-
ponents of an undirected graph G = (V, E) where V = {1, . . . , n}. More
precisely, the output from the algorithm should be an array C of size n such
that C(i) = C(j) = k if and only if vertex i is in same component as vertex j.
Moreover, k is the minimum index of the vertices in the connected component
that contains vertices i and j. Let us introduce a number of useful definitions.
DEFINITION 1.5. At any step of the algorithm, the pseudo-vertex labeled

by i is the set of vertices j, k, l, · · · ∈ V such that C(j) = C(k) = C(l) = · · · =
i. If i labels a pseudo-vertex then i is called a p-vertex. For any vertex j that
is not a p-vertex, C(j) denotes the p-vertex of j.
An invariant of the algorithm is that a p-vertex i is the smallest index of ver-

tices labeled by i and that all these vertices belong to the same connected com-
ponent. This holds true when initializing C(i) = i for all i ∈ V = {1, . . . , n}.
In other words, at the beginning for the algorithm each PU considers itself as
the reference vertex of its component. The algorithm then iteratively changes
this rather egocentric perspective.
DEFINITION 1.6. A k-tree-loop (k > 0) is a weakly connected directed

graph (i.e., its underlying undirected graph is connected) such that:
• every vertex has out-degree 1; and
• there exists exactly one circuit whose length is equal to k + 1.
Removing any one edge of this circuit results in an in-tree. A star is a 0-tree-
loop.
The previous invariant can thus be read as: “the directed graph (V, {(i, C(i)) |
i ∈ V }) consists of stars.” We can freely identify pseudo-vertices and stars:
the center of a star is the label of the corresponding pseudo-vertex. Connected
components are computed by applying the following two functions in sequence
as many times as needed:
Exercises 29
1 GATHER()
2 forall i ∈ S in parallel
do
3 T (i) ← min C(j) | {i, j} ∈ E, C(j) 6= C(i)
4 { if this set is empty, then C(i) is associated to T (i) }
do
6 T (i) ← min T (j) | C(j) = i, T (j) 6= i
7 { if this set is empty, then T (i) is associated to T (i) }
8 JUMP()
9 forall i ∈ S in parallel do B(i) ← T (i)
10 repeat log n times
11 forall i ∈ S in parallel do T (i) ← T (T (i))
12 forall i ∈ S in parallel do

C(i) ← min B(T (i)), T (i)
1. Consider the following graph.
7
4
3
1
6 2
5 8
9
Apply the GATHER function to this graph, then the JUMP function, then the
GATHER function, and so on. Follow the effect of these steps on the directed
graphs (V, {(i, T (i)) | i ∈ V }) and (V, {(i, C(i)) | i ∈ V }).
2. Prove that after applying GATHER connected components comprising more

than one pseudo-vertex lead to 1-tree-loops in (V, {(i, T (i)) | i ∈ V }). Prove
that the smallest pseudo-vertex of a 1-tree-loop always belongs to the cycle.
3. Prove that JUMP turns 1-tree-loops into stars.
4. Prove that applying the GATHER and JUMP functions dlog ne times en-
ables pseudo-vertices induced by C to correspond exactly to the connected
components of the original graph.
5. What is the complexity of this algorithm? How many PUs are used?
1.7 Answers
Exercise 1.2 (Selection in a List)
We simply use a version of the pointer jumping technique to obtain Algo-
rithm 1.6. Each PU determines the location of the first blue object after it in
the list in O(log n) time.
This EREW algorithm requires a bit of explanation. It works like classical
pointer jumping but in parallel on many lists. The last object in each of these
lists is either blue or Nil. Each PU ends up with a pointer to the end of its
list, i.e., to the next blue element. All pointer jumps are made on independent
lists and thus do not interfere with each other. At the end of the algorithm
the list of blue elements starts either with the first element of the initial list
if it was blue, or with its first blue successor.
1 SELECT-BLUE()
3 if next(i) = Nil Or color(next(i)) = blue then
4 done(i) ←True
5 blue(i) ←next(i)
6 while there is a node i such that done(i) = False do
8 if done(i) = False then
9 done(i) ← done(next(i))
10 if done(i) = True then
11 blue(i) ← blue(next(i))
12 next(i) ← next(next(i))
ALGORITHM 1.6: Algorithm to select blue elements in a list.
Exercise 1.4 (Looking for Roots in a Forest)

. Question 1. A natural algorithm to find roots is based on the pointer
jumping technique, exploiting the fact that a node and its ancestors share a
path to the root of a tree. A simple transformation of the algorithm developed
in Section 1.1 leads to Algorithm 1.7. This is a CREW algorithm since all
writes performed by processor i are for its own data (root(i) and father (i)).
This is however not an EREW algorithm because many PUs have the same
Answers 31
1 FIND ROOT()
3 if father (i) = Nil then
4 root(i) ← i
5 while there exists a node i such that father (i) 6= Nil do
7 if father (i) 6= Nil then
8 if father (father (i)) = Nil then
9 root(i) ← root(father (i))
10 father (i) ← father (father (i)).
ALGORITHM 1.7: CREW Algorithm for finding roots in a forest.
father (especially in the end!) and thus will need to access simultaneously the
same data father (i).
The same analysis as the one for list ranking (see Section 1.1) can be applied.
Each node in the forest find its root in at most time O(log d), where d is
the maximal depth of the trees, and all nodes find their roots in parallel.
Therefore, the algorithm runs in time O(log d).
. Question 2. Let us consider the worst case for the EREW model, i.e.,
the case where the forest contains only one tree. Let us count the number of
PUs that may know the root at each step. In the EREW model, the number
of PUs that know an information can at most double at each step. In our
case, exactly one PU knows the root at the beginning and therefore at least
Ω(log n) steps are required to propagate this information to all PUs.
Exercise 1.6 (Mystery Function)

. Question 1. Let us look at an example:
A =[ 5 7 3 1 4 2 7 2 ]
Flags =[ 1 1 1 1 0 0 1 0 ]
Idown =[ 0 0 0 0 0 1 2 2 ]
.
Iup =[ 3 4 5 6 7 7 7 8 ]
Index =[ 3 4 5 6 0 1 7 2 ]
Result =[ 4 2 2 5 7 3 1 7 ]
With this example it is easy to see what function SPLIT does: elements of
A whose corresponding elements in Flags are equal to 0 are moved to the
beginning of Result, remaining in the same order. Likewise, elements of A
whose corresponding elements in FLAGS are equal to 1 are grouped at the end
of Result, remaining in the same order.
Using O(n) PUs SCAN and PRESCAN can be done in time O(log n). As other
operations only require constant time, SPLIT runs in time O(log n).
. Question 2.
(a) A = [ 5 7 3 1 4 2 7 2 ]
bit(0) = [ 1 1 1 1 0 0 1 0 ]
A ←SPLIT(A, bit(0)) = [ 4 2 2 5 7 3 1 7 ]
bit(1) = [ 0 1 1 0 1 1 0 1 ]
A ←SPLIT(A, bit(1)) = [ 4 5 1 2 2 7 3 7 ]
bit(2) = [ 1 1 0 0 0 1 0 1 ]
A ←SPLIT(A, bit(2)) = [ 1 2 2 3 4 5 7 7 ]
(b) Based on the example, it looks like MYSTERY sorts its input array. In fact,
it is a parallel implementation of the well-known radix-sort algorithm:
starting with the least-significant bit, the SPLIT function splits the array
in two parts depending on the value of this bit. Each call to SPLIT sorts
elements according to the current bit value while maintaining the order
obtained with previous bits. This is why the algorithm goes from the
least-significant bit to the most-significant bit.
(c) There are O(log n) iterations of the main loop. The execution time of
the MYSTERY function is thus O(log2 n) with O(n) PUs. When using
only p PUs, the execution time of SPLIT becomes O( np + log p) and the
execution time of the parallel radix-sort becomes O(( np + log p) log n) =
O( np log n + log n log p). The work is optimal (i.e., equal to O(n log n))
for p such that p log p 6 n, e.g., for p = nq with 0 < q < 1.
Exercise 1.7 (Connected Components)

. Question 1. At the beginning of the algorithm, C is initialized as follow:
C=7
C=4 7
4
C=3
C=1 3
1 C=6
6 2
C=2
C=5 C=8
5 8
9
C=9
C remains unchanged after applying GATHER but T is as follows:

Answers 33
T =2
T =1 7
4
T =2
T =4 3
1 T =3
6 2
T =3
T =1 T =6
5 8
9
T =6
First, we can notice that the directed graph (V, {(i, T (i)) | i ∈ V }) consists of
1-tree-loops:
T =2
T =1 7
4
T =2
T =4 3
1 T =3
6 2
T =3
T =1 T =6
5 8
9
T =6
After pointer jumping in the JUMP function, all these 1-tree-loops have been
transformed into stars:
T =3
T =4 7
4
T =3
T =1 3
1 T =2
6 2
T =2
T =4 T =3
5 8
9
T =3
These stars are merged in (V, {(i, C(i)) | i ∈ V }) in the last part of the JUMP
function:
C=2
C=1 7
4
C=2
3
1 C=1 C=2
6 2
C=2
C=1 C=2
5 8
9
C=2
There are only two remaining pseudo-vertices and C is thus as follows before
applying GATHER again:
C=2
C=1 7
4
C=2
C=1 3
1 C=2
6 2
C=2
C=1 C=2
5 8
9
C=2
After the first step, T is updated

T =2
T =2 7
4
T =2
T =1 3
1 T =1
6 2
T =2
T =2 T =2
5 8
9
T =2
and finally
T =2
T =1 7
4
T =2
T =2 3
1 T =2
6 2
T =1
T =1 T =2
5 8
9
T =2
We end up with a directed graph (V, {(i, T (i)) | i ∈ V }) that consists solely
of 1-tree-loops:
T =2
T =1 7
4
T =2
T =2 3
1 T =2
6 2 T =1
T =1 T =2
5 8
9
T =2
Answers 35
Pseudo-vertices 1 and 2 are thus merged after applying JUMP:

C=1
C=1 7
4
C=1
C=1
3
1 C=1
6 2
C=1
C=1 C=1
5 8
9
C=1
Each vertex is now aware of its p-vertex, hence connected components have
been computed.
. Question 2. First of all, any star associated with a graph component

containing a single pseudo-vertex is transferred to T without any change.
When a connected component has more than one pseudo-vertex, T describes
the set of 1-tree-loops for this component. Indeed, let us consider a pseudo-
vertex of such a component. It contains at least one vertex adjacent to a vertex
of another pseudo-vertex. GATHER causes each p-vertex to point – via T – to
one another, while other vertices of the pseudo-vertex remain unchanged and
still point to their original p-vertex. In short, when two groups are in contact,
the corresponding p-vertices get linked in the directed graph induced by T .
Each component in this directed graph has at least one loop since the out-
degree of each vertex is exactly 1. There is at most one loop otherwise a
p-vertex would have two values for T . Lastly, this loop can only have length
2. Otherwise, if its length were equal to one, i and T (i) would be the same or
(if its length were larger than 2) there would be a smaller vertex i on the loop
such that T (i) is not the smallest p-vertex of vertices adjacent to vertices in
pseudo-vertex i.
. Question 3. Using pointer jumping, JUMP merges all vertices of a 1-tree-

loop into a star labeled by the smallest vertex. Indeed, after the pointer jumps,
the T value of each vertex is either one or the other value of the vertices of
the loop. In the last step the T value of all vertices is set to the smallest value
of this tree-loop.
. Question 4. We just need to prove that each step halves the number of
connected components. Let us focus on p-vertices and on the graph induced by
T on these vertices. In this graph, two pseudo-vertices i and j are connected
if and only if there are two vertices k and l that are connected in the original
graph and such that C(k) = i and C(l) = j. The function JUMP merges
all these pseudo-vertices into a single pseudo-vertex. Therefore, the number
of pseudo-vertices is at least halved at each step. As there are originally n
pseudo-vertices, dlog ne calls to GATHER and JUMP are sufficient to compute

the connected components.
. Question 5. The sequential loop of JUMP enforces an execution time of
at best O(log2 n), regardless of the number of PUs. We first show that this
execution time can be achieved with O(n2 ) PUs.
With O(n2 ) PUs, the first and the last loop of JUMP only require O(1) time
and the pointer jumps require O(log n) time. In fact O(n) PUs are enough to
achieve such an execution time. We now need to show that GATHER can run
in O(log n) time to prove that the whole execution time is O(log2 n).
Computing the minimum of n values can be done in time O(1) with n2 PUs.
However, GATHER actually computes the minima of many sets, which can be
written abstractly as in Algorithm 1.8.
do
2 T (i) ← min C(j) | {i, j} ∈ E, C(j) 6= C(i)
3 { if this set is empty, then C(i) is associated to T (i) }
ALGORITHM 1.8: Abstract algorithm to compute minima.
1 forall i, j ∈ S in parallel do
2 if {i, j} ∈ E And C(i) 6= C(j) then T emp(i, j) ← C(j)
3 else T emp(i, j) ← ∞
4 forall i ∈ S in paralleldo
5 T emp(i, 1) ← min T emp(i, j) | j ∈ S
6 forall i ∈ S in parallel do
7 if T emp(i, 1) = ∞ then T (i) ← C(i)
8 else T (i) ← T emp(i, 1)
ALGORITHM 1.9: A clever algorithm to compute minima.
Algorithm 1.8 can be transformed into Algorithm 1.9. The first loop of
Algorithm 1.9 clearly runs in time O(1) with O(n2 ) PUs (more precisely with
O(|E|) PUs) on a CREW. The next two loops run in time O(log n) with O(n2 )
PUs using classical pointer jumping (n PUs are assigned to each row and each
PU computes its minimum independently of others).
Computing connected components can thus be done in time O(log2 |V |)
with O(|V | + |E|) PUs. However, the JUMP function wastes resources in the
minima computation.
2 With Brent’s theorem we can lower the number of
n |E|
PUs down to O log n (actually O log |V | + |V | PUs) without changing the
execution time.
Chapter 2
Sorting Networks
A sorting network is a more realistic, but less general, model of a parallel

computer than a PRAM. A sorting network is built by organizing and con-
necting comparator modules, or comparators for short, to sort a sequence of
numbers. Each comparator has two input and two output “wires,” with the
wires holding numerical values. One of the output wires always outputs the
smallest input value and the other outputs the largest input value, as de-
picted in Figure 2.1. The objective is to find a sorting network architecture
that only depends on the length of the input sequence and that is indepen-
dent of the values of the sequence. Therefore, the main difference between
sorting networks and traditional comparison-based sorting algorithms is that
the sequence of comparisons is set in advance, regardless of the outcome of
previous comparisons.
a min(a, b)
b max(a, b)
FIGURE 2.1: A comparator.
In this chapter, we present two sorting networks. The first network imple-
ments a merge sort, just like Cole’s PRAM algorithm presented in Section 1.4.
The second network uses an odd-even transposition scheme that can easily be
mapped to a one-dimensional network of processors.
2.1 Odd-Even Merge Sort
Batcher [15] proposed a merge sorting network constructed recursively from

a network for merging two shorter sequences. We first describe the merging
network. Throughout this section we assume that the length of the input
sequence is a power of two.
37
38 Chapter 2. Sorting Networks
2.1.1 Odd-Even Merging Network

Let us define a few notations:
• for an arbitrary sequence hc1 , c2 , . . . , cn i, SORT(hc1 , c2 . . . , cn i) denotes

the sorted sequence of the ci ’s;
• whenever a sequence c is sorted, i.e., if c1 6 c2 6 · · · 6 cn , we write

SORTED(hc1 , c2 . . . , cn i);
• we define MERGE(), the merging operator of two sorted sequences, as:
if SORTED(ha1 , . . . , an i) and SORTED(hb1 , . . . , bn i) then
MERGE(ha1 , . . . , an i), hb1 , . . . , bn i) = SORT(ha1 , . . . , an , b1 , . . . , bn i) .
a1
b1
a3
b3
a1 a2
b1 b2
a2 a4
b2 b4
(a) Merge1 network to merge two (b) Merge2 network to merge two sorted se-
sorted sequences of length 2. quences of length 4.
FIGURE 2.2: Merge networks for sequences of length 2 and 4.
Let us build a Mergem network that merges two sorted sequence of length
2m . For m = 0, we only need a single comparator. For m = 1, assum-
ing SORTED(ha1 , a2 i) and SORTED(hb1 , b2 i), we can use three comparators as
depicted in Figure 2.2(a). It is not hard to see that this Merge1 works as
expected. We know that a1 6 a2 and b1 6 b2 . The upper output is min(a1 , b1 )
and the lower output is max(a2 , b2 ). An additional comparator is added to
sort the two outputs from the middle, i.e., max(a1 , b1 ) and min(a2 , b2 ).
Determining that Merge2 (depicted in Figure 2.2(b)) works is more diffi-
cult. So instead let us prove the result in the general case, by induction. The
Mergem network is built with two copies of the Mergem−1 network followed
with a column of 2m − 1 comparators. The first copy of Mergem−1 merges
odd elements from the input sequences and the second copy merges even ele-
ments. Surprisingly, a simple column of comparators suffices to complete the
merging of the two input sequences.
2.1. Odd-Even Merge Sort 39
PROPOSITION 2.1. Consider two sequences A = ha1 , . . . , a2n i and B =

hb1 , . . . , b2n i such that SORTED(A) and SORTED(B). Let us denote
hd1 , . . . , d2n i = MERGE(ha1 , a3 , . . . , a2n−1 i, hb1 , b3 , . . . , b2n−1 i)

he1 , . . . , e2n i = MERGE(ha2 , a4 , . . . , a2n i, hb2 , b4 , . . . , b2n i) .
Then, we have
SORTED(hd1 , min(d2 , e1 ), max(d2 , e1 ), . . . ,

. . . , min(d2n , e2n−1 ), max(d2n , e2n−1 ), e2n i) .
Proof. Without loss of generality, we can assume that elements are distinct.
In the resulting sequence d1 is in first position, which is correct as it is the
smallest element of the whole sequence. Likewise, e2n ’s position is correct. In
the general case, di and ei−1 (for i > 2 and i 6 2n) are in position 2i − 2 or
2i − 1 in the resulting sequence. We show that their positions are correct by
showing that they both dominate 2i−3 elements of the complete sequence and
are dominated by 4n − 2i + 1 elements of the complete sequence. Therefore,
their position in the whole sequence is necessarily 2i − 2 or 2i − 1 and the final
comparison finally sets them in their correct positions.
We have 4 items to prove for 2 6 i 6 2n:
(i) di dominates 2i − 3 elements;
(ii) ei−1 dominates 2i − 3 elements;
(iii) di is dominated by 4n − 2i + 1 elements;
(iv) ei−1 is dominated by 4n − 2i + 1 elements.
Let us start by proving (i): assume for example that di is in sequence

A (the case where di is in B is similar). Let k be the number of elements
from hd1 , d2 , . . . , di i that come from sequence A. Then, we have di = a2k−1
and di dominates 2k − 2 elements from A. There are i − k elements from
B in hd1 , d2 , . . . , di−1 i. The largest one is thus b2(i−k)−1 and di dominates
2(i−k)−1 elements from B. Therefore, di dominates (2k−2)+(2(i−k)−1) =
2i − 3 elements. (ii) is proved in a similar way.
Let us now prove (iv): assume that ei−1 is in sequence B. Let k be the
number of elements from he1 , e2 , . . . , ei−1 i that come from sequence B. Then,
we have ei−1 = b2k and ei−1 is dominated by 2n − 2k elements from B. There
are also i − 1 − k elements from sequence A in sequence he1 , . . . , ei−1 i. ei−1 is
thus dominated by a2(i−k) and its successors, that is 2n − 2(i − k) + 1 elements
from A. Therefore, ei−1 is dominated by (2n − 2(i − k) + 1) + (2n − 2k) =
4n − 2i + 1 elements. The proof of (iii) is similar and is left as an exercise to
the reader.
The positions of di and ei−1 in the resulting sequence are thus correct.
a1 d1 c1
a2 d2 c2
a3 d3 c3
.
a4 .
. . c4
. di+1
. . c5
a2i−1 .
.
a2i Mergem−1 .
. .
. .
.
b1 e1 c2i
b2 e2
. c2i+1
b3 .
b4 .
. ei .
. . .
. . .
.
b2i−1
b2i
. Mergem−1
.
.
FIGURE 2.3: Recursive design of Mergem .
Figure 2.3 shows the Mergem network, with its two copies of the Mergem−1
network whose output are connected with 2m − 1 comparators.
The processing time tm of the Mergem network is defined as the maximum
number of comparators an input must traverse. Therefore, we have t1 = 2
because some data have to go through 2 comparators (even though some only
go through one) and t2 = 3. Of course, many comparators can be active at
the same time (this book is about parallel computing after all).
LEMMA 2.1. The processing time tm and the number of comparators pm of
Mergem satisfy the following recursions:
t0 = 1 t1 = 2 tm = tm−1 + 1 (i.e., tm = m + 1)
m
p0 = 1 p1 = 3 pm = 2pm−1 + 2 − 1 (i.e., pm = 2m m + 1)
Proof. These recursions follow directly from Proposition 2.1. The expression
pm = 2m m + 1 can be proved via a simple induction.
When expressed as a function of the length n = 2m of the input sequence,
these quantities are respectively O(log n) and O(n log n). The efficiency of the
network is rather low: by multiplying the number of comparators with the
processing time, one obtains the total work (see Section 1.2): Wn = pn × tn =
O(n(log n)2 ). This is far beyond the total work of a sequential merge: using
a single comparator and O(n) steps, the total work of the sequential merge is
O(n). Note however that the overall processing time, tm , is very short.
The rather poor efficiency of this network can easily be explained: each
comparator is used only once during the merge. The efficiency of the network
could be improved by merging many different pairs of sequences in sequence.

Indeed, the same network can start processing a new pair of sequences at each
time-step, with multiple sequences traversing the network concurrently in a
pipelined usage. After d time steps (where d is the depth of the network) all
comparators are active and a merged sequence is output by the network at
each time step (the period of the network is said to be equal to 1).
2.1.2 Sorting Network

It is now easy to design a Sortm network recursively to sort n = 2m
elements. All we need to do is to connect the output of two Sortm−1 networks
with the input of a Mergem−1 network, as shown in Figure 2.4.
Sort1
Sort1
Merge1 Merge1
Sort2
Sort1
Sort1
Merge1 Merge1
Sort2
Merge2
FIGURE 2.4: Recursive design of Batcher’s Sortm network (m = 3).
LEMMA 2.2. The processing time t0m and the number of comparators p0m of
Sortm satisfy the following recursions:
t01 = 1 t0m = t0m−1 + tm−1 (i.e., t0m = O(m2 ))

p01 = 1 p0m = 2p0m−1 + pm−1 (i.e., p0m = O(2m m2 ))
Proof. These recursions follow directly from the recursive design of the net-
work. Since t0m = t0m−1 + tm−1 = t0m−1 + m, we get t0m = O(m2 ). For the
second equation, p0m = 2p0m−1 + 2m−1 (m − 1) + 1 = 2m−1 (1 + 2 + 3 + · · · +
(m − 1)) + (1 + 2 + 4 + · · · + 2m−1 ) = 2m−1 ( m(m−1)
2 ) + 2m − 1. Therefore,
0 m−1 m(m−1) m 2
pm = 2 ( 2 + 2) − 1 = O(2 m ).
When expressed as a function of the length n = 2m of the input se-

quence, the processing time is O((log n)2 ) and the number of comparators
is O(n(log n)2 ). The total work is thus O(n(log n)4 ). One may think that
the number of comparators is prohibitively large given the processing time,
at least for practical purposes. But here again, this very fast network can be
used in a pipelined fashion.
One may wonder if there exists a sorting network with a O(log n) process-
ing time for sequences of length n. Such a network could then be thought
as an equivalent of Cole’s PRAM algorithm (section 1.4), but for the less
flexible sorting network model. Such a network was designed somewhat re-
cently. In 1983, Ajtai, Komlos and Szemeredi [2] proposed a network for
sorting sequences of length n with O(n log n) comparators in O(log n) time.
Unfortunately, the constants hidden in the O(. . .) notations are so enormous
that this result is of no practical use.
As a conclusion to this section, we recall the main result:
THEOREM 2.1. Batcher’s merge sort network sorts a sequence of length n
with O(n(log2 n)) comparators in O(log2 n) time.
2.1.3 0–1 Principle

In this section, we present a new technique to prove the correctness of
merging and sorting networks. A 0–1 sequence is a sequence whose elements
are either 0 or 1.
PROPOSITION 2.2 (0–1 Principle). A network is a sorting network for
arbitrary sequences if and only if it is a sorting network for 0–1 sequences.
Proof. If R sorts arbitrary sequences correctly, it obviously also sorts 0–1 se-
quences correctly. Let us now prove that if R does not sort arbitrary sequences
correctly then it does not sort 0–1 sequences correctly.
Let us first note that for any increasing function f a comparator has the
same behavior when the input is hx1 , x2 i as when the input is hf (x1 ), f (x2 )i.
Consider a given network R applied to a given sequence hx1 , . . . , xn i. The
final position of xi , i.e., the wire where xi is output, does not depend on the
value xi but solely on its relative position in the sequence. Therefore, when
applying R to hf (x1 ), . . . , f (xn )i, where f is an increasing function, f (xi ) is
output on the same wire as xi .
Let us assume that R does not sort correctly. Then, there is a sequence
x = hx1 , . . . , xn i and a position k such that R(x)k > R(x)k+1 . Now let us
define an increasing function f : {x1 , . . . , xn } → {0, 1} as follows:
(
0 if y < R(x)k
f (y) = .
1 if y > R(x)k
It is easy to see that R does not sort the 0–1 sequence hf (x1 ), . . . , f (xn )i
correctly. Indeed, f (R(xk )) = 1 is output at position k and f (R(xk+1 )) = 0 is
output at position k + 1. Therefore, R does not sort 0–1 sequences correctly,

which completes the proof of the proposition.
The 0–1 principle can be used to prove the correctness of Mergem . Let us
provide a new, less cumbersome proof of Proposition 2.1.
New proof of Proposition 2.1. Using the 0–1 principle we can now restrict the
proof of Proposition 2.1 to 0–1 sequences.
Let us denote by ZEROS(x) the number of zeros in sequence x = hx1 , . . . , xm i.
A sorted 0–1 sequence is always structured as x = 0r 1m−r where r = ZEROS(x).
Let p = ZEROS(ha1 , . . . , a2n i) and q = ZEROS(hb1 , . . . , b2n i). We distinguish
four different cases:
• p = 2p0 and q = 2q 0 . We have:
ZEROS(hd1 , . . . , d2n i) = ZEROS(he1 , . . . , e2n i) = p0 + q 0 .
hd1 , . . . , d2n i and he1 , . . . , e2n i have the name number of 0’s. The last
column of comparators receives as input p0 + q 0 − 1 pairs of 0’s, followed
by a 10 (the first 1 of d and the last 0 of e) and 2n − p0 − q 0 − 1 pairs
of 1’s. The sorted sequence is thus obtained thanks to the (p0 + q 0 )-th
comparator.
• p = 2p0 and q = 2q 0 − 1. We have:
ZEROS(hd1 , . . . , d2n i) = p0 + q 0
ZEROS(he1 , . . . , e2n i) = p0 + q 0 − 1 .
The last column of comparators receives as input p0 + q 0 − 1 pairs of 0’s,

followed by 2n − p0 − q 0 pairs of 1’s. None of the 2n − 1 comparators is
really useful as the resulting sequence 0p+q 14n−p−q is obtained with a
simple interleaving.
• p = 2p0 − 1 and q = 2q 0 . This case is similar to the previous one.
• p = 2p0 − 1 and q = 2q 0 − 1. We have:
ZEROS(hd1 , . . . , d2n i) = p0 + q 0
ZEROS(he1 , . . . , e2n i) = p0 + q 0 − 2 .
The last column of comparators receives as input p0 + q 0 − 1 pairs of 0’s,

followed by a 01 and 2n − p0 − q 0 pairs of 1’s. Once again, none of the
2n − 1 comparators performs truly useful work.
2.2 Sorting on a One-Dimensional Network

2.2.1 Odd-even Transposition Sort
In this section, we present a very simple network known as “odd-even trans-
position sorting network.” To simplify the depiction of the “folding” of this
network in the next section, we rotate the comparators as depicted below:
a b
min(a, b) max(a, b)
The odd-even transposition sorting network consists of a sequence of lines

of comparators. More precisely, to sort a sequence of n = 2p elements, we
use p copies of a sub-network that consists of two lines. See Figure 2.5 for a
depiction of this sub-network for n = 8. The first line contains p comparators
whose inputs are the p pairs of wires 2i − 1 and 2i, 1 6 i 6 p (“odd” step).
The second line contains p − 1 comparators whose inputs are the p − 1 pairs
of wires 2i and 2i + 1, 1 6 i 6 p − 1 (“even” step). The whole network thus
contains a total of p(2p−1) = n(n−1)
2 comparators. The layout is similar when
n is odd (see Figure 2.5 for n = 7) and also contains n(n−1)
2 comparators.
PROPOSITION 2.3. The odd-even transposition network is a sorting net-

work.
Proof. We use the 0–1 principle. Let ha1 , . . . , an i denote a 0–1 sequence. Let
us denote by k the number of 1’s in this sequence and let j0 denote the position
of the last 1 (i.e., the rightmost 1). In Figure 2.6, we show an example for
n = 7, k = 3 and j0 = 4. Let us first note that a 1 never moves to the left:
the only possible move is when it is compared with a 0 on its right, in which
case it moves to the right.
Let us follow the last 1’s moves. If j0 is even then it does not move in the
first step but it moves to the right in the second step. If j0 is odd, it moves
to the right in the first step. In both cases, it moves to the right from step 2
and for all following steps until it is at position n. Before step 2, the last 1 is
at least at position 2 and thus it always has enough time to arrive at position
n in n − 1 steps.
Let us now follow the moves of the next-to-last 1. At step 0 it is at position
j1 (j1 = 2 in Figure 2.6). As the last 1 moves to the right from step 2 on (at
least), the next-to-last 1 is never blocked by the last 1 when moving to the
right. Therefore, from step 3 on, the next-to-last 1 moves to the right until
it reaches position n − 1. More generally, the i-th 1 (counting from the right)
2.2. Sorting on a One-Dimensional Network 45
step
odd
even
odd
even
odd
even
odd
even
n=8 n=7
FIGURE 2.5: Layout of the odd-even transposition network.
1 1 0 1 0 0 0
FIGURE 2.6: Illustration of the proof that the odd-even transposition net-
work is a sorting networks.
moves to the right starting (at least) at step i + 1 until it reaches position
n − i + 1: it is never blocked by another 1 on its right.
At last, the leftmost 1 (the k-th as there are k 1’s in total) moves to the
right at step k + 1 until it reaches its final position n − k + 1. The k − 1
remaining 1’s are on its right and the sequence is thus sorted.
Another proof of Proposition 2.3. Another proof was proposed by Knuth [75]
in exercises 36 and 37, Section 5.3.4. It relies on “primitive sorting networks.”
A primitive sorting network is a sorting network such that comparisons are
only made between neighbors (the odd-even transposition network is thus a
primitive network).
Formally, a sorting network α can be modeled by a sequence of comparators,
i.e., α = [xk , yk ] ◦ . . . ◦ [x2 , y2 ] ◦ [x1 , y1 ] where k is the number of comparators.
A primitive network is such that for all i: yi = xi + 1. The proof relies on two
lemmas:
LEMMA 2.3. Let α be a n-entry primitive network. α is a sorting network
if and only if α sorts hn, n − 1, . . . , 2, 1i.
Proof. Proving the implication is trivial. Let us prove the reciprocal by con-
tradiction.
Let x denote an input sequence such that α(x)i > α(x)j , with i < j. Let
y = hn, n − 1, . . . , 2, 1i. We prove that α(y)i > α(y)j , which establishes the
lemma. We prove this by induction on the number of comparators, k. More
precisely our induction hypothesis is:
Hi,j,x,y (q) : For any primitive sorting β network of size q,

β(x)i > β(x)j ⇒ β(y)i > β(y)j .
Let us first prove Hi,j,x,y (0): a primitive network β of size 0 does not sort
anything so we necessarily have β(y)i = n + 1 − i > n + 1 − j = β(y)j , hence
the result.
Let us now assume that Hi,j,x,y (q − 1) is true and consider an arbitrary
primitive network γ of size q such that γ(x)i > γ(x)j . We have γ = [p, p+1]◦β,
where β is a primitive sorting network of size q − 1.
We need to distinguish between a few cases depending on p:
• If p = i: we have γ(x)i = γ(x)p = min(β(x)p , β(x)p+1 ) > γ(x)j and

γ(x)p 6 γ(x)p+1 = max(β(x)p , β(x)p+1 ), implying that j 6= p + 1.
Therefore, we have j > p + 1 and γ(x)j = β(x)j . Thus, β(x)p > β(x)j
and β(x)p+1 > β(x)j . Therefore, β(x)i > β(x)j and using the induction
hypothesis Hi,j,x,y (q −1), we obtain β(y)p > β(y)j and β(y)p+1 > β(y)j ,
which gives us γ(y)i > γ(y)j .
• If p = i − 1, p = j or p = j − 1: similar arguments can be used.
• Other cases: the result is obvious.
LEMMA 2.4. A primitive network for sequences of length n requires at least

n(n − 1)/2 comparators to be a sorting network.
Proof. Each comparator reduces the number of inversions of the input x.

As the permutation hn, n − 1, . . . , 2, 1i has exactly n(n − 1)/2 inversions, a
primitive sorting network has at least n(n − 1)/2 comparators.
Let us now end the proof of the odd-even transposition network. It is

made of exactly n(n − 1)/2 comparators. We can observe the effect of the
transposition network on hn, n − 1, . . . , 2, 1i and check that each comparator
performs a real inversion. The resulting sequence is thus sorted, proving the
result.
2.2. Sorting on a One-Dimensional Network 47
The processing time tn of the odd-even transposition network is tn = n

and the number of comparators is pn = n(n − 1)/2. The total work is then
Wn = O(n3 ), which is very high. The only nice feature of this network is its
simplicity.
2.2.2 Odd-even Sorting on a One-Dimensional Network

In this last section we focus on sorting using a one-dimensional network
of processors. The algorithm we design is directly inspired by the previous
odd-even transposition network. The idea is to “fold” the network to get a one-
dimensional network of processors. Each processor communicates alternately
with its left neighbor and its right neighbor.
P1 P2 P3 P4 P5 P6
init {8,3,12} {10,16,5} {2,18,9} {17,15,4} {1,6,13} {11,7,14}
local sort {3,8,12} {5,10,16} {2,9,18} {4,15,17} {1,6,13} {7,11,14}
odd {3,5,8} ↔ {10,12,16} {2,4,9} ↔ {15,17,18} {1,6,7} ↔ {11,13,14}
even {3,5,8} {2,4,9} ↔ {10,12,16} {1,6,7} ↔ {15,17,18} {11,13,14}
odd {2,3,4} ↔ {5,8,9} {1,6,7} ↔ {10,12,16} {11,13,14} ↔ {15,17,18}
even {2,3,4} {1,5,6} ↔ {7,8,9} {10,11,12} ↔ {13,14,16} {15,17,18}
odd {1,2,3} ↔ {4,5,6} {7,8,9} ↔ {10,11,12} {13,14,15} ↔ {16,17,18}
even {1,2,3} {4,5,6} ↔ {7,8,9} {10,11,12} ↔ {13,14,15} {16,17,18}
FIGURE 2.7: Odd-even merge sort on a one-dimensional network of pro-

cessors (p = 6).
Since general purpose processors are more powerful than comparators, we

only use p n processors to sort sequences of length n. For simplicity, we
assume that p divides n. Each processor is given a sub-sequence of length
n/p. These sub-sequences are sorted in parallel, thus in time O( np log np ) =
O( np log n). After this initial sorting, p steps of odd-even transpositions are
performed. However, instead of exchanging a single element, processors ex-
change sub-sequences of length n/p. When two processors exchange two sub-
sequences of length n/p they are merged and the leftmost processor keeps the
first half of the resulting sequence, i.e., the n/p smaller elements, while the
rightmost processor keeps the second half. This algorithm is illustrated in

Figure 2.7 on an example.
The computation time needed for a transposition is that of the sequential
merge, i.e., O(n/p). The time for all the transpositions is thus O(n) and the
overall sorting time is O( np log n + n). The total work is O(n(p + log n)) and
the algorithm is therefore optimal for p 6 log n. Note that our analysis does
not account for the overhead due to communication between the processors
(this is left for upcoming chapters).
The proof of correctness comes directly from Proposition 2.3. In the end,
we have proved the following result:
PROPOSITION 2.4. On a one-dimensional network of p general purpose

processors, a sequence of length n can be sorted in time O( np log n + n). This
algorithm is optimal for p 6 log n.
This chapter draws inspiration from the book by Gibbons and Rytter [62]
for the Batcher network and from the book by Akl [3] for the section on the
one-dimensional network. For additional information on sorting networks, we
refer the curious reader to seminal book by Knuth [75].
2.3. Exercises 49
2.3 Exercises
The first exercise is a straightforward warm-up. The second exercise deals
with bitonic sorting and is classical (see [62] or [44]). The third exercise
presents a more sophisticated example of sorting network. Many other (com-
plex) examples can be found in [81].
Exercise 2.1 : Particular Sequences

We have seen in Section 2.2.1 that a primitive network is a sorting network
if and only if it sorts the sequence hn, n − 1, . . . , 1i correctly.
Prove that a network α sorts the sequence hn, n − 1, . . . , 1i correctly if and
only if it sorts sequences h1i 0n−i i for any i ∈ {1, . . . , n}.
Exercise 2.2 : Bitonic Sorting Network

DEFINITION 2.1. A bitonic sequence is composed of two subsequences,
one monotonically non-decreasing and the other monotonically non-increasing.
A ∨-frame (e.g., h12, 5, 10, 11, 19i) and a ∧-frame (e.g., h2, 3, 7, 7, 4, 1i) are
examples of bitonic sequences. 0–1 bitonic sequences are either of the form
0i 1j 0k or of the form 1i 0j 1k .
DEFINITION 2.2. A bitonic sorting network is a comparator network
that can sort any bitonic sequence
DEFINITION 2.3. A split network of length n, with n even, is a column of
comparators such that input i is compared with input i+ n2 for i ∈ {1, 2, ..., n2 }.
1. Build a bitonic sorting network using split networks. Give the depth of
your network and the number of comparators needed.
2. Design a network to merge two sorted sequences using bitonic sorting
networks. Compute the size and the depth of a general sorting network built
from these merging networks.
Exercise 2.3 : Sorting on a 2-D Grid

This exercise presents an extension of the one-dimensional odd-even sorting
network to a two-dimensional grid.
DEFINITION 2.4. A 2-D grid A = (ai,j ) of size n × n, n = 2m , is snake-
ordered if its elements are ordered as follows:
a2i−1,j 6 a2i−1,j+1 , if 1 6 j 6 n − 1, 1 6 i 6 n/2,
a2i,j+1 6 a2i,j , if 1 6 j 6 n − 1, 1 6 i 6 n/2,
a2i−1,n 6 a2i,n , if 1 6 i 6 n/2,
a2i,1 6 a2i+1,1 , if 1 6 i 6 n/2 − 1.
Note that this snake order induces an embedded one-dimensional network in

the grid (see Figure 2.8 for an example).
a1,1 →a1,2 →a1,3 →a1,4

↓
a2,1 ←a2,2 ←a2,3 ←a2,4
↓
a3,1 →a3,2 →a3,3 →a3,4
↓
a4,1 ←a4,2 ←a4,3 ←a4,4
FIGURE 2.8: Snake order on a 4 × 4 grid.
DEFINITION 2.5. The shuffle operation transforms the sequence hz1 , . . . ,

z2p i into hz1 , zp+1 , z2 , zp+2 , . . . , zp , z2p i. For example,
SHUFFLE(h1, 2, 3, 4, 5, 6, 7, 8i) = h1, 5, 2, 6, 3, 7, 4, 8i .
Let us study a merging algorithm for snake-ordered sequences. The algo-

rithm merges four 2m−1 × 2m−1 snake-ordered grids into a 2m × 2m snake-
ordered grid, and is shown below.
1 SNAKE MERGE(A)
2 SHUFFLE each row of the network (using odd-even
transpositions based on the elements’ indices).
3 { Note that this amounts to SHUFFLE the columns. }
4 Sort pairs of columns, i.e., snake-ordered n × 2 grids, using
2n steps of odd-even transpositions on the induced linear
subnetwork.
5 Apply 2n steps of odd-even transpositions on the induced
linear network of size n2 .
1. Apply this algorithm to the 4 × 4 array A defined by: ai,j = 21 − 4i − j

for 1 6 i, j 6 4.
2. A time unit is defined as the time needed to perform an exchange between

neighbors. Exchanges between different neighbors can be done in parallel
though. By applying odd-even transpositions on a clever set of indices, prove
that the first step of the merging algorithm can be done in n/2 − 1 time units.
Conclude that the execution time of the merging algorithm is smaller than
9
2 n.
Exercises 51
3. Assuming the merging algorithm is correct, design a network to sort a

sequence of length 22m on a 2m × 2m grid. Compute the execution time of
this network.
4. Prove that the odd-even transposition sort on a grid is correct. This

amounts to proving that the 2n steps of odd-even transpositions in the third
step of the algorithm are sufficient to obtain a snake-ordered sequence.
2.4 Answers
Exercise 2.1 (Particular Sequences)
⇒ Let us first assume that α sorts the sequence hn, n − 1, . . . , 1i. Then (see
the proof of Proposition 2.2), α also sorts the sequences hfi (n), fi (n −
1), . . . , fi (1)i for any i ∈ {1, . . . , n} where
fi (x) = 0 if x < i and 1 otherwise.
⇐ Let us now assume that α does not sort hn, n − 1, . . . , 1i correctly. We

show that there exists an integer i ∈ {1, . . . , n} such that α does not
sort h1i 0n−i i correctly.
Let j denote the largest element of {1, . . . , n} whose position is incorrect
in α(hn, n − 1, . . . , 1i). Let us denote by σ a permutation of {1, . . . , j}
such that:
α(hn, n − 1, . . . , 1i) = hσ(1), . . . , σ(k), . . . , σ(j), j + 1, . . . , ni .

|{z} |{z}
=j <j
Let us consider the previously defined function fj . Since fj is an in-

creasing function, we have:
α(h1j , 0n−j i) = α(fj (hn, n − 1, . . . , 1i)) = fj (α(hn, n − 1, . . . , 1i))

= h0, . . . , 0, 1, 0, . . . , 0, 1, . . . , 1i.
Therefore, the network does not sort sequences h1i 0n−i i correctly for all
i ∈ {1, . . . , n}.
Exercise 2.2 (Bitonic Sorting Network)

. Question 1. Applying a split network to a bitonic sequence yields two
equal-length bitonic sequences. At least one of these sequences is “clean,” i.e.,
consisting only of 0’s or only of 1’s (see Figure 2.9).
Let us prove the above result more formally. Assume that there are more
1’s than 0’s in the initial bitonic sequence b.
• If b is of the form 1i 0j 1k .
n
– If k > 2 then the output sequence is 1i 0j 1k .
n n n
– If i > 2 then the output sequence is 1i− 2 0j 1k 1 2 .
– If i < and k < n2 (note that we then necessarily have i + k > n2
n
2
since there are more 1’s than 0’s) then the output sequence is of
n n n n
the form 0 2 −k 1 2 −j 0 2 −i 1 2 .
Answers 53
 
0 0 
 0 0 


  
 



 0 0 clean bitonic 


 0 0 bitonic

 1 0 
 sequence

 1 1 
 sequence


  

 
bitonic  1 0  bitonic  1 0 
sequence 1 1  sequence 1 1 

 

0 0 1 1
 
bitonic clean bitonic
 

 

0
 1 
 sequence 0
 1 
 sequence


0  
 
1 
0 1 
FIGURE 2.9: Applying a split network of size 8 to two different bitonic

sequences.

0
 0 0 0 


bitonic

 0 0 0 0 


 

sort[n/2]
 
1 0 0 0 

 

bitonic


1 0 
split[n] 0 0 sorted
sequence 1 1 1 0 
 sequence
 

bitonic 
 0 0 0 1 


 

sort[n/2]

 0 1 1 1 


 

0 1

1 1 
FIGURE 2.10: Using split networks to build a bitonic sorting network for
sequences of length n.
n
• If b is of the form 0i 1j 0k , then since j > 2 the output sequence is of the
n n
form 0i 1j− 2 0k 1 2 .
We can thus easily check that in all cases the output sequence consists of two
equal-length bitonic subsequences with at least one of them clean. The proof
is similar when there are more 0’s than 1’s.
We can now use this property to build a bitonic sorting network from split
networks (see Figure 2.10). Let tm and pm denote the depth and the size of
bitonic sorting networks for inputs of length n = 2m . We can easily check
that:
t1 = 1, tm = tm−1 + 1, which implies that tm = m,
p1 = 1, pm = tm 2m−1 , which implies that pm = m2m−1 .
This bitonic sorting network thus contains O(n log n) comparators and its
depth is O(log n).
. Question 2. Let us assume that we have two sorted sequences of length n:

0i 1n−i and 0k 1n−k . By reversing the second sequence (this amounts to invers-
ing wires but does not require additional comparators) we obtain a bitonic
sequence 0i 12n−i−k 0k that can be sorted with the previous bitonic sorting net-
work. The merging network is thus easily built by inverting the first column
of comparators in the bitonic sorting network (see Figure 2.11).
The general sorting network is then designed by recursively stacking merg-
ing networks. Let t0m and p0m denote the depth and the size of this sorting
 0 0
0 0 


 0 0 

bitonic sorted

0 0 

1 1 

sort[n/2] sequence 1 0 


 0 0 

1 1 sorted
1 1 
0 1 sequence




 1 1 

bitonic sorted 1 1 

1 1 

sort[n/2] sequence 1 1 


 1 1 
1 1


FIGURE 2.11: Building a merging network out of bitonic sorting networks.
network for inputs of length n = 2m . We can easily check that:

t01 = 1, t0m = t0m−1 + tm , which implies that t0m = O(m2 ),
p01 = 1, p0m = 2p0m−1 + pm , which implies that p0m = O(m2 2m ).
Hence, this sorting network contains O(n(log n)2 ) comparators and its depth
is O((log n)2 ).
Exercise 2.3 (Sorting on a 2-D Grid)

. Question 1. Figure 2.12 depicts the different values of A after applying
the steps of the snake merging algorithm.
16 15 14 13 11 12 9 10 11 9 12 10
12 11 10 9 16 15 14 13 16 14 15 13
8 7 6 5 3 4 1 2 3 1 4 2
4 3 2 1 8 7 6 5 8 6 7 5
Initial layout Recursive call Column shuffling
2 × 2 grid 4 × 4 grid
1 3 2 4 1 2 3 4
8 6 7 5 8 7 6 5
9 11 10 12 9 10 11 12
16 14 15 13 16 15 14 13
Column sorting 2n odd-even
4 × 4 grid 4 × 4 grid
FIGURE 2.12: Snake merging.
. Question 2. For i between 1 and n = 2p we define the index ci of column

i as the index this column should have after the SHUFFLE:
ci = 2i − 1 if 1 6 i 6 p and 2(i − 2m−1 ) otherwise.
Answers 55
This index set should then be sorted using local comparisons. Let us now
consider the primitive network α made of p − 1 stages. Stage i performs the
following comparisons:
hp − i + 1, p − i + 2i, hp − i + 3, p − i + 4i, . . . , hp + i − 1, p + ii .
For example, for n = 8 and p = 4 the algorithm would sort h1, 3, 5, 7, 2, 4, 6, 8i.
The three stages of α consists of the following comparators: (4, 5) in the first
stage, (3, 4), (5, 6) in the second stage and (2, 3), (4, 5), (6, 7) in the third stage.
Using a simple induction, one can prove that α makes it possible to sort
the sequence c and thus can be used to SHUFFLE the columns.
The time needed for the SHUFFLE operation is n2 − 1. The time needed for
the other steps is 2 × 2n and the execution time of the merging algorithm is
thus smaller than 92 n.
. Question 3. Let tm denote the time needed to sort on a 2m × 2m grid.
Using the previous question, we have tm 6 tm−1 + 92 2m . We also have t0 = 0
since a 1 × 1 grid is always snake-ordered. Summing up all these inequalities
for tm , tm−1 , tm−2 , . . . , we obtain:
m
9X k 9
tm 6 2 = (2m+1 − 2) 6 9 · 2m .
2 2
k=1
√
The time needed to sort a sequence of length n = 22m on a grid is thus O( n).
. Question 4. We use the 0–1 principle to simplify the proof. The odd-
even transposition sort on a grid is correct if and only if the third step of the
algorithm can be done with only 2n odd-even transpositions on 0–1 sequences.
The two possible layouts of snake-ordered 0–1 sequences are depicted in
Figure 2.13. The index i of the last row that starts with a 0 is always odd.
There are two cases depending on whether the number of remaining 0’s does
not exceed a row or spills over the following row.
i is odd 0 i+ε i is even 0 i+ε
1
1
FIGURE 2.13: Two possible layouts for snake-ordered 0–1 sequences: i is

always odd and ε ∈ {−1, 1}.
We only have to show that after the second step of the algorithm (sorting
on linear networks of size 2n) at most two rows of the grid are not clean, i.e.,
consisting only of 0’s or only of 1’s (see Figure 2.14).
shuffle
i1 i1 + ε 1 i1 + i2 + ε1/2
i2 i1 + i2 i1 + i2
i2 + ε 2 + ε1 + ε2
i3 + ε 3
i3 i3 + i4 + ε3/4
i4 i4 + ε 4 i3 + i4 i3 + i4
+ ε3 + ε4
sorting
pairs of















































columns
S + ε10 + ε20 + ε30 + ε40

S + ε10 + ε20 + ε30
the four blocks
S + ε10 + ε20
S + ε10
at most
n 2 lines
FIGURE 2.14: Merging two snake-ordered 0–1 sequences.
Using the notations of Figure 2.14, i1 , i2 , i3 , i4 are odd integers and ε1 , ε2 , ε3 ,

ε4 are in {−1, 1}.
Let us count the number of 0’s and 1’s in pairs of columns after the SHUFFLE.
In the upper half of the grid, there are at most three types of pairs of columns.
These pairs comprise either i1 + i2 , i1 + i2 + ε1/2 (ε1/2 being equal either to
ε1 or ε2 ) or i1 + i2 + ε1 + ε2 0’s. The same analysis can be done for the lower
half and thus there are at most five types of pairs of columns after the sort on
pairs of columns. These pairs of columns comprise either S, S +ε01 , S +ε01 +ε02 ,
S + ε01 + ε02 + ε03 or S + ε01 + ε02 + ε03 + ε01 0’s, where S = i1 + i2 + i3 + i4 and
ε01 , ε02 , ε03 , ε04 are in {−1, 1}.
S is even since it is the sum of four odd numbers. Therefore, one can check
that in all cases at most two rows are not clean. Hence, at most 2n steps of
odd-even transpositions are needed to sort the whole sequence correctly.
Chapter 3
Networking
In this chapter, we present network design and operation principles that are
relevant to the study of parallel algorithms. We start with a brief description
of classical network topologies (Section 3.1). Next, we present common mes-
sage passing mechanisms along with a few performance models (Section 3.2).
Then, we focus on the routing problem for two classical topologies: the ring
(Section 3.3) and the hypercube (Section 3.4). More precisely, we discuss how
to implement point to point communications as well as global communications
on these topologies. Finally, we present some recent applications of some of
these techniques to Peer-to-Peer Networks (Section 3.5).
3.1 Interconnection Networks

Let us we briefly review several classical interconnection networks used
in distributed memory parallel platforms. In these platforms each proces-
sor has its own private memory. Processors need to exchange messages to
communicate. By contrast, in shared memory parallel platforms processors
can communicate through simple reads and writes to a single shared mem-
ory. As a result, shared memory platforms are typically easier to program
than distributed memory platforms. Unfortunately, the connection between
the processors and the shared memory quickly becomes a bottleneck for most
applications. As a result shared memory platforms do not scale to as many
processors as distributed memory platforms and become very expensive when
the number of processors increases.
An alternative to a purely shared or purely distributed memory platform is
a distributed shared memory platform. By extending the virtual memory im-
plementation to use physically distributed local memories, one can implement
a virtual global access to all of them, thereby emulating a large shared mem-
ory. This approach raises a number of difficult, but not unsolvable, technical
issues. And in fact, most of these issues arise in shared memory platforms any-
way! Indeed, shared memory platforms are implemented using a hierarchy of
multi-level caches. The memory is thus divided in parts whose access time
is not uniform across the processors. In such Non-Uniform Memory Access
57
58 Chapter 3. Networking
(NUMA) platforms access time depends on the memory location relative to

the processor: a processor can access its own local memory much faster than
the memory of other processors. Furthermore, whenever a processor updates
data in his own private cache the change may need to be propagated to other
processors. Otherwise, processors could be working with stale data, leading to
an incoherent view of the shared memory. It is therefore necessary to ensure
cache coherence, which requires careful implementation. We refer the reader
to the books by Culler and Singh [48] and by El-Rewini and Abd-El-Barr [52]
for more details on parallel architectures.
Beyond the architectural differences between shared and distributed mem-
ory platforms, there is a difference in programming style. Distributed mem-
ory platforms are generally programmed using the Single Program Multiple
Data (SPMD) approach and using message-passing, while shared memory
platforms are generally programmed with sequential imperative languages ex-
tended with parallelization primitives (parallel loops, critical sections, etc.).
These two programming models can be mixed and used on almost any type of
platform provided adequate software support is available (and understanding
that performance may suffer if the programming model is poorly adapted to
the underlying architecture).
3.1.1 Topologies
The processors in a distributed memory parallel platform are connected
through an interconnection network. Nowadays all computers have specialized
coprocessors dedicated to communication that route messages and put data
in local memories (i.e., network cards). In what follows we say that a parallel
platform comprises nodes. Each node consists of a (computing) processor, a
memory, and a communication coprocessor. When there is no ambiguity we
sometimes use the term processor to denote the entire node.
Nodes can be interconnected in arbitrary ways. Since the 70s, many indus-
trial and academic projects have tried to determine the best way to intercon-
nect nodes efficiently and at low cost. There are two main approaches:
• static topologies: the interconnection networks is fixed by the con-

structor and cannot be changed. Nodes are connected directly to other
nodes via point-to-point communication links. Most parallel machines
(e.g., the Intel Paragon) of the previous decades were built in this way;
• dynamic topologies: the topology can change at runtime and one or

more nodes can request that direct connections be established between
them. This is accomplished by connecting nodes via switches. The SP
series from IBM, and Myrinet or ATM networks are examples of such
dynamic topologies.
3.1. Interconnection Networks 59
3.1.2 A Few Static Topologies

The most common static topologies are depicted in Figure 3.1: (a) a fully-
connected network also known as a clique; (b) a ring; (c) a 2-dimensional grid;
(d) a torus and (e) a hypercube. A dynamic topology is also shown: (f) a
fat-tree which we will discuss later.
(b)
(a) (c)
(d)
(e) (f)
FIGURE 3.1: A few examples of interconnection network topologies:

(a) clique; (b) ring; (c) grid; (d) torus; (e) hypercube; (f) fat-tree.
Static topologies are defined by graphs where vertices represent nodes of

the platform and edges represent communication links. Such networks are
generally characterized by the following quantities:
• the number of nodes (p) – This is a key quantity since it bounds the
potential parallelism.
• the degree (k) – This is the number of edges incident to a node. For
non-regular topologies, we can distinguish between the smallest degree
δ and the largest degree ∆. In the following, we denote by k = ∆ the
number of communication links of the communication coprocessor.
• the diameter (D) – The distance between two nodes is the length of
the shortest path between two nodes. The diameter is the greatest such
distance across all node pairs;
• the number of links (Nl ) – This is the total number of edges.
• the bisection width (LB ) – This is the minimum number of edges
that have to be removed to partition the network into two equal sized
disconnected networks.
Let us first note that the number of processors p of a static topology with
degree ∆ is bounded as follows:
∆(∆ − 1)D − 2
p 6 1 + ∆ + ∆(∆ − 1) + ∆(∆ − 1)2 + . . . + ∆(∆ − 1)D−1 = .
∆−2
This bound is known as Moore’s bound. It is derived by counting the neighbors
of a processor, the uncounted neighbors of neighbors, and so on until the
farthest processors are reached (at distance D).
A short discussion of the topologies from Figure 3.1 follows along with a
summary of their main characteristics in Table 3.1.
• Clique (Figure 3.1(a)). This is the ideal network as the distance be-
tween any two processors is equal to 1. However, the number k of links
per processor is equal to p − 1 and thus the total number of links Nl is
equal to p(p − 1)/2. This type of network is only feasible with a limited
number of nodes at reasonable cost. Thanks to routing techniques, some
of which we discuss later in this chapter, any interconnection network
can always been thought of as a clique by the application programmer.
But in this case communication times are larger and there is a possibil-
ity of network contention when two messages travel through the same
communication link at the same time.
• Ring (Figure 3.1(b)). This is one of the simplest topologies and is in
fact used as a model for developing many useful parallel algorithms (see
Chapter 4).
• 2-dimensional grid (Figure 3.1(c)). The maximum degree of proces-
sors is equal to 4. The main drawback of a grid is its lack of symmetry.
Indeed, border processors and central processors have different charac-
teristics. The 2-D grid is well suited to, for instance, image processing
problems where computations are local and communications are typi-
cally performed between neighbors. Another interesting aspect of 2-D
grids is their scalability with regard to practical implementation: pro-
cessors only need to be added at the edges.
• 2-dimensional torus (Figure 3.1(d)). This one is easily derived from
the 2-D grid by connecting edge processors with each others. The di-
ameter is thus halved with the same degree. The connectivity can even
be increased with a 3-D torus like the one in the Cray T3D platform.
• Hypercube (Figure 3.1(e)). This topology has been used extensively
in the previous decades. Due to its recursive definition, one can design
simple yet efficient algorithms that operate dimension by dimension.
Another attractive aspect of this topology is its low diameter (logarith-
mic in the number of nodes). However, the total number of links grows
too quickly to be of practical interest for massively parallel machines.
A hypercube is characterized by its dimension d, where p = 2d .
3.1. Interconnection Networks 61
TABLE 3.1: Main characteristics of classical topologies.
Topology Num. of proc. Degree Diameter Num. of links Bisec. Width
p k D Nl LB
Clique p p−1 1 p(p − 1)/2 (p/2)2
Ring p 2 bp/2c p 2
√ √ √ √ √
2-D Grid p p 2→4 2( p − 1) 2p − 2 p p
√ √ √ √
2-D Torus p p 4 2b p/2c 2p 2 p
Hypercube p = 2d d = log (p) d = log(p) p log (p)/2 p/2
3.1.3 Dynamic Topologies

The fat-tree dynamic topology depicted in Figure 3.1(f) was chosen by
Thinking Machine Corporation to build the CM-5 [82]. This figure is dif-
ferent from the others because compute nodes are located only at the leaves
and the nodes at higher levels in the tree do not perform computation but
merely connect other nodes together. The topology is a binary tree with larger
bandwidth near the root than near the leaves. This layout ensures that there
is no bandwidth loss when increasing the number of processors: when pro-
cessors in different subtrees communicate they have more available network
links than when they are in the same subtree. Note that link redundancy also
provides good fault tolerance.
The ideal way in which one can connect p processors dynamically is to use
a crossbar switch or crossbar for short. A schematic view of a crossbar is
shown in Figure 3.2. The crossbar is built with p2 switches. These switches
are shown at the intersection of each vertical and horizontal “wires” in the
picture, and can be controlled to either connect or not connect a vertical wire
with a horizontal wire. The pictures shows an arbitrary example in which
processor 4 is connected to processor 3, processor 3 to processor 2, processor
2 to processor 4, and processor 1 to itself. The problem with crossbars is
that cost rises with the number of switches, and thus quadratically with the
number of processors connected. To remedy this problem one can build a
network to connect p processors by using multiple crossbars that each connect
(far) fewer than p processors. Beneš or Omega networks use this approach
(see Exercise 3.5). They are built from small crossbars that are arranged
in stages. Only crossbars in adjacent stages are connected together. One
then talks of multi-stage networks. Multi-Stage networks are much cheaper
to manufacture than a full crossbar (which can be thought of as a single-stage
network). However, configuring a multi-stage network is generally much more
difficult than configuring a crossbar. Figure 3.3 depicts a Beneš network: the
p inputs are connected to the p outputs through 2(log2 p − 1) stages that each
contain p/2 2 × 2 crossbars.
Unlike 15 years ago, dynamic topologies are now the most commonly used
topologies. We refer the interested reader to the book by Culler and Singh [48]
for an interesting discussion and perspectives on the topic of topologies.
1 2 3 4
FIGURE 3.2: Crossbar network for p = 4.

1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
FIGURE 3.3: Beneš network for p = 8.
3.2 Communication Model

In this section, we present different approaches to transfer a message along
a given route in a topology. We analyze the performance of these approaches
at first with a very simple model. We then discuss several possible extension
of this model.
3.2.1 A Simple Performance Model

Let us assume that a node, Pi , wishes to send a message M of size m to
another node, Pj . The time needed to transfer a message along a network
link is roughly linear in the message length. Therefore, the time needed to
transfer a message along a given route is also linear in the message length.
This is why the time ci,j (m) required to transfer a message of size m from Pi
3.2. Communication Model 63
to Pj is often modeled as
ci,j (m) = Li,j + m/Bi,j = Li,j + m.bi,j ,
where Li,j is the start-up time (also called latency) expressed in seconds and
Bi,j is the bandwidth expressed in byte per second. For convenience, we also
define bi,j as the inverse of Bi,j . Li,j and Bi,j depend on many factors: the
length of the route, the communication protocol, the characteristics of each
network link, the software overhead, the number of links that can be used
in parallel, whether links are half-duplex or full-duplex, etc. This model was
proposed by Hockney [68] to model the performance of the Intel Paragon and
is probably the most commonly used.
3.2.2 Point-to-Point Communication Protocols

There are two main approaches for point-to-point communication: Store-
and-Forward protocols and Cut-through protocols.
Store-and-Forward (SF ) – In this protocol each intermediate node re-

ceives and stores the whole message M before transmitting it toward the des-
tination node. This model was implemented in the earliest parallel machines,
in which nodes were not equipped with communication coprocessors. Conse-
quently, intermediate nodes are interrupted to handle messages messages and
route them to their destination.
If d(i, j) is the number of links on the route between Pi and Pj we have:
m m
ci,j (m) = d(i, j) · L + = d(i, j)L +
B B/d(i, j)
= d(i, j)L + m.d(i, j)b .
Such a protocol results in a rather poor latency and bandwidth. The com-
munication cost can be reduced by using pipeline. The message is split in
r packets of size mr . Packets are sent one after the other from Pi to Pj . The
first packet reaches node j after ci,j (m/r) time units and the remaining r − 1
packets arrive immediately after, on after another, in (r − 1)(L + m r b) time
units. The total communication time is thus equal to
m
d(i, j) − 1 + r L + b .
r
One can then seek the value of r that minimizes the previous expression. This
is a minimization problem of the form (γ + α.r)(δ + β/r) with four constants
α, β, γ, and δ. Removing all constant terms, this boils down to minimizing
αδ · r + γβ/r, and hence the optimal value of r, ropt , is:
r
γβ
ropt = .
αδ
Indeed, the sum of two terms whose product is constant is minimized when
the two terms are equal. This is the famous theorem of the goat in a pen
whose perimeter, if its area is fixed, is minimal if the pen is a square (The
reader unimpressed by bucolic analogies can use the function’s derivative to
calculate ropt analytically.) The minimum value obtained with ropt is:
p p p
((γ + α.r)(δ + β/r))[ γβ/αδ] = ((γ/δ)(δ + δβα/γ))(δ + αδβ/γ)
p p
= ( γ/δ(δ + δβα/γ))2
p p
= ( γδ + βα)2 .
qTherefore, replacing α, β, γ, and δ by their values, we obtain ropt =

(d(i,j)−1)mb
L and the optimal time is:
√ 2 √
r
p m m
(d(i, j) − 1)L + mb = mb + O( mb) = +O .
B B
Cut-through (CT ): Circuit Switching and Wormhole – In this model

the message does not need to be stored on an intermediate node before being
forwarded. The commonly used performance model for such networks is
ci,j (m) = L + d(i, j) · δ + m/B,
where δ is the routing management overhead. Generally δ L because

most routing management is performed by the hardware whereas L comprises
software overhead. Two different protocols rely on this model:
• Circuit-switching (CC) – A route is created before sending the first

bytes of the message. Then, the message is directly sent to the desti-
nation. Note that all nodes belonging to the route are busy during the
whole transmission and cannot be used for any other communication.
This approach was first used in the Intel iPSC.
• Wormhole (WH) – In this protocol, the destination address is recorded

in the header of the message and routing is performed dynamically at
each node. The output of a node is computed when reading the destina-
tion address. The message is split in small packets called flits. In case
of conflict at a node (e.g., two flits arriving at the same time), flits are
stored in intermediate nodes’ internal registers. There are many routing
algorithm for this model (see Section 3.4). The Intel Paragon used this
protocol.
Discussion on Point-to-Point Communications – Because of technol-

ogy and limited knowledge on routing, the store-and-forward approach was
the first to be used for message passing in parallel platforms. This protocol
is still used nowadays but at the application level in so-called overlay net-
works (e.g., Peer-to-Peer networks). It is however no longer used in physical
networks. Cut-through protocols are more efficient because they hide the
distance between nodes and avoid large buffer requirements on intermediate
nodes. The wormhole approach is generally preferred to the circuit switching
approach because latency is typically much lower (see Figure 3.4). These
Latency Bandwidth
P3
P2 Store & Forward
P1
P0 t
P3
P2 Circuit Switching
P1
P0 t
P3
P2 Wormhole
P1
P0 t
Header Data
FIGURE 3.4: Illustration of various communication protocols.
protocols have been used extensively for the dedicated networks in parallel
platforms, with good buffer management, almost no message loss and thus no
need for flow-control mechanisms. By contrast, in large networks (e.g., on the
Internet) there is a strong need for congestion avoidance mechanisms. Proto-
cols like TCP split messages just like wormhole protocols but do not send all
packets at once. Instead, they wait until they have received some acknowl-
edgments from the destination node. To this end, the sender uses a sliding
window of pending packet transmissions whose size changes over time de-
pending on network congestion. The size of this window is typically bounded
by some operating system parameter Wmax and the achievable bandwidth is
thus bounded by Wmax /(Round Trip Time). Such flow-control mechanisms
involve a behavior called slow start: it takes time before the window gets full
and the maximum bandwidth is thus not reached instantly. One may thus
wonder whether the Hockney model is still valid in such networks. It turns
out that in most situations the model remains valid but Li,j and Bi,j cannot
be determined via simple measurements.
3.2.3 More Precise Models

Many aspects are not taken into account in the Hockney model. One of
them is that at least three components are involved in a communication: the
sender, the network and the receiver. There is a priori no reason for all these
components to be fully synchronized and busy at the same time. It may thus
be important to evaluate precisely when each one of these components is active
and whether it can perform some other task when not actively participating
in the communication.
The LogP Family

The LogP [47] model, and other models based on it, have been proposed
as more precise alternatives to the Hockney model. More specifically, these
models account for the fact that some of the communications may be partially
overlapped. Messages, say of size m, are split in small packets whose size is
bounded by the Maximum Transmission Unit (MTU), w. In the LogP model,
L is an upper bound on the latency. o is the overhead, defined as the length of
time for which a node is engaged in the transmission or reception of a packet.
The gap g is defined as the minimum time interval between consecutive packet
transmissions or consecutive packet receptions at a node. During this time,
the node cannot use the communication coprocessor, i.e., the network card.
The reciprocal of g thus corresponds to the available per-node communication
bandwidth. Lastly, P is the number of nodes in the platform. Figure 3.5
depicts the communication between nodes each equipped with a network card
under the LogP model. In this model, sending m bytes with packets of size w
takes time
m−1
c(m) = 2o + L + ·g .
w
The processor occupation time on the sender and on the receiver is equal to

m−1
o+ −1 ·g .
w
One goal of this model was to summarize in a few numbers the character-
istics of parallel platforms. It would thus have enabled the easy evaluation
of complex algorithms based only on the above simple parameters. However,
parallel platforms are very complex and their architectures are often hand-
tuned with many optimizations. For instance, often different protocols are
used for short and long messages. LogGP [4] is an extension of LogP where G
P0
 o o o o




















g g g L
P1 o
L L L
g g














P2 o o o
o







g L
P3 o
time
FIGURE 3.5: Communication under the LogP model.
captures the bandwidth for long messages. One may however wonder whether
such a model would still hold for average-size messages. pLogP [74] is an
extension of LogP when L, o and g depends on the message size m. This
model also introduces a distinction between the sender and receiver overhead
(os and or ).
Affine models
One drawback of the LogP models is the use of floor functions to account for
explicit MTU. The use of these non linear functions causes many difficulties
for using the models in analytical/theoretical studies. As a result, many fully
linear models have also been proposed. We summarize them through the
general scenario depicted in Figure 3.6.
(nw)
ci,j (m)
(s)
bi,j · m
(s)
sending node Pi Li,j
(nw)
bi,j ·m
(nw)
link ei,j Li,j
(r)
bi,j · m
(r)
receiving node Pj Li,j
time
FIGURE 3.6: Sending a message of size m from Pi to Pj .

We first define a few notations to model the communication from Pi to Pj :

• The time during which Pi is busy sending the message is expressed as
(s) (s) (s)
an affine function of the message size: ci,j (m) = Li,j + m · bi,j . The
(s)
start-up time Li,j corresponds to the software and hardware overhead
(s)
paid to initiate the communication. The quantity bi,j corresponds to
the inverse of the transfer rate that can be achieved by the processor
when sending data to the network (say, the data transfer rate from the
main memory of Pi to a network card able to buffer the message).
• Similarly, the time during which Pj is busy receiving the message is ex-
(r)
pressed as an affine function of the message length m, namely ci,j (m) =
(r) (r)
Li,j + m · bi,j .
• The total time for the communication, which corresponds to the total
occupation time of the link ei,j : Pi → Pj , is also expressed as an affine
(nw) (nw) (nw) (nw) (nw)
function ci,j (m) = Li,j + m · bi,j . The parameters Li,j and bi,j
correspond respectively to the start-up cost and to the inverse of the
link bandwidth.
Simpler models (e.g., Hockney) do not make the distinction between the
(s) (nw)
three quantities ci,j (m) (emission by Pi ), ci,j (m) (occupation of ei,j : Pi →
(r) (s) (r) (nw)
Pj ) and ci,j (m) (reception by Pj ). Such models use Li,j = Li,j = Li,j
(s) (r) (nw)
and bi,j
= bi,j
= bi,j .
This amounts to assuming that the sender Pi and
the receiver Pj are blocked throughout the communication. In particular,
(nw)
Pi cannot send any message to another processor Pk during ci,j (m) time
units. However, some system/platform combinations may allow Pi to proceed
to another send operation before the entire message has been received by
Pj . To account for this situation, more complex models would use different
(s) (nw) (s)
functions for ci,j (m) and ci,j (m), with the obvious condition that ci,j (m) 6
(nw) (s) (nw) (s) (nw)
ci,j (m) for all message sizes m (which implies Li,j 6 Li,j and bi,j 6 bi,j ).
Similarly, Pj may be involved only at the end of the communication, during
(r) (nw)
a time period ci,j (m) smaller than ci,j (m).
Here is a summary of the general framework, assuming that Pi initiates a
communication of size m to Pj at time t = 0:
(nw) (nw) (nw)
• Link ei,j : Pi → Pj is busy from t = 0 to t = ci,j (m) = Li,j +L·bi,j
(s) (s) (s)
• Processor Pi is busy from t = 0 to t = ci,j (m) = Li,j + m · bi,j , where
(s) (nw) (s) (nw)
Li,j 6 Li,j and bi,j 6 bi,j
(nw) (r) (nw)
• Processor Pj is busy from t = ci,j (m)−ci,j (m) to t = ci,j (m), where
(r) (r) (r) (r) (nw) (r) (nw)
ci,j (m) = Li,j + m · bi,j , Li,j 6 Li,j and bi,j 6 bi,j
Banikazemi et al. [12] propose a model that is very close to the general model
presented above. They use affine functions to model the occupation time of
the processors and of the communication link. The only minor difference is
that they assume that the time intervals during which Pi is busy sending (of
(s) (r)
duration ci,j (m)) and during which Pj is busy receiving (of duration ci,j (m))
do not overlap. Consequently, they write
(s) (nw) (r)
ci,j (m) = ci,j (m) + ci,j (m) + ci,j (m) .
In [12] a methodology is proposed to instantiate the six parameters of the

(s) (nw) (s)
affine functions ci,j (m), ci,j (m) and ci,j (m) on a heterogeneous platform.
The authors show that these parameters actually differ for each processor pair
and depend upon the CPU speeds.
A simplified version of the general model was proposed by Bar-Noy et
al. [14]. In this variant, the time during which an emitting processor Pi is
blocked does not depend upon the receiver Pj (and similarly the blocking
time in reception does not depend upon the sender). In addition, only fixed-
size messages are considered in [14], so that one obtains
(s) (nw) (r)
ci,j = ci + ci,j + cj . (3.1)
Modeling Concurrent Communications

As we have seen in Section 3.1 there is a wide variety of interconnection net-
works. There is an even wider variety of network technologies and protocols.
All these factors greatly impact the performance of point-to-point communica-
tions. They impact even more the performances of concurrent point-to-point
communications. The LogP model can be seen as an attempt to model con-
current communications even though it has not clearly been designed to this
end. In this section, we present a few models that account for the interference
between concurrent communication.
Multi-port – Sometimes a good option is simply to ignore contention. This

is what the multi-port model does. All communications are considered to
be independent and do not interfere with each others. In other words, a
given node can communicate with as many other nodes as needed without
any degradation in performance. This model is often associated to a clique
interconnection network, leading to the so-called macro-dataflow model (see
Chapter 7). This model is widely used in scheduling theory since its simplicity
makes it possible to prove that problems are hard and to know whether there
is some hope of proving interesting results. (See Chapters 7 and 8.)
The major flaw of the macro-dataflow model is that communication re-
sources are not limited. The number of messages that can simultaneously
circulate between nodes is not bounded, hence an unlimited number of com-
munications can simultaneously occur on a given link. In other words, the
communication network is assumed to be contention-free. This assumption

is of course not realistic as soon as the number of nodes exceeds a few units.
Some may argue that the continuous increase in network speed favors the rele-
vance of this model. Note that the previously discussed models by Banikazemi
et al. [12] and by Bar-Noy et al. [14] are called multi-port because they al-
low a sending node to initiate another communication while a previous one
is still taking place. However, both models impose an overhead to pay before
engaging in another communication. As a results these models do not allow
for fully simultaneous communications.
Bounded multi-port – Assuming an application that uses threads on, say,

a node that uses multi-core technology, the network link could be shared by
several incoming and outgoing communications. Therefore, the sum of the
bandwidths allotted by the operating system to all communications cannot
exceed the bandwidth of the network card. The bounded multi-port model
proposed by Hong and Prasanna [69] is an extension of the multi-port model
with a bound on the sum of the bandwidths of concurrent communications
at each node. An unbounded number of communications can thus take place
simultaneously, provided that they share the total available bandwidth.
Note that with this model there is no degradation of the aggregate through-
put. Such a behavior is typical for protocols with efficient congestion control
mechanisms (e.g., TCP). Note however that this model does not express how
the bandwidth is shared among the concurrent communications. It is gen-
erally assumed in this model that the application is allowed to define the
bandwidth allotted to each communication. In other words bandwidth shar-
ing is performed by the application and not by the operating system. While
technology exists to achieve application-level bandwidth-sharing, it is not the
standard way in which networks and operating systems operate.
1-port (unidirectional or half-duplex) – To avoid unrealistically opti-

mistic results obtained with the multi-port model, a radical option is simply
to forbid concurrent communications at a node. In the 1-port model, a node
can either send data or receive data, but not simultaneously. This model is
thus very pessimistic as real-world platform can achieve some concurrency of
computation. On the other hand, it is straightforward to design algorithms
that follow this model and thus to determine their performance a priori.
1-port (bidirectional or full-duplex) – Nowadays most network cards

are full-duplex, which means that emissions and receptions are independent.
It is thus natural to consider a model in which a node can be engaged in a
single emission and in a single reception at the same time.
The bidirectional 1-port model is used by Bhat et al. [34, 35] for fixed-size
messages. They advocate its use because “current hardware and software do
not easily enable multiple messages to be transmitted simultaneously.” Even
if non-blocking multi-threaded communication libraries allow for initiating

multiple send and receive operations, they claim that all these operations “are
eventually serialized by the single hardware port to the network.” Experimen-
tal evidence of this fact has recently been reported by Saif and Parashar [106],
who report that asynchronous sends become serialized as soon as message sizes
exceed a few megabytes. Their results hold for two popular implementations
of the MPI message-passing standard, MPICH on Linux clusters and IBM
MPI on the SP2.
The one-port model fully accounts for the heterogeneity of the platform, as
each link has a different bandwidth. It generalizes a simpler model studied
by Banikazemi et al. [11], Liu [85], and Khuller and Kim [73]. In this simpler
model, the communication time ci,j (m) only depends on the sender, not on
the receiver: in other words, the communication speed from a node to all its
neighbors is the same.
k-ports – A node may have k > 1 network cards and a possible extension
of the 1-port model is thus to consider that a node cannot be involved in more
than an emission and a reception on each network card. This model will be
used in Chapters 4 and 5.
Bandwidth Sharing – The main drawback of all these models is that

they only account for contention on nodes. However, other parts of the net-
work can limit performance, especially when multiple communications occur
concurrently. It may thus be interesting to write constraints similar to the
bounded multi-port model on each network link.
Consider a network with a set L of network links and let Bl be the band-
width of link l ∈ L. Let a route r be a non-empty subset of L and let us
consider R the set of active routes. Set Al,r = 1 if l ∈ r and Al,r = 0 other-
wise. Last let us denote by %r the amount of bandwidth allotted to connection
r. Then, we have
X
∀l ∈ L : Al,r %r 6 Bl .
r∈R
We also have
∀r ∈ R : %r > 0 .
The network protocol will eventually reach an equilibrium that results in an
allocation % such that A% 6 B and % > 0. Most protocols actually optimize
some metric under these constraints. Recent works [88, 86] have successfully
characterized which equilibrium is reached for many protocols. For example
ATM networks recursively maximize a weighted minimum of the %r whereas
some versions of TCP optimize a weighted sum of the log(%r ) or even more
complex functions.
Such models are very interesting for performance evaluation purposes but
they almost always prove too complicated for algorithm design purposes. For
this reason, in the rest of this book we use the Hockney model or even sim-
plified versions of it model (e.g., with no latency) under the multi-port or the
1-port model. These models represent a good trade-off between realism and
tractability.
3.3 Case Study: the Unidirectional Ring

We consider a computing platform that consists of p processors arranged in
a logical unidirectional ring, as seen in Figure 3.7. The processors are denoted
by Pk for k = 0, . . . , p − 1. Each processor can determine its logical index by
calling function MY NUM().
A processor can find out the total number of processors p by calling function
NUM PROCS(). Note that we use the term processor to denote both a physical
compute device (or node) and a running program on that device, i.e., a process.
There is no confusion because we always assume one process per device.
...
Pp−1
P3
P0
P2
P1
FIGURE 3.7: A unidirectional ring with p processors.
Each processor has its own local memory. All processors execute the same
program, which operates on the data in their respective local memories. This
mode of operation is common and termed SPMD, Single Program Multiple
Data. To access data that is not in their local memories, processors must
communicate via explicit sending and receiving of messages. This mode of
operation is also common and is termed Message Passing. Each processor can
send a message to its successor on the ring by calling the following function:
SEND(addr , m),
where addr is the address (in the memory of the sender processor) of the
first data element to be sent and m is the message length expressed as a
number of data elements. For simplicity we assume that the data elements
of a message must be contiguous in memory, starting at base address addr .
Note that message passing implementations, for instance those of the MPI
3.3. Case Study: the Unidirectional Ring 73
standard [110], typically provide sophisticated capabilities to communicate

non-contiguous data with a single function call. To receive a message, a
processor must call the following function:
RECEIVE(addr , m).
Both functions above and others allowing processors to communicate are often
termed “communication primitives.” A few important points must be noted:
• Calls to communication primitives must match: if processor Pi executes
a RECEIVE, then its predecessor (processor Pi−1 mod p ) must execute a
SEND. Otherwise, the program cannot terminate.
• Since each processor has a single outgoing communication link there is no
need for specifying the destination processor: it is always the successor of
the sending processor. Similarly, when executing a RECEIVE, the source
of the incoming message is always the predecessor processor. For a
more complex logical topology, one may have to specify the destination
processor in the SEND and the source processor in the RECEIVE.
• The address of the first data element in a message, which is passed to
SEND, is an address in the local memory of the sender processor. Simi-
larly, the address passed to RECEIVE is in the local memory of the receiver
processor. Therefore, even if the program uses the same variable for stor-
ing both addresses (remember that we are in an SPMD execution), the
value of this variable will most likely be different in both processors.
One can of course use two different variables, as shown below:
q ←MY NUM()
if q = 0 then SEND(addr1 , m)
if q = 1 then RECEIVE(addr2 , m)
So far we have not said anything regarding the semantics of the communication
primitives. There are three standard assumptions:
• A rather restrictive assumption consists in assuming that each SEND and
RECEIVE is blocking, i.e., the processor that calls one of these communica-
tion primitives can continue its execution only once the communication
has completed. This completely synchronous message passing mode,
which is also called “rendez-vous,” is typical of first generation parallel
computing platforms.
• A classical assumption is to keep the RECEIVE blocking but to assume that
the SEND is non-blocking. This allows a processor to initiate a send but to
continue execution while the data transfer takes place. Typically, this
is implemented via two functions: one to initiate the communication,
and the other to check whether the communication has completed. In
this chapter, in order to not clutter the pseudo-code of our algorithms,

we do not use the second function. Instead in the description of our
algorithms we mention which operations are blocking and which ones
are non-blocking.
• A more recently proposed assumption is that both communication prim-

itives are non-blocking: a single processor can send data, receive data,
and compute simultaneously. Of course the three should only occur con-
currently only if there is no race condition. Again, this is implemented
via two functions for each communication primitive: initialization and
completion check. It is then convenient to think of each program run-
ning on a processor as three logical threads of control, one for computing,
one for sending data, and one for receiving data (even though the im-
plementation may not be multi-threaded or at least may not appear
multi-threaded to the programmer).
The implications of these different assumptions will be clear in the course of

writing our first parallel programs in the upcoming sections of this chapter
and in the upcoming chapters. We almost always use the least restrictive
third assumption above, both when writing the programs and when analyzing
their performance. It is straightforward to adapt both the programs and the
performance analyses to the first two more restrictive assumptions if they seem
more appropriate for the underlying computing platform. Our goal here is to
convey general principles of parallel algorithm design and of parallel algorithm
performance analysis, and these principles hold regardless of the underlying
assumptions.
As explained earlier, we use one of the simplest performance models: the
time to send/receive a message of length m is L + mb, where L and b are two
constants that depend on the platform. L is the startup cost, in seconds, due to
the physical network latency and the software overhead involved in a network
communication. b is the inverse of the data transfer rate, and measures the
raw speed of the communication in steady-state.
3.3.1 Broadcast
For a given processor index k, we wish to write a program by which pro-
cessor Pk sends the same message, of length m, to all other processors: this
operation is called a broadcast. This is a fundamental collective communi-
cation primitive. For instance, the sender processor can be a “master” that
broadcasts general information (e.g., problem size, input data) to all other
“worker” processors.
At the beginning of the program the message is stored at address addr
in the memory of the sender, processor Pk . At the end of the program the
message will be stored at address addr in the memory of each processor. All
processors must call the following function:
BROADCAST(k, addr, m).
The main idea is to have the message go around the ring, from processor
Pk to processor Pk+1 , then from processor Pk+1 to processor Pk+2 , etc. For
the remainder of the chapter, we will often implicitly assume that processor
indices are to be taken modulo p. There is no parallelism here since the SEND
and the RECEIVE executed by each processor are not independent. Therefore,
the whole program is written as shown in Algorithm 3.1. It is important to
note that the predecessor of the sender processor, i.e., processor Pk−1 , must
not send the message. This is not only because processor Pk already has the
message, since it is the sender processor, but also because it does not execute
a RECEIVE in the program as we have written it.
1 BROADCAST(k, addr , m)
2 q ←MY NUM()
3 p ←NUM PROCS()
4 if q = k then
5 SEND(addr , m)
6 else
7 if q = k − 1 mod p then
8 RECEIVE(addr , m)
9 else
10 RECEIVE(addr , m)
11 SEND(addr , m)
ALGORITHM 3.1: Broadcast on a ring of processors.
For the program to be correct, we assume that the RECEIVE is blocking, since
processors forward the message with a SEND immediately after the RECEIVE.
For this program, the semantics of the SEND does not matter. Since we have
a sequence of p − 1 communications, the time needed to broadcast a message
of length m is (p − 1)(L + mb).
Note that message passing implementations, for instance those of the MPI
standard [110], typically do not use a ring topology for implementing collec-
tive communication primitive such as a broadcast but rather use various tree
topologies. Logical tree topologies are more efficient on modern parallel plat-
forms for the purpose of collective communications. Nevertheless, we use a
ring topology in this chapter for two reasons. First, a linear topology such as
a ring makes for straightforward collective communication algorithms while
highlighting key general concepts such as communication pipelining. Second,
we will see in Chapter 4 that when designing algorithms for a logical ring
topology it is sometimes beneficial to implement collective communications

“by hand.”
3.3.2 Scatter
We now turn to the scatter operation by which processor Pk sends a different
message to each processor. To simplify we assume that all sent messages have
the same length, m. A scatter is useful, for instance, to distribute different
data to worker processors, such as matrix blocks or parts of an image.
At the beginning of the execution, processor Pk holds the message to be
sent to processor Pq at address addr [q]. To keep things uniform, we assume
that there is a message at address addr [k ] to be sent by processor Pk to itself.
(Such a convention is used in standard communication libraries as it turns
out to be convenient in practice.) At the end of the execution, each processor
holds its own message at address msg.
The simple but key idea to implement the scatter operation is to pipeline
message sending, starting with the message to be sent to the furthest proces-
sor, that is processor Pk−1 . While this message is on its way, other messages
to be sent to closer processors can be sent as well.
1 SCATTER(k, msg, addr , m)

2 q ←MY NUM()
3 p ←NUM PROCS()
4 if q = k then
5 for i = 1 to p − 1 do
6 SEND(addr [k + p − i mod p], m)
7 msg ←addr [k ]
8 else
9 RECEIVE(tempR, m)
10 for i = 1 to k − 1 − q mod p do
11 tempS ↔tempR
12 SEND(tempS , m) || RECEIVE(tempR, m)
13 msg ←tempR
ALGORITHM 3.2: Scatter on a ring of processors.
The program is shown in Algorithm 3.2. In this program, each processor

uses two buffers containing addresses, tempS and tempR, so that the sending
of a message and the receiving of the next message can occur in parallel. This
parallelization is denoted by notation “. . . || . . .” and is possible if we assume
that the SEND is non-blocking while the RECEIVE is blocking. We will make
this common assumption for the vast majority of our algorithms. Finally,
note that instructions tempS ↔ tempR and msg ← tempR are mere pointer
updates, and not memory copies.
The time for the scatter is the same as for the broadcast, i.e., (p − 1)(L +
mb). Indeed, pipelining of the communications on the rings allows for several
communication links to be used simultaneously. Therefore, it is possible to
send p − 1 messages in the same time as one, provided that their destinations
are p − 1 consecutive processors along the networks.
3.3.3 All-to-all
A third important collective communication primitive is the all-to-all ex-

change. For all k from 0 to p − 1, processor Pk wishes to send a message to all
others, which amounts to p simultaneous broadcasts. Again, we assume that
all messages have the same length, m. At the beginning, each processor holds
the message it wishes to send at address my message. At the end, each pro-
cessor will hold an array addr of p messages, where addr [k] holds the message
from processor Pk . The algorithm is surprisingly straightforward and consists
of p − 1 steps, with p messages on the network during each step, as shown in
Algorithm 3.3.
1 ALL TO ALL(my message, addr , m)

2 q ←MY NUM()
3 p ←NUM PROCS()
4 addr [q] ←my message
5 for i = 1 to p − 1 do
6 SEND(addr [q − i + 1 mod p], m) ||
RECEIVE(addr [q − i mod p], m)
ALGORITHM 3.3: All-to-all on a ring of processors.
Considerations regarding the semantics of the communication primitives

are identical to those for the scatter operation in the previous section. And
again, the execution time is (p − 1)(L + mb) as all communication links are
used at each step.
There is a last classical collective communication operation called gossip,

in which each processor sends a different message to each processor. We leave
the gossip algorithm as an exercise for the reader.
3.3.4 Pipelined Broadcast

Can a broadcast be executed faster? The implementation we have proposed
runs in time (p − 1)(L + mb), for broadcasting a message of length m. The
algorithm is simple, but it turns out to be rather inefficient for long messages.
To improve execution time one idea is to split the message into r pieces
of same length (assuming that r divides m). The sender processor sends
the r message pieces in sequence, allowing the pieces to travel on the ring
simultaneously. Such pipelining is key for reducing communication time. At
the beginning of the execution, the r message pieces are stored at addresses
addr [0 ], . . . , addr [r − 1 ] on processor Pk . At the end of execution, all message
pieces are stored at all processors. At each step, while a processor receives a
message piece it also sends the piece it has previously received, if any, to its
successor. The program is shown in Algorithm 3.4.
2 q ←MY NUM()
3 p ←NUM PROCS()
4 if q = k then
5 for i = 0 to r − 1 do
6 SEND(addr [i], m/r)
7 else if q = k − 1 mod p then
8 for i = 0 to r − 1 do
9 RECEIVE(addr [i ], m/r)
10 else
11 RECEIVE(addr [0], m/r)
12 for i = 0 to r − 2 do
13 SEND(addr [i], m/r) || RECEIVE(addr [i + 1], m/r)
14 SEND(addr [r − 1], m/r)
ALGORITHM 3.4: Pipelined broadcast on a ring of processors.
To determine the execution time, we just need to determine when the last
processor, Pk−1 , receives the last message piece. There must be p − 1 com-
munication steps for the first message piece to arrive at Pk−1 , which amounts
to time (p − 1)(L + m r b). Then, all remaining r − 1 message pieces arrive
subsequently, in time (r − 1)(L + m r b). Therefore, the overall execution time
is the sum of the two: m
(p + r − 2) L + b .
r
We seek the value of r that minimizes the previous expression. Using the “goat
3.4. Case Study: the Hypercube 79
q
in a pen” theorem, we obtain ropt = m(p−2)b
L and the optimal execution time
is:
p √ 2
(p − 2)L + mb . (3.2)
In this expression, p, L, and b are all fixed. Therefore, for long messages,
the expression tends to mb, which does not depend on p! This is the same
principle as the one used in IP networks, which split messages into many small
packets to improve throughput over multi-hop paths thanks to pipelining of
communications over multiple communication links.
3.4 Case Study: the Hypercube

We have briefly mentioned the hypercube earlier. In this section, we de-
scribe a few interesting properties of this topology.
3.4.1 Labeling Vertices

Hypercubes are built using duplication. A 0-cube is made of a single vertex
and a n-cube (also called n-dimensional hypercubes) is recursively built from
two identical (n−1)-cubes whose vertices are connected one by one. Figure 3.8
depicts this recursive construction: two 3-cubes are connected to build a 4-
cube.
0001 0101 1001 1101
0000 0100 1000 1100
0011 0111 1011 1111
0010 0110 1010 1110
FIGURE 3.8: A 4-dimensional hypercube.
Here is another equivalent definition: a n-cube is a graph made of 2n vertices

labeled from 0 to 2n − 1 and such that two vertices are connected if and only if
their binary representations differ by a single bit. The recursive construction
leads to a natural labeling of the vertices: vertices from the first cube are
labeled 0ai and those from the second cube are labeled 1ai , where ai is the
binary representation of the “sibling” vertices. This is the labeling shown in

Figure 3.8.
Using this recursive definition, it is easy to show that the diameter and the
degree of a n-cube are equal to n: the diameter and the degree are increased
by 1 at each step of the recursion.
3.4.2 Paths and Routing in a Hypercube

Paths With the hypercube vertex labeling one can build paths between any
two arbitrary vertices. Let A and B denote two vertices from the n-cube.
Routing a message from A to B is equivalent to finding a path from A to B
in the n-cube whose length is minimum.
Let us define H(A, B) the Hamming distance between A and B, i.e., the
number of bits that differ in the binary labels of A and B.
Let us consider a path from A to B whose length l is minimum. Exactly
one bit is changed along this path from a vertex to another. Therefore, there
are at most l different bits between A and B, which means that l > H(A, B).
Let us now build a path from A to B whose length is exactly H(A, B) = i.
Let A = an−1 . . . a1 a0 and B = bn−1 . . . b1 b0 .
Without loss of generality, we can assume that the i different bits are the
rightmost ones. Then, we have:
A = an−1 an−2 . . . ai+1 ai ai−1 ai−2 . . . a2 a1 a0

B = an−1 an−2 . . . ai+1 ai ai−1 ai−2 . . . a2 a1 a0
where x = 1 − x is the bit complement of x. Here is a path from A to B that

adjusts bits starting from the rightmost ones:
A = vertex 0 = an−1 an−2 . . . ai+1 ai ai−1 ai−2 . . . a2 a1 a0
vertex 1 = an−1 an−2 . . . ai+1 ai ai−1 ai−2 . . . a2 a1 a0
...
B = vertex i = an−1 an−2 . . . ai+1 ai ai−1 ai−2 . . . a2 a1 a0
There is no particular reason for starting by the rightmost bits. Actually,

there are i! paths from A to B of length i. They are obtained by simply
adjusting the i different bits in an arbitrary order. The right question is:
how many independent minimal length paths from A to B are there? (Two
paths are said to be independent if they do not have any vertex in common
except for A and B.) Such paths are of great interest as they allow to route
simultaneously several messages from A to B (or different parts of the same
message!).
With the previous construction, we can build i independent paths. The
j-th path γj , 0 6 j < i is obtained by first adjusting the j-th distinct bit,
then (j + 1)-th distinct bit, and so on until the (i − 1)-th bit is adjusted.
Then, the 0-th, 1-st, . . . , j − 1-th in this order. Let us call γ0 the path we
have just built. Let us assume that γj0 and γj1 share a common vertex X.
The lengths of the subpaths from A to X in γj0 and γj1 are equal otherwise
we could build a shorter path from A to B. However, for any j and k, the
first adjusted bits in γj is Sj,k = {j, . . . , j + k − 1} if j + k − 1 < n and
Sj,k = {j, . . . , n − 1} ∪ {0, . . . , k − (n − j)} otherwise. Therefore, Sj0 ,k = Sj1 ,k
if and only if j0 = j1 . Lastly, it is impossible to find more independent paths.
Any path starts by adjusting a bit, say the j-th, and thus is in conflict with
path γj . Note that i independent paths only reach i(i − 1) + 2 vertices and
the bandwidth available from A is fully used only when i = n.
Routing Many routing algorithms can be designed to exploit the many

paths in an hypercube. Let us start with a simple one: to route a message from
processor A to processor B, we use the path that always adjusts the rightmost
bit. At a hardware level, we simply need to route the message through the link
numbered by the index of the first non-zero bit in A XOR B. The message
is initially labeled with A XOR B and this label is modified along the route.
Each vertex inspects this label: if this label is equal to 0, then the message
has reached its destination; otherwise, the vertex updates the rightmost bit
and routes the message along the corresponding link.
Here is an example with the route from A = 1011 to B = 1101 in the
4-cube. The message is originally labeled by A XOR B = 0110. Vertex A
routes the message along link 1 (the link indices range from 0 to 3) with the
new label 0100. Vertex 1001 receives this message and knows it should be
forwarded along link 2. Vertex B receives this new message labeled by 0000
and thus knows it should store this message in its own local memory.
The simplicity of this routing algorithm is well suited to a hardware im-
plementation of a wormhole or cut-through protocol. There are however two
limitations: (i) the algorithm does not exploit redundant paths; and (ii) the
routing is purely static. The first limitation is not critical. There are actu-
ally very few machines able to route efficiently along several links in parallel.
After all, pipelining already makes it possible to obtain good performances
using a single link. The second limitation is more of a concern: if another
pair of processors has already reserved one of the links on the desired path,
the message will stall until the end of the other communication. This could
be avoided since many other paths are possible.
The routing algorithm can be made dynamic by simply adjusting the first
bit that corresponds to an available link. The routers dynamically select links
to use based on a link reservation table and on the message labels. Let us
go back to our example on the 4-cube with A = 1011 and B = 1101. As
A XOR B = 0110, the static routing suggests to use link 1. If this link is
already used by another ongoing communication, then the message is sent on
link 2 to router 1111 and the new label is 0010. The message will thus be
forwarded on link 1. Whenever a vertex determines that there is no more
available (useful) link, it has no other choice but to wait for one of them to be
available again. This algorithm is harder to implement than one may think
at first glance. One needs to ensure the fairness of the routing protocol or at
least to avoid starvation: a communication should not be delayed indefinitely
on a loaded network. One also needs to rebuild a message upon reception
since packets may follow different routes and thus may arrive out of order.
Lastly, one needs to ensure that all messages are delivered to their destination,
which is easy with the static algorithm but could become rather nightmarish
if messages are allowed to travel backwards: if links 1 and 2 are busy at
processor A in our previous example, one could try to use link 0 even if it
extends the route.
The static algorithm was implemented by Intel on the iPSC2. The dynamic
algorithm was also implemented but on the Paragon’s 2-D grid but not on a
hypercube. It uses the same idea: a message sent from processor (x, y) to
processor (x0 , y 0 ) is labeled with x0 − x, y 0 − y assuming that x0 > x and
y 0 > y. A simple static routing procedure is obtained by forwarding the
message horizontally x0 −x times and then vertically y 0 −y times. It is however
possible to use any Manhattan path. The message can use the vertical link
whenever the horizontal one is busy and conversely. The only restriction is to
never go out the rectangle defined by the source and the destination.
3.4.3 Embedding Rings and Grids into Hypercubes

Hypercubes are popular for their strong connectivity but also because clas-
sical topologies including rings, 2-D grids and 3-D grids can be embedded into
hypercubes while preserving locality. For instance, any two neighbors in a ring
can be mapped to neighbor nodes in the hypercube. Topology embedding is
very useful for designing algorithms on rings or grids and then use the power-
ful connectivity of hypercubes for global communications like broadcasts. It is
not always possible to find an embedding that preserves locality. Sometimes,
one has to resort to embeddings in which neighbor processors in the original
topology are no longer neighbors in the hypercube. In this case one strives to
minimize their distance in the hypercube.
In this section, we limit our study to rings and grids whose dimensions are
powers of 2 and that use all the vertices of the hypercube. A well-adapted
mathematical tool for such a study is the Gray code. A Gray code is an
ordered sequence of binary codes whose successive values differs in a single
bit. The Gray code for dimension n is denoted Gn and is defined recursively
for n > 2 as:
Gn = {0Gn−1 , 1Grev
n−1 } ,
where
• xG is the sequence obtained by prefixing every element of G by x;

• Grev is the sequence obtained by enumerating the elements of G in the

reverse order (i.e., starting from the end).
The recursive construction is initiated with G1 = {0, 1}. Thus, we get:
G2 = {00, 01, 11, 10}

G3 = {000, 001, 011, 010, 110, 111, 101, 100}
G4 = {0000, 0001, 0011, 0010, 0110, 0111, 0101, 0100,
1100, 1101, 1111, 1110, 1010, 1011, 1001, 1000}.
(r)
We denote by gi the i-th element of the Gray code of dimension r.
The Gray code Gn makes it possible to define a ring of 2n nodes in the
n-cube. Let us assume that this is true for n − 1, with n > 2 (this is trivial
for n = 1). Consecutive elements in position 0 to 2n−1 − 1 differ by a single
bit (due to the recursion). Similarly, consecutive elements in position 2n−1 to
2n − 1 also differ by a single bit. Finally, since the last element of Gn−1 is
equal to the first element of Grev
n−1 , elements 2
n−1
− 1 and 2n−1 also differ by
a single bit. All elements are thus connected.
To embed a 2r × 2s 2-D torus in a n-cube, with r + s = n, we can simply use
the Cartesian product Gr ×Gs . Processor with coordinates (i, j) on the grid is
(r) (s)
mapped to processor f (i, j) = (gi , gj ) in the n-cube. Horizontal neighbors
(r) (s)
(i ± 1, j) are mapped to f (i ± 1, j) = (gi±1 , gj ), where index i ± 1 is taken
modulo 2r : these nodes are indeed neighbors of f (i, j) in the n-cube. We can
prove likewise that vertical neighborhood is preserved. This construction can
easily be generalized to a 2r × 2s × 2t 3-D torus, where r + s + t = n.
3.4.4 Collective Communications in a Hypercube

The goal of this section is to highlight the difficulty of designing collective
communications in an interconnection network slightly more complicated than
a ring, in our case an hypercube. We only study the simplest operation:
the broadcast. We assume processor 0 wants to broadcast a message to all
other processors. The naı̈ve algorithm works as follows: processor 0 send the
message to all its neighbors; then every neighbor of 0 sends the message to
all its neighbors; and so on. There is redundancy in this algorithm: the same
processor receives the same message several times. For example processor 0
receives the message from all its neighbors: mismatched SENDs and RECEIVEs
are thus very likely to happen!
We seek a strategy in which each processor receives the message only once
and in which the number of steps is minimal. Not that the optimal number
of steps strongly depends on the communication model (e.g., multi-port, k-
ports, 1-port). The main idea is to build one or more spanning trees of the
hypercube. Functions SEND and RECEIVE need an additional argument: the
dimension in which the communication takes place (there was no need for
such an argument in the unidirectional ring). Therefore, we use the following

new functions:
SEND(cube link , send addr , m) and RECEIVE(cube link , recv addr , m).
The broadcast algorithm works as follow: there are n steps numbered from
n − 1 to 0. All processors will receive their message on the link corresponding
to their rightmost 1 and forward it on the links whose index is smaller than
this rightmost 1. At step i all processors such that their rightmost 1 is strictly
larger than i forward the message on link i. We assume that processor 0 has
a fictitious 1 at position n. Let us sketch the beginning of the algorithm for
n = 4:
• at step 3, processor 0000 is the only processor whose rightmost 1 is in
position four. It sends the message on link 3 to 1000;
• at step 2, 0000 and 1000 have their rightmost 1 in position at least
3. They both send the message on link 2 (to processor 0100 and 1100
respectively);
• and so on until step 0 at which each even-numbered processor sends the
message on link 0.
The corresponding spanning tree is depicted in Figure 3.9.
0000
1000 0100 0010 0001
1100 1010 1001 0110 0101 0011
1110 1101 1011 0111
1111
FIGURE 3.9: Broadcast using a spanning tree on a hypercube for n = 4.
Let us denote by BIT(A, b) the value of the b-th bit of processor A. The
broadcast of a message of length m by processor k can be done with Algo-
rithm 3.5.
As there are n steps, the execution time with the store and forward model
is n(L + mb). As at each step, a processor communicates with only one other
processor, making the algorithm valid for the 1-port model. We could consider
splitting the message in packets and pipelining them to improve the execution
time as seen in Sections 3.2.2 and 3.3.4. It would however only work in the n-
port model. Indeed, as soon as the root of the broadcast reaches steady-state
it needs to send its packets simultaneously to all its neighbors.
2 q ← MY NUM()
3 n ← log(TOT PROC NUM())
{ Update pos to work as if P0 was the root of the broadcast }
4 pos ←q XOR k
{ Find the rightmost 1 }
5 first1 ←0
6 while ((BIT(pos, first1 ) = 0) And (first1 < n)) do
7 first1 ←first1 + 1
{ Core of the algorithm }
8 for phase= n − 1 to 0 do
9 if (phase=first1 ) then RECEIVE(phase, addr , m)
10 else if (phase<first1 ) then SEND(phase, addr , m)
ALGORITHM 3.5: Broadcast algorithm in a hypercube.
The n-port pipelined algorithm could still be improved using n edge-disjoint

spanning trees. The n parts of the message would then all be pipelined in
parallel on the n trees. Such a “forest” exists but the resulting spanning tree
is slightly taller. Its height is equal to n + 1 instead of n for the previous tree:
1. The previous spanning tree is rooted at 0 and is denoted by T0 (see

Figure 3.10(a)).
2. Let s = sn−1 sn−2 . . . s1 s0 be the binary label of vertex s. The rotation

(to the left) R(s) is defined by R(s) = sn−2 sn−3 . . . s0 sn−1 . We build
new trees T1 , . . . , Tn−1 by rotating (to the left) the binary representa-
tions of T0 : if we denote R(1) = R and R(i) = R◦R(i−1) for 2 6 i 6 n−1,
then Ti is obtained by applying R(i) to the nodes of T0 .
3. The n trees Ti are unfortunately not edge-disjoint. For example, the

neighbors of 0 are the same on all trees. To this end, we inverse the i-th
bit (modulo n) of Ti ’s vertices (see Figure 3.10(a)).
4. After this XOR operation, node 0 is a leaf in each tree Ti . We can then
merge these trees in a single tree rooted at 0 (see Figure 3.10(b)).
This broadcast works only when assuming all links are bidirectional: 101
communicates with 100 in the leftmost subtree while 100 communicates with
101 in the rightmost subtree (see Figure 3.10(b)).
Using optimal pipelining on a single tree, this new algorithm improves the
execution time to:
√ p 2
mb + (n − 1)L .
A0 A1 A2
000 000 000
100 010 001 001 100 010 010 001 100
110 101 011 101 011 110 011 110 101
111 111 111

XOR 001 XOR 010 XOR 100
001 010 100
101 011 000 011 110 000 110 101 000
111 100 010 111 001 100 111 010 001
110 101 011

(a) Rotate and XOR operation on spanning trees, n = 3.
000
001 010 100
101 011 011 110 110 101
111 100 010 111 001 100 111 010 001
110 101 011

(b) Broadcasting with n = 3 edge-disjoint spanning trees.
FIGURE 3.10: Building edge-disjoint spanning trees.
When optimally pipelining n messages of length m/n on the n edge-disjoint

spanning trees, the execution time is further improved to:
2
√
r
m
b + nL .
n
This execution time is only twice as large as the absolute lower bound on the
time needed to broadcast a message of size m on a n-cube under the n-port
model. This bound is nL + ( mn + n − 1)b. Indeed, the first bit of the message
needs at least n(L + b) time units to reach the furthest processor. At this
time, this processor has received at most n bits of the message. Therefore,
there are at least m − n bits left to receive on each link, which requires at
least ( m
n − 1)b time units.
3.5. Peer-to-Peer Computing 87
This bound cannot be reached: only one bit has been initially sent so how
could the end of the message arrive right after the first bit at no cost? The
study of the complexity of collective communications has led to many research
articles published in the 80s: the difficulty comes from the non-linear term L
that precludes the use of flow theory.
3.5 Peer-to-Peer Computing

The last decade has witnessed the rapid development and growing popu-
larity of peer-to-peer networks. This rapid growth was originally motivated
by the will of Internet users to share music and video freely. More generally,
such networks enable users to share data, storage, computation, bandwidth,
etc. Peer-to-peer networks make it possible to avoid relying on centralized
entities or servers. This property is clearly useful for conducting illegal ac-
tivities like sharing of copyrighted materials. But it is also a key advantage
as centralized servers are never 100% reliable and often become performance
bottlenecks, Peer-to-peer networks are inherently fault tolerant and scalable,
two fundamental properties for enabling novel distributed applications with
many participating computers.
A peer-to-peer network is an application-level network, or overlay, that
consists of peers that simultaneously act both as “clients,”“servers,” and “for-
warders” to the other peers in the network: they may consume resources,
provide resources, or forward requests for resources. Typically a peer is aware
of only a very small number of other peers in the network. These peers are
called the peer’s neighbors and one talks of the neighbor set of a peer. The
(average) number of peers in the neighbor set of a peer is called the (average)
degree. If a participating peer knows the location (e.g., the IP address) of an-
other peer in the P2P network, then there is a directed edge from the former
to the latter in the overlay network. This edge is only in the application-level
overlay network, in the sense that it has nothing to do with the underlying
physical network links.
Peer-to-peer networks differ from the topologies discussed in the previous
sections in many ways. First, they are generally enormous: from a few thou-
sands to millions of peers. Second, their structure generally changes over time
very quickly: peers arrival and departures can happen at any time. This phe-
nomenon is referred to as churn. Third, the proximity of two peers in the
network is in principle completely unrelated to their end-to-end bandwidth
and latency. Obviously, techniques exist to make the overlay network and the
physical network somewhat congruent in a view to improving performance. In
this context finding a peer corresponding to a particular criterion, e.g., a peer
holding a particular information, is a very challenging problem. Note that
this problem is not particularly difficult in, say, a dedicated parallel platform
with 128 peers interconnected via a hypercube topology.
Peer-to-peer systems and applications have caught the interest of many
computer science researchers in the last decade. Depending on how the peers
in the overlay network are linked to each other, one can classify a P2P net-
work as unstructured or structured. Unstructured networks are generally built
using randomized protocols: new peers randomly connect to some peers and
update their neighbor sets over time based on random walks. The graph of
the corresponding overlay network is thus generally modeled well by random
graphs. Data or peer localization is achieved via “flooding”, i.e., by broadcast-
ing the request to the whole networks with a bound on the maximum number
of hops through the network. Flooding is known to be very inefficient. Struc-
tured networks are built using a globally consistent and deterministic protocol.
This protocol ensures that the graph of the overlay network has a particular
structure that facilitates efficient routing. We present briefly a few important
ideas underlying structured networks, and refer the reader to [112] for a com-
prehensive review of developments in peer-to-peer networking over the last
decade.
3.5.1 Distributed Hash Tables and Structured Overlay Net-

works
A hash table is a data structure that associates keys (e.g., names or song
titles) with values (e.g., phone numbers or files). It is often used as a dictionary
or as yellow pages. A distributed hash table (DHT) is a distributed system
that offers the same kind of functionality as a hash table. Each peer in the
distributed system is assigned a single key called its identifier (ID). A peer
with ID i is responsible for all the keys lying between i and the smallest peer
ID larger than i. A routing mechanism in such a system is a mean for any peer
to find the peer associated to an arbitrary key (and thus the value associated
to this key). DHTs have received a lot of attention because they provide a
fundamental abstraction useful for popular peer-to-peer applications in which
users wish to locate and retrieve particular items based on their names or on
a specification of their properties. The desired features of a DHT system are
the following:
• Decentralization: no central coordination is needed to build the net-

work. Thus, no peer has global knowledge of the system, which implies
that peers only have vastly reduced local (and different!) views of the
system. Hence, the arrival or the departure of a peer only requires a
small update of the network. Having a low maintenance cost of the
network requires that the underlying overlay network have a relatively
small degree compared to the size of the system. Note that too small
a degree may not be desirable either as a peer could then easily be
disconnected from the system.
• Scalability: a peer-to-peer network typically comprises thousands or

millions of peers and its performance should not degrade with its size. In
particular, the time needed to find the value associated to a key should
be kept as small as possible. Such a property requires that underlying
overlay network have a relatively small diameter compared to the size
of the system.
• Fault tolerance: the system should be reliable even in presence of the

sudden failure of a large fraction of the peers. Such a property requires
that the underlying overlay network have a relatively large bisection
compared to the size of the system.
Let N , ∆, and D respectively denote the size, the maximum degree, and
the diameter of the network. Using Moore’s bound (see Section 3.1.2), we
easily obtain:
log N
D=Ω .
log ∆
Several structures can thus be considered:

log N
• With degree O(log N ), we obtain a route length of at least Ω log log N .
In practice, many systems have D = Θ(log N ), which is not optimal but
convenient. Hypercubes have typically such a degree and such a diame-
ter. They also have a large bisection and have thus been a great source
of inspiration to design peer-to-peer protocols (e.g., Chord [113], Pas-
try [104], Tapestry [121]). Overlay networks with an optimal diameter
have later been proposed [71] but the routing protocol is very complex
and difficult to implement.
• With degree O(1), the diameter is at least Ω(log N ).

Some networks like CAN [100] are based on d-dimensional grids. There-
fore, they have degree O(d), a suboptimal route length Θ(d.N 1/d ), and
a good bisection Θ(N (d−1)/d ).
Optimal diameter is attained by a tree for example. However, trees
suffer from a very small bisection that makes them poorly adapted to
this context: removing a single peer in a tree causes it to become parti-
tioned. Many graphs like cube-connected cycles, butterfly graphs, or De
Bruijn graphs (see Exercise 3.3, 3.5, and 3.6) achieve a bounded degree,
a logarithmic diameter and a large bisection. These structures have in-
spired some overlay networks like Viceroy [49] (butterfly) or D2B [57]
(de Bruijn). Viceroy is the first known constant-degree overlay net-
work with logarithmic diameter. Its construction and management are
however relatively complex and require sophisticated procedures which
might be difficult to implement in a practical setting. Although these
structures offer advantages over hypercube-based ones, they have been
used and advertised slightly later and have thus not received as much
attention.
• With degree O(N α ) we obtain a diameter of Ω(1) but such a high degree
is far too large to be of practical interest.
In the following, we focus on hypercube-based overlay networks, which are

the most commonly used.
3.5.2 Chord
In the chord system [113], the set of keys is {0, . . . , 2m − 1} and the peers
are organized in a virtual ring. If peers n1 and n2 are two consecutive peers on
this ring, then peer n2 is responsible for all keys that fall between n1 and n2 .
Finding a key in this network would obviously be very time consuming since its
diameter is O(N ). This is why there are “shortcuts.” Node n is connected to all
peers responsible for key n + 2i [2m ] for all i ∈ {0, . . . , m − 1} (see Figure 3.11).
The lookup is thus easily done in O(log N ) hops by always forwarding the
request to the peer with the closest smallest ID (the distance to the destination
is halved at each hop). Also, the routing table, meaning the data structure
that holds the neighbor set, remains relatively small (O(log N )).
N1 N1
Finger table
N8 N8
+1 N8 + 1 N14 lookup(54)
N56 N8 + 2 N14 N56
+2 N8 + 4 N14
K54
+4 N8 + 8 N21
N51 N51
+32 +8 N14 N8 +16 N32 N14
N8 +32 N42
N48 +16 N48
K10
N21 N21
K38
N42 K30 K24 N42
N38 N38
N32 N32
FIGURE 3.11: Routing in the Chord network.
Using this hypercube-inspired organization, it is possible to design join/leave

protocols that require only O((log N )2 ) messages, that balance the load be-
tween peers, and that are robust to the sudden failure of a large fraction of
peers. The main drawback of this approach is that there is absolutely no free-
dom on the choice of neighbor sets. Peer n is connected to peer n + 2i even
if this peer is physically very far away. It could however be the case that an
other peer whose ID is close to n + 2i is also physically close to n and would
thus be a much better shortcut in practice. This is one of the ideas described
in the next section.
3.5.3 Plaxton Routing Algorithm

Plaxton et al. [97] proposed a suffix routing algorithm similar to the one used
in a hypercube. Like in Chord, the peers and the keys are ordered in a circular
fashion and shortcuts are created to allow efficient routing. The system uses
an alphabet B of size b. Each peer is identified by a key n of size m = logb N
and for each letter x ∈ B and for each i ∈ {0, . . . , m − 1}, peer n is connected
to a peer whose ID ends with xnm−i . . . nm−1 x (the size of the neighbor set
is thus O(b logb (N ))). Other redundant links (to n ± 1, n ± 2, . . . , n ± d) are
often added to replicate the data managed by n in case of sudden failure of
n. The neighbor set is then of size O(d + b logb (N )), which is still reasonable.
Table 3.2 shows an example of a possible routing table for peer 2310 with
B = {0, 1, 2, 3}.
Routing is then performed by increasing the length of the common suf-
fix at each hop. For example the route from peer 3745 to peer 3BA8 is
???8 →??A8 →?BA8 → 3BA8. As expected, the routing length is O(logb N )
but now there is much more freedom (at least when the suffix is short) on the
choice of the shortcuts, making it possible to chose peers that are geograph-
ically close. As a consequence, adjacent peers in the key-space are generally
geographically diverse and thus have a small probability of simultaneous fail-
ure, which strengthens fault tolerance. Besides, the first hops of a route are
generally fast, whereas the last ones are more likely to be slow. Therefore, the
redundant links (with n ± 1, n ± 2, . . . , n ± d) not only improve fault tolerance
but also lookup efficiency because the last steps are not always needed.
The two most famous overlay networks using such an approach are Pas-
try [104] and Tapestry [121].
3.5.4 Multi-casting in a Distributed Hash Table

A publish/subscribe system is a system in which any participant can sub-
scribe to a set of topics and receive an event notification as soon as these
topics are updated. Implementing such a system using a DHT is an appealing
idea that has been used for example in SCRIBE [105]. A topic t is identified
TABLE 3.2: A possible routing table for peer 2310.

PP letter
PP
0 1 2 3
suffix PPP P
∅ - 1201 3202 2123
0 3200 - 3220 2130
10 3010 3110 2210 -
310 0310 1310 - 3310
by the hash h(t) of the corresponding string. As previously, the peer nt whose
ID is closer to h(t) is responsible for topic t. Any participant willing to sub-
scribe to this topic can thus route a message to h(t) to inform nt about its
subscription. Let us denote by S(t) the set of subscribers to topic t. Merging
the set of routes from s ∈ S(t) to nt we obtain a tree rooted at nt that can be
used backward to multi-cast notification messages from nt to every subscriber
s ∈ S(t). In practice, when a peer n subscribes to topic t the routing ends as
soon as it encounters a peer belonging to the tree.
This approach has many interesting features. First, nt does not need to
know the whole set of subscribers. Indeed, routes from the subscribers to the
owner of the topic often merge very soon and thus the tree is well balanced
in practice. Peers responsible for hot topics are thus overloaded neither by
requests nor by notifications. Second, the upper part of the tree is made of
geographically diverse peers, whereas the leaves are generally close to each
other within a subtree. Therefore, the more communication-intensive part of
the execution consists of a large group of efficient and mostly independent
communications.
The famous PVM message passing library [59] provided primitives for both
synchronous and asynchronous communications. The current standard for
message passing is MPI [110], which offers a rich set of collective communi-
cation primitives. A large part of the material in this chapter is inspired by
Desprez’s thesis [51] and by the book by Gengler, Ubéda and Desprez [60].
The book by Culler and Singh [48] is a great reference to find other exam-
ples of collective communication algorithms on grids or hypercube and more
information on commutation networks and distributed memory architectures.
Most work regarding peer-to-peer architectures is very recent. Our section on
this topic takes a very particular point of view that is strongly related to the
preceding sections on interconnection networks and routing. [112] provides a
comprehensive overview of developments in peer-to-peer networking over the
last decade.
3.6. Exercises 93
3.6 Exercises
Exercise 3.1 : Torus, Hypercubes and Binary Trees
1. What is the difference between a 6-cube and a 4 × 4 × 4 3-D torus?
2. Explain how to embed a complete binary tree with 2n − 1 vertices in a

r × r 2-D grid?
Exercise 3.2 : Torus, Hypercubes and Binary Trees (again!)

1. What is the diameter of a binary tree of depth h?. What is the average
length of a path in a binary tree?
2. What is the number of paths (with minimal length) from (x1 , y1 ) to (x2 , y2 )
in a n × n torus?
3. Prove that the bisection width of a d-cube is 2d−1 .
Exercise 3.3 : Cube-Connected Cycles

A cube-connected cycle CCC(m) is obtained by replacing each processor
of a m-cube by a ring of size m, where each processor of the ring is associated
to a dimension of the hypercube (see Figure 3.12).
FIGURE 3.12: Cube-connected cycles of order 3.
1. How many vertices are there in a CCC(m)? Give a formal definition of a

CCC(m) and a simple upper bound on its diameter.
2. Prove that the diameter of CCC(m) is exactly D = 2m − 2 + b m

2 c if
m > 3, and is equal to 6 if m = 3.
Exercise 3.4 : Matrix Transposition

In this exercise, we design a parallel algorithm to transpose a n × n matrix.
We assume the matrix is already distributed on the processors. We also
assume bidirectional communication links and a multi-port model.
1. Propose an algorithm on a ring of p processors and give its complexity.
We assume an one-dimensional distribution: each processor is responsible for
a set of consecutive rows.
2. Propose an algorithm to transpose a matrix on a p = q × q torus and
give its complexity. We assume a 2-dimensional distribution: each processor
is responsible for a nq × nq sub-matrix.
3. Propose an algorithm to transpose a matrix on a m-cube and give its com-
plexity. We assume a 2-dimensional distribution: each processor is responsible
for a nq × nq sub-matrix.
Exercise 3.5 : Dynamically Switched Networks

st
st
st
st
ag
ag
ag
ag
e0
e1
e2
e3
row 000
row 001
row 010
row 011
node (011,3)
row 100
row 101
row 110
row 111
FIGURE 3.13: BU T (3): the butterfly network of order 3.
A butterfly network BU T (r) of order r is a multi-stage network made of

(r + 1)2r vertices organized in 2r rows of r + 1 stages. A vertex is thus
represented by a pair hw, ii where w is a sequence of r bits numbering the
corresponding row, and i corresponds to the stage number (0 6 i 6 r). Two
vertices hw, ii and hw0 , i0 i are connected if and only if i0 = i + 1 and one of
the following two condition is fulfilled:
• w = w0
Exercises 95
• w and w0 only differ on the i-th bit.
BU T (3) is depicted in Figure 3.13.
1. What kind of network is obtained when all the vertices of a given row are
grouped?
2. The butterfly network is built recursively. Give two ways to split a but-
terfly network of order r into two butterfly networks of order r − 1.
3. Prove that there is a unique path of length r between any vertex hw, 0i
from stage 0 and any vertex hw0 , ri from stage r. What is the diameter of
BU T (r)?
A Beneš network of order r consists of two back-to-back butterfly networks.

Vertices from stage r of each butterfly network are merged. There are thus
2r + 1 vertices on each row. Figure 3.14 depicts a Beneš network of order 3.
As this graph has degree 4, we can replace all vertices by 2 × 2 switches that
can be configured either in cross or in parallel to transmit data.
st
st
st
st
st
st
st
ag
ag
ag
ag
ag
ag
ag
e0
e1
e2
e3
e4
e5
e6
row 000
row 001
row 010 Configuration 1
row 011
row 100
Configuration 2
row 101
row 110
row 111
FIGURE 3.14: Beneš network of order 3.
4. Prove (by induction on the order r of the network) that Beneš networks
can be configured to implement any permutation: given an arbitrary permu-
tation π of {0, . . . , 2r − 1}, there is a configuration of switches for establishing
simultaneous routes from any input i to the corresponding output π(i).
Figure 3.15 depicts a configuration for the Beneš network of order 2 imple-
menting permutation π = (0, 1, 2, 3) → (3, 1, 2, 0).
stage 0 stage 2
0 0
1 1
2 2
3 3
FIGURE 3.15: Configuration of Beneš network of order 2 for permutation
π = (0, 1, 2, 3) → (3, 1, 2, 0).
Exercise 3.6 : De Bruijn network

The De Bruijn graph B(m) of order m consists of the set of vertex {0, 1}m .
For α and β ∈ {0, 1}, and x ∈ {0, 1}m−2 , the vertex αxβ is connected as
follow:
0αx xβ0
αxβ
1αx xβ1
1. What is the degree and the diameter of B(m)? Propose a simple and
optimal routing algorithm in the De Bruijn graph.
2. Assume now that the network is undirected (i.e., that αxβ is connected
with 0αx, 1αx, xβ0, and xβ1). What is the diameter of the undirected
B(m)? Assuming that it is possible to easily compute the longest common
sub-sequence between u and v ∈ {0, 1}m , propose a simple and optimal rout-
ing algorithm in the De Bruijn graph.
Exercise 3.7 : Gossip

1. Give the pseudo-code for the gossip collective communication primitive on
a ring of p processors. In gossip, each processor Pk sends a different message to
each other processor Pq . Initially each processor holds an array of p messages,
including one message it sends to itself. At the end of the algorithm this
array holds all the messages that processor Pk has received. For simplicity,
assume that all messages are the same length, L. Write the pseudo-code in a
way similar to, e.g., the way we have written it for the all-to-all algorithm in
Section 3.3.3.
2. Give a performance model for your algorithm.

3.7. Answers 97
3.7 Answers
Exercise 3.1 (Torus, Hypercubes and Binary Trees)
. Question 1. There is no difference: both architectures are the same! One
can easily check that their diameter and degree are equal to 6 in both cases.
The embedding of the torus in the hypercube with Gray codes, as seen in
Section 3.4.3, is an isomorphism: each processor has exactly 6 neighbors in
both networks. One can also easily check that a 4-cube is strictly equivalent
to a 4 × 4 2-D torus.
. Question 2. It is impossible as soon as n is large enough, even for large

values of r. Indeed, in a complete binary tree with 2n − 1 vertices all vertices
are at most n − 1 hops from the root. However, in a r × r torus (even a large
one), there are exactly 2n(n + 1) vertices at less than n − 1 hops from a given
vertex S (all vertices have to be in a square centered on S and rotated 45◦ ).
Considering for S the image of the root of the tree, we get a contradiction as
soon as 2n − 1 > 2n(n + 1), i.e., for n > 7.
Exercise 3.3 (Cube-Connected Cycles)

. Question 1. There are clearly p = m · 2m vertices. A vertex can be
represented by the pair s = hw, ii where w is the hypercube part of s, i.e.,
a sequence of m bits, and i ∈ {0, . . . , m − 1} is the ring part of s, i.e., the
position of the vertex in the ring. Vertices hw, ii and hw0 , i0 i are connected if
and only if:
• w = w0 and i − i0 ≡ ±1[m] (ring link) or
• i = i0 and w and w0 only differ on the i-th bit (hypercube link).
To bound the diameter, we can try to bound the distance between to arbi-
trary vertices. As we can XOR the hypercube components, this amounts to
finding a path from vertex h0, 0i to a vertex hw, ii.
Like for the hypercube, we adjust the bits of w starting for example from the
rightmost ones. For each 1 in w, we go through the corresponding hypercube
link. To this end, we will however have to move forward on one or more
vertices of the ring to reach the desired hypercube link. We can thus reach
the ring labeled by w in at most 2m − 1 steps: at most m steps on the
hypercube links and at most m − 1 steps on the ring’s links. Then, we need
to reach position i on the ring. The diameter of a ring of size m is b m2 c. The
upper bound on the diameter of our CCC(m) is thus D 6 2m − 1 + b m 2 c.
A nice feature is that the degree of such a network is bounded (it is equal
to 3) while its diameter is logarithmic in the total number of processors (we
have m 6 log p).
. Question 2. The previous upper bound was almost tight. We can easily
check in Figure 3.12 that the diameter of CCC(3) is equal to 6.
Let us now focus on the case m > 3. To go from h0, 0i to hw, ii, we
necessarily go through at least |w| hypercube links, where |w| is the number
of 1’s in w. Let us assume for now that a shortest path goes through exactly
w hypercube links. Let us consider the “boxed” cycle associated to 0 where
vertices corresponding to 1’s in w have been boxed. For example, if m = 8
and w = 11000110, we have:
0 → 1 → 2 →3
↑ ↓
7 ← 6 ← 5 ←4
We will go through a hypercube link as soon as we arrive on a boxed vertex.
Thus, we need to find a shortest path from 0 to i going through all boxed
vertices. The total length of the path from h0, 0i to hw, ii is then equal to |w|
plus the length of the path on the boxed ring. If i = 5, a shortest path on the
boxed cycle is:
0 → 1 → 2 → 1 → 0 → 7 → 6 → 5,
and its length is 7. By adding |w| = 4 we get the total routing length.
With such a representation, it is clear that going through additional hyper-
cube links is useless. One gets the worst case with w = 11 . . . 1 (all bits are
set to 1) and i = b m2 c: the shortest path from 0 to i that goes through all
links is of length m + b m
2 c − 2. Hence, the result. With our previous example
(m = 8), a shortest path from 0 to 4 going through all vertices is
0 → 7 → 6 → 5 → 6 → 7 → 0 → 1 → 2 → 3 → 4,
whose length is 10 = 8 + 4 − 2.
Exercise 3.4 (Matrix Transposition)

. Question 1. Let us first look at the kind of communications that are needed
to transpose the matrix with 4 processors (see Figure 3.16).
P0
P1
P2
P3
FIGURE 3.16: Data movement on a ring of size 4.
We assume that p divides n. Each processor needs to send a specific data

n n
(a p × p sub-matrix) to all other processors. It is an All-To-All operation.
Answers 99
2
2 2

Cost for each phase b np2 + L 2 b np2 + L 3 b np2 +L
ring of size 3
ring of size 4
ring of size 5
ring of size 6
FIGURE 3.17: Organizing communication phases on rings.
By grouping communications by their numbers of hops, we can organize the

execution in phases (see Figure 3.17). In each phase (except maybe for the
last one) each processor exchanges data with exactly two processors. Thus,
by pipelining the communications, we can use every link at any given time
and all messages travel along their shortest path.
The total time needed to transpose a matrix on a ring is thus:
bX
2c
p
n2 n2 p2 n2

1 j p k j p k
k L+b 2 = +1 L+b 2 ∼ L+ b.
p 2 2 2 p 8 8
k=1
. Question 2. Let us group communications by their number of hops (see

Figure 3.18). By analyzing communications along the network links, one can
see that in a given phase, a processor is involved in only one communication.
FIGURE 3.18: Data movement on a 6 × 6 processor grid.
All these phases can thus be done in parallel, hence a transposition time equal
to:
j q k n2
n2 √
2 · b 2 + L ∼ b√ + L p .
2 q p
This is optimal (in a store and forward mode) since the execution time is equal
to the communication time of the farthest two processors.
. Question 3. The recursive structure of hypercubes allows for the straight-

forward implementation of the recursive transposition depicted in Figure 3.19.
The time needed to perform such a transposition is thus equal to:
2
(log p)n2

n
(m − 1) · b +L ∼b + L log p
p p
FIGURE 3.19: Data movement on a 4-cube.
Exercise 3.5 (Dynamically Switched Networks)

. Question 1. We obtain a n-cube: in this new architecture, w and w0 are
connected by a link if and only if they differ by a single bit. Therefore, any
routing algorithm designed for a hypercube can be simulated on a butterfly
network with the same execution time. This is another example of a bounded-
degree network with hypercube-like capabilities.
. Question 2. By removing all vertices at stage 0, we obtain two butterfly

networks: the first one consists of the 2r−1 first rows and the second one
consists of the 2r−1 last rows. We can also remove all the vertices of stage r
Answers 101
but it is not at straightforward. The first network consists of all even rows
and the second one consists of all odd rows. Indeed, removing the last level
amounts to ignoring the last bit of the row indices.
. Question 3. At stage i, we choose either the horizontal link (thus, staying

on the same row) if w and w0 do not differ on the i-th bit, and the other link
otherwise.
The longest path is thus of length 2r (e.g., consider the path from h0, 0i to
h2r − 1, 0i).
. Question 4. If r = 1, then there is single switch and the result is trivial.

Let us assume that the result holds for a Beneš network of order r − 1.
The proof relies on the following observation: stages 1 to 2r − 1 of a Beneš
network of order r comprise two Beneš networks of order r−1 (see Figure 3.20).
1 1
2 Upper 2
3 half network 3
4 4
5 5
6 6
Lower
7 half network 7
8 8
FIGURE 3.20: Recursive layout of Beneš network.
To implement permutation π, we simply route half of the messages through

the upper network and the other half of the messages through the lower net-
work. We need to ensure that all 2r routes are edge-disjoint but it is not
difficult. We start by routing the first message from input x1 = 0 to out-
put yA = π(0) through the upper network (using the induction hypothesis).
Output y1 shares a switch with another output, say y2 . Let x2 be the out-
put corresponding to y2 (i.e., π(x2 ) = y2 ). We route the message from x2 to
y2 through the lower network (using the induction hypothesis). Output x2
shares a switch with another input, say x3 . We route the message from x3 to
y3 = π(x3 ) through the upper network, and so on. We keep routing messages
in that way until we loop back to x1 . We can then repeat this procedure
with a message that not been routed yet. In the end, half of the messages are
routed through the upper network and the other half are routed through the
lower network.
Part II
Parallel Algorithms
103
Chapter 4
Algorithms on a Ring of Processors
When writing a parallel algorithm on a distributed-memory platform that

consists of multiple processors (e.g., a cluster of workstations with some in-
terconnect technology), it is convenient to abstract away the physical network
topology of the platform and to design algorithms using a logical topology
instead. This chapter presents several parallel algorithms that are designed
for use with a logical ring topology. This topology is both simple, which
makes it an ideal candidate for a first look at distributed memory parallel
algorithms, and popular. Indeed, a ring is a linear interconnection network:
each processor has a single predecessor and a single successor. We will see that
linear networks make for a natural decomposition of regular data structures
like arrays.
Our first algorithm is a simple matrix-vector multiplication (Section 4.1),
followed by its extension to a matrix-matrix multiplication (Section 4.2).
Next, we present a “stencil” algorithm such as ones used for instance in many
numerical methods (Section 4.3). We then provide an in-depth study of the
famous LU factorization of a dense matrix (Section 4.4). Our last algorithm
in this chapter is a stencil algorithm that motivates the use of a bidirectional
ring (Section 4.5). We end the chapter with two sections about practical
implementation considerations (Sections 4.6 and 4.7).
4.1 Matrix-Vector Multiplication

Let us write our first parallel algorithm on an unidirectional ring of proces-
sors: the multiplication y = Ax of a matrix A of dimension n × n by a vector
x of dimension n (we number indices from 0 to n − 1). A matrix-vector mul-
tiplication is simply a sequence of n scalar products, as seen in the algorithm
below:
for i = 0 to n − 1 do
yi ← 0
for j = 0 to n − 1 do
yi ← yi + Ai,j × xj
105
106 Chapter 4. Algorithms on a Ring of Processors
Each iteration of the outer loop (the i loop) computes the scalar product of
one row of A by vector x. Furthermore, all these scalar product computa-
tions are independent, in the sense that they can be performed in any order.
Consequently, a natural way to parallelize a matrix-vector multiplication is
to distribute the computation of these scalar products among the processors.
Let us assume that n is divisible by p, the number of processors, and let us
define r = n/p. Each processor will compute r scalar products to obtain r
components of the result vector y. To do this, each processor must store r
rows of matrix A. It is natural for a processor to store r contiguous such rows,
which is often termed a block of rows, or simply a block row. For instance, one
can allocate the first block row of matrix A to the first processor, the second
block row to the second processor, and so on. The components of vector y
(which are to be computed) are then distributed among the processors in a
similar manner as the rows of matrix A (e.g., the first r components to the
first processor, etc.). These kinds of data distributions are typically called
“1-D distributions,” because arrays are partitioned along a single dimension.
A 1-D data distribution is a natural choice when developing an algorithm on
a 1-D logical topology like a ring or processors.
If one assumes that vector x is fully duplicated across all processors, then
the computations of the scalar products are completely independent since no
input data is shared. But it is usual, in practice, to assume that vector x is
distributed among the processors in the same manner as matrix A and vector
y. This is for reasons of modularity and consistency. Parallel programs often
consist of sequences of parallel operations. Assuming that the input vector is
distributed in the same manner as the matrix and the output vector makes
it possible to perform a second matrix-vector multiplication z = By, where
the input vector is the output vector of the previous computation (assuming
that matrix B is distributed similarly to matrix A). Furthermore, parallel
algorithms are often easier to understand and modify when all data objects
are distributed in a consistent manner.
Each processor holds r rows of matrix A, stored in an array A of dimension
r × n, and r elements of vectors x and y, stored in two arrays x and y of
dimension r. More precisely, processor Pq holds rows qr to (q + 1)r − 1 of
matrix A, and components qr to (q + 1)r − 1 of vectors x and y. Thus, one
needs the following declarations in our parallel program:
var A: array[0..r − 1,0..n − 1] of real;
var x, y: array[0..r − 1] of real;
Using these declarations, element A[0, 0] on processor P0 corresponds to ele-
ment A0,0 of the matrix, but element A[0, 0] on processor P1 corresponds to
element Ar,0 of the matrix. Typically the indices of matrix elements, which
we denote via subscripts, are called the global indices, while the indices of
array elements, which we denote with square brackets, are called the local
indices. The mapping between global and local indices is one of the techni-
cal difficulties when writing parallel programs that operate over distributed
4.1. Matrix-Vector Multiplication 107

A00 A01 A02 A03 A04 A05 A06 A07 x0
P0
A10 A11 A12 A13 A14 A15 A16 A17 x1
A20 A21 A22 A23 A24 A25 A26 A27 x2
P1
A30 A31 A32 A33 A34 A35 A36 A37 x3
A40 A41 A42 A43 A44 A45 A46 A47 x4
P2
A50 A51 A52 A53 A54 A55 A56 A57 x5
A60 A61 A62 A63 A64 A65 A66 A67 x6
P3
A70 A71 A72 A73 A74 A75 A76 A77 x7
FIGURE 4.1: Matrix-vector multiplication example (n = 8, p = 4, r = 2):

initial data distribution.
arrays. However, for our particular program in this section this mapping is
straightforward: global indices (i, j) correspond to local indices (i − b ri c, j) on
processor Pbi/rc .
Note that faster execution is not the only motivation for the parallelization
of a sequential algorithm on a distributed memory platform. Parallelization
also makes it possible to solve larger problems. For instance, in the case of
matrix-vector multiplication, the matrix is distributed over the local memo-
ries of p processors rather than being stored in a single local memory as in
the sequential case. Therefore, the parallel algorithm can solve a problem
(roughly) p times larger than its sequential counterpart.
The principle of the algorithm is depicted in Figure 4.1, which shows the
initial distribution of matrix A and of vector x, and in Figure 4.2, which shows
the p steps of the algorithm. At each step, the processors compute a partial
result, i.e., the product of a r ×r matrix by a vector of size r. In the beginning
(step 0), each processor Pq holds the q th block of vector x, and can calculate
the partial scalar product that corresponds to a diagonal block of matrix A.
We use a block notation by which x cq (resp. ybq ) is the q th block of size r of
vector x (resp. y), and by which A q,s is the r × r block at the intersection of
d
th th
the q block row and the s block column of matrix A. Using this notation,
each processor Pq can compute ybq = A q,q × x
cq during the first algorithm step.
d
While this computation is taking place, one can do a circular block shift of
vector x among the processors: processor Pq sends x cq to processor Pq+1 . Note
that we assume that processor indices are taken modulo p.
This is shown in Figure 4.2 with communicated blocks stored into buffer
tempR. As mentioned earlier in Section 3.3, we assume that sending, receiv-
ing, and computing can all occur concurrently at a processor as long as they
are independent of each other. By the beginning of step 1, processor Pq has
received x[ q−1 . (Note that, like for processor indices, we implicitly assume
that block indices are taken modulo p.) Processor Pq can therefore compute
ybq = ybq + A\q,q−1 × x
[ q−1 , and at the same time it can participate in the circular
block shift of vector x. At the end of the pth step, array y in processor Pq con-
tains block ybq , i.e., the desired r components of the result: yqr , . . . , yqr+r−1 .

A00 A01 • • • • • • x0 x6
P0 tempR ←
A10 A11 • • • • • • x1 x7
• • A22 A23 • • • • x2 x0
P1 tempR ←
• • A32 A33 • • • • x3 x1
• • • • A44 A45 • • x4 x2
P2 tempR ←
• • • • A54 A55 • • x5 x3
• • • • • • A66 A67 x6 x4
P3 tempR ←
• • • • • • A76 A77 x7 x5
(a) First step

• • • • • • A06 A07 x6 x4
P0 tempR ←
• • • • • • A16 A17 x7 x5
A20 A21 • • • • • • x0 x6
P1 tempR ←
A 30 A 31 • • • • • • x1 x7
• • A42 A43 • • • • x2 x0
P2 tempR ←
• • A52 A53 • • • • x3 x1
• • • • A64 A65 • • x4 x2
P3 tempR ←
• • • • A74 A75 • • x5 x3
(b) Second step

• • • • A04 A05 • • x4 x2
P0 tempR ←
• • • • A14 A15 • • x5 x3
• • • • • • A26 A27 x6 x4
P1 tempR ←
• • • • • • A36 A37 x7 x5
A40 A41 • • • • • • x0 x6
P2 tempR ←
A A
50 51 • • • • • • x1 x7
• • A62 A63 • • • • x2 x0
P3 tempR ←
• • A72 A73 • • • • x3 x1
(c) Third step

• • A02 A03 • • • • x2 x0
P0 tempR ←
• • A12 A13 • • • • x3 x1
• • • • A24 A25 • • x4 x2
P1 tempR ←
• • • • A34 A35 • • x5 x3
• • • • • • A46 A47 x6 x4
P2 tempR ←
• • • • • • A56 A57 x7 x5
A60 A61 • • • • • • x0 x6
P3 tempR ←
A70 A71 • • • • • • x1 x7
(d) Fourth (and last) step
FIGURE 4.2: Matrix-vector multiplication example (n = 8, p = 4, r = 2):

steps of the parallel execution.
4.1. Matrix-Vector Multiplication 109
1 var A: array[0..r − 1,0..n − 1] of real

2 var x, y: array[0..r − 1] of real
3 q ←MY NUM()
4 p ←NUM PROCS()
5 tempS ← x
6 for step=0 to p − 1 do
7 SEND(tempS, r)
8 ||
9 RECEIVE(tempR, r)
10 ||
11 for i = 0 to r − 1 do
12 for j = 0 to r − 1 do
13 y[i] ← y[i] + A[i, (q − step mod p)r + j] × tempS[j]
14 tempS ↔tempR
ALGORITHM 4.1: Matrix-vector multiplication algorithm on a

ring of processors.
This parallel algorithm is shown in Algorithm 4.1. Note our use of the ||
symbol to indicate actions that occur concurrently at a processor.
The reader will notice that the circular block shift of vector x in the last
step is not absolutely necessary. But it allows all processors to hold at the end
of the execution the block of x that they held initially, which may be desirable
for subsequent operations.
The computation of the execution time of this algorithm on p processors is
straightforward. There are p identical steps. Each step lasts as long as the
longest of the three activities performed during the step: compute, send, and
receive. Since the time to send and the time to receive are identical, T (p) can
be written as:
T (p) = p max{r2 w, L + rb} ,
where w is the computation time for one basic operation (in this case multi-
plying two vector components and adding the result to another vector com-
ponent), b is the inverse of the data transfer rate measured in number of basic
data units (in this case one vector component) per second, and L is the com-
munication start-up cost. Since r = n/p, for a given number of processors
p the computation component in the equation above is asymptotically larger
2
than the communication component as n becomes large ( np2 w L + np b).
This means that our algorithm achieves an asymptotic parallel efficiency of 1
as n becomes large. Note that if we had opted for a full duplication of vector
x across all processors, the disadvantages of which were discussed earlier, then
there would be no need for any communication at all. Therefore, the parallel
efficiency would always be 1, only at the cost of a slight increase in memory

usage.
4.2 Matrix-Matrix Multiplication

Now that we know how to multiply a matrix by a vector on a ring of
processors, multiplying a matrix by a matrix is straightforward. Consider the
multiplication C = A×B, where A, B, and C are matrices of dimension n×n.
The multiplication consists in computing n2 scalar products:
Ci,j ← 0
for k = 0 to n − 1 do
Ci,j ← Ci,j + Ai,k × Bk,j
Inspired by the matrix-vector multiplication one distributes all three matri-

ces among the p processors so that the first processor holds the first r = n/p
rows of each of the matrices, the second processor holds the next r rows of
each of the matrices, etc. Unlike in the matrix-vector case, full duplication of
matrix B across all processors would severely limit the size of the problems
that can be solved, due to the increased memory usage. Thus, one needs the
following declarations in our parallel program:
var A, B, C: array[0..r − 1,0..n − 1] of real;
The principle of the algorithm is identical to that of our parallel matrix-vector

multiplication and thus also proceeds in p steps. At each step, each processor
computes p partial matrix multiplications, that is p multiplications of a r × r
matrix by a r × r matrix. Using the block notation again, at step 0 of the
algorithm, each processor Pq holds blocks A d q,l , Bq,l , and Cq,l of matrices A, B,
d d
and C for l = 0, . . . , p − 1. (Note that the blocks of matrix C are all initialized
to zero.) Processor Pq computes the A q,q × Bq,l products for all l. Each such
d d
product is added to block Cq,l . While this computation is being performed, a
d
circular shift of the block rows of matrix B among the processors also takes
place: each processor Pq sends blocks B d q,l to processor Pq+1 for all l. By
the beginning of step 1, processor Pq has thus received blocks B \ q−1,l for all l
and can therefore compute all A\ q,q−1 × B
\ q−1,l products. As in the previous
section we assume that block indices are taken modulo p. These products are
added to the blocks of matrix C held by processor Pq . After p such steps
(and a final shift of block rows of matrix B so that its distribution among the
processors after the algorithm terminates is identical to its initial distribution),
4.2. Matrix-Matrix Multiplication 111
1 var A,B,C: array[0..r − 1,0..n − 1] of real

2 q ←MY NUM()
3 p ←NUM PROCS()
4 tempS ←B
5 for step=0 to p − 1 do
6 SEND(tempS, r × n)
7 ||
8 RECEIVE(tempR, r × n)
9 ||
10 for l = 0 to p − 1 do
11 for i = 0 to r − 1 do
12 for j = 0 to r − 1 do
13 for k = 0 to r − 1 do
14 C[i, lr + j] ←C[i, lr + j] + A[i, r × ((q −
step) mod p) + k] × tempS[k, lr + j]
15 tempS ↔tempR
ALGORITHM 4.2: Matrix-matrix multiplication algorithm on a

ring of processors.
processor Pq holds the desired blocks of the result matrix C. The algorithm is
shown in Algorithm 4.2. This algorithm is very similar to the one for matrix-
vector multiplication, essentially replacing the scalar products by sub-matrix
multiplications and replacing the circular shifting of vector components by
circular shifting of matrix rows. Consequently, the performance analysis is
also similar. The algorithm proceeds in p steps. Each step lasts as long as the
longest of the three activities performed during the step: compute, send, and
receive. Therefore, using the same notations as in the previous section, T (p)
can be written as:
T (p) = p. max{nr2 w, L + nrb} .
Here again, the asymptotic parallel efficiency when n is large is 1. It is in-
teresting to note that the matrix-matrix multiplication could be achieved by
executing the matrix-vector multiplication algorithm n times. With this more
naı̈ve algorithm the execution time would be simply the one obtained in the
previous section multiplied by n:
T 0 (p) = p. max{nr2 w, nL + nrb} .
The only difference between T (p) and T 0 (p) is the term L, which has become
nL. While in the naı̈ve approach processors exchange vectors of size r, in
the algorithm developed in this section they exchange matrices of size r × n,
thereby reducing the overhead due to network start-up cost L. This change
does not reduce the asymptotic efficiency, but it can be significant in practice.
This notion of sending data in bulk is a well-known and general principle
used to reduce communication overhead in parallel algorithms, and we will
see it again and again in this chapter. Furthermore, with modern processors,
operating on blocks of data is often better for performance as it exploits a
processor’s memory hierarchy (registers, multiple levels of cache, main mem-
ory) to the best of its capabilities. As a result, operating on blocks of data
improves both computation and communication performance. Note that in
this algorithm the total number of data elements sent over the network is
p × p × n × r = p × n2 (recall that r = n/p). We will see how this total amount
of communication can be reduced.
4.3 A First Look at Stencil Applications

In this section, we discuss the parallelization of a popular class of applica-
tions commonly called “stencil applications.” These applications operate on a
discrete domain, A, that consists of cells. Each cell holds some value(s) and
has neighbor cells. The application uses an algorithm that applies pre-defined
rules to update the value(s) of a cell using the values of its neighbor cells. The
location of a cell’s neighbors and the function used to update cell values form
a “stencil” that is applied to all cells in the domain. Stencil applications arise
in many areas of science an engineering. Examples include image processing
algorithms in which cells are pixels; numerical methods used to compute ap-
proximate solutions of partial differential equations over a physical domain,
in which case cells are contiguous small regions of the domain; simulations of
complex cellular automata, of which Conway’s Game of Life is a well-known
if simplistic example. For some of these applications the domain is large
enough that one may have to resort to a distributed memory implementation,
rather than using a simple shared-memory implementation. In this section,
we investigate how a distributed memory parallel stencil application can be
implemented using a logical ring topology.
4.3.1 A Simple Sequential Stencil Algorithm

We consider a stencil application that operates on a 2-dimensional domain
of size n × n for n > 0. Each cell has eight neighbor cells, as shown in
Figure 4.3.
The algorithm we consider updates the value of cell c as a function of its
current value and of the already updated value of its West and North neighbors.
The stencil is depicted in Figure 4.4 and can be formalized as follows:
cnew ← UPDATE(cold , Wnew , Nnew ) .

4.3. A First Look at Stencil Applications 113
NW N NE
W c E
SW S SE
FIGURE 4.3: The eight neighbor cells of cell c.
W c
FIGURE 4.4: A simple stencil in which cell c is updated using the values
of its West and North neighbors.
This stencil, albeit simple, is at the heart of important applications such

as the Gauss-Seidel numerical method [98], or the Smith-Waterman biological
string comparison algorithms [119]. Note that the stencil, as it is described
above, cannot be applied to the cells in the topmost row or the leftmost column
of the domain. Indeed, the former have no North neighbors and the latter have
no West neighbors. This is handled by the UPDATE function, for instance by
using constant values for neighbors outside the domain or by using modified
calculations when a cell has only one or even no neighbor as it is the case for
the top left cell. To indicate that no neighbor exists for a cell update we pass a
Nil argument to UPDATE. For instance, the call UPDATE(c, Wnew , Nil) is used to
update a cell on the upper edge of the domain, and the call UPDATE(c, Nil, Nil)
is used to update the top left cell.
4.3.2 Parallelizations of the Stencil Algorithm

Consider a ring of p processors P0 , . . . , Pp−1 . We must distribute the cells
in the domain among the processors. A natural solution is to allocate rows of
cells to processors. The main challenge here is to come up with a distribution
that balances the computational load among the processors without hindering
performance due to overly expensive communications.
Greedy Algorithm
A first idea to parallelize our stencil algorithm is to use a greedy approach
by which processors send the cells they compute to their neighbors as early as
possible. Such a parallelization reduces the start-up latencies (i.e., processors
start computing as early as possible) and leads to a good load balance.

Let us assume for now that the number of processors p is equal to the
domain’s dimension n. In this case, we allocate row i of domain A to processor
Pi and thus each processor has the following declaration:
var A: array[0..n − 1] of real;
As soon as processor Pi has computed a cell value, it sends that value to

processor Pi+1 (0 6 i < p − 1). As in Sections 4.1 and 4.2 we use subscripts
for global indices and we denote by Ai,j the value of the cell on row i and
column j of the domain (0 6 i, j 6 n − 1).
Given the shape of our stencil and the fact that the already updated values
of the West and North neighbor cells are needed, initially only A0,0 can be
computed. Once A0,0 has been computed, then both A1,0 and A0,1 can be
computed. It can be seen easily that the computation proceeds in steps: at
step k all values of cells on the k-th anti-diagonal are computed. This is
depicted in Figure 4.5 in which in each cell is labeled by the step at which its
value can be computed. The program is shown in Algorithm 4.3.
1 var A: array[0..n − 1] of real

2 q ←MY NUM()
3 p ←NUM PROCS()
4 if q = 0 then
5 A[0] ←UPDATE(A[0], Nil, Nil)
6 SEND(A[0], 1)
7 else
8 RECV(v, 1)
9 A[0] ←UPDATE(A[0], Nil, v)
10 for j = 1 to n − 1 do
11 if q = 0 then
12 A[j] ←UPDATE(A[j], A[j − 1], Nil)
13 SEND(A[j], 1)
14 else if q = p − 1 then
15 RECV(v, 1)
16 A[j] ←UPDATE(A[j], A[j − 1], v)
17 else
18 SEND(A[j − 1], 1) || RECV(v, 1)
19 A[j] ←UPDATE(A[j], A[j − 1], v)
ALGORITHM 4.3: Greedy algorithm for the sten-

cil application on a ring of processors.
j
0 1 2 3 4 5 6
1 2 3 4 5 6 7
2 3 4 5 6 7 8
3 4 5 6 7 8 9
4 5 6 7 8 9 10
i
FIGURE 4.5: Steps of the greedy parallel algorithm.
At time i + j, processor Pi performs the following operations:

• it receives Ai−1,j from Pi−1 ;
• it computes Ai,j ;
• then it sends Ai,j to Pi+1 .
Note that the above is not technically true because P0 does not need to receive
cell values to update its cells and Pp−1 does not need to send cell values after
updating its cells. However, these two special cases have no influence on the
performance analysis of the algorithm.
When p is lower than n, which is typical, then the question of which row to
assign to which processor arises. Without loss of generality let us assume that
p divides n. For our matrix multiplication algorithm we assigned a block of
contiguous rows to each processor. However, this approach goes against the
main goal of our greedy algorithm. Consider processor P0 . If the first n/p rows
are assigned to it, then it takes at least n/p steps before the value of the first
cell of the last row assigned to P0 is computed. This cell value is necessary for
processor P1 to start computing the first cell of its block of rows. Therefore,
P1 (and all other processors) stay idle for at least the first n/p steps of the
algorithm. In a view to allowing processors to start computing as early as
possible one can instead assign rows to processors in an interleaved, or cyclic,
manner: row j is assigned to processor Pj mod p . Such a cyclic distribution
is a classic technique to achieve a good load balance across processors. Each
processor has the following declaration:
var A: array[0..n/p − 1,0..n − 1] of real;
This is a contiguous array of rows that are non-contiguous in the domain.

We can now modify our previous algorithm to implement a cyclic distribution
of rows among processors, as shown in Algorithm 4.4.
Let us compute the execution time of this algorithm, T (n, p). We assume
that receiving a message is a blocking operation, while sending a message is
1 var A: array[0..n/p − 1,0..n − 1] of real

2 q ←MY NUM()
3 p ←NUM PROCS()
4 for i = 0 to n/p − 1 do
5 if q = 0 And i = 0 then
6 A[0, 0] ←UPDATE(A[0, 0], Nil, Nil)
7 SEND(A[0, 0], 1)
8 else
9 RECV(v, 1)
10 A[i, 0] ←UPDATE(A[i, 0], Nil, v)
11 for j = 1 to n − 1 do
12 if q = 0 And i = 0 then
13 A[i, j]
←UPDATE(A[i, j], A[i − 1, j], Nil)
14 SEND(A[i, j], 1)
15 else if q = p − 1 And i = n/p − 1 then
16 RECV(v, 1)
17 A[i, j] ←UPDATE(A[i, j], A[i − 1, j], v)
18 else
19 SEND(A[i, j], 1) || RECV(v, 1)
20 A[i, j] ←UPDATE(A[i, j], A[i − 1, j], v)
ALGORITHM 4.4: Algorithm for the stencil applica-

tion on a ring of processors using a cyclic data distribution.
not. Therefore, the sending of a message at step k of the algorithm occurs in

parallel with the reception of a message at step k + 1. Consequently, the time
needed to perform one algorithm step is w + b + L, where w is the time needed
to update a cell value, b is the rate at which cell values are communicated
on a network link, and L is the communication start-up cost. We now need
to compute the number of such steps. The computation terminates when
processor Pp−1 finishes computing the rightmost cell value of its last row of
cells. It takes p − 1 algorithm steps before processor Pp−1 can start doing
its first computation. At this point, processor Pp−1 computes one cell value
at each step. There are n2 cells in the domain, and each processor holds
n2 /p cells. Therefore, processor Pp−1 computes for n2 /p steps, and the total
number of steps is p − 1 + n2 /p, which gives:
n2

T (n, p) = p − 1 + (w + b + L) .
p
This algorithm was designed so that the time between a cell value compu-
tation and its reception by the next processor is as short as possible. But
a glaring problem is that the algorithm performs many communications of
data items that may be small. In practice, the communication startup cost
L can be orders of magnitude larger than b if the size of a cell value is small.
In the case of stencil applications the cell value is often as small as a single
integer or a single floating-point number. Furthermore, w may not be large.
It may be comparable to or smaller than L. Indeed, cell value computations
can be very simple and involve only a few operations. This is the case for
many numerical methods that update cell values based on simple arithmetic
formulae. Therefore, for many applications, a large fraction of the execution
time T (n, p) could be due to the L terms in the previous equation. Spending a
large fraction of the execution time in communication overhead reduces paral-
lel efficiency significantly. This can be seen more plainly by simply computing
the parallel efficiency. The parallel efficiency of the algorithm is equal to the
sequential execution time, n2 w, divided by p × T (n, p). When n gets large,
the parallel efficiency tends to w/(w + b + L), which may be well below 1. In
the next section we present two techniques for addressing this problem.
Augmenting the Granularity of the Algorithm

A simple technique to decrease communication costs, or more precisely com-
munication overhead due to start-up latencies, is to send fewer but larger
messages. Using the same allocation of rows to processors as before, a pro-
cessor now computes k contiguous cell values in each row at each step, where
1 6 k 6 n, as opposed to just one. As before, the algorithm proceeds in
several steps. For simplicity we assume that k divides n so that each row
contains n/k segments of k contiguous cells. If k does not divide n then the
(bn/kc + 1)-th segment of the first row stored by each processor spills over
the second row stored by that processor, and so on. The last segment of the
last row stored by that processor may contain fewer than k cells, which does
not modify the spirit of the algorithm. We make this assumption to simplify
the performance analysis of the algorithm. The data distribution is depicted
in Figure 4.6 for 4 processors, where each segment of k consecutive cells is
labeled by the step at which it is computed.
With this algorithm the number of cell values communicated among pro-
cessors is the same as with the greedy algorithm, but cells values are commu-
nicated in bulk, k at a time. Larger values of k lead to lower communication
overhead due to network latencies. However, larger values of k also imply
that the time between a cell value computation and its reception by the next
processor is larger. This goes against the principle guiding our design of the
algorithm in the previous section. In this algorithms processors will start
computing cell values later, leading to more idle time.
Another technique to reduce communication costs is to reduce the number
of cell values that need to be communicated between the processors. This can
be done by allocating blocks of r consecutive rows to processors (rp 6 n),
k k
P0 0 1 2 3 4 5 . . .
P1 1 2 3 4 5 6 . . .
P2 2 3 4 5 6 7 . . .
P3 3 4 5 6 7 8 . . .
First row of each processor Second row
FIGURE 4.6: Steps of the modified algorithm with k > 1 and p = 4 pro-
cessors.
still in a cyclic fashion. At each step each processor now computes r × k cell
values. We assume that r × p divide n for simpler performance analysis. For
instance, with r = 3, n = 36 and p = 4, we have the following allocation of
rows to processors:
P0 P1 P2 P3
0, 1, 2 3, 4, 5 6, 7, 8 9, 10, 11
12, 13, 14 15, 16, 17 18, 19, 20 21, 22, 23
24, 25, 26 27, 28, 29 30, 31, 32 33, 34, 35
More formally, processor Pi holds rows j such that i = rj mod p, 0 6 j 6

n − 1. This notion of a block-cyclic allocation is classic and we will use it

often.
The steps of the algorithm are similar to those shown in Figure 4.6, simply
replacing individual rows by blocks of rows. We can easily obtain a sufficient
condition for processors not to be idle. A processor, say P0 , computes all cell
values in its first block of rows in n/k steps of the algorithm. Processor Pp−1
sends k cell values to processor P0 after p algorithm steps. These values are
necessary for P0 to start computing its second block of rows. Therefore, if
n > kp no processor stays idle, which is desirable, and which we will assume
hereafter. Note that if n > kp then processors have to temporarily store re-
ceived cell values while finishing computing their block of rows for the previous
step.
With this new distribution of rows to processors the amount of data com-
municated between processors is r times smaller than with the previous algo-
rithm. Indeed, the processors need to exchange data only at the boundaries
between blocks of rows. So the larger r the lower the communication cost, but
the larger r the larger the time between the computation of a cell value and its
communication to the next processor. This is once again the same trade-off:
lower communication costs or lower latency between a cell computation and
its reception by the next processor.
One interesting question is: are there optimal values of k and r? We can
answer this question via performance analysis of the algorithm. We assume
that n > kp so that no processor stays idle and leave the (less relevant in
practice) n < kp case as an exercise for the motivated reader. The algorithm
proceeds in a sequence of steps. At each step at least one processor is involved
in the following activities:
• it receives k cell values from its predecessor;
• it computes kr cell values;
• it sends k cell values to its successor.
The analysis is very similar to that for our simple greedy algorithm. Here again
we assume that receiving a message is a blocking operation while sending a
message is not. Consequently, the time needed to perform one step of the
algorithm is krw + kb + L. The computation terminates when processor Pp−1
finishes computing the rightmost segment of its last block of rows of cells.
It takes p − 1 algorithm steps before processor Pp−1 can start doing any
computation. At this point, processor Pp−1 will compute one segment of a
block row at each step. There are n2 /(kr) such segments in the domain, and
each processor holds the same number of segments. Therefore, processor Pp−1
computes for n2 /(pkr) steps. Overall, the algorithm runs for p − 1 + n2 /(pkr)
steps and the total execution time T (n, p, r, k) is:
n2

T (n, p, r, k) = p−1+ (krw + kb + L) .
pkr
Let us compare the asymptotic parallel efficiency of this algorithm with that
of the algorithm in Section 4.3.2, whose parallel efficiency was w/(w + b + L).
Dividing the sequential execution time n2 w by p × T (n, p, r, k) we obtain an
asymptotic efficiency of w/(w + b/r + L/rk). In essence, increasing r and
k makes it possible to achieve significantly higher asymptotic efficiency by
reducing communication costs.
While low asymptotic parallel efficiency is important, one may wonder what
values of k and r should be used in practice. It turns out that, for n and p
fixed, one can easily compute the optimal value of k, kopt (r). Equating the
derivative of T (n, p, r, k) to zero and solving for k one obtains the value k 0 (r):
s
0 L
k (r) = n .
p(p − 1)r(rw + b)
Since k 6 n/p, we obtain kopt (r) = min(k 0 (r), n/p). Finally, for given
values of L, w and b, one can compute kopt (r) and inject it in the expression
for T (n, p, r, k). One can then determine the best value for r numerically.
4.4 LU Factorization
In this section, we develop a parallel algorithm that performs the classic LU
decomposition of a square matrix A using the Gaussian elimination method,
which in turn allows for the straightforward solution of linear systems of the
form Ax = b. Readers familiar with the algorithm will notice that we make
two simplifying assumptions so as not to make matters overly complicated.
First, we use the Gaussian elimination method without any pivoting. This as-
sumption is rather inconsequential at least in the case of partial pivoting: since
the columns of A will be distributed among the processors, partial pivoting
would not add any extra communication. Second, our algorithm eliminates
columns of A one after another. This is not realistic for modern comput-
ers as their memory hierarchies would not be exploited to the best of their
capability. Instead, the algorithm should really process several columns at
a time (i.e., column blocks), exactly as in our matrix-matrix multiplication
algorithm (Section 4.2). But from a purely algorithmic standpoint there is no
conceptual difference between computing a single column or a column block,
and the algorithm’s spirit is unchanged. Note however that when writing the
corresponding programs in practice the utilization of column blocks requires
many more lines of code. This is true even for sequential algorithms. For
instance, see the sequential version of the Gaussian elimination algorithm in
the publicly available LAPACK library [8].
4.4.1 First Version

The sequential algorithm for a matrix A of dimension n × n (with global
indices from 0 to n−1) proceeds in n−1 steps, as shown in Algorithm 4.5. Note
that we show two inlined functions, PREP and UPDATE. This is to identify and to
refer to the two main computations performed by the algorithm conveniently,
namely the preparation of the current column and the update of the bottom
right sub-matrix of matrix A. We depict the execution of the algorithm in
Figure 4.7.
A point of detail for the interested reader: the A = LU decomposition
is obtained after the above algorithm completes by defining Lik = −Aik for
i > k (L is lower triangular with a unit diagonal) and Ukj = Akj for k 6 j (U
is upper triangular). This sign swapping for the Lik values is due to the fact
that these values are in fact the coefficients used to eliminate columns when
making A triangular [64]. Also, as stated above, adding pivoting in functions
PREP and UPDATE would be straightforward.
The question we wish to answer is: how can we parallelize this algorithm
on a ring of p processors? Since the algorithm processes the matrix column by
column, it is natural to distribute these columns among the processors. Rather
than fixing the data distribution scheme at this point, let us just assume that
4.4. LU Factorization 121
Already computed
Already computed
current column
to be
updated
FIGURE 4.7: A step of the LU factorization.
1 for k = 0 to n − 2 do
2 PREP(k) :
3 for i = k + 1 to n − 1 do
4 Aik ← −Aik /Akk
5 for j = k + 1 to n − 1 do
6 UPDATE(k, j) :
7 for i = k + 1 to n − 1 do
8 Aij ← Aij + Aik × Akj
ALGORITHM 4.5: Sequential LU factorization algorithm.
we have defined a function ALLOC, which, given a column index k between

0 and n − 1, returns the index of the processor that stores column k in its
memory. At step k of the algorithm, processor ALLOC(k) broadcasts column
k to all processors, allowing them to update their own columns of matrix
A. Using the BROADCAST function (see Section 3.3), we obtain the parallel
algorithm shown in Algorithm 4.6.
Note that in the algorithm above we use a helper array called buffer in
which we store updated column elements. The use of such an array is of-
ten necessary due to the memory layout of 2-D arrays, in our case array A.
Since computer memory is inherently 1-D, 2-D arrays (as well as arrays of
higher dimension) must be “chopped up” into 1-D pieces. The two common
approaches are row-major and column-major, depending on whether 2-D ar-
rays are chopped along rows or along columns. So far in this chapter we have
implicitly assumed row-major storage of 2-D arrays (as in C) rather than
column-major storage (as in FORTRAN). This assumption implies that ele-
ments of a row are contiguous in memory, while elements of a column are not.
The reader will recall that in all the algorithms we have seen so far processors
were communicating rows of 2-D arrays, rather than columns. Consequently,
processor were always sending array elements that were contiguous in memory,
1 var buffer : array[0..n − 1] of real

2 q ←MY NUM()
3 p ←NUM PROCS()
4 for k = 0 to n − 2 do
5 if ALLOC(k) = q then
6 PREP(k) :
7 for i = k + 1 to n − 1 do
8 buffer [i − k − 1] ← Aik ← −Aik /Akk
9 BROADCAST(ALLOC(k), buffer , n − k − 1)
10 for j = k + 1 to n − 1 do
11 UPDATE(k, j) :
12 for i = k + 1 to n − 1 do
13 Aij ← Aij + buffer [i − k − 1] × Akj
ALGORITHM 4.6: LU factorization algorithm on a ring

of processors with the use of a broadcast.
which was possible with single calls to SEND. The LU decomposition algorithm
however needs to broadcast columns of the array, that is elements that are
not contiguous in memory! Therefore, we place these elements into a helper
array so that a single call to BROADCAST can send matrix elements in bulk. The
alternative would have been to place n individual calls to BROADCAST, one for
each element in the current column of matrix A. This alternative leads to high
overhead due to network latencies (typically much higher than the overhead of
an extra memory copy). Understanding memory layouts is always a good idea
(for instance to improve cache reuse) but is paramount when implementing
distributed memory algorithms to ensure that data is communicated in bulk
as much as possible.
One remaining issue is to map global indices to local indices for accessing
elements of matrix A. Defining r = n/p, each processor stores r columns of
A in its local memory. As the algorithm makes progress, some columns stop
being accessed: after step k, only columns with an index higher than k are
read and/or updated. The classical idea here is for each processor to use a
local index, l, which indicates the next column of the local array that should
be accessed. At the beginning of the execution all processors set l = 0. At
step k = 0, processor ALLOC(0) prepares column 0 by calling function PREP(0).
Its local index will then be incremented to l = 1. When this processor updates
its columns of matrix A, it only updates the r − l last ones. The value of l is
unchanged at other processors. At step k = 1, processor ALLOC(1) increments
its own value of l after calling PREP(1). Using array declarations and array
indices to replace the matrix element specifications in the previous algorithm,
we now obtain the full program shown in Algorithm 4.7.
1 var A: array[0..n − 1,0..r − 1] of real

3 q ←MY NUM()
4 p ←NUM PROCS()
5 l←0
6 for k = 0 to n − 2 do
7 if ALLOC(k) = q then
8 PREP(k) :
9 for i = k + 1 to n − 1 do
10 buffer [i − k − 1] ← A[i, l] ← −A[i, l]/A[k, l]
11 l ←l+1
12 BROADCAST(ALLOC(k), buffer , n − k − 1)
13 for j = l to r − 1 do
14 UPDATE(k, j) :
15 for i = k + 1 to n − 1 do
16 A[i, j] ← A[i, j] + buffer [i − k − 1] × A[k, j]
ALGORITHM 4.7: LU factorization algorithm on a ring

of processors written with local indices.
The only remaining thing that needs to be specified is function ALLOC, or in

other terms, which matrix columns are assigned to which processors? This was
not a difficult issue for the matrix-vector or the matrix-matrix multiplication:
just allocate even blocks of consecutive matrix rows to processors. Although
one could think of doing the same thing with matrix columns here, this simple
solution is not adequate for two reasons:
• The amount of data to process varies throughout the algorithm’s exe-

cution. At the core of the algorithm is the update of columns k + 1 to
n − 1 at each step k. Therefore, there are fewer and fewer columns to
process as the algorithm makes progress, that is when k increases;
• The amount of computation is not proportional to the amount of data!

Indeed, column k is updated k times. Therefore, from the perspective of
a processor, holding column n−1 implies significantly more computation
than holding, say, column bn/2c.
Given the above, we need to find an allocation that balances both the
memory consumption (all processors must hold the same number of columns
so that matrices as large as possible can be processed) and the computation

(processors must perform similar numbers of operations to ensure that no
processing power is wasted due to processors being idle). Furthermore, com-
putation must be balanced at every step of the algorithm since the amount
and distribution of the computation vary throughout algorithm execution.
Perhaps expectedly, a cyclic allocation, which assigns column j to processor
Pj mod p , gets the job done. Unlike for the stencil application in which a
cyclic allocation leads to a perfect balance of computation across processors,
here processor Pp−1 will have slightly more computations to perform than
processor P0 , but it can be easily shown that the difference is asymptotically
negligible. Indeed, consider a column j assigned to processor Pj mod p . This
column is updated at algorithm steps k = 0, . . . , j − 1. More precisely, at
step k, the elements of column j that are updated are the ones from position
k + 1 to position n − 1. Therefore, at step k, the number of basic update
operations performed by the processor holding column j is equal to n − k −
1. Consequently, the total number of elemental updates performed by the
processor holding column j on this column is:
j−1
X 1 1
ops(j) = (n − k − 1) = − j 2 + n − j,
2 2
k=0
With a cyclic distribution, processor Pi holds columns lp+i for l = 0, . . . , n/p−

1. Therefore, processor Pi performs a total of Ops(i) update operations,
where:
n
p −1
X
Ops(i) = ops(lp + i) .
l=0
Replacing ops(lp + j) by its expression, separating terms, and using well-

known formulae for the sums of consecutive integers and sums of the squares
3
of consecutive integers, one obtains that Ops(i) = n3p + O(n2 ). Therefore,
asymptotically, Ops(i) does not depend on i and all processors perform the
same amount of update operations. It is easy to show that asymptotically
all processors also perform the same number of column preparations. How-
ever, note that there are O(n) times more update operations than column
preparations operations, making the time spent doing column preparations
asymptotically negligible. To modify our program to use a cyclic allocation
of matrix columns to processors, one just has to replace the test ALLOC(k) = q
by k = q mod p.
Let p(k, Pq ) and u(k, j, Pq ) denote the executions of functions PREP(k) and
UPDATE(k, j) on processor Pq , respectively. The execution time of the algo-
rithm is that of the longest path of operations (also called the critical path),
that is:
r−1
X r−1
X
p(0, P0 ) → u(0, 1, P1 ), p(1, P1 ) → u(1, j, P2 ), p(2, P2 ) → . . .
j=1 j=2
where each → symbol denotes a communication between neighbor processors.

After expressing the individual function execution times as functions of the
platform parameters, as we have done above for the number of column update
operations, it is straightforward to obtain that the overall execution time is
the sum of:
n2
• a term nL + 2 b + O(1) that accounts for the n − 1 communications;
n2 0
• a term 2 w + O(1) for the column preparations; and
n3
• a term 3p w + O(n2 ) for the column updates.
Note that to be precise we use w0 to denote the time for a basic column prepa-
ration operation (one division and one negation), and w to denote the time for
a basic column update operation (one multiplication and one addition). As n
3
grows, the overall execution time becomes asymptotically equivalent to n3p w.
Note that the total number of update operations to be performed is given by
i=n
X
ops(i) ,
i=1
which can be computed by replacing ops(i) by its expression. Simple calcula-

3
tions show that the above sum is asymptotically equal to n3 w when n becomes
large. The execution time of our algorithm is asymptotically p times shorter
than this amount, and consequently our algorithm has an asymptotic parallel
efficiency of 1. In spite of this good asymptotic result, the communication
term above can still be significant in practical situations. In the next two
sections we describe techniques to reduce communication costs.
4.4.2 Pipelining on the Ring

The algorithm we developed in the previous section does not rely on the fact
that the processors are arranged in a logical ring. The algorithm just uses a
broadcast primitive that could be implemented on any logical topology. This
makes our algorithm portable, which is good, but on the other hand it does
not exploit the ring topology to the best of its potential. In particular, the
n − 1 broadcasts are not overlapped with computation. On a ring topology
and with a cyclic allocation of the matrix columns to the processors, realizing
a good overlap is however natural. In fact, one can almost directly insert the
source code of the broadcast algorithm into the LU algorithm!

3 q ←MY NUM()
4 p ←NUM PROCS()
5 l←0
6 for k = 0 to n − 2 do
7 if k = q mod p then
8 PREP(k) :
9 for i = k + 1 to n − 1 do
10 buffer [i − k − 1] ← A[i, l] ← −A[i, l]/A[k, l]
11 l ←l+1
12 SEND(buffer , n − k − 1)
13 else
14 RECV(buffer , n − k − 1)
15 if q 6= k − 1 mod p then
16 SEND(buffer , n − k − 1)
17 for j = l to r − 1 do
18 UPDATE(k, j) :
19 for i = k + 1 to n − 1 do
20 A[i, j] ← A[i, j] + buffer [i − k − 1] × A[k, j]
ALGORITHM 4.8: Pipelined LU factorization algorithm

on a ring of processors.
We modify the algorithm so that as soon as a processor receives column k

(at step k), it sends it to its successor. The processor performs this communi-
cation before starting to update the columns it holds in memory. Application
execution is depicted in Figure 4.8, where each column corresponds to one
of four processors and time progresses from top to bottom. For the sake of
illustration, we depict column preparation, sending, receiving, and column up-
dating as taking the same amount of time. This figure makes the pessimistic
assumption that communication cannot be overlapped with computation at a
processor, even though the two may be independent from each other. Even
with this assumption one can see that some communication phases take place
concurrently with certain computation phases. While column k travels on
the ring, the first processors have started their column updates while the last
processors are still in the process of receiving column k. As a result, processor
Pq is ahead of processor Pq+1 , which makes it possible to pipeline communi-
cation and computation phases as seen in the figure. Each p steps there is a
gap, also called a pipeline bubble. These bubbles happen when the processor
that is the most behind holds the column that must be sent to all the other
P0 P1 P2 P3
PREP(0)
SEND(0) RECEIVE(0)
U(0, 4) SEND(0) RECEIVE(0)
U(0, 8) U(0, 1) SEND(0) RECEIVE(0)
U(0, 12) U(0, 5) U(0, 2) U(0, 3)
U(0, 9) U(0, 6) U(0, 7)
U(0, 13) U(0, 10) U(0, 11)
PREP(1) U(0, 14) U(0, 15)
SEND(1) RECEIVE(1)
U(1, 5) SEND(1) RECEIVE(1)
RECEIVE(1) U(1, 9) U(1, 2) SEND(1)
U(1, 4) U(1, 13) U(1, 6) U(1, 3)
U(1, 8) U(1, 10) U(1, 7)
U(1, 12) U(1, 14) U(1, 11)
PREP(2) U(1, 15)
SEND(2) RECEIVE(2)
RECEIVE(2) U(2, 6) SEND(2)
SEND(2) RECEIVE(2) U(2, 10) U(2, 3)
U(2, 4) U(2, 5) U(2, 14) U(2, 7)
U(2, 8) U(2, 9) U(2, 11)
U(2, 12) U(2, 13) U(2, 15)
PREP(3)
RECEIVE(3) SEND(3)
SEND(3) RECEIVE(3) U(3, 7)
U(3, 4) SEND(3) RECEIVE(3) U(3, 11)
U(3, 8) U(3, 5) U(3, 6) U(3, 15)
U(3, 12) U(3, 9) U(3, 10)
U(3, 13) U(3, 14)
FIGURE 4.8: Example execution of the pipelined LU algorithm on a ring.
processors. The corresponding algorithm is shown in Algorithm 4.8.

Assume now, like we have done throughout this chapter, that our processors
can perform concurrent computation and communication. This assumption
makes it possible for the column for step k + 1 to be communicated along the
ring completely asynchronously while the computations for step k are taking
place. In this case, communication is, for the most part, fully overlapped with
computation and pipeline bubbles can be reduced. Application execution,
with this assumption, is depicted in Figure 4.9. Note however that while the
idle periods are somewhat reduced they are not completely eliminated.
4.4.3 Look-Ahead Algorithm

Is it possible to improve on the previous algorithm? It turns out that it is,
based on a rather clever way of interleaving communication and computation
phases. Consider processor P1 :
P0 P1 P2 P3
PREP(0)
SEND(0) U(0, 4) RECEIVE(0)
U(0, 8) SEND(0) U(0, 1) RECEIVE(0)
U(0, 12) U(0, 5) SEND(0) U(0, 2) RECEIVE(0)
U(0, 9) U(0, 6) U(0, 3)
U(0, 13) U(0, 10) U(0, 7)
PREP(1) U(0, 14) U(0, 11)
SEND(1) U(1, 5) RECEIVE(1) U(0, 15)
U(1, 9) SEND(1) U(1, 2) RECEIVE(1)
RECEIVE(1) U(1, 13) U(1, 6) SEND(1) U(1, 3)
U(1, 4) U(1, 10) U(1, 7)
U(1, 8) U(1, 14) U(1, 11)
U(1, 12) PREP(2) U(1, 15)
RECEIVE(2) U(2, 10) SEND(2) U(2, 3)
SEND(2) U(2, 4) RECEIVE(2) U(2, 14) U(2, 7)
U(2, 8) U(2, 5) U(2, 11)
U(2, 12) U(2, 9) U(2, 15)
U(2, 13) PREP(3)
RECEIVE(3) SEND(3) U(3, 7)
SEND(3) U(3, 4) RECEIVE(3) U(3, 11)
U(3, 8) SEND(3) U(3, 5) RECEIVE(3) U(3, 15)
U(3, 12) U(3, 9) U(3, 6)
PREP(4) U(3, 13) U(3, 10)
SEND(4) U(4, 8) RECEIVE(4) U(3, 14)
FIGURE 4.9: Example execution of the pipelined LU algorithm on a ring,

with concurrent communication and computation at each processor.
1. At step k = 0, processor P1 receives column 0 from processor P0 .

2. It immediately sends it to its successor, processor P2 .
3. Then, it updates its columns by executing UPDATE(0, j) for all j such
that j = 1 mod p.
4. It then moves on to step k = 1 and prepares column k = 1 by executing
PREP(1).
5. It sends column k = 1 to processor P2 .
6. It updates its own columns via executions of UPDATE(1, j)
We can improve the above sequence. The key insight is that, instead of
executing all the UPDATE(0, j) followed by PREP(1), and then sending column
k = 1, processor P1 can execute UPDATE(0, 1), then PREP(1), then send column
k = 1, and then execute all remaining UPDATE(0, j) for j = p + 1, 2p + 1, 3p + 1,
etc. From the perspective of the other processors, both sequences of opera-
tions are equivalent. But in the second sequence, column k = 1 is sent earlier,
which ends up greatly reducing and sometimes removing the pipeline bubbles
mentioned earlier. The basic principle here is, again, to perform communi-
cations as early as possible. The pseudo-code of the look-ahead algorithm is
shown in Algorithm 4.9, omitting the code for the PREP() and UPDATE() inlined
functions. Consequently, we have added an extra argument to these functions
to specify the buffer array that is to be used. Note the use of two different
buffers in this version of the algorithm so that a processor can call PREP() in
step k of the algorithm and then place UPDATE() calls that are “left over” from
step k − 1.

2 var received _buffer : array[0..n − 1] of real
3 var sent_buffer : array[0..n − 1] of real
4 q ←MY NUM()
5 p ←NUM PROCS()
6 for k = 0 to n − 2 do
7 if k = q mod p then
8 PREP(k, sent buffer )
9 SEND(sent buffer , n − k − 1)
10 if k > 1 then
11 forall j such that j mod p = q, j > k do
12 UPDATE(k − 1, j, received buffer )
14 UPDATE(k, j, sent buffer )
15 else
16 RECV(received buffer , n − k − 1)
17 if q 6= k − 1 mod p then
18 SEND(received buffer , n − k − 1)
19 if q = k + 1 mod p then
20 UPDATE(k, k + 1, received buffer )
21 else
23 UPDATE(k, j, received buffer )
ALGORITHM 4.9: Look-ahead LU factorization algo-

rithm on a ring of processors.
Application execution when using the look-ahead algorithm described above

is depicted in Figure 4.10. By comparing this figure with Figure 4.9 we can
P0 P1 P2 P3
PREP(0)
U(0, 8) SEND(0) U(0, 1) RECEIVE(0)
U(0, 12) PREP(1) SEND(0) U(0, 2) RECEIVE(0)
SEND(1) U(0, 5) RECV(1) U(0, 6) U(0, 3)
U(0, 9) SEND(1) U(0, 10) RECV(1) U(0, 7)
RECEIVE(1) U(0, 13) U(0, 14) SEND(1) U(0, 11)
U(1, 4) U(1, 5) U(1, 2) U(0, 15)
U(1, 8) U(1, 9) PREP(2) U(1, 3)
U(1, 12) U(1, 13) SEND(2) U(1, 6) RECV(2) U(1, 7)
RECEIVE(2) U(1, 10) SEND(2) U(1, 11)
SEND(2) U(2, 4) RECEIVE(2) U(1, 14) U(1, 15)
U(2, 8) U(2, 5) U(2, 6) U(2, 3)
U(2, 12) U(2, 9) U(2, 10) PREP(3)
RECEIVE(3) U(2, 13) U(2, 14) SEND(3) U(2, 7)
SEND(3) U(3, 4) RECEIVE(3) U(2, 11)
PREP(4) SEND(3) U(3, 5) RECEIVE(3) U(2, 15)
SEND(4) U(3, 8) RECV(4) U(3, 9) U(3, 6) U(3, 7)
U(3, 12) SEND(4) U(3, 13) RECV(4) U(3, 10) U(3, 11)
U(4, 8) U(4, 5) SEND(4) U(3, 14) RECV(4) U(3, 15)
FIGURE 4.10: Example execution of the look-ahead LU algorithm on a

ring, with concurrent communication and computation at each processor.
see that the new algorithm is indeed more efficient than the simple pipelined
algorithm from the previous section because it removes most of the pipeline
bubbles. Remember that our depiction of the execution makes the unrealistic
assumption that column preparation, column update, and column commu-
nication all take the same time. Nevertheless, the ability of this look-ahead
algorithm to reduce idle time is well observed in practice.
4.5 A Second Look at Stencil Applications

In Section 4.3 we studied a stencil application in which a cell value is up-
dated using the updated value of its North and West neighbor cells. Many
other stencils are possible. For instance, stencils that uses the non-updated
values of cell’s West, East, North, and South neighbors are commonly em-
ployed in the Jacobi numerical method [98]:
cnew ← UPDATE(cold , Wold , Eold , Nold , Sold ) .
With this stencil, which is applied on a 2-dimensional n × n domain, it is

not possible to update the domain in place without overwriting cell values
4.5. A Second Look at Stencil Applications 131
prematurely. Instead, two domains must be stored in memory: the original

one, which we call A, and an empty one into which updated values will be
stored, which we call B. As a result, the execution of the algorithm is in fact
simpler than that for the stencil in Section 4.3, and so is the data distribution.
4.5.1 Parallelization on a Unidirectional Ring

We consider our usual unidirectional ring of p processors. As done previ-
ously we assume that p divides n to simplify the performance analysis. The
most natural data distribution is to allocate r = n/p consecutive rows of the
domain to each processor, exactly like for the matrix in Section 4.1. Therefore,
each processor has the following declarations:
var A, B: array[0..r − 1,0..n − 1] of real;
where array A contains initial cell values, and array B is to be filled with
updated cell values. Each processor can update rows 1 to r − 2 of B using
the values stored in A. However, the updating of the first row of array B
at processor Pq requires values stored in the last row of array A at processor
Pq−1 . Similarly, the updating of the last row of array B at processor Pq re-
quires values stored in the first row of array A at processor Pq+1 . For the sake
of simplicity, we assume that the domain “wraps around” in the sense that
processors P0 and Pq−1 exchange rows. This assumption removes the need for
the q = 0 or q = p − 1 tests that cluttered the algorithms in Section 4.3. Fur-
thermore, this assumption does not decrease performance. Indeed, since the
execution goes only as fast as the slowest processor, communication savings
on only 2 out of p processors does not reduce execution time. In practice, it
is actually common to avoid special cases and have all processors perform the
exact same communications. This leads to simpler algorithms, as we will see,
and data that has been communicated superfluously can be explicitly ignored
by some processors when they perform their computations.
With the above assumption, each processor must send data to both its suc-
cessor and its predecessor on the ring. Given that we use an unidirectional
ring, sending data to a predecessor amounts to sending that data all around
the ring with p − 1 hops, with intermediate processors merely relaying mes-
sages. Our parallel stencil algorithm on the unidirectional ring is shown in
Algorithm 4.10.
This algorithm proceeds in two phases: a first phase from statements 4
to 20 and a second phase from statements 21 to 28. The first phase consists
of two concurrent sub-phases. In the first sub-phase, statements 4 to 11, each
processor communicates the top and bottom row of array A to its predecessor
and its successor, respectively. In the second sub-phase, statements 13 to 20,
each processor computes all rows of B but for the top and bottom row. As
done previously in this chapter we assume that processors can compute and
communicate concurrently, and can send and receive concurrently. Removing
these assumptions would simply mean removing the || sign in the program and
1 var A, B: array[0..r − 1,0..n − 1] of real

2 var f romP , f romS: array[0..n − 1] of real
3 p ←NUM PROCS()
4 begin { Phase 1: communications }

5 tempS ←ADDR(A[0, 0])
6 for k = 1 to p − 2 do
7 SEND(tempS , n) || RECV(tempR, n)
8 tempS ↔tempR
9 SEND(tempS , n) || RECV(fromS , n)
10 SEND(ADDR(A[r − 1, 0]), n) || RECV(fromP , n)
11 end
12 ||
13 begin { Phase 1: computations }
14 for i = 1 to r − 2 do
15 B [i,0] ←UPDATE(A[i, 0], Nil, A[i, 1], A[i − 1, 0], A[i + 1, 0])
16 for j = 1 to n − 2 do
17 B [i,j] ←UPDATE(A[i, j], A[i, j − 1], A[i, j + 1], A[i − 1, j], A[i + 1, j])
18 B [i,n − 1] ←
19 UPDATE(A[i, n − 1], A[i, n − 2], Nil, A[i − 1, n − 1], A[i + 1, n − 1])
20 end
21 begin { Phase 2 }
22 B [0,0] ←UPDATE(A[0, 0], Nil, A[0, 1], fromP [0], A[1, 0])
23 B [r − 1,0] ←UPDATE(A[r − 1, 0], Nil, A[r − 1, 1], A[r − 2, 0], fromS [0])
24 for j = 1 to n − 2 do
25 B [0,j] ←UPDATE(A[0, j], A[0, j − 1], A[0, j + 1], fromP [j], A[1, j])
26 B [r − 1,j] ←
27 UPDATE(A[r − 1, j], A[r − 1, j − 1], A[r − 1, j + 1], A[r − 2, j], fromS [j])
28 end
29 B [0,n − 1] ←
UPDATE(A[0, n − 1], A[0, n − 2], Nil, fromP [n − 1], A[1, n − 1])
30 B [r − 1,n − 1] ←
UPDATE(A[r − 1, n − 1], A[r − 1, n − 2], Nil, A[r − 2, n − 1], fromS [n − 1])
ALGORITHM 4.10: Algorithm for our second stencil application on a

unidirectional ring of processors.
4.5. A Second Look at Stencil Applications 133
changing our performance analysis. By the beginning of the second phase of

the algorithm each processor has received the bottom row from its predecessor
and the top row from its successor. Therefore, in the second phase, each
processor can finally compute the top and bottom row of array B. We detail
critical steps of this algorithm below.
Statement 2 in the program declares array fromP for storing the bottom
row sent by the predecessor processor, and array fromS for storing the top
row sent by the successor processor. Statement 5 sets pointer tempS to the
beginning of the first row of array A, this processor’s top row. Note our use of
a function ADDR to access the address of an array element (equivalent to the &
operator in C for instance), and our reliance on the fact that array A is stored
in row-major fashion as explained in Section 4.4.1. At the first iteration of the
for loop at statement 6 the processor sends its top row to its successor and
receives its predecessor’s top row. At every following iterations the processor
forwards what it receives from its predecessor to its successor. After p − 2
iterations, the final step in statement 9 consists in doing a last forwarding and
finally receiving the top row of the processor’s successor. In statement 10 the
processor sends its bottom row to its successor, and receives its predecessor’s
bottom row. As usual, we assume that the SEND is non-blocking while the
RECV is blocking.
After completion of statements 4 to 11, each processor has received the two
rows needed from its predecessor and its successor. While all these commu-
nications are happening we assume that computation can take place at each
processor. Statements 13 to 20 just apply the stencil to compute all rows of
B that can be computed without the bottom row of A from the predecessor
processor and without the top row of A from the successor processor. That is
all rows of B but row 0 and row r − 1. This is done with the for loop in state-
ment 14. Each iteration of this loop deals with the leftmost and the rightmost
elements of row i of B separately as the former has no West neighbor and the
latter has no East neighbor.
Once both the previously described communication and computation sub-
phases have completed, with statements 21 to 28 the processor computes row
0 and row r − 1 of array B using received values stored in arrays f romP and
f romS respectively. The computations are similar to those in the for loop in
statement 14.
One can easily compute the execution time of the algorithm. As usual let
w be the computation time for one basic cell update computation, b the time
to communicate one cell value in steady state, and L the network start-up
cost. The communication sub-phase of the first phase consists of a sequence
of p concurrent sends and receives of a row of n cell values, and thus takes
time pL + pnb. The computation sub-phase of the first phases consists in
computing r − 2 rows, and thus takes time (r − 2)nw. Therefore, given that
r = n/p, the first phase of the algorithm takes time:
max{pL + pnb, (n2 /p − 2n)w} .

The second phase of the algorithm consists in computing two rows, and thus
takes time 2nw. The overall execution time of the algorithm, T (n, p), is:
T (n, p) = max{pL + pnb, (n2 /p − 2n)w} + 2nw . (4.1)
When n becomes large, T (n, p) ∼ wn2 /p. Since the sequential execution time
is wn2 the parallel algorithm’s asymptotic efficiency is 1.
4.5.2 Parallelization on a Bidirectional Ring

The previous algorithm is overly complicated. Indeed, because we used a
unidirectional ring, we had to send messages from a processor to its prede-
cessor around the ring with p − 1 hops. In spite of its optimal asymptotic
parallel efficiency, in practice the first term of the maximum in Equation (4.1)
may be large (for instance if w and n are relatively small). Furthermore, and
perhaps more importantly, using an unidirectional ring makes the code for the
communication phase rather cumbersome.
Since as parallel algorithm designers we can choose which logical topology
to use, we can just decide to use a bidirectional ring if we feel it is more
appropriate. Let us then redefine the SEND and RECV functions to allow direct
communication to both the predecessor and the successor processor. We just
add an argument, with value pred or succ, to both calls to specify the direction
of the communication. For instance, SEND(pred, addr, 1) is used by a processor
to send one data element stored at address addr to its predecessor. The
predecessor would have to place a matching call such as RECV(succ, addr, 1).
With this new logical topology, the communication phase of the algorithm is
rewritten as:
...
SEND(pred, ADDR(A[0, 0]), n) || RECV(succ, fromS , n)
SEND(succ, ADDR(A[r − 1, 0]), n) || RECV(pred, fromP , n)
...
With this new communication phase, the algorithm’s execution time be-
comes:
T (n, p) = max{2(L + nb), (n2 /p − 2n)w} + 2nw .
The program is much simpler. Furthermore the communication time is lower

(it no longer depends on p as messages only use one hop), and thus the max-
imum in the above equation is likely lower in practice. We conclude that the
bidirectional ring is a very natural match for our stencil and in fact it is the
one used in practice for this type of stencil.
4.6. Implementing Logical Topologies 135
4.6 Implementing Logical Topologies

In this chapter, we have seen our first example of changing a logical topology
to better fit the requirements of an algorithm. We even made the statement
that as designers of parallel algorithms we can choose the logical topology.
Some readers may wonder how this can be done in practice. It turns out
that message passing libraries, for instance ones implemented according to
the MPI standard [110], assign an integer identifier to each processor and al-
low communications between any pair of processors. This is done via SEND
and RECV functions that take processor identifiers as arguments for specify-
ing communication sources and destinations. As a result, the developer has
complete control about which logical communication paths are used by the
algorithm. Using a logical topology restricts communications to only a few
paths, which makes algorithm design simpler (provided an appropriate logi-
cal topology is chosen). When defining a logical topology one just arbitrarily
defines for each processor which processors are its neighbors. One then de-
cides which way communication can flow on each logical link between two
neighboring processors. The logical topology is then implemented via a set
of functions by which processors identify their neighbors. For instance, one
could implement a function LEFTNEIGHBOR(q) that, given a processor’s iden-
tifier q, returns the identifier of the processor that is the left neighbor of
processor q in the logical topology, for some definition of “left”. A call such as
SEND(LEFTNEIGHBOR(MY NUM()), addr, 1) then implements communication on
the logical topology. While in this chapter we have only used a ring topology,
many others are used in practice. In the next chapter for instance we will
see logical 2-D grid and torus topologies. Other common topologies include
single-level and multi-level trees, or topologies that combine any of the above.
A difficult question is that of matching the logical topology to the physical
topology. Indeed, it is often the case that an algorithm can be implemented
using different logical topologies. The ScaLAPACK library [36] for instance
allows the user to choose among several logical topologies for executing parallel
linear algebra algorithms. The common wisdom is that a logical topology that
resembles the underlying physical topology should lead to good performance.
If the physical network topology is, say, a tree, implementing a logical ring
topology would cause several performance difficulties. For instance, the com-
munication times between neighboring processors on the logical ring would
vary among pairs of processors, which could lead to unexpected idle time in
the execution of a parallel algorithm. Paradoxically, the point of using a logical
topology is often to hide the complexity of the underlying physical topology.
Choosing a logical topology that closely resembles the physical topology could
make algorithm design very difficult. The goal is then to choose a logical topol-
ogy that at the same time makes algorithm design natural and is reasonably
compatible with the underlying physical topology. Achieving this goal can
be straightforward or challenging depending on the application and on the

platform. Note that some modern supercomputers provide multiple physical
networks to better support various communication patterns and thus vari-
ous logical topologies. Also, platforms such as commodity clusters often use
switched networks that are akin to fully connected (but possibly hierarchical)
topologies. Such physical topologies often support various logical topologies
reasonably well, but often extensive benchmarking is required to determine
the best logical topology for a given algorithm on a given platform. In this
book we do not study this question of matching logical and physical topology
in depth because it depends on physical characteristics of the parallel plat-
forms at hand, which vary across platforms and across platform generations.
However, the logical topologies in this chapter and the ones we study in the
next chapter are known to be useful in the majority of practical scenarios.
4.7 Distributed vs. Centralized Implementations

When designing the parallel algorithms in this chapter we have always as-
sumed that data (i.e., arrays) were already distributed among processors at the
onset of the algorithm execution. For instance, in our matrix-vector multipli-
cation algorithm in Section 4.1, we assumed that at the onset of the parallel
execution each processor magically holds a particular subset of the rows of
matrix A. The reader may wonder how this distribution of data occurs in
the first place and whether the algorithm should distribute the data directly.
One can distinguish two approaches: distributed and centralized. In this book
we have opted for always using the distributed approach. As a result, our
algorithm implementations do not perform any data distribution and we as-
sume that the data is already distributed. In the centralized approach one
assumes instead that data resides at a single “master” location (a processor
or, if the data is large, a file on disk). The algorithm must then distribute
the data among available processors, apply the desired parallel algorithm,
and “un-distribute” (gather) the results of the computation back to the mas-
ter location. While data distribution has to occur one way or another, the
question of which approach is best arises when developing a general-purpose
library of routines that implement parallel, distributed-memory algorithms.
In other terms, should the library user or the library itself be responsible for
distributing data?
One advantage of the centralized approach is that each library routine can
enforce whichever data distribution is deemed best by the library developer.
As a result the library user does not have to worry about data distribution,
which is another advantage. Unfortunately, for best performance, the choice
of data distribution for a given algorithm must account for the particular
4.8. Summary of Algorithmic Principles 137
underlying physical topology. Since this topology is not known at library

development time the library developer may wish to implement multiple ver-
sions of each routine for different possible data distributions. The user then
would have to choose which routine to call depending on the underlying plat-
form. As discussed in Section 4.6 this choice may be difficult without exten-
sive benchmarking. The main drawback of the centralized approach however
emerges when the user calls multiple library routines in sequence with the
same data. For instance, imagine that a user wishes to compute a matrix
product C = A × B, followed by an LU factorization of matrix C, followed by
a sequence of (LU )xi = bi linear system resolutions. Data will be distributed
and un-distributed at each library call, which typically leads to prohibitive
overhead. Unfortunately, such use of a library is common and, as a result,
most library developers opt for a distributed implementation.
For the same reasons as those given in the case of a centralized approach,
routines in a distributed library implementation should support multiple data
distribution schemes for best performance. This leads to an interesting trade-
off. Indeed, the user may have called a library routine and used a given
distribution for a given matrix. When calling another routine for which a
different distribution would be best, a trade-off arises: is it better to incur the
overhead of re-distributing the matrix from the old distribution to the new
distribution, which can be significant, or is it better to call the next routine
with a sub-optimal data distribution? This trade-off is difficult to resolve,
especially at library development time. Consequently, the typical approach is
to provide routines that can execute the necessary algorithms using a variety
of data distribution schemes, and routines that can redistribute data from any
distribution scheme to any other. The user is then given maximum flexibility
for data distributions and data re-distributions. By contrast, a centralized
implementation offers at best marginal such flexibility. Some users argue that
this flexibility only leads to endless dilemmas, but it is the price to pay for
the hope of achieving best performance.
4.8 Summary of Algorithmic Principles

Throughout this chapter we have highlighted several principles that are
commonly employed when designing and implementing distributed memory
algorithms to reduce processor idle time and/or to reduce communication
overhead. While our focus was the ring topology, these principles are generally
applicable. Unfortunately, as seen in previous sections, these principles often
conflict with each other and force the algorithm developer to understand and
choose among multiple trade-offs. We summarize the principles below:
• Sending data in bulk – Aggregating communication operations reduces

communication overhead due to network latencies.
• Sending data early – Sending data as early as possible allows processors

to start computing as early as possible, which reduces processor idle
time.
• Overlapping communication and computation – It is always a good idea

to overlap communication and computation in order to hide communi-
cation time.
• Block data distribution – Using a block distribution by which processors

are assigned blocks of contiguous data elements reduces the amount of
communication (and can often improve cache reuse).
• Cyclic data distribution – Using a cyclic distribution by which data el-

ements are interleaved among processors makes it possible to achieve
better load balancing between processors, which reduces processor idle
time.
In terms of algorithms, a seminal reference is the book by Kumar et al. [77].
The content of this chapter’s section on matrix-vector and matrix-matrix mul-
tiplication belongs to popular parallel computing knowledge. The discussion
and performance modeling of stencil applications are inspired by articles by
Miguet and Robert [91, 92]. Finally, the parallel Gaussian elimination al-
gorithm used in our LU factorization is a classic (see [102] and other cited
references).
4.9. Exercises 139
4.9 Exercises
We start with two classical linear algebra sequential algorithms and their
parallelization on a ring of processors. The third exercise revisits the definition
of parallel speedup and introduces the important notion of scaled speedup.
The fourth exercise is a hands-on implementation of a matrix multiplication
algorithm on a parallel platform using the MPI message-passing standard.
Exercise 4.1 : Solving a Triangular System

Consider the problem of solving linear system Ax = b, where A is a lower
triangular matrix of rank n and b is a vector with n components. We assume
the computing platform is a unidirectional ring of processors P0 , P1 , . . . , Pp−1 .
For simplicity we assume that p divides n.
1. We wish to distribute columns of matrix A among the processors. What is
likely the best strategy? Give the pseudo-code for the corresponding parallel
algorithm.
2. Now we wish to distribute rows of A among processors. What is likely the
best strategy? Give the pseudo-code for the corresponding parallel algorithm.
Exercise 4.2 : Givens Rotation

To triangularize a matrix A of rank n in a way that is numerically stable, one
can use Givens rotations. The basic operation, which we denote ROT(i, j, k),
consists in combining rows i and j of the matrix. These two rows must both
have k − 1 leading zeros and the goal of the combination is to zero out element
aj,k :

0 . . . 0 a0i,k a0i,k+1 . . . a0i,n cos θ − sin θ 0 . . . 0 ai,k ai,k+1 . . . ai,n
←
0 . . . 0 0 a0j,k+1 . . . a0j,n sin θ cos θ 0 . . . 0 aj,k aj,k+1 . . . aj,n
The reader can easily compute the value of θ so that element aj,k is zeroed
out. The sequential algorithm can be written as follows:
1 GIVENS(A)
2 for k = 1 to n − 1 do
3 for i = n DownTo k + 1 do
4 ROT(i − 1, i, k)
Consider that one rotation executes in one unit of time, independently of the
value of k.
1. Give the pseudo-code for this algorithm on a ring of p = n processors.

Give a performance model.
2. Write the pseudo-code for this algorithm on a (bidirectional) ring of only

b n2 c processors.
Exercise 4.3 : Parallel speedup

This exercise has the reader “discover” several fundamental laws of parallel
computing (Amdahl, Gustafson).
1. Consider a sequential computation that needs to be parallelized, but with

a fraction f of the sequential execution time that cannot be parallelized. Show
that the parallel speedup is bounded by 1/f no matter how many processors
are used. Assume that the part of the execution that can be parallelized can
be parallelized perfectly.
2. Consider a matrix computation for n × n matrices so that:
• The number of (arithmetic) operations to execute is nα , where α is a

constant;
• the number of matrix elements to store in memory is βn2 , where β is a

constant;
• the number of non-parallelizable operations (in this exercise we assume

they are simply I/O operations) is γn2 , where γ is a constant.
Let M be the maximum memory size on a processor. What is the parallel

speedup with p processors for a “large problem,” i.e., one that uses all available
memory? What are the implications?
3. Give examples of scenarios in which one may observe a super-linear

speedup, i.e., a parallel efficiency strictly superior to 1.
Exercise 4.4 : MPI Matrix-Matrix Multiplication

1. Implement the matrix multiplication algorithm described in Section 4.2
using MPI, for n×n matrices, on a ring of p processors. For simplicity, assume
that p divides n. Each processor initially holds a piece of matrix A and a piece
of matrix B in two arrays. Initialize these arrays with matrix elements defined
as Ai,j = i and Bi,j = i + j (these are global indices starting at 0). At the end
of the execution each processor holds a piece of matrix C stored in a third
array. Your program should check the validity of the results (indeed, they can
be computed analytically as ci,j = i × n × (n − 1)/2 + i × j × n, using global
indices).
Exercises 141
2. Modify your program from Question 1 so that the “p divides n” assumption

is no longer necessary.
3. Run your program on a parallel platform, e.g., a cluster, and plot the
parallel speedup and efficiency for 2, 4, and 8 processors as functions of the
matrix size, n. Each data point should be obtained as an average over 10
runs.
4. If you have not done so, experiment with non-blocking MPI communi-
cation and see whether there is any impact on your program’s performance
when you overlap communication and computation.
4.10 Answers
Exercise 4.1 (Solving a Triangular System)
. Question 1. If we distribute columns of A among the processors, we must
parallelize operations within rows of A. (Parallelizing the operations within
a column would lead to sequential execution.) For each row, each processor
should contribute some computations for the columns it holds. Consider the
typical sequential algorithm:
1 for i = 0 to n − 1 do
2 s←0
3 for j = 0 to i − 1 do
4 s ← s − ai,j × xj
5 xi ← (bi − s)/ai,i
This algorithm is well-suited to parallelizing row operations. At step i,

one computes the scalar product of the elements of the i-th row of A that
are before the diagonal and of the elements of vector x that have already
been computed. To balance the load at each step, one must distribute the
columns of A among the processors in a cyclic fashion. More precisely, one
allocates column i of matrix A and component bi of vector b to processor
ALLOC(i) = i mod p. This processor is responsible for the computation of
the xi components of vector x. With this allocation, we obtain the parallel
algorithm:
1 for i = 0 to n − 1 do
2 t←0
3 forall j ∈ MyCols, j < i do
4 t ← t + ai,j × xj
5 s = GATHER(ALLOC(i), t, 1)
6 if i ∈ MyCols then
7 xi ← (bi − s)/ai,i
In this algorithm, we use variable MyCols to denote the set of the indices of
the columns allocated to each processor. The GATHER operation is used so that
the sum of the partial scalar products, computed locally at each processor, is
computed and stored in the memory of processor ALLOC(i). The above pseudo-
code is written using only global indices, and we let the reader write it using
local array indices. This can be done using the same technique as for the LU
factorization algorithm in Section 4.4.
Answers 143
. Question 2. We only sketch the main idea of the solution, which is similar
to that for Question 1. If matrix rows are distributed to processors, we need to
operate on a matrix column at each step, so that each processor can contribute
by updating the fraction of the column corresponding to the processor’s local
rows. This implies swapping the two loops in the sequential version:
1 for j = 0 to n − 1 do
2 x[j] ← b[j]/a[j, j]
3 for i = j + 1 to n − 1 do
4 b[i] ← b[i] − a[i, j] × x[j]
As before, a cyclic allocation will nicely balance the work among the pro-
cessors. One allocates row i of matrix A and component bi to processor
ALLOC(i) = i mod p, which is responsible for the computation of xi . With
this allocation, we obtain the parallel algorithm:
1 for j = 0 to n − 1 do
2 if j ∈ MyRows then
3 x[j] ← b[j]/a[j, j]
4 BROADCAST(()alloc(j),x[j],1)
5 for i ∈ MyRows, i > j do
6 b[i] ← b[i] − a[i, j] × x[j]
Exercise 4.2 (Givens Rotation)

. Question 1. Implementing the algorithm with n processors, which we
denote Pi , i = 1, . . . , n, is straightforward: processor Pk is responsible for
computing all ROT(i − 1, i, k) rotations. One can then view the execution as
feeding the rows of the matrix to the ring starting with the last row as seen
below ( i represents row i of matrix A):
1 2 3 4 5 6 7 8 → P1 → P2 → P 3 → P4 → P 5 → P6 → P 7 → P8
When a row arrives at a processor, it stays there for one step. When a
second row arrives at that processor, it is combined with the first row. The
first row is then sent to the next processor and the combined row takes its
place. We obtain the following execution, shown on an example for n = 8
(each table element shows which matrix row each processor contains at each
algorithm step):
Step P1 P2 P3 P4 P5 P6 P7 P8
t=1 8
t=2 ROT(7, 8, 1)
t=3 ROT(6, 7, 1) 8
t=4 ROT(5, 6, 1) ROT(7, 8, 2)
t=5 ROT(4, 5, 1) ROT(6, 7, 2) 8
t=6 ROT(3, 4, 1) ROT(5, 6, 2) ROT(7, 8, 3)
t=7 ROT(2, 3, 1) ROT(4, 5, 2) ROT(6, 7, 3) 8
t=8 ROT(1, 2, 1) ROT(3, 4, 2) ROT(5, 6, 3) ROT(7, 8, 4)
t=9 1 ROT(2, 3, 2) ROT(4, 5, 3) ROT(6, 7, 4) 8
t = 10 1 2 ROT(3, 4, 3) ROT(5, 6, 4) ROT(7, 8, 5)
t = 11 1 2 3 ROT(4, 5, 4) ROT(6, 7, 5) 8
t = 12 1 2 3 4 ROT(5, 6, 5) ROT(7, 8, 6)
t = 13 1 2 3 4 5 ROT(6, 7, 6) 8
t = 14 1 2 3 4 5 6 ROT(7, 8, 7)
t = 15 1 2 3 4 5 6 7 8
At step 2n − 1, each processor contains a row of the triangularized matrix.

We let the reader write the actual pseudo-code and the performance model.
. Question 2. The idea is to “fold in half” the ring used in the previous
question so that rows travel first to the right and then to the left.
Exercise 4.3 (Parallel speedup)

. Question 1. Let T1 be the execution time with one processor, which we
decompose into T1 = Tseq +Tpar , where Tseq is the sequential part and Tpar the
parallelizable part. By hypothesis, Tseq = f · T1 . With p processors, the best
we can do is to divide Tpar by a factor p (perfect parallelism, no overhead).
The total execution time Tp with p processors is thus
Tpar (1 − f )T1
Tp > Tseq + = f · T1 + ,
p p
and the speedup is bounded by
T1 1 1
Sp = 6 1−f
6 .
Tp f+ p f
As a consequence, if f = 5 %, the potential speedup is limited to 20, even

with 1, 000 processors. For a problem of fixed size, the degree of parallelism
that can be achieved is bounded independently of the number of available
processors. This result is known as Amdahl’s law [6], which seems to preclude
massive parallelism.
. Question 2. The sequential execution time for a problem of size n is

T1 (n) = nα w + γn2 τi/o , where w et τi/o denote the processor’s compute
Answers 145
speed and I/O speed. With 1 processor, the maximum feasible problem size,
nmax (1), is defined by β(nmax (1))2 = M .
With p processors, one can compute a larger problem since one can ag-
√
gregate all the processors’ memories: nmax (p) = p · nmax (1). How do we
then compute the parallel speedup for a problem of size nmax (p), that is for a
problem too large to be executed on a single processor? The simple idea is to
scale the speedup: we compute A(p), the mean time to perform an arithmetic
operation with p processors for a problem of maximum size nmax (p). The new
A(1)
definition of the speedup, proposed by Gustafson [66], is Sp = A(p) .
For our matrix computation, assuming perfect parallelization of the arith-
metic computations, we have:
nα w + γn2 τi/o
A(1) = with βn2 = M , and
nα
nα w 2
p + γn τi/o
A(p) = with βn2 = pM .
nα
If α = 2, A(1) = w + γτi/o and A(p) = wp + γτi/o , we obtain the traditional
parallel speedup that is bounded by a value that does not depend on p. But
for α > 3, we have
γτi/o
A(1) = w + α−2 , and
M
β
w γτi/o
A(p) = + α−2
p M
β pα−2
The speedup then increases roughly linearly with p.

The conclusion from this development is that massive parallelism is useful
but only as the problem size becomes large and the fraction of the sequential
execution that is non-parallelizable becomes negligible. Fortunately, most
matrix operations meet these two criteria.
. Question 3. There are several kinds of answers to the question:
• Cheating – Here is a simple example: the execution of a loop to add
two vectors of size 100. The sequential execution requires computing 100
additions, plus some overhead to control the loop (i.e., to increment the
loop index, to test whether the loop is finished). With 100 processors, we
have a single addition per processor but no more control. Consequently,
we compute (slightly) more than 100 times faster. Of course we could
have unrolled the loop with a single processor, thereby removing the
control overhead. This is why we say that this solution is “cheating”.
• Algorithms – Sometimes the algorithm used with p processors may
converge surprisingly fast. Consider a branch-and-bound algorithm to
solve some optimization problem. Processors will search independent

sub-trees in parallel. One processor may then (luckily) immediately find
the optimal solution in its sub-tree, independently of the time spent in
the sequential execution (hence a super-linear speedup).
Here is another example, in which the algorithm used with p processor
is different from the sequential algorithm in some subtle way. Consider
the resolution of a sparse linear system by some iterative method. Typ-
ically, a pre-conditioning matrix is used. With two processors, a natural
idea is to zero out non-diagonal blocks of the pre-conditioning matrix so
that the execution can go fully parallel. It may well be the case that this
ad-hoc preconditioning leads to convergence in fewer iterations than the
sequential one, hence a super-linear speedup again. It is true that the
speedup should be computed as the ratio of the time of the best sequen-
tial algorithm over the time of the parallel algorithm, and a posteriori
we could have simulated the parallel algorithm with a single processor.
However, in general we do not know the best sequential algorithm, and
parallelism leads us to invent new algorithms which would never have
been considered (as in this example) for a purely sequential execution.
• Hardware – This is the classic answer. Never forget that p processors

have more memory than one. When computing a matrix-vector product
with two processors, there always exists a problem size such that (i)
distributed data (half the matrix and the vector) fit in the main memory
of each processor; but (ii) accessing the whole data (full matrix and
vector) requires disk reads and writes with a single processor. The
dramatic difference in access speed between memory and disk guarantees
super-linear speedup!
Let us conclude with some parallel computing humor: a single mover takes
infinite time to move a piano up a few floors, but two movers only take a few
minutes: infinite speedup!
Chapter 5
Algorithms on Grids of Processors
5.1 Logical 2-D Grid Topologies

In Chapter 4 we have developed algorithms on a ring of processors. In
this chapter we motivate and demonstrate the use of a more complex logical
topology: a 2-D grid, or grid for short. Figure 5.1(a) shows an example
square grid with p = q 2 processors. Processors in the corners have only two
neighbors, processors on the edges have three, and processors in the middle
have four. Processors are indexed by their row and column in the grid, Pi,j ,
0 6 i, j < q. One popular variation of the grid topology is obtained by adding
loopbacks from edge to edge, forming what is commonly called a 2-D torus,
or torus for short, as shown in Figure 5.1(b). In this case every processor
belongs to two different rings. As in the case of a ring topology, links can be
either unidirectional or bidirectional. Choosing an appropriate flavor of the
grid topology is up to the parallel algorithm developer, but we will see that a
bidirectional torus proves very convenient and this is the topology we will use
by default. For the sake of simplicity we will always assume that the number
of processors is a perfect square, leading to square grids. The algorithms
presented in this chapter can be adapted to rectangular grids, which often
require only more cumbersome index calculations.
Just as with a ring, an assumption here is that communication can occur
in parallel on different links. So, using the grid in Figure 5.1(b), P0,0 can
send data to P1,0 while P3,2 sends data to P0,2 . Additionally, as with a
ring, the three standard assumptions about concurrent sending, receiving,
and computing apply (see Section 3.3). For all the algorithms developed in
this chapter, we will assume that each processor can be engaged in computing,
sending, and receiving concurrently as long as there are no race conditions.
Two new issues arise with a bidirectional grid topology.
The first issue is that of a processor concurrently sending and receiving on
the same bidirectional link. We were careful that such concurrent communi-
cation on a single link did not arise in our use of a bidirectional link for the
stencil application described in Section 4.5.2. In this chapter, we will make
the assumption that links are full-duplex, meaning that communications can
flow both ways on each link without contention. In other terms, a processor
can concurrently send and receive a given amount of data on a link in the same
147
148 Chapter 5. Algorithms on Grids of Processors
P0,0 P0,1 P0,2 P0,3 P0,0 P0,1 P0,2 P0,3
P1,0 P1,1 P1,2 P1,3 P1,0 P1,1 P1,2 P1,3
P2,0 P2,1 P2,2 P2,3 P2,0 P2,1 P2,2 P2,3
P3,0 P3,1 P3,2 P3,3 P3,0 P3,1 P3,2 P3,3
(a) 2-D Grid (b) 2-D Torus
FIGURE 5.1: Two options for a logical unidirectional grid topology of p =

16 processors.
time as it can send (or receive) that amount of data. This assumption may or
may not be realistic for the underlying physical platform. It is straightforward
to modify the programs presented in this chapter and, more importantly, their
performance analyses, in case the full-duplex assumption does not hold.
The second issue is that of the number of communications in which a sin-
gle processor can be engaged simultaneously. With four bidirectional links,
conceivably a processor can be involved in one send and one receive on all its
network links, all concurrently. The assumption that such concurrent com-
munications are allowed at each processor with no decrease in communication
speed when compared to a single communication is termed the multi-port
model. In the case of our grid topology, we talk of a 4-port model. If instead
at most two concurrent communications are allowed, one of them being a send
and one of them being a receive, then one talks of a 1-port model. Going back
to our stencil algorithm on a bidirectional ring (Section 4.5.2), the reader will
see that we had implicitly used the 1-port model. In this chapter, we will show
performance analyses for both the 1-port and the 4-port model. Once again,
it is typically straightforward to adapt these analyses to other assumptions
regarding concurrency of communications.
As discussed at the end of Chapter 4, an important issue is impedance
matching between logical and physical topology: how do the grid and ring
logical topologies compare in terms of realism for a given physical platform?
It turns out that there are platforms whose physical topologies are or include
grids and/or rings. A famous example of a supercomputer using a grid is
the defunct Intel Paragon. A more recent example, at least at the time this
book is being written, is IBM’s Blue Gene/L supercomputer and its 3-D torus
5.2. Communication on a Grid of Processors 149
topology, which contains grids and rings. When both a ring and a grid map
well to the physical platform, the grid is preferable for many algorithms. For
a given number of processors p, a torus topology uses 2p network links (and
√
a grid topology 2(p − p)), twice more than a ring topology which uses only
p network links. As a result, more communications can occur in parallel and
there are more opportunities for developing parallel algorithms with lower
communication costs. Interestingly, even on platforms whose physical topol-
ogy does not contain a grid (e.g., on a platform using a switched interconnect),
using a logical grid topology can allow for more concurrent communications.
In this chapter, we will see that the opportunity for concurrent communica-
tions is the key advantage of logical grid topologies for implementing popular
algorithms. But we will also see that even if the underlying platform offers no
possibility of concurrent communications, writing some algorithms assuming
a grid topology is inherently beneficial!
5.2 Communication on a Grid of Processors

In this section, we define communication primitives and convenient func-
tions that we will use in the upcoming sections to write parallel algorithms
on grids of processors. Processor Pi,j , 0 6 i, j < q, is said to be in processor
row i and to be in processor column j. A processor can obtain the indices of
its processor row and column via the two following functions:
MY PROC ROW() and MY PROC COL().
A processor can determine the total number of processors, p = q 2 , by calling

function NUM PROCS(). Since we assume square grids this function makes it
possible to know the number of processors in each processor row or column.
Two separate functions may be needed in the case of rectangular grids.
A processor can send a message of L data items stored at address addr to
one of its neighbor processors by calling the following primitive:
SEND(dest, addr , L).
where dest has value north, south, west, or east. For a torus topology, which
is the topology we will use for the majority of the algorithms in this chapter,
the North, South, West, and East neighbors of processor Pi,j are Pi−1,j mod q ,
Pi+1,j mod q , Pi,j−1 mod q , and Pi,j+1 mod q , respectively. We often omit the
modulo and assume that all processor indices are taken modulo q implicitly.
If the topology is a grid then some dest values are not allowed for some source
processors. Each SEND call has a matching RECV call:
RECV(src, addr , L).

As in Chapter 4 we assume asynchronous communications, i.e., non-blocking

sends and blocking or non-blocking receives. We remind the reader that it
is typically straightforward to adapt algorithms and performance analyses to
more restrictive assumptions.
In addition to the above point-to-point communication primitives we also
assume two additional primitives that implement a broadcast within a proces-
sor row or processor column (which is technically a multi-cast). Broadcasting
from processor Pi,j to all the processors in processor row i is done as follows:
BROADCASTROW(i, j, srcaddr , dstaddr , L),
where srcaddr is the address, in the memory of processor Pi,j , of a message

of length L that is to be sent. Once received, the message is stored at address
dstaddr in the memory of all involved processors. In the same spirit as for the
broadcast function described in Chapter 3 all processors in the processor row
place the same call and the source processor sends a message to itself. We
assume a similar function for broadcasting within processor column j:
BROADCASTCOL(i, j, srcaddr , dstaddr , L).
Note that in the case of a torus, each processor row and processor column
is a ring embedded in the processor grid. Therefore, the two above functions
can use the pipelined implementation of the broadcast on a ring developed in
Section 3.3.4. If in addition links are bidirectional and one assumes a 4-port
model (or in fact just a 2-port model in this case), then the broadcast can
be done faster by sending data from the source processors in both directions
simultaneously. We will see later in this chapter that this does not change the
asymptotic performance of the broadcast. If the topology is not a torus but
links are bidirectional, then this broadcast can be implemented by sending
messages both ways from the source processor. If the topology is not a torus
and links are unidirectional, then these functions cannot be implemented. We
assume that a processor that calls these functions and is not in the relevant
processor row or column returns immediately. This assumption will simplify
the pseudo-code of our algorithms by removing the need for processor row
and column indices before calling BROADCASTROW() or BROADCASTCOL().
5.3 Matrix Multiplication on a Grid of Processors

In this section, we present four popular algorithms for computing the matrix
product C = A × B on a square grid of processors. The first question that
arises is that of the distribution of the matrices among the processors. Let us
assume that the grid is square and contains p = q 2 processors, that matrices
are also square of dimension n × n, and that q divides n. In the case of a 1-D
5.3. Matrix Multiplication on a Grid of Processors 151
topology, i.e., a ring, we had used a natural 1-D data distribution. Here, our
2-D topology naturally induces a 2-D data distribution. We define m = n/q.
The standard approach is to assign a m × m block of each matrix to each
processor according to the grid topology. More precisely, processor Pi,j , 0 6
i, j < q, holds matrix elements Ak,l , Bk,l , and Ck,l with i.m 6 k < (i + 1).m
and j.m 6 l < (j + 1).m. We denote the three matrix blocks assigned to
processor Pi,j by A di,j , Bi,j , and Ci,j , as depicted in Figure 5.2 for matrix A.
d d
All algorithms hereafter use this distribution scheme.
P0,0 P0,1 P0,2 P0,3

A
d 0,0 A
d 0,1 A
d 0,2 A
d 0,3
P1,0 P1,1 P1,2 P1,3 A

d 1,0 A
d 1,1 A
d 1,2 A1,3
d
P2,0 P2,1 P2,2 P2,3 A

d 2,0 A
d 2,1 A
d 2,2 A
d 2,3
P3,0 P3,1 P3,2 P3,3 A

d 3,0 A
d 3,1 A
d 3,2 A
d 3,3
FIGURE 5.2: 2-D block distribution of a n × n matrix (n = 24) on a

unidirectional grid/torus of p processors (p = 42 = 16).
5.3.1 The Outer-Product Algorithm

While the standard algorithm for matrix multiplication is often written as
a sequence of inner product computations (as in Section 4.2), operations can
be ordered in many different ways. One option is to write the algorithm as
a series of outer products, which amounts to simply switching the order of
the loops. Assuming that all elements of matrix C are initialized to zero, the
sequential algorithm is written as:
for k = 0 to n − 1 do
Ci,j ← Ci,j + Ai,k × Bk,j
It turns out that this so-called “outer-product algorithm” [1, 56, 77] leads
to a particularly simple and elegant parallelization on a torus of processors.
The algorithm proceeds in n steps, that is n iterations of the outer loop. At
each step k, Ci,j is updated using Ai,k and Bk,j . Recall that all three matrices
are partitioned in q 2 blocks of size m × m, as in the right side of Figure 5.2.
The algorithm above can be written in terms of matrix blocks and of matrix
multiplications, and it proceeds in q steps as follows:
for k = 0 to q − 1 do
for i = 0 to q − 1 do
for j = 0 to q − 1 do
C i,j ← Ci,j + Ai,k × Bk,j
d d d d
Now consider the execution of this algorithm on a torus of p = q 2 processors.

Processor Pi,j holds block C d i,j and is responsible for updating this block at
each step of the above algorithm. To perform this update at step k, processor
Pi,j needs blocks A d i,k and Bk,j . At step k = j, Pi,j already happens to hold
d
Adi,k . For all other steps, Pi,j must receive Ai,k from the processor that holds
d
it, that is Pi,k . This is true for all Pi,j , j 6= k, processors. Therefore, at step
k processor Pi,k must broadcast its block of matrix A to all processors Pi,j ,
j 6= k, that is all processors that are on processor Pi,k ’s processor row. This
is true for all i. Similarly, blocks of matrix B must be broadcasted at step
k by Pk,j to all processors on its processor column, for all j. The resulting
communication pattern is illustrated in Figure 5.3. The figure shows which
blocks of matrices A and B are sent to which processors at step k = 1 of the
algorithm in the case of a 4 × 4 torus. For instance, block A d2,1 , which is held
by processor P2,1 , it sent to processors P2,0 , P2,2 , and P2,3 .
Matrix A Matrix B
P0,0 P0,2 P0,3 P0,0 P0,1 P0,2 P0,3

A
d 0,1
P1,0 P1,2 P1,3 B

d1,0 B
d1,1 B
d1,2 B
d1,3
A
d 1,1
P2,0 P2,2 P2,3 P2,0 P2,1 P2,2 P2,3

A
d 2,1
P3,0 A P3,2 P3,3 P3,0 P3,1 P3,2 P3,3

3,1
d
FIGURE 5.3: Communications of blocks of matrices A and B at step k =

1 of the outer-product matrix multiplication algorithm on a 4 × 4 torus of
processors.
The outer-product algorithm on a torus of processors is given in Algo-

rithm 5.1. Statement 1 in the algorithm declares the square blocks of the
three matrices stored by each processor, assuming that elements of array C
are initialized to zero, and that arrays A and B contain the blocks of matrices
A and B according to the data distribution shown in Figure 5.2. Statement 2
declares two helper buffers that will be used by the processors to store re-
ceived blocks of matrices A and B. In statement 3 each processor determines
1 var A, B, C: array[0..m − 1,0..m − 1] of real

2 var bufferA, bufferB : array[0..m − 1,0..m − 1] of real
3 q ←SRQT(NUM PROCS())
4 myrow ←MY PROC ROW()
5 mycol ←MY PROC COL()
6 for k = 0 to q − 1 do
7 for i = 0 to m − 1 do { Broadcast A along rows }
8 BROADCASTROW(i, k, A, bufferA, m × m)
9 for j = 0 to m − 1 do { Broadcast B along columns }
10 BROADCASTCOL(k, j, B, bufferB , m × m)
11 { Multiply matrix blocks }
12 if (myrow = k) And (mycol = k) then
13 MATRIXMULTIPLYADD(C, A, B, m)
14 else if (myrow = k) then
15 MATRIXMULTIPLYADD(C, bufferA, B, m)
16 else if (mycol = k) then
17 MATRIXMULTIPLYADD(C, A, bufferB , m)
18 else
19 MATRIXMULTIPLYADD(C, bufferA, bufferB , m)
ALGORITHM 5.1: The outer-product algorithm on a grid of processors.
the square root of the number of processors. In statements 4 and 5 each

processor obtains its location on the processor torus. The q steps of the al-
gorithm are implemented by the loop in statement 6. At iteration k three
things happen. First, in statements 7–8, the q processors in processor column
k broadcast their block of A in their respective processor rows. Note that
these statements are correct because we assume that processors that should
not participate in a broadcast return immediately from the BROADCASTROW()
call. Then, statements 9–10 implement similar broadcasts for blocks of ma-
trix B along processor columns. Once all these broadcasts have completed,
each processor holds all necessary blocks: each processor multiplies a block
of A by a block of B and adds the result to the block of C for which it is
responsible. If the processor is both on processor row k and on processor
column k, then this processor can just multiply the two blocks of A and B it
holds (statements 12–13). Note that for brevity we assume the existence of
a function MATRIXMULTIPLYADD() defined as MATRIXMULTIPLYADD(C, A, B, n),
which multiplies two n × n matrices stored in arrays A and B and adds the
result to an n × n matrix stored in array C. If the processor is on processor
row k and not on processor column k, then it must multiply the block of A it
has just received with the block of B it holds (statements 14–15). Similarly
a processor on processor column k but not on processor row k multiplies the

block of A it holds with the block of B it has just received (statements 16–17).
Finally, in statement 18–19, a processor that is neither on processor row k nor
on processor column k multiplies the two blocks of A and B it has received.
Note that it is possible to adapt this algorithm to the case of non-square ma-
trices and/or non-square grids by allocating rectangular blocks of the matrix
to processors, which we leave as an exercise for the reader.
Performance analysis of the algorithm is straightforward. At each step,
each processor is involved in two broadcasts of messages containing m2 matrix
elements to q − 1 processors. Using the pipelined broadcast implementation
on a ring described in Section 3.3.4, the time for each broadcast is:
p √ 2
Tbcast = (q − 2)L + m2 b ,
where L is the communication startup cost, and b is the time to communicate

a matrix element in steady state. After the two broadcasts, each processor
performs a m × m matrix multiplication, which takes time m3 w, where w is
the computation time for a basic operation (multiplying two matrix elements
and adding the result to another matrix element). After the communication
phase in step 0, communication at step k can always occur in parallel with
computation at step k − 1. The computation phase in the last step does not
occur in parallel with any computation. Since there are q steps, the overall
execution time T (n, p) is:
T (m, q) = 2Tbcast + (q − 1) max 2Tbcast , m3 w + m3 w.

(5.1)
The above analysis is for the 1-port model, with the horizontal and the ver-
tical broadcasts happening in sequence at each step. With the 4-port model,
both broadcasts can occur concurrently, and the execution time of the algo-
rithm is obtained by removing the factor 2 in front of each Tbcast in the above
equation. When n becomes large, Tbcast ∼ n2 b/p, and thus T (m, q) ∼ n3 w/p,
which shows that the algorithm achieves an asymptotic efficiency of 1. This
algorithm is used by the ScaLAPACK [36] library, albeit often using a block-
cyclic data distribution (see Section 5.4)
5.3.2 Grid vs. Ring?

The reader will remember that in Section 4.2 we have already given a dis-
tributed memory matrix multiplication algorithm with optimal asymptotic
efficiency. Furthermore, that algorithm was for a ring of processors, which is
a simpler topology. Therefore, one may wonder why there is any advantage
to the grid topology. It is true that asymptotically there is no advantage, due
to the fact that matrix multiplication has a O(n3 ) computational complexity
and a O(n2 ) data size. However, communication terms that become negligible
asymptotically do matter for practical values of n, and in fact many algorithm
designers have striven to reduce communication costs for parallel matrix mul-
tiplication. We discuss below in what way a grid topology is advantageous
compared to a ring topology.
In practice for large values of n, up to a point, the algorithm’s execution time
can be dominated by communication time. This happens, for instance, when
the ratio w/b is low. The communication time (which is then approximately
equal to the execution time) on a grid, using the outer-product algorithm de-
√
scribed in the previous section, would be 2n2 b/ p, assuming a 1-port model.
The communication time of the matrix multiplication algorithm on a ring,
developed and analyzed in Section 4.2, is pn2 /pb = n2 b. The time spent in
√
communication when using a grid topology is thus a factor 12 p smaller than
when using a ring topology. This is easily seen when examining the communi-
cation patterns of both algorithms. When using a ring topology, at each step
the communication time is equal to that needed for sending n2 /p elements
between neighbor processors, and there are p such steps. By contrast, for the
algorithm on a grid, the communication at each step involves the broadcast of
√
twice n2 /p matrix elements, and there are p such steps. With the pipelined
implementation of the broadcast, broadcasting n2 /p matrix elements in a pro-
cessor row or column can be done in approximately the same time as sending
n2 /p matrix elements from one processor to another on the ring (provided
that n is not too small). Since there are two broadcasts, at each step the
algorithm on the grid spends twice as much time communicating as on the
√
ring. But it performs a factor p fewer steps! So the algorithm on a grid
1√
spends a factor 2 p less time communicating than the algorithm on a ring.
√
With a 4-port model, this factor is p.
The above advantage of the grid topology can be attributed to the presence
of more network links and to the fact that many of these links can be used
concurrently. In fact, for matrix multiplication, the 2-D data distribution
induced by a grid topology is inherently better than the 1-D data distribution
induced by a ring topology, regardless of the underlying physical topology!
To see this, let us just compute the total amount of data that needs to be
communicated in both versions of the algorithm.
The algorithm on a ring communicates p matrix stripes, each containing
n2 /p elements at each step, for p steps, amounting to a total of p.n2 matrix
√
elements sent on the network. The algorithm on a grid proceeds in p steps.
√ 2 √
At each step 2 × p blocks of n /p elements are sent, each to p − 1 pro-
√ √ √
cessors, for a total of 2 p. p − 1.n2 /p 6 2.n2 elements. Since there are p
steps, the total number of matrix elements sent on the network is lower than
√ √
2 p.n2 , i.e., at least a factor 2 p lower than in the case of the algorithm on
a ring! We conclude that when using a 2-D data distribution one inherently
sends less data then when using a 1-D distribution, by a factor that increases
with the number of processors. Although we do not show it here formally,
this result is general and holds for any (reasonable) matrix multiplication al-
gorithm. The implication of this result is that, for the purpose of matrix
multiplication, using a grid topology (and the induced 2-D data distribution)
is at least as good as using a ring topology, and possibly better. For instance,
when implementing a parallel matrix multiplication in a physical topology on
which all communications are serialized (e.g., on a bus architecture like a non-
switched Ethernet network), one should opt for a logical grid topology with a
2-D data distribution to reduce the amount of transferred data. Recall how-
ever that for n sufficiently large, the two logical topologies become equivalent
with execution time dominated by computation time.
5.3.3 Three Matrix Multiplication Algorithms

In this section, we review three classic algorithms for multiplying square
matrices, named after their designers: Cannon, Fox, and Snyder. These al-
gorithms are more complex than the outer-product algorithm but they are
interesting to study and should be part of the culture of all parallel algorithm
designers. We assume a q × q torus of p = q 2 processors as shown on the
left side of Figure 5.2, with the corresponding 2-D data distribution for ma-
trices A, B, and C as shown on the right side of Figure 5.2. In this section,
we write algorithms using high-level pseudo code for the sake of simplicity.
Indeed, although these algorithms are easy to understand intuitively and vi-
sually with examples, the codes for the full algorithms can be rather lengthy.
At this point in this book the reader should be able to develop full-detailed
implementations of these algorithms and we leave such implementations as
exercises.
The Cannon Algorithm

The Cannon algorithm [38] requires an initial re-distribution of matrices A
and B as follows. Each block row of matrix A is shifted (by zero, one, or
more positions) so that each processor in the first processor column holds a
diagonal block of the matrix. Similarly, each block column of matrix B is
shifted so that each processor in the first processor row holds a diagonal block
of the matrix. After these shifts, block Ad i,j of matrix A is stored on processor
Pi,(i−j) mod q and block Bi,j of matrix B is stored on processor P(i−j) mod q,j .
d
The resulting distributions are shown in Figure 5.4 for a 4 × 4 torus. At the
end of the algorithm similar shifts are necessary to restore matrices A and B
to their initial distributions. These initial and final steps are typically called
preskewing and postskewing.
The Cannon algorithm only requires point-to-point communication between
neighboring processors. Once the preskewing is done, the algorithm proceeds
in q steps. At each step, each processor computes the product of the blocks
of A and B it holds and adds the result to its block of C (whose elements
were initialized to zero). Then, blocks of matrix A are shifted by one posi-
tion toward the left (horizontal shifts within processor rows) and blocks of
matrix B are shifted by one position toward the top (vertical shifts within
processor columns). Blocks of matrix C never move. Communication and
   
A
d00 A01 A02 A03
d d d B
d00 B11 B22 B33
d d d
   
A A12 A13 A10 
d d B B21 B32 B03 
d d
 11  10
d d d d
   
A A23 A20 A21
B B31 B02 B13
 22  20
d d d d d d d d
 
A
d33 A30 A31 A32
d d d B
d30 B01 B12 B23
d d d
FIGURE 5.4: Block data distribution of matrices A and B after the preskew-
ing phase of the Cannon algorithm (on a 4 × 4 processor grid).
computation can be overlapped, provided that incoming data can be buffered

by processors, as seen for many of the algorithms we have discussed so far in
this book. The Cannon algorithm, written in high-level pseudo-code, is given
in Algorithm 5.2.

{ Preskewing }
2 Horizontal preskewing of A
3 Vertical preskewing of B
{ Computation phase }
4 for k = 1 to q do
6 Horizontal shift of A
7 Vertical shift of B
{ Postskewing }
8 Horizontal postskewing of A
9 Vertical postskewing of B
ALGORITHM 5.2: The Cannon algorithm.
We depict the first two steps of the algorithm in Figure 5.5, which shows
which block multiplications are performed by each processor. The symbol
indicates block-wise matrix multiplications. For instance, in the first step,
processor P2,3 updates its block of matrix C, C d2,3 , by adding to it the result of
the A2,1 × B1,3 product, while processor P1,0 adds A
d d d1,1 × B1,0 to C1,0 . In the
d d
second step, blocks of A and B have been shifted horizontally and vertically,
as seen in the figure. So during this step processor P2,3 adds A 2,2 × B2,3 to
d d
C 2,3 , while processor P1,0 adds A1,2 × B2,0 to C1,0 . Intuitively one can see
d d d d
that eventually each processor Pi,j will have computed all A i,l × Bl,j products,
d d
l = 0, . . . , q − 1, needed to obtain the final value of C di,j .
     
C
d00 C01 C02 C03
d d d A
d00 A01 A02 A03
d d d B
d00 B11 B22 B33
d d d
   
 
C C11 C12 C13 
d d A A12 A13 A10 
d d B B21 B32 B03 
d d
 10  11  10
d d d d d d
 + =   
C C21 C22 C23
A A23 A20 A21
d d B B31 B02 B13
 20  22   20
d d d d d d d d d d
 
C
d30 C31 C32 C33
d d d A
d33 A30 A31 A32
d d d B
d30 B01 B12 B23
d d d
     
C
d00 C01 C02 C03
d d d A
d01 A02 A03 A00
d d d B
d10 B21 B32 B03
d d d
     
C
 10 C11 C12 C13 
A
 12 A13 A10 A11   B20 B31 B02 B13 
d d d d d d d d d d d d
 + =   
C
 20 C21 C22 C23 
A
 23 A20 A21 A22   B30 B01 B12 B23 
C
d30 C31 C32 C33
d d d A
d30 A31 A32 A33
d d d B
d00 B11 B22 B33
d d d
FIGURE 5.5: The first two steps of the Cannon algorithm on a 4 × 4 grid
of processors, where at each step each processor multiplies one block of A and
one block of B and adds this product to a block of C.
The Fox Algorithm
This algorithm developed by Fox in [56] was originally designed for Cal-
Tech’s hypercube platform but it uses a torus logical topology. The algorithm
performs broadcasts of blocks of matrix A and is also known as the broadcast-
multiply-roll algorithm. Unlike Cannon’s algorithm, the Fox algorithm does
not require any preskewing or postskewing of the matrices. The algorithm
proceeds in q steps and at each step it performs a vertical shift of blocks
of matrix B. At step k, 1 6 k 6 q, the algorithm also performs horizontal
broadcasts of all blocks of the k-th block diagonal of matrix A within processor
rows. A block A i,j is on the k-th block diagonal, k > 1, if j = i + k − 1 mod q.
d
Therefore, at step k processor Pi,i+k−1 mod q sends its block of matrix A to all
processors Pi,j , j 6= i + k − 1 mod q. The Fox algorithm, written in high-level
pseudo-code, is shown in Algorithm 5.3.

2 for k = 1 to q do
3 Horizontal broadcasts of the blocks of the k-th diagonal of A
ALGORITHM 5.3: The Fox algorithm.
Like for the Cannon algorithm we illustrate the first two steps of this al-
gorithm in Figure 5.6. The figure shows which matrix block multiplications
are performed by each processor at each step. During the first step relevant
blocks of A are blocks A d i,i , 0 6 i 6 q, that is blocks of the first block diagonal
of matrix A. For instance, in the first step, processor P2,3 updates its block of
matrix C, C 2,3 , by adding to it the results of the A2,2 × B2,3 product, while
d d d
processor P1,0 adds A 1,1 × B1,0 to C1,0 . In the second step, relevant blocks of
d d d
A are blocks Ai,i+1\ mod q , that is blocks on the second block diagonal. During
this step, processor P2,3 adds A 2,3 × B3,3 to C2,3 , and processor P1,0 adds
d d d
A1,2 × B2,0 to C1,0 . Here again it is easy to see that eventually processor Pi,j
d d d
will have computed all block products necessary to obtain the final value of
C
d i,j .
     
C
d00 C01 C02 C03
d d d A
d00 A00 A00 A00
d d d B
d00 B01 B02 B03
d d d
    
C C11 C12 C13 
d d A A11 A11 A11 
d d B B11 B12 B13 
d d
 10  11  10
d d d d d d
 + =   
C C21 C22 C23
A A22 A22 A22
d d B B21 B22 B23
 20  22   20
d d d d d d d d d d
 
C
d30 C31 C32 C33
d d d A
d33 A33 A33 A33
d d d B
d30 B31 B32 B33
d d d
     
C
d00 C01 C02 C03
d d d A
d01 A01 A01 A01
d d d B
d10 B11 B12 B13
d d d
     
C
 10 C11 C12 C13 
A
 12 A12 A12 A12   B20 B21 B22 B23 
 + =   
C
 20 C21 C22 C23 
A
 23 A23 A23 A23   B30 B31 B32 B33 
C
d30 C31 C32 C33
d d d A
d30 A30 A30 A30
d d d B
d00 B01 B02 B03
d d d
FIGURE 5.6: The first two steps of the Fox algorithm on a 4 × 4 grid of
processors, where at each step each processor multiplies one block of A and
one block of B and adds this product to a block of C.
The Snyder Algorithm
Our third algorithm, proposed by Snyder in [84], uses both a preskewing

and postskewing phase, as well as vertical shifts of matrix B and global sum
operations. The preskewing step consists in simply transposing the blocks of
matrix B (so that block B d i,j is stored on processor Pj,i ). Like the previous
ones, this algorithm proceeds in q steps. At each step k, 1 6 k 6 q, each
processor multiplies the blocks of matrices A and B it holds and performs a
vertical shift of B (by one position upward). Then, processor Pi,i+k−1 mod q
receives all block products computed by the processors on its processor row,
including itself, which are summed together and stored into the block of matrix
C this processor holds, i.e., Ci,i+k−1 mod q . Note that these blocks of C are
those of the k-th block diagonal of matrix C. The Snyder algorithm, written
in high-level pseudo-code, is shown in Algorithm 5.4.

2 var bufferC : array[0..m − 1,0..m − 1] of real
{ Preskewing }
3 Transpose B
{ Computation Phase }
4 MATRIXMULTIPLYADD(bufferC , A, B, m)
6 for k = 1 to q − 1 do
7 Global sum of bufferC on proc. rows into Ci,(i+k−1) mod q
8 MATRIXMULTIPLYADD(bufferC , A, B, m)
10 Global sum of bufferC on proc. rows into Ci,(i+q−1) mod q
{ Postskewing }
11 Transpose B
ALGORITHM 5.4: The Snyder algorithm.
Figure 5.7 shows the first two steps of the algorithm. The blocks of C that
are updated by the global sum operations are shown in boldface, along the
first diagonal for the first step, and the second diagonal for the second step.
In this sense the meaning of the + = sign in this figure is different from that
in Figure 5.5 and 5.6. Indeed, only q blocks of matrix C are updated at each
step, as opposed to q 2 blocks. But each block is updated only once during the
execution of the algorithm. For instance, in the second step, processor P2,3
updates block C 2,3 by adding to it the three products A2,0 × B0,3 , A2,1 × B1,3 ,
d d d d d
and A 2,2 × B2,3 , received from processors P2,0 , P2,1 , and P2,2 respectively, and
d d
the locally computed product A 2,3 × B3,3 .
d d
5.3.4 Performance Analysis of the Three Algorithms

Table 5.1 summarizes the essential features of the three algorithms. In
this section, we analyze their performance on a q × q torus, both under the
1-port assumption and the less stringent 4-port assumption. In both cases
we assume full-duplex bidirectional network links. For completeness, we also
mention results obtained in the literature assuming that the underlying plat-
form implements wormhole routing. Note that a typical approach is to con-
sider a hypercube topology to develop global communication algorithms. One
TABLE 5.1: Features of the Cannon, Fox, and Snyder algorithms.

Algorithm Cannon Fox Snyder
preskewing shifts of A and B none transposition of B
Matrix products in place in place sums on proc. rows
A movements horizontal shifts horizontal broadcasts none
B movements vertical shifts vertical shifts vertical shifts
2 3 2 3 2 3
C
d00 C01 C02 C03
d d d A
d00 A01 A02 A03
d d d B
d00 B10 B20 B30
d d d
6 7 6 7 6 7
C10 C11 C12 C13 7 6 A10 A11 A12 A13 7 6B B11 B21 B31 7
6 d d d 7 6d d d7 6 d d7
01
6 d d d d
6 7+ = 6 76 7
6 7 6d 7 6 d d7
6 C20 C21 C22 C23 7
d d d d 6 A20 A21 A22 A23 7 6 B02
d d d d B
d12 B22 B32 7
4 5 4 5 4 5
C
d30 C31 C32 C33
d d d A
d30 A
d31 A32 A33
d d B
d03 B
d13 B23 B33
d d
2 3 2 3 2 3
C
d00 C01 C02 C03
d d d A
d00 A01 A02 A03
d d d B
d01 B11 B21 B31
d d d
6 7 6 7 6 7
C10 C11 C12 C13 7 6 A10 A11 A12 A13 7 6B B12 B22 B32 7
6 d d d 7 6d d d7 6 d d7
02
6 d d d d
6 7+ = 6 76 7
6 7 6d 7 6 d d7
6 C20 C21 C22 C23 7
d d d d 6 A20 A21 A22 A23 7 6 B03
d d d d B
d13 B23 B33 7
4 5 4 5 4 5
C
d 30 C31 C32 C33
d d d A
d30 A
d31 A32 A33
d d B
d00 B
d10 B20 B30
d d
FIGURE 5.7: The first two steps of the Snyder algorithm on a 4 × 4 grid of
processors, where at each step each processor multiplies one block of A and
one block of B. All such products are added together and added to the blocks
of C shown in boldface, within each processor row.
then “casts” these algorithms to a grid topology, which is straightforward as a

grid topology is easily embeddable in an hypercube topology. Essentially, the
wormhole routing assumption makes it possible to ignore issues of processor
proximity when accounting for communication time. More details on the hy-
percube, on the notion of topology embedding, and on wormhole routing are
presented in Chapter 3.
We use the following notations, with which the reader should be familiar by
now: m = n/q, L is the communication start-up cost, b is the time necessary
to send a matrix element over a network link in steady state, w is the time
necessary to multiply an element of matrix A by an element of matrix B and
add the result to an element of matrix C.
Cannon Algorithm
Let us start with the 4-port model. The algorithm’s execution time is
4p
the sum of two terms, Tskew , the time for preskewing and postskewing the
4p
matrices, and Tcomp , the time to perform the computation (“4p” stands for
4-port).
For the preskewing and postskewing steps, one can limit the number of
shifts of blocks of matrices A and B to bq/2c. To understand this, consider a
processor row and the shifts of blocks of matrix A that must be performed by
processors in that row. It should be clear to the reader that performing q − 1
left shifts is equivalent to performing one right shift. More generally, perform-
ing x left shifts is equivalent to performing q − x right shifts. Therefore, in the
worst case, a processor row only needs to perform bq/2c shifts, this maximum
number of shifts being performed by the middle processor row(s). Therefore,
the preskewing of matrix A takes time 2q (L + m2 b). The time to preskew

matrix B is identical. In the 4-port model, horizontal and vertical commu-

nications can
occur concurrently, and thus the total time for preskewing is
simply 2q (L + m2 b). The time for postskewing is identical to the time for
preskewing. Therefore, the total time spent for preskewing and postskewing
is: jqk
4p
Tskew =2 (L + m2 b).
2
At each step, each processor computes a m×m matrix multiplication, which
is done in time m3 w, and sends a m × m block of A and a m × m block of B
to its neighbors, which both take time L + m2 b. These two communications
can occur concurrently in the 4-port model because one is horizontal and the
other is vertical. Assuming that computation can occur concurrently with
communication, and accounting for the fact that there are q steps to the
computation, we obtain the overall execution time as:
4p
= q. max m3 w, L + m2 b .

Tcomp
Using the same basic algorithm with a 1-port model, the added constraint
is that horizontal and vertical communications cannot happen concurrently.
Consequently, preskewing (and postskewing) for matrices A and B must hap-
1p 4p
pen in sequence and we obtain Tskew = 2Tskew (“1p” stands for 1-port). Sim-
ilarly, at each step the communications of blocks of A and blocks of B cannot
happen concurrently, leading to:
1p
= q. max m3 w, 2L + 2m2 b .

Tcomp
In the end we obtain the overall execution times for both models:
jqk
T 4p = 2 (L + m2 b) + q. max m3 w, L + m2 b ,

j 2q k
1p
(L + m2 b) + q. max m3 w, 2L + 2m2 b .

T =4
2
Fox Algorithm
The Fox algorithm proceeds in q steps, with no preskewing or postskewing,
and thus the execution time is simply the time taken by a single step multiplied
by q. The computation time at each step is m3 w, just like for the Cannon
algorithm.
At each step, there are q concurrent broadcasts of blocks of matrix A, one
broadcast in each processor row. Using the pipelined broadcast presented in
Section 3.3.4 with the optimal packet size, the time for the broadcast, Tbcast ,
is: p √ 2
Tbcast = (q − 2)L + m2 b .
Due to the fact that links are bidirectional, with the 4-port model the broad-
cast time above can be reduced by having the source processor simultaneously
send data in both directions on the ring. With this technique, the execution
time of the broadcast is obtained by replacing q in the above equation by
dq/2e. This is because the first packets sent in both directions go through
at most dq/2e hops. Note that the asymptotic performance of the broadcast
when m gets large is unchanged by this modification.
The shift of the blocks of matrix B can be done in time L + m2 b. In the
4-port model, this shift can occur concurrently with the broadcasts of the
blocks of matrix A at each step. As a result, the time to perform the shift is
completely hidden (Tbcast > L + m2 b, for q > 1). Computation at each step
occurs concurrently with these communications. However, processors must all
wait for the first broadcast to complete before proceeding. Then, at each step
the computation for that step, the shift for that step and the broadcast for
the next step can occur concurrently. In the last step, only the computation
and the shift occur. We obtain the overall execution time in the 4-port model,
T 4p , as:
q p √ 2
T 4p = d e (q − 2)L + m2 b +
2
q p √ 2
3 2
(q − 1) max m w, d e (q − 2)L + m b +
2
max m3 w, L + m2 b

With the 1-port model, horizontal and vertical communications cannot oc-
cur concurrently and during the broadcast the source can only send data in
one direction at a time. Therefore, the execution time is simply:
p √ 2
T 1p = (q − 2)L + m2 b +
p √
(q − 1) max m3 w, ( (q − 2)L + m2 b)2 + L + m2 b +
max m3 w, L + m2 b .

If the underlying platform implements wormhole routing the execution time

for the broadcasts in the 1-port model can be reduced approximately to being
only logarithmic in q due to the use of a (binary) broadcast tree, while in
the above equation it is linear in q. The broadcast starts with one processor
sending its matrix block to another processor. Then, both these processors
send their blocks to another two processors. And so on, for a total of log q
steps. See Section 3.4.2 for a complete description of the broadcast algorithm
with wormhole routing.
Snyder Algorithm
The major differences between the Snyder algorithm and the previous ones
are the use of a pre- and post-transposition of matrix B and the use of global
sums to compute blocks of matrix C by accumulation. Let us discuss both
these operations.
There are various ways to implement a matrix transposition on a grid of

processors. A simple approach is discussed in Exercise 3.4. We discuss here
another simple approach (note that we transpose the blocks, but not the
content of the blocks). Consider a block of matrix B held initially by processor
Pi,j where i > j (i.e., a block in the lower left part of the matrix). One way
for this block to reach its destination, processor Pj,i , is for it to travel to the
right, all the way to processor Pi,i (which is on the diagonal of the processor
grid), and then upward all the way to processor Pj,i . Blocks in the upper
right half of the matrix travel similarly in the other direction. In the 4-
port model, all blocks can be communicated concurrently and the blocks that
require the largest number of hops are the blocks owned by processors Pq,1
and P1,q . These blocks are communicated between neighboring processors
2(q − 1) times, giving us the time for the entire transposition as:
4p
Ttranspose = 2(q − 1)(L + m2 b).
If our logical topology is a torus this time can be cut roughly in half. Indeed,
with a torus, transposing the block initially held by processor Pq,1 takes only
two communication steps, using the loopback links. The factor 2(q − 1) above
then becomes 2bq/2c. The 1-port model precludes concurrent communications
of the blocks originally in the upper half of the matrix and of the blocks
originally in the lower half of the matrix. Therefore, the execution time is
multiplied by a factor 2.
Another approach could consist in shifting the blocks in each row so that
they are in their destination columns. As soon as a block has reached its
destination column it can be shifted vertically to reach its destination row
provided there is no link contention. While in a 4-port model this algorithm
has the same execution time as the algorithm above, in the 1-port model it
uses a smaller number of steps. Writing the algorithm to satisfy the “there
is no link contention” constraint requires care, and we leave it as an exercise
for the reader. Note that enforcing this constraint is not necessary to have a
correct algorithm, but it may be difficult to analyze its performance.
Finally, note that if wormhole routing is implemented on the underlying
platform, then a clever recursive transposition algorithm described in [45]
leads to a shorter execution time:
1p
Ttranspose = dlog2 (q)e(L + m2 b).
Let us first consider the 4-port model when developing the performance
analysis of the rest of the algorithm. The execution consists of a sequence of q
steps, where each step consists in a product of matrix blocks, a shift of blocks
of matrix B along processor columns, and a global sum of blocks of matrix C
along processor rows (see Algorithm 5.4). The first global sum can only be
done after the matrix products have been performed. These matrix products
take time m3 w, and can be done concurrently with the shift of matrix B,
which takes time L + m2 b. As usual, with appropriate use of buffers and

pointers, all three operations can be done in parallel for q − 1 iterations of the
for loop in the algorithm. By now, it should be clear to the reader that the
shift of matrix B is completely hidden by the global sum of matrix C as it
entails strictly fewer communication steps. The algorithm completes with a
final global sum of matrix C blocks. Using Γ to denote the execution time of
this global sum, which is to be determined, the execution time of the q steps
is:
max m3 w, L + m2 b + (q − 1) max m3 w, Γ + Γ

The global sum of blocks of matrix C can be implemented in various ways.

A complex but likely efficient approach would be to have partial sums com-
puted recursively and in parallel by increasingly large groups of processors
within a row. Instead, let us design a simple algorithm. At each step, in
each processor row, the two processors the farthest away from the destina-
tion processor (i.e., the one on the k-th diagonal of the processor grid at step
k = 1, . . . , q) send a matrix block toward this processor in opposite directions
on the ring. Each processor receiving this block adds it to its own block and
forwards the result toward the destination processor, and so on. The destina-
tion processor will receive two blocks, which will be added to its own block of
matrix C. If q is odd, then these two blocks will arrive at the same time and
thus will be added one after the other. It follows that this algorithm requires
dq/2e communication steps and d(q + 1)/2e computation steps, for an overall
execution time of:
lqm
q+1
Γ= (L + m2 b) + m2 w 0 .
2 2
Note our use of w0 in this equation to denote the time needed to add two
matrix elements together, which is lower than w, the time needed to multiply
two matrix elements together and add them to a third matrix element.
Alternatively, one can design a pipelined algorithm on a unidirectional ring
in the same fashion as the pipelined broadcast developed in Section 3.3.4. Note
that it is possible to implement a faster broadcast using the fact that our rings
are bidirectional, as for the Fox algorithm. However, note that here we have
both communications and computations (for the summing of blocks of C), so
this technique is only useful if communication time is larger than computation
time. We leave the development of a bidirectional global sum of blocks of C
as an exercise for the reader. We split the m2 matrix elements to be sent and
added by each processor into r individual chunks. Let us consider a given
processor row i. Without loss of generality let us also assume that Pi,q−1
is the destination processor and that communication occurs in the direction
of increasing processor column indices. In the first step of the algorithm,
processor P0 sends a chunk of m2 /r matrix elements to P1 . In the second step,
processor P1 adds these matrix elements to the corresponding matrix elements
it owns (to perform the global sum), and receives the next chunk from P0 .
In the third step, processor P1 is engaged in three activities: receiving m2 /r

elements, adding m2 /r elements, and sending m2 /r elements, which can all
occur concurrently and take time max(L+m2 b/r, m2 w0 /r). With this scheme,
processor Pq−1 finishes computing the first chunk of its block of matrix C after
2(q − 1) = 2q − 2 steps. Since the first step entails only communication and all
the others entail both communication and computation, processor Pq finishes
computing its first chunk by time:
L + m2 b/r + (2q − 3) max L + m2 b/r, m2 w0 /r .

Afterwards, a new chunk is computed at every step. Since there are r − 1

remaining chunks, the overall execution time for the global sum, Γ(r), is:
Γ(r) = L + m2 b/r + (2q − 4 + r) max L + m2 b/r, m2 w0 /r .

As in the case of the simple pipelined broadcast, one question here is to find
the optimal value for r. Fortunately, we can apply the same technique. For
instance, assuming that b > w0 , the maximum in the above equation becomes
simply L + m2 b/r. The optimal execution time is obtained for r = ropt , where
r
(2q − 3)m2 b
ropt = .
L
Recall that the above can be obtained by directly applying the “goat in a pen”
theorem, or by simply computing the derivative of Γ(r) with respect to r (see
Sections 3.2.2 and 3.3.4). The optimal execution time for the global sum when
r = ropt , Γopt , is then:
p
Γopt = (2q − 2)L + m2 b + 2m (2q − 3)bL.
The case b < w0 is more involved, and is left as an exercise for the interested
reader. We finally have the overall execution time for the algorithm, T 4p , in
the case b > w0 , as:
4p
T 4p = 2Ttranspose + max m3 w, L + m2 b +

p
(q − 1) max m3 w, (2q − 2)L + m2 b + 2m (2q − 3)bL +
p
(2q − 2)L + m2 b + 2m (2q − 3)bL.
With the 1-port model, the shifts of blocks of matrix B cannot occur con-
currently with the global sums of blocks of matrix C. We obtain the overall
execution time as:
1p
T 1p = 2Ttranspose + max m3 w, L + m2 b +

p
(q − 1) max m3 w, L + m2 b + (2q − 2)L + m2 b + 2m (2q − 3)bL +
p
L + m2 b + (2q − 2)L + m2 b + 2m (2q − 3)bL.
5.4. 2-D block cyclic data distribution 167
If the underlying platform implements wormhole routing, the global sum

can be implemented following an approach similar to the one for the broad-
cast in the Fox algorithm, that is using a binary broadcast tree. The only
difference is that some computation is involved at each step to compute the
global sum. Essentially, one processor sends its block of matrix C to another
processor, which adds the received block to the block it holds. Then, both
these processors send their blocks to another two processors. And so on, for
a total of log q steps. Remembering that with wormhole routing the data
transfer times do not depend on the distance between the two communicating
processors (or only in a way that is typically negligible), the execution time
of the global sum is then approximately equal to dlog(q)e(L + m2 w + m2 b)).
Conclusion
When n gets large, all these algorithms achieve an asymptotic parallel effi-
ciency of 1. By now it should be clear to the reader that this is not terribly
difficult to achieve for matrix multiplication, given that the computation is
O(n3 ) and the communication O(n2 ). The above performance analyses make
it possible to compare the three algorithms for particular values of n and q
and of the characteristics of the platform. More importantly, the main merit
of going through these admittedly lengthy performance analyses is to expose
the reader to several typical algorithms such as matrix transposition or global
sums, to put principles such as pipelining to use, and to better understand
the impact of 1-port and 4-port models on algorithm design and performance.
5.4 2-D block cyclic data distribution

On a 2-D grid we have used the natural 2-D data distribution of matrices
by which square blocks of a matrix are assigned to processors. This distri-
bution was well-suited to, for instance, the multiplication of two matrices.
In Chapter 4, we have seen applications that strongly benefit from cyclic 1-
D data distributions, namely stencil applications and the LU factorization.
We demonstrated that cyclic distributions can reduce idle time via latency
reduction, improved pipelining of communication and computation, and/or
improved load balancing. It is therefore natural to combine the advantages of
cyclic data distributions and of 2-D data distributions, in so-called 2-D block
cyclic data distributions.
Figure 5.8 depicts an example 2-D block cyclic distribution of a square
matrix on a square processor grid, in which processors are allocated 2 × 2
blocks of the matrix. While we have seen 1-D distributions that were cyclic
across rows or across columns, this 2-D distribution is cyclic both across rows
and columns. More formally, for a matrix A of dimension n × n distributed
in a 2-D block cyclic fashion on a processor grid with p = q 2 processors using
b × b blocks, block Adk,l is stored on processor Pk mod q,l mod q . Note that for
0,0 0,1 0,2 0,3 0,0 0,1 0,2 0,3 0,0 0,1
P0,0 P0,1 P0,2 P0,3 1,0 1,1 1,2 1,3 1,0 1,1 1,2 1,3 1,0 1,1
2,0 2,1 2,2 2,3 2,0 2,1 2,2 2,3 2,0 2,1
P1,0 P1,1 P1,2 P1,3 3,0 3,1 3,2 3,3 3,0 3,1 3,2 3,3 3,0 3,1
0,0 0,1 0,2 0,3 0,0 0,1 0,2 0,3 0,0 0,1
P2,0 P2,1 P2,2 P2,3 1,0 1,1 1,2 1,3 1,0 1,1 1,2 1,3 1,0 1,1
2,0 2,1 2,2 2,3 2,0 2,1 2,2 2,3 2,0 2,1
P3,0 P3,1 P3,2 P3,3
3,0 3,13,2 3,3 3,0 3,1 3,2 3,3 3,0 3,1
0,0 0,1 0,2 0,3 0,0 0,1 0,2 0,3 0,0 0,1
1,0 1,1 1,2 1,3 1,0 1,1 1,2 1,3 1,0 1,1
FIGURE 5.8: 2-D block cyclic distribution of a n × n matrix (n = 20) on

a grid/torus of p processors (p = 42 = 16), with block size r = 2. Each 2 × 2
block is labeled by the indices of the processor holding that block. Blocks
held by processor P0,0 are highlighted in gray.
simplicity we assume that b divides n. Depending on the values of n, q, and

b, some processors may hold more blocks of A than some others. For the
example shown in Figure 5.8, processor P0,0 holds 9 blocks of the matrix
while processor P2,3 holds only 4 blocks.
All the algorithms we have seen in this chapter can be implemented using a
2-D block cyclic distribution. Another advantage of the 2-D bock cyclic dis-
tribution is that block sizes can be chosen arbitrarily, rather than be limited
to being n/q × n/q. This provides more flexibility for performance tuning and
more opportunities to utilize each processor’s memory hierarchy to the best
of its potential. Both the ScaLAPACK [36] and PLAPACK [117] libraries
provide 2-D block cyclic implementations of linear algebra functions. Such
implementations are usually not conceptually difficult, but require the im-
plementation of a software infrastructure to support and implement the 2-D
block cyclic abstraction. This is especially true when the algorithms cannot
rely on simplifying assumptions and must instead handle, say, rectangular
matrices, grids, or blocks, and prime matrix, grid or block dimensions. Note
also that the High-Performance Fortran (HPF) language supports 2-D block
cyclic data distributions natively [76]. We do not develop algorithms that use
a 2-D block cyclic data distribution in this book and refer the reader to the
aforementioned libraries for example implementations.
In addition to the many references cited in this chapter, a valuable reference
for examples of algorithms on 2-D processor grids is, once again, the book
by Kumar et al. [77]. Of interest is also the book by Cosnard et al. [45].
Finally, both the ScaLAPACK [36] and PLAPACK [117] library contain many
implementations of interesting algorithms on 2-D processor grids.
5.5. Exercises 169
5.5 Exercises
In the first two exercises, we write the pseudo-code for two algorithms that
we have seen in this chapter, namely the matrix block transposition in Snyder’s
algorithm and a stencil application, on 2-D processor grids. The third exercise
deals with the parallelization of the well-known Gauss-Jordan method for
solving a linear system of equation. Finally, the fourth exercise is a hands-
on implementation of the outer-product matrix multiplication algorithm on a
parallel platform using the MPI message-passing standard.
Exercise 5.1 : Matrix Transposition

Write the pseudo-code for the 4-port matrix block transposition described
as part of the Snyder algorithm in Section 5.3.4, for an n × n matrix on a
non-toric q × q 2-D processor grid, where q divides n.
Exercise 5.2 : Stencil Application

1. Write the pseudo-code for the stencil application described in Section 4.5
for a n × n domain but this time on a q × q 2-D processor grid, where q divides
n.
2. Give a performance model for your algorithm, assuming a 4-port bidirec-

tionel model.
Exercise 5.3 : Gauss-Jordan Method

The Gauss-Jordan method is an algorithm for solving a linear system Ax =
b, where A is a matrix of rank n, b is a right-hand side vector with n compo-
nents, and x is a vector of unknowns with n components. The method consists
in eliminating all the unknowns but xi from equation i. The sequential algo-
rithm, assuming that indices start at 0, is easily written as in Algorithm 5.5.
1. Write the pseudo code for the above algorithm on a q × q 2-D processor
torus, where q divides n.
2. Give a performance model for your algorithm with the 4-port assumption
and the 1-port assumption.
Exercise 5.4 : Outer-Product Algorithm in MPI

1. Implement the outer-product matrix multiplication algorithm described
in Section 5.3.1 using MPI, for n × n matrices, on a q × q processor grid.
Assume that q divides n. Each processor initially holds a piece of matrix
1 GAUSSJORDAN(A, b, x)
2 for i = 0 to n − 1 do
3 for j = 0 to n − 1 do
4 if i 6= j then
5 for k = j to n − 1 do
6 Ai,k ← Ai,k −(Ai,j /Aj,j )×Aj,k
7 bi ← bi − (Ai,j /Aj,j ) × bj
8 for i = 0 to n − 1 do
9 xi ← bi /Ai,i
ALGORITHM 5.5: The (sequential) Gauss-

Jordan algorithm.
A and matrix B in two arrays. Initialize these arrays with matrix elements
defined as ai,j = i and bi,j = i + j (these are global indices starting at 0). At
the end of the execution each processor holds a piece of matrix C stored in
a third array. Your program should check the validity of the results (indeed,
they can be computed analytically as ci,j = i.n.(n − 1)/2 + i.j.n, using global
indices). Hint: it may be a good idea to use multiple MPI communicators,
one for each processor row, and one for each processor column, as a convenient
way to implement the BROADCASTROW() and BROADCASTCOLUMN() functions.
2. Run your program on a parallel platform, e.g., a cluster, and plot the
parallel speedup and efficiency for 2, 4, and 8 processors as functions of the
matrix size, n. Each data point should be obtained as the average over 10
runs.
3. If you have not done so, experiment with non-blocking MPI communi-
cation and see whether there is any impact on your program’s performance
when you overlap communication and computation.
5.6. Answers 171
5.6 Answers
Exercise 5.1 (Matrix Transposition)
According to the description of the algorithm, a block stored on a processor
in the lower part of the processor grid travels to the right on that processor’s
processor row until the processor on the diagonal of the processor grid is
reached. Then, the block travels upwards in that processor’s column until it
reaches its destination. Blocks stored on processors in the upper part of the
processor grids travel similarly but first downward and then to the left.
Given the above algorithm, processor Pi,i (0 6 i 6 q − 1) is involved in
forwarding a total of 2i blocks. Each processor Pi,j in the lower part of the
matrix sends min(i, j) blocks to its West neighbor and min(i + 1, j + 1) blocks
to its East neighbor, with corresponding block receptions by these neighbors.
Each processor Pi,j in the upper part of the matrix sends min(i, j) blocks
to its North neighbor and min(i + 1, j + 1) blocks to its South neighbor,
with corresponding block receptions by these neighbors. The figure below
shows an example for a 5 × 5 processor grid; for each processor it depicts how
many send/receive operations this processor performs throughout the overall
execution.
P0,0 P0,1 P0,2 P0,3 P0,4
P1,0 P1,1 P1,2 P1,3 P1,4
P2,0 P2,1 P2,2 P2,3 P2,4
P3,0 P3,1 P3,2 P3,3 P3,4
P4,0 P4,1 P4,2 P4,3 P4,4
Assuming as usual that sends are non-blocking and that receives are block-
ing, we can now write the pseudo-code as shown in Algorithm 5.6, with
m = n/q. The algorithm is written so that the clockwise and the coun-
terclockwise communications happen in parallel. Note that this algorithm
can be further improved by allowing a message reception for step k1 + 1 (resp.
k2 + 1) to occur in parallel with a message sending for step k1 (resp. k2 ),
as for instance in Algorithm 4.1. This improvement leads to the performance
model given in Section 5.3.4.
1 var A, buff1 , buff2 : array[0..m − 1,0..m − 1] of real

2 myrow ←MY PROC ROW()
3 mycol ←MY PROC COL()
4 if myrow == mycol then
5 for k1 = 1 to myrow do
6 RECV(north, buff1 , m × m)
7 SEND(west, buff1 , m × m)
8 ||
9 for k2 = 1 to myrow do
10 RECV(west, buff2 , m × m)
11 SEND(north, buff2 , m × m)
12 else if myrow > mycol then
13 SEND(east, A, m × m)
14 for k1 = 1 to min(myrow, mycol) do
15 RECV(west, buff1 , m × m)
16 SEND(east, buff1 , m × m)
17 ||
19 RECV(east, buff2 , m × m)
20 SEND(west, buff2 , m × m)
21 RECV(east, A, m × m)
22 else
23 SEND(south, A, m × m)
25 RECV(south, buff1 , m × m)
26 SEND(north, buff1 , m × m)
27 ||
29 RECV(north, buff2 , m × m)
30 SEND(south, buff2 , m × m)
31 RECV(south, A, m × m)
ALGORITHM 5.6: Matrix block transposition algorithm.

Answers 173
Exercise 5.2 (Stencil Application)

. Question 1. The stencil requires that each cell be updated using its North,
South, East, and West neighbors. Let us consider processor Pi,j , which holds
a square block of dimension nq × nq . Pi,j can update all “internal” cells of its
block without need for communication with its neighboring processors. Cells
on an edge of the block require values that are stored on the edge of a block
held by a neighbor processor. Therefore, each processor must exchange nq cell
values with each of its neighbors. This is depicted in Figure 5.9. If the domain
is assumed not to wrap around, there are special cases for the processors on
the edges and in the corners of the processor grid as they are involved in
fewer communications. We let the reader write the pseudo-code based on this
high-level description of the algorithm.
FIGURE 5.9: Depiction of the block held by a processor, in the center, and
of the blocks held by its four neighbors. Each processor can update the inside
cells of its block (in white) independently of its neighbors. The edge cells (in
gray) must be exchanged between neighbors so that they can be updated by
all the processors.
. Question 2. Each processor can compute internal cells of its block while
exchanging edge cells with its neighbors. Let us consider a processor not on
the edge of the processor grid. Using the same notation as in Section 4.5, the
time for this processor to update its internal cells is:
2
n
−1 w ,
q
and the time for this processor for exchange its edge cells with its neighbors
in the 4-port, bidirectional model is:

n
L+ b.
q
Once both these (simultaneous) operations have completed, each processor

can then update all its edge cells in time:

n
4 −1 w .
q
We obtain the overall execution time T (n, q) as:

" 2 #
n n n
T (n, q) = max − 1 w, L + b + 4 −1 w .
q q q
2
n
As n gets large, the execution time becomes equivalent to q w, denoting
an asymptotic efficiency of 1.
Chapter 6
Load Balancing on Heterogeneous
Platforms
In Chapters 4 and 5 we discussed and developed parallel algorithms for various

logical topologies, various data distributions, and various assumptions regard-
ing the concurrency of communication and computation. But we made one
assumption that was common to all our algorithms, namely that the underly-
ing computing platform is homogeneous. In this context homogeneity means
that all processors have the same computing capabilities, and that all network
links have the same physical characteristics. The homogeneous assumption is
relevant as the most popular platforms for running distributed-memory ap-
plications today are commodity clusters, which are typically purchased and
installed as homogeneous systems. Originally, the notion of heterogeneous
parallel computing stemmed from the desire to build cheap parallel comput-
ers by using individual workstations already available in, say, an academic
department. Such workstations are typically heterogeneous as they are pur-
chased at different times, for different users and possibly different purposes.
Such ad-hoc parallel computers were initially the main target of the PVM [59]
message-passing library, which contributed to their popularity.
The use of such ensembles of heterogeneous workstations has since then
decreased due to the availability of affordable commodity clusters. However,
the need for parallel algorithms that run on heterogeneous platforms has not
disappeared. Well-known examples of highly heterogeneous systems that are
used every day to run parallel applications are Internet-wide volunteer com-
puting systems that harvest idle CPU cycles of donated home and office com-
puters, such as SETI@home [7]. Heterogeneity also arises in the context of
commodity clusters. During their life cycles clusters are often upgraded with
additional compute nodes. These compute nodes use newer technology and
typically have faster compute speed than the original nodes, resulting in a
heterogeneous cluster. An application that attempts to use all the available
compute power of the cluster would then have to use heterogeneous nodes
effectively. Perhaps the most compelling case for parallel algorithms that sup-
port heterogeneous platforms comes from the notion of grid computing [55, 25]
(formerly known, to some extent, as “metacomputing”). The computational
requirements of many scientific applications make it difficult to run these ap-
plications on a single cluster. Given the increasing availability of high-speed
wide-area networks, it has become possible to run applications on increasingly
175
176 Chapter 6. Load Balancing on Heterogeneous Platforms
distributed collections of individual clusters. These clusters typically belong to

different institutions, were deployed at different times, and the overall system
thus comprises heterogeneous compute nodes.
This chapter provides an introduction to parallel algorithms on heteroge-
neous platforms. The primary focus is on load balancing, which ensures that
processors are assigned amounts of data and computation commensurate with
their computing capabilities. We first address a simple load balancing problem
(Section 6.1): given M identical and independent tasks and p heterogeneous
processors, how many tasks should be assigned to each processor to minimize
execution time? While the solution is straightforward, it provides the basis
for solving useful load balancing problems, as we will see in the context of a
stencil application (Section 6.1.3) and of the LU factorization (Section 6.1.4).
Both applications were discussed in previous chapters but we revisit them here
in the context of heterogeneous platforms. Although the load balancing prob-
lem is often straightforward in the case of 1-D data distributions, it becomes
difficult (i.e., NP-complete) for 2-D data distributions. This chapter high-
lights this difficulty in the case of the outer-product algorithm for multiplying
matrices (Sections 6.2 and 6.3).
While this chapter focuses solely on load balancing, the issue of heterogene-
ity is also explored in Chapters 7-8 in the context of application scheduling.
6.1 Load Balancing for 1-D Data Distributions

The problem of allocating tasks to heterogeneous resources in a way that
optimizes application performance (e.g., that minimizes execution time) has
been studied extensively. Two broad variants of the problem have been de-
fined, depending on whether processors are non-dedicated or whether they
are dedicated. In the non-dedicated case, the compute speed that a processor
delivers to the application varies through time and is in general not perfectly
predictable, if at all. In this case, one must resort to dynamic strategies to
allocate work to processors. Such strategies range from the simple workqueue
scheme, by which a master processor maintains a list of tasks that are yet to
be executed and a worker processor picks the first task in this list whenever
idle, to more sophisticated approaches, which typically make work allocation
decisions based on the past performance of the processors. In this book we
only consider the dedicated case and refer the reader to [26] for a discussion
of dynamic strategies for non-dedicated processors.
Even when assuming dedicated resources the notion of dynamic resource
allocation is attractive. Indeed, letting each processor request tasks when
idle should lead the system to automatically converge to a resource allocation
that is load-balanced. Unfortunately in many cases, due to data dependences,
6.1. Load Balancing for 1-D Data Distributions 177
a dynamic strategy can lead to poor load balancing, as we will see later in
this chapter. In such cases, one must resort to static task allocation schemes,
which are the focus of this chapter.
6.1.1 Basic Static Task Allocation Algorithm

The first step for implementing a sound task allocation on a heterogeneous
set of processors is to assess task execution times on each processor. These
assessments can be based on off-line benchmark measurements of the actual
tasks, or on performance models that estimate task execution times based on
task and processor characteristics, or on a combination of both. Alternately,
task execution times can be assessed during the initial stages of application
execution, and a static task allocation based on these assessments can be
computed and implemented for the remainder of application execution. Let
us assume that our application consists of M tasks that are identical, inde-
pendent, and atomic. That is, tasks do not require communications, can be
completed in any order, and are indivisible. The computing platform consists
of p processors, P1 , . . . , Pp , and we denote the task execution time on each
processor by t1 , . . . , tp . These times are often called processor cycle times. We
thus have a problem with homogeneous tasks and heterogeneous processors.
Intuitively, the number of tasks to allocate to each processor Pi , which we
denote by ci , should be inversely proportional to ti , so that processors are
assigned a number of tasks that is commensurate to their processor speeds.
More formally:
∀i, 1 6 i 6 1, ci × ti = K .
where K is a constant. One can then compute the number of tasks allocated
to each processor as
1
ci = Pp ti 1 ×M .
k=1 tk
Pp
If M is a multiple of C = lcm(t1 , t2 , . . . , tp ) i=1 t1i , then load balancing is
perfect, that is ci × ti is a constant that does not depend on i. If M is not a
multiple of C, the above formula does not produce an integral task allocation,
which is problematic because tasks are atomic. One may then wonder how to
round off the obtained solution. For instance, should one allocate a task to
processor Pi if ci = 0.5?
It turns out that the optimal allocation is easily obtained with Algorithm 6.1,
which was originally used in [87]. This algorithm simply computes an integral
lower bound on the allocation of each processor, and then iteratively increases
the allocations. At each step, the allocation of the processor that can complete
a new task that earliest is increased by one.
PROPOSITION 6.1. Algorithm 6.1 produces an optimal task allocation.
Proof. Let us consider an optimal allocation, which we denote by o1 , o2 , . . . , op
(oi is the number of tasks allocated to processor Pi ). Let j be the index of a
1 ALLOCATION((t1 , . . . , tp ), M )
{ Initialization: compute ci values such that
ci × ti ≈ Constant and c1 + c2 + . . . + cp 6 M }
2 for i = 1to p do
1
ci = Pp ti 1 ×M
3 k=1 tk
{ Iteratively increment the ci value so that execution

Pp
P is minimized, until i=1 ci = M }
time
p
4 while i=1 ci < M do
5 k = argmin16i6p {ti × (ci + 1)}
6 ck = ck + 1
7 return (c1 , c2 , . . . , ck )
ALGORITHM 6.1: Optimal static allocation of M identical, in-

dependent, atomic tasks on p processors with cycle times t1 , . . . , tp .
processor with the maximum execution time: ∀i ∈ {1, . . . , p}, T = oj tj > oi ti ,

where T is the overall execution time. To prove that the algorithm produces
the optimal allocation we prove the following invariant:
∀i ∈ {1, . . . , p}, ci ti 6 oj tj (I)
1
After the initialization step of the algorithm, for all i, ci 6 Pp ti 1 × M .
k=1 tk
Pp Pp
Since M = k=1 ok 6 oj tj × k=1 t1k , we have ci ti 6 PpM 1 6 oj tj , and
k=1 tk
invariant (I) is satisfied.

We now prove by induction that (I) is satisfied after each incrementation
step in thePalgorithm. Assume that at a given step ck is incremented. Before
p 0
this step, i=1 ci < M , and there exists k ∈ {1, . . . , p} such that ck <
0
ok0 . We have tk0 (ck0 + 1) 6 tk0 ok0 6 tj oj , and the choice of k 0 implies that
tk (ck + 1) 6 tk0 (ck0 + 1). Therefore, invariant (I) is satisfied after this step.
Finally, the obtained allocation, (c1 , c2 , . . . , cp ), is optimal because for all i
max{ci × ti } 6 oj tj = max(oi ti ).
After the initialization phase we have c1 + c2 + . . . + cp > M − p. The

algorithm proceeds in at most p steps and its complexity is thus O(p2 ). This
complexity can be further reduced, for instance by using sorted data structures
to speed up the computation for the argmin in statement 5 of the algorithm.
6.1.2 Incremental Algorithm
The algorithm presented in the previous section makes it possible to com-

pute the optimal allocation on a set of processors for a given number of inde-
pendent tasks M . However, if one wishes to compute the optimal allocations
for all numbers of tasks between 1 and M , a simple dynamic programming
algorithm is more efficient. We will see later in this chapter why knowing all
these allocations is useful and leads to elegant load balancing solutions.
We develop an incremental algorithm that returns the optimal task allo-
cation for 1 to M tasks. We illustrate the principle of the algorithm on an
example with 3 processors with cycle times t1 = 3, t2 = 5 and t3 = 8. The
table on the left-hand side of Figure 6.1 summarizes the task allocation com-
puted by the algorithm up to M = 10. In the table, column “Chosen proc.”
shows the index of the processor that computes the next task. At each step,
the chosen processor minimizes the overall execution time. As before, let ci be
the number of tasks allocated to processor Pi . The execution time of alloca-
tion C = (c1 , c2 , . . . , cp ) is max16i6p ci ti . Therefore, the mean task execution
max ci ti
time is tmean = P16i6p p
ci
, which is shown in the table at each step.
i=1
Cycle time t1 = 3 t2 = 5 t3 = 8
Num. of c1 c2 c3 tmean Chosen P1 P2 P3
tasks proc.
0 0 0 0 P1
time
time
1 1 0 0 3 P2
2 1 1 0 2.5 P1
3 2 1 0 2 P3
4 2 1 1 2 P1
5 3 1 1 1.8 P2
6 3 2 1 1.67 P1
7 4 2 1 1.71 P1
8 5 2 1 1.87 P2
9 5 3 1 1.67 P3
10 5 3 2 1.6
P1 P2 P3 P1 P2 P3
After Step 3 After Step 4
FIGURE 6.1: Steps of the incremental algorithm with three processors with
cycle times t1 = 3, t2 = 5, and t3 = 8.
Consider step 4 as an example, that is the step in which the fourth task
is allocated to a processor. The algorithm bases the allocation decision on
the allocation obtained so far, that is (c1 , c2 , c3 ) = (2, 1, 0). This allocation is
depicted on the Gantt chart labeled as “After step 3” on the right-hand side
of Figure 6.1. This diagram depicts the three tasks already allocated, two
to processor P1 and one to processor P2 , with time along the vertical axis.
The allocation decision boils down to choosing which ci should be incremented.

There are three possibilities: (c1 +1, c2 , c3 ) = (3, 1, 0), (c1 , c2 +1, c3 ) = (2, 2, 0)
and (c1 , c2 , c3 + 1) = (2, 1, 1), with execution times 49 (P1 finishes last), 10 4
(P2 finishes last), and 48 (P3 finishes last), respectively. These three possible
allocations are depicted as three dashed-line boxes on the “After step 3” Gantt
chart. The third allocation, (c1 , c2 , c3 ) = (2, 1, 1), is picked because it leads
to the lowest execution time. This allocation is depicted on the “After Step
4” Gantt chart, which also depicts the next three possible choices for resource
allocation. One can see that the best choice will be to allocate the next task
to processor P1 as it leads to the lowest overall execution time, as seen on the
fifth row of the table in Figure 6.1.
Algorithm 6.2 shows the algorithm more formally. This algorithm produces
an allocation which is encoded as a list of processor indices, A, in the order
of the “Chosen proc.” column of the table in Figure 6.1. The list is initially
empty, and at each step one processor index is added to it. The complexity
of the algorithm is simply O(pM ).
1 INCREMENTALDISTRIBUTION(t1 , . . . , tp , M )
{ Initialization }
2 C = (c1 , . . . , cp ) = (0, . . . , 0)
3 A = {}
{ Iterative computation of allocation A }
4 for n = 1 to M do
5 i = argmin16j6p (tj × (cj + 1))
6 A[n] = i
7 ci ← ci + 1
8 return A
ALGORITHM 6.2: Incremental allocation of M

tasks on p processors with cycle times t1 , . . . tp .
PROPOSITION 6.2. Every prefix of length m of the allocation computed

by the incremental algorithm for M tasks on p processors corresponds to an
optimal allocation of m tasks onto the p processors, m 6 M .
Proof. The proof is very similar to that of the previous proposition. One can
simply modify the invariant as:
(m) (m)
∀i ∈ {1, . . . , p}, ci ti 6 ojm tjm = max oj tj , (Im )
16j6p
(m) (m) (m)

where O(m) = (o1 , o2 , . . . , op ) is an optimal allocation with m tasks,
and jm is the index of the processor that finishes last for this allocation. By
induction, at step m, invariant (Im ) is satisfied. Since the overall execution

(m)
time can only increase when one executes an extra task we have ojm tjm 6
(m+1)
ojm+1 tjm+1 . Thus, before step m + 1, invariant (Im+1 ) is satisfied. Since
step m + 1 is identical to all steps of Algorithm 6.1, the proof of the previous
proposition shows that (Im+1 ) holds after step m + 1.
6.1.3 Application to a Stencil Application

Let us revisit the stencil application discussed in Section 4.3. Recall that
the goal is to use the simple stencil shown in Figure 4.4 to update all cells
of a 2-D domain. This stencil requires that the West and North neighbors
of a given cell be updated before it. We have seen that, in the case of ho-
mogeneous processors, a cyclic allocation of rows, or blocks of rows, to the
processors is efficient since it leads to (asymptotically) optimal load balancing.
Remember also that each row is partitioned in several chunks, so as to strike
a good compromise between communication overhead and parallelism. In this
section, we consider the execution of the same stencil application on a set of
heterogeneous processors.
In the case of heterogeneous processors one could think of employing a
dynamic allocation strategy in the hope that processors self-regulate and con-
verge to a load-balanced execution. Unfortunately, due to data dependences,
a dynamic allocation is not effective for our stencil application. For instance,
consider the natural approach in which a master processor maintains a list of
rows to be computed, and each worker processor, whenever idle at the onset
of the application or after computing a row, is allocated the next row in the
list. The problem with this approach from a load balancing perspective stems
from the fact that a chunk of cells cannot be computed before the chunk above
it has been computed. Consider an example with two worker processors, a
slow one and a fast one. Let us assume for instance that the first row of the
domain is assigned to the fast processor and that the second row is assigned to
the slow processor. Once the fast processor finishes computing the first row,
the slow processor is still computing the second row. The fast processor starts
processing the third row and may catch up to the slow processor. Once the
slow processor has finished computing the last chunk of the second row, the
fast processor can start computing the last chunk of the third row. While this
computation is taking place the slow processor starts processing the fourth
row, and so on with each processor computing every other row. As a result,
each processor is allocated an equal amount of work as opposed to an amount
of work commensurate to its processing speed, and poor load balance ensues.
In fact, in spite of being attractive at first glance, any dynamic allocation ap-
proach, whether it allocates entire rows (as in our example), chunks of cells, or
individual cells is doomed to produce an allocation that gives equal amounts
of work to the processors.
Let us consider a platform with 3 processors with respective cycle times

t1 = 3, t2 = 5 and t3 = 8, where the cycle time denotes the time to compute
one row of the domain. Let us consider a domain that consists of N rows.
The sequential execution time is obtained by allocating all rows to the fastest
processor, P1 , and is thus equal to 3N . A simple cyclic allocation like the ones
we used in the homogeneous case, by which row j is assigned to processor
Pj mod 3 leads to an execution time approximately equal to 83 N . Indeed, each
processor computes in locked step with the slowest processor, with cycle time
8, and computes one third of the rows. This is also what a dynamic row
allocation strategy would achieve. By contrast, Algorithm 6.1 leads to a much
better allocation, as seen hereafter.
For the same reasons as in the homogeneous case, a cyclic distribution is
required. However, we have seen that a simplistic cyclic distribution leads to
poor load balancing. The idea is to use Algorithm 6.1 to compute a load-
balanced assignment of a small and fixed number of rows, and then use this
assignment in a periodic fashion. For instance, let us consider only 10 rows.
For 10 rows, the allocation computed by the algorithm for our three processors
is c1 = 5, c2 = 3, and c3 = 2. Based on this allocation we allocate the first five
rows to P1 , the next three rows to P2 , the next two rows to P3 , the next five
rows to P1 , and so on until all rows have been assigned. This heterogeneous
cyclic allocation is depicted in Figure 6.2. On the left-hand side of the figure
we see the rows allocated to the processors, where each row consists of six
chunks of cells in our example. On the right-hand side we depict which sets
of chunks are computed at which steps of the application execution.
The main thing to note is that it takes almost the same time for P1 to
compute 5 chunks, for P2 to compute 3 chunks, and for P3 to compute 2
chunks, as ensured by the static allocation algorithm. Indeed, we have c1 ×t1 =
c2 × t2 = 15 and c3 × t3 = 16. As a result, the processor idle time due to
load imbalance is reduced. Furthermore, the execution time of the algorithm
is approximately equal to the time it takes for the most loaded processor, in
this case P3 , to compute its allocated rows. P3 computes one fifth of the rows,
and takes 8 times units to compute a row. Therefore, the overall execution
time of the application is 85 N , which is a factor 53 lower than with a simplistic
cyclic allocation.
In our example we picked π = 10 rows arbitrarily as the period of our cyclic
allocation, with the same pattern being repeated for each group of 10 rows.
Let us see what would have happened if we had picked, say, π = 6. In this
case Algorithm 6.1 would have returned c1 = 3, c2 = 2, and c3 = 1. We
compute c1 × t1 = 9, c2 × t2 = 10 and c3 × t3 = 8. The load imbalance
appears larger, since both P1 and P3 have to wait for P2 . To check this, we
compute the execution time of the most loaded processor, P2 . We derive the
value 53 N , since P2 executes one third of the rows.
Can we determine a better period π? Only π values that make ci × ti
products all identical lead to perfect load balancing. In our case the smallest
such value is actually π = 79. Indeed we obtain c1 = 40, c2 = 24 and c3 = 15,
chunk
P1 0 1 2 3 4 5
P2 1 2 3 4 5 6
P3 2 3 4 5 6 7
P1 6 7 8 9 10 11
P2 7 8 9 10 11 12
P3 8 9 10 11 12 13
.. .. .. .. .. ..
FIGURE 6.2: Static cyclic allocation of rows to three heterogeneous pro-
cessors with cycle times t1 = 3, t2 = 5, and t3 = 8, for the stencil application.
and ci × ti = 120 for all i. The execution time is 12079 N , and it is the optimal
value that can be achieved.
One can thus increase the granularity of the application (to trade off par-
allelism for lower communication overhead) by choosing π values that are
multiples of the smallest π value that achieves perfect load balancing. This
is exactly the same process as when using larger blocks of rows in the cyclic
allocation used in the homogeneous case. Note however that the smallest π
value that achieves perfect load balancing may be large. With Ppp processors
of cycle time ti , 1 6 i 6 p, the smallest π value is π = L i=1 t1i , where
L = lcm(t1 , . . . , tp ) is he least common multiple of the cycle-times. In our
previous example we had L = 120. But for instance, with cycle times t1 = 11,
t2 = 23, and t3 = 31, we obtain the least common multiple L = 7, 843 and the
smallest π value is then π = 1, 307. With such a large period, unless the total
number of rows N is orders of magnitude larger than π, the reduced paral-
lelism for the first and second step of the application execution would have a
prohibitive impact on performance. In this case, one may opt for a value of π
that does not lead to perfect load balancing but that allows processors P2 and
P3 to start computing earlier. For instance, using π = 88 leads to an execu-
tion in which the load imbalance between the processors at each step is lower
than 1% (i.e., the relative differences between all processor execution times
at each step are all below 1%). Indeed, we obtain c1 × t1 = 48 × 11 = 528,
c2 × t2 = 23 × 23 = 529 and c3 × t3 = 17 × 31 = 527.
6.1.4 Application to the LU Factorization

Recall the principle of the LU factorization of a n × n matrix, as described
in Section 4.4: at each step k, k = 0, . . . , n−2, the processor that holds matrix
column k prepares the elements of this column on rows k and higher through
a simple scaling; this processor then broadcasts these updated elements to all
other processors; all processors can then update all elements of columns k + 1
and higher on rows k + 1 and higher. Note that this algorithm is non-blocked,
in the sense that columns are processed one at a time. While this is inefficient
on processors with a memory hierarchy, it does not change the spirit of the
algorithm. Let us consider the execution of this algorithm on p processors,
Pi , i = 0, . . . , p − 1, with cycle times t1 , . . . tp .
The performance analysis of the LU factorization algorithm in Section 4.4
shows that the bulk of the computation time is due to the column updates,
while the time for performing column preparation is asymptotically negligible.
Our goal here is then to determine an allocation of columns to the processors
that leads to a well-balanced execution of the column updates. Let us consider
the first step of the algorithm. Once the first column has been prepared, all
columns that need to be updated can be updated independently. Therefore, if
the matrix is of size n, n − 1 update tasks must be allocated to the processors.
A simple idea is to use Algorithm 6.1 to determine the column allocations.
While this would lead to good load balance during the first step of the algo-
rithm, unfortunately the number of columns to be updated decreases as the
algorithm makes progress. In the second step, only n − 2 columns need to be
updated, and thus the allocation of columns to processors need to be recom-
puted to lead to good load balancing. One then faces a conundrum. On the
one hand, as the algorithm makes progress the column allocation likely leads
to worsening load balance. On the other hand, redistributing the columns
among the processors at each step according to the optimal column allocation
at this step makes the algorithm much more complicated and would likely
cause large overhead. One way to strike a compromise would be to only
re-allocate columns periodically each x iterations, where x is chosen appro-
priately given n and the characteristics of the underlying platform. However,
it is difficult to determine the best choice for x and one must resort to some
empirical method based on previous runs and benchmarks.
It turns out that there is an elegant solution to the above conundrum,
which does not require column re-allocations among the processors and which
achieves optimal load balance at each step of the algorithm. Let us denote
the (n − k + 1) update tasks at step k = 0, . . . , n − 1, of the algorithm as
uk+1 , . . . , un−1 . This is a slight abuse of notation as we do not use a subscript
to indicate the algorithm step, meaning that, for instance, un−1 denotes the
last update task at all steps. We seek an allocation of u1 , . . . , un−1 onto the
p processors, with the following constraint: for each i ∈ {1, . . . , n − 1}, the
number of tasks allocated to processor Pj among tasks ui , . . . , un−1 should be
(approximately) proportional to its cycle time tj .
The attentive reader will recognize that the above constraint corresponds
almost exactly to the task allocation computed by our our incremental al-
gorithm from Section 6.1.2. The only difference is that we need to run the
algorithm in reverse. Indeed, the algorithm produces an optimal task alloca-
tion for all subsets [1, i] of [1, n − 1], while we need an optimal task allocation
for all subsets [i, n − 1] of [1, n − 1]. Let us go back to our example of 3 pro-
cessors with cycle times t1 = 3, t2 = 5 and t3 = 8 for π = 10 column update
tasks. To obtain our desired allocation for the LU factorization algorithm we
can just read the table on the left-hand side of Figure 6.1 from bottom to top,
thus allocating column 0 to P3 , column 1 to P2 , and so on. The entire pattern
is (P3 P2 P1 P1 P2 P1 P3 P1 P2 P1 ). For illustration purposes Figure 6.3 depicts two
allocations. On the left-hand side the figure shows the “reversed” allocation
computed by the above algorithm, which is optimally load-balanced at each
step of the LU factorization. On the right-hand side the figure shows the
“non-reversed” allocation, which is optimally load-balanced only for the first
step of the LU factorization. Above each allocation we plot the compute time
at each algorithm step, thus showing that using the reversed heterogeneous
allocation leads to better performance.
time time
30 30
total time = 99 total time = 116
20 20
10 10
step step
3 2 1 1 2 1 3 1 2 1 1 3 2 1 3 2 1 3 2 1
P1
P2
P3 P1 P3
P2
Heterogeneous Allocation Non-reversed Allocation
FIGURE 6.3: Comparison of the heterogeneous allocation optimal at all

steps of the LU factorization algorithm (on the left) and of the heterogeneous
allocation optimal only for the first step (on the right), with 3 processors
(t1 = 3, t2 = 5, t3 = 8) and 10 column update tasks.
When n is large, it may be time consuming to compute the full heteroge-

neous allocation via the incremental algorithm. Furthermore, with a more
irregular (i.e., less periodic) allocation of columns to processors, it is more
difficult to overlap communication and computation. Therefore, on the one

hand one desires the optimal allocation computed by the incremental algo-
rithm over the n columns, and on the other hand one needs to have a periodic
allocation. The typical approach is to pick a certain number of columns, say
B, for which one computes the optimal allocation. This allocation pattern is
then repeated every B columns.
6.2 Load Balancing for 2-D Data Distributions

In the previous section we have seen a simple algorithm for balancing com-
putational load across a set of heterogeneous processors. We have seen two
straightforward applications of this algorithm, both for 1-D data distributions.
In turns out that the load balancing problem becomes much more difficult in
the case of 2-D data distributions. In this section, we highlight this difficulty
using the outer-product matrix multiplication algorithm as an example. For
the sake of simplicity we consider non-cyclic, non-blocked data distributions,
but the message in this section is unchanged for such distributions.
6.2.1 Matrix Multiplication on a Heterogeneous Grid

We consider the C = A × B multiplication of two n × n matrices, to be
executed on a q × q torus of p = q 2 processors. Let us first consider a homo-
geneous torus. Assuming that q divides n, each processor stores a rectangular
block of each matrix of size nq × nq . This distribution, which is the same for
all three matrices, is depicted for the case of a 3 × 3 torus on the left-hand
side of Figure 6.4.
We will use the outer-product algorithm to perform the matrix multipli-
cation (see Section 5.3.1). Recall that at each step the algorithm performs q
horizontal and q vertical broadcasts of blocks of matrices A and B respectively,
for a total of q steps. In this section, we consider a non-blocked version of
this algorithm, that is a version in which at each step the algorithm performs
broadcasts of segments of a single column of A and of a single row of B, for a
total of n steps. These broadcasts are depicted in Figure 6.4 for a particular
column of A and a particular row of B.
As seen on the right-hand side of the figure this non-blocked outer-product
algorithm is easily adaptable to the heterogeneous case. The only difference
is that processors are assigned rectangular blocks of different sizes, with each
size commensurate to each processor’s computing speed.
We denote the processors by Pi,j , with 1 6 i, j 6 q. We use ti,j to denote the
cycle time of processor Pi,j , in this case the time to perform the multiplication
of two elements of matrices A and B and to add the result to an element of
B B
A C A C
Homogeneous 3x3 torus Heterogeneous 3x3 torus
FIGURE 6.4: Example 2-D matrix distributions for matrix multiplication

using the outer-product algorithm on a homogeneous 3 × 3 torus of processors
(on the left) and on a heterogeneous 3 × 3 torus (on the right).
matrix C. Let us assume that processor Pi,j is allocated a rectangle of size

ri × cj , as shown on an example in Figure 6.5. At each step of the algorithm
this processor computes for time ti,j × ri × cj .
c1 c2 c3
r1 P11 P12 P13
r2 P21 P22 P23
r3 P31 P32 P33
FIGURE 6.5: Example 2-D matrix distribution on a 3 × 3 heterogeneous

processor grid.
Let us note that depending on the cycle time of the processors and on the
size of the matrices, it is not always possible to achieve perfect load balancing.
See the example shown in Figure 6.6, for a 2 × 2 processor grid with the
four processor cycle times indicated in the corresponding matrix blocks. We
arbitrarily normalize r1 and c1 to 1. We observe that to balance the load
between processor P1,1 and processor P1,2 , we need c2 = 12 (which may only
be approximated if the matrix dimension is not divisible by 2). Indeed, in this

way t1,1 × r1 × c1 = t1,2 × r1 × c2 = 1 and thus both processors have the same
computation time at each step of the algorithm. Similarly, we need r2 = 13
so that t1,1 × r1 × c1 = t2,1 × r2 × c1 = 1. But now, the dimensions of the
rectangular block assigned to processor P2,2 are constrained and its compute
time at each algorithm step is t2,2 × r2 × c2 = 56 < 1. As a result processor
P2,2 is underutilized. If instead this processor had a cycle time t2,2 = 6, then
load balancing could be perfect, since t2,2 × r2 × c2 = 1 in this case.
1 1/2
1 t11 = 1 t12 = 2
1/3 t21 = 3 t22 = 5
Underutilized
FIGURE 6.6: An example matrix distribution on a 2 × 2 heterogeneous

processor grid with cycle times t1,1 = 1, t1,2 = 2, t2,1 = 3 and t2,2 = 5,
which shows that it is not possible to achieve perfect load balance when the
matrix of processor cycle times is not of rank 1. In this case processor P2,2 is
under-utilized.
The above result, which we just showed on an example, is very easy to

generalize: perfect load balancing is achievable if and only if the matrix of
the processor cycle times is of rank 1. Recall that the rank of a matrix is
the number of rows, or columns, of the matrix that are linearly independent.
Let us call T this matrix, whose elements are (ti,j ), i, j = 1, . . . , q. If T is of
rank 1, then we set r1 = 1, c1 = 1/t1,1 , ri = 1/(ti,1 × c1 ) (for i > 2), and
cj = 1/t1,j (for j > 2). We can then see that ti,j ri cj = 1 for all i, j > 2.
Indeed, ti,j × ri × cj = ti,j /(ti,1 × c1 × t1,j ). But since the matrix is of rank 1,
t t
the determinant 11 1j is equal to zero. Therefore, ti,1 × t1,j = t1,1 × ti,j .
ti1 tij
One then obtains ti,j × ri × cj = 1/(t1,1 × c1 ), which is equal to 1 by definition
of c1 . Conversely, if load balancing is perfect, then ti,j × ri × cj = 1 for all
i, j = 1, . . . , q. In this case matrix T has to be of rank 1 (one can easily show
that the previous determinant is null for all i and j).
Although it may be impossible to achieve perfect load balancing given the

values of the processor cycle times, we are still interested in finding the best
possible allocation of rectangular blocks to the processors. Since at each step
each processor Pi,j computes for ri × cj × ti,j time units, the overall execution
time of a step of the algorithm is maxi,j {ri × cj × ti,j }.
For the remainder of this section we normalize the matrix dimension to 1
so that the ri and cj values are rational numbers between 0 and 1. One can
normalize and obtain the following optimization problem with unknowns ri ,
i = 1, . . . , q and cj , j = 1, . . . , q:
P min
P max{ri × cj × ti,j } .
( i ri =1; j cj =1) i,j
Given a solution to this problem, one computes the actual dimensions of the
rectangular blocks assigned to each processor by multiplying the ri and cj
values by n, the original matrix dimension. One then rounds all values to
integers so that the sums of the ri values and the sum of the cj values are
both equal to n.
The above formulation of the load balancing problem as an optimization
problem does not lend itself to an easy solution. But at any rate, the problem
that we need to solve is much more complex. Indeed, the above optimization
problem is for a given arrangement of the processors in a 2-D grid. However,
there are p! such arrangements! So we must compute the optimal solution
to the optimization problem for all possible arrangements, and then pick the
optimal solution for the optimal arrangement. As expected, this problem is
NP-complete, which is shown in the next section.
6.2.2 On the Hardness of the 2-D Data Partitioning Problem

In this section we consider the data partitioning problem in more abstract
terms. Consider p = q 2 processors, P1 , . . . , Pp , with cycle times t1 , . . . , tp .
The problem consists in placing these processors in a square grid of dimen-
sion q × q, so that the resulting 2-D data distribution leads to the minimum
execution time of the outer-product algorithm. This problem, which we call
MaxGrid(s) can be formulated purely algebraically as:
DEFINITION 6.1. MaxGrid(s): Consider q 2 real positive numbers s1 , . . . ,
sq2 . Find 2q real positive numbers (r1 , . . . , rq , c1 , . . . , cq ) and a one-to-one
mapping, or bijection, σ from [1, q] × [1, q] to [1, q 2 ] such that
q
X q
X
∀(i, j) ∈ [1, q] × [1, q], ri cj 6 sσ(i,j) and ri cj is maximum.
i=1 j=1
We simply replaced the cycle times by computing speeds: si = t1i is the

(relative) speed of processor Pi ; σ represents the arrangement of the q 2 pro-
cessors on the q × q grid, and ri and cj are the same variables as the ones
used in the previous section. Note that the condition ri × tσ(i,j) × cj 6 1

is equivalent to ri cj 6 sσ(i,j) . The objective function maximization simply
attempts to have each processor perform as much computation as possible.
The MaxGrid(s) problem is relevant to a large class of algorithms on a
2-D grid of processors: all algorithms for which a 2-D domain needs to be
partitioned among the processors and for which each processor is responsible
for computing/updating the rectangular block of the domain it is allocated.
The associated decision problem is as follows:
DEFINITION 6.2. MaxGrid(s, K): Consider q 2 real positive numbers
s1 , . . . , sq2 and a real positive number K. Are there 2q numbers (r1 , . . . , rq ,
c1 , . . . , cq ) and a one-to-one mapping, or bijection, σ from [1, q]×[1, q] to [1, q 2 ]
such that:
q
X q
X
∀(i, j) ∈ [1, q] × [1, q], ri cj 6 sσ(i,j) and ri cj > K ?
i=1 j=1
THEOREM 6.1. MaxGrid(s, K) is NP-complete.

The proof of this theorem is long and rather intricate, and the interested
reader will find it in [16]. The result itself is fundamental and shows the
difficulty of balancing the load among heterogeneous processors in 2-D distri-
butions. One must resort to heuristics. For instance, in [99], a (unfortunately)
non-guaranteed polynomial heuristic is presented to solve MaxGrid(s). The
main idea is to obtain an arrangement of processors in the grid so that the
rank of the matrix of the processor cycle times is as close as possible to 1. This
is achieved via a decomposition of the problem and via iterative refinement.
6.3 Free 2-D Partitioning on a Heterogeneous Grid

6.3.1 General Problem
When we introduced the data distribution for a matrix multiplication on
a heterogeneous platform we made a seemingly natural but rather arbitrary
assumption. We assumed that all processors on the same processor row (resp.
column) are allocated matrix blocks of same height (resp. width), leading
to the simple, albeit irregular, checkerboard pattern seen in Figure 6.5. But
in fact, the more general problem is merely to partition matrices in as many
rectangles as processors in any possible way. For instance, Figure 6.7 shows
a possible data distribution for 13 heterogeneous processors. The goal is the
same: allocate to each processor a rectangular block that is inversely pro-
portional to that processor’s cycle time (i.e., proportional to that processor’s
speed). For this example we picked a prime number of processors to em-
phasize the fact that for solving the general problem it is not even necessary
6.3. Free 2-D Partitioning on a Heterogeneous Grid 191
that the processors be arranged in a grid at all! At each step of the outer-
product algorithm, there is a horizontal broadcast of a column of A and a
vertical broadcast of a row of B. These broadcasts involve different numbers
of source and destination processors depending on where the column or row is
located. The figure shows an example vertical broadcast of a row of B, which
is partially held by four processors. For instance, the processor holding the
top-right block of the matrix is involved in receiving data from all four source
processors, while the processor holding the bottom-left block of the matrix is
involved in receiving data from only one source processor.
At each step each destination processor receives an amount of data propor-
tional to the half-perimeter of the rectangular block it holds, typically from
more than two source processors. This is in contrast with the data distribu-
tion seen in Figure 6.4, with which at each step each destination processors is
only engaged in two receives (one horizontal and one vertical).
FIGURE 6.7: Example broadcast of a row of matrix B during a step of the

outer-product algorithm with 13 heterogeneous processors.
Pp
Given p processors with speeds s1 , . . . , sp , normalized so that i=1 si = 1,
it is always possible to partition the unit square in p rectangles with areas
s1 , s2 , . . . , sp . From now on we always reason on the unit square, knowing that
we can then multiply rectangle dimensions by the matrix size n and round to
appropriate integer values to obtain actual matrix blocks. One possibility for
partitioning the unit square is to simply use a 1-D distribution in block rows.
The question is: what is the best partitioning among the ones that achieve
perfect load balancing? To formalize this question, one must define an objec-
tive function. Given that we only consider partitioning schemes that balance
the load perfectly, we can distinguish them by the amount of communication
these partitioning schemes induce for our outer-product matrix multiplica-
tion algorithm. We said earlier that each processor at each step is involved
in communicating an amount of data proportional to the half-perimeter of
the rectangular matrix block associated to that processor. This leads us to a
very intuitive geometric interpretation of our problem: while the areas of the
rectangular blocks are fixed, one can adjust their shapes to lead to the lowest
amount of communication.
Depending on the network used by the underlying physical platform, dif-
ferent communication models are possible. For instance, all communications
could be sequential if the processors are interconnected by, say, a non-switched
Ethernet network. This could be the case if the participating processors are
heterogeneous workstations in some laboratory for instance. Probably more
typical for modern parallel platforms, such as heterogeneous commodity clus-
ters, the communications can happen concurrently due to the use of a switched
interconnect. If communications can happen concurrently, then one wishes to
minimize the maximum of the half-perimeters of the rectangular blocks. If
instead the communications happen sequentially, then one wishes to minimize
the sum of the half-perimeters.
Given the above, we can now see one of the key points highlighted in Chap-
ter 5: if the platform consists of q 2 homogeneous processors one can achieve
lower communication costs by using a 2-D distribution on a grid of proces-
sors than by using a 1-D distribution on a ring. On a ring, the sum of the
half-perimeters of the matrix blocks is 1 + q 2 , while on a q × q grid it is q + q.
Considering only the geometrical interpretation of the problem, both op-
timization problems above are easily stated as follows: how can one parti-
tion
Pp a unit square in p rectangles with given areas s1 , s2 , . . . , sp , such that
i=1 si = 1, in a ways that minimizes:
• either the sum of the half-perimeters of the rectangles (serialized com-

munications);
• or the maximum half-perimeter of the rectangles (concurrent communi-

cations).
Consider an example with p = 5 rectangles R1 , . . . , R5 with areas s1 = 0.36,

s2 = 0.25, s3 = s4 = s5 = 0.13. One possible partition of the unit square
for these 5 rectangles is shown in Figure 6.8. The areas of the rectangles
are as follows: R1 = 0.61 × 36 25
61 , R2 = 0.61 × 61 , and R3 = R4 = R5 =
1
0.39 × 3 . The sum of the half-perimeters is 4.39. A lower bound is reached
when all rectangles are squares, which is not possible in this example, but
which would lead to a sum of the half-perimeters approximately equal to
4.36. The maximum half-perimeter is for R1 and approximately equal to
1.2002. Here, a lower bound is achieved when R1 is a square, with a half-
perimeter of 1.2, which is also not possible in this example. We conclude that
the depicted partition of the unit square into 5 rectangles is very satisfactory
for both our objective functions.
For all that follows we only consider the first objective function, i.e., the
sum of the half-perimeters. The degrees of freedom of the problem are the
shapes of the p rectangles used to partition the unit square. We can define
the problem formally:
0.61 0.39
s3 1/3
36/61 s1
s4 1/3
25/61 s2
s5 1/3
FIGURE 6.8: A partitioning of the unit square using 5 rectangles.
DEFINITION Pp 6.3. PeriSum(s): Given p positive real numbers s1 , . . . , sp

such that i=1 si = 1, find a partition of the square unit in p rectangles
R
Ppi , i = 1, . . . , p, with areas si and of dimensions hi × vi , such that C
b =
i=1 (hi + vi ) is minimum.
There is a trivial lower bound for problem PeriSum(s), which we saw earlier
on an example and which can be formalized as:
LEMMA 6.1. For any solution C b > 2 Pp √si .
b to PeriSum(s), we have C
i=1
Proof. The half-perimeter of each rectangle Ri is maximized when the rect-
√
angle is a square and is then equal to 2 si . As discussed earlier, this lower
bound is not always reachable because it is not always possible to partition
the square unit using only squares of given areas.
The decision problem associated to PeriSum(s, K) is as follows:
DEFINITION Pp 6.4. PeriSum(s, K): Given p positive real numbers s1 , . . . , sp
such that i=1 si = 1 and a positive real number K, does there exist a par-
tition of the unit square in p P
rectangles Ri , i = 1, . . . , p, with areas si and
p
dimensions hi × vi , such that i=1 (hi + vi ) 6 K?
The following theorem should be no surprise to the reader:
THEOREM 6.2. PeriSum(s, K) is NP-complete.
Like for the 2-D load balancing theorem in Section 6.2.2, the proof is long
and involved, and we refer the interested reader to [17].
6.3.2 A Guaranteed Heuristic

It turns out that if one restricts the PeriSum(s, K) problem to consider
partitions that consists of columns of rectangles, then one can compute an op-
timal solution. We describe here the solution and the performance guarantee
obtained when compared to the solution of PeriSum(s, K).
The ColPeriSum(s) Problem

Since PeriSum(s) is NP-hard, we consider a more constrained problem,
ColPeriSum(s). In this new problem we specify that the partition of the
unit square must consist of columns of rectangles, as shown on an example
P Figure 6.9. Formally, given p real positive numbers s1 , . . . , sp such that
in
i si = 1, we want to partition the unit square in C columns (where C is to
be determined) of widths c1 , . . . , cC . Each column Ci is itself partitioned in
ki rows (to be determined as well) of areas sσ(i,1) , . . . , sσ(i,ki ) . The resulting
PC
partition of the unit square consists of i=1 ki = p rectangles. The goal is to
produce such a partition that minimizes the sum of the half-perimeters of the
rectangles.
c1 c2 c3
S1 S4
S11
S5
S8
S10
S3 1
S9
S6
S12
S2 S7
FIGURE 6.9: Column partition of the unit square with C = 3 columns with
each k1 = 5, k2 = 3 and k3 = 4 rectangles.
The algorithm to solve this problem uses dynamic programming and relies
on the two following ideas:
1. It renumbers variables s1 , . . . , sp so that s1 6 s2 6 . . . 6 sp .
2. It iteratively constructs p functions fC for values of C going from 1 to
p. For q ∈ {1, . . . , p}, fC (q) is defined as the sum of the half-perimeters
inPan optimal partition of a rectangle with height 1 and with width
q
( i=1 si ) in C columns and q rectangles with areas s1 , . . . , sq .
The key idea behind the algorithm is that it is straightforward to compute
function fC recursively based on function fC−1 as follows:
X
fC (q) = min 1 + (q − a) si + fC−1 (a) (6.1)
a∈[C−1,q−1]
a<i6q
TABLE 6.1: Table showing the values of fC (q) and of a0 (separated by a
“|”) for our 8-rectangle example. The values in boldface indicate the optimal
solution.
H q
H q=1 q=2 q=3 q=4 q=5 q=6 q=7 q=8
C H
H
H
C =1 1.05 | 0 1.2 | 0 1.54 | 0 2.12 | 0 2.9 | 0 4|0 5.90 | 0 9|0
C =2 2.10 | 1 2.28 | 2 2.56 | 2 2.94 | 3 3.50 | 3 4.38 | 4 5.76 | 5
C =3 3.18 | 2 3.38 | 3 3.66 | 4 4|4 4.58 | 5 5.50 | 6
C =4 4.28 | 3 4.48 | 4 4.78 | 5 5.20 | 6 5.88 | 7
C =5 5.38 | 4 5.60 | 5 5.98 | 6 6.50 | 7
C =6 6.50 | 5 6.80 | 6 7.28 | 7
C =7 7.70 | 6 8.10 | 7
C =8 9|7
In the above minimum a corresponds to the number of rectangles in the first

C − 1 columns, and thus q − a corresponds to the number ofP rectangles in the
last column. The first term in the minimum, 1 + (q − a) a<i6q si , is the
contribution of the last column to the total sum of the half-perimeters. The
intuitive interpretation of Equation (6.1) is that it looks at all possible ways to
make use of an additional column to achieve the optimal partition. We denote
by a0 the value of parameter a that achieves the minimum in Equation (6.1).
This recursion is initialized by setting
q
X
f1 (q) = 1 + q × si ,
i=1
which simply gives Pqthe sum of the half perimers of a rectangle with height 1
and with width ( i=1 si ) in 1 column and q rectangles with areas s1 , . . . , sq .
To better understand how the algorithm works, let us apply it on an exam-
ple. Consider p = 8 rectangles with areas (0.05; 0.05; 0.08; 0.1; 0.1; 0.12; 0.2; 0.3).
The algorithm recursively computes all fC (q) values for 1 6 q, C 6 8, as shown
in Table 6.1. In each column of the table we show the optimal value, i.e., the
one with the smallest fC (q) value, in boldface. Since we wish to partition the
unit square in 8 rectangles, we look at column q = 8 and find that the optimal
fC (q) value, 5.5, is achieved for C = 3, indicating that the optimal partition
consists of 3 columns. Furthermore, the optimal fC (q) value is achieved for
a0 = 6. Therefore, the last column of the optimal partition must contain
8 − 6 = 2 rectangles. We now look at column q = 6 of the table and find out
that the optimal fC (q) value is achieved for a0 = 3. Therefore, the next-to-
last column of the optimal partition must contains 6 − 3 = 3 of the remaining
rectangles. Similarly, we determine that in column q = 3 of the table the
optimal fC (q) value is achieved for a0 = 0. The first column of the optimal
partitioning consists of all remaining 3 − 0 = 3 rectangles, which makes sense
since we know that the optimal partitioning consists of 3 columns. The widths
of the three columns in the optimal partitioning are thus c1 = s1 + s2 + s3 ,
c2 = s4 + s5 + s6 , and c3 = s7 + s8 . This partitioning is shown in Figure 6.10.
0.18 0.32 0.5
0.12
0.08
0.3
1 0.1
0.05
0.05 0.1 0.2
1/2-perimeter 1/2-perimeter 1/2-perimeter optimal partitioning

0.18 × 3 + 1 0.32 × 3 + 1 0.5 × 2 + 1
f1(3) = 1.54 f2(6) = 3.5 f3(8) = 5.5
FIGURE 6.10: Optimal column partitioning for our 8-rectangle example.

The thicker lines correspond to the rectangle sides that contribute to the sum
of the half-perimeters.
Algorithm 6.3 makes it possible to compute the optimal partition with, in

the worst case, a complexity of O(p3 ). The final partition corresponds to
function fCopt (p) = min16C6p fC (p). This partition is found by Algorithm 6.4.
We saw on an example how this algorithm operates, namely by reading Ta-
ble 6.1 from right to left and by following the values in boldface. The unit
square is partitioned in Copt columns. The i-th column contains rectangles
sdi , sdi +1 , . . . , sdi +ki with di = k1 + k2 + . . . + ki−1 .
Algorithms 6.3 and 6.4 together compute the optimal solution of problem
ColPeriSum(s). The only remaining difficulty is to show that one can indeed
consider only ordered s1 6 s2 . . . 6 sp sequences:
DEFINITION 6.5. A partitioning is said well-ordered if for all pair of col-

umn Ci and Cj , either all elements of Ci are lower than or equal to all elements
of Cj , or the other way around. Figure 6.11 illustrates this definition.
Without loss of generality consider a partitioning in C columns of widths

k1 > k2 > . . . > kC (swapping columns around does not change the quality
of the partitioning). Assume that the si values are indexed so that s1 6
s2 6 . . . 6 sp ; let τ be a permutation of {1, 2, . . . , p} such that the i-th
column of the partitioning contains rectangles sτ (di +1) , . . . , sτ (di +ki ) , where
di = k1 + k2 + . . . + ki−1 . Recall that the contribution of column Ci to the
Pdi +ki
total sum of the half-perimeters is 1 + ki j=d s . The total sum is
i +1 τ (j)
1 PARTITION(s1 , . . . , sp )
2 S=0
3 for q = 1 to p do
4 S = S + sq
5 f1 (q) = 1 + S × q
6 f1cut (q) = 0
7 for C = 2 to p do
8 for q = C to p do  
X
fC (q) = min 1 + (q − a) si + fC−1 (a)
a∈[C−1,q−1]
9 q−a<i6q
10 fCcut (q) = q − a0 { where a0 is the value of a that leads to the
minimum in the previous expression. }
11 return (f∗cut )
ALGORITHM 6.3: Algorithm for ColPeriSum(s): construction of the

functions fC . fCcut (q) corresponds to the number of rectangles used in the
C − 1 first columns, which leaves q − fCcut (q) rectangles in column C.
computed as follows:
C + k1 sτ (1) + k1 sτ (2) + . . . + k1 sτ (k1 )

+ k2 sτ (k1 +1) + k2 sτ (k1 +2) + . . . + k2 sτ (k1 +k2 )
+ ...
+ kC sτ (k1 +...+kC−1 +1) + kC sτ (k1 +...+kC−1 +2) + . . . + kC sτ (k1 +...+kC ) .
Since k1 > k2 > . . . > kC , the above expression is minimized for τ =

Identity, which characterizes a well-ordered partition! The proof is completed
via an induction on the number of element swaps in permutation τ . We
conclude that, for all partition, there is a corresponding well-ordered partition
1 RE-BUILD(f∗cut , Copt )
2 q=p
3 for C = Copt down to 1 do
4 kC = q − fCcut (q)
5 q = fCcut (q)
6 return (k1 , . . . , kCopt )
ALGORITHM 6.4: Reconstructing the optimal solu-

tion from functions fCcut .
6 6 6 6
12 15
6 6
9
9
6 6
28 28
6 6
15
12
3 3
3 3
not well-ordered well-ordered

FIGURE 6.11: Two example column partitions. The one on the left is not
well-ordered, while the one on the right is.
that leads to a total sum of half-perimeters that is lower or equal to that of

the original partition. This completes the proof that Algorithms 6.3 and 6.4
compute the optimal solution to ColPeriSum(s).
Performance Guarantee
In this section, we show that column partitioning leads to a good approxi-
mation of the optimal (free) partition. This is especially true when the ratio
between the largest rectangle area, max si , and the smallest area, min si , is
low.
THEOREM 6.3. Let r = max si

min si . Let C denote the sum of the half-perimeters
b
Pp √
of the rectangles in the optimal column partition, and let LB = 2 i=1 si .
Then, we have:
√
r
C
b 1 max si 1
6 r 1+ √ = 1+ √ .
LB p min si p
If r = 1, i.e, if all processors have the same speed, column partitioning is

asymptotically optimal. By contrast, if r is large, then the above upper bound
is very pessimistic.
√ P √
Proof. We define C = d r i si e. Let us consider the partition that consists
of C columns in which the rectangles are distributed evenly across the
columns.∗
Therefore, the number of rectangles per column is either b Cp c or Cp . Let C
b
denote the sum of the half-perimeters of this partitioning. We have:
& ' & '
∗
√ X√ p
C
b 6 r si + √ P √
i
r i si
√ X√ p
6 2+ r si + √ P √ .
i
r i si
And therefore,
b∗ √
C 1 r p
P √ 6P √ + + √ P √ 2 .
2 i si i si 2 2 r( i si )
Furthermore,
X
si = 1 =⇒ p max si > 1
i
1
=⇒ min si > ,
pr
which leads to: r
X√ p p
si > p min si > .
i
r
Finally, we obtain:
b∗ r √ √
C r r r
P √ 6 + +
2 i si p 2 2
√

1
6 r 1+ √ .
p
Since Cb corresponds to the best solution among all the possible column
partitionings, we have C
b6C b ∗ , which completes the proof.
The load balancing results for 1-D data distributions presented in the first
part of this chapter are well known, and we refer the reader to the referenced
work therein for further details. The section on load balancing for 2-D data
distributions come for the most part from the Ph.D. thesis by Rastello [99] and
related articles, which contain many results. Generally speaking, the literature
is rife with works that study load balancing problems and we explore some of
them in the exercises accompanying this chapter.
6.4 Exercises
The first exercise is a straightforward demonstration that rather than facing
the difficulty of load balancing for 2-D data distributions, an easier option is
to transform a “grid algorithm” into a “ring algorithm.” The second exercise
studies load balancing for the LU factorization on a heterogeneous grid of
processors. Finally, the third exercise discusses optimal load balancing for a
stencil application on a heterogeneous ring of processors (inspired by the work
in [79], to which we refer the reader for more details).
Exercise 6.1 : Matrix Product on a Heterogeneous Ring

1. Consider the outer-product matrix multiplication algorithm that we de-
scribed on a grid of processors in Section 5.3.1. We have seen that heteroge-
neous grids of processors lead to difficult load balancing problems. So instead,
let us see how this algorithm can be executed on a ring of processors. For
starters, explain how this algorithm can be executed on a homogeneous ring
of p processors for n × n matrices.
2. Adapt the algorithm in the previous question to the case of a heterogeneous

ring of p processors with cycle times ti , i = 1, . . . , p, so that the execution is
load-balanced. Consider that the execution of the algorithm is blocked using a
r ×r block-size (as opposed to proceeding column-by-column), where r divides
n and r 6 n/p. Explain the execution of the algorithm on an example.
Exercise 6.2 : LU Factorization on a Heterogeneous Grid

Consider the execution of the LU factorization of a n × n matrix A (see
Section 4.4) but using a 2-D data distribution on a p × q heterogeneous grid
of processors, with processor Pi,j ’s cycle time denoted by ti,j , i = 1, . . . , p,
j = 1, . . . , q. Recall that a challenge with the LU factorization is that it
updates the lower-right sub-matrix of matrix A, the size of which decreases
at each step of the algorithm. In Section 6.1.4 we have seen that allocating
columns to processors “in reverse” of the allocation computed by the static
load balancing algorithm in Section 6.1.1 leads to a data distribution that is
optimal at each step. But this result was for a ring of processor and thus a
1-D data distribution. We now know that load balancing is more challenging
for 2-D data distributions. But what of the LU factorization with a 2-D
distribution?
1. Consider the case in which the matrix of processor cycle times is of rank
1. In this case, we know that perfect load balancing is possible for the whole
matrix. In other words, we can find ci and Prj , i = 1, . . . P
, p, j = 1, . . . , q, such
that ri ti,j cj = 1 for all i, j, and such that i ri = 1 and j cj = 1. Processor
Exercises 201
Pi,j is thus allocated a ri × cj block of the matrix (which are really normalized
and must be multiplied by n and rounded to integer values). See Section 6.2.1
for all details. Show that if the matrix of processor cycle times is of rank 1,
then it is possible to distribute columns and rows to processors such that the
distribution is optimal at each step step of the LU factorization.
2. Show an example in which the matrix of processor cycle times is not of

rank 1 and in which it is not possible to achieve a data distribution that is
optimal at each step.
Exercise 6.3 : Stencil Application on a Heterogeneous Ring

Consider a stencil application like the one described in Section 4.5 to be
executed on a ring of p processors. Each processor Pi , i = 1, . . . , p, can be
allocated a block column of the 2-D domain. Let W be the total number of
columns; processor Pi is allocated αi contiguous columns. At each iteration
of the stencil application a processor and its neighboring processors must
exchange two columns of the domain, say a total amount D both ways. We
assume a 2-port, bidirectional ring with full-duplex links, with non-blocking
sends and blocking receives.
1. Let us first consider the case of a ring with heterogeneous processors and
homogeneous network links: the time for processor Pi to process a column
is wi ; the time for a processor to exchange a column with both its successor
and predecessors is c × D (where c is the time to exchange one unit of data
with both its neighbors). Show that the optimal allocation uses either one or
all processors. Give a condition sufficient for the optimal allocation to use all
processors.
2. We now consider that the underlying platform is fully connected with

heterogeneous network links. Therefore, processors Pi and Pj (i 6= j) exchange
D data items in time ci,j D. Assume that all processors participate in the
computation. It turns out that the load balancing problem in this case is
difficult. Indeed, show that finding the best allocation is equivalent to finding
the shortest Hamiltonian cycle in the platform (with processors as vertices and
network links as edges) for some edge weights that you must define. (Which
is to say that solving the load balancing problem is equivalent to solving the
well-known NP-hard traveling salesman problem.)
6.5 Answers
Exercise 6.1 (Matrix Product on a Heterogeneous Ring)
. Question 1. Just consider the ring as a rectangular 1×p grid and simply run
the outer-product algorithm on this grid. For a C = A × B product, the three
matrices are distributed across the processors by column blocks of r = n/p
columns. At step k = 1, . . . , p, the processor that holds the k-th column block
of A broadcasts it to all other processors. This is the only communication
since vertical broadcasts of B are now replaced by local memory accesses
within block rows.
. Question 2. Let us consider an example with n = 2, 528 and three pro-

cessors with cycle times t1 = 3, t2 = 5 and t3 = 8. Let us take a block size
r = 32, so that the matrices consist of B = 79 block columns, which must be
distributed across the processors.
One can simply use the static load balancing algorithm in Section 6.1.1 to
determine how many block columns are allocated to each processors. The
load balancing algorithm produces the following allocations of block columns
to processors: c1 = 40, c2 = 24, c3 = 15. This allocation is perfectly load-
balanced. Processor P1 holds the first c1 column blocks of all matrices, P2
the next c2 column blocks, and P3 the last c3 column blocks. The blocked
algorithm proceeds in B steps. At step k, the processor that holds the k-th
column block broadcasts it to all other processors. Each processor also has
the pieces of the k-th row blocks of B. Therefore, each processor can update
the corresponding r × r block of C. Two steps are shown in Figure 6.12. The
time for each step is proportional to c1 × t1 = c2 × t2 = c3 × t3 .
Exercise 6.2 (LU Factorization on a Heterogeneous Grid)

. Question 1. It turns out that using the static allocation algorithm in
reverse on both the rows and the columns does the job. Consider a distribution
σ1 of the matrix rows to the processors, i.e., a function from {1, . . . , n} to
{1, . . . , p}. Similarly, consider a distribution σ2 of the matrix columns to the
processors, i.e., a function from {1, . . . , n} to {1, . . . , q}. Let k ∈ {1, . . . , n}
denote the step of the LU factorization algorithm. We consider the bottom-
right square sub-matrix of A, A(k), which starts at row k and at column k. Let
ri0 (k) denote the number of rows in A(k) that are allocated to processors on
processor row i, that is the number of indices l such that σ1 (l) = i. Similarly,
let c0j (k) denote the number of columns in A(k) that are allocated to processors
on processor column j, that is the number of indices l such that σ2 (l) = j.
The time needed to complete step k of the LU factorization algorithm is thus
T (k) = maxi,j ri0 (k)ti,j c0j (k), since processor Pi,j is responsible for updating
Answers 203
21
B B
51
21 51
A C A C
40 24 15 40 24 15
FIGURE 6.12: Steps 21 and 51 of the outer-product algorithm on a hetero-

geneous ring for our example matrix multiplication.
ri0 (k) × c0j (k) elements of sub-matrix A(k). We easily see that:
ri0 (k)c0j (k)

T (k) = max max ri0 (k)ti,j c0j (k) = max max ,
i j i j ri cj
since ri ti,j cj = 1. One can then separate the maxima, which are independent,
to obtain:
r0 (k) c0j (k)
T (k) = max i × max .
i ri j cj
The objective is to minimize the above quantity for all k. It turns out that
the static load balancing algorithm from Section 6.1.1, whose allocation is
used in reverse, can be used to minimize both these maxima because it leads
to 1-D data distributions that are optimal at each step of the LU factoriza-
tion. (Minimizing the maximum of the, for instance, ri0 (k)/ri ratio for all i
is equivalent to ensuring that each processor has an amount of work that is
as commensurate as possible to its cycle time.) We conclude that using the
allocation produced by this algorithm in reverse and along both dimension
produces a data distribution optimal at each step!
. Question 2. Consider a 3 × 3 processor grid with the cycle times depicted
in Figure 6.13. There is one fast processor and 8 slow processors. We must
show that the optimal allocation of matrix elements for a 3 × 3 matrix cannot
be constructed based on the optimal allocation of matrix elements for a 2 × 2
matrix.
For a 2 × 2 matrix there are only three possible kinds of allocations of
matrix elements to the processors: (i) the fast processor is assigned all matrix
elements; (ii) the fast processor is assigned only one matrix element; (iii) the
t1,1 = 5 t1,2 = 5 t1,3 = 5
t2,1 = 5 t2,2 = 5 t2,3 = 5
t3,1 = 5 t3,2 = 5 t3,3 = 1
FIGURE 6.13: 3 × 3 processor grid and cycle times.
fast processor is assigned no matrix element. In case (i), the time to update
the matrix is 4 (the fast processor updates 4 elements sequentially). In the
other cases, since a slow processor is involved, the time to update the matrix
is at least 5. So the best distribution is to allocate all matrix elements to the
fast processor.
t3,3 = 1 t3,3 = 1 t3,3 = 1 t1,1 = 5 t1,3 = 5 t1,3 = 5
t3,3 = 1 t3,3 = 1 t3,3 = 1 t3,1 = 5 t3,3 = 1 t3,3 = 1
t3,3 = 1 t3,3 = 1 t3,3 = 1 t3,1 = 5 t3,3 = 1 t3,3 = 1
(a) (b)
FIGURE 6.14: 3 × 3 distributions based on the best 2 × 2 distribution.
Let us now consider the case of a 3 × 3 distribution. Our objective is to

exhibit an allocation that is better than an allocation that uses the above
2 × 2 allocation for the lower-right 2 × 2 sub-matrix. Let us consider allo-
cations that use the 2 × 2 allocation. We show the two possible allocations
in Figure 6.14. Figure 6.13 (a) shows the distribution that involves only the
fast processor, whose time to update the matrix is 9. (The figure simply
shows which processors holds which matrix element, and also indicates the
processor cycle times.) Figure 6.13 (b) shows the distribution that involves
also processors P1,3 and processor P3,1 , whose time to update the matrix is
max(5, 2 × 5, 4 × 1) = 10. Therefore, the best 3 × 3 distribution based on the
best 2 × 2 distribution leads to a time to update the matrix of 9. However,
consider the natural distribution in which processor Pi,j holds matrix element
Ai,j . Its update time is max(5, 1) = 5! Therefore, the best 3 × 3 distribution
is not based on the best 2 × 2 distribution. Therefore, with a matrix of cycle
times of rank > 1, it is not possible to obtain a distribution that is optimal
at all steps of the LU factorization.
Part III
Scheduling
205
Chapter 7
Scheduling
7.1 Introduction
This chapter presents basic (but important) results on task graph schedul-
ing. We start with a motivating example before providing a quick overview
of models and complexity results.
7.1.1 Where Do Task Graphs Come From?

Consider the following algorithm to solve the linear system Ax = b, where
A is an n × n nonsingular lower triangular matrix and b is a vector with n
components:
for i = 1 to n do
Task Ti,i : xi ← bi /ai,i
for j = i + 1 to N do
Task Ti,j : bj ← bj − aj,i × xi
For a given value of i, 1 6 i 6 n, each task Ti,∗ represents some computa-

tions executed during the i-th iteration of the external loop. The computation
of xi is performed first (task Ti,i ). Then, each component bj of vector b such
that j > i is updated (task Ti,j ).
In the original program, there is a total precedence order between tasks.
Let us write T <seq T 0 if task T is executed before task T 0 in the original
sequential code. We have
T1,1 <seq T1,2 <seq T1,3 <seq . . . <seq T1,n <seq T2,2 <seq T2,3 <seq . . . <seq Tn,n .
However, there are independent tasks that can be executed in parallel. Intu-
itively, independent tasks are tasks whose execution order can be interchanged
without modifying the result of the program execution. A necessary condi-
tion for tasks to be independent is that they do not update the same variable.
They can read the same value, but they cannot write into the same memory
location (otherwise there would be a race condition and the result would be
non-deterministic). For instance tasks T1,2 and T1,3 both read x1 but modify
distinct components of b, hence they are independent.
207
208 Chapter 7. Scheduling
We can express this notion of independence more formally. Each task T has
an input set In(T ) (read values) and an output set Out(T ) (written values). In
our example, In(Ti,i ) = {bi , ai,i } and Out(Ti,i ) = {xi }. For j > i, In(Ti,j ) =
{bj , aj,i , xi } and Out(Ti,j ) = {bj }. Two tasks T and T 0 are not independent
(we write T ⊥T 0 ) if they share some written variable:
 In(T ) ∩ Out(T 0 ) 6= ∅

0
T ⊥T ⇔ or Out(T ) ∩ In(T 0 ) 6= ∅
or Out(T ) ∩ Out(T 0 ) 6= ∅

These conditions are known as Bernstein’s conditions [27]. We come back to

these conditions in Section 8.4 and discuss how to determine whether they
hold or not.
Let us simply try some examples here. Task T1,1 and T1,2 are not inde-
pendent because Out(T1,1 ) ∩ In(T1,2 ) = {x1 }; therefore T1,1 ⊥T1,2 . Similarly,
Out(T1,3 ) ∩ Out(T2,3 ) = {b3 }, hence T1,3 and T2,3 are not independent; we
write T1,3 ⊥T2,3 .
Given the dependence relation ⊥, we can extract a partial order from the
total order <seq induced by the sequential execution of the program. If two
tasks T and T 0 are dependent, i.e., T ⊥T 0 , we order them according to the
sequential execution; we write T ≺ T 0 if both T ⊥T 0 and T <seq T 0 . The
precedence relation ≺ represents the dependences that must be satisfied to
preserve the semantics of the original program; if T ≺ T 0 , then T was executed
before T 0 in the sequential code, and it has to be executed before T 0 even if we
have infinitely many processors, because T and T 0 share a written variable.
To define ≺ more accurately, in terms of order relations, we can write
≺ equals (<seq ∩ ⊥)+ ,
where + denotes the transitive closure. In other words, we take the transitive
closure of the intersection of ⊥ and <seq to derive the set of all constraints
that need to be satisfied to preserve the semantics of the original program. In
a sense, ≺ captures the intrinsic sequentiality of the original program. The
original total ordering <seq was unduly restrictive, only the partial ordering
≺ needs to be respected. Why do we need to take the transitive closure of
<seq ∩ ⊥ to get a correct definition of ≺? In our example, we have T2,4 ⊥T4,4
(which is not a predecessor relationship, as there is T3,4 in between) and
T4,4 ⊥T4,5 , hence a path of dependences from T2,4 to T4,5 , while we do not
have T2,4 ⊥T4,5 . We need to track dependence chains to define ≺ correctly.
We can draw a directed graph to represent the dependence constraints that
need to be enforced. The vertices of the graph denote the tasks, while the
edges express the dependence constraints. An edge e : T → T 0 in the graph
means that the execution of T 0 must begin only after the end of the execution
of T , whatever the number of available processors. We do not usually draw
transitivity edges on the graph, as they represent redundant information; if
T ≺ T 0 and T 0 ≺ T 00 and if there exists a dependence T ⊥T 00 , then it will be
7.1. Introduction 209
automatically satisfied. We say that T is a predecessor of T 0 if T ≺ T 0 and if

there is no task T 00 in between, i.e., such that T ≺ T 00 and T 00 ≺ T 0 . We will
give formal definitions in the next section. In our example, the predecessor
relationships are as follows:
• Ti,i ≺ Ti,j for 1 6 i < j 6 n

(the computation of xi must be done before updating bj at step i of the
outer loop).
• Ti,j ≺ Ti+1,j for 1 6 i < j 6 n

1
(updating bj at step i of the outer loop must be done before reading
it at step i + 1).
We end up with the graph shown in Figure 7.1. We will use this graph several
times in this chapter for illustrative purposes.
7.1.2 Overview
This chapter presents classic theorems and algorithms from scheduling the-
ory. The communication model used by this theory is rather unrealistic but
it makes it possible to obtain fundamental complexity results.
We start with the most simple (one might say crude) model where all
communication delays between processors are neglected. We introduce ba-
sic definitions in Section 7.2. When there is no restriction on the number of
available processors, optimal schedules can be found in polynomial time, as
shown in Section 7.3. Section 7.4 deals with a limited number of processors;
the scheduling problem becomes NP-complete, and so-called list scheduling
heuristics are the typical approach. An elegant and powerful theorem shows
that any list scheduling algorithm generates a schedule that is no longer than
twice the optimal schedule (Section 7.4.2). We continue on the theoretical
side: in Section 7.4.5, we discuss the scheduling of independent tasks, and we
derive arbitrarily good approximation algorithms, i.e., polynomial algorithms
whose performance can be guaranteed within a (1 + ε) of the optimal, for any
arbitrary ε > 0.
Next we move to the classical scheduling model in which communication
costs are taken into account each time two dependent tasks are not assigned
to the same processor. We detail this model in Section 7.5. In this case,
even the problem with unlimited processors is NP-complete, as explained in
Section 7.6. We present heuristics for p identical processors in Section 7.7
and briefly discuss how to extend these heuristics to handle heterogeneous
processors in Section 7.8.
1 Well,“must” is slightly exaggerated, because the addition is an associative and commu-

tative operation. This is true the way the program is written, reusing the same memory
location bj for all the updates. Technically, we have an output dependence that can be
removed using standard techniques [93, 120].
7.2 Scheduling Task Graphs

DEFINITION 7.1. A task system, or task graph, is a directed vertex-
weighted graph G = (V, E, w) where:
• the set V of vertices represents the tasks (note that V is finite).
• the set E of edges represents precedence constraints between tasks:
e = (u, v) ∈ E if and only if u ≺ v.
• the weight function w : V −→ N∗ gives the weight (or duration) of each
task. Task weights are assumed to be positive integers. 2
For the triangular system (Figure 7.1), we can assume that all tasks have
equal weight; let w(Ti,j ) = 1 for 1 6 i 6 j 6 n. We could also consider
that a division is more costly than a multiply-add and give extra weight to
the diagonal tasks Ti,i .
A schedule σ of a task system is a function that assigns a start time to each
task:
DEFINITION 7.2. A schedule of a task system G = (V, E, w) is a function
σ : V −→ N∗ such that σ(u) + w(u) 6 σ(v) whenever e = (u, v) ∈ E.
In other words, a schedule must preserve the dependence constraints induced
by the precedence relation ≺ and embodied by the edges of the dependence
graph; if u ≺ v, then the execution of u begins at time σ(u) and requires w(u)
units of time, and the execution of v at time σ(v) must start after the end of
the execution of u.
Often there are other constraints that must be met by schedules, namely,
resource constraints. When there is an infinite number of processors 3 , we say
that we have a problem with unlimited processors, denoted Pb(∞). When
there is only a fixed number p of available processors, we speak of a problem
with limited processors, Pb(p). In this case we need an allocation function
alloc : V −→ P, where P = {1, . . . , p} denotes the set of available processors.
This function assigns a target processor to each task. The resource constraints
simply specify that no processor can be allocated more than one task at the
same time. This translates into the following conditions:
σ(T ) + w(T ) 6 σ(T 0 )

alloc(T ) = alloc(T 0 ) ⇒
or σ(T 0 ) + w(T 0 ) 6 σ(T )
This condition expresses the fact that if two tasks T and T 0 are allocated to
the same processor, then their executions cannot overlap in time.
2 This is not a restriction; tasks weights can be rational numbers. However, because there
is a finite number of tasks, the weights can always be scaled up to integers.
3 In fact we need no more processors than the total number of tasks.
7.2. Scheduling Task Graphs 211
T1,1
T1,2 T1,3 T1,4 T1,5 T1,6
T2,2
T2,3 T2,4 T2,5 T2,6
T3,3
T3,4 T3,5 T3,6
T4,4
T4,5 T4,6
T5,5
T5,6
T6,6
FIGURE 7.1: Task graph for the triangular system (n = 6).
Given a task system, there is a basic condition for a schedule to exist,

regardless of resource constraints:
THEOREM 7.1. Let G = (V, E, w) be a task system. There exists a schedule

if and only if G contains no cycle.
Proof. Clearly, if there is a cycle v1 → v2 . . . → . . . → vk → v1 , then v1 ≺ v1

(v1 depends on itself), and a schedule σ would satisfy σ(v1 ) + w(v1 ) 6 σ(v1 ),
which is impossible because we assumed w(v1 ) > 0.
Conversely, if G has no cycle, then there exist vertices with no predeces-
sor (V is finite) and we can thus topologically sort the vertices and sched-
ule them one after the other (i.e., assuming a single processor) according to
the topological order. Formally, if vπ(1) , vπ(2) , . . . , vπ(n) is the ordered list
of vertices obtained by the topological sort, let σ(vπ(1) ) = 0 and σ(vπ(i) ) =
σ(vπ(i−1) ) + w(vπ(i−1) ) for 2 6 i 6 n. Dependence constraints are respected,
because if vi ≺ vj then the topological sort ensures that π(i) < π(j).
Theorem 7.1 states that scheduling deals with directed acyclic graphs (or
DAGs):
DEFINITION 7.3. A DAG G = (V, E, w) is a task system (as in Defini-

tion 7.2) where G is a directed acyclic graph. We have the following:
1. Let σ be a schedule for G. Assume σ uses at most p processors (let

p = ∞ if processors are unlimited). The makespan MS(σ, p) of σ is its
total execution time:
MS(σ, p) = max{σ(v) + w(v)} − min{σ(v)} .

v∈V v∈V
2. Pb(p) is the problem of determining a schedule σ of minimal makespan

MS(σ, p) assuming p processors (let p = ∞ if processors are unlimited).
Let MSopt (p) be the value of the makespan of an optimal schedule with
p processors:
MSopt (p) = min MS(σ, p).
σ
If the first task is scheduled at time 0, which is a common assumption, the

expression of the makespan can be reduced to MS(σ, p) = maxv∈V {σ(v) +
w(v)}. We extend weights to pathsP in G as usual; if Φ = (T1 , T2 , . . . , Tn )
n
denotes a path in G, then w(Φ) = i=1 w(Ti ). Because schedules respect
dependences, we have the following easy bound on the makespan:
PROPOSITION 7.1. Let G = (V, E, w) be a DAG and σ a schedule for G

with p processors. Then, MS(σ, p) > w(Φ) for all paths Φ in G.
Proof. Consider any path Φ = (T1 , T2 , . . . , Tn ) in G: e = (Ti , Ti+1 ) ∈ E

for 1 6 i < n. Then, σ(Ti ) + w(Ti )P6 σ(Ti+1 ) for 1 6 i < n, and thus
n
MS(σ, p) > w(Tn ) + σ(Tn ) − σ(T1 ) > i=1 w(Ti ) = w(Φ).
Our last definition introduces the notions of speedup and efficiency for
schedules (see [77] for a detailed discussion of speedup and efficiency):
DEFINITION 7.4. Let G = (V, E, w) be a DAG, and σ a schedule for G

with p processors:
Seq P
1. The speedup is the ratio s(σ, p) = MS(σ,p) , where Seq = v∈V w(v) is
the sum of all task weights.
s(σ,p) Seq
2. The efficiency is the ratio e(σ, p) = p = p×MS(σ,p) .
Seq is the optimal execution time MSopt (1) of a schedule with a single
processor. We have the following well-known result:
THEOREM 7.2. Let G = (V, E, w) be a task system. For any schedule σ

with p processors,
0 6 e(σ, p) 6 1 .
7.2. Scheduling Task Graphs 213
Proof. Consider the execution of σ as illustrated in Figure 7.2 (this is a fic-

titious example, not related to the triangular system example). At any time
during execution, some processors are active, and some are idle. At the end,
all tasks have been processed. Let Idle denote the cumulated idle time of
the p processors during the whole execution. Because Seq is the sum of all
task weights, the quantity Seq + Idle is equal to the area of the rectangle in
Figure 7.2, i.e., the product of the number of processors by the makespan of
Seq
the schedule: Seq + Idle = p × MS(σ, p). Hence, e(σ, p) = p×MS(σ,p) 6 1.
processors
active
P4
idle
P3
P2
P1
time
FIGURE 7.2: Active and idle processors during execution.
Another way to state Theorem 7.2 is to say that the speedup with p proces-
sors is always bounded by p. No superlinear speedup with our model! Here is
an easy result to conclude this section: the more processors, the smaller (or
equal) the optimal makespan.
THEOREM 7.3. Let G = (V, E, w) be a task system. We have:
Seq = MSopt (1) > . . . > MSopt (p) > MSopt (p + 1) > . . . > MSopt (∞) .
Proof. The proof is straightforward. Consider an optimal schedule σ with p

processors, and view it as a schedule with p + 1 processors where the last
processor is kept idle. Then, MS(σ, p + 1) = MS(σ, p) = MSopt (p); hence,
MSopt (p + 1) 6 MSopt (p).
Theorem 7.3 can be refined as follows. The number of processors actually

used by a schedule σ is |alloc(V )|, i.e., the number of processors that execute at
least one task. If we define MS0 (p) as the minimum makespan of all schedules
that use exactly p processors, we have MS0 (p) = MSopt (p) for 1 6 p 6 |V |,
so that Theorem 7.3 holds when replacing MSopt (p) by MS0 (p). Intuitively, it
cannot hurt to make use of more processors in a model where communication
costs are not taken into account! In Section 7.5, we introduce communication
costs, and we give an example where MS0 (p) < MS0 (p0 ) with p < p0 ; the
refined version of Theorem 7.3 is no longer true with this new model.
We are now ready to address the search for optimal schedules. Not sur-
prisingly, it turns out that the problem Pb(p) with limited processors is more
difficult than Pb(∞), whose solution is explained in the next section.
7.3 Solving Pb(∞)

Let G = (V, E, w) be a given DAG and assume unlimited processors. Re-
member that a schedule σ for G is said to be optimal if its makespan MS(σ, ∞)
is minimal, i.e., if MS(σ, ∞) = MSopt (∞).
DEFINITION 7.5. Let G = (V, E, w) be a DAG.
1. For v ∈ V , PRED(v) denotes the set of all immediate predecessors of v,
and SUCC(v) the set of all its immediate successors;
2. v ∈ V is an entry (top) vertex if and only if PRED(v) = ∅;
3. v ∈ V is an exit (bottom) vertex if and only if SUCC(v) = ∅;
4. For v ∈ V , the top level tl(v) is the largest weight of a path from an
entry vertex to v, excluding the weight of v;
5. For v ∈ V , the bottom level bl(v) is the largest weight of a path from v
to an output vertex, including the weight of v.
In the example of the triangular system, there is a single entry vertex, T1,1 ,
and a single exit vertex, Tn,n . The top level of T1,1 is 0, and tl(T1,2 ) =
tl(T1,1 ) + w(T1,1 ) = 1. We have
tl(T2,3 ) = max{w(T1,1 ) + w(T1,2 ) + w(T2,2 ), w(T1,1 ) + w(T1,3 )} = 3
because there are two paths from the entry vertex to T2,3 .
The top level of a vertex can be computed by a traversal of the DAG; the
top level of an entry vertex is 0, while the top level of a non-entry vertex v is
tl(v) = max{tl(u) + w(u); u ∈ PRED(v)} .
Similarly, bl(v) = max{bl(u); u ∈ SUCC(v)} + w(v) (and bl(v) = w(v) for
an exit vertex v). The top level of a vertex is the earliest possible time at
which it can be executed, while its bottom level represents a lower bound of
the remaining execution time once starting its execution. This can be stated
more formally as follows.
THEOREM 7.4. Let G = (V, E, w) be a DAG and define σf ree as follows:
∀v ∈ V, σf ree (v) = tl(v) .
Then, σf ree is an optimal schedule for G.
7.3. Solving Pb(∞) 215
Proof. The proof has two parts. First we show that σf ree is indeed a schedule,
then we derive its optimality. Both are easy:
1. σf ree respects all dependence constraints by construction; if (u, v) ∈ E,

then u ∈ PRED(v) and the constraint σf ree (v) > σf ree (u) + w(u) is
taken into account in the computation of tl(v) = σf ree (v).
2. To prove that σf ree is optimal, we use a topological sort order of the

vertices of G, and prove by induction that all vertices are scheduled as
soon as possible, i.e., as soon as the execution of all their predecessors
has been completed.
The free schedule σf ree is also known as the as soon as possible (ASAP)
schedule.
From Theorem 7.4 we have
MSopt (∞) = MS(σf ree , ∞) = max{tl(v) + w(v)} .

v∈V
Hence, MSopt (∞) is simply the maximal weight of a path in the graph. Note
that σf ree is not the only optimal schedule. For example the as late as possible
(ALAP) schedule σlate is also optimal. We define σlate as follows:
∀v ∈ V, σlate (v) = MS(σf ree , ∞) − bl(v) .
To understand the definition, note that bl(v) is the maximal weight of a path
from v to exit nodes, hence the need to start the execution of v no later than
MS(σf ree , ∞) − bl(v) if all tasks must have completed within MS(σf ree , ∞)
time units.
CORROLARY 7.1. Let G = (V, E, w) be a directed acyclic graph. Pb(∞)

can be solved in time O(|V | + |E|).
Proof. From Theorem 7.4 we know that the optimal schedule σf ree can be
computed using top levels and that MSopt (∞) is the maximal weight of a
path in the graph. Because G is acyclic, these quantities can be computed by
a traversal of the graph, hence the complexity O(|V | + |E|).
Going back to the triangular system (Figure 7.1), because all tasks have
weight 1, the weight of a path is equal to its length plus 1. The longest path
is
T1,1 → T1,2 → T2,2 → . . . → Tn−1,n−1 → Tn,−1,n → Tn,n ,
whose weight is 2n − 1. Note that we do not need as many processors as
tasks to achieve execution within 2n − 1 time units. For example, we can use
only n − 1 processors. Let 1 6 i 6 n; at time 2i − 2, processor P1 starts the
execution of task Ti,i , while at time 2i − 1, the first n − i processors P1 , P2 ,
. . ., Pn−i execute tasks Ti,j , i + 1 6 j 6 n.
7.4 Solving Pb(p)

Let G = (V, E, w) be a DAG. It turns out that the problem with limited
processors Pb(p) is NP-complete. Hence, a polynomial algorithm to determine
an optimal schedule is unlikely to exist (unless P = NP!). Therefore, in the
next sections we will introduce heuristics to compute approximate solutions.
7.4.1 NP-Completeness of Pb(p)

In the following proofs, we use two well-know partition problems, although
their names are somewhat misleading:
2-Partition Given n positive integer
P P {a1 , a2 , . . . , an }, is there a sub-
numbers
set I of indices such that i∈I ai = i∈I/ ai ?
3-Partition Given 3n positive integer numbers {a1 , a2 , . . . , a3n } and a bound

P3n
B, assuming that B4 < ai < B2 for all i and that i=1 ai = nB, is there
a partition of these numbers into n subsets I1 , I2 , . . . , In of sum B? In
other words, are there n subsets I1 , I2 , P . . . , In such that I1 ∪ I2 . . . ∪ In =
{1, 2, . . . , 3n}, Ii ∩ Ij = ∅ if i 6= j and j∈Ii aj = B for all i?
We see that 3-Partition does not amount to partitioning the data into three
sets of equal sum, which would have been the natural extension of 2-Partition.
The technical condition B4 < ai < B2 ensures that all subsets will be of cardinal
P3n
3, hence the name 3-Partition. The last condition i=1 ai = nB is obviously
needed for a solution to exist.
Why deal with these two variants then? It turns out that 2-Partition,
although NP-complete, is easier, if we may say so, than 3-Partition: there
exists a pseudo-polynomial algorithm to solve 2-Partition, while 3-Partition is
NP-complete in the strong sense [58]. Intuitively, problems involving numbers
may turn out hard to solve only when these numbers are very large, or they
may remain of combinatorial nature even for small numbers. The latter are
said NP-hard in the strong sense. We refer the reader to [58] for a primer on
NP-complete number problems.
Partitioning problems and scheduling independent tasks are closely related
topics, and we come back to this important notion in Section 7.4.5. Before-
hand, we turn to the complexity of Pb(p) and to list scheduling heuristics:
DEFINITION 7.6. The decision problem Dec(p) associated with Pb(p) is
as follows. Given a DAG G = (V, E, w), a number of processors p > 1, and an
execution bound K ∈ N∗ , does there exist a schedule σ for G using at most p
processors, such that MS(σ, p) 6 K? The restriction of Dec(p) to independent
tasks (no dependence, i.e., when E = ∅) is denoted Indep-tasks(p). In both
problems, p is arbitrary (it is part of the problem instance). When p is fixed
a priori, say p = 2, we note Dec(2) and Indep-tasks(2).
7.4. Solving Pb(p) 217
THEOREM 7.5.
• Indep-tasks(2) is NP-complete but can be solved by a pseudo-polynomial
algorithm
• Indep-tasks(p) is NP-complete in the strong sense
• Dec(2) (and hence Dec(p)) is NP-complete in the strong sense
Proof. First, Dec(p) (and hence all the other problems, which are restrictions
of it) belongs to NP: if we are given a schedule σ whose makespan is less
than or equal to K, we can check in polynomial time that both dependences
and resource constraints are satisfied. Indeed, we have to ensure that each
dependence constraint (each edge in E) is satisfied, which is straightforward.
Also, we need to check that no more than p tasks ever execute simultaneously.
We can sort the tasks by their starting time, and check the latter condition
by scanning the sorted array. This can easily be done in time polynomial in
the size of the problem instance.
For proving the NP-completeness of Indep-tasks(2), we show that 2-Partition
can be polynomially reduced to Indep-tasks(2). Consider an arbitrary Pninstance
Inst1 of 2-Partition, with n integers {a1 , a2 , . . . , an }, and let S = i=1 ai be
even (otherwise we know there is no solution). We build an instance Inst2
of Indep-tasks(2) as follows. We let p = 2 (of course), G = (V, E, w) with
V = {v1 , v2 , . . . , vn }, E = ∅, and w(vi ) = ai , 1 6 i 6 n. We also let K = S2 .
The construction of Inst2 is polynomial (and even linear) in the size of Inst1 .
Moreover, Inst1 has a solution if and only if there exists a schedule that meets
the bound K, hence if and only if Inst2 has a solution.
The pseudo-polynomial algorithm to solve Indep-tasks(2) is a simple dy-
namic programming algorithm. For 1 6 i 6 n and 0 6 T 6 S, let the boolean
variable c(i, T ) be true if there exists a subset of {a1 , a2 , . . . , ai } whose sum
is T . We need to determine the value of c(n, S2 ). We use the induction
c(i, T ) = c(i − 1, T ) or ((T > ai ) and c(i − 1, T − ai )) ,
which basically states that either ai is involved in the target subset, or not.
The initialization is c(1, a1 ) = 1, c(i, 0) = 1 for all i and all other boundary
values set to 0. The complexity of the algorithm is O(nS), which is not
polynomial
Pn in the problem size, whose typical binary encoding would be O(n+
i=1 log ai ). (However, if the ai ’s are encoded in unary, we have polynomial
complexity, which is the definition of a pseudo-polynomial algorithm.)
The reduction for the strong NP-completeness of Indep-tasks(p) is straight-
forward. Consider an arbitrary instance Inst1 of 3-PARTITION, with 3n
integers {a1 , a2 , . . . , a3n } and bound B as stated above. The instance Inst2
of Indep-tasks(p) is built with 3n independent tasks of weight ai , p = n pro-
cessors and K = B. Clearly, Inst1 has a solution if and only if there exists a
schedule that meets the bound K, hence if and only if Inst2 has a solution.
The reduction for the strong NP-completeness of Dec(2) is more interesting.

Consider again an arbitrary instance Inst1 of 3-PARTITION, with 3n integers
{a1 , a2 , . . . , a3n } and bound B. The instance Inst2 of Dec(2) is built with p = 2
processors, 3n independent tasks {T1 , . . . , T3n } where the weight of Ti is ai ,
and 3n other tasks, {X1 , Y1 , Z1 , . . . , Xn , Yn , Zn }, all of weight B, linked by
the following dependences:
Y1 Y2 Yn
X1 X2 X3 ··· Xn
Z1 Z2 Zn
Finally, we let K = 2nB.

Assume first that Inst1 has a solution I1 ∪ I2 . . . ∪ In where each Ii is
composed of three numbers whose sum is B. The solution to Inst2 is the
following schedule σ:
• the first processor P1 executes all 2n tasks Xi and Yi . These tasks are
totally ordered along a dependence path of length K = 2nB.
• the second processor P2 executes Zi while P1 executes Yi . While P1

executes a task Xi , it has a slot of size B to execute the three tasks Tj
that belong to Ii .
Altogether, all dependence and resource constraints are satisfied, and σ is a

valid schedule of makespan K.
Now, assume that Inst2 has a solution schedule σ. We see that the schedule
σ is quite constrained. Because X1 → Y1 → X2 → . . . → Xn → Yn is
a dependence path of length K, these tasks must be processed as soon as
possible, without any idle time in between. But X1 → Z1 → X2 → . . . →
Xn → Zn is another dependence path of same length, so these tasks must be
processed as soon as possible too. This enforces that σ(Xi ) = (2i − 2)B and
that σ(Yi ) = σ(Zi ) = (2i − 1)B. Up to some exchanges, we can assume that
P1 executes all the Xi and Yi , and that P2 executes all the Zi . We see that
the 3n tasks Ti are executed by P2 during n intervals of length B, hence the
solution to Inst1 .
7.4.2 List Scheduling Heuristics

Because Pb(p) is NP-complete, we rely on heuristics to schedule DAGs with
limited processors. The most natural idea is to use greedy strategies: at each
instant, we try to schedule as many tasks as possible onto available processors.
Such strategies deciding not to deliberately keep a processor idle are called list
scheduling algorithms. Of course there are different possible strategies to
decide which tasks are given priority in the (frequent) case where there are
more free tasks than available processors. But a key result due to Coffman [42]
is that any list algorithm can be shown to achieve at most twice the optimal
makespan. We express this more formally after giving some definitions.
DEFINITION 7.7. Let G = (V, E, w) be a DAG and let σ be a schedule

for G. A task v ∈ V is free at time t (we note v ∈ F REE(σ, t)) if and only if
its execution has not yet started (σ(v) > t) but all its predecessors have been
executed (∀ u ∈ PRED(v), σ(u) + w(u) 6 t).
A list schedule is a schedule such that no processor is deliberately left idle;

at each time t, if |F REE(σ, t)| = r > 1, and if q processors are available, then
min(r, q) free tasks start executing.
THEOREM 7.6. Let G = (V, E, w) be a DAG and assume there are p

available processors. Let σ be any list schedule of G. Let MSopt (p) be the
makespan of an optimal schedule. Then,

1
MS(σ, p) 6 2 − MSopt (p) .
p
It is important to point out that Theorem 7.6 holds for any list schedule,
regardless of the strategy to choose among free tasks when there are more free
tasks than available processors.
Proof. We first need a lemma:

LEMMA 7.1. There exists a dependence path Φ in G whose weight w(Φ)
satisfies:
Idle 6 (p − 1) × w(Φ),
where Idle is the cumulated idle time of the p processors during the whole
execution of the list schedule.
Proof. Let Ti1 be a task whose execution terminates at the end of the schedule:
σ(Ti1 ) + w(Ti1 ) = MS(σ, p) .
Let t1 be the largest time smaller than σ(Ti1 ) and such that there exists an
idle processor during the time interval [t1 , t1 +1[ (let t1 = 0 if such a time does
not exist). Why is this processor idle? Because σ is a list schedule, no task
is free at t1 , otherwise the idle processor would start executing a free task.
Therefore, there must be a task Ti2 that is an ancestor 4 of Ti1 and that is
being executed at time t1 ; otherwise Ti1 would have been started at time t1 by
the idle processor. Because of the definition of t1 we know that all processors
4 The ancestors of a task are its predecessors, the predecessors of its predecessors, and so
on.
are active between the end of the execution of Ti2 and the beginning of the
execution of Ti1 .
We start the construction again from Ti2 so that we obtain a task Ti3 such
that all processors are active between the end of Ti3 and the beginning of
Ti2 . Iterating the process, we end up with r tasks Tir , Tir−1 , . . . , Ti1 that
belong to a dependence path Φ of G and such that all processors are active
except perhaps during their execution. In other words, the idleness of some
processors can only occur during the execution of these r tasks, during which
at least one processor
Pr is active (the one that executes the task). Hence,
Idle 6 (p − 1) × j=1 w(Tij ) 6 (p − 1) × w(Φ).
Going back to the proofP of Theorem 7.6, we know that p × MS(σ, p) =

Idle + Seq, where Seq = v∈V w(v) is the sequential time, i.e., the sum of all
task weights (see Figure 7.2). Now take the dependence path Φ constructed
in Lemma 7.1. We have w(Φ) 6 MSopt (p), because the makespan of any
schedule is greater than the weight of all dependence paths in G (simply
because dependence constraints are met). Furthermore, Seq 6 p × MSopt (p)
(with equality only if all p processors are active all the time). Putting this
together, we get
p × MS(σ, p) = Idle + Seq

6 (p − 1)w(Φ) + Seq 6 (p − 1)MSopt (p) + pMSopt (p)
= (2p − 1)MSopt (p),
which proves the theorem.
Fundamentally, Theorem 7.6 says that any list schedule is within 50% of
the optimum. Therefore, list scheduling is guaranteed to achieve half the best
possible performance, regardless of the strategy to choose among free tasks.
Before presenting the most widely used strategy to perform this choice (in
order to obtain a practical scheduling algorithm), we make a short digression
to show that the bound 2p−1p cannot be improved.
PROPOSITION 7.2. Let MSlist (p) be the shortest possible makespan pro-
duced by a list scheduling algorithm. The bound
2p − 1
MSlist (p) 6 MSopt (p)
p
is tight.
Proof. Let K be an arbitrarily large integer. We build a DAG G = (V, E, w),

for which any list schedule σ has a makespan MS(σ, p) ≈ 2p−1 p MSopt (p) (see
Figure 7.3). There are 2p + 1 vertices, whose weights are as follows: w(Ti ) =
K(p − 1) for 1 6 i 6 p − 1; w(Tp ) = 1; w(Ti ) = K for p + 1 6 i 6 2p; and
w(T2p+1 ) = K(p − 1). Precedence edges are indicated in the figure. There
are exactly p entry vertices, hence σ(Ti ) = 0, 1 6 i 6 p for any list schedule
σ. At time 1, the execution of Tp is complete and the free processor (the

one that executed Tp ) will be successively assigned p − 1 of the p free tasks
Tp+1 , Tp+2 , . . . , T2p . Note that this processor starts the execution of the last
of its p − 1 tasks at time 1 + K(p − 2) and terminates it at time 1 + K(p − 1).
Therefore, the remaining p-th task will be executed at time K(p − 1) by
another processor. Only at time K(p − 1) + K = Kp will task T2p+1 be free,
which leads to MS(σ, p) = Kp + K(p − 1) = K(2p − 1).
(1)
Tp
(K(p−1)) (K(p−1)) (K(p−1))
T1 T2 ··· Tp−1
(K) (K) (K)

Tp+1 Tp+2 ··· T2p
(K(p−1))
T2p+1
FIGURE 7.3: The DAG used to bound list scheduling performance; task
weights are indicated as exponents inside parentheses.
However, the DAG can be scheduled in only Kp+1 time units. The key is to
deliberately keep p−1 processors idle while executing task Tp at time 0 (which
is forbidden in a list schedule). Then, at time 1, each processor executes one of
the p tasks Tp+1 , Tp+2 , . . . , T2p . At time 1 + K one processor starts executing
T2p+1 while the other p − 1 processors execute tasks T1 , T2 , . . . , Tp−1 . This
defines a schedule with a makespan equal to 1 + K + K(p − 1) = Kp + 1, which
is optimal because it is equal to the weight of the path Tp → Tp+1 → T2p+1 .
Hence, we obtain the ratio
MS(σ, p) K(2p − 1) 2p − 1 2p − 1 2p − 1
> = − = − ε(K),
MSopt (p) Kp + 1 p p(Kp + 1) p
where limK→+∞ ε(K) = 0.
7.4.3 Implementing a List Schedule

In this section, we show how to implement a “generic” list scheduling algo-
rithm, meaning that we do not specify how to choose free tasks when there
are more free tasks than available processors. We simply assume that free
tasks are placed in a priority queue, according to some to-be-defined priority.
We give an example instantiation of this generic algorithm in Section 7.4.4.
The implementation is not difficult but is somewhat lengthy to describe.
The overall scheme can be outlined as follows:
1. Initialization:
(a) Compute the priority of all tasks, for some definition of priority.
(b) Place the free tasks in a priority queue, sorted by non-increasing
priority.
(c) Let t be the current time: t = 0.
2. While there remain tasks to schedule:
(a) Add new free tasks, if any, to the priority queue. If the execution of
a task terminates at time t, remove this task from the predecessor
list of all its successors. Add those tasks whose predecessor list has
become empty to the priority queue.
(b) If there are q available processors and r tasks in the priority queue,
remove the first min(q, r) tasks from the priority queue and sched-
ule them; for each such task T set σ(T ) = t.
(c) Increment t by one (recall that task weights are integers).
Let G = (V, E, w) be a DAG and assume there are p available processors.
Let σ be any list schedule of G. Our aim is to derive an implementation
whose complexity is O(|V | log |V | + |E|) for computing the schedule. Clearly,
the above scheme must be modified because time varies from t = 0 up to
t = MS(σ, p), implying that the complexity depends on task weights. Indeed,
MS(σ, p) may be of the order of Seq, the sum of all task weights, and we would
have a pseudo-polynomial algorithm instead of a true polynomial algorithm; a
binary encoding of the problem instance is of size log(Seq), not of size Seq. We
outline a possible solution written in pseudo-code in Algorithm 7.1. Rather
than using time we use events which correspond to times when tasks become
free or processors become idle.
A few words of explanation are in order for Algorithm 7.1. We use a heap Q
(see [44]) to store free tasks for two reasons; we can access the task with highest
priority in constant time; and we can insert a task in the heap, according to
its priority level, in time proportional to the logarithm of the heap size, which
is bounded by |V |. We use another heap P to handle active processors; a
processor executing a task v ∈ V is valued by the time at which the execution
of v terminates. Thereby we can compute the next event in constant time, and
we can insert a new active processor in the heap in O(log |P|) 6 O(log |V |)
time. When we extract a processor from the processor heap, meaning that a
task v has terminated, we need to update the in-degree of each successor of v
in array A. On the fly, if the in-degree of a given successor v 0 becomes zero,
we insert v 0 in the priority heap Q. This way, we process each edge of G only
once, for a global cost O(|E|). Overall, each task causes two insertions: the
first is the insertion of the task itself in heap Q; the second is the insertion
of the processor that executes it in heap P. Because each operation costs
at most O(log |V |), we obtain the desired complexity O(|V | log |V | + |E|) for
computing the schedule.
1 LIST SCHEDULE(G = (V, E, w), p)

{ p processors, with 1 6 p 6 |V | }
{ In addition to the data structure representing G, store in array A the
number of predecessors of each task (its in-degree) }
2 Insert all the tasks without predecessors in priority heap Q
3 Initialize processor heap P to Empty
4 t ← −1
5 while Q 6= Empty do
6 t0 ← NEXT EVENT(P, t)
7 UPDATE(t0 , A, Q)
8 ALLOCATE TASKS(t0 , P, Q)
9 t ← t0
ALGORITHM 7.1: Outline of a list scheduling algorithm.

T1 T2 T3
T4 T5 T6 T7 T8
FIGURE 7.4: A small example DAG.
7.4.4 Critical Path Scheduling

In this section, we detail a widely used list scheduling technique, known as
critical path scheduling. We have seen the basic principle of the list scheduling
technique and assessed its performance. We now have to explain how to
choose among free tasks to get a practical list scheduling algorithm. The
most popular selection criterion is based on the value of the bottom level of
the tasks. Intuitively, the larger the bottom level, the more “urgent” the task.
The critical path of a task is defined as its bottom level and is used to assign
priority levels to tasks. Critical path scheduling is list scheduling where the
priority level of a task is given by the value of its critical path. Ties are broken
arbitrarily.
Let us work out a small example. Consider the DAG shown in Figure 7.4.
There are eight tasks, whose weights and critical paths are listed in Table 7.1.
Assume there are p = 3 available processors and let Q be the priority queue
of free tasks. At t = 0, Q is initialized as Q = (T3 , T2 , T1 ). Because q = r = 3,
TABLE 7.1: Weights and critical paths for DAG in Figure 7.4.
Tasks T1 T2 T3 T4 T5 T6 T7 T8
Weights 3 2 1 3 4 4 3 6
Critical Paths 3 6 7 3 4 4 3 6
we execute these three tasks. At t = 1, we add T8 to the queue: Q = (T8 ).

There is one processor available, which starts the execution of T8 . At t = 2,
we add the four successors of T2 to the queue: Q = (T5 , T6 , T4 , T7 ). Note that
we have broken ties arbitrarily (using task indices in this case). The available
processor picks the first task T5 in Q. Following this scheme, the execution
goes on up to t = 10, as summarized in Figure 7.5.
T3 T8 T7
P3
T2 T5 T4
P2
T1 T6
P1
0 1 2 3 4 5 6 7 8 9 10 time steps
FIGURE 7.5: Critical path schedule for the example DAG in Figure 7.4.
P3 T3 T6 T7
P2 T2 T5 T4
T1 T8
P1
0 1 2 3 4 5 6 7 8 9 10 time steps
FIGURE 7.6: Optimal schedule for the example DAG in Figure 7.4.
Note that it is possible to schedule the DAG in only 9 time units, as shown
in Figure 7.6. The trick is to leave a processor idle at time t = 1 deliberately;
although it has the highest critical path, T8 can be delayed by two time
units. T5 and T6 are given preference to achieve a better load balance between
processors. How do we know that the schedule shown in Figure 7.6 is optimal?
Because Seq = 26, so that three processors require at least d 26 3 e = 9 time
units. This small example illustrates the difficulty of scheduling with a limited
number of processors.
7.4.5 Scheduling Independent Tasks

We come back to the problem Indep-tasks(p) introduced in Section 7.4.1.
Recall that the objective is to schedule a set of independent tasks (no depen-
dence relation). In addition to being a scheduling problem, this problem can
be viewed as a load balancing problem where we aim at partitioning the tasks
so as to assign (almost) equal load to each processor.
For fixed p, we have shown that Indep-tasks(p) is NP-complete but solvable
in pseudo-polynomial time. In this section, we aim at deriving λ-approximation

algorithms, i.e., polynomial-time schedules whose makespan is at most λ times
the optimal makespan for any problem instance.
Because all tasks are independent, any greedy (or list scheduling) algorithm
will keep some processors idle only at the end of the execution. Can we
improve the performance bound λ = 2 − p1 of Theorem 7.6 for this particular
case? For the sake of simplicity we only consider the case p = 2 in this
section. Although simple, this case enables us to introduce approximation
schemes that play a major role in scheduling theory.
Given n independent tasks T1 , T2 , . . . , Tn , where the weight of Ti is ai , how
to schedule these tasks on P two identical processors? Let MSopt be the opti-
n
mal schedule and Psum = i=1 ai . Of course MSopt > Psum 2 . There are two
natural greedy algorithms:
• GREEDY: consider the tasks in some arbitrary order; at each step,

schedule the current task on the least-loaded processor.
• SORTED-GREEDY: same as above, but considering the tasks sorted by

non-increasing weight.
The rationale to sort the tasks is that the arrival of a big task in the end
may unbalance the whole execution. However, we need to know all the tasks
(for sorting) before starting the execution of the algorithm. For this reason
SORTED-GREEDY is called an off-line algorithm. By contrast, GREEDY
can be applied to an on-line problem in which new tasks dynamically arrive
(e.g., to a dual-processor computer).
THEOREM 7.7. GREEDY is a 23 -approximation and SORTED-GREEDY
is a 7/6-approximation for Indep-tasks(2), and these approximation factors
cannot be improved.
Proof. We first show that the bounds are tight (note that the list scheduling
bound is 2 − 21 = 32 ). For GREEDY, take an instance with three tasks of
weight 1, 1, and 2: GREEDY has a makespan of 3 while the optimal is 2. For
SORTED-GREEDY, take five tasks, three of weight 2 and two of weight 3.
SORTED-GREEDY has a makespan of 7, while the optimal is 6.
For the approximations, recall that MSopt > Psum 2 and that MSopt > ai
for all i. Let us start with GREEDY. Let P1 and P2 be the two processors.
Assume that P1 finishes execution last. Let M1 be the execution time on P1
(the sum of all task weights assigned to it) and M2 the execution time on P2 .
Because P1 terminates last, M1 > M2 . Of course M1 + M2 = Psum .
Let Tj be the last task that executes on P1 . Let M0 = M1 − aj be the
load of P1 before Tj (of weight aj ) is assigned to it. Why does the GREEDY
algorithm choose to assign Tj to P1 ? It can only be because at that time
P2 has more load than (or the same load as) P1 : M0 is not larger than the
current load of P2 at that time, which itself is not larger than its final load
M2 (note that P2 may have been assigned more tasks after Tj was scheduled
on P1 ). Therefore, M0 6 M2 . To summarize, the makespan of the schedule
computed by GREEDY is M1 , and
1
M1 = M0 + aj = ((M0 + M0 + aj ) + aj )
2
1 1
6 ((M0 + M2 + aj ) + aj ) = (Psum + aj )
2 2
aj MSopt
6 MSopt + 6 MSopt + ,
2 2
hence proving the 3/2-approximation result for GREEDY.
For SORTED-GREEDY the same line of reasoning is used, but with a
tighter bounding of aj than by MSopt . First, if aj 6 13 MSopt we obtain what
we need, i.e., M1 6 67 MSopt . But if aj > 13 MSopt , then necessarily j 6 4.
Indeed, if Tj was the fifth task or higher, because task weights are sorted,
there would be at least five tasks of weight greater than 13 MSopt ; in any
schedule, including the optimal schedule, one processor would receive three of
these tasks, a contradiction. Next we observe that the makespan achieved by
SORTED-GREEDY is the same when scheduling all tasks as when scheduling
only the first four tasks. But for any problem instance with n 6 4, SORTED-
GREEDY is optimal, and M1 = MSopt .
We now show that Indep-tasks(2) has a (1 + ε)-approximation for any value

of ε > 0. We say that Indep-tasks(2) has a Polynomial Time Approxima-
tion Scheme, or PTAS. For any value of ε > 0, we build an algorithm whose
makespan is guaranteed up to a factor 1+ε from the optimal; and whose com-
plexity is polynomial. A word of caution here: “polynomial” means polynomial
in the problem size only. Even if 1ε can tend to infinity, ε is a constant for the
algorithm. Later we present Fully Polynomial Time Approximation Scheme,
or FPTAS, where the complexity of the 1 + ε approximation is polynomial
both in the problem size and in 1ε .
THEOREM 7.8. ∀ ε > 0, Indep-tasks(2) has a (1 + ε)-approximation.
Proof. Let Pmax = maxi ai and L = max( Psum 2 , Pmax ). We already know
that L 6 MSopt . We call big jobs those tasks Ti whose weights are such that
ai > εL and small jobs those such that ai 6 εL. The number of big jobs is at
most:
Psum 2L 2
6 = =B.
εL εL ε
Because ε is fixed, B is a constant, so there is a (possibly large but) constant
number of big jobs. We temporarily forget about small jobs, consider only
big jobs and search for the best schedule. There are 2B possible schedules
(each big job can be assigned to either processor), which is a constant number
big
again, so we try them all and keep the best one, say σopt . The resulting
big big
makespan MSopt satisfies MSopt 6 MSopt because there are fewer jobs than in
the original problem.
big
Now we extend σopt and schedule the small jobs after the big jobs, using
GREEDY, and we obtain a schedule σ for the original problem. We claim
that σ solves the problem, i.e., that MS(σ) 6 (1 + ε)MSopt . If the makespans
big
of σopt and σ are equal, then σ is optimal. Otherwise, σ terminates with a
small job Tj , say on the first processor P1 . But the load of P1 before this last
assignment could not exceed Psum 2 , otherwise GREEDY would have assigned
Tj on P2 (same proof as in Theorem 7.7). Hence,
Psum
MS(σ) 6 + aj 6 L + εL 6 (1 + ε)MSopt ,
2
which proves the theorem.
Theorem 7.8 is interesting, but the combinatorial search for the best as-
signment of big jobs can be prohibitively expensive when ε tends to 0. This
motivates our last result, which states that Indep-tasks(2) has a FPTAS.
(This will conclude our initiation to the fascinating world of approximation
schemes.)
THEOREM 7.9. ∀ ε > 0, Indep-tasks(2) has a (1 + ε)-approximation whose
complexity is polynomial in 1ε .
Proof. We encode schedules as Vector Sets. The first component of each vector
represents the load of the first processor P1 , and the second component the
load of P2 . Here is the construction:
Initialization Set V S1 = {[a1 , 0], [0, a1 ]}

Phase k for 2 6 k 6 n For every vector [x, y] ∈ V Sk−1 , put [x + ak , y] and
[x, y + ak ] in V Sk
Output Vector [x, y] ∈ V Sn that minimizes max(x, y).
Instead of building all possible vectors (whose number is exponential), we

will prune some of them during the construction. The difficulty is to retain
ε
a vector close to the optimal when pruning. Let ∆ = 1 + 2n . All considered
vectors lie in the rectangle [0, Psum ] × [0, Psum ]. We subdivide this rectangle
into many boxes. Horizontal and vertical cuts are made at the coordinates ∆i
for 1 6 i 6 M where

ln Psum 2n
M = dlog∆ Psum e = 6 1+ ln(Psum ) .
ln ∆ ε
1
The last inequality comes from the fact that ln(z) > 1 − z for all z > 1.
The idea is to add new vectors during each phase of the construction only
if they fall into empty boxes. In other words, boxes are small enough so that
we do not need several vectors per box. Two vectors [x1 , y1 ] et [x2 , y2 ] fall
into the same box if and only if
 x1
 6 x2 6 ∆x1
∆

 y1 6 y2 6 ∆y1


∆
The pruned construction is as follows:
Initialization Set V S1# = V S1 .

#
Phase k for 2 6 k 6 n For every vector [x, y] ∈ V Sk−1 , put [x + ak , y] in
#
V Sk if and only if the box that it falls into does not already contain
another element of V Sk# . Apply the same procedure for [x, y + ak ].
Output Vector [x, y] ∈ V Sn# that minimizes max(x, y).
We generate at most one vector per box. As a result the total number of
vectors is bounded by nM 2 . Because M has polynomial size in 1ε and in
log Psum , we have the required complexity. Let us prove that the pruned
construction leads to a (1 + ε)-approximation of the optimal schedule.
LEMMA 7.2. ∀[x, y] ∈ V Sk , ∃[x# , y # ] ∈ V Sk# such that x# 6 ∆k x and
y # 6 ∆k y.
Proof. We proceed by induction. The case k = 1 is clear. Assume the property

holds for k − 1, and consider [x, y] ∈ V Sk . Assume that x = u + ak and y = v
for some [u, v] ∈ V Sk−1 (the other case x = u and y = v + ak is symmetric).
#
By induction, there exists [u# , v # ] ∈ V Sk−1 such that u# 6 ∆k−1 u and
v # 6 ∆k−1 v. We do not know whether [u# + ak , v] belongs to V Sk# but we
do know that there is at least one element in the corresponding box:
∃[x# , y # ] ∈ V Sk# such that x# 6 ∆ u# + ak and y # 6 ∆v #

(
x# 6 ∆k u + ∆ak 6 ∆k (u + ak ) = ∆k x
⇒
y # 6 ∆v # 6 ∆k y
As a consequence of Lemma 7.2, the vector [x# , y # ] ∈ V S # output by the

pruned construction is such that max(x#, y # ) 6 ∆ n
n MSopt . There remains to
ε
show that ∆n 6 (1 + ε). We have ∆n = 1 + .
2n
z n
LEMMA 7.3. For 0 6 z 6 1, f (z) = 1 + − 1 − 2z 6 0.
n
7.5. Taking Communication Costs Into Account 229
1 z n−1

Proof. f 0 (z) = n 1 + − 2 and f 00 (z) > 0, so f is convex and its
n n n
1
maximum is either f (0) = 0 or f (1) = g(n) = 1 + − 3. But g(n) 6
n
e − 3 < 0 (where e = 2.718 is the base of natural logarithms).
ε
We use the lemma with z = 2 to complete the proof of Theorem 7.9.
7.5 Taking Communication Costs Into Account

Distributed-memory parallel computing platforms pose many challenges to
the algorithm designer and the programmer. An obvious factor contributing to
this complexity is the need for network communication, whose performance is
difficult to model in a way that is both precise and conducive to understanding
the performance of algorithms (see Chapter 3). Older parallel computers used
a store-and-forward approach to communicate messages, which was not effi-
cient but simple to understand and to model. Essentially, the time for sending
a message from a processor p to a processor p0 is: c(p, p0 ) = dist(p, p0 )×(L+bb),
where b is the length of the message; dist(p, p0 ) is the distance between p and
p0 in number of hops; L is the communication start-up cost; and b is the
inverse of the steady-state bandwidth. In modern computers, messages are
split into packets that are dynamically routed between processors, possibly
using different paths. Messages can be routed efficiently if there are no con-
tentions on the communication links (or “hot spots”). The distance between
communicating processors is no longer the single most important factor for
communication performance. In fact, if several processors are to exchange
data simultaneously, then the more structured the communication patterns,
the more efficient, making the role of locality on performance at best indirect.
In light of the complexity of performance modeling for network communica-
tions, the vast majority of scheduling works and results are for a very simple
model, which is as follows. If a task T communicates data to a successor task
T 0 , the cost is modeled as:
if alloc(T ) = alloc(T 0 )

0
cost(T, T 0 ) = 0
c(T, T ) otherwise
where alloc(T ) denotes the processor that executes task T (see Section 7.2),
and c(T, T 0 ) is defined by the application specification. The above model
states that the time for communication between two tasks running on the
same processor is negligible. The model also assumes that the processors
are part of a fully connected clique. This so-called macro-dataflow model
makes two main assumptions: (i) communication can occur as soon as data is
available; and (ii) there is no contention for network links. Assumption (i) is
reasonable as communication can overlap with (independent) computations in

most modern computers. Assumption (ii) is much more questionable. Indeed,
there is no physical device capable of sending, say, 1, 000 messages to 1, 000
distinct processors, at the same speed as if there were a single message. In
the worst-case, it would take 1, 000 times longer (serializing all messages). In
the best case, the output bandwidth of the network card of the sender would
be a limiting factor. In other words, assumption (ii) amounts to assuming
infinite network resources! Nevertheless, this assumption is omnipresent in
the traditional scheduling literature. Perhaps it is the price to pay to derive
tractable mathematical results? We use this mode for now and turn to more
realistic communication models in Chapter 8.
We conclude this discussion by stating the model more formally.
DEFINITION 7.8. A communication DAG (or cDAG) is a direct acyclic
graph G = (V, E, w, c) where vertices represent tasks and edges represent
precedence constraints. The computation weight function is w : V −→ N∗
and the communication cost function is c : E −→ N∗ . A schedule σ must
preserve dependences, which is written as:
σ(T ) + w(T ) 6 σ(T 0 ) if alloc(T ) = alloc(T 0 )

∀e = (T, T 0 ) ∈ E, 0 0
σ(T ) + w(T ) + c(T, T ) 6 σ(T ) otherwise
The expression of resource constraints is the same as in the no-communication
case.
7.6 Pb(∞) with communications

Including communication costs in the model makes everything difficult, in-
cluding solving Pb(∞). The intuitive reason is that we hesitate between
allocating tasks to either many processors (hence balancing the load but com-
municating intensively) or few processors (leading to less communication but
less parallelism as well). We illustrate this with a small example, borrowed
from [61].
Consider the cDAG in Figure 7.7. Task weights are indicated close to the
tasks within parentheses, and communication costs are shown along the edges,
underlined. For the sake of this example we use two non-integer communica-
tion costs: c(T4 , T6 ) = c(T5 , T6 ) = 1.5. Of course, we could scale every weight
w and cost c to have only integer values. We can check the following:
• On the one hand, if we allocate all tasks to the same processor the
makespan will be equal to the sum of all task weights, i.e., 13.
• On the other hand, if we have as many processors as we want (we need
no more than seven processors because there are seven tasks), we can
7.6. Pb(∞) with communications 231
(1)
T1
1
5 (1)
T3
(5)
4 3
T2
(2) (2)
T4 T5
1.5 1.5
2 (1)
T6
1
T7 (1)
FIGURE 7.7: An example cDAG.
allocate one task per processor. Then, we can check that the makespan
of the ASAP schedule is equal to 14. To see this, it is important to
point out that once the allocation of tasks to processors is given, we
can compute the makespan easily: for each edge e : T → T 0 , add a
virtual node of weight c(T, T 0 ) if the edge links two different processors
(alloc(T ) 6= alloc(T 0 )), and do nothing otherwise. Then, consider the
new graph as a DAG (without communications) and traverse it to com-
pute the length of the longest path, as explained in Section 7.3. In our
case, because all tasks are allocated to different processors, we add a
virtual node on each edge. The longest path is T1 → T2 → T7 , whose
length is w(T1 ) + c(T1 , T2 ) + w(T2 ) + c(T2 , T7 ) + w(T7 ) = 14.
There is a difficult tradeoff between executing tasks in parallel (hence with

several distinct processors) and minimizing communication costs. In our ex-
ample, it turns out that the best solution is to use two processors, according
to the schedule in Figure 7.8, whose makespan is equal to 9.
T3 T4 T5 T6 T7
P2
P1 T1 T2
0 1 2 3 4 5 6 7 8 9 time steps
FIGURE 7.8: An optimal schedule for the example.
Note that dependence constraints are satisfied in Figure 7.8. For example,
T2 can start at time 1 on processor P1 because this processor executes T1 ,
hence there is no need to pay the communication cost c(T1 , T2 ). By contrast,
T3 is executed on processor P2 , hence we need to wait until time 2 to start it
even though P2 is idle: σ(T1 ) + w(T1 ) + c(T1 , T3 ) = 0 + 1 + 1 = 2.
How did we find the schedule shown in Figure 7.8? And how do we know it
is optimal? By a tedious case-by-case analysis! This example shows that using

more processors does not always lead to a shorter execution time. Using the
notations of Section 7.2, the minimum makespan of a schedule making actual
use of seven processors is MS0 (7) = 14 while MS0 (1) = 13 (or MS0 (2) = 9).
In other words, Theorem 7.3 is no longer true when communication costs are
taken into account.
7.6.1 NP-completeness of Pb(∞)

Communication costs make solving Pb(∞), the scheduling problem with
unlimited processors, very difficult.
THEOREM 7.10. Pb(∞) is NP-complete.
Proof. The decision problem Comm(∞) associated with Pb(∞) is the follow-
ing. Given a cDAG G = (V, E, w, c) and an execution bound K ∈ N∗ , does
there exist a schedule σ for G such that MS(σ, ∞) 6 K? We want to show
that Comm(∞) is NP-complete. First, Comm(∞) belongs to NP. If we are
given a schedule σ whose makespan is less than or equal to K, we can check
in polynomial time that dependence constraints are satisfied. For each task
we know the beginning σ(T ) of its execution and the processor alloc(T ) that
executes it, hence we just have to check for constraints.
(A)
T0
(2a1 ) (2a2 ) (2an )

T1 T2 ··· Tn
(A)
Tn+1
FIGURE 7.9: Reduction for the NP-completeness proof.
To prove NP-completeness, we use 2-Partition as in the proof of Theo-

rem 7.5, but the reduction is more involved. Consider any instance Inst1 of 2-
Partition. Given P n positive integer
P numbers {a1 , . . . , an }, is there a subset I of
indices such that i∈I ai = i∈I / ai ? We build an instance Inst2 of Comm(∞)
as follows; we let G = (V, E, w, c) be a fork-join graph (see Figure 7.9). There
are n + 2 tasks: V = {T0 , T1 , T2 , . . . , Tn , Tn+1 }. The task weights are defined
as follows: w(Ti ) = 2ai for 1 6 i 6 n, and w(T0 ) = w(Tn+1 ) = A, where A
is a positive integer. There are 2n edges, and the communication costs are
all equal: c(T0 , Ti ) = c(Ti , Tn+1 ) = C for 1 6 i 6 n. Here, C is any integer in
Pn
the interval ]α − min16i6n 2ai , α[, where α = i=1 ai . Note that this interval
does contain an integer, as min16i6n 2ai > 2. Finally, let K = 2A + C + α.
The size of Inst2 is clearly polynomial in the size of Inst1 . The difficult part
is to show that Inst1 has a solution if and only if there exists a schedule that
meets the bound K, i.e., if and only if Inst2 has a solution.
First,
P assume P that Inst1 has a solution. Let I be a subset of indices such
α
that i∈I ai = i∈I / ai = 2 . Let T1 = {Ti , i ∈ I} and T2 = {Ti , i ∈ / I}. By
hypothesis, w(T1 ) = w(T2 ) = α. Consider the schedule in Figure 7.10.
P2 Tasks in T2 Tn+1
P1 T0 Tasks in T1
A+α 2A + C + α time steps

0 A A+C
A+C +α
FIGURE 7.10: A schedule with makespan K = 2A + C + α.
The makespan of this schedule is equal to K = 2A + C + α. All dependence

constraints are satisfied. Indeed:
• Processor P2 starts executing the tasks in T2 at time A+C = w(T0 )+C.
• Processor P1 terminates executing the tasks in T1 at time A + α. Hence,

at time A + α + C, task Tn+1 is ready to be executed by processor P2 .
Its execution can start right then, as P2 terminates the tasks in T2 at
time A + C + α.
Conversely, assume that Inst2 has a solution. Let σ be a schedule whose

makespan MS(σ) is less than or equal to K. We need the two following
lemmas.
LEMMA 7.4. Tasks T0 and Tn+1 are not executed by the same processor in
schedule σ.
Proof. Assume that the same processor P executes both T0 and Tn+1 . Then,
P executes all n other tasks Ti , 1 6 i 6 n. Otherwise, let Ti0 be a task
executed by another processor. The makespan of σ is greater than or equal
to the length of the path T0 → Ti0 → Tn+1 :
MS(σ) > A + C + 2ai0 + C + A = (2A + C) + (2ai0 + C) > (2A + C) + α = K,
hence a contradiction. Therefore, P does execute the n+2 tasks, which is a

contradiction, as the sum of all task weights is 2A + 2α > 2A + α + C = K.
Let P1 be the processor that executes T0 and P2 be the processor that

executes Tn+1 .
LEMMA 7.5. Each task Ti , 1 6 i 6 n, is executed by either P1 or P2 .
Proof. Assume that there exists a task Ti0 , 1 6 i0 6 n, executed by a processor
other than P1 and P2 . Then, the makespan of σ is greater than or equal to
the length of the path T0 → Ti0 → Tn+1 : MS(σ) > A + C + 2ai0 + C + A > K
(as in Lemma 7.4), hence a contradiction.
Let T1 be the set of tasks Ti , 1 6 i 6 n, executed by P1 . Define similarly
T2 for P2 . The makespan of σ satisfies:
MS(σ) > w(T0 ) + w(T1 ) + C + w(Tn+1 ) = 2A + C + w(T1 ).
To understand this note that P1 takes at least w(T0 ) + w(T1 ) time units to
execute its tasks, and that a communication must occur before P2 can start
Tn+1 .
Similarly, MS(σ) > 2A + C + w(T2 ), because P2 must wait at least A + C
time units before starting execution. Because MS(σ) 6 K = 2A + C + α,
we have w(T1 ) 6 α and w(T2 ) 6 α. But w(T1 ) + w(T2 ) = 2α. Therefore,
w(T1 ) = w(T2 ) = α. Let I denote the set of indices of the tasks in T1 ; I is a
solution to Inst1 , our instance of 2-Partition.
Theorem 7.10 only shows that P B(∞) is NP-complete in the weak sense.
In fact, P B(∞) is NP-complete in the strong sense: even the problem in
which all task weights and communication costs have the same (unit) value,
the so-called UET-UCT problem (Unit Execution Time-Unit Communication
Time), is NP-hard [94, 95].
7.6.2 A Guaranteed Heuristic for Pb(∞)

In this section, we present a guaranteed heuristic, due to Hanen and Mu-
nier [67], to solve Pb(∞). The heuristic is guaranteed within a factor at most 34
of the optimal under the assumption that all communication costs are smaller
than all computation costs. Such a task graph is said to be coarse-grain, as
stated in the following definition:
DEFINITION 7.9. Let G = (V, E, w, c) be a cDAG. The granularity of G
is the computation to communication ratio
minT ∈V w(T )
g(G) = .
maxT,T 0 ∈V c(T, T 0 )
G is coarse-grain if g(G) > 1.

Before stating Hanen and Munier’s heuristic formally, we explain the main
idea, that of “favorite successors.”
TABLE 7.2: Favorite successors for the schedule in Figure 7.8.
Task T1 T2 T3 T4 T5 T6 T7
Favorite Successor T2 — T4 — T6 T7 —
Favorite successors Let G = (V, E, w, c) be a coarse-grain cDAG and σ

be any schedule for G. Let T ∈ V be any task. The favorite successor of T ,
if it exists, is the unique immediate successor T 0 of T such that
σ(T 0 ) < σ(T ) + w(T ) + c(T, T 0 ). (FS)
If it exists, the favorite successor of T is executed by the same processor as T ,

otherwise a communication cost would be paid and Condition (FS) would not
hold. Now, to see why the favorite successor is unique (if it exists), assume
that two successors T 0 and T 00 of T satisfy condition (FS). T 0 and T 00 are
executed by the same processor. Without loss of generality, assume that T 0
is executed before T 00 . This implies that σ(T ) + w(T ) 6 σ(T 0 ), and that
σ(T 0 ) + w(T 0 ) 6 σ(T 00 ). But, by hypothesis σ(T 00 ) < σ(T ) + w(T ) + c(T, T 00 );
hence w(T 0 ) < c(T, T 00 ), a result that contradicts the fact that G is coarse-
grain. Table 7.2 gives all favorite successors for the optimal schedule shown
in Figure 7.8.
For each edge e = (T, T 0 ) ∈ E, we introduce a Boolean variable xT ,T 0 :
xT ,T 0 = 0 if T 0 is the favorite successor of T , and xT ,T 0 = 1 otherwise. The
inequality
σ(T ) + w(T ) + xT ,T 0 c(T, T 0 ) 6 σ(T 0 )
holds for any edge (T, T 0 ) ∈ E, whether T 0 is the favorite successor of T or not.
Casting all such inequalities into a linear program is the main idea underlying
the heuristic.
Hanen and Munier’s heuristic – Given a coarse-grain cDAG, we define

the following integer linear program.
DEFINITION 7.10. Let G = (V, E, w, c) be a cDAG. We define the integer
linear program ILP(G) as follows:
Minimize M∞ subject to
0


 ∀(T, T ) ∈ E xT ,T 0 ∈ {0, 1} (A)


 ∀T ∈ V s(T ) > 0 (B)
 ∀(T, T 0 ) ∈ E

s(T ) + w(T ) + xT ,T 0 c(T, T 0 ) ≤ s(T 0 ) (1)
P
∀T ∈ V s.t. SUCC(T ) 6= ∅ x 0 > |SUCC(T )|−1 (2)

 PT 0 ∈SUCC(T ) T ,T
∀T ∈ V s.t. PRED(T ) 6= ∅ T 0 ∈PRED(T ) xT ,T > |PRED(T )|−1 (3)

 0


∀T ∈ V

s(T ) + w(T ) 6 M∞ (4)
We refer to [107] for a primer on (integer) linear programming. Intuitively,

the makespan M∞ is the maximum of the completion times of all tasks, which
is expressed by constraint (4).
LEMMA 7.6. Let G = (V, E, w, c) be a cDAG. The solution M∞ of the

integer linear program ILP(G) is equal to the optimal makespan with unlimited
processors MSopt (∞).
Proof. We show that there is a one-to-one correspondence between valid sched-
ules for G (with unlimited processors) and solutions to the integer linear pro-
gram ILP(G).
Let σ be a valid schedule for G. Let s(T ) = σ(T ) for each task T , and let
xT ,T 0 = 0 if T 0 is the favorite successor of T , and xT ,T 0 = 1 otherwise. Then,
all constraints of ILP(G) are met:
• Constraints (A) and (B) are met by construction.
• Constraint (1) was derived above.
• Constraint (2) expresses the fact that each task has at most one favorite
successor.
• Constraint (3) expresses the fact that each task is the favorite successor
of at most one task, which can be proved quite similarly to constraint
(2).
• The definition of the makespan is MS(σ, ∞) = maxT ∈V (σ(T ) + w(T )),
hence constraint (4) is met.
Let M∞ (σ) be the value returned by ILP(G) when all variables s(T ) and
xT ,T 0 are defined as above. Because constraint (4) is the only constraint on
the objective function, we have M∞ (σ) = MS(σ, ∞).
Reciprocally, consider a solution of the optimization problem ILP(G). To
define the induced schedule σ, we need to determine for each task both a
starting time and the processor that executes it. For starting times, we simply
let σ(T ) = s(T ) for any task T ∈ V . We define the allocation function as
follows:
∀e = (T, T 0 ) ∈ E, xT ,T 0 = 0 ⇔ alloc(T ) = alloc(T 0 ) .
To be more precise, we allocate entry tasks to different processors, and we
traverse the graph to compute the allocation function as follows: If T 0 is an
immediate successor of T and xT ,T 0 = 1, we allocate T 0 to a new processor,
otherwise we allocate T 0 to the same processor as T . We have no conflict
during this traversal. Indeed, due to condition (2), for each task T ∈ V , there
is at most one immediate successor of T allocated to the same processor as
T . This successor, if it exists, is the unique task T 0 such that xT ,T 0 = 0 (T 0
is then the favorite successor of T ). Similarly, due to condition (3), for each
task T 0 ∈ V , there is at most one predecessor of T 0 allocated to the same
processor as T 0 . Furthermore, constraint (1) together with the choice of the
allocation function ensures that all dependence constraints are met: σ is a
valid schedule for G. Finally, condition (4) shows that M∞ is equal to the
makespan of σ.
Given a solution to ILP(G), we can interpret s(T ) as the top level of T ,

where bottom and top levels are computed according to the allocation function
induced by the variables xT ,T 0 . We add the communication cost c(T, T 0 ) into
the weight of a path going from T to T 0 if and only if alloc(T ) 6= alloc(T 0 ), i.e.,
if and only if xT ,T 0 = 1. Condition (4) shows that the solution to ILP(G) is
indeed equal to the value of the maximal weight of a path in the dependence
graph computed using the previous rules. The difficulty lies in determining the
allocation function, i.e., in determining the xT ,T 0 values. Because these values
are integer (and even restricted to 0 or 1), the ILP problem is an integer linear
program, which is NP-hard [107]. However, if we relax the condition that the
xT ,T 0 ’s are integers, we obtain a linear program with rational unknowns, whose
complexity is known to be polynomial [107].
DEFINITION 7.11. Let G = (V, E, w, c) be a cDAG:
• We define the relaxed linear program RLP(G) as the program obtained
by replacing equation (A) in the definition of ILP(G) by the equation
∀(T, T 0 ) ∈ E, 0 6 xT ,T 0 6 1 .
Now the variables xT ,T 0 are rational numbers instead of integers.
• We let (xrel
T ,T 0
, srel (T ), M∞
rel
) denote the solution of the relaxed problem
RLP(G) over the rational numbers.
Hanen and Munier define their schedule σ hm directly from the solution of
the relaxed linear program RLP(G). Let T ∈ V be any task. Constraint (2)
ensures that there is at most one successor T 0 of T such that xrel T ,T 0
< 12 ; and
00
constraint (3) ensures that there is at most one predecessor T of T such that
xrel
T 00 ,T
< 12 . Therefore, let xhmT ,T 0
= 0 for any edge e = (T, T 0 ) ∈ E such that
xrel
T ,T 0
< 12 , and xhm
T ,T 0
= 1 otherwise. For any task T ∈ V , define σThm to be
the top level of T , where bottom and top levels are computed according to
the allocation function induced by the xhm T ,T 0
. We add the communication cost
c(T, T 0 ) in the weight of a path going from T to T 0 if and only if alloc(T ) 6=
alloc(T 0 ), i.e., if and only if xhmT ,T 0
= 1. As explained earlier, this defines a
valid schedule for G.
THEOREM 7.11. Let G = (V, E, w, c) be a coarse-grain cDAG, with gran-
ularity g(G) > 1. Let σ hm be the schedule defined by Hanen and Munier.
Then,
2g(G) + 2
MS(σ hm , ∞) 6 MSopt (∞) .
2g(G) + 1
Proof. For any path in the graph going from a task T to one of its successors
T 0 , we have the communication cost xhm T ,T 0
c(T, T 0 ) for Hanen and Munier’s
rel 0
schedule, instead of xT ,T 0 c(T, T ) for the solution of RLP(G). Two cases
occur:
• xhm
T ,T 0
= 0: then w(T ) + xhm
T ,T 0
c(T, T 0 ) 6 w(T ) + xrel
T ,T 0
c(T, T 0 ).
• xhm
T ,T 0
= 1: then xrel
T ,T 0
> 12 . We have
c(T,T 0 )
w(T ) + xhm
T ,T 0
c(T, T 0 ) w(T ) + c(T, T 0 ) 1+ w(T )
6 = c(T,T 0 )
and
rel
w(T ) + xT ,T 0 c(T, T 0 ) w(T ) + c(T, T 0 )/2 1+ 2w(T )
c(T,T 0 ) 1
1+ w(T ) 1+ g(G) 2g(G) + 2
c(T,T 0 )
6 1 = .
1+ 1+ 2g(G)
2g(G) + 1
2w(T )
In all cases, w(T ) + xhm

T ,T 0
c(T, T 0 ) 6 2g(G)+2 rel 0
2g(G)+1 (w(T ) + xT ,T 0 c(T, T )), and this
inequality extends to all paths in the graph, which proves the theorem.
An immediate consequence of Theorem 7.11 is that Hanen and Munier’s

heuristic is guaranteed with a factor at most 43 for coarse-grain graphs.
7.7 List Heuristics for Pb(p) with Communications

As expected the limited processors scheduling problem Pb(p) remains NP-
complete when introducing communication costs. Pb(p) does remain in the
NP class: once the allocation is known, dependence and resource constraints
can be checked by traversing the graph. The true problem is to determine
a good allocation. The most natural idea is to extend the critical path list
algorithm introduced in Section 7.4.4. We explain how to modify it in a
straightforward fashion and then how to design a much improved version.
7.7.1 Nave Critical Path

The idea of critical path scheduling remains the same: list-scheduling with
task priorities equal to the task bottom levels. The problem is that we do
not know how to compute bottom levels. Without knowing the allocation,
it is not possible to decide which communication costs should be taken into
account. A conservative approach is to include all communication costs when
computing bottom levels (which amounts to assuming one distinct processor
per task).
Consider again the example DAG in Figure 7.7, with corresponding task
bottom levels shown in Table 7.3. We check that the bottom level of task T1
is 14, the length of the ASAP schedule with one processor per task.
Let us build a list schedule based on the values of these bottom levels.
The algorithm proceeds as explained in Section 7.4.4, but with an important
7.7. List Heuristics for Pb(p) with Communications 239
TABLE 7.3: Task bottom levels for the DAG in Figure 7.7.
Task T1 T2 T3 T4 T5 T6 T7
Critical Path 14 8 11.5 6.5 6.5 3 1
difference. A task is free when all its predecessors have been executed. But a
free task cannot start execution as soon as it is free, even if there are available
processors; depending on the allocation decision, we may or may not need to
wait for some communication delay.
Assume p = 3 available processors P1 , P2 , and P3 . When there are more
available processors than free tasks, assign the tasks to the processors with
(say) lowest indices. Let Q be the priority queue of free tasks. In our example,
there is r = 1 free task at time t = 0: Q = (T1 ). Processor P1 executes T1 at
t = 0. At t = 1, we update Q as Q = (T3 , T2 ) (T3 is given priority over T2
because its bottom level is larger). All processors are available, so according
to our rule we allocate T3 to P1 and T2 to P2 . Note that we are lucky to
assign T3 to P1 : because P1 has executed T1 , there is no communication delay
to pay, hence it can start T3 at time t = 1. On the other hand, P2 must wait
until time t = 6 to start T2 . At time t = 2, we have Q = (T4 , T5 ) (breaking
ties arbitrarily, using task numbers), and two available processors P1 and P3 .
Indeed, although still idle, P2 has been marked busy until it returns from the
execution of T2 . Hence, we allocate T4 to P1 (execution can start immediately)
and T5 to P3 (execution cannot start before t = 5). In the end, we obtain the
schedule shown in Figure 7.11.
T5
P3
T2
P2
P1 T1 T3 T4 T6 T7
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 time units
FIGURE 7.11: Naı̈ve critical path scheduling for the example DAG in Fig-
ure 7.7.
We obtain a makespan equal to 14 time units. Note that the naı̈ve critical
path (naı̈ve CP) scheduling with two processors leads to the same result: T5
would have been executed by P1 at time t = 4 rather than by P3 at time t = 5:
this is the only difference. In both cases, we obtain the same makespan, even
worse than the execution on a single processor! There must be room for
improvement.
7.7.2 Modified Critical Path

If we analyze the execution of naı̈ve CP scheduling on our small example,
we see that we made a wrong decision when assigning T2 to P2 . Indeed,
at time t = 1 we had Q = (T3 , T2 ). The first allocation, that of T3 to P1 ,

is fine. But the second, that of T2 to P2 , is not. We should allocate it to
P1 again, even though it is not available. The reason is that P1 can start
T2 earlier than P2 . The rule of modified critical path (MCP) is: allocate a
free task to the processor that allows its earliest execution, given previous task
allocation decisions. It is important to explain further what “previous task
allocation decisions” means. Free tasks from the queue are processed one after
the other. At any moment, we know which processors are available and which
ones are busy. Moreover, for the busy processors, we know when they will
finigh computing their currently allocated tasks. Hence, we can always select
the processor that can start the execution of the task under consideration
earliest. It may well be the case that we select a processor that is currently
busy, as discussed earlier for allocating task T2 .
T5
P3
T4 T6 T7
P2
T1 T3 T2
P1
0 1 2 3 4 5 6 7 8 9 10 11 time steps
FIGURE 7.12: Modified critical path scheduling for the example DAG in
Figure 7.7.
Rather than writing the details of MCP in algorithmic form, we apply

it on our example with three processors. We obtain the schedule shown in
Figure 7.12. As already discussed, T2 is allocated to P1 . Similarly, T6 is
allocated to P2 , because it allows execution at time t = 8.5, against 9.5 on P1
or P3 . The new makespan is 10.5, to be compared with the makespan 14 of
naı̈ve CP.
7.7.3 Hints for Comparison

The comparison of naı̈ve CP and MCP for our little example should not
lead to drastic conclusions. We are comparing two heuristics; neither of them
is always superior to the other. There are examples where naı̈ve CP is better
than MCP: for the example shown in Figure 7.13 with two processors, one
can check that the makespan of MCP is 15, while that of naı̈ve CP is 14.
Our intuition, however, shows that MCP is likely to outperform naı̈ve CP
in most cases. How to quantify this assertion? To be less specific than with
particular examples, let us compare CP and MCP on simple graph types. We
analyze two very simple cases, a fork with two nodes and a join with two
nodes. In Figures 7.14(a) and 7.14(b), we have three tasks of the same weight
w. The communication costs are all equal to c. Assume two processors are
7.7. List Heuristics for Pb(p) with Communications 241
(5)
T1
1 6
(5) (3)
T2 T3
3
(2)
T4
FIGURE 7.13: An example where naı̈ve CP is better than MCP.
available, P1 and P2 :
T1 T2 T3
T2 T3 T1
(a) Fork graph (b) Join graph
FIGURE 7.14: Elementary graphs for comparing naı̈ve CP and MCP.
1. Fork Graph (Figure 7.14(a)). Naı̈ve CP schedules T1 on P1 at time

t = 0. Then, it schedules T2 on P1 at time t = w, and T3 on P2 at time
w + c, hence a makespan equal to 2w + c. MCP does the same as naı̈ve
CP if w > c. But if w < c, the earliest execution time for T3 is t = 2w
on P1 , hence MCP schedules all tasks on P1 , and its makespan is equal
to 3w < 2w + c. Conclusion: MCP outperforms naı̈ve CP if w < c and
ties it otherwise.
2. Join Graph (Figure 7.14(b)). Naı̈ve CP and MCP perform identically.
At time t = 0, they schedule T2 on P1 and T3 on P2 . At time t = w + c,
T1 is scheduled on either P1 or P2 , and the makespan is 2w + c. Note
that is not optimal if w < c; it is better to schedule the three tasks on
the same processor! Of course this is forbidden in list scheduling: no
processor can be deliberately kept idle.
This short discussion shows how difficult it is to draw conclusions. The

scheduling problem is NP-complete, and we only compare heuristics. Gener-
ally speaking, there are three possible approaches:
1. Theoretical: Prove that the heuristic is guaranteed, i.e., that it always
leads to a makespan within some fraction of the optimal (in other words,
make it an approximation algorithm).
2. Experimental: Use random graphs and/or“standard benchmark”graphs
to compare heuristics. The difficulty is that there is little consensus on
which benchmarks are representative of large and relevant application
classes.
3. Tricky: Prove that the heuristic is optimal for certain classes of graphs:
forks, joins, fork-joins, trees, etc.
The first approach is the strongest: one is ensured that the heuristic will
perform within a certain factor of the optimal in the worst case. The second
approach is quite useful (and used) in practice. And the third approach helps
to tune the heuristics so as to be optimal for certain graph classes (and maybe
to publish nice research papers!)
A small step in the first direction is the following (rather weak) counterpart
of Theorem 7.6 :
THEOREM 7.12. Let G = (V, E, w, c) be a cDAG of granularity g(G)
(see Section 7.6.2), and let MSopt (p) be the makespan of an optimal sched-
ule. Then, we can derive a schedule σ with p processors whose makespan
verifies
1
MS(σ, p) 6 2 − (1 + g(G))MSopt (p) .
p
Proof. The proof is straightforward: neglect all communication costs and con-
struct a list schedule σ: its makespan is such that

1 1
MS(σ, p) 6 2 − MS∗opt (p) 6 2 − MSopt (p),
p p
where MS∗opt (p) is the optimal makespan without communication costs. Then,
we stretch the schedule by a factor 1 + g(G), which allows to pay for the
communication cost incurred from the predecessors of each task Ti : we have an
interval of length (1+g(G))wi to communicate data from the predecessors of Ti
and execute it. Therefore, we have derived a valid schedule whose makespan
meets the desired bound. Note that this schedule is not necessarily a list-
schedule because we may have waited longer than needed to execute some
tasks.
7.8 Extension to Heterogeneous Platforms

This section explains how to extend list scheduling techniques to hetero-
geneous platforms, i.e., to platforms that consist of processors with different
speeds and interconnection links with different bandwidths. We have discussed
the issue of static load balancing on platforms with heterogeneous processors
(but homogeneous network links) in Chapter 6. Rather than defining all nota-
tions precisely, we proceed rather informally and focus on the key differences
with the homogeneous case.
We start with a cDAG with n tasks T1 , . . . , Tn . The goal is to schedule this
cDAG on a platform with p heterogeneous processors P1 , . . . , Pp . There are
many parameters to instantiate:
7.8. Extension to Heterogeneous Platforms 243
Computation costs – The execution cost of Ti on Pq is modeled as wiq .

Therefore, a n × p matrix of values is needed to specify all computation
costs. This matrix comes directly for the specific scheduling problem
at hand. However, when attempting to evaluate competing scheduling
heuristics over a large number of synthetic scenarios, one must generate
this matrix. One can distinguish two approaches. In the first approach
on generates a consistent (or uniform) matrix with wiq = wi × γq ,
where wi represents the number of operations required by Ti and γq is
the inverse of the speed of Pq (in operations per second). With this
definition the relative speed of the processors does not depend on the
particular task they execute. If instead some processors are faster for
some tasks than some other processors, but slower for other tasks, one
speaks of an inconsistent (or non-uniform) matrix. This corresponds to
the case in which some processors are specialized for some tasks (e.g.,
specialized hardware or software).
Communication costs – Just as processors have different speeds, commu-

nication links may have different bandwidths. However, while the speed
of a processor may depend upon the nature of the computation it per-
forms, the bandwidth of a link does not depend on the nature of the
bytes it transmits. It is therefore natural to assume consistent (or uni-
form) links. If there is a dependence eij : Ti → Tj , if Ti is executed on
Pq and Tj executed on Pr , then the communication time is modeled as
comm(i, j, q, r) = data(i, j) × vqr ,
where data(i, j) is the data volume associated to eij and vqr is the
communication time for a unit-size message from Pq to Pr (i.e., the
inverse of the bandwidth). Like in the homogeneous case, we let vqr = 0
if q = r, i.e., if both tasks are assigned the same processor. If one
wishes to generate synthetic scenarios to evaluate competing scheduling
heuristics, one then must generate two matrices: one of size n × n for
data and one of size p × p for vqr .
The main list scheduling principle is unchanged. As before, we need to

compute the priority of each task so as to decide which one to execute first
when there are more free tasks than available processors. In other words, we
need to find the equivalent of bottom levels for MCP. The most natural idea is
to compute averages of computation and communication times, and use these
to compute priority levels exactly as in the homogeneous case. We define:
Pp
q=1 wiq
• wi = p , the average execution time of Ti ,
P
16q,r6p,q6=r vqr
• commij = data(i, j) × p(p−1) , the average communication cost
for edge eij : Ti → Tj .
The last (but important) modification concerns the way in which tasks are
assigned to processors: instead of assigning the current task to the processor
that will start its execution first (given all already taken decisions), we should
assign it to the processor that will complete its execution first (given all already
taken decisions). Both choices are equivalent with homogeneous processors,
but intuitively the latter is likely to be more efficient in the heterogeneous
case.
Altogether, we have re-discovered the list heuristic called HEFT, for Het-
erogeneous Earliest Finish Time [116]. The complexity of the algorithm as we
have outlined it here is the same as that of MCP. More sophisticated versions
attempt to insert tasks in intervals of time during which processors are idle.
This technique is called insertion scheduling: instead of scheduling a new task
after those already assigned to a given processor, a good idea may be to try
and schedule it at some earlier time, provided that there exists an interval
long enough to accommodate the task and during which the processor was
idle (most likely waiting for some communication to complete).
All the material covered in this chapter is rather basic. Without commu-
nication costs, we mention that pioneering work includes the book by Coff-
man [42]. Chapter 9 of [83], the book by El-Rewini, Lewis, and Ali [54] and
the IEEE compilation of papers [108] provide additional material. On the
theoretical side, Appendix A5 of Garey and Johnson [58] provides a list of
NP-complete scheduling problems. Also, the book by Brucker [37] offers a
comprehensive overview of many complexity results.
The literature with communication costs is more recent. Theorem 7.10 is
due to Chrétienne [40]. Picouleau [94, 95] proves that Pb(∞) remains NP-
complete even when we assume all task weights and communication costs
to have the same (unit) value — this is the so-called UET-UCT problem
(Unit Execution Time-Unit Communication Time) — or even if communica-
tion costs are arbitrarily small (but nonzero). Several extensions to Theo-
rem 7.10 are discussed in the survey paper by Chrétienne and Picouleau [41].
Hanen and Munier’s heuristic can be extended to cope with limited processors;
see [67]. See also the book by Darte, Robert and Vivien [50] where many clus-
tering heuristics are surveyed. Finally, a recent book by Sinnen [109] provides
a thorough discussion on communication models. In particular, it describes
several extensions for modeling and accounting for communication contention.
7.9. Exercises 245
7.9 Exercises
We start with two exercises on scheduling without any communication costs.
Exercise 7.1 studies some properties of the free schedule and its variants,
and Exercise 7.2, borrowed from [65], shows surprising anomalies of the list
scheduling approach. Communication costs are involved in Exercise 7.3 that
studies the complexity of scheduling a simple fork graph using various commu-
nication models. We conclude with a more difficult problem in Exercise 7.4,
namely scheduling in-trees using Hu’s algorithm. This exercise is borrowed
from [50]. The original reference is [90], which gives a simpler proof of Hu’s
algorithm [70].
Exercise 7.1 : Free Schedule

1. Consider a DAG G = (V, E, w) and assume unlimited processors. Show
that any optimal schedule σ satisfies:
∀v ∈ V, σf ree (v) 6 σ(v) 6 σlate (v),
where σf ree and σlate are the ASAP and ALAP schedules defined in Sec-
tion 7.3.
2. Give an example of a DAG G = (V, E, w) that has at least three different
optimal schedules with unlimited processors.
3. Consider the DAG in Figure 7.15. Assume that all tasks have unit weight.
What is the optimal execution time MSopt (∞)? How many processors are
needed for the ASAP scheduling? For the ALAP scheduling? Determine the
minimum number popt of processors needed to achieve execution in optimal
time MSopt (∞).
T1 T2
T3 T4 T5
T6 T7
T8
FIGURE 7.15: The DAG to schedule in Exercise 7.1.

4. Consider a DAG G = (V, E, w) and let popt be the minimum number of

processors required to achieve execution in optimal time. Formally,
popt = min{p | MSopt (p) = MSopt (∞)} .
Show that the problem of determining popt is NP-complete.
7
E 8
I
7 18
A G 2
H
F 3 8
7 C J
2 3
B D
FIGURE 7.16: A DAG to reveal anomalies of list scheduling.
Exercise 7.2 : List Scheduling Anomalies

Consider the DAG in Figure 7.16 where a pair X/w means that task X has
weight w. For instance A has weight 8.
1. What is the makespan achieved by critical path list scheduling with 2
processors? Is it optimal?
2. Assume that each task weight is decreased by one unit (now A has weight
7, B has weight 1, and so on). Show that the makespan achieved by critical
path list scheduling increases. Show that, somewhat shockingly, the makespan
achieved by any list scheduling algorithm increases.
3. Going back to original task weights (see Figure 7.16), assume that we have
3 processors. Show that the makespan achieved by critical path list scheduling
increases. Show that the makespan achieved by any list scheduling algorithm,
shockingly again, increases.
Exercise 7.3 : Scheduling a FORK Graph with Communi-

cations
DEFINITION 7.12 (FORK with n children). A FORK graph with n chil-
dren is a cDAG with n + 1 tasks T0 , T1 , . . . , Tn as shown in Figure 7.17. There
is an edge from T0 to each child Ti , 1 6 i 6 n. The weight of task Ti is wi .
The weight of edge (T0 , Ti ) is di .
Exercises 247
T0
d1 dn
d2 di
T1 T2 Ti Tn
FIGURE 7.17: FORK graph with n children.
First we assume an unlimited number of (identical) processors. We define

the following optimization problem:
DEFINITION 7.13 (ForkSched-∞(G)). Given a FORK graph G with n
children and an unlimited number of processors, what is the optimal makespan?
1. Give a polynomial algorithm to solve ForkSched-∞(G).
Now we target the same problem, but with a bounded number of processors:
DEFINITION 7.14 (ForkSchedBounded(G,p)). Given a FORK graph
G with n children and a set of p processors, what is the optimal makespan?
2. Show that the decision problem associated to ForkSchedBounded(G,p)
is NP-complete.
We come back to the problem with infinite processors, but we introduce
a new model for communications: we assume that a processor can send or
receive a single message at a time. Communications are serialized in this
1-port model (see Section 3.2.3 for more details).
DEFINITION 7.15 (ForkSched-1-Port-∞(G)). Given a FORK graph
G with n children and an unlimited number of one-port processors, what is
the optimal makespan?
3. Show that the decision problem associated to ForkSched-1-Port-∞(G)
is NP-complete.
Exercise 7.4 : Hu’s Algorithm

1. In this exercise, we study the problem of scheduling an in-tree G =
(V, E, w) (a DAG where each vertex has at most one successor). We assume
that all vertices have unit execution time: w(v) = 1 for all v ∈ V . We
denote by level(v) the maximal length of a path starting from v, and we let
level(v) = 0 if v has no successor. We let h = maxv∈V level(v) be the maximal
level in G. We assume there are p identical resources and we denote by
MSopt (p) the minimal makespan of a schedule for p resources and by MS(σ, p)
the makespan of a schedule σ for p resources.
Generalizing Proposition 7.1 and Theorem 7.2, show that
|Vi |
∀i, 0 6 i 6 h, MSopt (p) > + i,
p
where Vi = {v ∈ V | level(v) > i} is the set of the vertices whose level is at

least i.
2. Let σ be a list schedule with a priority queue of tasks ordered by decreasing

level. This means that if two tasks u and v are ready to be scheduled at a given
time, then level(u) > level(v) implies σ(u) 6 σ(v). For 0 6 t < MS(σ, p), we
denote by St the set of tasks executed at time t by σ: St = {v ∈ V | σ(v) = t}.
We first assume that there exists an integer t, 0 6 t < MS(σ, p), such that
St has exactly p tasks with the same level. We denote by k the largest such
integer and by L the (common) level of the tasks in Sk . We also assume that
k < MS(σ, p) − 1 and we denote by L0 the maximal level of a task v not yet
scheduled at time k, i.e., such that σ(v) > k.
(a) Give the maximum level of a task v in St for k < t < MS(σ, p) and show
that MS(σ, p) = k + L0 + 2.
(b) Show that, for all integers t, 0 6 t 6 k, |St | = p. (Use the fact that G is
an in-tree, i.e., each vertex has at most one successor.)
(c) Infer from the previous two questions that MS(σ, p) = MSopt (p). Show
that this optimality result still holds even if k = MS(σ, p) − 1 (in which
case L0 is not defined) or if k does not exist.
7.10. Answers 249
7.10 Answers
Exercise 7.1 (Free Schedule)
. Question 1. Consider any task v ∈ V . By definition, σf ree (v) is the length
of the longest path from an entry node up to v, so σf ree (v) 6 σ(v) for any
schedule σ, be it optimal or not. Now read the definition of σlate carefully:
there is a path of length σlate (v) = MSopt (∞) − bl(v) from v to an exit node.
Hence, if a schedule σ starts executing v later than time σlate (v), its makespan
will be greater than MSopt (∞), implying that σ is not optimal.
. Question 2. Consider a DAG with four tasks T1 , T2 , T3 , T4 of unit weights.
The only dependences are
T1 → T2 → T3 .
We have MSopt (∞) = 3, σf ree (T1 ) = σlate (T1 ) = 0, σf ree (T2 ) = σlate (T2 ) = 1
and σf ree (T3 ) = σlate (T3 ) = 2. However, T4 is independent from the other
tasks. We have σf ree (T4 ) = 0 and σlate (T4 ) = 2. There is room for a third
optimal schedule σ that coincides with the other two on T1 , T2 and T3 , and
such that σ(T4 ) = 1.
. Question 3. The longest path is T2 → T4 → T6 → T8 , thus MSopt (∞) = 4.
The following table shows the starting times for σf ree and σlate :
σf ree 0 0 1 1 1 2 2 3
σlate 1 0 2 1 2 2 3 3
We see from the table that we need 3 processors for σf ree (at time 1) and
also 3 processors for σlate (at time 2). Any schedule whose makespan is
MSopt (∞) = 4 requires at least 2 processors, since there are 8 tasks. The
schedule σ below is valid (check that all dependences are satisfied) and achieves
a makespan of 4 with only 2 processors:
σ 0 0 1 1 2 2 3 3
We conclude that popt = 2.
. Question 4. The problem is obviously in NP. We use a reduction from 2-
Partition.PConsider an arbitrary instance Inst1 with n integers {a1 , a2 , . . . , an }.
n
Let S = i=1 ai . We assume that S is even and that ai 6 S2 for all i (other-
wise we know there is no solution). We build an instance Inst2 of our problem
as follows: we have a DAG of n + 1 independent tasks T1 to Tn+1 . We let
w(Ti ) = ai for 1 6 i 6 n, and w(Tn+1 ) = S2 . The size of Inst2 is linear in
the size of Inst1 . We have MSopt (∞) = w(Tn+1 ) = S2 and we ask whether
popt 6 K = 3. Obviously, there is a solution to Inst1 if and only if there is
one to Inst2 .
Exercise 7.2 (List Scheduling Anomalies)

. Question 1. It is not difficult to compute bottom levels and execution
times. We obtain:
Task A B C D E F G H I J
Bottom level 33 30 13 28 25 25 18 10 8 8
σ 0 0 5 2 8 8 15 15 17 25
Processor P1 P2 P2 P2 P1 P2 P1 P2 P2 P2
The makespan is 33, while the sequential time is 66. There is no idle time in
this schedule obtained with critical path list scheduling, and thus it is optimal.
. Question 2. Here again, we can compute bottom levels and execution
times. We obtain:
Task A B C D E F G H I J
Bottom level 30 26 10 25 23 23 17 8 7 7
σ 0 0 3 1 7 13 19 5 6 13
The makespan is 36, three time units more than with larger task weights.
It turns out that any list scheduling algorithm leads to a makespan of at least
36. This is painful to prove: no other way than trying all possibilities. At
time 0, either we schedule A and B, or we schedule A and C, or we schedule
B and C. In this way we explore a tree of possibilities and eventually prove
the result.
. Question 3. With three processors there is a single list schedule: we have
no freedom at all. We obtain:
Task A B C D E H I J F G
σ 0 0 0 2 8 3 5 5 13 20
The makespan is 38, five time units more than with 2 processors.
Exercise 7.3 (Scheduling a FORK Graph with Communica-

tions)
. Question 1. Let G be a FORK graph with n children. Let P0 be the
processor executing T0 . Which other tasks will be executed by P0 ? We first
make the following simplifying assumption: any other processor Pi executes
at most one task. Indeed, if two tasks are executed on Pi 6= P0 , the makespan
is not increased if we reassign the tasks to two distinct processors (remember
that we have infinite resources). Therefore, there exists an optimal schedule
such that a subset I = {Tik , . . . , Tin } of the tasks is executed by P0 and all
the other tasks are executed on distinct processors:
Answers 251
T0 I
di1 din
di2 dik
Ti1 Ti2 Tik Tin
wi1 wi2 wik win
The makespan of such a schedule is:

!
X
T = max wi , {w0 + (dj + wj ) | j ∈
/ I} .
i∈I
Assume that there are two tasks Tj ∈ / I et Tk ∈ I such that wk + dk <

wj + dj . Then, the makespan does not increase if we remove task Tk from
I and assign it to a new processor. Hence, there exists an optimal schedule
where all tasks in I have a value wi + di larger than that of tasks not in I.
To derive an optimal schedule, we need to sort tasks Ti , 1 P
6 i 6 n, by non-
n
decreasing value of wi + di , and to find k minimizing max( i=k wi , wk−1 +
dk−1 ). Then, the n − k remaining tasks will be assigned to P0 .
. Question 2. This is not difficult. The problem is in NP, and we use

a reduction from 2-Partition, just as in Section 7.4.1. Given an arbitrary
instance Inst1 of 2-Partition, with n integers {a1 , a2 , . . . , an }, we construct the
following instance Inst2 of ForkSchedBounded(G,p) with p = 2 processors:
• G is a FORK graph {T0 , . . . , Tn } with n children;
• the parent node T0 has zero weight w0 = 0;
• for 1 6 i 6 n, node Ti has weight wi = ai and zero communication cost:

di = 0.
Obviously, the size of Inst2 is linear in the size of Inst1 . It is easy to check
Pn that
Inst1 has a solution if and only if we can schedule G in time K = 12 i=1 ai .
. Question 3. The problem is in NP, and we use a reduction from 2-

Partition. Given an arbitrary Pn instance Inst1 of 2-Partition, with n integers
{a1 , a2 , . . . , an }, let S = 21 i=1 ai (if S is not an integer, Inst1 has no solu-
tion), M = max ai and m = min ai . We construct the following instance Inst2
of ForkSched-1-Port-∞(G):
• G is a FORK graph {T0 , . . . , Tn+3 } with n + 3 children;
• the father node T0 has zero weight w0 = 0;
• for 1 6 i 6 n, node Ti has weight wi = 10(M + ai + 1);

• the last three children Tn+1 , Tn+2 and Tn+3 have weight:
wn+1 = wn+2 = wn+3 = 10(M + m) + 1;
• communication costs are equal to the weights: di = wi for 1 6 i 6 n + 3;

Pn
• the bound on the makespan is K = 12 i=1 wi + 2wn+1 = 5n(M + 1) +
10S + 20(M + m) + 2.
Clearly, the size of Inst2 is linear in the size of Inst1 . We show that Inst1 has
a solution if and only if Inst2 has a solution:
⇐ Assume that Inst1 has a solution:Plet I1 andPI2 be two subsets parti-
tioning {1, . . . , n} and such that I1 ai = I2 ai = S. We build the
schedule as follows:
• Processor P0 executes task T0 , tasks Ti for i ∈ I1 and tasks Tn+1
and Tn+2 . P0Pneeds exactly K time units
Pn to processP all these tasks
n
since K = 12 i=1 wi + 2wn+1 and 12 i=1 wi = i∈I1 wi .
• Each remaining task is assigned a distinct processor. Thus, we use
|I2 | + 1 processors in addition to P0 .
• Communications are performed according to non-decreasing task
indices. Thus, the last message sent by P0 is for task Tn+3 .
• The processor
P in charge of Tn+3 is ready to start its execution at
time I2 di + dn+3 . It terminates the execution of Tn+3 at time
X
di + dn+3 + wn+3 = K .
I2
• All the other processors terminate their execution no later than

P K.
Indeed, they receive their messages no later than at time I2 di ,
and their weight wi is no larger than 2wn+3 .
We have built a valid schedule for Inst2 .
⇒ Reciprocally, assume that Inst2 has a solution, i.e., a schedule σ whose
makespan is no greater than K. Let P0 be the processor that executes
T0 . Define I = {i | 1 6 i 6 n + 3 and Ti is executed by P0 } as the
index set of the P tasks assigned to P0 . The execution time of P0 is
at least A = i∈I wi . The processor receiving the last message from
P0 to execute task Tlast (whosePindex is not in I) cannot terminate
its execution before time B = i6∈I di + wlast . Since
Pσ is a solution
to Inst2 , we have max(A, B) 6 K. But A + B = i wi + wlast =
2K + wlast − wn+1 > 2K. Therefore, A = B = K and wn+1 = wlast .
Because A = B, we have A = B mod 10. Hence, I contains two indices
{n1 , n2 } taken from {n + 1, n + 2, n + 3}. Letting I1 = I\{n1 , n2 } and
I2 = {1, . . . , n}\I1 , we obtain a solution to Inst1 .
Chapter 8
Advanced scheduling
In this chapter, we discuss several scheduling topics that are more advanced
than those studied in Chapter 7. We strongly believe that the macro-dataflow
task graph scheduling model should be modified to better account for network
resources consumption. It would be unrealistic to expect that a single model
realistically models all kinds of architecture/software combinations. But band-
widths of network cards and of communication links are always limiting fac-
tors, exactly as CPU speeds limit computing resource consumption.
As stated in Section 3.2.3, a common approach is to use one-port mod-
els (either uni- or bidirectional) for single-threaded programs using single-
threaded communication libraries, and to use multi-port models (with band-
width bounds and/or overheads) for asynchronous multi-threaded programs.
But recall that the work in [106] casts doubts on the ability to achieve true
asynchronous communications. Let us point out that serialized communica-
tions in the one-port model have a dramatic impact on application execu-
tion time (makespan). For example, in the traditional macro-dataflow model,
scheduling a fork graph with an unlimited number of homogeneous processors
has polynomial complexity, while in the one-port model this problem becomes
NP-hard (see Exercise 7.3).
In this chapter, we address four important (but largely independent) topics:
• Scheduling of divisible load applications, that is master-worker applica-

tions in which task granularity can be chosen arbitrarily (Section 8.1);
• Throughput optimization for master-worker applications in steady-state

(Section 8.2);
• Scheduling of workflow applications that consist of a large number of

identical DAGs where each DAG is executed on a different data set.
The goal is to optimize steady-state throughput and/or response time.
• Loop nest scheduling, which is a classical topic at the intersection of

scheduling and compiler techniques (Section 8.4). At first glance, this
last section stands somewhat apart from the rest of the chapter, in
particular because it does not account for communications between pro-
cessors. Instead, it focuses on extracting parallelism out of a sequential
program, and on doing so automatically in a compiler. The discovery
of parallelism and of dependences in a program is a fundamental basis
253
254 Chapter 8. Advanced scheduling
for developing parallel algorithms and for constructing schedules. The

automation of this discovery turns out to be a fascinating topic that
provides several insights into the fundamentals of parallel computing.
We mostly use one-port models, either bidirectional (Sections 8.1 and 8.2)
or unidirectional (Section 8.3). For the sake of completeness, we also use the
bounded multi-port model in Section 8.2. All application models considered
in this chapter are structured, in that applications exhibit some intrinsic reg-
ularity. This is the main reason why we succeed in establishing deeper results
than for the scheduling of a single DAG. We refer the adventurous reader to
the bibliographical notes at the end of this chapter for more information and
for some pointers to recent papers representative of the ongoing research in
the field.
8.1 Divisible Load Scheduling

8.1.1 Motivation
In this section, we study the problem of scheduling independent tasks on
master-worker platforms. We consider two models. The first model was al-
ready introduced in Section 7.4.1 and denoted by Indep-tasks(p). Recall that
in that model the number of tasks and the task computational costs are set
in advance. In the second model the number of tasks and the task sizes can
be chosen arbitrarily: this scenario is relevant when the application consists
of an amount of computation, or load, that can be divided arbitrarily into
sub-tasks. In other words, the application is a perfectly parallel job: any
sub-task can itself be processed in parallel, and on any number of workers. In
practice, this model is an approximation of an application that consists of a
large number of identical, low-granularity computations.
The second model is called the divisible load model and has been widely
studied after being popularized by the landmark book written in 1996 by
Bharadwaj, Ghose, Mani and Robertazzi [32]. The divisible load model is
really a relaxation of the first, more traditional model. As a result of this
relaxation it is possible to give optimal solutions to several makespan min-
imization problems, such as the master-worker scheduling problem that we
discuss hereafter.
Many applications are approximated reasonably well by the divisible load
model. One example among many is an application for studying Earth seismic
tomography. The goal of the application is to validate a model for the internal
structure of the Earth. This can be done by comparing the propagation time
of seismic waves estimated by the model with the actual time measured by
physical instruments. This is done for a set of recorded seismic events and
8.1. Divisible Load Scheduling 255
for a 3-D domain. Below is a simplified pseudo-code for this application im-
plemented in parallel in a master-worker fashion. This pseudo-code is written
using the same template as in Chapter 3 and using the SCATTER function (see
Section 3.3.2) to distribute pieces of an array among the processors:
p ←NUM PROCS()
myrank ←MY NUM()
{ The master reads input data into an array }
if myrank = MASTER then
data ←n seismic events of size L read from an input file
{ Each processor receives one piece of the array }
SCATTER(myrank , data, rbuff , n × L/p)
{ Every processor computes its piece of the array }
COMPUTE(rbuff )
The master processor reads n data items and scatters them among the p
processors (in this case the master operates as a worker for the computation).
Each processor then computes results independently. This application can be
modeled sufficiently accurately as a divisible load only if the number of tasks,
n, is large when compared to the number of processors. For this particular
application n is expected to be large. For instance, during 1999 as many
as n = 817, 101 seismic events were recorded, each of which can be used to
validate the seismic model.
This example application is representative of a large class of applications
that consist of very large, some may say enormous, numbers of fine-grain com-
putations. A common and often reasonable assumption is that the execution
time of each computation is proportional to the size of the data to be pro-
cessed in that computation. Since the computations are independent there is
no need for either synchronizations or communications among the processors:
only the input messages from the master to the workers need to be taken into
account when scheduling the application.
8.1.2 Classical Approach

We target a master-worker platform as illustrated in Figure 8.1. We as-
sume that the processors have different computation speeds. All processors
are interconnected via a bus network, meaning that communications between
different pairs of processors are serialized and that the latencies and band-
widths of the network between the master and all the workers are identical.
We introduce some notations:
• The platform is composed of a master processor M , which holds all

application data initially, and of a set P1 , ..., Pp of workers. In a practical
application the master also participates in the computation rather than
merely doling out work to the workers. As a result it is common and
FIGURE 8.1: A master-worker computing platform structured as a bus.
convenient to also denote the master by P0 , essentially considering that

it is a (p + 1)-st worker.
• The time for worker Pi to execute a unit-size work is wi , i = 0, . . . , p.
The time needed to send a unit-message from M to Pi , i = 1, . . . , p,
is c (recall that all workers communicate at the same speed with the
master). These notations are depicted in Figure 8.2.
c c
c c
P1 P2 Pi Pp
w1 w2 wi wp
FIGURE 8.2: Abstract model of a bus-structured master-worker platform.
• The total number of tasks is Wtotal . P

Processor Pi receives ni tasks (to
p
be determined), where ni ∈ N and i=0 ni = Wtotal . Note that this
sum includes index 0 because the master M = P0 will execute n0 tasks
itself. The computation time on Pi is ni .wi .
• The master, M , sends a single message to each participating worker.
Furthermore, we use the one-port model, which means that these mes-
sages are transferred on the bus sequentially. As mentioned in Sec-
tion 3.2.3, this one-port model corresponds to synchronous, non-multi-
threaded message passing.
• The master, M , can simultaneously compute tasks and send data to
a single worker. A worker cannot start processing before receiving the
entire data from M .
• To simplify indexing, and without loss of generality, we assume that the
master serves the workers in the order P1 , ..., Pp . Hence, it sends a
message containing input data for n1 tasks to P1 , then for n2 tasks to
P2 , and so on. In the meantime, M executes n0 tasks itself.
Our goal is to determine the ni values so that the overall execution time,
i.e., the makespan, is minimized. Figure 8.3 shows an example execution for
p = 3 workers. Let Ti denote the execution time of processor Pi (recall that
M = P0 ). Accounting for the serialization of communications on the bus and
the order in which the master “serves” the workers, we obtain the following
expression for Ti :
- P0 : T0 = n0 .w0
- P1 : T1 = n1 .c + n1 .w1
- P2 : T2 = (n1 .c + n2 .c) + n2 .w3
Pi
- Pi : Ti = j=1 nj .c + ni .wi for i > 1
To make the above formula homogeneous we define c0 = 0 and ci = c for
i > 1, so that
Xi
Ti = nj .cj + ni .wi for i = 0, 1, . . . , p .
j=0
The total execution time is

 
i
X
T = max  nj .cj + ni .wi  ,
06i6p
j=0
and we look for a distribution of work to workers, n0 , . . . , np , which mini-

mizes T . We rewrite T as:
  
Xi
T = max n0 .c0 + n0 .w0 , max  nj .cj + ni .wi  ,
16i6p
j=0
and finally as:

  
i
X
T = n0 .c0 + max n0 .w0 , max  nj .cj + ni .wi  .
16i6p
j=1
The above equation shows that an optimal solution for the distribution of
Wtotal tasks over p + 1 processors is obtained by distributing n0 tasks to pro-
cessor P0 , and then optimally distributing Wtotal − n0 tasks to processors
P1 , . . . , Pp . This observation would easily lead to a dynamic programming
algorithm for computing the optimal ni values. However, this solution would
only be partially satisfactory. First, it is not a closed-form solution. Second,
and more importantly, it does not solve the question of the ordering of the
workers. Indeed, we have arbitrarily decided that the master M would send
messages in the order P1 , P2 , . . . , Pp . But the workers have different com-
puting powers, so one should expect the ordering to matter. Unfortunately,
P3
P2
P1
M
time
0
Work Communication Idle
(a) M computes and serves P1
P3
P2
P1
M
time
0
(b) M serves P2 , and P1 computes
P3
P2
P1
M
time
0 end
(c) Entire execution

FIGURE 8.3: An example execution with a master and p = 3 workers.
there are p! possible orderings, way too many for resorting to an exhaustive
search for the best ordering with reasonable complexity!
One way to address this challenge is to realize that we do not actually need
a precise solution where the ni are integers. Let us think of our seismic model
validation application, with 817, 101 tasks and, say, 10 processors. With such
a large number of tasks relatively to the number of processors, one can simply
search for rational ni values, at the price of some rounding in the end to
derive a feasible work allocation. This simple relaxation of the problem is the
quintessence of the divisible load approach and turns out to be surprisingly
successful, as seen in the next section.
8.1.3 Divisible Load Approach

We keep exactly the same setting and notations as before, except that
processor Pi ,P0 6 i 6 p, now receives an amount of work ni = αi Wtotal with
p
αi ∈ Q and i=0 αi = 1. We allow for rational values of αi , the fraction of
work assigned to Pi . Note that ni is rational too, so we cannot really think
in terms of atomic tasks any longer.
The total execution time is
 
Xi
T = max  αj .cj + αi .wi  Wtotal
06i6p
j=0
(recall that c1 = 0 and ci = c for i > 1). We look for a data distribution α0 ,
. . . , αp that minimizes T .
LEMMA 8.1. In an optimal solution, all processors stop computing at the
same time.
Proof. We prove this lemma by contradiction. Consider an optimal solution

and assume that there are two workers, Pi and Pi+1 (i > 1), that do not
terminate at the same time. Suppose first that Ti < Ti+1 . Figure 8.4 shows
an example with i = 1, i.e., P1 terminates earlier than P2 . Let us decrease
αi+1 by ε, and increase αi by ε. Furthermore, let us choose ε so that Pi and
Pi+1 terminate simultaneously in the new solution:
(αi + ε)(c + wi )Wtotal = ((αi + ε)c + (αi+1 − ε)(c + wi+1 )) Wtotal .
Because ci+1 = ci = c, the communication time for the following processors

is unchanged, and in the new solution, Pi and Pi+1 terminate strictly earlier
than Pi+1 in the previous schedule (that is at a time earlier than the former
value of Ti+1 ).
Now if we had Ti > Ti+1 , we would proceed similarly and decrease the load
of Pi by ε so that Pi and Pi+1 terminate simultaneously in the new solution:
(αi − ε)(c + wi )Wtotal = ((αi − ε)c + (αi+1 + ε)(c + wi+1 )) Wtotal .

P3
P2
P1
M
time
0 end
(a) i = 1, P1 terminates earlier than P2
P3
P2
P1
M
time
0 end
(b) Decrease αi+1 = α2 by ε, increase αi = α1 by ε
P3
P2
P1
M
time
0 end
(c) Communication time for other workers is unchanged
FIGURE 8.4: Illustration of the proof of Lemma 8.1.
Note that the reasoning also works if i = 0, i.e., with the master P0 . If it
finishes before P1 , suppress ε from the load of P1 , and if it finishes after P1 ,
suppress ε from the load of P0 . In both cases, they finish simultaneously, and
strictly earlier than max(T0 , T1 ).
We are almost done with the proof. We start from an optimal solution
whose execution time is T = Topt . There exists at least one processor whose
end time is T . We apply the exchange procedure to any processor pair Pi
and Pi+1 such that min(Ti , Ti+1 ) < max(Ti , Ti+1 ) = T . Both termination
times of Pi and Pi+1 are smaller than T after the exchange. We continue
applying this procedure until there remains no such pair. In the end, we have
found a solution whose total execution time is smaller than T , a contradiction.
We conclude that all processors do have the same end time in any optimal
solution.
LEMMA 8.2. In an optimal solution all processors participate in the com-

putation.
Proof. This is a direct consequence of Lemma 8.1.
Equipped with Lemmas 8.1 and 8.2, we can characterize the best way of
assigning loads to the master P0 and to workers P1 , . . . , Pp :
• T = α0 w0 Wtotal ;
w0
• T = α1 (c + w1 )Wtotal . Therefore, α1 = c+w1 α0 ;
w1
• T = (α1 c + α2 (c + w2 ))Wtotal . Therefore, α2 = c+w2 α1 ;
wi−1
• For all i > 1 we derive αi = c+wi αi−1 ;
Pp
• We use the normalization equation i=0 αi = 1 to derive
i
!
w0 Y wk−1
α0 1+ + ... + + ... = 1 .
c + w1 c + wk
k=1
We still do not know in which order the master should communicate with
the workers. The intuition is that faster workers should be served first, so
that they can work longer. Is it correct? Let us assess the impact of the
communication order analytically. To do so, consider the load processed by
workers Pi and Pi+1 during a time t:
1 t
Processor Pi – We have αi (c + wi )Wtotal = t. Therefore, αi = c+wi Wtotal .
Processor Pi+1 – We have αi cWtotal + αi+1 (c + wi+1 )Wtotal = t. Hence,

wi
αi+1 = c+w1i+1 ( Wtotal
t
− αi c) = (c+wi )(c+w t
i+1 ) Wtotal
.
Sum of the loads – We compute

c + wi + wi+1 t
αi + αi+1 =
(c + wi )(c + wi+1 ) Wtotal
We see that the formula is symmetric, and we conclude that the communica-
tion order has absolutely no impact on the solution, a surprising conclusion
indeed! We can perform a similar analysis for the master M = P0 and the
first worker P1 :
1 t
Processor P0 – We have α0 w0 Wtotal = t. Then, α0 = w0 Wtotal .
1 t
Processor P1 – We have α1 (c + w1 )Wtotal = t. Hence, α1 = c+w1 Wtotal .

c + w0 + w1 t c + w0 + w1 t
α0 + α1 = = .
w0 (c + w1 ) Wtotal cw0 + w0 w1 ) Wtotal
We see that the sum of loads is larger when w0 < w1 . Perhaps, in some
situations, the master is fixed a priori. But for some other applications it
may be chosen among all available processors. In the latter case we should
select the most powerful (or fastest) processor as the master. We conclude
this section by summarizing our results:
THEOREM 8.1. For divisible load applications on bus-structured networks,
the master should be selected (if possible) as the fastest processor. In an opti-
mal solution, all workers participate and terminate simultaneously. The com-
munication order from the master to the workers has no impact. Closed-form
formulas can be established for α0 , α1 , . . . , αp .
8.1.4 Extension to Star Networks

In this section, we aim at extending previous results to “star-shaped” plat-
forms, an example of which is depicted in Figure 8.5. The key difference with
the bus network model is that the links between the master and the work-
ers have different characteristics. Of course the workers still have different
computational power. Our star-shaped platform is fully heterogeneous.
We keep the same notations as in Section 8.1.3: we have a master M , also
denoted as P0 , and p workers P1 , . . . , Pp . The total load Wtotal . Worker Pi
needs wi time units to execute a single load unit and ci time units to receive
Ppthe master. We assign αi load units to Pi for 1 6 i 6 p, where αi ∈ Q
it from
and i=1 αi = 1. Note that for convenience we do not assign any load to the
master here. Because links have different communication times, we can always
create a fictitious worker Pp+1 with wp+1 = w0 and cp+1 = 0 to account for
the master’s computing capacity. This convention makes the approach more
symmetric with respect to the master and the workers. We keep the same
FIGURE 8.5: A star-shaped master-worker platform.
M
c1 cp
c2 ci
P1 P2 Pi Pp
w1 w2 wi wp
FIGURE 8.6: Abstract model of the star-shaped master-worker platform.
communication model too: M sends a single message to each participating

worker, and under the one-port model these messages are sequentialized.
We start by searching for the best communication order. We have learned
to be cautious with our intuition! Furthermore, deciding whether to favor
fast-communicating or fast-computing workers is not clear. Let us use some
analytical formulas just as for bus-shaped platforms. Consider the load pro-
cessed by workers Pi and Pi+1 during a time t:
1 t
Processor Pi – We have αi (ci + wi )Wtotal = t. Therefore, αi = ci +wi Wtotal .
Processor Pi+1 – We have αi ci Wtotal + αi+1 (ci+1 + wi+1 )Wtotal = t. Hence,

1
αi+1 = ci+1 +w ( t − αi ci ) = (ci +wi )(cwi+1
i+1 Wtotal
i t
+wi+1 ) Wtotal .

ci+1 + wi + wi+1 t
αi + αi+1 = .
(ci + wi )(ci+1 + wi+1 ) Wtotal
Sum of the communications – We derive

ci ci+1 + ci wi+1 + ci+1 wi t
αi ci + αi+1 ci+1 = .
(ci + wi )(ci+1 + wi+1 ) Wtotal
We see that the formula is symmetric for the total communication time:
the network is occupied the same amount of time whether Pi comes before
or after Pi+1 . However, the amount of work does depend upon the ordering:
αi + αi+1 is maximized when ci 6 ci+1 , which suggests that we should serve
the faster communicating worker first. Because the ordering of Pi and Pi+1
has no impact on the other workers, we can infer a very important result:
LEMMA 8.3. In an optimal solution, participating workers must be served
by non-decreasing values of ci .
However, we do not yet know whether we should utilize all workers as in
the case of bus platforms. The exchange procedure we used in Section 8.1.3
does not work so easily here. This is because communication times change
when shifting fractions of load from one worker to another. Nevertheless, the
result still holds:
LEMMA 8.4. In an optimal solution, all workers participate in the compu-
tation.
Proof. Consider an optimal solution. Let us renumber the processors so that
the ordering of communications is P1 , P2 , . . . , Pp . Suppose that at least one
worker is kept fully idle. In this case, at least one of the αi ’s, 1 6 i 6 p, is
zero. Let us denote by k the largest index such that αk = 0.
Case k < p – We add Pk at the end of the initial solution, thus using the or-
dering P1 , . . . , Pk−1 , Pk+1 , . . . , Pp , Pk . By construction, αp 6= 0. There-
fore, the network is not used during at least the last αp wp time units. It
α w
would thus be possible to process at least ckp+wpk > 0 additional units of
load with worker Pk , which contradicts the assumption that the original
solution was optimal: Pk does more work than in the original solution in
which it was kept idle, and all the other processors do the same amount
of work
Case k = p – We modify the initial solution by giving some work to Pp with-
out increasing the execution time. Let k 0 be the largest index such that
αk0 6= 0. By construction, the communication medium is not used dur-
ing at least the last αk0 wk0 > 0 time units. Thus, as previously, it would
be possible to process at least αcpk+w
0 wk 0
p
> 0 additional units of load with
worker Pp , which leads to a similar contradiction.
Therefore, in an optimal solution all workers participate in the computation.
It is worth pointing out that the above property does not hold true if we
consider solutions in which the communication ordering is fixed a priori. For
instance, consider a platform comprising two workers: P1 (with c1 = 4 and
w1 = 1) and P2 (with c2 = 1 and w2 = 1). If the first chunk has to be sent
to P1 and the second chunk to P2 , the optimal number of units of load that
can be processed within 10 time units is 5, and P1 is kept fully idle in this
solution. On the other hand, if the communication ordering is not fixed, then
6 units of load can be performed within 10 time units (5 units of load are sent
to P2 , and then 1 to P1 ). In the optimal solution, both workers perform some
computation, and both workers finish computing at the same time. This is a
general result:
LEMMA 8.5. In an optimal solution, all workers finish computing simulta-

neously.
Proof. The reader may want to skip this proof as it requires more involved
mathematical arguments. Consider an optimal solution. All αi ’s have strictly
positive values (Lemma 8.4). Consider the following linear program:
P
Maximize βi ,
subject
to
LB(i) ∀i, βi > 0
Pi
UB(i) ∀i, k=1 βk ck + βi wi 6 T
The αi ’s satisfy the set of constraints above, and from any set of βi ’s sat-
isfying the
P set of inequalities, we can build a valid schedule that processes
exactly βi units of load. Therefore,Pif we denote
P by (β1 , . . . , βp ) an optimal
solution of the linear program, then βi = αi .
It is known that one of the extremal solutions S1 of the linear program is
one of the convex polyhedron P induced by the inequalities [107, chapter 11]:
this means that in the solution S1 , there are at least p inequalities among the
2p equalities. Since we know that for any optimal solution all the βi ’s are
strictly positive (Lemma 8.4), then this vertex is the solution of the following
(full rank) linear system
i
X
∀i, βk ck + βi wi = T.
k=1
Thus, we conclude that there exists an optimal solution in which all workers
finish their work at the same time.
Let us denote by S2 = (α1 , . . . , αp ) another optimal solution, with S1 6= S2 .
As already pointed out, S2 belongs to the polyhedron P. Now, consider the
following function f :
R → Rn

f:
x 7→ S1 + x(S2 − S1 )
P P
By construction, we know that βi = αi . Thus, with the notation f (x) =
(γ1 (x), . . . , γp (x)):
∀i, γi (x) = βi + x(αi − βi ),
and therefore X X X
∀x, γi (x) = βi = αi .
Therefore, all the points f (x) that belong to P are extremal solutions of the
linear program.
Since P is a convex polyhedron and both S1 and S2 belong to P, then
∀ 0 6 x 6 1, f (x) ∈ P. Let us denote by x0 the largest value of x > 1 such
that f (x) still belongs to P: at least one constraint of the linear program is
an equality in f (x0 ), and this constraint is not satisfied for x > x0 . Could
this constraint be one of the UB(i) constraints? The answer is no, because
otherwise this constraint would be an equality along the whole line (S2 f (x0 )),
and would remain an equality for x > x0 . Hence, the constraint of interest is
one of the LB(i)’s. In other terms, there exists an index i such that γi (x0 ) = 0.
This is a contradiction since we have proved that the γi ’s correspond to an
optimal solution of the problem. Therefore, S1 = S2 , the optimal solution is
unique, and in this solution, all workers finish computing simultaneously.
According to Lemma 8.3, we re-order the workers so that c1 6 c2 6 . . . 6

cp . The following linear program makes it possible to compute the optimal
distribution of the load:
Minimize Tf ,
subject
 to
 (1) αP
i >0 16i6p
p
(2) αi = Wtotal
Pi=1
i
(3) j=1 αj cj + αi wi 6 Tf 1 6 i 6 p (i-th worker)

THEOREM 8.2. The optimal solution is given by the solution of the linear
program above.
Theorem 8.2 is a direct consequence of the previous two lemmas. Note that
inequalities (3) will be in fact equalities in the solution of the linear program,
so that we can easily derive a closed-form expression for the αi ’s (although
it is far less elegant than for bus platforms). It is important to point out
that this is linear programming with rational numbers, hence of polynomial
complexity.
As stated above, the variant where the master is capable of computing
while communicating to one of its children can be solved by adding a fictitious
worker Pp+1 with cp+1 = 0 and wp+1 = w0 . The previous analysis shows that
the master is kept busy all the time, as expected (otherwise more units of load
could be processed). However, if the master is not fixed a priori but rather
can be freely chosen among all processors, we should introduce new variables
cij to denote the communication time of one unit of load from Pi to Pj . But
we have no easy rule to decide which processor should be elected as master
(we had such a rule for bus platforms). Instead, we can still solve p + 1 linear
programs (with Pi as master in the i-th program) and retain the best solution.
Exercise 8.1 explains how to extend the linear programming approach to
general (multi-level) heterogeneous trees.
8.1.5 Going Further

The divisible load approach is more powerful than the classical approach
with atomic tasks. Relaxing the integer constraint has led us to nice closed-
form formulas and linear programs. However, there are some limitations to
this approach.
The first limitation appears when we include return messages. After all,
if the master M sends data to the workers, they might need to return some
results to it after they finish computing. For simplicity, let the size of the initial
and return messages be equal, so that Pi needs αi ci time units to return its
message to M . Note that in general, we may have different message sizes in
both directions, workers may, say, either filter data (shorter return messages)
or generate large cryptographic keys (longer return messages). Even with this
simplifying assumption, none of our results can be extended.
With return messages we need to specify whether the one-port model used is
unidirectional or bidirectional, because the master can possibly overlap com-
munication of initial and return messages. We adopt a bidirectional setting
here, but results are similar for the unidirectional variant. For a given choice
of participating workers, a given ordering of initial messages and a given or-
dering of return messages, we can still write a linear program to compute the
optimal assignment of the load to the workers. But there is a wide combi-
natorial space to explore! Figure 8.7 shows an example in which the optimal
solution utilizes only two workers out of three. Intuitively, we could expect
the order of return messages to be the same as that of input messages (FIFO
strategy) or the converse (LIFO strategy). As shown in Figure 8.8, there are
cases where the optimal solution is neither FIFO nor LIFO. In fact, the com-
plexity of the problem with return messages is open. Note that in both these
figures we simply indicate how many tasks are performed per time unit with
the various schedules.
The second limitation of the divisible load model is related to the assump-
tion that the master communicates a single message to each worker. In other
words, there is only one round to serve the workers. As illustrated in Fig-
ure 8.9, this induces long idle-times, especially for last served workers. A way
to circumvent this problem is to use a multi-round approach, by which the
master sends several messages to the workers in a round robin fashion. If
things go smoothly, everything is nicely pipelined, communications and com-
putations overlap well, and the idle time is greatly reduced. But if we allow
multi-round algorithms with our simple linear cost model (where the cost to
send/receive a message is proportional to its length), we are readily convinced
that the optimal solution is to use an infinite number of arbitrary small-size
rounds. To be realistic, we have to add start-up overheads in the communica-
tion time. This affine cost model is mandatory to study multi-round strate-
gies. Unfortunately, the divisible scheduling problem becomes NP-hard with
the affine model [80], even if we restrict to single round algorithms. We need
yet another relaxation! In the next section, we relax the objective function:
c1 = 1 c3 = 5
c2 = 1
P1 P2 P3
w1 = 1 w2 = 1 w3 = 5
P1 P1
P2 P2
P3 P3
LIFO, 3 processors FIFO, 2 processors
61/135 ≈ 0.45 task/sec 1/2 = 0.5 task/sec
(best schedule)
FIGURE 8.7: With return messages, all processors do not always participate
in the computation.
P1
c1 = 7 c = 8 c3 = 12
2 P2
P1 P2 P3 P3
w1 = 6 w2 = 5 w3 = 5
Optimal schedule
(38/499 ≈ 0.076 task/sec)
P1 P1
P2 P2
P3 P3
Best FIFO schedule Best LIFO schedule
(47/632 ≈ 0.074 task/sec) (43/580 ≈ 0.074 task/sec)
FIGURE 8.8: With return messages, the optimal schedule may be neither
LIFO nor FIFO.
8.2. Steady-State Scheduling 269
we forget about makespan minimization and instead we move to throughput

optimization.
Pp αp wp Pp
...
P2 α2 w2 P2
P1 α1 w1 P1
M α1g α2 g αp g M
T1 T2 Tp Tf
R0 R1 Rk
One round Multi-round
FIGURE 8.9: One round versus multi-round.
8.2 Steady-State Scheduling

In this section, we revisit once more the master-worker paradigm on hetero-
geneous platforms, star- or tree-shaped. We somewhat blend the traditional
approach and the divisible load approach: we consider atomic independent
tasks, but we compute a steady-state schedule where we allocate rational
numbers of tasks to processors before reconstructing the actual final schedule
with a regular (integer) task assignment. A crucial point is that the goal is
no longer to minimize makespan, but rather to maximize throughput.
8.2.1 Motivation
An idea to circumvent the difficulty of makespan minimization is to lower
the ambition of the scheduling objective. Instead of aiming at the absolute
minimization of the execution time, why not consider asymptotic optimality?
Often, the motication for deploying an application on a parallel platform is
that the number of tasks is very large. In this case, the optimal execution time
with the optimal schedule may be very large and a small deviation from it is
likely acceptable. To state this informally: if there is a nice (e.g., polynomial)
way to derive, say, a schedule whose length is two hours and three minutes, as
opposed to an optimal schedule that would run for only two hours, we would
be satisfied.
This approach has been pioneered by Bertsimas and Gamarnik [29]. Steady-
state scheduling allows one to relax the scheduling problem in many ways. The
costs of the initialization and clean-up phases are neglected. The initial integer
formulation is replaced by a continuous or rational formulation. The precise

scheduling of computations and communications is not required, or at least
not before the optimal schedule is outlined. The main idea is to characterize
the activity of each resource during each time unit: which (rational) fraction
of time is spent computing, which is spent receiving or sending to which
neighbor. Such activity variables are gathered into a linear program, which
includes conservation laws that characterize the global behavior of the system.
The actual schedule then arises naturally from these quantities.
We first work out an example with a tree platform before analytically solving
the problem for heterogeneous star-shaped platforms, and finally for general
tree platforms.
8.2.2 Working out an Example

Consider the tree-shaped platform in Figure 8.10(a). As in Section 8.1 the
master initially holds a collection of independent tasks to be processed. It can
either decide to process these tasks itself or to delegate some to its children,
which in turn face the same dilemma: which tasks to process in place, and
which to send to which child?
An abstract view of the platform is given in Figure 8.10(b). For convenience
we name the processors A, B, C, and D, rather than with indices. Each
processor is a node in the tree. Each node is labeled by the time needed
by the processor to compute a task. Each edge is labeled by a number that
represents the time to send a task along this edge. For instance, it takes 6
time units for processor C to compute a task, and 1 time unit for processor
C to send a task to processor D.
In the example, the master A decides to serve its fast-communicating child
C first. C delegates the task to its own child D, which executes it. We
see all these events in Figure 8.10(c). Because D requires 2 time units to
execute a task, this scheme can be reproduced every two time units. This is
illustrated in Figure 8.10(d), together with the fact that A can overlap its own
computations with the sending of messages.
Next, the master realizes that although D is saturated it can send more
tasks to C, which can process them itself. C is slow-computing but fast-
communicating, and A sends a new task to C every 6 time units so as to
saturate it (Figure 8.10(e)). There remains idle times on the network: exactly
2 time units out of every 6. Because B needs 2 time units to receive a task, A
shifts ahead by one unit its third message intended to D (Figure 8.10(f)), and
then every third communication to D, so that B can receive and execute a task
once every 6 time units (Figure 8.10(g)). The network is now saturated, and
we have derived the final schedule. As outlined in Figure 8.10(h), this schedule
is cyclic. After a short initialization period (two time units), a steady-state
is reached: the whole pattern of computations and communications repeats
itself every 6 time units. Note that all nodes are saturated except B, which
is fast-computing but slow-communicating.
3
Data starts A
My Intermediatenodes
here computer can compute too 2 1
Internet Cluster 2 6
Gateway Host B C
Super- 1
Partner
computer 2
Site
D
Participating PCs
and workstations
(a) Target tree platform (b) Platform parameters
A compute A compute
A send A send
B receive B receive
B compute B compute
C receive C receive
C compute C compute
C send C send
D receive D receive
D compute D compute
(c) A sends a first task to D (d) A computes and serves D repeatedly
A compute A compute
A send A send
B receive B receive
B compute B compute
C receive C receive
C compute C compute
C send C send
D receive D receive
D compute D compute
(e) A serves (slow) C too (f) Shifting messages to D to allow serving B
A compute A compute
A send A send
B receive B receive
B compute B compute
C receive C receive
C compute C compute
C send C send
D receive D receive
D compute D compute
(g) A serves B (still partly idle) (h) Period of the cyclic schedule
FIGURE 8.10: Reaching optimal steady-state on an examplae tree platform.

We say that we have a period of length Tperiod = 6. During each period

seven tasks are computed, two by A, one by B, one by C and three by D.
The throughput is % = 76 . If the execution lasts T time units, the schedule will
execute almost %T tasks: a few tasks are “lost” during the initialization and
clean-up phases. To state this more formally, let nb(T ) be the actual number
of tasks executed by the cyclic schedule within T time units, and let nbopt (T )
be the optimal number of tasks that can be executed by any schedule, be it
periodic or not. We show in Section 8.2.5 that
nb(T )
lim =1.
T →∞ nbopt (T )
We say that the schedule is asymptotically optimal. In the next section we
provide a systematic way to derive a cyclic schedule whose throughput is
maximum, and we establish its asymptotic optimality.
8.2.3 Star-Shaped Platforms

In this section, we target heterogeneous star-shaped platforms. We use
exactly the same notations as in Section 8.1.4, but the master M initially
holds a large collection of atomic tasks instead of a divisible load application.
Refer to Figure 8.6 for notations:
• The master M (or P0 ) sends tasks to workers sequentially, and without
preemption (one-port model);
• There is full computation/communication overlap at each worker;
• Worker Pi receives a task in ci time units;
• Worker Pi processes a task in wi time units;
• The master M does not compute any task (but a master with computa-
tion time w0 can be simulated as a worker with same computation time
and zero communication time).
The optimal steady-state is defined as follows: for each worker determine the
fraction of time spent computing tasks, and the fraction of time spent receiving
tasks; for the master determine the fraction of time spent communicating
along each communication link. The objective is to maximize the (average)
number of tasks processed per time unit. Formally, after a start-up phase, we
want the resources to operate in a periodic mode, where worker Pi executes
αi tasks per time unit. We point out that αi is a rational number, not an
integer, so that there will remain some work to reconstruct a feasible schedule,
i.e., with an integer number of tasks.
First we express the constraints for computations: Pi takes time αi wi to
compute αi tasks, and it computes these tasks within one time unit, so nec-
essarily
αi wi 6 1 . (8.1)
As for communications, the master M sends tasks sequentially to the work-

ers, and it takes αi ci time units to send αi tasks along the link to Pi , so
necessarily
X p
αi ci 6 1 . (8.2)
i=1
Finally, the objective is to maximize throughput, namely

X
%= αi .
i
Altogether, we have a linear programming problem with rational unknowns:
Maximize %,
subject toP
 p

 %
Pp i=1 αi
=
i=1 αi ci 6 1


 ∀i, α i >0
∀i, αi wi 6 1

It turns out that the linear program is so simple that it can be solved
analytically. Indeed it is a fractional knapsack problem [44] with value-to-
cost ratio c1i . We should start with the “item” (worker) of largest ratio, i.e.,

the smallest ci , and take (assign) as much tasks as we can, i.e., min c1i , w1i .
Here is the detailed procedure:
1. Sort the workers by increasing communication times. Re-number them

so that c1 6 c2 . . . 6 ck .
Pq
2. Let q be the largest index so that i=1 wcii 6 1. Workers P1 to Pq will
be fully active (and each of them will execute w1i tasks per time unit). If
Pq
q < p let ε = 1 − i=1 wcii , otherwise let ε = 0. Worker Pq+1 (if it exists)
ε 1
will be only partially active, and will execute min( cq+1 , wq+1 ) tasks per
time unit.
3. Workers Pq+2 to Pp (if they exist) are discarded: they will not partici-
pate in the computation.
4. The optimal throughput is then

q
X 1 ε 1
%= + min( , ).
i=1
wi cq+1 wq+1
When q = p the result is expected: it basically says that workers can be fed
with tasks fast enough so that they are all kept computing steadily. However, if
q < p the result is surprising. Indeed, if communication bandwidth is limited,
TABLE 8.1: Achieved throughput for the bandwidth-centric strategy on the
example tree platform.
Tasks Communication Computation
6 tasks to P1 6c1 = 6 6w1 = 18
3 tasks to P2 3c2 = 6 3w2 = 18
2 tasks to P3 2c3 = 6 2w3 = 2
some workers will partially starve. In the optimal solution these partially
starved workers are those with slow communication rates, regardless of their
processing speeds. In other words, a slow processor with a fast communication
link is to be preferred to a fast processor with a slow communication link. This
optimal strategy is often called bandwidth-centric because it delegates work
to the fastest communicating workers, regardless of their computing speeds.
Of course, slow workers will not contribute much to the overall throughput.
M
1 20
2 3 10
3 6 1 1 1






















Fully active Discarded
FIGURE 8.11: An example tree platform illustrating the bandwidth-centric

strategy.
Consider the example shown in Figure 8.11. Workers are sorted by non-
decreasing ci . We see that wc11 + wc22 = 23 < 1 and that wc11 + wc22 + wc33 = 23 +3 > 1,
so that q = 2 and ε = 31 in the previous formula. Therefore, P1 and P2 will
be fully active, contributing α1 + α2 = w11 + w12 = 31 + 16 tasks per time
ε 1
unit. P3 will only be partially active, contributing α3 = min( cq+1 , wq+1 =
1 1
min( 9 , 1) = 9 . P4 and P5 will be discarded. The optimal throughput is
% = 13 + 16 + 91 = 11
18 ≈ 0.6. Table 8.1 shows that 11 tasks are computed every
Tperiod = 18 time units.
It is important to point out that if we had used a purely greedy (demand-
driven) strategy, we would have reached a much lower throughput. Indeed,
the master would serve the workers in round-robin fashion, and we would
execute only 5 tasks every 36 time units, therefore achieving a throughput of
only % = 5/36 ≈ 0.14. The conclusion is that even when resources are cheap
and abundant, resource selection is key to performance.
Once one has obtained the solution to the linear program defined above,
say (α, %), one needs to reconstruct a (periodic) schedule. In other terms, one
needs to decide in which specific activities each computation and communica-
tion resource is involved during each period. More precisely, we need to define
Tperiod such that during a period (i) an integral number of tasks is processed
by each processor, and (ii) an integral number of messages goes through each
link.
We express all the rational numbers αi , 1 6 i 6 q, as αi = uvii , where the ui
ε 1
and the vi are relatively prime integers. We also write αq+1 = min( cq+1 , wq+1 )=
uq+1
vq+1 . The period of the schedule is set to Tperiod = lcm(v1 , . . . , vq , vq+1 ), the
least common multiple of the denominators. In the example in Figure 8.11,
we had α1 = 31 , α1 = 16 and α1 = 91 , and lcm(3, 6, 9) = 18. This is how we
found the period used in Table 8.1.
In steady-state, during each period of duration Tperiod :
• Master M sends αi Tperiod tasks to Pi , 1 6 i 6 q +1. These q +1 messages
are sent in any order. A total of %Tperiod tasks are sent during each
period. All these communications share the network but Equation (8.2)
ensures that link bandwidths are not exceeded.
• Worker Pi executes the αi Tperiod tasks that it has received during the
last period. Equation (8.1) ensures that Pi has enough time to execute
these tasks.
Obviously, the first and last period are different: no computation takes place
during the first period, and no communication during the last one. Note that
the steady-state regime is reached no later than the beginning of the second
period.
Altogether, we have a periodic schedule, which is described in compact
form: because it arises from the linear program, log(Tperiod ) is indeed a number
polynomial in the problem size, but Tperiod itself is not. Hence, describing what
happens at every time-step during the period would be exponential in the
problem size. Instead, we have a more “compact” description of the schedule:
we only need the duration of the p time intervals during which the master
sends tasks to each worker (some possibly zero), and the duration of the p
time intervals during which each worker computes (again, some possibly zero).
We conclude this section by explaining how to modify the linear program
to use the bounded multi-port model instead of the one-port model. Here is
the one-port linear program again:
Maximize %,
subject toP
 p
 %
P= i=1 αi (i)
 p
i=1 αi ci 6 1 (ii)

 ∀i, αi > 0
 (iii)
∀i, αi wi 6 1 (iv)

Because messages can now be sent in parallel, we replace equation (ii) by
∀i, αi ci 6 1 ,
which states that the bandwidth of the link from M to Pi is not exceeded.
Let δ be the volume of data (in bytes) that needs to be sent for one task. We
can rewrite the last equation as:
αi δ
∀i, 61 (ii-a) ,
Bi
where Bi is the bandwidth of the link (thus ci = Bδi ). We also have to enforce
a global bound related to the bandwidth B of the master’s network card:
Pp
( i=1 αi )δ
61 (ii-b) .
B
Replacing equation (ii) by both equations (ii-a) and (ii-b) is all that is needed
to change to the bounded multi-port model!
8.2.4 Tree-Shaped Platforms

The linear programming approach extends nicely to general tree platforms
under the bidirectional one-port model. For a node Pi in the tree, let wi
denote the time it needs to compute a task, and αi the (fractional) number of
tasks that it executes every time unit. For any edge Pi → Pj in the tree, let
ci,j denote the time needed to send a task from Pi to Pj , and sent(Pi → Pj )
the (fractional) number of tasks that are sent by Pi to Pj every time unit.
The master M = P0 is the root of the tree. If P01 , P02 , . . . , P0k denote its
children in the tree, we have an equation quite similar to Equation (8.1):
k
X
sent(P0 → P0j )c0,0j 6 1 .
j=1
If the master computes tasks as well, then we have:
α0 w0 6 1 .
Consider now an internal node Pi . Let Pi0 denote its parent in the tree,
and let Pi1 , Pi2 , . . . , Pik denote its children in the tree. As before, there are
equations to constrain the computations and communications of Pi :
αi wi 6 1 ,
k
X
sent(Pi → Pij )ci,ij 6 1 .
j=1
But there is another constraint to write, namely a conservation law : the

number of tasks received by Pi is equal to the sum of the number of tasks
that it executes itself, plus the number of tasks that it sends to its children:
k
X
sent(Pi0 → Pi ) = αi + sent(Pi → Pij ) .
j=1
It is important to understand that the conservation law applies to the steady-

state operation: overall, the flow of incoming tasks equals the flow of tasks
consumed in place plus the flow of outgoing tasks. The law links tasks that ar-
rive during different periods: incoming tasks arrive during the current period
while consumed and outgoing tasks had during arrived the previous period.
We can check the conservation law for node C in the example shown in Fig-
ure 8.10. We have sent(A, C) = 32 , αC = 61 and sent(C, D) = 21 . We see in
Figure 8.10 that these quantities pertain to tasks from different periods.
Finally, if Pi is a tree leaf whose parent is Pi0 , we easily derive that:
αi wi 6 1 ,
sent(Pi0 → Pi )ci0 ,ij 6 1 ,

sent(Pi0 → Pi ) = αi .
The total throughput to optimize is given by
X
%= αi .
allPi
We have built a linear program, and we can follow the same steps as for
star-shaped platforms to define the period and reconstruct the final schedule.
8.2.5 Asymptotic Optimality

We keep the notations of the previous section and we now prove that steady-
state scheduling is asymptotically optimal. Given a tree platform and a time
bound T , let nbopt (T ) be the optimal number of tasks that can be computed
by any schedule, be it periodic or not, within T time units. We have the
following result:
PROPOSITION 8.1. nbopt (T ) 6 % × T .
Proof. Consider an optimal schedule. Consider any node Pi , with parent Pi0
and children Pi1 , Pi2 , . . . , Pik . Let ti (T ) be the total number of tasks that are
executed by Pi within the T time units. Similarly, for each edge ei,ij : Pi → Pij
in the tree, let ti,ij (T ) be the total number of tasks that have been forwarded
by Pi to Pij within the T time units. The following equations hold true:
• ti (T ) · wi 6 T (time for Pi to process its tasks);
Pk
• j=1 ti,ij (T ) · ci,ij 6 T (time for Pi to forward outgoing tasks in the
one-port model);
• ti0 ,i (T ) · ci0 ,i 6 T (time for Pi to receive incoming tasks in the one-port
model);
Pk
• ti0 ,i (T ) = ti (T ) + j=1 ti,ij (T ) (conservation equation);
wi ti (T ) ci,ij ti,ij (T )
Let αi = T , sent(Pi → Pij ) = T , and sent(Pi0 → Pi ) =
ci0 ,i ti0 ,i (T )
P αTi . All the equations of the previous linear program hold, hence
i wi 6 %, the optimal value. Going back to the original variables, we derive:
X
nbopt (T ) = ti (T ) 6 % × T .
i
Essentially Proposition 8.1 says that no schedule can execute more tasks
than the optimal steady-state. There remains to bound the potential loss due
to the initialization and the clean-up phase. Consider the following algorithm
(assume T is large enough):
• Solve the linear program: compute the maximal throughput %, compute
all the values for αi , sent(Pi → Pj ), and rij to determine Tperiod . For
each processor Pi , determine per i , the total number of tasks that it
receives per period. Note that all these quantities are independent of T :
they depend only upon wi and cij , characteristics of the platform.
• Initialization: the master sends per i tasks to each processor Pi . This
requires I units of time, where I is a constant independent of T .
• Let J be the maximum time for each processor to consume per i tasks
(J = maxi {per i .wi }). Again, J is a constant independent of T .
• Let r = b TT−I−J
period
c.
• Steady-state scheduling: during r periods of duration Tperiod , operate

the platform in steady-state.
• Clean-up: do not forward any task but consume in place all remaining
tasks. This requires at most J time units. Do nothing during the very
last units (T − I − J may not be evenly divisible by T ).
The number of tasks processed by the above algorithm within T time units is
equal to nb(T ) = (r + 1) × Tperiod × %.
PROPOSITION 8.2. The previous scheduling algorithm based upon steady-
state operation is asymptotically optimal:
nb(T )
lim =1.
T →+∞ nbopt (T )
Proof. Using Lemma 8.1, nbopt (T ) 6 %T . From the description of the algo-
rithm, we have nb(T ) = ((r + 1)Tperiod ).% > (T − I − J).%. This proves the
result because I, J, Tperiod and % are constants independent of T .
8.2.6 Summary
In addition to its simplicity, which makes it possible to tackle more complex
problems, steady-state scheduling has two main advantages over traditional
scheduling:
• Efficiency: Steady-state scheduling provides, by definition, a periodic

schedule, which is described in compact form and is thus possible to
implement efficiently in practice. This is to be contrasted with the
need to list the starting dates and target processors for each task in the
application.
• Adaptability: Because the schedule is periodic, it is possible to dynami-

cally record the observed performance during the current period, and to
inject this information into the algorithm that will compute the optimal
schedule for the next period. This makes it possible to react on-the-fly
to resource availability variations, which is the common case on non-
dedicated federated platforms such as Grid platforms [55, 25].
With steady-state scheduling one can give efficient solutions to complex

scheduling problems. We sketch two examples below, but we do not detail
them because their difficulty goes beyond the scope of this book. We refer the
reader to the bibliographical notes at the end of the chapter instead.
• Our first example is the problem of throughput optimization in the case

of several master-worker applications that run simultaneously and com-
pete for resources. This is an obvious generalization of the material
presented in this section to concurrent applications. A natural objective
would be to maximize the minimum throughput achieved over all ap-
plication (or a minimum weighted throughput if the applications have
different priority levels). This is the classical Max-Min fairness approach
used in networks [28, 89]. It turns out that casting all constraints into
a linear program, in a way that is akin to what we have done in this
section, still works well for this kind of objective function.
• Our second example is for the broadcasting of long messages on hetero-

geneous platforms (whose interconnection graph is arbitrary and may in-
clude cycles and redundant paths). Slicing the message into many pack-
ets and organizing the pipelined broadcast of these packets in steady-
state regime makes it possible to derive asymptotically optimal algo-
rithms for solving this NP-hard problem.
8.3 Workflow Scheduling

In this section, we study the scheduling of simple but fundamental appli-
cations that belong to the broad class of so-called scientific workflows. While
scientific workflows in general are typically structured as arbitrary cDAGs, in
this section we consider “chains”, i.e., applications structured as a sequence
of stages. Each stage corresponds to a different computational task. In a
way typical of the execution of workflows in practice, we assume that the ap-
plication must process a (large) number of data sets, each of which must go
through all the stages. As a result, it should be possible to achieve an efficient
pipelined execution. Each stage has its own communication and computation
requirements: it reads an input from the previous stage, processes the data,
and outputs a result to the next stage. Initial data is input to the first stage
and final results are obtained as the output from the last stage. The pipeline
operates in synchronous mode: after some initialization delay, a new task is
completed every period. The period is defined as the longest “cycle time” to
operate a stage, and it is the inverse of the throughput that can be achieved.
Each pipeline stage is a sequential task that may write some global data
structure, to disk or to memory, for each processed data set. In this case tasks
must always be processed in a sequential order within a stage. Moreover, due
to possible local updates, each stage must be mapped onto a single processor.
For a given stage, we cannot process half of the tasks on a processor and
the remaining half on another without maintaining global information, which
might be costly and difficult to implement. In other words, a processor that
is assigned a stage will execute the operations required by this stage (input,
computation, and output) for all the tasks fed into the pipeline. Other as-
sumptions are possible and we refer the reader to the bibliographical notes at
the end of the chapter.
What is the objective function? As pointed out in Section 8.2, an important
metric for parallel applications that consists of many individual computations
is the throughput. The throughput measures the aggregate rate of data pro-
cessing; it is the rate at which data sets can enter the system. Equivalently,
as pointed out before, the inverse of the throughput, defined as the period,
is the time interval required between the beginning of the execution of two
consecutive data sets. The period minimization problem can be stated infor-
mally as follows: which stage to assign to which processor so that the largest
period of a processor is kept minimal? We consider several variants, in which
we require the mapping to be one-to-one (a processor is assigned at most one
stage), or interval-based (a processor is assigned an interval of consecutive
stages). In addition to these two mapping categories, we target three different
platform types. First, Fully Homogeneous platforms with homogeneous net-
work links and processors. Second, Communication Homogeneous platforms
with homogeneous network links but heterogeneous processors. Third, Fully
8.3. Workflow Scheduling 281
Heterogeneous platforms with heterogeneous network links and processors.

We start by defining the application and architectural framework more pre-
cisely in Section 8.3.1. In Section 8.3.2, we study the complexity of the period
minimization problem for all combinations: one-to-one or interval mappings;
and Fully Homogeneous, Communication Homogeneous or Fully Heteroge-
neous platforms. We also touch upon an interesting extension of the well-
known chains-to-chains problem: given an array of n elements a1 , a2 , . . . , an ,
the original chains-to-chains problem is to partition the array into p intervals
whose element sums are well balanced (technically, the aim is to minimize the
largest sum of the elements of any interval). This problem has been exten-
sively studied in the literature (see the survey [96]). It amounts to load-balance
n computations whose ordering must be preserved (hence the restriction to
intervals) onto p identical processors. The advent of heterogeneous clusters
naturally leads to the following generalization: can we partition the n ele-
ments into p intervals whose element sums match p prescribed values (the
processor speeds) as closely as possible. Theorem 8.6 shows the NP-hardness
of this natural generalization of the chains-to-chains problem.
We conclude in Section 8.3.3 with brief mentions of other possible objectives
functions, such as response time minimization. The response time is the time
elapsed between the beginning and the end of the execution of a single data
set, which is clearly an important metric.
b0
S1
b1
S2 ... bk−1
Sk
bk
... Sn
bn
w1 w2 wk wn
FIGURE 8.12: The application pipeline.
8.3.1 Framework
We consider a pipeline with n stages Sk , 1 6 k 6 n, as illustrated in
Figure 8.12. Tasks are fed into the pipeline and processed from stage to
stage, until they exit the pipeline after the last stage. The k-th stage Sk
receives an input from the previous stage, of size bk−1 , performs a number of
wk operations, and outputs data of size bk to the next stage. The first stage
S1 receives an initial input of size b0 , while the last stage Sn returns a final
result of size bn .
We target a platform with p processors Pu , 1 6 u 6 p, that are fully
interconnected (see Figure 8.13). There is a bidirectional link linku,v : Pu ↔
Pv with bandwidth Bu,v between each processors Pu and Pv , For the sake of
simplicity, we enforce the unidirectional variant of the one-port model: a given
processor can be involved in a single communication at any time-step, either
a send or a receive. Note that independent communications between distinct
Win Pin Pout Wout
Bin,u Bv,out
Pu Pv
Bu,v
Wu Wv
FIGURE 8.13: The target platform.
processor pairs can take place simultaneously (this was not possible in star-
shaped platforms). However, remember that in the unidirectional variant of
the one-port model a given processor cannot send and receive data in parallel
(while this was allowed for tree-shaped platforms in Section 8.2.4).
In the most general case, we have fully heterogeneous platforms, with dif-
ferent processor speeds and link bandwidths. The speed of processor Pu is
denoted as Wu , and it takes X/Wu time units for Pu to execute X operations.
We also enforce a linear cost model for communications, hence it takes X/Bu,v
time units to send (resp. receive) a message of size X to (resp. from) Pv . We
classify below particular cases that are important, both from a theoretical and
practical perspective:
Fully Homogeneous– homogeneous processors (Wu = W ) and links (Bu,v =

B), which corresponds to typical parallel platforms.
Communication Homogeneous– heterogeneous processors (Wu 6= Wv )

and homogeneous links (Bu,v = B), which corresponds to workstations
on a LAN.
Fully Heterogeneous– heterogeneous processors (Wu 6= Wv ) and links

(Bu,v 6= Bu0 ,v0 ), which is the most general model and corresponds for
instance to hierarchical platforms comprising several clusters intercon-
nected by backbone links.
Finally, we assume the existence of two special additional processors Pin and
Pout . The initial input data for each task resides on Pin , while all final results
must be returned to Pout .
The mapping problem consists in assigning application stages to processors.
If we restrict the search to one-to-one mappings, we require that each stage
Sk of the application pipeline be mapped onto a distinct processor Palloc(k)
(which is possible only if n 6 p). The function alloc associates a processor
index to each stage index. For convenience, we create two fictitious stages S0
and Sn+1 , and we assign S0 to Pin and Sn+1 to Pout . What is the period of
Palloc(k) , i.e., the minimum delay between the processing of two consecutive
tasks? To answer this question, we need to know to which processors the
previous and next stages are assigned. Let t = alloc(k − 1), u = alloc(k) and
v = alloc(k + 1). Pu needs bk−1 /Bt,u time units to receive the input data from
Pt , wk /Wu time units to process it, and bk /Bu,v time units to send the result
to Pv , hence a cycle-time of bk−1 /Bt,u + wk /Wu + bk /Bu,v time units for Pu .
Because of the one-port communication model, these three steps are serialized
(see Figure 8.14 for an illustration). The period achieved with the mapping is
the maximum of the cycle-times of the processors, which corresponds to the
rate at which the pipeline can be activated.
In this simple instance, the optimization problem can be stated as follows:
determine a one-to-one allocation function alloc : [1, n] → [1, p] (augmented
with alloc(0) = in and alloc(n + 1) = out) such that

bk−1 wk bk
Tperiod = max + + (8.3)
16k6n Balloc(k−1),alloc(k) Walloc(k) Balloc(k),alloc(k+1)
is minimized. We denote this optimization problem One-to-one Mapping.
time-step 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 . . .
P1 F M M F M M F M M F ...
P2 F F M M F F M M F F M M ...
P3 FF M FF M F F ...
FIGURE 8.14: An example of one-to-one mapping with three stages and

processors. Each processor periodically receives input data from its predeces-
sor (F), performs some computation (), and outputs data to its successor
(M). Note that these operations are shifted in time from one processor to an-
other. The cycle-time of P1 and P2 is 5 while that of P3 is 4, hence Tperiod = 5.
Unfortunately, one-to-one mappings may be needlessly restrictive. Natu-

ral extensions are interval mappings in which each participating processor is
assigned an interval of consecutive stages. Note that when p < n interval
mappings are mandatory. Intuitively, assigning several consecutive tasks to
the same processors will increase its computational load, but will also decrease
communication. The best interval mapping may turn out to be a one-to-one
mapping, or instead may utilize only a very small number of fast computing
processors interconnected by high-speed links.
Let us write the optimization problem associated to interval mappings for-
mally. We need to express that the intervals achieve a partition of the original
set of stages S1 to Sn . We search for a partition of [1..n] into m intervals
Ij = [dj , ej ] such that dj 6 ej for 1 6 j 6 m, d1 = 1, dj+1 = ej + 1 for
1 6 j 6 m − 1 and em = n. Recall that the function alloc : [1, n] → [1, p] as-
sociates a processor index to each stage index. In a one-to-one mapping, this

function was a one-to-one assignment. In an interval mapping, for 1 6 j 6 m,
the whole interval Ij is mapped onto the same processor Palloc(dj ) , i.e., for
dj 6 i 6 ej , alloc(i) = alloc(dj ). Also, two intervals cannot be mapped to
the same processor, i.e., for 1 6 j, j 0 6 m, j 6= j 0 , alloc(dj ) 6= alloc(dj 0 ). The
period is expressed as:
( Pej )
bdj −1 i=dj wi bej
Tperiod = max + + .
16j6m Balloc(dj −1),alloc(dj ) Walloc(dj ) Balloc(dj ),alloc(ej +1)
(8.4)
Note that alloc(dj − 1) = alloc(ej−1 ) = alloc(dj−1 ) for j > 1 and d1 − 1 = 0.
Also, ej + 1 = dj+1 for j < m, and em + 1 = n + 1. We still assume that
alloc(0) = in and alloc(n + 1) = out. The optimization problem Interval
Mapping is to determine the mapping that minimizes Tperiod , over all possi-
ble partitions into intervals, and over all mappings of these intervals to the
processors.
With two mapping strategies (one-to-one and interval) and three platform
types, we have six problems to solve. The next section assesses the complexity
of these six problems when the objective is to minimize the period (or to
maximize the throughput).
8.3.2 Period Minimization

Throughout this section, the objective is to determine a mapping that min-
imizes the period. We start with one-to-one mappings on simpler platforms:
THEOREM 8.3. For Fully Homogeneous and Communication Homoge-

neous platforms, the optimal One-to-one Mapping can be determined in
polynomial time.
Proof. For Fully Homogeneous platforms the problem is straightforward: all

mappings have the same period. For Communication Homogeneous platforms,
we provide a constructive proof: we outline a binary-search algorithm that
iterates until the optimal period is found.
The optimal period belongs to the set T = { bk−1 wk bk
B + Wu + B , 1 6 k 6
n, 1 6 u 6 p}, because it is equal to the cycle-time of some processor Pu
executing some stage Sk . First we compute the set T and sort its elements
into an array TA . Then, we binary search the array TA for the optimal period,
testing at each step whether the current element Tperiod is a feasible value. To
do so, we use the greedy assignment procedure described hereafter. Initially,
the current element Tperiod is the median of TA . If the greedy assignment
procedure returns “failure” we increase the period by jumping to the median
of the elements of TA that are larger than Tperiod , and if it returns “success”
we jump to the median of the elements of TA that are smaller than Tperiod .
The algorithm terminates in dlog(T )e iterations. Note that |T | 6 pn, hence
1 GREEDY ASSIGNMENT()
2 Work with the n fastest processors, numbered P1 to Pn
3 where W1 6 W2 6 . . . 6 Wn
4 Mark all stages S1 to Sn as free
5 for u = 1 to n do
6 Pick any free stage Sk s.t. bk−1 /B + wk /Wu + bk /B 6 Tperiod
7 Assign Sk to Pu . Mark Sk as already assigned
8 If no stage is found return “failure”
9 Return “success”
ALGORITHM 8.1: Greedy assignment algorithm for a given period

Tperiod .
the total computation time is O((pn + costGA ) log(pn)), where costGA is the
cost of the greedy assignment procedure.
We now describe the greedy assignment algorithm for a given Tperiod value.
Recall that there are n stages to map onto p > n processors in a one-to-one
fashion. We target Communication Homogeneous platforms with heteroge-
neous processors (Wu 6= Wv ) but with homogeneous links (Bu,v = B). First,
we retain only the fastest n processors, which we rename P1 , P2 , . . . , Pn such
that W1 6 W2 6 . . . 6 Wn . Then, we consider the processors in the order
P1 to Pn , i.e., from the slowest to the fastest, and greedily assign to them
any free, that is, not already assigned, stage that they can process within the
period. Algorithm 8.1 details the procedure.
Before providing the proof, we proceed with a small example. Consider
an application with 3 stages with respective computation requirements 1, 1,
and 100; 3 processors of speed 1, 1, and 100; and no communication costs
whatsoever. Can we achieve Tperiod = 1? The algorithm starts by assigning
stages to the slowest processors, i.e., to the first two processors. Only the first
two stages can be assigned to these processors to fit into the period, so they
are chosen, and the last stage then fits when assigned to the fastest processor.
It cannot be assigned to one of the first two processors because the required
period would be exceeded.
The proof that the greedy procedure returns a solution if and only if there
exists a solution of period Tperiod is done via a simple exchange argument.
Consider a valid one-to-one assignment of period Tperiod , denoted by A, and
assume that in this assignment Sk1 is assigned to P1 . Note first that the
greedy procedure will indeed find a stage to assign to P1 and cannot fail,
since Sk1 can be chosen. If the choice of the greedy procedure is actually
Sk1 , we proceed by induction with P2 . If the greedy procedure has selected
another stage Sk2 for P1 , we find which processor, say Pu , has been assigned
this stage in the valid assignment A. Then, we exchange the assignments of
P1 and Pu in A. Because Pu is faster than P1 , which could process Sk1 in
time in the assignment A, Pu can process Sk1 in time too. Because Sk2 has
been mapped on P1 by the greedy procedure, P1 can process Sk1 in time. So
the exchange is valid. We can consider the new assignment A which is valid
and which assigns the same stage to P1 as the greedy procedure. The proof
proceeds by induction with P2 as before.
The complexity of the greedy assignment procedure is O(n2 ), because of the
two loops over processors and stages. Altogether, since n 6 p, the complexity
of the whole algorithm is O((pn log(pn)), which is indeed polynomial in the
problem size.
We now proceed to interval mappings. First we show that the complexity

is polynomial for Fully Homogeneous platforms:
THEOREM 8.4. For Fully Homogeneous platforms, the optimal Interval

Mapping can be determined in polynomial time.
Proof. Consider an application with n stages S1 to Sn to be mapped onto

a fully-homogeneous platform with p processors. Let W and B respectively
denote the processor speed and the link bandwidth. We compute recursively
the value c(i, j, k), which is the optimal period that can be achieved by any
interval-based mapping of stages Si to Sj using exactly k processors. The goal
is to determine min16k6p c(1, n, k). The recursion can be written, including
two initializations, as:



 c(i, j, k) = min min {max (c(i, `, q), c(` + 1, j, r))}

 q+r=k i6`6j−1
 16q,r6k−1
 Pj
bi−1 wk bj
+ k=i


 c(i, j, 1) = +


 B W B
c(i, j, k) = +∞ if k > j − i + 1

The first initialization gives the period obtained by mapping stages Si to

Sj to a single processor. The second initialization means that no mapping
can be found with more processors than stages. The main recursion is easy to
justify: to compute c(i, j, k) we search over all possible partitionings into two
subintervals [i..`] and [` + 1..j]. Furthermore, we try every possible processor
number q for the first interval and every possible processor number r for the
second interval, where q + r = k. We explore all possibilities with such a
strategy. The complexity of this dynamic programming algorithm is bounded
by O(n3 p2 ) (there are five nested loops to execute the algorithm, three on
pipeline stages and two on processors).
It is not possible to extend the previous dynamic programming algorithm

to deal with Communication Homogeneous platforms. This is because the
algorithm intrinsically relies on identical processors in the recursion. Het-
erogeneous processors would execute sub-intervals with different cycle-times.
Because of this additional difficulty, the Interval Mapping problem for
Communication Homogeneous platforms is very combinatorial and, indeed,

we have the following result:
THEOREM 8.5. For Communication Homogeneous platforms, the (decision
problem associated to the) Interval Mapping optimization problem is NP-
complete.
Theorem 8.5 is a consequence of Theorem 8.6, which states the complexity
of the heterogeneous 1-D partitioning problem. We introduce this problem
and prove Theorem 8.6 before returning to the proof of Theorem 8.5.
The chains-to-chains problem– Given an array of n elements a1 , a2 , . . . ,

an , the 1-D partitioning problem, also known as the chains-to-chains problem,
is to partition the array into p intervals whose element sums are almost iden-
tical. More precisely, we search for a partition of [1..n] into p consecutive
intervals I1 , I2 , . . . , Ip , where Ik = [dk , ek ] and dk 6 ek for 1 6 k 6 p,
d1 = 1, dk+1 = ek + 1 for 1 6 k 6 p − 1 and ep = n. The objective is to
minimize
X ek
X
max ai = max ai .
16k6p 16k6p
i∈Ik i=dk
The chains-to-chains problem amounts to load balance n computations
whose ordering must be preserved (hence the restriction to intervals) onto
p identical processors. Here ai corresponds to the execution time of the i-th
task, and the sum of the elements in interval Ik is the load of the processor
to which Ik is assigned. Replacing ai by wi we recognize a simplified ver-
sion of our Interval Mapping optimization problem on Fully Homogeneous
platforms, without any communication cost (and the dynamic programming
algorithm of Theorem 8.4 can be simplified, as seen in Exercise 8.3).
With different-speed processors, the 1-D partitioning problem can be stated
as follows: the goal is to partition the n elements into p intervals whose element
sums match p prescribed values (the processor speeds) as closely as possible.
Let W1 , W2 , . . . , Wp denote these values. We search for a partition of [1..n]
into p intervals Ik = [dk , ek ] and for a permutation σ of {1, 2, . . . , p}, with the
objective to minimize: P
i∈Ik ai
max .
16k6p Wσ(k)
Formally, we define:
DEFINITION 8.1 (Hetero-1D-Partition-Dec). Given n elements a1 ,
a2 , . . . , an , p values W1 , W2 , . . . , Wp and a bound K, can we find a partition
of [1..n] into p intervals I1 , I2 , . . . , Ip , with Ik = [dk , ek ] and dk 6 ek for
1 6 k 6 p, d1 = 1, dk+1 = ek + 1 for 1 6 k 6 p − 1 and ep = n, and a
permutation σ of {1, 2, . . . , p}, such that
P
i∈Ik ai
max 6K ?
16k6p Wσ(k)
THEOREM 8.6. The Hetero-1D-Partition-Dec problem is NP-complete.
Proof. The Hetero-1D-Partition-Dec problem clearly belongs to the class

NP: given a solution, it is easy to verify in polynomial time that the partition
into p intervals is valid and that the maximum sum of the elements in a given
interval divided by the corresponding W value does not exceed the bound
K. To establish the completeness, we use a reduction from NUMERICAL
MATCHING WITH TARGET SUMS (NMWTS), which is NP-complete in
the strong sense [58]. We consider an instance I1 of NMWTS: given 3m
numbers x1 , x2 , . . . , xm , y1 , y2 , . . . , ym and z1 , z2 , . . . , zm , does there exist two
permutations σ1 and σ2 of {1, 2, . . . , m}, such that xi + yσ1 (i) = zσ2 (i) for 1 6
i 6 m? Because NMWTS is NP-complete in the strong sense, we can encode
the 3m numbers in unary and assume that theP size of I1 is PO(m + MP ), where
m m m
M = maxi {xi , yi , zi }. We also assume that i=1 xi + i=1 yi = i=1 zi ,
otherwise I1 cannot have a solution.
We build the following instance I2 of Hetero-1D-Partition-Dec (we use
the formulation in terms of task weights and processor speeds which is more
intuitive):
• We define n = (M + 3)m tasks, whose weights are shown below:
| {z } C D | A2 111...1
A1 111...1 | {z } C D | . . . | Am 111...1
| {z } C D
M M M
Here, B = 2M , C = 5M , D = 7M , and Ai = B + xi for 1 6 i 6 m.

To define the ai formally for 1 6 i 6 n, let N = M + 3. We have for
1 6 i 6 m: 
 a(i−1)N +1 = Ai = B + xi

a(i−1)N +j = 1 for 2 6 j 6 M + 1


 aiN −1 = C
aiN = D

• For the number of processors (and intervals), we choose p = 3m. As for

the speeds, we let Wi be the speed of processor Pi where, for 1 6 i 6 m:
W i = B + zi , Wm+i = C + M − yi , W2m+i = D .
Finally, we ask whether there exists a solution matching the bound K = 1.

Clearly, the size of I2 is polynomial in the size of I1 . We now show that
instance I1 has a solution if and only if instance I2 does.
Suppose first that I1 has a solution, with permutations σ1 and σ2 such that
xi + yσ1 (i) = zσ2 (i) . For 1 6 i 6 m:
• We map each task Ai and the following yσ1 (i) tasks of weight 1 onto
processor Pσ2 (i) .
• We map the following M − yσ1 (i) tasks of weight 1 and the next task, of
weight C, onto processor Pm+σ1 (i) .
• We map the next task, of weight D, onto the processor P2m+i .
We do have a valid partition of all the tasks into p = 3m intervals. For
1 6 i 6 m, the load and speed of the processors are indeed equal:
• The load of Pσ2 (i) is Ai +yσ1 (i) = B+xi +yσ1 (i) and its speed is B+zσ2 (i) .
• The load of Pm+σ1 (i) is M − yσ1 (i) + C, which is equal to its speed.
• The load and speed of P2m+i are both equal to D.
The mapping does achieve the bound K = 1, hence a solution to I2 .
Suppose now that I2 has a solution, i.e., a mapping matching the bound
K = 1. We first observe that Wi < Wm+j < W2m+k = D for 1 6 i, j, k 6 m.
Indeed Wi = B + zi 6 B + M = 3M , 5M 6 Wm+j = C + M − yj 6 6M
and D = 7M . Hence, each of the m tasks of weight D must be assigned to a
processor of speed D, and it is the only task assigned to this processor. These
m singleton assignments divide the set of tasks into m intervals, namely the
set of tasks before the first task of weight D, and the m − 1 sets of tasks
lying between two consecutive tasks of weight D. The total weight of each
of these m intervals is Ai + M + C > B + M + C = 10M , while the largest
speed of the 2m remaining processors is 6M . Therefore, each of them must
be assigned to at least 2 processors each. However, there remain only 2m
available processors, hence each interval is assigned exactly 2 processors.
Consider such an interval Ai 111...1 C with M tasks of weight 1, and let
Pi1 and Pi2 be the two processors assigned to this interval. Tasks Ai and C
are not assigned to the same processor (otherwise the whole interval would).
So Pi1 receives task Ai and hi tasks of weight 1 while Pi2 receives M − hi
tasks of weight 1 and task C. The weight of Pi2 is M − h + C > C = 5M
while Wi 6 3M for 1 6 i 6 m. Hence, Pi1 must be some Pi , 1 6 i 6 m
while Pi2 must be some Pm+j , 1 6 j 6 m. Because this holds true on each
interval, this defines two permutations σ2 (i) and σ1 (i) such that Pi1 = Pσ2 (i)
and Pi2 = Pσ1 (i) . Because the bound K = 1 is achieved, we have:
• Ai + hi = B + xi + hi 6 B + zσ2 (i)
• M − hi + C 6 C + M − yσ1 (i)
Pm Pm
PTherefore,
m Pmyσ1 (i) 6Phm i and xi + hi 6 zσ2 (i) , and i=1 xi + i=1 yi 6
i=1 xi + i=1 hiP6 i=1 zi .P
m m Pm
By hypothesis, xi + i=1 P
i=1 P yi = zi , hence all inequalities are
i=1P
m m m
tight,
Pm and in
Pm particularPm i=1 xi +
Pm i=1 h i = i=1 zi . We can deduce that
i=1 yi = i=1 hi = i=1 zi − i=1 xi , and since yσ1 (i) 6 hi for all i, we
have yσ1 (i) = hi for all i. Similarly, we deduce that xi + hi = zσ2 (i) for all i,
and therefore xi + yσ1 (i) = zσ2 (i) . Altogether, we have found a solution for I1 ,
which concludes the proof.
Back to the proof of Theorem 8.5. Obviously, the Interval Mapping

optimization problem belongs to the class NP. Any instance of the Hetero-
1D-Partition problem with n tasks ai , p processor speeds Wi and bound K
can be converted into an instance of the Interval Mapping problem with
n stages of weight wi = ai , letting all communication costs bi = 0, targeting
a Communication Homogeneous platform with p processor speeds Wi and
homogeneous links of bandwidth B = 1, and trying to achieve a period that
is no greater than K. This concludes the proof.
As a consequence of Theorem 8.5, the interval mapping problem is also
NP-hard for Fully Heterogeneous platforms. It turns out it is also the case of
the one-to-one mapping problem on such platforms:
THEOREM 8.7. For Fully Heterogeneous platforms, the (decision prob-
lems associated to the) One-to-one Mapping optimization problem is NP-
complete.
Proof. The problem clearly belongs to the class NP: given a solution, it is
easy to verify in polynomial time that all stages are processed, and that
the maximum cycle-time of any processor does not exceed the bound on
the period. To establish the completeness, we use a reduction from MIN-
IMUM METRIC BOTTLENECK WANDERING SALESPERSON PROB-
LEM (MMBWSP) [9, 46]. We consider an instance I1 of MMBWSP: given
a set C = {c1 , c2 , . . . , cm } of m cities, an initial city s ∈ C, a final city
f ∈ C, distances d(ci , cj ) ∈ N satisfying the triangle inequality, and a bound
K > 2 on the largest distance, does there exist a simple path from the initial
city s to the final city f passing through all cities in C, i.e., a permutation
π : [1..m] → [1..m] such that cπ(1) = s, cπ(m) = f , and d(cπ(i) , cπ(i+1) ) 6 K
for 1 6 i 6 m − 1? To simplify notations, and without loss of generality, we
renumber cities so that s = c1 and f = cm (i.e., π(1) = 1 and π(m) = m).
We build the following instance I2 of our mapping problem (note that the
same instance will work out for the two variants: One-to-one Mapping and
Interval Mapping):
• For the application: n = 2m − 1 stages that for convenience we denote
as
→ S1 → S10 → S2 → S20 → . . . → Sm−1 → Sm−10
→ Sm →
For each stage Si or Si0 we set bi = wi = 1 (as well as b0 = bn = 1), so
that the application is perfectly homogeneous.
• For the platform (see Figure 8.15):
– We use p = m + m

2 processors that for convenience we denote
as Pi , 1 6 i 6 m and Pi,j , 1 6 i < j 6 m. We also use an
input processor Pin and an output processor Pout . The speed of
1
each processor Pi or Pij has the same value s = 2K (note that
we have a computation-homogeneous platform), except for Pin and
Pout whose speeds are 0.
– The communication links shown in Figure 8.15 have a larger band-

width than the others, and are referred to as fast links. More
precisely, bPin ,P1 = bPm ,Pout = 1, and bPi ,Pij = bPij ,Pj = d(ci2,cj ) for
1 6 i < j 6 m. All the other links have a very small bandwidth
1
B = 5K and are referred to as slow links. The intuition here is
that slow links are too slow to be used for the mapping.
Finally, we ask whether there exists a solution with period 3K. Clearly,
the size of I2 is polynomial (and even linear) in the size of I1 . We now
show that instance I1 has a solution if and only if instance I2 does.
2
d(c1 ,c2 ) 2
2 P1,2 P2 d(c2 ,c4 )
d(c1 ,c2 )
2 2 P2,3
1 d(c1 ,c3 ) d(c1 ,c3 )
Pin P1 P1,3 P3 P2,4
2 P3,4
d(c1 ,c4 )
P1,4 2 P4 2
d(c1 ,c4 ) d(c2 ,c4 )
1
Pout
FIGURE 8.15: The platform used in the reduction for Theorem 8.7.
Suppose first that I1 has a solution. We map stage Si onto Pπ(i) for 1 6 i 6
m, and stage Si0 onto processor Pπ(i),π(i+1) for 1 6 i 6 m − 1. The cycle-time
d(c ,c )
of P1 is 1 + 2K + π(1)2 π(2) 6 1 + 2K + K 2 6 3K. Quite similarly, the
cycle-time of Pm is smaller than 3K. For 2 6 i 6 m − 1, the cycle-time of
d(cπ(i−1) ,cπ(i) ) d(c ,c )
Pπ(i) is 2 + 2K + π(i) 2 π(i+1) 6 3K. Finally, for 1 6 i 6 m − 1,
π(i) π(i+1) d(c ,c ) d(c ,c )
the cycle-time of Pπ(i),π(i+1) is 2 + 2K + π(i) 2 π(i+1) 6 3K. The
mapping does achieve a period that is no greater than 3K, hence a solution
to I2 .
Suppose now that I2 has a solution, i.e., a mapping of period lower than
3K. We first observe that each processor is assigned at most one stage by the
mapping, because executing two stages would require at least 2K + 2K units
of time, which would be too large to match the period. Next, we observe that
1
any slow link of bandwidth 5K cannot be used in the solution: otherwise the
period would exceed 5K.
The input processor Pin has a single fast link to P1 , so necessarily P1 is
assigned stage S1 (i.e., π(1) = 1). As observed above, P1 cannot execute any
other stage. Because of fast links, stage S10 must be assigned to some P1,j ; we
let j = π(2). Again, because of fast links and the one-to-one constraint, the
only choice for stage S2 is Pπ(2) . Necessarily j = π(2) 6= π(1) = 1, otherwise
P1 would execute two stages. We proceed similarly for stage S20 , assigned
TABLE 8.2: Summary of complexity results for the different instances of
the workflow mapping problem.
Fully Hom. Comm. Hom. Hetero.
One-to-one polynomial (bin.search) NP-hard
Interval polynomial (dyn. prog.) NP-hard NP-hard
to some P2,k (let k = π(3)) and stage S3 assigned to Pπ(3) . Owing to the
one-to-one constraint, k 6= 1 and k 6= j, i.e., π : [1..3] → [1..m] is a one-to-one
mapping. By induction, we build the full permutation π : [1..m] → [1..m].
Because the output processor Pout has a single fast link to Pm , necessarily Pm
is assigned stage Sm , hence π(m) = m.
We have built the desired permutation, it remains to show that for 1 6
i 6 m − 1: d(cπ(i) , cπ(i+1) ) 6 K. The cycle time of processor Pπ(i) is
d(cπ(i) ,cπ(i+1) ) d(c ,c )
2 + 2K + π(i) 2 π(i+1) 6 3K, hence d(cπ(i) , cπ(i+1) ) 6 K. Al-
together, we have found a solution for I1 , which concludes the proof.
Table 8.2 summarizes all previous complexity results. We see that one level
of heterogeneity (in processor speed) is enough to make interval mapping NP-
hard, while two levels of heterogeneity (adding different link bandwidths) are
required to make one-to-one mapping NP-hard as well.
8.3.3 Response Time Minimization

Let us briefly discuss the response time minimization problem. Recall that
an interval mapping is a partition of [1..n] into m intervals Ij = [dj , ej ], and
that the whole interval Ij is mapped onto the same processor Palloc(dj ) . As
usual, assume that alloc(0) = in and alloc(n+1) = out. Each data set traverses
all stages, but only communications between two stages mapped on the same
processors take zero time units. Overall, the response time is expressed as:
( Pej )
X bdj −1 i=dj wi bn
Trt = + + . (8.5)
Balloc(dj −1),alloc(dj ) Walloc(dj ) Balloc(n),alloc(n+1)
16j6m
The response time for a one-to-one mapping obeys the same formula (with
the restriction that each interval has length 1).
Note that it may well be the case that different data sets have different
response times (because they are mapped onto different processor sets), hence
the response time is defined as the maximum response time over all data sets.
Intuitively, response time is small when all communications are zeroed out.
An obvious candidate mapping would be to map all stages on the fastest
processor, which will result in a large period. This is a general observation:
minimizing the response time is antagonistic to minimizing the period. The
bibliographical notes at the end of the chapter point to works that tackle
8.4. Hyperplane Scheduling (or Scheduling at Compile-Time) 293
a bicriteria approach, aiming at minimizing the response time under period

constraints (or the converse).
We conclude this section with the following result:
PROPOSITION 8.3. The response time minimization problem is polyno-

mial on Fully Homogeneous and Communication Homogeneous platforms.
Proof. We start with interval mappings, which are more natural to minimize
the response time as many communications will be zeroed out. On Fully Ho-
mogeneous platforms, the optimal solution is to map all stages on a single
(arbitrary) processor, because this zeroes out all communications except in-
put/output ones. On Communication Homogeneous platforms, the optimal
solution is to map all stages on the fastest processor, for the same reason.
All one-to-one mappings on Fully Homogeneous platforms have the same
response time. On Communication Homogeneous platforms, the optimal so-
lution is to assign the most computationally expensive stages to the fastest
processors, in a greedy manner (largest stage on fastest processor, second
largest on second fastest, and so on).
In Exercise 8.4 we see that the response time minimization problem is NP-
hard for one-to-one mappings on Fully Heterogeneous platforms. To the best
of our knowledge, the complexity is open for interval mappings on such plat-
forms, at least at the time this book is being written.
8.4 Hyperplane Scheduling (or Scheduling at Compile-

Time)
In this section, we focus on yet another topic, that of compiler transfor-
mations to schedule parallel loop nests. This section can be viewed as an
introduction to automatic parallelization techniques, a major area of research
for scheduling and mapping. Rather than writing a parallel program, every
user would dream of a compiler that would do all the work:
1. find the parallelism in a sequential program (identify which statements

are dependent and which are not);
2. transform the program to expose the parallel statements;
3. execute the program efficiently on a target parallel platform.
This dream has not come (fully) true yet, but in spite of many setbacks a lot
of progress has been made in several directions. In this section, we focus solely
on the automatic parallelization of so-called uniform loop nests. We explain
simple but representative results, motivated by two seminal papers by Karp,
Miller and Winograd [72], and by Lamport [78] (see the bibliographical notes
at the end of the chapter for further information).
8.4.1 Uniform Loop Nests

Consider the following fragment of code:
S1 : a←b+1
S2 : b←a−1
S3 : a←c−2
S4 : d←c
Here a, b, c, and d are four un-aliased variables (pointing to distinct memory

locations). They are initialized somewhere before the code fragment and pos-
sibly re-used later. The sequential execution order <seq is simply the textual
order <text of the four statements, first S1 , then S2 , then S3 and finally S4 .
Dependence analysis is the name of the technique for determining which of the
four statements are independent, and hence could be executed in arbitrary or-
der, and which are not. We already encountered this problem in Section 7.1.1
with the example of the triangular system. Here we aim at automating the
process.
Statement S1 writes variable a while statement S2 reads it, hence they
are dependent. If we change their order, we may use a wrong value of a,
and change the result of the program. Thus, S1 and S2 cannot be executed in
parallel, they must remain executed one after the other. Which one first? The
answer is obvious, it is S1 because it precedes S2 in the sequential execution
order. What we have found here is a flow dependence from S1 to S2 , i.e., a
write-then-read dependence.
Focusing on variable a again, we see that S2 reads a while S3 writes it.
Both statements are therefore dependent. But this time this is not a flow
dependence from S3 to S2 . It is of course impossible to have a dependence
that violates the sequential execution order. Instead, what we have here is
an anti dependence from S2 to S3 , i.e., a read-then-write dependence: the
value of a must be read by S2 before being updated by S3 . Let us continue
with a: we find yet another kind of dependence from S1 to S3 , because both
statements write to a (and of course the write of S1 must precede that of
S3 ). This is an output dependence from S1 to S3 , i.e., a write-then-write
dependence. However, because there is a dependence (flow) from S1 to S2 and
another dependence (anti) from S2 to S3 , the latter dependence (output) from
S1 to S3 is not an actual dependence between successors. It can be retrieved
from the transitive closure of the dependence relation, just as explained in
Section 7.1.1. We provide a more formal explanation below. Note that there
is another dependence involving variable b, namely an anti-dependence from
S1 to S2 . Finally, statement S4 is independent of the others, even from S3
which reads the same variable c (but there is no write of c, hence the order
does not matter).
1 for i = 0 to N do
2 for j = 0 to N do
3 S1 (i, j) : a(i, j) = b(i, j − 6) + d(i − 1, j + 3)
4 S2 (i, j) : b(i + 1, j − 1) = c(i + 2, j + 5) + 1
5 S3 (i, j) : c(i + 3, j − 1) = a(i, j + 2)
6 S4 (i, j) : d(i, j − 1) = a(i, j − 1) − 1
ALGORITHM 8.2: An example of uniform loop nest.
The definitions of flow, anti and output dependences can be extended to

uniform loop nests as follows. Consider the example in Algorithm 8.2. In
this example, there are four statements S1 to S4 , all surrounded by the loops
on i and j. We say the loops are perfectly nested because all statements
are surrounded by the same loops. Each loop is defined by a loop counter
that takes, one after the other, integral values from the lower bound to the
upper bound. The iterations of n perfectly nested loops are represented by a
vector of size n, called the iteration vector. The set of all possible values of
the iteration vector (defined by the lower and upper bounds of the loops) is
called the iteration domain. In the example, the loops i and j define a two-
dimensional iteration vector I = (i, j) whose iteration domain Dom is the set
of integral vectors in the square 1 6 i, j 6 N :
Dom = {(i, j) ∈ Z2 , 0 6 i 6 N, 0 6 j 6 N } .
When running the program, each value of an iteration vector I corresponds

to a given execution of the code surrounded by the loops; in particular, each
statement Sk is executed for each value I of the iteration vector. Such an
execution is called an operation and is denoted by S(I). These operations are
carried out in a predefined order, called the sequential order, which we denote
<seq as previously. The sequential order is not difficult to characterize for
uniform loop nests:
S(I) <seq T (J) ⇔ (I <lex J) or (I = J and S <text T ),
where <lex is the lexicographic order of the iteration vectors, and <text is the
textual order of statements in the program’s source code. In the example, we
have S3 (2, 5) <seq S1 (3, 1), S3 (2, 5) <seq S2 (2, 6) and S3 (2, 5) <seq S4 (2, 5).
There is a dependence between operations Si (I) and iteration Sj (J) if:
• Si (I) is executed before Sj (J).
• Si (I) and Sj (J) refer to a memory location M , and at least one of these
references is a write.
• The memory location M is not written between iteration I and itera-
tion J.
We retrieve flow, anti and output dependences, depending on whether there
is a single write to M (flow if the read occurs after, anti if it comes before) or
two writes (output). The last condition is to ensure that the dependence is
indeed between successors in the dependence graph.
The dependence vector between iteration Si (I) and iteration Sj (J) is de-
fined as d(i,I),(j,J) = J − I. The loop nest is said to be uniform if the depen-
dence vectors d(i,I),(j,J) do not depend on either I or J, and we denote them
simply di,j . We point out that each dependence vector di,j is lexicographically
positive (its first nonzero component is greater than 0), due to the semantics
of the sequential execution: if there is a dependence from Si (I) to iteration
Sj (J), then I precedes J, i.e., I 6lex J and di,j = J − I >lex 0. We can then
represent the dependences as an oriented graph with k nodes (the statements)
linked by edges corresponding to the dependence vectors.
In the example, Algorithm 8.2, we see that variable a(i, j) is produced
(written) by statement S1 (i, j) and consumed (read) by statement S4 (i, j +1),
0
hence a uniform dependence from S1 to S4 of vector d1,4 = . How did we
1
find this? First, since S4 (i, j) reads a(i, j−1), then S4 (i, j+1) reads a(i, j), the
memory location written by S1 (i, j). We do have I = (i, j) <seq J = (i, j + 1).
No other operation writes into this location in between, since in this loop nest
every memory location is written only once. We have “manually” checked the
three conditions stated above.
How can we automate the process of finding dependences? For each candi-
date array and statement pair, we can write a system of equations, and try
to solve it. Let us do this for array a and the pair (S1 , S3 ). S1 writes into a
and S3 reads it, so we search if there exists a flow dependence from S1 (I) to
S3 (J), for some I, J ∈ Dom with I <seq J. Letting I = (i, j) and J = (i0 , j 0 )
we obtain:
• i = i0 and j = j 0 + 2: we write a(i, j) in S1 (i, j), and read a(i0 , j 0 + 2) in
S3 (i0 , j 0 );
• i < i0 , or i = i0 and j < j 0 : this is the condition I <seq J.
We see that there is no solution, hence no flow dependence from S1 to S3 .
But we see (because we are used to looking at loop nests!) that there is an
anti-dependence from S3 to S1 . With the same notations, looking for an anti
dependence from S3 (J) to S1 (I), for some I, J ∈ Dom with J <seq I, and
letting J = (i0 , j 0 ), I = (i, j), we obtain:
• i0 = i and j 0 + 2 = j: we read a(i0 , j 0 + 2) in S3 (i, j), and write a(i, j) in
S1 (i, j);
• i0 < i or i0 = i and j 0 < j.

We derive the solution i = i0 and
j = j 0 + 2, hence a uniform anti dependence,
0
with dependence vector d3,1 = .
2
The above illustrates how parallelizing compiler software works. In general,
the situation is more complicated, because we have to check the third condition
for a dependence to exist, namely that no other operation writes to the same
variable in between. For the anti dependence from S3 (J) to S1 (I), if we
had found many solutions we should have kept only the last write, i.e., the
lexicographic maximum over all index vectors J that lead to a solution for
a given I. But as already mentioned, each memory location is written only
once in the example, which simplifies the problem considerably.
Altogether, after checking all candidate arrays and statement pairs, we
obtain the following list of dependences and dependence vectors:

0
S1 −→ S4 : array a, flow , d1,4 =
1

0
S3 −→ S1 : array a, anti , d3,1 =
2

1
S2 −→ S1 : array b, flow , d2,3 =
5

1
S3 −→ S2 : array c, flow , d3,2 =
−6

1
S4 −→ S1 : array d, flow , d4,1 =
−4
We obtain the so-called dependence matrix:

001 1 1
D = (d1,4 , d3,1 , d2,3 , d3,2 , d4,1 ) = .
1 2 5 −6 −4
S1 S3 S2
S4
FIGURE 8.16: Dependence graph for the example in Algorithm 8.2.
The dependence graph is shown in Figure 8.16. It captures all the informa-
tion extracted from the dependence analysis. Note that this information is not
100% accurate. Each uniform dependence seems to occur for each operation
over the entire iteration domain, while it is not the case at the boundary. For
instance with the anti dependence from S3 to S1 , the condition j = j 0 + 2 can-
not be satisfied if j 0 = N − 1 or j 0 = N . However, the condition is satisfied on
most points of the domain, so it makes sense to approximate dependences as
we did in the graph. What is important is to always over-approximate, i.e., to
record more dependences that there may actually exist. Over-approximations
can reduce parallelism but will never lead to a violation of the semantics of
the original program, while under-approximations are simply. . . dangerous.
8.4.2 Lamport’s Hyperplane Method

In this section examine the problem of scheduling uniform loop nests with
infinite processors. Consider a uniform loop nest whose computation domain
Dom is the integral polyhedron:
Dom = {p ∈ Zn , Ap 6 b, where A ∈ Za×n , and b ∈ Za } ,

and whose dependence matrix D is:

D = d 1 d 2 . . . dm .
Here n is the nest depth, i.e., the dimension of index points, a is the number
of constraints that define the shape of the domain, and m is the number of
dependence vectors, so that the constraint matrix A is of dimension a × n and
the dependence matrix D is of dimension n × m.
Back to the example in Figure 8.16, we have Dom = {(i, j) ∈ Z2 , 0 6 i 6
N, 0 6 j 6 N }. which translates into
     
−1 0 0 0
 1 0  N  1
A=  0 −1  and b =  0  = N × b0 , where b0 =  0  ,
    
0 1 N 1
so that a = 4 and n = 2. Recall that we have m = 5 dependence vectors.

Given a uniform loop nest, we write p1 ≺ p2 if p2 = p1 + di for some di ∈ D,
i.e., when p1 and p2 are two points of Dom such that p2 depends upon p1 .
As defined in Section 7.2, a schedule is a function σ : Dom → Z such that
for any arbitrary index points p1 , p2 ∈ Dom, σ(p1 ) < σ(p2 ) if p1 ≺ p2 . In
other words, a schedule is a mapping that assigns a time of execution to each
operation of the nest in such a way that dependences are preserved, nothing
new since Chapter 7!
Using the terminology of Section 7.2, each operation p ∈ Dom is a unitary
task that requires one time unit. The total execution time of a schedule σ is
Tσ = 1 + max(σ(p), p ∈ Dom) − min(σ(q), q ∈ Dom) .

With an unlimited number of processors, Tσ is minimum when σ schedules

computations as soon as their operands are available, i.e., when σ is the free
schedule σf ree of Section 7.3. But the free schedule requires that one traverse
the dependence graph and that one generate in extension the scheduling date
σf ree (p) for each operation p ∈ Dom. Unfortunately, there are too many
operations for this to be possible (there are (N +1)2 operations in the example
in Figure 8.16). Instead, compilers use linear schedules. A linear schedule is
a mapping σπ : Dom → Z defined by
σπ (p) = bπ.pc for p ∈ Dom ,
such that dependences are preserved. The vector π ∈ Q1×n (π has ratio-
nal coefficients) is called an admissible scheduling vector. Not any vector π
can be chosen as an admissible scheduling vector, since dependences must be
preserved.
The basic idea of linear schedules is to try to transform the original uniform
loop nests into an equivalent nest in which all the internal loops except the
first one are parallel, as in:
1 for time=timemin to timemax do

2 forall p ∈ E(time) in parallel do
3 S1 (p)
4 ...
5 Sk (p)
The external loop corresponds to an iteration time and E(time) is the set
of all the points computed at the step time. Such points must not be linked
by dependence vectors so that they can be executed simultaneously. With
a linear schedule σπ , we have E(time) = {p ∈ Dom, bπ.pc = time}. The
scheduling vector π defines a family of affine hyperplanes H(t) orthogonal to
π such that the set of points executed at a time t is equal to the intersection
of the domain with the hyperplane H(t). The flow of computation goes from
H(t) to H(t + 1) by a translation; hence the name of the method that we
describe in this section: the computations progress like a wavefront parallel
to the family of hyperplanes H(t).
Admissible scheduling vectors π are easy to characterize for uniform loop
nests:
πd > 1 for each dependence vector d ∈ D.
Let us explain this: if p1 , p2 ∈ Dom are two points such that p1 ≺ p2 , i.e.,
p2 = p1 + di for some di ∈ D, we must have σπ (p1 ) < σπ (p2 ), i.e., bπp1 c + 1 6
bπ(p1 + di )c. This inequality is satisfied if πdi > 1. We retrieve Lamport’s
condition:
πD > 1 ,
which means that πdi > 1 for all di ∈ D. Note that this condition is sufficient
for π to be an admissible scheduling vector, and it is necessary unless the
domain Dom is very small, a situation very unlikely to occur in practice.
How can we find an admissible scheduling vector? It turns out to be
straightfoward, owing to the fact that all dependence vectors are lexicograph-
ically positive. Assume that the columns of matrix D = (d1 , . . . , dm ), whose
dimension is n × m, have been sorted lexicographically. Let k1 be the index
of the first nonzero component of d1 , the “smallest” vector of D. Since d1 is
positive, d1,k1 > 0. Take πk1 = 1 and πk = 0 for k1 < k 6 n. Before defin-
ing the first components of π, remark that whatever their values, πd1 > 0.
Now, let k2 be the index of the first nonzero component of d2 . Because d1
is lexicographically smaller than d2 , k2 6 k1 . Let πk = 0 for k2 < k < k1
and take for πk2 the smallest positive integer such that πd2 > 0, modifying
maybe πk1 if k2 = k1 . Continuing the process, we obtain an admissible vec-
tor. Let us execute this procedure
for the example in Figure 8.16 to find an
π1
admissible vector π = . After sorting the dependence matrix, we have
π2

00 1 11
D= . We have n = 2, k1 = 2, and we take π2 = 1 so that
1 2 −6 −4 5
πd1 > 1, whatever the value of π1 . We have k2 = 2, and πd2 > 1, so we do
not need to change the value of π2 . Next k3 = 1 and the condition πd3> 1
7
writes π1 − 6π2 > 1; so we take π1 = 7. We can keep the value π =
1
because we already have πd4 > 1 and πd5 > 1.
We know how to find an admissible scheduling vector. But how do we find
the optimal one? We first start with the example in Figure 8.16, and then
moveto the general procedure. As discussed above, the scheduling vector
π1
π= is admissible if and only if π2 > 1 and π1 > 1 + 6π2 , which implies
π2
that both π1 and π2 are positive. The total execution time of σπ is
Tσπ = 1 + max(bπpc, p ∈ Dom) − min(bπqc, q ∈ Dom) .
Here Dom is the square {0 6 i, j 6 N } and π has positive components so

max(bπpc, p ∈ Dom) = b(π1 + π2 )N c and min(bπqc, q ∈ Dom) = 0. We see

7
that the optimal scheduling vector is the one that we found already, π = ,
1
and the optimal execution time is Tσπ = 1 + 8N .
For the general case, consider an uniform loop nest. The best linear schedule
is the one that minimizes Tσπ = 1 + max(bπpc, p ∈ Dom) − min(bπqc, q ∈
Dom) over all rational vectors π such that πD > 1. We write
Tlinear = min(Tσπ , π ∈ Qn , πD > 1) . (8.6)
Consider the usual continuous relaxation of the integer programming problem

of Equation (8.6). In other words, we move from Dom, the computation
domain with integer points, to Dom, the “extended” computation domain

with rational points:
Dom = {p ∈ Qn , Ap 6 b} ,
Tσ∗π = max π(p − q) ,

p,q∈Dom
∗
Tlinear = min(Tσ∗π , π ∈ Qn , πD > 1) .
We claim that Tlinear , the time of the best linear schedule, verifies
∗
Tlinear 6 2 + Tlinear .
To see why we may lose at most two time-steps when moving from Dom to
Dom, we simply write the following:
• For p ∈ Dom, let πp = R + r, where R = bπpc and 0 6 r < 1. Similarly,

for q ∈ Dom, πq = S + s, where S = bπqc and 0 6 s < 1.
• We have σπ (p) − σπ (q) = R − S and π(p − q) = R − S + (r − s) where

−1 < r−s < 1, hence σπ (p)−σπ (q) 6 π(p−q)+1. We get Tσπ 6 2+Tσ∗π .
• Because Tlinear 6 Tσπ for all scheduling vectors π, we have Tlinear 6

2 + Tσ∗π . This is valid for all π, hence Tlinear 6 2 + Tlinear
∗
.
∗
Finding the optimal scheduling vector in Tlinear is equivalent to solving the
problem:
min max π(p − q) .
πD>1 Ap6b,Aq6b
Consider first that π is fixed. By the duality theorem of linear programming

[107, p. 90],

 
 X1 A = π
 Ap 6 b  X2 A = −π


Aq 6 b is equivalent to: X1 > 0
max π(p − q) X2 > 0
 



min (X1 + X2 )b

the existence of the maximum implying the existence of the minimum. Thus,
the search for the optimal scheduling vector can be done by solving the fol-
lowing linear problem: 
 πD > 1

 X1 A = π



X2 A = −π

 X1 > 0

 X2 > 0



min (X1 + X2 )b

We remark that this problem is linear in b. The search for the best scheduling
vector over the family of domains {Ax 6 N b} is reduced to the search on the
domain {Ax 6 b}, which can be done without knowing the parameter N , thus
at compile-time.
For the example in Figure 8.16, we let π = (π1 , π2 ), X1 = (a1 , b1 , c1 , d1 )
and X2 = (a2 , b2 , c2 , d2 ). We solve the following optimization problem:

 XD > 1 : π2 > 1, 2π2 > 1, π1 − 6π2 > 1, π1 − 4π2 > 1, π1 + 5π2 > 1

 X1 A = X : −a1 + b1 = π1 , −c1 + d1 = π2



X2 A = −X : −a2 + b2 = −π1 , −c2 + d2 = −π2


 X1 > 0 : a1 > 0, b1 > 0, c1 > 0, d1 > 0
X 2 > 0 : a2 > 0, b2 > 0, c2 > 0, d2 > 0




N × min (X1 + X2 )b : N × (b1 + b2 + d1 + d2 )

To solve this problem, we can use the simplex method provided by packages
such as GLPK [63] or Maple [39]. We obtain the solution: π = (7, 1), X1 =
∗
(0, 7, 0, 1) and X2 = (7, 0, 1, 0). The total execution time is Tlinear = N × (7 +
0 + 1 + 0) = 8N .
The last thing to do is to rewrite the loop nest to expose the parallelism.
We consider again the example of Figure 8.16. We need to transform the
original loop nest into
1 for time=timemin to timemax do

2 forall p ∈ E(time) in parallel do
3 ...
We already have time = 7i + j, which we extend into

time 71 i
= .
proc 10 j
The trick is that the 2 × 2 matrix is unimodular, i.e., has an integral inverse:

i 0 1 time
= .
j 1 −7 proc
Because 0 6 i, j 6 N we can compute the new loop bounds for time and proc:
• time = 7i + j, hence timemin = 0 and timemax = 8N .
• i = proc, hence 0 6 proc 6 N . Also, j = time − 7proc hence 0 6
time − 7proc 6 N . We derive b time−N
7 c 6 proc 6 b time
7 c.
Altogether, we can rewrite the loop nest

as shown in Algorithm
8.3.
proc
At each instant time, all nodes p = ∈ Dom belong to
time − 7proc
E(time) and can be executed in parallel. If we have N available processors,
1 for time = 0 to 8N do
2 for proc = max(0, b time−N
7 c) to min(N, b time
7 c) do
3 a(proc, time − 7proc) ←
b(proc, time − 7proc − 6) + d(proc − 1, time − 7proc + 3)
4 b(proc + 1, time − 7proc − 1) ← c(proc + 2, time − 7proc + 5) + 1
5 c(proc + 3, time − 7proc − 1) ← a(proc, time − 7proc + 2)
6 d(proc, time − 7proc − 1) ← a(proc, time − 7proc − 1) − 1
ALGORITHM 8.3: Rewriting the loop nest to expose parallelism.
we can execute all iterations of the inner loop simultaneously on distinct pro-
cessors (hence the name ”proc” for the first component). Again, it is possible
to automate the process of rewriting the loop nest in the general case. This
relies on sophisticated mathematical tools such as Hermite normal forms and
Fourier-Motzkin elimination, and we point the reader to references [50, 5] for
more details.
The divisible load model has been widely studied in the last several years,
after having been popularized by the landmark book by Bharadwaj, Ghose,
Mani and Robertazzi [32]. See also the introductory papers [33, 103]. Recent
results on star and tree networks are surveyed in [19].
Steady-state scheduling was pioneered by Bertsimas and Gamarnik [29].
See [13, 20, 21, 18] for recent applications of the technique.
Workflow scheduling is a hot topic. A few papers related to the material
covered in this chapter are [114, 30, 31, 111, 24]. Bicriteria optimization
(response time with period constraints, or the converse) are discussed in [115,
22, 23, 118].
Finally, loop nest scheduling dates back to the seminal papers of Karp,
Miller and Winograd [72] and Lamport [78]. Many loop transformation al-
gorithms are described in the books by Banerjee [10], by Wolfe [120], and by
Allen and Kennedy [5]. An introduction to the topic of uniform recursions,
loop parallelization and software pipelining is provided in the book by Darte,
Robert and Vivien [50].
8.5 Exercises
We offer quite a comprehensive set of exercises. Exercise 8.1 is devoted
to divisible load scheduling on tree platforms. Exercise 8.2 deals with the
steady-state approach for multiple applications. The next two exercises are
on the topic of workflow scheduling: Exercise 8.3 tackles the classic chains-to-
chains problem, while Exercise 8.4 explores the complexity of response time
minimization. Finally, the last two exercises study dependences in loop nests.
Exercise 8.5 applies basic techniques from Section 8.4. Exercise 8.6 is more
involved and investigates the problem of removing dependence cycles.
Exercise 8.1 : Divisible Load Scheduling on Heterogeneous

Trees
In this exercise we extend the linear programming approach to schedule
divisible load applications on general multi-level trees in which each node is
associated to a processor. The master P0 serially sends a single message to
each of its children (one round). The master has no processing power (same
reason as in Section 8.1.4) but every other node is capable of computing and
communicating to one of its children simultaneously. However, because of the
one-round hypothesis, no overlap can occur with the incoming communication
from the node’s parent.
P0
c1 c2
w1 P1 P2 w2
c3 c4
P3 P4 w4
w3
c5
P5 w
5
FIGURE 8.17: Heterogeneous tree of processors.
1. Consider the tree shown in Figure 8.17, with a master P0 and 5 workers
P1 to P5 . The cycle-time of worker Pi is wi . Let Wtotal be the total amount
of work, and let αi be the fraction executed by Pi for 1 6 i 6 5. Assume that
c1 6 c2 , so that the master P0 will serve P1 before P2 in the optimal solution
(same proof as in Lemma 8.3). Write the linear program that characterizes
the optimal makespan T for the tree.
Exercises 305
c0 c0
w0 w−1
c1
cp ⇔
c2 ci
w1 w2 wi wp
FIGURE 8.18: Replacing a single-level tree by an equivalent node.
2. Now we investigate another method to solve the problem, based on a node

reduction technique illustrated in Figure 8.18. Explain how a single-level tree
network with parent P0 (with input link of communication-time c0 and cycle-
time w0 ) and p children Pi , 1 6 i 6 p (with input link of communication-time
ci and cycle-time wi ), where c1 6 c2 6 . . . 6 cp , is equivalent to a single node
with same communication-time c0 and cycle-time w−1 = 1/W where W is the
solution to the linear program:
Maximize W,
subject
 to
 (1) αi > 0 06i6p
 (2) Pp α = W

i=0 i
 (3) W c 0 + α0 w0 = 1
 Pi
(4) W c0 + j=1 αj cj + αi wi = 1 16i6p

3. Explain how the previous reduction can be used to compute the optimal
solution for a general multi-level tree. Can you compare this approach to that
in Question 1?
Exercise 8.2 : Steady-State Scheduling of Multiple Applica-

tions
In this exercise we consider multiple applications that execute concurrently
on heterogeneous tree platforms and compete for processors and network links.
Each application consists of a large number of independent and identical tasks
that originate at the tree’s root. Each application is also given a weight that
quantifies its relative value (one can think of this weight as a priority). The
goal is to maximize throughput while executing tasks from each application
in the same ratio as their weights.
Here is a summary of the notations:
• P0 is the root processor and Pp(u) is the parent of node Pu for u 6= 0.

• Γ(u) is the set of indices of the children of node Pu .
• Node Pu can perform cu operations per time unit, and, if u 6= 0, can

receive bu bytes from its parent Pp(u) .
• Application k has weight w(k) , and each task of type k involves b(k)
bytes and c(k) operations.
1. Let α(k) be the throughput achieved for application k, i.e., the number of
tasks of type k that are executed on the platform every time unit in steady-
state mode. Explain why maximizing
α(k)

min
k w(k)
is a better objective function than maximizing
X α(k)
.
k
w(k)
α(k)
n o
2. Write a linear program to maximize the former objective function mink w(k)
.
Exercise 8.3 : Chains-to-Chains

Given an array of n elements a1 , a2 , . . . , an , the chains-to-chains problem
consists in partitioning the array into p consecutive intervals I1 , I2 , . . . , Ip ,
where Ik = [dk , ek ] and dk 6 ek for 1 6 k 6 p, d1 = 1, dk+1 = ek + 1 for
1 6 k 6 p − 1 and ep = n. The objective is to minimize
X ek
X
max ai = max ai .
16k6p 16k6p
i∈Ik i=dk
1. Give a dynamic programming algorithm to solve the problem. What is

its complexity?
2. Give a binary search algorithm to solve the problem. What is its com-
plexity?
Exercise 8.4 : Response Time of One-to-One Mappings

Show that the response time minimization problem is NP-hard for one-to-
one mappings on Fully Heterogeneous platforms.
Exercises 307
Exercise 8.5 : Dependence Analysis and Scheduling Vectors

1. Perform a dependence analysis of the following program fragment:
S1 : A←B+C
S2 : D ←E+F
S3 : C ←A+C
S4 : E ←C +D
S5 : E←1
2. Perform a dependence analysis of the following loop nest:
for i = 1 to N do
for j = 1 to N do
S1 : a(i+1, j−1) ← b(i, j+4)+c(i−2, j−3)+1
S2 : b(i − 1, j) ← a(i, j) − 1
S3 : c(i, j) ← a(i, j − 2) + b(i, j)
3. What are the admissible scheduling vectors for the previous loop nest?
Determine the one that minimizes the execution time. Use this vector to
re-write the loop nest so that the innermost loop is parallel.
4. Write a two-dimensional loop nest with 3 statements S1 , S2 and S3 whose

dependences are as follows:
• From S1 to S2 : an anti dependence of distance vector (1, −2)
• From S1 to S3 : a flow dependence of distance vector (1, −1)
• From S2 to S1 : an anti dependence of distance vector (1, 2)
• From S2 to S3 : a flow dependence of distance vector (1, 4)
• From S3 to S1 : an anti dependence of distance vector (0, 1)
• From S3 to S2 : a flow dependence of distance vector (0, 3)
Exercise 8.6 : Dependence Removal

For this exercise we define the reduced dependence graph (RDG) of a loop
nest as the graph whose nodes are the statements of the nest, and whose edges
are the dependences between the statements. But here we are only interested
in the type of the dependences, not in their distance vector. Thus, edges will
be labeled by f, a, or o, to identify flow, anti, and output dependences.
1. Compute the RDG of the following loop nest:
for i = 1 to N do
S1 : a(i) ← b(i) + c(i)
S2 : a(i + 1) ← a(i) + 2d(i)
Try to break the dependence cycle by introducing a temporary variable.
2. In the general case, the node splitting technique proceeds as shown below:
for .i .=
. 1 to N do for .i .=
. 1 to N do
Sk : lhs(f (i)) ← rhs(. . . ) Sk0 : temp(f (i)) ← rhs(. . . )
... Sk : lhs(f (i)) ← temp(f (i))
... ...
Si : . . . ← lhs(g(i)) Si : ... ← temp(g(i))
before splitting after splitting
Assume that the access function lhs (standing for left-hand side) is one-
to-one. Show the effect of the node splitting technique on the six possible
dependence types: dependences incoming to Sk of type flow, anti and output,
and dependences outgoing from Sk of type flow, anti and output.
3. Let G be the RDG of a loop nest, and let G0 be the graph obtained after
splitting all the nodes. Show that a cycle in G0 is either uniquely composed of
flow dependences, or uniquely composed of output dependences. In addition,
show that any cycle in G0 corresponds to an existing cycle in G.
4. Apply the node splitting technique to the following loop nest:

for i = 4 to N do
S1 : a(i + 5) ← c(i − 3) + b(2i + 2)
S2 : b(2i) ← a(i − 1) + 1
S3 : a(i) ← c(i + 5) + 1
S4 : c(i) ← b(2i − 4)
8.6. Answers 309
8.6 Answers
Exercise 8.1 (Divisible Load Scheduling on Heterogeneous
Trees)
. Question 1. We give the linear program and comment on the equations
afterwards:
Minimize T,
subject
 to
 (i) αi > 0 16i65
 P5
(ii) i=1 αi = Wtotal




(0) (α 1 + α3 )c1 + (α2 + α4 + α5 )c2 6 T




(1) (α 1 + α3 )c1 + α1 w1 6 T




 (10 ) (α1 + α3 )c1 + α3 c3 6 T


(2) (α2 + α4 + α5 )c2 + α2 w2 6 T
(20 ) (α2 + α4 + α5 )c2 + α4 c4 6 T




(3) (α1 + α3 )c1 + α3 c3 + α3 w3 6 T




(4) (α2 + α4 + α5 )c2 + α4 c4 + α4 w4 6 T




0
(4 ) (α2 + α4 + α5 )c2 + α4 c4 + α5 c5 6 T




(5) (α2 + α4 + α5 )c2 + α4 c4 + α5 c5 + α5 w5 6 T

Equation (0) expresses the total communication time for the master P0 . Equa-
tion (1) says that after the end of its incoming communication, internal node
P1 should be constantly computing. Equation (1’) says that after the end of
its incoming communication, P1 should should be constantly sending data to
P3 while computing. Note that Equation (1’) is useless, due to Equation (3).
Equation (3) states that P3 keeps computing after receiving its own data,
which comes from the master together with the data for P1 (hence the term
(α1 + α3 )c1 ), and which is then forwarded by P1 (hence the term α3 c3 ). Fi-
nally, Equations (2), (4) and (5) are the counterpart for the right side of the
tree (equations (2’) and (4’) are useless as well).
. Question 2. Here, instead of minimizing the time Tf required to execute
load W , we aim at determining the maximum amount of work W that can be
processed in one time-unit. Obviously, after the end of the incoming commu-
nication, the parent P0 should be constantly computing (equation (3)). We
know that all children (i) participate in the computation and (ii) terminate
execution at the same-time. Finally, the optimal ordering for the children is
given by Lemma 8.3. Equation (4) expresses the communication and com-
putation time for each child, using the optimal ordering. This completes the
proof. Note that because equations (3) and (4) are equalities, we could derive
a closed-form expression for w−1 = 1/W .
. Question 3. First we traverse the tree from bottom to top, replacing each
single-level tree by the equivalent node. We do this until there remains a single
star. We solve the problem for the star, using the results of Section 8.1.4.
Then, we traverse the tree from top to bottom, and undo each transformation
in the reverse order. Going back to a reduced node, we know for how long
it is supposed to be working. We know the optimal ordering of its children
and we know for which amount of time each of the children is supposed to
work. If one of these children is a leaf node, we have computed its load. If it
is instead a reduced node, we apply the transformation recursively.
Instead of this pair of tree traversals, we could write down the linear pro-
gram for a general tree, exactly as we did for the example in Figure 8.17.
Briefly, when it receives data a given node knows exactly what to do: com-
pute itself all the remaining time, and forward data to its children in de-
creasing bandwidth order. The problem is that the size of the linear program
would grow proportionally to the size of the tree. Consequently, the recursive
solution is preferred, at least for large platforms.
Exercise 8.2 (Steady-State Scheduling of Multiple Applica-

tions)
. Question 1. Assume first that all application weights are identical (say
equal to 1). Maximizing the minimum throughput ensures a fair processing
of each application. By contrast, maximizing the sum of the throughputs
would unduly favor applications with a large computation-to-communication
ratio: executing more tasks of such applications requires less time and fewer
resources than for other applications.
Adding weights in the picture simply allows for setting priorities among
applications: they receive a fair treatment, as the number of tasks that are
executed for a given application is no less than the respective share that is
due to this application.
. Question 2. We use the following variables:

(k)
• α
eu , the number of tasks of type k executed by Pu per time unit.
(k)
• recv , the number of tasks of type k received by Pv from Pp(v) per time
unit.
We derive the following equations:

n (k) o
α
Maximize mink w(k) under the constraints
 P (k)

 ∀k, uαeu = α(k) definition of α(k)
 (k) (k) P (k)
 ∀k, ∀uP6= 0, recu = α eu + v∈Γ(u) recv data movement conservation



(k)
∀u, kαeu · c(k) 6 cu computation limit at node Pu
P (k) (k)
k recv ·b
 P
 ∀u, 61 communication limit out of Pu


 v∈Γ(u) bv
(k) (k)

∀k, u α eu > 0 and recv > 0 non-negativity

Answers 311
(k)
The first equation sums the contribution α eu of each processor Pu into α(k)
for each task type k. The second equation is the conservation law: for each
task type k, the number of tasks that a processor receives is equal to what
it consumes plus what it sends to all its children. The third equation states
that Pu computes all its load within one time unit. The fourth equation is the
counterpart of the third equation, but for communications with the one-port
model: everything is serialized, hence the summation. Note that it would be
easy to use the bounded multi-port model instead of the one-port model. We
would write: X X
∀u, rec(k)
v ·b
(k)
6 Bu ,
v∈Γ(u) k
where Bu is the total bandwidth of the communication card of Pu , and for

each incoming link: X
∀u, eu(k) · b(k) 6 bu .
α
k
Exercise 8.3 (Chains-to-Chains)

. Question 1. Define the cost of a partitioning as the maximum interval
sum. For 1 6 i 6 n and 1 6 k 6 p, let g(i, k) be the optimal cost when
partitioning [a1 , . . . , ai ] into k intervals. We want to find g(n, p) with initial
Pi
values g(i, 1) = j=1 aj for 1 6 i 6 n. For k > 2 the recursion is
  
i
X
g(i, k) = min max g(s, k − 1), aj  .
16s6i−1
j=s+1
With the above equation we simply try all possible splits into [1, s] (with k − 1
intervals) and the single interval [s + 1, i].
Xi
We can speed up computations by pre-computing all values f (s, i) = aj
j=s+1
in time O(n). The complexity becomes O(np).
. Question 2. Given a bound M , can we partition [a1 , a2 , . . . , an ] into p

intervals I1 , I2 , . . . , Ip whose sums are not greater than M ? We answer this
question with a simple greedy algorithm. For I1 we start with a1 and in-
clude
Pe1 the next elements Pe1 +1until the sum exceeds M . Formally, e1 is defined by
i=1 ai 6 M and i=1 ai > M . For I2 we start from ae1 +1 and repeat the
procedure. We have a solution if and only if we reach the last element in p
steps. Checking for a solution with M has a cost O(n).
Pn
Obviously, a lower bound for M is L = max maxi (ai ), p1 i=1 ai . An
Pn
upper bound is U = i=1 ai . The number of iterations in the binary search
for M is bounded by log(U − L), which is polynomial in the problem size.
Exercise 8.4 (Response Time of One-to-One Mappings)

The problem clearly belongs to NP. We use a reduction from the Traveling
Salesman Problem (TSP), which is NP-complete [58]. Consider an arbitrary
instance Inst1 of TSP, i.e., a complete graph G = (V, E, c), where c(e) is the
cost of edge e, a source vertex s ∈ V , a sink vertex t ∈ V , and a bound K: is
there a Hamiltonian path in G from s to t with cost no greater than K?
We build an instance Inst2 of the one-to-one response time minimization
problem. We consider an application with n = |V | identical stages. All
application costs are unit costs: wi = δi = 1 for all i. In terms of processors,
in addition to Pin and Pout , we use p = n = |V | identical processors of unit
speed: si = 1 for all i. We simply write i for the processor Pi that corresponds
to vertex vi ∈ V .
We only tinker with the link bandwidths: we interconnect Pin and s, Pout
and t with links of bandwidth 1. We interconnect i and j with a link of
bandwidth c(e1i,j ) . All the other links are very slow (say their bandwidth
1
is smaller than K+n+3 ). We ask whether we can achieve a response time
0 0
Trt 6 K , where K = K + n + 2. Clearly, the size of Inst2 is linear in the size
of Inst1 .
Because we have as many processors as stages, a solution to Inst2 uses
all processors. We need to map the first stage on s and the last one on t,
otherwise the input/output cost already exceeds K 0 . We spend 2 time units for
input/output, and n time units for computing (one unit per stage/processor).
There remain exactly K time units for inter-processor communications, i.e.,
for the total cost of the Hamiltonian path that goes from s to t. We cannot
use any slow link either. Hence, we have a solution for Inst2 if and only if we
have one for Inst1 .
Exercise 8.5 (Dependence Analysis and Scheduling Vectors)

. Question 1. Consider the first statement, which writes to A. The only
other statement that uses A is S3 . Since S3 is a read, and since it comes after
S1 in the sequential order, we have found a flow dependence. We proceed
similarly for all the instructions, one after the other, and we obtain:
• Flow S1 → S3 : variable A.
• Flow S2 → S4 : variable D.
• Anti S1 → S3 : variable C.
• Flow S3 → S4 : variable C.
• Anti S2 → S4 : variable E.
• Anti S2 → S5 : variable E (transitive).
• Output S4 → S5 : variable E.
Answers 313
. Question 2. S1 writes to a(i + 1, j − 1) and S2 reads a(i, j), hence there

is a dependence between both statements. We use the method described in
Section 8.4.1. If we look for a dependence from S1 (i, j) to S2 (i0 , j 0 ) (it will be a
flow dependence), we must have (i, j) <seq (i0 , j 0 ) with i+1 = i0 and j −1 = j 0 .
If we find such a dependence, its distance vector is (1, −1). Note that if we
look for a dependence from S2 (i, j) to S1 (i0 , j 0 ) (hence an anti dependence),
we have the equations i = i0 +1 and j = j 0 −1 with (i, j) <seq (i0 , j 0 ), i.e., i < i0
or i = i0 and j < j 0 , hence no solution. With a little care we can compute the
difference of the index pairs and orient the dependence in the right direction
(that of the sequential order). But watch out for possible mistakes! Consider
the case of S2 , which writes to b(i−1, j), and of S1 , which reads b(i, j +4). The
first component of the distance vector must be nonnegative, hence we obtain
an anti dependence from S1 to S2 . Finally, we have the following dependences:
• Flow S1 → S2 : variable a, distance vector (1, −1)
• Flow S1 → S3 : variable a, distance vector (1, 1)
• Flow S3 → S1 : variable c, distance vector (2, 3)
• Anti S1 → S2 : variable b, distance vector (1, 4)
• Anti S3 → S2 : variable b, distance vector (1, 0)
. Question 3. The dependence matrix is:

11112
D= .
−1 0 1 4 3

a
We look for scheduling vectors π = such that πdi > 1 for all di ∈ D.
b
We derive the constraints 

 a−b>1
a > 1


a+b>1
a + 4b > 1




2a + 3b > 1

These equations characterize

alladmissible
vectors. Note that we have several
1 2 5
choices, including , , , and many others.
0 1 −1
It is easy to determine
the best vector because the domain Dom is a square.
a
Indeed, with π = , operation p = (0, 0) ∈ Dom is scheduled at time 0 and
b
operation p = (N, 0) ∈ Dom is scheduled at time aN . Hence, the makespan
is at least |a|N . If we look for an admissible vector with a = 1, theonly
1
solution is b = 0 (because of the first two constraints). But with π = ,
0
the makespan is indeed N , so it is the optimal scheduling vector.
To re-write the loop nest, we do nothing, because we can use the identity
matrix as the unimodular transformation matrix. We merely mark the second
loop as parallel (to show that we have worked hard):
for i = 1 to N do
for j = 1 to N in parallel do
S1 : a(i+1, j−1) ← b(i, j+4)+c(i−2, j−3)+1
S2 : b(i − 1, j) ← a(i, j) − 1
S3 : c(i, j) ← a(i, j − 2) + b(i, j)
. Question 4. We prepare a loop nest of the form:

for i = 1 to N do
for j = 1 to N do
S1 : a(i, j) ← . . .
S2 : b(i, j) ← . . .
S3 : c(i, j) ← . . .
We need an anti dependence (1, −2) from S1 to S2 , so we let S1 read b(i+1, j −

2) since S2 writes to b(i, j). For the flow dependence (1, −1) from S1 to S3 ,
we let S3 read a(i − 1, j + 1). Proceeding likewise for the other dependences,
we obtain a solution:
for i = 1 to N do
for j = 1 to N do
S1 : a(i, j) ← b(i + 1, j − 2)
S2 : b(i, j) ← a(i + 1, j + 2) + c(i, j − 3)
S3 : c(i, j) ← a(i − 1, j + 1) + a(i, j + 1) + b(i − 1, j − 4)
Exercise 8.6 (Dependence Removal)

. Question 1. We have an output dependence from S2 to S1 , because a(i+1)
is written by S2 at iteration i before being re-written by S1 at iteration i + 1.
There is also a flow dependence from S1 to S2 because of a(i). We get the
following RDG:
f
S1 S2
o
Introducing a temporary variable enables to “split” statement S1 and to re-

move the cycle:
for i = 1 to N do
S10 : temp(i) ← b(i) + c(i)
S1 : a(i) ← temp(i)
S2 : a(i + 1) ← temp(i) + 2d(i)
Answers 315
The RDG of the new loop nest is as follows:

f
S10 S1 S2
f o
The cycle has been broken. We can split (or “distribute”) the loop to exhibit
the parallelism:
for i = 1 to N do
S10 : temp(i) ← b(i) + c(i)
a(1) ← temp(1)
for i = 1 to N do
S2 : a(i + 1) ← temp(i) + 2d(i)
. Question 2. We examine the six possibilities. We obtain the following
transformations:
• Flow dependence incoming to Sk (Figure 8.19). Assume that one of

the variables that are read in the right hand side of Sk has been produced
(written) by the left hand side of Si . After the split this variable is read
in the right hand side of Sk0 : the flow dependence now points to Sk0 .
Also, do not forget the new flow dependence induced by temp from Sk0
to Sk .
Si : rhs(. . . ) ← ...
Si : rhs(. . . ) ← ... fin
fin Sk0 : temp(f (i)) ← rhs(. . . )
Sk : lhs(f (i)) ← rhs(. . . ) fnew
Sk : lhs(f (i)) ← temp(f (i))
FIGURE 8.19: Flow dependence incoming to Sk , before and after node

splitting.
• Anti dependence incoming to Sk (Figure 8.20). Assume that state-

ment Si reads lhs(f (i)) before Sk writes to it. After the split lhs(f (i))
is still read in Sk , the anti dependence from Si to Sk is unchanged.
• Output dependence incoming to Sk (Figure 8.21). Assume that

statement Si writes to lhs(f (i)) before Sk does. After the split this
output dependence from Si to Sk is unchanged.
Si : ← lhs(. . . )
Si : ← lhs(. . . )
ain Sk0 : temp(f (i)) ← rhs(. . . )
Sk : lhs(f (i)) ← rhs(. . . ) ain fnew
FIGURE 8.20: Anti dependence incoming to Sk , before and after node

splitting.
Si : lhs(. . . ) ← ...
Si : lhs(. . . ) ← ...
oin Sk0 : temp(f (i)) ← rhs(. . . )
Sk : lhs(f (i)) ← rhs(. . . ) oin fnew
FIGURE 8.21: Output dependence incoming to Sk , before and after node

splitting.
• Flow dependence outgoing from Sk (Figure 8.22). Assume that

statement Si reads the value of lhs(f (i)) produced by Sk . After the
split the access to lhs(f (i)) in Si has been replaced by an access to
temp(f (i)), whose value is produced by Sk0 : there is a flow dependence
from Sk0 to Si .
• Anti dependence outgoing from Sk (Figure 8.23). Assume that a
variable read in the right hand side of Sk is later written in the left hand
side of Si . After the split this variable is read in the right hand side of
Sk0 , the anti dependence now has its source in Sk0 (but still points to Si ).
• Output dependence outgoing from Sk (Figure 8.24). Assume that
statement Si writes into lhs(f (i)) after Sk did. After the split this
output dependence from Sk to Si is unchanged.
All these transformations are synthesized in Figure 8.25.
. Question 3. See Figure 8.25 for help following the proof. Let C be a cycle
in G0 and consider an edge e of C:
• If e corresponds to an output dependence, then according to Figure 8.25,
e is an edge from a node Sk to a node Si : indeed, there is no output
Answers 317
Sk0 : temp(f (i)) ← rhs(. . . )

fout Sk : lhs(f (i)) ← temp(f (i))
Si : ← lhs(. . . ) fout
Si : ← temp(. . . )
FIGURE 8.22: Flow dependence outgoing from Sk , before and after node
splitting.
Sk0 : temp(f (i)) ← rhs(. . . )

aout Sk : lhs(f (i)) ← temp(f (i))
Si : rhs(. . . ) ← aout
Si : rhs(. . . ) ←
FIGURE 8.23: Anti dependence outgoing from Sk , before and after node
splitting.
dependence incoming to or outgoing from the new node Sk0 . Next, there
are only edges corresponding to output dependences that go out from
Si . Therefore, the edge following e in the cycle C also corresponds to
an output dependence. In conclusion, C is uniquely composed of output
dependence edges. And all edges in C are edges that already existed in
G.
• If e corresponds to an anti dependence, then e goes from a node Sk0 to a
node Si . The only edges going out from Si are output dependence edges.
Therefore, the edge following e in the cycle C is an output dependence
edge. Reasoning as before, C is uniquely composed of output depen-
dence edges, which contradicts our assumption regarding e. Therefore,
no cycle in G0 may include an anti dependence edge.
• If e corresponds to a flow dependence, then either e has been created
when splitting a node Sk (and then goes from Sk0 to Sk ), or e corresponds
to an existing edge going from Sk to Si in G (and then goes from Sk0 to
Si0 ). We study both cases:
fnew
– e : Sk0 −→ Sk : all edges going out from Sk correspond to output
dependences, the edge following e in the cycle C is an output de-
Sk0 : temp(f (i)) ← rhs(. . . )

oout Sk : lhs(f (i)) ← temp(f (i))

Si : lhs(. . . ) ← oout
Si : lhs(. . . ) ←
FIGURE 8.24: Output dependence outgoing from Sk , before and after node
splitting.
fin
fin ain oin fout
Sk0 aout
ain oin
Sk fnew
Sk
fout aout oout
oout
FIGURE 8.25: Transformations of dependences during a node split.
pendence edge. Reasoning as before, C is uniquely composed of

output dependence edges, a contradiction.
f
– e : Sk0 −→ Si0 : there may exist flow or anti dependence edges going
out from Si0 . But we have proved that there is no anti dependence
edge in a cycle of G0 , hence the edge following e in the cycle C
corresponds to a flow dependence. Therefore, C is uniquely com-
posed of flow dependence edges. Furthermore, edges in C do not
correspond to the new dependences introduced by the splits, thus
they all were already present in G.
In summary, the splitting technique has not introduced any new dependence
cycle. It has made it possible to break all cycles, except those uniquely com-
posed of flow dependences, and those uniquely composed of output depen-
dences.
. Question 4. The RDG is shown in Figure 8.26.

There are six dependences in the nest:
• three flow dependences (from S3 to S2 because of variable a, from S2 to

S4 because of b, and from S4 to S1 because of c);
Answers 319
S4
f f
a
S1 S2
a
o f
S3
FIGURE 8.26: RDG before node splitting.
• two anti dependences (from S1 to S2 because of b, and from S3 to S4

because of c);
• one output dependence (from S1 to S3 because of a).
The RDG is strongly connected. Splitting nodes S2 and S3 leads to the RDG
shown in Figure 8.27.
S4 f
f
a
f
S1 S2 S20
a
o S30 f
f
S3
FIGURE 8.27: RDG after node splitting.
There is no cycle in this new graph. We rewrite the loop nest using tempo-
rary variables atemp (for the split of S3 ) and btemp (for the split of S2 ):
for i = 4 to N do
S1 : a(i + 5) ← c(i − 3) + b(2i + 2)
atemp (i − 1) + 1 if i > 5
S20 : btemp (2i) ←
a(i − 1) + 1 if i = 4
S2 : b(2i) ← btemp (2i)
S30 : atemp (i) ← c(i + 5) + 1
S3 : a(i) ← atemp (i)
btemp (2i − 4) if i > 6
S4 : c(i) ←
b(2i − 4) if i 6 5
Here, conditional assignments are necessary to take dependences originating

from outside the loop nest into account.
What can we do now that we have broken all cycles? We can topologically
sort the nodes, and get the ordering S30 , S20 , S4 , S1 , S2 , and S3 . Now, all
dependences flow from a statement to statements following it in the list; there
is no feedback edge. Hence, we can process the statements one after the other,
because doing so we only delay the operations of the following statements,
which is safe. Because there are no self dependences, all loops can be made
parallel. This technique is known as loop distribution. For the example, we
obtain:
for i = 4 to N in parallel do
S30 : atemp (i) ← c(i + 5) + 1
for i = 4 to N in parallel
do
atemp (i − 1) + 1 if i > 5
S20 : btemp (2i) ←
a(i − 1) + 1 if i = 4
S30 : atemp (i) ← c(i + 5) + 1
btemp (2i − 4) if i > 6
S4 : c(i) ←
b(2i − 4) if i 6 5
S1 : a(i + 5) ← c(i − 3) + b(2i + 2)
S2 : b(2i) ← btemp (2i)
S3 : a(i) ← atemp (i)
Bibliography
[1] R. Agarwal, F. Gustavson, and M. Zubair. A high performance matrix

multiplication algorithm on a distributed-memory parallel computer,
using overlapped communication. IBM J. Research and Development,
38(6):673–681, 1994.
[2] M. Ajtai, J. Komlos, and E. Szemeredi. An O(n log n) sorting network.
In Proceedings of the 15th ACM Symposium on the Theory of Computing
STOC, pages 1–19. ACM Press, 1983.
[3] S. G. Akl. The Design and Analysis of Parallel Algorithms. Prentice
Hall, 1989.
[4] A. Alexandrov, M. Ionescu, K. Schauser, and C. Scheiman. LogGP:
Incorporating long messages into the LogP model for parallel computa-
tion. Journal of Parallel and Distributed Computing, 44(1):71–79, 1997.
[5] R. Allen and K. Kennedy. Optimizing compilers for modern architec-
tures. Academic Press, 2002.
[6] G. Amdahl. The validity of the single processor approach to achieving
large scale computing capabilities. In AFIPS Conference Proceedings,
volume 30, pages 483–485. AFIPS Press, 1967.
[7] D. P. Anderson, J. Cobb, E. Korpela, M. Lebofsky, and D. Werthimer.
SETI@home: An experiment in public-resource computing. Communi-
cations of the ACM, 45(11):56–61, November 2002.
[8] E. Anderson, Z. Bai, C. Bischof, J. Demmel, J. Dongarra, J. D. Croz,
A. Greenbaum, S. Hammarling, A. McKenney, S. Ostrouchov, and
D. Sorensen. LAPACK Users’ Guide, Second Edition. SIAM, Philadel-
phia, PA, 1995. See also the URL: http://www.netlib.org/lapack.
[9] G. Ausiello, P. Crescenzi, G. Gambosi, V. Kann, A. Marchetti-
Spaccamela, and M. Protasi. Complexity and Approximation. Springer
Verlag, 1999.
[10] U. Banerjee. Loop parallelization. Kluwer Academic Publishers, 1994.
[11] M. Banikazemi, V. Moorthy, and D. K. Panda. Efficient collective com-
munication on heterogeneous networks of workstations. In Proceedings
of the 27th International Conference on Parallel Processing (ICPP’98).
IEEE Computer Society Press, 1998.
321
322 Bibliography
[12] M. Banikazemi, J. Sampathkumar, S. Prabhu, D. Panda, and P. Sa-

dayappan. Communication modeling of heterogeneous networks of
workstations for performance characterization of collective operations.
In HCW’99, the 8th Heterogeneous Computing Workshop, pages 125–
133. IEEE Computer Society Press, 1999.
[13] C. Banino, O. Beaumont, L. Carter, J. Ferrante, A. Legrand, and
Y. Robert. Scheduling strategies for master-slave tasking on heteroge-
neous processor platforms. IEEE Trans. Parallel Distributed Systems,
15(4):319–330, 2004.
[14] A. Bar-Noy, S. Guha, J. S. Naor, and B. Schieber. Message multicasting
in heterogeneous networks. SIAM Journal on Computing, 30(2):347–
358, 2000.
[15] K. Batcher. Sorting networks and their applications. In Spring Joint
Computing Conference, pages 307–314. AFIPS Procedings vol. 32, 1968.
[16] O. Beaumont, V. Boudet, A. Petitet, F. Rastello, and Y. Robert. A
proposal for a heterogeneous cluster ScaLAPACK (dense linear solvers).
IEEE Trans. Computers, 50(10):1052–1070, 2001.
[17] O. Beaumont, V. Boudet, F. Rastello, and Y. Robert. Matrix multi-
plication on heterogeneous platforms. IEEE Trans. Parallel Distributed
Systems, 12(10):1033–1051, 2001.
[18] O. Beaumont, L. Carter, J. Ferrante, A. Legrand, L. Marchal, and
Y. Robert. Centralized versus distributed schedulers for multiple bag-of-
task applications. In International Parallel and Distributed Processing
Symposium IPDPS’2006. IEEE Computer Society Press, 2006.
[19] O. Beaumont, H. Casanova, A. Legrand, Y. Robert, and Y. Yang.
Scheduling divisible loads on star and tree networks: results and open
problems. IEEE Trans. Parallel Distributed Systems, 16(3):207–218,
2005.
[20] O. Beaumont, A. Legrand, L. Marchal, and Y. Robert. Pipelining broad-
casts on heterogeneous platforms. IEEE Trans. Parallel Distributed
Systems, 16(4):300–313, 2005.
[21] O. Beaumont, A. Legrand, L. Marchal, and Y. Robert. Steady-state
scheduling on heterogeneous clusters. Int. J. of Foundations of Com-
puter Science, 16(2):163–194, 2005.
[22] A. Benoit, V. Rehn, and Y. Robert. Complexity results for throughput
and latency optimization of replicated and data-parallel workflows. In
HeteroPar’2007: International Conference on Heterogeneous Comput-
ing, jointly published with Cluster’2007. IEEE Computer Society Press,
2007.
Bibliography 323
[23] A. Benoit, V. Rehn-Sonigo, and Y. Robert. Multi-criteria scheduling

of pipeline workflows. In HeteroPar’2007: International Conference
on Heterogeneous Computing, jointly published with Cluster’2007. IEEE
Computer Society Press, 2007.
[24] A. Benoit and Y. Robert. Mapping pipeline skeletons onto heteroge-
neous platforms. In ICCS’2007, the 7th International Conference on
Computational Science. Springer Verlag, 2007.
[25] F. Berman, G. Fox, and T. Hey, editors. Grid Computing: Making the
Global Infrastructure a Reality. Wiley PUblishers, Inc., 2003.
[26] F. Berman, R. Wolski, H. Casanova, W. Cirne, H. Dail, M. Faerman,
S. Figueira, J. Hayes, G. Obertelli, J. Schopf, G. Shao, S. Smallen,
N. Spring, A. Su, and D. Zagorodnov. Adaptive computing on the grid
using AppLeS. IEEE Trans. Parallel Distributed Systems, 14(4):369–
382, 2003.
[27] A. J. Bernstein. Analysis of programs for parallel processing. IEEE
Transactions on Electronic Computers, 15:757–762, Oct. 1966.
[28] D. Bertsekas and R. Gallager. Data Networks. Prentice Hall, 1987.
[29] D. Bertsimas and D. Gamarnik. Asymptotically optimal algorithms
for job shop scheduling and packet routing. Journal of Algorithms,
33(2):296–318, 1999.
[30] M. Beynon, A. Sussman, U. Catalyurek, T. Kurc, and J. Saltz. Perfor-
mance optimization for data intensive grid applications. In PProceed-
ings of the Third Annual International Workshop on Active Middleware
Services (AMS’01). IEEE Computer Society Press, 2001.
[31] M. D. Beynon, T. Kurc, A. Sussman, and J. Saltz. Optimizing exe-
cution of component-based applications using group instances. Future
Generation Computer Systems, 18(4):435–448, 2002.
[32] V. Bharadwaj, D. Ghose, V. Mani, and T. Robertazzi. Scheduling Divis-
ible Loads in Parallel and Distributed Systems. IEEE Computer Society
Press, 1996.
[33] V. Bharadwaj, D. Ghose, and T. Robertazzi. Divisible load theory:
a new paradigm for load scheduling in distributed systems. Cluster
Computing, 6(1):7–17, 2003.
[34] P. Bhat, C. Raghavendra, and V. Prasanna. Efficient collective com-
munication in distributed heterogeneous systems. In ICDCS’99 19th
International Conference on Distributed Computing Systems, pages 15–
24. IEEE Computer Society Press, 1999.
[35] P. Bhat, C. Raghavendra, and V. Prasanna. Efficient collective commu-
nication in distributed heterogeneous systems. Journal of Parallel and
Distributed Computing, 63:251–263, 2003.
324 Bibliography
[36] L. S. Blackford, J. Choi, A. Cleary, E. D’Azevedo, J. Demmel,

I. Dhillon, J. Dongarra, S. Hammarling, G. Henry, A. Petitet, K. Stan-
ley, D. Walker, and R. C. Whaley. ScaLAPACK Users’ Guide. SIAM,
1997.
[37] P. Brucker. Scheduling Algorithms. Springer-Verlag New York, Inc.,
Secaucus, NJ, USA, 2004.
[38] L. E. Cannon. A cellular computer to implement the Kalman filter
algorithm. PhD thesis, Montana State University, 1969.
[39] B. W. Char, K. O. Geddes, G. H. Gonnet, M. B. Monagan, and S. M.
Watt. Maple Reference Manual, 1988.
[40] P. Chrétienne. Task scheduling with interprocessor communication de-
lays. European Journal of Operational Research, 57:348–354, 1992.
[41] P. Chrétienne and C. Picouleau. Scheduling with communication delays:
a survey. In P. Chrétienne, E. G. C. Jr., J. K. Lenstra, and Z. Liu, edi-
tors, Scheduling Theory and its Applications, pages 65–89. John Wiley
& Sons, 1995.
[42] E. G. Coffman. Computer and job-shop scheduling theory. John Wiley
& Sons, 1976.
[43] R. Cole. Parallel merge sort. In Proceedings of the 27th Annual Sympo-
sium on Foundations of Computer Science FOCS, pages 511–516. IEEE
Computer Society Press, 1986.
[44] T. H. Cormen, C. E. Leiserson, and R. L. Rivest. Introduction to Algo-
rithms. The MIT Press, 1990.
[45] M. Cosnard and D. Trystram. Algorithmes et Architectures Parallèles.
Interéditions, 1993.
[46] P. Crescenzi and V. Kann. A compendium of NP optimization prob-
lems. World Wide Web document, URL: http://www.nada.kth.se/
∼viggo/wwwcompendium/wwwcompendium.html.
[47] D. Culler, R. Karp, D. Patterson, A. Sahay, E. Santos, K. Schauser,

R. Subramonian, and T. von Eicken. LogP: a practical model of parallel
computation. Communication of the ACM, 39(11):78–85, 1996.
[48] D. E. Culler and J. P. Singh. Parallel Computer Architecture: A Hard-
ware/Software Approach. Morgan Kaufmann, San Francisco, CA, 1999.
[49] D. R. Dahlia Malkhi, Moni Naor. Viceroy: A scalable and dynamic
emulation of the butterfly. In Proceedings of the 21st annual ACM
symposium on Principles of distributed computing. ACM Press, 2002.
[50] A. Darte, Y. Robert, and F. Vivien. Scheduling and Automatic Paral-
lelization. Birkhaüser, 2000.
Bibliography 325
[51] F. Desprez. Procédures de base pour le calcul scientifique sur machines

parallèles à mémoire distribuée. PhD thesis, École Normale Supérieure
de Lyon, Jan. 1994.
[52] H. El-Rewini and M. Abd-el barr. Advanced Computer Architecture and
Parallel Processing . Wiley-Interscience, 2005.
[53] H. El-Rewini and T. G. Lewis. Distributed and parallel computing. Man-
ning, 1998.
[54] H. El-Rewini, T. G. Lewis, and H. H. Ali. Task scheduling in parallel
and distributed systems. Prentice Hall, 1994.
[55] I. Foster and C. Kesselman, editors. The Grid 2: Blueprint for a New
Computing Infrastructure. M. Kaufman Publishers, Inc., 2nd edition,
2003.
[56] G. Fox, S. Otto, and A. Hey. Matrix algorithms on a hypercube I:
matrix multiplication. Parallel Computing, 3:17–31, 1987.
[57] P. Fraigniaud and P. Gauron. D2B: a de Bruijn based content-
addressable network. Theoretical Computer Science, 355(1):65–79, 2006.
[58] M. R. Garey and D. S. Johnson. Computers and Intractability, a Guide
to the Theory of NP-Completeness. W. H. Freeman and Company, 1991.
[59] A. Geist, A. Beguelin, J. Dongarra, W. Jiang, R. Manchek, and V. Sun-
deram. PVM Parallel Virtual Machine: A Users’Guide and Tutorial for
Networked Parallel Computing. The MIT Press, 1996.
[60] M. Gengler, S. Ubéda, and F. Desprez. Initiation au Parallélisme. Mas-
son, 1996.
[61] A. Gerasoulis and T. Yang. A comparison of clustering heuristics for
scheduling DAGs on multiprocessors. J. Parallel and Distributed Com-
puting, 16(4):276–291, 1992.
[62] A. Gibbons and W. Rytter. Efficient Parallel Algorithms. Cambridge
University Press, 1988.
[63] GLPK: GNU Linear Programming Kit. http://www.gnu.org/software/
glpk/.
[64] G. H. Golub and C. F. V. Loan. Matrix computations. Johns Hopkins,
1989.
[65] R. L. Graham. Bounds on multiprocessing timing anomalies. SIAM
Journal on Applied Mathematics, 17(2):416–429, 1969.
[66] J. L. Gustafson. Reevaluating Amdahl’s law. IBM Systems Journal,
31(5):532–533, 1988.
326 Bibliography
[67] C. Hanen and A. Munier. An approximation algorithm for schedul-

ing dependent tasks on m processors with small communication de-
lays. In EFTA 95: INRIA / IEEE Symposium on Emerging Technology
and Factory Animation, pages 167–189. IEEE Computer Science Press,
1995.
[68] R. W. Hockney. The communication challenge for mpp : Intel paragon
and meiko cs-2. Parallel Computing, 20:389–398, 1994.
[69] B. Hong and V. Prasanna. Distributed adaptive task allocation in het-
erogeneous computing environments to maximize throughput. In Inter-
national Parallel and Distributed Processing Symposium IPDPS’2004.
IEEE Computer Society Press, 2004.
[70] T. C. Hu. Parallel sequencing and assembly line problems. Operations
Research, 9(6):841–848, 1961.
[71] M. F. Kaashoek and D. R. Karger. Koorde: A simple degree-optimal
distributed hash table. Lecture Notes in Computer Science, 2735:98–
107, 2003.
[72] R. M. Karp, R. E. Miller, and S. Winograd. The organization of
computations for uniform recurrence equations. Journal of the ACM,
14(3):563–590, July 1967.
[73] S. Khuller and Y. A. Kim. On broadcasting in heterogenous networks.
In Proceedings of the fifteenth annual ACM-SIAM symposium on Dis-
crete algorithms, pages 1011–1020. Society for Industrial and Applied
Mathematics, 2004.
[74] T. Kielmann, H. E. Bal, and K. Verstoep. Fast measurement of LogP
parameters for message passing platforms. In Proceedings of the 15th
IPDPS. Workshops on Parallel and Distributed Processing, 2000.
[75] D. E. Knuth. The Art of Computer Programming, volume 3: Sorting
and Searching. Addison-Wesley, 1973.
[76] C. H. Koelbel, D. B. Loveman, R. S. Schreiber, G. L. S. Jr., and M. E.
Zosel. The High Performance Fortran Handbook. The MIT Press, 1994.
[77] V. Kumar, A. Grama, A. Gupta, and G. Karypis. Introduction to Par-
allel Computing. The Benjamin/Cummings Publishing Company, Inc.,
1994.
[78] L. Lamport. The parallel execution of DO loops. Communications of
the ACM, 17(2):83–93, Feb. 1974.
[79] A. Legrand, H. Renard, Y. Robert, and F. Vivien. Mapping and Load-
Balancing Iterative Computations. IEEE Transactions on Parallel and
Distributed Systems, 15(6):546–558, 2004.
Bibliography 327
[80] A. Legrand, Y. Yang, and H. Casanova. NP-completeness of the divisible

load scheduling problem on heterogeneous star platforms with affine
costs. Research Report CS2005-0818, GRAIL Project, University of
California at San Diego, march 2005.
[81] F. T. Leighton. Introduction to parallel algorithms and architectures:
arrays, trees, hypercubes. Morgan Kaufmann, 1992.
[82] C. E. Leiserson. Fat-trees: universal networks for hardware-efficient
supercomputing. IEEE Trans. Computers, 34(10):892–901, 1985.
[83] T. G. Lewis and H. El-Rewini. Introduction to Parallel Computing.
Prentice-Hall, 1992.
[84] C. Lin and L. Snyder. A matrix product algorithm and its comparative
performance on hypercubes. In Scalable High Performance Computing
Conference, pages 190–193. IEEE Computer Society Press, 1992.
[85] P. Liu. Broadcast scheduling optimization for heterogeneous cluster
systems. Journal of Algorithms, 42(1):135–152, 2002.
[86] S. H. Low. A duality model of TCP and queue management algorithms.
IEEE/ACM Transactions on Networking, 2003.
[87] F. Manne and T. Sørevik. Partitioning an array onto a mesh of pro-
cessors. In Proceedings of Para’96, Workshop on Applied Parallel Com-
puting in Industrial Problems and Optimization, volume 1184, pages
467–477. Lecture Notes in Computer Science, Springer, 1996.
[88] L. Massoulié and J. Roberts. Bandwidth sharing: Objectives and algo-
rithms. In INFOCOM (3), pages 1395–1403, 1999.
[89] L. Massoulié and J. Roberts. Bandwidth sharing: Objectives and algo-
rithms. Transactions on Networking, 10(3):320–328, june 2002.
[90] J. A. M. McHugh. Hu’s precedence tree scheduling algorithm: A simple
proof. Naval Research Logistics Quaterly, 31:409–411, 1984.
[91] S. Miguet and Y. Robert. Path planning on a ring of processors. Intern.
Journal Computer Math, 32:61–74, 1990.
[92] S. Miguet and Y. Robert. Parallélisation d’algorithmes de balayage
d’image sur un anneau de processeurs. Technique et Science Informa-
tiques, 10(4):287–296, 1991.
[93] D. A. Padua and M. J. Wolfe. Advanced compiler optimizations for
supercomputers. Communications of the ACM, 29(12):1184–1201, Dec.
1986.
[94] C. Picouleau. Two new NP-complete scheduling problems with commu-
nication delays and unlimited number of processors. Technical Report
91-24, IBP, Université Pierre et Marie Curie, France, Apr. 1991.
328 Bibliography
[95] C. Picouleau. Task scheduling with interprocessor communication de-

lays. Discrete Applied Mathematics, 60(1-3):331–342, 1995.
[96] A. Pinar and C. Aykanat. Fast optimal load balancing algorithms for 1D
partitioning. J. Parallel Distributed Computing, 64(8):974–996, 2004.
[97] C. G. Plaxton, R. Rajaraman, and A. W. Richa. Accessing nearby copies
of replicated objects in a distributed environment. In ACM Symposium
on Parallel Algorithms and Architectures, pages 311–320, 1997.
[98] W. Press, S. Teukolsky, W. Vetterling, and B. Flannery. Numerical
Recipes: The Art of Scientific Computing. Cambridge University Press,
third edition, 2007.
[99] F. Rastello. Partitionnement: optimisations de compilation et algorith-
mique hétérogène. PhD thesis, École Normale Supérieure de Lyon, Sept.
2000.
[100] S. Ratnasamy, P. Francis, M. Handley, R. Karp, and S. Shenker. A
Scalable Content-Addressable Network. In Proceedings of the ACM 2001
SIGCOMM Conference, pages 161–172, 2001.
[101] J. Reif. Synthesis of Parallel Algorithms. Morgan Kaufmann, 1993.
[102] Y. Robert. The impact of vector and parallel architectures on the Gaus-
sian elimination algorithm. Manchester University Press and John Wi-
ley, 1991.
[103] T. Robertazzi. Ten reasons to use divisible load theory. IEEE Computer,
36(5):63–68, 2003.
[104] A. Rowstron and P. Druschel. Pastry: Scalable, decentralized object lo-
cation, and routing for large-scale peer-to-peer systems. In IFIP/ACM
International Conference on Distributed Systems Platforms (Middle-
ware), pages 329–350, 2001.
[105] A. I. T. Rowstron, A.-M. Kermarrec, M. Castro, and P. Druschel.
SCRIBE: The design of a large-scale event notification infrastructure.
In Networked Group Communication, pages 30–43, 2001.
[106] T. Saif and M. Parashar. Understanding the behavior and performance
of non-blocking communications in MPI. In Proceedings of Euro-Par
2004: Parallel Processing, LNCS 3149, pages 173–182. Springer, 2004.
[107] A. Schrijver. Theory of Linear and Integer Programming. John Wiley
& Sons, New York, 1986.
[108] B. A. Shirazi, A. R. Hurson, and K. M. Kavi. Scheduling and load
balancing in parallel and distributed systems. IEEE Computer Science
Press, 1995.
[109] O. Sinnen. Task scheduling for parallel systems. Wiley, 2007.
Bibliography 329
[110] M. Snir, S. W. Otto, S. Huss-Lederman, D. W. Walker, and J. Dongarra.

MPI the complete reference. The MIT Press, 1996.
[111] M. Spencer, R. Ferreira, M. Beynon, T. Kurc, U. Catalyurek, A. Suss-
man, and J. Saltz. Executing multiple pipelined data analysis oper-
ations in the grid. In 2002 ACM/IEEE Supercomputing Conference.
ACM Press, 2002.
[112] R. Steinmetz and K. Wehrle, editors. Peer-to-Peer Systems and Appli-
cations, volume 3485. Springer Publishing, Lecture Notes on Computer
Science, 2005.
[113] I. Stoica, R. Morris, D. Karger, F. Kaashoek, and H. Balakrishnan.
Chord: A scalable Peer-To-Peer lookup service for internet applications.
In Proceedings of the ACM 2001 SIGCOMM Conference, pages 149–160,
2001.
[114] J. Subhlok and G. Vondran. Optimal mapping of sequences of data
parallel tasks. In Proc. 5th ACM SIGPLAN Symposium on Principles
and Practice of Parallel Programming, PPoPP’95, pages 134–143. ACM
Press, 1995.
[115] J. Subhlok and G. Vondran. Optimal latency-throughput tradeoffs for
data parallel pipelines. In ACM Symposium on Parallel Algorithms and
Architectures SPAA’96, pages 62–71. ACM Press, 1996.
[116] H. Topcuoglu, S. Hariri, and M. Y. Wu. Performance-effective and low-
complexity task scheduling for heterogeneous computing. IEEE Trans.
Parallel Distributed Systems, 13(3):260–274, 2002.
[117] R. van de Geijn. Using PLAPACK: Parallel Linear Algebra Package.
The MIT Press, 1997.
[118] N. Vydyanathan, U. Catalyurek, T. Kurc, P. Saddayappan, and J. Saltz.
An approach for optimizing latency under throughput constraints for
application workflows on clusters. Research Report OSU-CISRC-1/07-
TR03, Ohio State University, Columbus, OH, Jan. 2007. Available at
ftp://ftp.cse.ohio-state.edu/pub/tech-report/2007.
[119] M. Waterman and T. Smith. Identification of Common Molecular Sub-
sequences. Journal of Molecular Biology, 147:195–197, 1981.
[120] M. Wolfe. High Performance Compilers For Parallel Computing.
Addison-Wesley, 1996.
[121] B. Zhao, L. Huang, J. Stribling, S. Rhea, A. Joseph, and J. Kubiatowicz.
Tapestry: a resilient global-scale overlay for service deployment. IEEE
Journal on Selected Areas in Communications, 22(1):41–53, 2004.
Index
<lex , 295 associative operator (prefix compu-

<seq , 295 tation), 7
<text , 295 asynchronous (communication), 74,
≺, 208 127
alloc, 210
bl(v), 214 bandwidth
e(σ, p), 212 communication, 63
g(G), 234 Batcher (sorting network), 37–43
MS(σ, p), 212 Beneš (network), 61, 95
MSopt (∞), 212 Bernstein (conditions), 208
MSopt (p), 212 binary
MS0 (p), 213, 232 bitonic sequence, 49
Pb(∞), 210, 212, 214–215, 230–238 tree, 93
Pb(p), 210, 212, 216–224, 238 bisection, 59
PRED(v), 214 bitonic sequence, 49
s(σ, p), 212 block cyclic distribution, 167–168
S(I), 295 bounded multi-port, 70
SUCC(v), 214 Brent’s theorem, 12
tl(v), 214 broadcast
w(P ), 212 hypercube, 83–87
w(Ti,j ), 210 ring, 74–76, 78–79
⊥, 208 tree, 83–87
→, 208 bubble (pipeline), 126–130
σ(v), 210 butterfly network, 94–95
σf ree , 214
k-tree-loop, 28 Cannon (matrix product), 156–157,
1-D heterogeneous load balancing, 185 160, 162
||, 76, 109 cDAG, 230
0–1 principle, 42–43 coarse-grain–, 234
1-port, 70 center (of a star), 28
circuit switching, 64
all-to-all coarse-grain, see cDAG
ring, 77, 98–99 Cole’s sorting machine, 15–24
allocation communication
dynamic, 176 asynchronous, 74, 127
function, 210 in a hypercube, 83–87
Amdahl, 140 in a ring, 72–79
ancestor, 219 model, 62–72
331
332 Index
primitive, 72, 73, 75–77 dynamic

synchronous, 73 allocation, 176
commutation programming, 179
network, see network (commu- routing, 81
tation) topology, 58, 61
comparator, 37
concurrent access, 4 efficiency, 10, 40, 212
connected components embedding, 82–83, 93
computation on a PRAM, 28–29 EREW, 4
consistent (mode on PRAM), 4 Euler (tour), 8–9
constant (time) even
sequence merging, 15–23 odd-even sorting network, 41
cost, 10 execution order, 207
CP, see critical path
Fox (matrix product), 158–160, 162–
CRCW, 4
163
CREW, 4
critical path, 223 Givens (Rotation), 139–140
modified–, 239–242 Goat in a pen, 64
naı̈ve–, 238–242 good sampler, 16
scheduling, 223–224, 238–242 gossip, see all-to-all, 96
crossbar, 61–62 granularity, 117, 234
cube, see hypercube, see hypercube graph
cut-through, 64 directed acyclic–, see DAG
cyclic distribution, 115, 124–125, 167– fork, 246–247
168, 181–183 Gray code, 82–83
grid, 60, see chapter 5
DAG, 211
heterogeneous, 189
communication–, see cDAG
matrix product, 167
dependence
matrix transposition, 94
analysis, 307
sorting, 49–51
constraints, 210
Gustafson, 140
cycle, 307–308
elimination, 307–308 Hamming distance, 80
order, 207 Hanen, 234
relation, 208 Hockney, 62–63
distance hypercube, 60, 79, 79–87
Hamming, 80 cube connected cycles, 93
distributed matrix transposition, 94
memory, 57–58
distribution incremental (algorithm), 179–181
1-D heterogeneous, 181–186 independent (tasks), 177–178
cyclic, 115, 124–125 independent paths, 80
cyclic (2-D), 167–168 interconnection network, 57–61
heterogeneous 2-D, 190 iteration
heterogeneous free, 190–199 domain, 295
Index 333
vector, 295 connected components, 28–29

odd-even merging network, 38–
latency 41
communication, 63 odd-even sorting network, 41
level tree, 15
bottom–, 214 model separation, 13–14
top–, 214 Moore’s bound, 60
list, see schedule multi-port, 69
list ranking, 5–7 (bounded), 70
load balancing multi-stage network, 61, 94–95
1-D, 176–185 Munier, 234
2-D, 186–190
heterogeneous, 190–199 network
perfect, 177 Beneš, 61, 95
LogP, 66 butterfly, 94–95
loop(s), 294 commutation, see also Beneš, 94–
LU decomposition, see LU factoriza- 95
tion interconnection, 57–61
LU factorization multi-stage, 61, 94–95
look-ahead, 127–129 peer-to-peer, 87–92
on a heterogeneous platform, 183– NP-complete problem, 216–217, 230–
185 238
on a ring, 120–130
pipelined, 125–127 odd-even
odd-even merging network, 38–
macro-dataflow, 69, 229 41
makespan, 212 sorting network, 44–47
matrix sort on a network of processors,
transposition, 94 47–48
matrix-vector multiplication optimal schedule, 212, 214
on a ring, 105–110 order
matrix multiplication, see matrix prod- lexicographic–, 295
uct sequential–, 295
matrix product snake, 49
Cannon, 156–157, 160, 162 textual–, 295
Fox, 158–160, 162–163 outer-product (matrix product), 151–
heterogeneous, 190–193 154
heterogeneous grid, 186–189 overlap (computation/communication),
on a ring, 110–112 186
outer-product, 151–154 overlay network, 87–92
Snyder, 159–160, 163–167
MCP, see critical path (modified) Parallel RAM, see Chapter 1
merge parallel speedup, see speedup
Batcher sorting network, 37–43 partitioning
Cole’s sorting machine, 15–24 of a square in rectangles, 192
334 Index
path scalability, 60
critical–, see critical path scatter, 77
peer-to-peer network, 87–92 ring, 76–77
perfect loop nest, see loop(s) schedule, 210
pipeline, 63 ALAP, 215
broadcast, 78–79 ASAP–, 215
bubble, 126–130 critical path–, see critical path
pointer jumping, 4–9 free–, 215
port list–, 218–224, 238–242
1-port, 70 shared memory, 3, 57–58
multi-port, 69 simulation theorem (PRAM), 14–15
postskewing, 156 Snyder (matrix product), 159–160, 163–
power of a model, 12–15 167
PRAM, see Chapter 1 sorting, see sorting network
precedence on a network of processors, 47–
constraints, 210 48
relation, 208 on a PRAM, 15–24, 27–28
predecessor, 209 sorting network, see Chapter 2
prefix (computation on a PRAM), 7 Batcher, 37–43
preskewing, 156 bitonic, 49
primitive (sorting network), 45–46 odd-even, 44–47
principle one-dimensional, 44–47
0–1 principle, 42–43 on a 2-D grid, 49–51
programming primitive, 45–46
dynamic, 194 spanning tree, 83–87
speedup, 10, 140, 212
rank split network, 49
cross-, 15 star, 28
in a list, 5–7 statement, 294
in a sorted sequence, 15 static
reduction routing, 66
on a PRAM, 13 topology, 58–81
resource(s) stencil application, 134
constraints, 210, 216–224, 238 on a heterogeneous platform, 181–
unlimited–, 214–215, 230–238 183
ring, see chapter 4 on a ring, 112–119, 130–131
bidirectional, 134 store-and-forward, 63–65
cube-connected cycles, 93 super-linear parallel speedup, 140
matrix transposition, 94 synchronous (communication), 73
sorting, 47, 48
routing task
dynamic, 81 allocation, 177–181
in a hypercube, 81–82 duration, see weight
in a peer-to-peer network, 90–92 static allocation heterogeneous,
static, 66 176–181
Index 335
system, 210
weight, see weight
topology
dynamic, 58, 61
static, 58–81
virtual, 135–136
torus, 59
transposition
odd-even network, 44–47
tree
binary, 93
broadcast, 83–87
fat-tree, 61
fusion, 15
merge, 18–20
spanning, 83–87
vertex
bottom level, 214
entry–, 214
exit–, 214
top level, 214
weight
task–, 210
well-ordered, 196
work, 10
wormhole, 64, 65
write (concurrent), 4
View publication stats

Parallel Algorithms

Uploaded by

Parallel Algorithms

Uploaded by

See

Book · January 2008

Henri Casanova Arnaud R Legrand

SEE PROFILE SEE PROFILE

SimGrid SMPI View project

SimGrid View project

The user has requested enhancement of the downloaded file.

Intended Audience and Use

Book Content and Organization

2.2 Sorting on a One-Dimensional Network . . . . . . . . . . . . 44

II Parallel Algorithms 103

4 Algorithms on a Ring of Processors 105

5 Algorithms on Grids of Processors 147

6 Load Balancing on Heterogeneous Platforms 175

III Scheduling 205

Scheduling a FORK Graph with Communications . . . . . . . 246

8 Advanced scheduling 253

A Turing machine is a simple abstract computational device intended to in-

FIGURE 1.1: The PRAM model.

1A RAM, or Random Access Machine, is an abstract model of a sequential processor.

for real-world parallel computers. Another simplifying assumption is that any

1.1 Pointer Jumping

ALGORITHM 1.1: List ranking algorithm.

1.1.1 List Ranking

A straightforward sequential algorithm consists in propagating and incre-

ALGORITHM 1.2: Prefix computation of a list.

One remaining question is whether this algorithm can be executed in O(log n)

1.1.2 Prefix Computation

[1,1] [2,2] [3,3] [4,4] [5,5] [6,6] /

Step 1 [1,1] [1,2] [2,3] [3,4] [4,5] / [5,6] /

Step 2 [1,1] [1,2] [1,3] / [1,4] / [2,5] / [3,6] /

Step 3 [1,1] / [1,2] / [1,3] / [1,4] / [1,5] / [1,6] /

FIGURE 1.3: Example execution of the prefix computation algorithm. [i, j]

1.1.3 Euler Tour

• The A PU of each vertex points to the A PU of the vertex’s left child if

• The B PU of each vertex points to the A PU of the vertex’s right child

A=1 C = −1 A=1 C = −1 A=2 C =1 A=2 C =1

A=1 C = −1 A=1 C = −1 A=1 C = −1 A=3 C =2 A=3 C =2 A=3 C =2

A=1 C = −1 A=1 C = −1 A=1 C = −1 A=4 C =3 A=4 C =3 A=4 C =3

(a) Initialization (b) After prefix computation

• If a vertex has a left child, then its C PU points to the B PU of the

• the A PU of vertex i adds 1 to the prefix sum, accounting for going

• the B PU’s contribution to the sum is equal to 0 because the depth of

We have reduced our problem to a constant-time initialization phase and a

1.2 Performance Evaluation of PRAM Algorithms

DEFINITION 1.1 (Cost and Work). The cost of a PRAM algorithm is

DEFINITION 1.2 (Speed-Up and Efficiency). The speedup of a PRAM

Sp (n) Tseq (n)

1.2.2 A Simple Simulation Result

PROPOSITION 1.1. Let A be an algorithm whose execution time is t on

A direct consequence of this result is that the cost of an algorithm is at

1.2.3 Brent’s Theorem

Brent’s theorem makes it possible to quantify the performances of an algo-

1.3 Comparison of PRAM Models

ALGORITHM 1.3: CRCW algorithm to compute

1.3.1 Model Separation

1.3.2 Simulation Theorem

P2 43 29 P2 (29,43)=A[2] Sort (29,43) P1

(a) CRCW write ac- (b) EREW simulation

FIGURE 1.5: Simulation of concurrent writes with exclusive writes.

li , the corresponding PU Pi of the EREW algorithm writes A[i] = (li , xi ).

1.4 Sorting Machine

level 2 5,6,7,8 1,2,3,4

level 1 7,8 5,6 3,4 1,2

DEFINITION 1.4 (Good Sampler). A sequence L is said to be a good

J(1) J(2) J(3) J(4) J(5) J(6) J(7) J(8) J(9)

FIGURE 1.7: Merging J and K with the help of L.

MERGE WITH HELP(J, K, L)

ALGORITHM 1.4: Merge with help Algorithm.

LEMMA 1.1. If L is a GS of the sorted sequences J and K and if cross-ranks

• Step 1: J is partitioned using |J| PUs. Each Pj , j ∈ J, reads rank (j, L) =

ALGORITHM 1.5: Cole merge Algorithm.

partition K. Knowing R[J, L] and R[K, L], it is thus possible to do the

rank (l, J|K) = rank (l, J) + rank (l, K) .

1.4.2 Sorting Trees

val(t) = X(t)|Y (t) .

Exercise 1.2 : Selection in a List

Exercise 1.4 : Looking for Roots in a Forest

Exercise 1.6 : Mystery Function

Exercise 1.7 : Connected Components