Part 1 - Lecture 3 - Parallel Software-1

CSC4305: Parallel Programming
Lecture 3: Parallel Software - 1
Sana Abdullahi Mu’az & Ahmad Abba Datti

Bayero University, Kano
Part 1: Foundations
• Introduction
• Parallel Hardware
• Parallel Software - 1
Roadmap
• Low Level Programming Models
• Shared Memory Model (with Threads)
• Message Passing Model
• High Level Programming Model

• Single Program Multiple Data
• Issues in Parallel Programming

• Parallelizability • Data Dependence
• Inhibitors • Load Balancing
• Partitioning • Granularity
• Communication and Synchronization • Input/Output
Parallel software
Overview
• There are roughly two parallel programming models in
common use:
• Shared Memory Models (with Threads)
• Message Passing Models
• GPU Programming Models
• Parallel programming models exist as an abstraction above

hardware and memory architectures.
Shared Memory Model with Threads
• In the shared-memory programming model, tasks share a
common address space, which they read and write
(a)synchronously.
• Various mechanisms such as locks / semaphores may be used to
control access to the shared memory.
• An advantage of this model from the programmer's point of view
is that the notion of data "ownership" is lacking, so there is no
need to specify explicitly the communication of data between
tasks. Program development can often be simplified.
• An important disadvantage in terms of performance is that it
becomes more difficult to understand and manage data locality.
Threads
• Threads are commonly associated with shared memory

architectures and operating systems.
Threads
• Perhaps the most simple analogy that can be used to describe threads is the
concept of a single program that includes a number of subroutines:
• The main program a.out is scheduled to run by the native operating

system. a.out loads and acquires all of the necessary system and user
resources to run.
• a.out performs some serial work, and then creates a number of tasks
(threads) that can be scheduled and run by the operating system in parallel.
• Each thread has local data, but also, shares the entire resources of a.out.
This saves the overhead associated with replicating a program's resources
for each thread. Each thread also benefits from a global memory view
because it shares the memory space of a.out.
Threads
• A thread's work may best be described as a subroutine within
the main program. Any thread can execute any subroutine at the
same time as other threads.
• Threads communicate with each other through global

memory (updating address locations). This requires
synchronization constructs to ensure that more than one thread
is not updating the same global address at any time.
• Threads can come and go, but a.out remains present to provide
the necessary shared resources until the application has
completed.
Implementations
• From a programming perspective, threads implementations commonly
comprise:
• A library of subroutines that are called from within parallel source code
• A set of compiler directives embedded in the source code
• In both cases, the programmer is responsible for determining all parallelism.

• Historically, hardware vendors have implemented their own proprietary
versions of threads. These implementations differed substantially from each
other making it difficult for programmers to develop portable threaded
applications.
• Unrelated standardization efforts have resulted in different implementations of

threads such as PThreads and OpenMP, Java Multithreading.
Implementations
• POSIX Threads
• Library based;
• C Language only
• Commonly referred to as Pthreads.
• Very explicit parallelism; requires significant programmer attention to
detail.
• OpenMP
• Portable / multi-platform, including Unix and Windows NT platforms
• Available in C/C++ and Fortran implementations
• Microsoft has its own implementation for threads, which is not related to the
UNIX POSIX standard or OpenMP.
Message Passing Model
•A set of tasks that use their own local memory during
computation. Multiple tasks can reside on the same physical
machine as well across an arbitrary number of machines.
• Tasks exchange data through communications by sending and

receiving messages.
• Data transfer usually requires cooperative operations to be

performed by each process. For example, a send operation must
have a matching receive operation.
Message Passing Model Implementations: MPI
• From a programming perspective, message passing

implementations commonly comprise a library of subroutines that
are embedded in source code. The programmer is responsible for
determining all parallelism.
• Historically, a variety of message passing libraries have been

available since the 1980s. These implementations differed
substantially from each other making it difficult for programmers to
develop portable applications.
• Examples include pyMPI, MPICH, Open MPI

Single Program Multiple Data (SPMD)
• SPMD is actually a "high level" programming model that can be

built upon any combination of the previously mentioned parallel
programming models.
• A single program is executed by all tasks simultaneously.
• At any moment in time, tasks can be executing the same or

different instructions within the same program.
Single Program Multiple Data (SPMD)
• SPMD programs usually have the necessary logic programmed

into them to allow different tasks to branch or conditionally
execute only those parts of the program they are designed to
execute. That is, tasks do not necessarily have to execute the
entire program - perhaps only a portion of it.
• All tasks may use different data

SPMD – single program multiple data
• An SPMD program consists of a single executable that can behave as if it were

multiple different programs through the use of conditional branches.
if (I’m thread/process i)
do this;
else
do that;
Issues in Parallel Programming: Parallelizability
• Undoubtedly, the first step in developing parallel software is to first

understand the problem that you wish to solve in parallel. If you are
starting with a serial program, this necessitates understanding the
existing code also.
• Before spending time in an attempt to develop a parallel solution for

a problem, determine what part of the problem that can actually be
parallelized. (0% to 100%)
Example of Parallelizable Problem
• Calculate the potential energy for each of several thousand

independent conformations of a molecule. When done, find the
minimum energy conformation.
• This problem is able to be solved in parallel. Each of the molecular

conformations is independently determinable. The calculation of
the minimum energy conformation is also a parallelizable problem.
Example of a Non-parallelizable Problem
Calculation of the Fibonacci series (1,1,2,3,5,8,13,21,...) by use of

the formula:
F(k + 2) = F(k + 1) + F(k)
• This is a non-parallelizable problem because the calculation of

the Fibonacci sequence as shown would entail dependent
calculations rather than independent ones. The calculation of the
k + 2 value uses those of both k + 1 and k. These three terms
cannot be calculated independently and therefore, not in parallel.
Parallel Inhibitors
• Identify inhibitors to parallelism..
• Are there areas that are disproportionately slow, or cause

parallelizable work to halt or be deferred? For example, I/O is
usually something that slows a program down.
• One common class of inhibitor is data dependence, as

demonstrated by the Fibonacci sequence above
Partitioning
• One of the first steps in designing a parallel program is to break

the problem into discrete "chunks" of work that can be distributed
to multiple tasks. This is known as decomposition or partitioning.
• There are two basic ways to partition computational work among

parallel tasks:
• Data parallelism (Domain decomposition)
• Task parallelism (Functional decomposition)
Domain Decomposition
• In this type of partitioning, the data associated with a problem is
decomposed. Each parallel task then works on a portion of the data.
Partitioning Data
• There are different ways to partition data
Functional Decomposition
• In this approach, the focus is on the computation that is to be performed rather
than on the data manipulated by the computation. The problem is decomposed
according to the work that must be done. Each task then performs a portion of
the overall work.
• Functional decomposition lends itself well to problems that can be split into
different tasks. For example
• Ecosystem Modeling
• Signal Processing
• Climate Modeling
Ecosystem Modeling
• Each program calculates the population of a given group, where
each group's growth depends on that of its neighbours.
• As time progresses, each process calculates its current state, then
exchanges information with the neighbour populations.
• All tasks then progress to calculate the state at the next time step.
Signal Processing
• An audio signal data set is passed through four distinct computational filters.
Each filter is a separate process.
• The first segment of data must pass through the first filter before progressing to
the second. When it does, the second segment of data passes through the first
filter. By the time the fourth segment of data is in the first filter, all four tasks are
busy.
Climate Modeling
• Each model component can be thought of as a separate task. Arrows represent
exchanges of data between components during computation: the atmosphere
model generates wind velocity data that are used by the ocean model, the
ocean model generates sea surface temperature data that are used by the
atmosphere model, and so on.
• Combining these two types of problem decomposition is common and natural.
Who Needs Communications?
• You DON'T need communications

• Some types of problems can be decomposed and executed in parallel with virtually no
need for tasks to share data. For example, imagine an image processing operation
where every pixel in a black and white image needs to have its color reversed. The image
data can easily be distributed to multiple tasks that then act independently of each
other to do their portion of the work.
• These types of problems are often called embarrassingly parallel because they are so
straight-forward. Very little inter-task communication is required.
• You DO need communications

• Most parallel applications are not quite so simple, and do require tasks to share data
with each other. For example, a 3-D heat diffusion problem requires a task to know the
temperatures calculated by the tasks that have neighboring data. Changes to
neighboring data has a direct effect on that task's data.
Factors to Consider (1)
• Cost of communications
• Inter-task communication almost always implies overhead.
• Machine cycles and resources that could be used for computation are
instead used to package and transmit data.
• Communications frequently require some type of synchronization between
tasks, which can result in tasks spending time "waiting" instead of doing
work.
• Competing communication traffic can saturate the available network
bandwidth, further aggravating performance problems.
• Latency vs. Bandwidth
• latency is the time it takes to send a minimal (0 byte) message from point A
to point B. Commonly expressed in microseconds.
• bandwidth is the amount of data that can be communicated per unit of
time. Commonly expressed in megabytes/sec.
• Sending many small messages can cause latency to dominate

communication overheads. Often it is more efficient to package small
messages into a larger message, thus increasing the effective
communications bandwidth.
• Visibility of communications
• With the Message Passing Model, communications are explicit and
generally quite visible and under the control of the programmer.
• With the Shared Memory Model, communications often occur

transparently to the programmer. The programmer may not even be able
to know exactly how inter-task communications are being accomplished.
• Synchronous Communications
• Synchronous communications require some type of "handshaking" between tasks that are sharing
data. This can be explicitly structured in code by the programmer, or it may happen at a lower level
unknown to the programmer.
• Synchronous communications are often referred to as blocking communications since other work
must wait until the communications have completed.
• Asynchronous communications
• Asynchronous communications allow tasks to transfer data independently from one another. For
example, task 1 can prepare and send a message to task 2, and then immediately begin doing other
work. When task 2 actually receives the data doesn't matter.
• Asynchronous communications are often referred to as non-blocking communications since other

work can be done while the communications are taking place.
• Interleaving computation with communication is the single greatest benefit for using asynchronous
communications.
Types of Synchronization
• Barrier
• Usually implies that all tasks are involved
• Each task performs its work until it reaches the barrier. It then stops, or "blocks".
• When the last task reaches the barrier, all tasks are synchronized.
• What happens from here varies. Often, a serial section of work must be done. In other cases, the
tasks are automatically released to continue their work.
• Lock / semaphore
• Can involve any number of tasks
• Typically used to serialize (protect) access to global data or a section of code. Only one task at a
time may use (own) the lock / semaphore / flag.
• The first task to acquire the lock "sets" it. This task can then safely (serially) access the protected
data or code.
• Other tasks can attempt to acquire the lock but must wait until the task that owns the lock releases
it.
• Can be blocking or non-blocking
Data Dependency
•A dependence exists between program statements when the

order of statement execution affects the results of the program.
•A data dependence results from multiple use of the same

location(s) in storage by different tasks.
• Dependencies are important to parallel programming because

they are one of the primary inhibitors to parallelism.
Load Balance
• Load balancing refers to the practice of distributing work among tasks so that all
tasks are kept busy all of the time. It can be considered a minimization of task idle
time.
• Load balancing is important to parallel programs for performance reasons. For

example, if all tasks are subject to a barrier synchronization point, the slowest task
will determine the overall performance.
How to Achieve Load Balance
• Equally partition the work each task receives (Static Balancing)
• For operations where each task performs similar work (e.g.
arrays/matrices), evenly distribute the data set among the tasks.
• Use dynamic work assignment

• Certain classes of problems result in load imbalances even if data is evenly
distributed among tasks: sparse arrays and N-body simulations
• When the amount of work each task will perform is intentionally variable,
or is unable to be predicted, it may be helpful to use a task pool approach.
As each task finishes its work, it queues to get a new piece of work.
• Itmay become necessary to design an algorithm which detects and
handles load imbalances as they occur dynamically within the code.
Parallel Concepts: Granularity
• Computation / Communication Ratio:
• In parallel computing, granularity is a qualitative measure of
the ratio of computation to communication.
• Periods of computation are typically separated from periods
of communication by synchronization events.
• Fine grain parallelism

• Coarse grain parallelism
Fine-grain Parallelism
• Relatively small amounts of computational work are
done between communication events
• Low computation to communication ratio
• Facilitates load balancing
• Implies high communication overhead and less

opportunity for performance enhancement
• If granularity is too fine it is possible that the overhead

required for communications and synchronization
between tasks takes longer than the computation.
Coarse-grain Parallelism
• Relatively large amounts of computational work

are done between communication/synchronization
events
• High computation to communication ratio
• Implies more opportunity for performance increase
• Harder to load balance efficiently

Which is Best?
• The most efficient granularity is dependent on the algorithm and the hardware
environment in which it runs.
• In most cases the overhead associated with communications and

synchronization is high relative to execution speed so it is advantageous to have
coarse granularity.
• Fine-grain parallelism can help reduce overheads due to load imbalance.

Input – Ouput Operations
• I/O operations are generally regarded as inhibitors to parallelism
• Parallel I/O systems are immature or not available for all platforms
• In an environment where all tasks see the same file system, write operations
will result in file overwriting
• Read operations will be affected by the fileserver's ability to handle multiple

read requests at the same time
• I/O that must be conducted over the network (NFS, non-local) can cause
severe bottlenecks
Parallel I/O
• Some parallel file systems are available. For example:
• GPFS: General Parallel File System for AIX (IBM)
• Lustre: for Linux clusters
• PVFS/PVFS2: Parallel Virtual File System for Linux clusters
• PanFS: Panasas ActiveScale File System for Linux clusters
• HP SFS: HP StorageWorks Scalable File Share. Lustre based parallel file

system
I/O Tips
• If you have access to a parallel file system, investigate using it. If you don't, keep
reading...
• Rule #1: Reduce overall I/O as much as possible
• Confine I/O to specific serial portions of the job, and then use parallel
communications to distribute data to parallel tasks. For example, Task 1 could
read an input file and then communicate required data to other tasks. Likewise,
Task 1 could perform write operation after receiving required data from all other
tasks.
• For distributed memory systems with shared filespace, perform I/O in local, non-
shared filespace.
Recap
• Low Level Programming Models
• Shared Memory Model (with Threads)
• Message Passing Model
• High Level Programming Model

• Single Program Multiple Data
• Issues in Parallel Programming

• Parallelizability • Data Dependence
• Inhibitors • Load Balancing
• Partitioning • Granularity
• Communication and Synchronization • Input/Output
Next Lecture.....
• Performance of Parallel Programs

Part 1 - Lecture 3 - Parallel Software-1

Uploaded by

Part 1 - Lecture 3 - Parallel Software-1

Uploaded by

CSC4305: Parallel Programming

Lecture 3: Parallel Software - 1

Sana Abdullahi Mu’az & Ahmad Abba Datti

• High Level Programming Model

• Issues in Parallel Programming

• Parallel programming models exist as an abstraction above

• Threads are commonly associated with shared memory

• The main program a.out is scheduled to run by the native operating

• Threads communicate with each other through global

• In both cases, the programmer is responsible for determining all parallelism.

• Unrelated standardization efforts have resulted in different implementations of

• Tasks exchange data through communications by sending and

• Data transfer usually requires cooperative operations to be

• From a programming perspective, message passing

• Historically, a variety of message passing libraries have been

• Examples include pyMPI, MPICH, Open MPI

• SPMD is actually a "high level" programming model that can be

• A single program is executed by all tasks simultaneously.

• At any moment in time, tasks can be executing the same or

• SPMD programs usually have the necessary logic programmed

• All tasks may use different data

• An SPMD program consists of a single executable that can behave as if it were

• Undoubtedly, the first step in developing parallel software is to first

• Before spending time in an attempt to develop a parallel solution for

• Calculate the potential energy for each of several thousand

• This problem is able to be solved in parallel. Each of the molecular

Calculation of the Fibonacci series (1,1,2,3,5,8,13,21,...) by use of

• This is a non-parallelizable problem because the calculation of

• Identify inhibitors to parallelism..

• Are there areas that are disproportionately slow, or cause

• One common class of inhibitor is data dependence, as

• One of the first steps in designing a parallel program is to break

• There are two basic ways to partition computational work among

• You DON'T need communications

• You DO need communications

• Sending many small messages can cause latency to dominate

• With the Shared Memory Model, communications often occur

• Asynchronous communications are often referred to as non-blocking communications since other

•A dependence exists between program statements when the

•A data dependence results from multiple use of the same

• Dependencies are important to parallel programming because

• Load balancing is important to parallel programs for performance reasons. For

• Use dynamic work assignment

• Fine grain parallelism

• Low computation to communication ratio

• Facilitates load balancing

• Implies high communication overhead and less

• If granularity is too fine it is possible that the overhead

• Relatively large amounts of computational work

• High computation to communication ratio

• Implies more opportunity for performance increase

• Harder to load balance efficiently

• In most cases the overhead associated with communications and

• Fine-grain parallelism can help reduce overheads due to load imbalance.

• Read operations will be affected by the fileserver's ability to handle multiple

• Lustre: for Linux clusters

• PVFS/PVFS2: Parallel Virtual File System for Linux clusters

• PanFS: Panasas ActiveScale File System for Linux clusters

• HP SFS: HP StorageWorks Scalable File Share. Lustre based parallel file

• Rule #1: Reduce overall I/O as much as possible

• High Level Programming Model

• Issues in Parallel Programming

• Performance of Parallel Programs

You might also like