Part 1 - Lecture 3 - Parallel Software-1
Part 1 - Lecture 3 - Parallel Software-1
• Introduction
• Parallel Hardware
• Parallel Software - 1
Roadmap
• Low Level Programming Models
• Shared Memory Model (with Threads)
• Message Passing Model
• a.out performs some serial work, and then creates a number of tasks
(threads) that can be scheduled and run by the operating system in parallel.
• Each thread has local data, but also, shares the entire resources of a.out.
This saves the overhead associated with replicating a program's resources
for each thread. Each thread also benefits from a global memory view
because it shares the memory space of a.out.
Threads
• A thread's work may best be described as a subroutine within
the main program. Any thread can execute any subroutine at the
same time as other threads.
• Threads can come and go, but a.out remains present to provide
the necessary shared resources until the application has
completed.
Implementations
• From a programming perspective, threads implementations commonly
comprise:
• A library of subroutines that are called from within parallel source code
• A set of compiler directives embedded in the source code
• OpenMP
• Portable / multi-platform, including Unix and Windows NT platforms
• Available in C/C++ and Fortran implementations
• Microsoft has its own implementation for threads, which is not related to the
UNIX POSIX standard or OpenMP.
Message Passing Model
•A set of tasks that use their own local memory during
computation. Multiple tasks can reside on the same physical
machine as well across an arbitrary number of machines.
if (I’m thread/process i)
do this;
else
do that;
Issues in Parallel Programming: Parallelizability
• The first segment of data must pass through the first filter before progressing to
the second. When it does, the second segment of data passes through the first
filter. By the time the fourth segment of data is in the first filter, all four tasks are
busy.
Climate Modeling
• Each model component can be thought of as a separate task. Arrows represent
exchanges of data between components during computation: the atmosphere
model generates wind velocity data that are used by the ocean model, the
ocean model generates sea surface temperature data that are used by the
atmosphere model, and so on.
• Combining these two types of problem decomposition is common and natural.
Who Needs Communications?
• Cost of communications
• Inter-task communication almost always implies overhead.
• Machine cycles and resources that could be used for computation are
instead used to package and transmit data.
• Communications frequently require some type of synchronization between
tasks, which can result in tasks spending time "waiting" instead of doing
work.
• Competing communication traffic can saturate the available network
bandwidth, further aggravating performance problems.
Factors to Consider (2)
• Latency vs. Bandwidth
• latency is the time it takes to send a minimal (0 byte) message from point A
to point B. Commonly expressed in microseconds.
• bandwidth is the amount of data that can be communicated per unit of
time. Commonly expressed in megabytes/sec.
• Visibility of communications
• With the Message Passing Model, communications are explicit and
generally quite visible and under the control of the programmer.
• Synchronous communications are often referred to as blocking communications since other work
must wait until the communications have completed.
• Asynchronous communications
• Asynchronous communications allow tasks to transfer data independently from one another. For
example, task 1 can prepare and send a message to task 2, and then immediately begin doing other
work. When task 2 actually receives the data doesn't matter.
• Interleaving computation with communication is the single greatest benefit for using asynchronous
communications.
Types of Synchronization
• Barrier
• Usually implies that all tasks are involved
• Each task performs its work until it reaches the barrier. It then stops, or "blocks".
• When the last task reaches the barrier, all tasks are synchronized.
• What happens from here varies. Often, a serial section of work must be done. In other cases, the
tasks are automatically released to continue their work.
• Lock / semaphore
• Can involve any number of tasks
• Typically used to serialize (protect) access to global data or a section of code. Only one task at a
time may use (own) the lock / semaphore / flag.
• The first task to acquire the lock "sets" it. This task can then safely (serially) access the protected
data or code.
• Other tasks can attempt to acquire the lock but must wait until the task that owns the lock releases
it.
• Can be blocking or non-blocking
Data Dependency
• The most efficient granularity is dependent on the algorithm and the hardware
environment in which it runs.
• Parallel I/O systems are immature or not available for all platforms
• In an environment where all tasks see the same file system, write operations
will result in file overwriting
• I/O that must be conducted over the network (NFS, non-local) can cause
severe bottlenecks
Parallel I/O
• Some parallel file systems are available. For example:
• GPFS: General Parallel File System for AIX (IBM)
• Confine I/O to specific serial portions of the job, and then use parallel
communications to distribute data to parallel tasks. For example, Task 1 could
read an input file and then communicate required data to other tasks. Likewise,
Task 1 could perform write operation after receiving required data from all other
tasks.
• For distributed memory systems with shared filespace, perform I/O in local, non-
shared filespace.
Recap
• Low Level Programming Models
• Shared Memory Model (with Threads)
• Message Passing Model