Introduction To Parallel Computing
Introduction To Parallel Computing
13/09/2007
https://computing.llnl.gov/tutorials/parallel_comp/
Blaise Barney, Livermore Computing <blaiseb@llnl.gov> Introduction to Parallel Computing
Overview
Concepts and Terminology
Parallel Computer Memory Architectures
Parallel Programming Models
Designing Parallel Programs
Abstract
Serial computation
Traditionally, software has been written for serial computation:
Parallel computing
In the simplest sense, parallel computing is the simultaneous use
of multiple compute resources to solve a computational problem.
Environment
The compute resources can include:
Parallelism?
Parallel computing is an evolution of serial computing that
attempts to emulate what has always been the state of aairs in
the natural world: many complex, interrelated events
happening at the same time, yet within a sequence.
Some examples:
Planetary and galactic orbits
Weather and ocean patterns
Tectonic plate drift
Rush hour trac in LA
Automobile assembly line
Daily operations within a business
Building a shopping mall
Ordering a hamburger at the drive through.
Blaise Barney, Livermore Computing <blaiseb@llnl.gov> Introduction to Parallel Computing
Overview
Concepts and Terminology
Parallel Computer Memory Architectures
Parallel Programming Models
Designing Parallel Programs
What for?
Traditionally, parallel computing has been considered to be "the
high end of computing" and has been motivated by numerical
simulations of complex systems and "Grand Challenge Problems"
such as:
1
See URL http://top500.org
Basic design:
SISD
Examples of SIMD:
Processor Arrays: Connection Machine CM-2, Maspar MP-1,
P-2
Vector Pipelines: IBM 9000, Cray C90, Fujitsu VP, NEC SX-2,
Hitachi S820
MISD
MIMD
Like everything else, parallel computing has its own "jargon". Some
of the more commonly used terms associated with parallel
computing are listed below. Most of these will be discussed in more
detail later.
Shared Memory
General Characteristics:
UMA/NUMA
Uniform Memory Access (UMA):
UMA/NUMA (2)
Non-Uniform Memory Access (NUMA):
Disadvantages:
Primary disadvantage is the lack of scalability between
memory and CPUs. Adding more CPUs can geometrically
increases trac on the shared memory-CPU path, and for
cache coherent systems, geometrically increase trac
associated with cache/memory management.
Programmer responsibility for synchronization constructs
that insure "correct " access of global memory.
<blaiseb@llnl.gov>
Blaise Barney, Livermore Computing Introduction to Parallel Computing
Overview
Concepts and Terminology
Parallel Computer Memory Architectures
Parallel Programming Models
Designing Parallel Programs
Distributed Memory
General Characteristics:
Overview
Shared Memory
Threads
Message Passing
Data Parallel
Hybrid
Implementations
On shared memory platforms, the native compilers translate
user program variables into actual memory addresses, which
are global.
No common distributed memory platform implementations
currently exist. However, as mentioned previously in the
Overview section, the KSR ALLCACHE approach provided a
shared memory view of data even though the physical memory
of the machine was distributed.
Threads Model
In the threads model of parallel programming, a single process can
have multiple, concurrent execution paths.
An analogy to describe threads is the concept of a single program
that includes a number of subroutines.
Implementations
From a programming perspective, threads implementations
commonly comprise:
POSIX Threads
Library based; requires parallel coding
Specied by the IEEE POSIX 1003.1c standard (1995).
C Language only
Commonly referred to as Pthreads.
Most hardware vendors now oer Pthreads in addition to their
proprietary threads implementations.
Very explicit parallelism; requires signicant programmer
attention to detail.
OpenMP
Compiler directive based; can use serial code
Jointly dened and endorsed by a group of major computer
hardware and software vendors. The OpenMP Fortran API was
released October 28, 1997. The C/C++ API was released in
late 1998.
Portable / multi-platform, including Unix and Windows NT
platforms
Available in C/C++ and Fortran implementations
Can be very easy and simple to use - provides for "incremental
parallelism"
Implementations
Implementations
Programming with the data parallel model is usually
accomplished by writing a program with data parallel
constructs.
Fortran 90 and 95 (F90, F95): ISO/ANSI standard
extensions to Fortran 77.
Contains everything that is in Fortran 77
New source code format; additions to character set
Additions to program structure and commands
Variable additions - methods and arguments
Pointers and dynamic memory allocation added
Array processing (arrays treated as objects) added
Recursive and new intrinsic functions added
Many other new features
Implementations are available for most common parallel
platforms.
Blaise Barney, Livermore Computing <blaiseb@llnl.gov> Introduction to Parallel Computing
Overview
Concepts and Terminology
Parallel Computer Memory Architectures
Parallel Programming Models
Designing Parallel Programs
Other Models
Hybrid
Two or more parallel programming models are combined:
Automatic parallelization
If you are beginning with an existing serial code and have time or
budget constraints, then automatic parallelization may be the
answer. However, there are several important caveats that apply to
automatic parallelization:
Hotspots
Identify the program's hotspots:
Know where most of the real work is being done. The majority
of scientic and technical programs usually accomplish most of
their work in a few places.
Prolers and performance analysis tools can help here
Focus on parallelizing the hotspots and ignore those sections
of the program that account for little CPU usage.
Bottlenecks
Identify bottlenecks in the program
Inhibitors
Identify inhibitors to parallelism. One common class of inhibitor is
data dependence, as demonstrated by the Fibonacci sequence
above.
Think dierent
Investigate other algorithms if possible. This may be the single
most important consideration when designing a parallel application.
Partitioning
Domain Decomposition
In this type of partitioning, the data associated with a problem is
decomposed. Each parallel task then works on a portion of of the
data.
Functional Decomposition
In this approach, the focus is on the computation that is to be
performed rather than on the data manipulated by the computation.
The problem is decomposed according to the work that must be
done. Each task then performs a portion of the overall work.
Communications
Who Needs Communications?
The need for communications between tasks depends upon your
problem:
You DO need communications
Most parallel applications are not quite so simple, and do require
tasks to share data with each other.
For example, a 2-D heat diusion problem
requires a task to know the temperatures
calculated by the tasks that have neighbor-
ing data. Changes to neighboring data has
a direct eect on that task's data.
Blaise Barney, Livermore Computing <blaiseb@llnl.gov> Introduction to Parallel Computing
Overview
Concepts and Terminology
Parallel Computer Memory Architectures
Parallel Programming Models
Designing Parallel Programs
−→
Factors to Consider
Cost of communications
Inter-task communication virtually always implies overhead.
Machine cycles and resources that could be used for
computation are instead used to package and transmit data.
Communications frequently require some type of
synchronization between tasks, which can result in tasks
spending time "waiting" instead of doing work.
Competing communication trac can saturate the available
network bandwidth, further aggravating performance problems.
Synchronization
Types of Synchronization
Barrier
Lock / semaphore
Synchronous communication operations
Barrier
Usually implies that all tasks are involved
Each task performs its work until it reaches the barrier. It then
stops, or "blocks".
When the last task reaches the barrier, all tasks are
synchronized.
What happens from here varies. Often, a serial section of work
must be done. In other cases, the tasks are automatically
released to continue their work.
Lock / semaphore
Can involve any number of tasks
Typically used to serialize (protect) access to global data or a
section of code. Only one task at a time may use (own) the
lock / semaphore / ag.
The rst task to acquire the lock "sets" it. This task can then
safely (serially) access the protected data or code.
Other tasks can attempt to acquire the lock but must wait
until the task that owns the lock releases it.
Can be blocking or non-blocking
Data Dependencies
Denition:
Load Balancing
Granularity
Coarse-grain Parallelism
Fine-grain Parallelism
Relatively small amounts of computational
work are done between communication
events
Low computation to communication ratio
Facilitates load balancing
Implies high communication overhead and
less opportunity for performance
enhancement
If granularity is too ne it is possible that the
overhead required for communications and
synchronization between tasks takes longer
than the computation.
Which is Best?
I/O
Some options
Amdahl's Law
Amdahl's Law states that potential program speedup is dened
by the fraction of code (P) that can be parallelized (S beeing the
serial fraction):
1 1
speedup = =
1−P S
If none of the code can be parallelized, P = 0 and the speedup = 1
(no speedup). If all of the code is parallelized, P = 1 and the
speedup is innite (in theory).
If 50% of the code can be parallelized, maximum speedup = 2,
meaning the code will run twice as fast.
Blaise Barney, Livermore Computing <blaiseb@llnl.gov> Introduction to Parallel Computing
Overview
Concepts and Terminology
Parallel Computer Memory Architectures
Parallel Programming Models
Designing Parallel Programs
1
speedup = P
N +S
speedup
--------------------------------
N P = .50 P = .90 P = .99
----- ------- ------- -------
10 1.82 5.26 9.17
100 1.98 9.17 50.25
1000 1.99 9.91 90.99
10000 1.99 9.91 99.02
Complexity
In general, parallel applications are much more complex than
corresponding serial applications, perhaps an order of magnitude.
Not only do you have multiple instruction streams executing at the
same time, but you also have data owing between them.
Design
Coding
Debugging
Tuning
Maintenance
Portability
Thanks to standardization in several APIs, such as MPI, POSIX
threads, HPF and OpenMP, portability issues with parallel
programs are not as serious as in years past. However...
All of the usual portability issues associated with serial programs
apply to parallel programs. For example, if you use vendor
"enhancements" to Fortran, C or C++, portability will be a
problem.
Even though standards exist for several APIs, implementations will
dier in a number of details, sometimes to the point of requiring
code modications in order to eect portability.
Resource Requirements
The primary intent of parallel programming is to decrease execution
wall clock time, however in order to accomplish this, more CPU
time is required. For example, a parallel code that runs in 1 hour
on 8 processors actually uses 8 hours of CPU time.
The amount of memory required can be greater for parallel codes
than serial codes, due to the need to replicate data and for
overheads associated with parallel support libraries and subsystems.
For short running parallel programs, there can actually be a
decrease in performance compared to a similar serial
implementation. The overhead costs associated with setting up
the parallel environment, task creation, communications and task
termination can comprise a signicant portion of the total
execution time for short runs.
Blaise Barney, Livermore Computing <blaiseb@llnl.gov> Introduction to Parallel Computing
Overview
Concepts and Terminology
Parallel Computer Memory Architectures
Parallel Programming Models
Designing Parallel Programs
Scalability
The ability of a parallel program's performance to scale is a result
of a number of interrelated factors. Simply adding more machines
is rarely the answer.
The algorithm may have inherent limits to scalability. At some
point, adding more resources causes performance to
decrease. Most parallel solutions demonstrate this characteristic at
some point.