Parallel Distributed Computing

Introduction
Scalability
• The ability of a parallel program's performance to scale is a result of a number

of interrelated factors. Simply adding more processors is rarely the answer.
• The algorithm may have inherent limits to scalability. At some point, adding
more resources causes performance to decrease. This is a common situation
with many parallel applications.
• Hardware factors play a significant role in scalability. Examples:
• Memory-cpu bus bandwidth on an SMP machine
• Communications network bandwidth
• Amount of memory available on any given machine or set of machines
• Processor clock speed
• Parallel support libraries and subsystems software can limit scalability
independent of your application.
Parallel and Distributed Computing

Introduction
Scalability

Introduction
Parallel Computer Memory Architecture

Shared Memory
General Characteristics
• Shared memory parallel computers vary widely, but generally have in common
the ability for all processors to access all memory as global address space.
• Multiple processors can operate independently but share the same memory
resources.
• Changes in a memory location effected by one processor are visible to all
other processors.
• Historically, shared memory machines have been classified
as UMA and NUMA, based upon memory access times.

Introduction
Uniform Memory Access (UMA)
• Most commonly represented today by Symmetric Multiprocessor

(SMP) machines
• Identical processors
• Equal access and access times to memory
• Sometimes called CC-UMA - Cache Coherent UMA. Cache coherent means if
one processor updates a location in shared memory, all the other processors
know about the update. Cache coherency is accomplished at the hardware
level.

Introduction
Non-Uniform Memory Access (NUMA)
• Often made by physically linking two or more SMPs

• One SMP can directly access memory of another SMP
• Not all processors have equal access time to all memories
• Memory access across link is slower
• If cache coherency is maintained, then may also be called CC-NUMA - Cache
Coherent NUMA

Introduction
Shared Memory
Advantages
• Global address space provides a user-friendly programming perspective to
memory
• Data sharing between tasks is both fast and uniform due to the proximity of
memory to CPUs
Disadvantages
• Primary disadvantage is the lack of scalability between memory and CPUs.
Adding more CPUs can geometrically increases traffic on the shared memory-
CPU path, and for cache coherent systems, geometrically increase traffic
associated with cache/memory management.
• Programmer responsibility for synchronization constructs that ensure "correct"
access of global memory.

Introduction
Distributed Memory
• Like shared memory systems, distributed memory systems vary widely but
share a common characteristic. Distributed memory systems require a
communication network to connect inter-processor memory.
• Processors have their own local memory. Memory addresses in one processor
do not map to another processor, so there is no concept of global address
space across all processors.
• Because each processor has its own local memory, it operates independently.
Changes it makes to its local memory have no effect on the memory of other
processors. Hence, the concept of cache coherency does not apply.
• When a processor needs access to data in another processor, it is usually the
task of the programmer to explicitly define how and when data is
communicated. Synchronization between tasks is likewise the programmer's
responsibility.
• The network "fabric" used for data transfer varies widely, though it can be as
simple as Ethernet.

Introduction
Distributed Memory

Introduction
Distributed Memory
Advantages
• Memory is scalable with the number of processors. Increase the number of
processors and the size of memory increases proportionately.
• Each processor can rapidly access its own memory without interference and
without the overhead incurred with trying to maintain global cache coherency.
• Cost effectiveness: can use commodity, off-the-shelf processors and
networking.
Disadvantages
• The programmer is responsible for many of the details associated with data
communication between processors.
• It may be difficult to map existing data structures, based on global memory, to
this memory organization.
• Non-uniform memory access times - data residing on a remote node takes
longer to access than node local data.

Introduction
Hybrid Distributed-Shared Memory

• The largest and fastest computers in the world today employ both shared and
distributed memory architectures.
• The shared memory component can be a shared memory machine and/or

graphics processing units (GPU).
• The distributed memory component is the networking of multiple shared
memory/GPU machines, which know only about their own memory - not the
memory on another machine. Therefore, network communications are required
to move data from one machine to another.
• Current trends seem to indicate that this type of memory architecture will
continue to prevail and increase at the high end of computing for the
foreseeable future.
Introduction
Hybrid Distributed-Shared Memory
Advantages and Disadvantages
• Whatever is common to both shared and distributed memory architectures.

• Increased scalability is an important advantage
• Increased programmer complexity is an important disadvantage

Introduction
Parallel Programming Models
There are several parallel programming models in common use:

• Shared Memory (without threads)
• Threads
• Distributed Memory / Message Passing
• Data Parallel
• Hybrid
• Single Program Multiple Data (SPMD)
• Multiple Program Multiple Data (MPMD)
Parallel programming models exist as an abstraction above hardware and

memory architectures.
Although it might not seem apparent, these models are NOT specific to a
particular type of machine or memory architecture. In fact, any of these models
can (theoretically) be implemented on any underlying hardware. Two examples
from the past are discussed below.

Introduction
SHARED memory model on a DISTRIBUTED memory machine
Kendall Square Research (KSR) ALLCACHE approach. Machine memory was

physically distributed across networked machines, but appeared to the user as a
single shared memory global address space. Generically, this approach is
referred to as "virtual shared memory".

Introduction
DISTRIBUTED memory model on a SHARED memory machine
Message Passing Interface (MPI) on SGI Origin 2000. The SGI Origin 2000
employed the CC-NUMA type of shared memory architecture, where every task
has direct access to global address space spread across all machines. However,
the ability to send and receive messages using MPI, as is commonly done over a
network of distributed memory machines, was implemented and commonly used.
Which model to use? This is often a combination of what is available and

personal choice. There is no "best" model, although there certainly are better
implementations of some models over others.
Introduction
Shared Memory Model (without threads)
• In this programming model, processes/tasks share a common address space,

which they read and write to asynchronously.
• Various mechanisms such as locks / semaphores are used to control access
to the shared memory, resolve contentions and to prevent race conditions and
deadlocks.
• This is perhaps the simplest parallel programming model.
• An advantage of this model from the programmer's point of view is that the
notion of data "ownership" is lacking, so there is no need to specify explicitly
the communication of data between tasks. All processes see and have equal
access to shared memory. Program development can often be simplified.

Introduction
Shared Memory Model (without threads)
• An important disadvantage in terms of performance is that it becomes more

difficult to understand and manage data locality:
• Keeping data local to the process that works on it conserves memory
accesses, cache refreshes and bus traffic that occurs when multiple processes
use the same data.
• Unfortunately, controlling data locality
is hard to understand and may be
beyond the control of the average user.

Introduction
Implementations
• On stand-alone shared memory machines, native operating systems,

compilers and/or hardware provide support for shared memory programming.
For example, the POSIX standard provides an API for using shared memory,
and UNIX provides shared memory segments (shmget, shmat, shmctl, etc.).
• On distributed memory machines, memory is physically distributed across a
network of machines, but made global through specialized hardware and
software. A variety of SHMEM implementations are
available: http://en.wikipedia.org/wiki/SHMEM.

Introduction
Threads Model
• This programming model is a type of shared memory programming.

• In the threads model of parallel programming, a single "heavy weight" process
can have multiple "light weight", concurrent execution paths.
• For example:
• The main program a.out is scheduled to run by the native operating
system. a.out loads and acquires all of the necessary system and user
resources to run. This is the "heavy weight" process.
• a.out performs some serial work, and then creates a number of tasks (threads)
that can be scheduled and run by the operating system concurrently.

Introduction
Threads Model
• Each thread has local data, but also,
shares the entire resources of a.out.
This saves the overhead associated with
replicating a program's resources for
each thread ("light weight"). Each thread
also benefits from a global memory view
because it shares the memory space
of a.out.
• A thread's work may best be described
as a subroutine within the main program.
Any thread can execute any subroutine
at the same time as other threads.
• Threads communicate with each other
through global memory (updating
address locations). This requires
synchronization constructs to ensure that
more than one thread is not updating the
same global address at any time.
• Threads can come and go,
but a.out remains present to provide the
necessary shared resources until the
application has completed.
Introduction
Implementations
• From a programming perspective, threads implementations commonly
comprise:
• A library of subroutines that are called from within parallel source code
• A set of compiler directives imbedded in either serial or parallel source code
• In both cases, the programmer is responsible for determining the parallelism
(although compilers can sometimes help).
• Threaded implementations are not new in computing. Historically, hardware
vendors have implemented their own proprietary versions of threads. These
implementations differed substantially from each other making it difficult for
programmers to develop portable threaded applications.
• Unrelated standardization efforts have resulted in two very different
implementations of threads: POSIX Threads and OpenMP.

Introduction
Posix Threads
• Specified by the IEEE POSIX 1003.1c standard (1995). C Language only.

• Part of Unix/Linux operating systems
• Library based
• Commonly referred to as Pthreads.
• Very explicit parallelism; requires significant programmer attention to detail.

Introduction
OpenMP
• Industrial standard, jointly defined and endorsed by a group of major computer

hardware and software vendors, organizations and individuals.
• Compiler directive based
• Portable / multi-platform, including Unix and Windows platforms
• Available in C/C++ and Fortran implementations
• Can be very easy and simple to use - provides for "incremental parallelism".
Can begin with serial code.
• Other threaded implementations are common, but not discussed here:
• Microsoft threads
• Java, Python threads
• CUDA threads for GPUs

Introduction
Distributed Memory / Message Passing Model
• This model demonstrates the following characteristics:

• A set of tasks that use their own local memory during computation. Multiple
tasks can reside on the same physical machine and/or across an arbitrary
number of machines.
• Tasks exchange data through communications by sending and receiving
messages.
• Data transfer usually requires cooperative operations to be performed by each
process. For example, a send operation must have a matching receive
operation.

Introduction
Data Parallel Model
• May also be referred to as the Partitioned Global Address Space

(PGAS) model.
• The data parallel model demonstrates the following characteristics:
• Address space is treated globally
• Most of the parallel work focuses on performing operations on a data set. The
data set is typically organized into a common structure, such as an array or
cube.
• A set of tasks work collectively on the same data structure, however, each task
works on a different partition of the same data structure.
• Tasks perform the same operation on their partition of work, for example, "add
4 to every array element".
• On shared memory architectures, all tasks may have access to the data
structure through global memory.
• On distributed memory architectures, the global data structure can be split up
logically and/or physically across tasks.

Introduction
Data Parallel Model

Introduction
Hybrid Model
• A hybrid model combines more than one of the previously described

programming models.
• Currently, a common example of a hybrid model is the combination of the
message passing model (MPI) with the threads model (OpenMP).

Introduction
SPMD and MPMD

Single Program Multiple Data (SPMD)
• SPMD is actually a "high level" programming model that can be built upon any
combination of the previously mentioned parallel programming models.
• SINGLE PROGRAM: All tasks execute their copy of the same program
simultaneously. This program can be threads, message passing, data parallel
or hybrid.
• MULTIPLE DATA: All tasks may use different data
• SPMD programs usually have the necessary logic programmed into them to
allow different tasks to branch or conditionally execute only those parts of the
program they are designed to execute. That is, tasks do not necessarily have
to execute the entire program - perhaps only a portion of it.
• The SPMD model, using message passing or hybrid programming, is probably
the most commonly used parallel programming model for multi-node clusters.

Introduction
Multiple Program Multiple Data (MPMD)
Like SPMD, MPMD is actually a "high level" programming model that can be built
upon any combination of the previously mentioned parallel programming models.
MULTIPLE PROGRAM: Tasks may execute different programs simultaneously.
The programs can be threads, message passing, data parallel or hybrid.
MULTIPLE DATA: All tasks may use different data
MPMD applications are not as common as SPMD applications, but may be better
suited for certain types of problems, particularly those that lend themselves better
to functional decomposition than domain decomposition (discussed later under
Partitioning).

Introduction
Communications
Who Needs Communications?
• The need for communications between tasks depends upon your problem:
• You DON'T need communications
• Some types of problems can be decomposed and executed in parallel with
virtually no need for tasks to share data. These types of problems are often
called embarrassingly parallel - little or no communications are required.
• For example, imagine an image processing operation where every pixel in a
black and white image needs to have its color reversed. The image data can
easily be distributed to multiple tasks that then act independently of each other
to do their portion of the work.

Introduction
Communications
Who Needs Communications?
• You DO need communications
• Most parallel applications are not quite so simple, and do require tasks to
share data with each other.
• For example, a 2-D heat diffusion problem requires a task to know the
temperatures calculated by the tasks that have neighboring data. Changes to
neighboring data has a direct effect on that task's data.

Introduction
Factors to consider
• There are a number of important factors to consider when designing your

program's inter-task communications:
Communication overhead
• Inter-task communication virtually always implies overhead.
• Machine cycles and resources that could be used for computation are instead
used to package and transmit data.
• Communications frequently require some type of synchronization between
tasks, which can result in tasks
spending time "waiting" instead
of doing work.
• Competing communication traffic
can saturate the available
network bandwidth, further
aggravating performance
problems.

Introduction
Latency vs Bandwidth
• Latency is the time it takes to send a minimal (0 byte) message from point A
to point B. Commonly expressed as microseconds.
• Bandwidth is the amount of data that can be communicated per unit of time.
Commonly expressed as megabytes/sec or gigabytes/sec.
• Sending many small messages can cause latency to dominate communication
overheads. Often it is more efficient to package small messages into a larger
message, thus increasing the effective communications bandwidth.

Introduction
Visibility of Communications
• With the Message Passing Model, communications are explicit and generally
quite visible and under the control of the programmer.
• With the Data Parallel Model, communications often occur transparently to the
programmer, particularly on distributed memory architectures. The
programmer may not even be able to know exactly how inter-task
communications are being accomplished.

Introduction
Synchronous vs Asynchronous Communications
• Synchronous communications require some type of "handshaking" between

tasks that are sharing data. This can be explicitly structured in code by the
programmer, or it may happen at a lower level unknown to the programmer.
• Synchronous communications are often referred to
as blocking communications since other work must wait until the
communications have completed.
• Asynchronous communications allow tasks to transfer data independently from
one another. For example, task 1 can prepare and send a message to task 2,
and then immediately begin doing other work. When task 2 actually receives
the data doesn't matter.
• Asynchronous communications are often referred to as non-
blocking communications since other work can be done while the
communications are taking place.
• Interleaving computation with communication is the single greatest benefit for
using asynchronous communications.

Introduction
Synchronization
• Managing the sequence of work and the tasks performing it is a critical design
consideration for most parallel programs.
• Can be a significant factor in program performance (or lack of it)
• Often requires "serialization" of segments of the program.

Introduction
Types of synchronization
Barrier
• Usually implies that all tasks are involved
• Each task performs its work until it reaches the barrier. It then stops, or
"blocks".
• When the last task reaches the barrier, all tasks are synchronized.
• What happens from here varies. Often, a serial section of work must be done.
In other cases, the tasks are automatically released to continue their work.

Introduction
Lock / semaphore
• Can involve any number of tasks

• Typically used to serialize (protect) access to global data or a section of code.
Only one task at a time may use (own) the lock / semaphore / flag.
• The first task to acquire the lock "sets" it. This task can then safely (serially)
access the protected data or code.
• Other tasks can attempt to acquire the lock but must wait until the task that
owns the lock releases it.
• Can be blocking or non-blocking.

Introduction
Synchronous Communication Operations
• Involves only those tasks executing a communication operation.

• When a task performs a communication operation, some form of coordination
is required with the other task(s) participating in the communication. For
example, before a task can perform a send operation, it must first receive an
acknowledgment from the receiving task that it is OK to send.

Introduction
Data Dependencies
Definition
• A dependence exists between program statements when the order of

statement execution affects the results of the program.
• A data dependence results from multiple use of the same location(s) in
storage by different tasks.
• Dependencies are important to parallel programming because they are one of
the primary inhibitors to parallelism.

Introduction
Computation / Communication ratio
• In parallel computing, granularity is a qualitative measure of the ratio of

computation to communication.
• Periods of computation are typically separated from periods of communication
by synchronization events.

Introduction
Fine Grain Parallelism
• Relatively small amounts of computational work are done between

communication events.
• Low computation to communication ratio.
• Facilitates load balancing.
• Implies high communication overhead and less opportunity for performance
enhancement.
• If granularity is too fine it is possible that the overhead required for
communications and synchronization between tasks takes longer than the
computation.

Introduction
Coarse-grain Parallelism
• Relatively large amounts of computational work are done between

communication/synchronization events
• High computation to communication ratio
• Implies more opportunity for performance increase
• Harder to load balance efficiently

Introduction
Which is best?
• The most efficient granularity is dependent on the algorithm and the hardware
environment in which it runs.
• In most cases the overhead associated with communications and
synchronization is high relative to execution speed so it is advantageous to
have coarse granularity.
• Fine-grain parallelism can help reduce overheads due to load imbalance.

Introduction From networked systems to distributed systems
Distributed versus Decentralized

What many people state
Centralized Decentralized Distributed
Distributed versus decentralized systems


When does a decentralized system become distributed?



• Adding 1 link between two nodes in a decentralized system?



• Adding 2 links between two other nodes?



• Adding 2 links between two other nodes?
• In general: adding k > 0 links....?

Alternative approach
Two views on realizing distributed systems
• Integrative view: connecting existing networked computer systems into a
larger a system.
• Expansive view: an existing networked computer systems is extended
with additional computers

Alternative approach
Two views on realizing distributed systems
• Integrative view: connecting existing networked computer systems into a
larger a system.
• Expansive view: an existing networked computer systems is extended
with additional computers
Two definitions
• A decentralized system is a networked computer system in which
processes and resources are necessarily spread across multiple
computers.
• A distributed system is a networked computer system in which processes
and resources are sufficiently spread across multiple computers.

Some common misconceptions

Centralized solutions do not scale
Make distinction between logically and physically centralized. The root of the
Domain Name System:
• logically centralized
• physically (massively) distributed
• decentralized across several organizations


Domain Name System:
Centralized solutions have a single point of failure

Generally not true (e.g., the root of DNS). A single point of failure is often:
• easier to manage
• easier to make more robust


Domain Name System:
Centralized solutions have a single point of failure

Generally not true (e.g., the root of DNS). A single point of failure is often:
• easier to manage
• easier to make more robust
Important
There are many, poorly founded, misconceptions regarding scalability, fault
tolerance, security, etc. We need to develop skills by which distributed systems
can be readily understood so as to judge such misconceptions.

Perspectives on distributed systems

Distributed systems are complex: take persepctives
• Architecture: common organizations
• Process: what kind of processes, and their relationships
• Communication: facilities for exchanging data
• Coordination: application-independent algorithms
• Naming: how do you identify resources?
• Consistency and replication: performance requires of data, which need to
be the same
• Fault tolerance: keep running in the presence of partial failures
• Security: ensure authorized access to resources
Studying distributed systems

Introduction Design goals
What do we want to achieve?

Overall design goals
• Support sharing of resources
• Distribution transparency
• Openness
• Scalability
Distribution transparency
What is transparency?
The phenomenon by which a distributed system attempts to hide the fact that
its processes and resources are physically distributed across multiple
computers, possibly separated by large distances.
What is transparency?
The phenomenon by which a distributed system attempts to hide the fact that
its processes and resources are physically distributed across multiple
computers, possibly separated by large distances.
Observation
Distribution transparancy is handled through many different techniques in a
layer between applications and operating systems: a middleware layer
Types
Transparency Description
Access Hide differences in data representation and how an
object is accessed
Location Hide where an object is located
Relocation Hide that an object may be moved to another location
while in use
Migration Hide that an object may move to another location
Replication Hide that an object is replicated
Concurrency Hide that an object may be shared by several
independent users
Failure Hide the failure and recovery of an object
Degree of transparency
Aiming at full distribution transparency may be too much
• There are communication latencies that cannot be hidden
• Completely hiding failures of networks and nodes is (theoretically and
practically) impossible
• You cannot distinguish a slow computer from a failing one
• You can never be sure that a server actually performed an operation
before a crash
• Completely hiding failures of networks and nodes is (theoretically and
practically) impossible
• You cannot distinguish a slow computer from a failing one
• You can never be sure that a server actually performed an operation
before a crash
• Full transparency will cost performance, exposing distribution of the
system
• Keeping replicas exactly up-to-date with the master takes time
• Immediately flushing write operations to disk for fault tolerance
Exposing distribution may be good
• Making use of location-based services (finding your nearby friends)
• When dealing with users in different time zones
• When it makes it easier for a user to understand what’s going on (when
e.g., a server does not respond for a long time, report it as failing).
Exposing distribution may be good
• Making use of location-based services (finding your nearby friends)
• When dealing with users in different time zones
• When it makes it easier for a user to understand what’s going on (when
e.g., a server does not respond for a long time, report it as failing).
Conclusion
Distribution transparency is a nice goal, but achieving it is a different story, and
it should often not even be aimed at.

Parallel Distributed Computing

Uploaded by

Parallel Distributed Computing

Uploaded by

Introduction

• The ability of a parallel program's performance to scale is a result of a number

Parallel and Distributed Computing

Parallel and Distributed Computing

Parallel Computer Memory Architecture

Parallel and Distributed Computing

Uniform Memory Access (UMA)

• Most commonly represented today by Symmetric Multiprocessor

Parallel and Distributed Computing

Non-Uniform Memory Access (NUMA)

• Often made by physically linking two or more SMPs

Parallel and Distributed Computing

Parallel and Distributed Computing

Parallel and Distributed Computing

Parallel and Distributed Computing

Parallel and Distributed Computing

Hybrid Distributed-Shared Memory

• The shared memory component can be a shared memory machine and/or

Hybrid Distributed-Shared Memory

Advantages and Disadvantages

• Whatever is common to both shared and distributed memory architectures.

Parallel and Distributed Computing

Parallel Programming Models

There are several parallel programming models in common use:

Parallel programming models exist as an abstraction above hardware and

Parallel and Distributed Computing

SHARED memory model on a DISTRIBUTED memory machine

Kendall Square Research (KSR) ALLCACHE approach. Machine memory was

Parallel and Distributed Computing

DISTRIBUTED memory model on a SHARED memory machine

Which model to use? This is often a combination of what is available and

Shared Memory Model (without threads)

• In this programming model, processes/tasks share a common address space,

Parallel and Distributed Computing

Shared Memory Model (without threads)

• An important disadvantage in terms of performance is that it becomes more

Parallel and Distributed Computing

• On stand-alone shared memory machines, native operating systems,

Parallel and Distributed Computing

• This programming model is a type of shared memory programming.

Parallel and Distributed Computing

Parallel and Distributed Computing

• Specified by the IEEE POSIX 1003.1c standard (1995). C Language only.

Parallel and Distributed Computing

• Industrial standard, jointly defined and endorsed by a group of major computer

Parallel and Distributed Computing

Distributed Memory / Message Passing Model

• This model demonstrates the following characteristics:

Parallel and Distributed Computing

Data Parallel Model

• May also be referred to as the Partitioned Global Address Space

Parallel and Distributed Computing

Data Parallel Model

Parallel and Distributed Computing

• A hybrid model combines more than one of the previously described

Parallel and Distributed Computing

SPMD and MPMD

Parallel and Distributed Computing

Multiple Program Multiple Data (MPMD)

Parallel and Distributed Computing

Parallel and Distributed Computing

Parallel and Distributed Computing

• There are a number of important factors to consider when designing your

Parallel and Distributed Computing

Parallel and Distributed Computing

Parallel and Distributed Computing

Synchronous vs Asynchronous Communications

• Synchronous communications require some type of "handshaking" between

Parallel and Distributed Computing

Parallel and Distributed Computing

Parallel and Distributed Computing

• Can involve any number of tasks

Parallel and Distributed Computing

• Involves only those tasks executing a communication operation.

Parallel and Distributed Computing

• A dependence exists between program statements when the order of