Parallel Distributed Computing
Parallel Distributed Computing
Scalability
Scalability
• Shared memory parallel computers vary widely, but generally have in common
the ability for all processors to access all memory as global address space.
• Multiple processors can operate independently but share the same memory
resources.
• Changes in a memory location effected by one processor are visible to all
other processors.
• Historically, shared memory machines have been classified
as UMA and NUMA, based upon memory access times.
Shared Memory
Advantages
• Global address space provides a user-friendly programming perspective to
memory
• Data sharing between tasks is both fast and uniform due to the proximity of
memory to CPUs
Disadvantages
• Primary disadvantage is the lack of scalability between memory and CPUs.
Adding more CPUs can geometrically increases traffic on the shared memory-
CPU path, and for cache coherent systems, geometrically increase traffic
associated with cache/memory management.
• Programmer responsibility for synchronization constructs that ensure "correct"
access of global memory.
Distributed Memory
General Characteristics
• Like shared memory systems, distributed memory systems vary widely but
share a common characteristic. Distributed memory systems require a
communication network to connect inter-processor memory.
• Processors have their own local memory. Memory addresses in one processor
do not map to another processor, so there is no concept of global address
space across all processors.
• Because each processor has its own local memory, it operates independently.
Changes it makes to its local memory have no effect on the memory of other
processors. Hence, the concept of cache coherency does not apply.
• When a processor needs access to data in another processor, it is usually the
task of the programmer to explicitly define how and when data is
communicated. Synchronization between tasks is likewise the programmer's
responsibility.
• The network "fabric" used for data transfer varies widely, though it can be as
simple as Ethernet.
Distributed Memory
General Characteristics
Distributed Memory
Advantages
• Memory is scalable with the number of processors. Increase the number of
processors and the size of memory increases proportionately.
• Each processor can rapidly access its own memory without interference and
without the overhead incurred with trying to maintain global cache coherency.
• Cost effectiveness: can use commodity, off-the-shelf processors and
networking.
Disadvantages
• The programmer is responsible for many of the details associated with data
communication between processors.
• It may be difficult to map existing data structures, based on global memory, to
this memory organization.
• Non-uniform memory access times - data residing on a remote node takes
longer to access than node local data.
Although it might not seem apparent, these models are NOT specific to a
particular type of machine or memory architecture. In fact, any of these models
can (theoretically) be implemented on any underlying hardware. Two examples
from the past are discussed below.
Message Passing Interface (MPI) on SGI Origin 2000. The SGI Origin 2000
employed the CC-NUMA type of shared memory architecture, where every task
has direct access to global address space spread across all machines. However,
the ability to send and receive messages using MPI, as is commonly done over a
network of distributed memory machines, was implemented and commonly used.
Implementations
Threads Model
Threads Model
• Each thread has local data, but also,
shares the entire resources of a.out.
This saves the overhead associated with
replicating a program's resources for
each thread ("light weight"). Each thread
also benefits from a global memory view
because it shares the memory space
of a.out.
• A thread's work may best be described
as a subroutine within the main program.
Any thread can execute any subroutine
at the same time as other threads.
• Threads communicate with each other
through global memory (updating
address locations). This requires
synchronization constructs to ensure that
more than one thread is not updating the
same global address at any time.
• Threads can come and go,
but a.out remains present to provide the
necessary shared resources until the
application has completed.
Parallel and Distributed Computing
Introduction
Implementations
• From a programming perspective, threads implementations commonly
comprise:
• A library of subroutines that are called from within parallel source code
• A set of compiler directives imbedded in either serial or parallel source code
• In both cases, the programmer is responsible for determining the parallelism
(although compilers can sometimes help).
• Threaded implementations are not new in computing. Historically, hardware
vendors have implemented their own proprietary versions of threads. These
implementations differed substantially from each other making it difficult for
programmers to develop portable threaded applications.
• Unrelated standardization efforts have resulted in two very different
implementations of threads: POSIX Threads and OpenMP.
Posix Threads
OpenMP
Hybrid Model
Like SPMD, MPMD is actually a "high level" programming model that can be built
upon any combination of the previously mentioned parallel programming models.
MULTIPLE PROGRAM: Tasks may execute different programs simultaneously.
The programs can be threads, message passing, data parallel or hybrid.
MULTIPLE DATA: All tasks may use different data
MPMD applications are not as common as SPMD applications, but may be better
suited for certain types of problems, particularly those that lend themselves better
to functional decomposition than domain decomposition (discussed later under
Partitioning).
Communications
Who Needs Communications?
• The need for communications between tasks depends upon your problem:
• You DON'T need communications
• Some types of problems can be decomposed and executed in parallel with
virtually no need for tasks to share data. These types of problems are often
called embarrassingly parallel - little or no communications are required.
• For example, imagine an image processing operation where every pixel in a
black and white image needs to have its color reversed. The image data can
easily be distributed to multiple tasks that then act independently of each other
to do their portion of the work.
Communications
Who Needs Communications?
• You DO need communications
• Most parallel applications are not quite so simple, and do require tasks to
share data with each other.
• For example, a 2-D heat diffusion problem requires a task to know the
temperatures calculated by the tasks that have neighboring data. Changes to
neighboring data has a direct effect on that task's data.
Factors to consider
Communication overhead
• Inter-task communication virtually always implies overhead.
• Machine cycles and resources that could be used for computation are instead
used to package and transmit data.
• Communications frequently require some type of synchronization between
tasks, which can result in tasks
spending time "waiting" instead
of doing work.
• Competing communication traffic
can saturate the available
network bandwidth, further
aggravating performance
problems.
Latency vs Bandwidth
• Latency is the time it takes to send a minimal (0 byte) message from point A
to point B. Commonly expressed as microseconds.
• Bandwidth is the amount of data that can be communicated per unit of time.
Commonly expressed as megabytes/sec or gigabytes/sec.
• Sending many small messages can cause latency to dominate communication
overheads. Often it is more efficient to package small messages into a larger
message, thus increasing the effective communications bandwidth.
Visibility of Communications
• With the Message Passing Model, communications are explicit and generally
quite visible and under the control of the programmer.
• With the Data Parallel Model, communications often occur transparently to the
programmer, particularly on distributed memory architectures. The
programmer may not even be able to know exactly how inter-task
communications are being accomplished.
Synchronization
• Managing the sequence of work and the tasks performing it is a critical design
consideration for most parallel programs.
• Can be a significant factor in program performance (or lack of it)
• Often requires "serialization" of segments of the program.
Types of synchronization
Barrier
• Usually implies that all tasks are involved
• Each task performs its work until it reaches the barrier. It then stops, or
"blocks".
• When the last task reaches the barrier, all tasks are synchronized.
• What happens from here varies. Often, a serial section of work must be done.
In other cases, the tasks are automatically released to continue their work.
Types of synchronization
Lock / semaphore
Types of synchronization
Synchronous Communication Operations
Data Dependencies
Definition
Coarse-grain Parallelism
Which is best?
• The most efficient granularity is dependent on the algorithm and the hardware
environment in which it runs.
• In most cases the overhead associated with communications and
synchronization is high relative to execution speed so it is advantageous to
have coarse granularity.
• Fine-grain parallelism can help reduce overheads due to load imbalance.
Alternative approach
Two views on realizing distributed systems
• Integrative view: connecting existing networked computer systems into a
larger a system.
• Expansive view: an existing networked computer systems is extended
with additional computers
Alternative approach
Two views on realizing distributed systems
• Integrative view: connecting existing networked computer systems into a
larger a system.
• Expansive view: an existing networked computer systems is extended
with additional computers
Two definitions
• A decentralized system is a networked computer system in which
processes and resources are necessarily spread across multiple
computers.
• A distributed system is a networked computer system in which processes
and resources are sufficiently spread across multiple computers.
Important
There are many, poorly founded, misconceptions regarding scalability, fault
tolerance, security, etc. We need to develop skills by which distributed systems
can be readily understood so as to judge such misconceptions.
Distribution transparency
What is transparency?
The phenomenon by which a distributed system attempts to hide the fact that
its processes and resources are physically distributed across multiple
computers, possibly separated by large distances.
Distribution transparency
Introduction Design goals
Distribution transparency
What is transparency?
The phenomenon by which a distributed system attempts to hide the fact that
its processes and resources are physically distributed across multiple
computers, possibly separated by large distances.
Observation
Distribution transparancy is handled through many different techniques in a
layer between applications and operating systems: a middleware layer
Distribution transparency
Introduction Design goals
Distribution transparency
Types
Transparency Description
Access Hide differences in data representation and how an
object is accessed
Location Hide where an object is located
Relocation Hide that an object may be moved to another location
while in use
Migration Hide that an object may move to another location
Replication Hide that an object is replicated
Concurrency Hide that an object may be shared by several
independent users
Failure Hide the failure and recovery of an object
Distribution transparency
Introduction Design goals
Degree of transparency
Aiming at full distribution transparency may be too much
Distribution transparency
Introduction Design goals
Degree of transparency
Aiming at full distribution transparency may be too much
• There are communication latencies that cannot be hidden
Distribution transparency
Introduction Design goals
Degree of transparency
Aiming at full distribution transparency may be too much
• There are communication latencies that cannot be hidden
• Completely hiding failures of networks and nodes is (theoretically and
practically) impossible
• You cannot distinguish a slow computer from a failing one
• You can never be sure that a server actually performed an operation
before a crash
Distribution transparency
Introduction Design goals
Degree of transparency
Aiming at full distribution transparency may be too much
• There are communication latencies that cannot be hidden
• Completely hiding failures of networks and nodes is (theoretically and
practically) impossible
• You cannot distinguish a slow computer from a failing one
• You can never be sure that a server actually performed an operation
before a crash
• Full transparency will cost performance, exposing distribution of the
system
• Keeping replicas exactly up-to-date with the master takes time
• Immediately flushing write operations to disk for fault tolerance
Distribution transparency
Introduction Design goals
Degree of transparency
Exposing distribution may be good
• Making use of location-based services (finding your nearby friends)
• When dealing with users in different time zones
• When it makes it easier for a user to understand what’s going on (when
e.g., a server does not respond for a long time, report it as failing).
Distribution transparency
Introduction Design goals
Degree of transparency
Exposing distribution may be good
• Making use of location-based services (finding your nearby friends)
• When dealing with users in different time zones
• When it makes it easier for a user to understand what’s going on (when
e.g., a server does not respond for a long time, report it as failing).
Conclusion
Distribution transparency is a nice goal, but achieving it is a different story, and
it should often not even be aimed at.
Distribution transparency