Parallel Computing
Parallel Computing
Course Contents
Slide 2
Slide 3
Slide 4
Slide 5
Slide 7
Slide 9
Why Use Parallel Computing? Save time and/or money: In theory, throwing more resources at a
task will shorten its time to completion, with potential cost savings. Parallel clusters can be built from cheap, commodity components. Solve larger problems: Many problems are so large and/or complex that it is impractical or impossible to solve them on a single computer, especially given limited computer memory. For example: "Grand Challenge" (en.wikipedia.org/wiki/Grand_Challenge) problems requiring PetaFLOPS and PetaBytes of computing resources. Web search engines/databases processing millions of transactions per second Provide concurrency: A single compute resource can only do one thing at a time. Multiple computing resources can be doing many things simultaneously. For example, the Access Grid ( www.accessgrid.org) provides a global collaboration network where Slide 10 people from around the world can meet and conduct work "virtually".
Slide 11
There are different ways to classify parallel computers. One of the more widely used classifications, in use since 1966, is called Flynn's Taxonomy. Flynn's taxonomy distinguishes multi-processor computer architectures according to how they can be classified along the two independent dimensions of Instruction and Data. Each of these dimensions can have only one of two possible states: Single or Multiple. The matrix below defines the 4 possible classifications according to Flynn:
SISD Single Instruction, Single Data SIMD Single Instruction, Multiple Data
Slide 12
Single Instruction, Single Data (SISD): (uniprocessor) A serial (non-parallel) computer Single instruction: only one instruction stream is being acted on by the CPU during any one clock cycle Single data: only one data stream is being used as input during any one clock cycle Deterministic execution This is the oldest and even today, the most common type of computer Examples: older generation mainframes, minicomputers and workstations; most modern day PCs.
Slide 13
Single Instruction, Multiple Data (SIMD): (array processors) A type of parallel computer Single instruction: All processing units execute the same instruction at any given clock cycle Multiple data: Each processing unit can operate on a different data element Best suited for specialized problems characterized by a high degree of regularity, such as graphics/image processing. Synchronous (lockstep) and deterministic execution Two varieties: Processor Arrays and Vector Pipelines Examples: Processor Arrays: Connection Machine CM-2, MasPar MP-1 & MP-2, ILLIAC IV Vector Pipelines: IBM 9000, Cray X-MP, Y-MP & C90, Fujitsu VP, NEC SX-2, Hitachi S820, ETA10 Most modern computers, particularly those with graphics processor Slide 14 units (GPUs) employ SIMD instructions and execution units.
Multiple Instruction, Single Data (MISD): (not practical) A single data stream is fed into multiple processing units. Each processing unit operates on the data independently via independent instruction streams. Few actual examples of this class of parallel computer have ever existed. One is the experimental Carnegie-Mellon C.mmp computer (1971). Some conceivable uses might be: multiple frequency filters operating on a single signal stream multiple cryptography algorithms attempting to crack a single coded message.
Slide 15
Multiple Instruction, Multiple Data (MIMD): (multiprocessor system) Currently, the most common type of parallel computer. Most modern computers fall into this category. Multiple Instruction: every processor may be executing a different instruction stream Multiple Data: every processor may be working with a different data stream Execution can be synchronous or asynchronous, deterministic or nondeterministic Examples: most current supercomputers, networked parallel computer clusters and "grids", multi-processor SMP computers, multi-core PCs. Note: many MIMD architectures also include SIMD execution subcomponents
Slide 16
Shared Memory General Characteristics: Shared memory parallel computers vary widely, but generally have in common the ability for all processors to access all memory as global address space. Multiple processors can operate independently but share the same memory resources. Changes in a memory location effected by one processor are visible to all other processors. Shared memory machines can be divided into two main classes based upon memory access times: UMA and NUMA. Uniform Memory Access (UMA): Most commonly represented today by Symmetric Multiprocessor (SMP) machines Identical processors Equal access and access times to memory Slide 17 Sometimes called CC-UMA - Cache Coherent UMA. Cache coherent
Slide 18
Slide 19
Like shared memory systems, distributed memory systems vary widely but share a common characteristic. Distributed memory systems require a communication network to connect interprocessor memory. Processors have their own local memory. Memory addresses in one processor do not map to another processor, so there is no concept of global address space across all processors. Because each processor has its own local memory, it operates independently. Changes it makes to its local memory have no effect on the memory of other processors. Hence, the concept of cache coherency does not apply. When a processor needs access to data in another processor, it is usually the task of the programmer to explicitly define how and when data is communicated. Synchronization between tasks is likewise the programmer's responsibility. The network "fabric" used for data transfer varies widely, though it can be as simple as Ethernet.
Slide 20
Advantages: Memory is scalable with number of processors. Increase the number of processors and the size of memory increases proportionately. Each processor can rapidly access its own memory without interference and without the overhead incurred with trying to maintain cache coherency. Cost effectiveness: can use commodity, off-the-shelf processors and networking.
Disadvantages: The programmer is responsible for many of the details associated with data communication between processors. It may be difficult to map existing data structures, based on global memory, to this memory organization. Non-uniform memory access (NUMA) times ed for data transfer varies widely, though it can be as simple as Ethernet.
Slide 21
The largest and fastest computers in the world today employ both shared and distributed memory architectures. The shared memory component is usually a cache coherent SMP machine. Processors on a given SMP can address that machine's memory as global. The distributed memory component is the networking of multiple SMPs. SMPs know only about their own memory - not the memory on another SMP. Therefore, network communications are required to move data from one SMP to another. Current trends seem to indicate that this type of memory architecture will continue to prevail and increase at the high end of computing for the foreseeable future. Advantages and Disadvantages: whatever is common to both shared and distributed memory architectures.
Slide 22
Although it might not seem apparent, these models are NOT specific to a particular type of machine or memory architecture. In fact, any of these models can (theoretically) be implemented on any underlying hardware. Two examples: 1.Shared memory model on a distributed memory machine: Kendall Square Research (KSR) ALLCACHE approach. Machine memory was physically distributed, but appeared to the user as a single shared memory (global address space). Generically, this approach is referred to as "virtual shared memory". Note: although KSR is no longer in business, there is no reason to suggest that a similar implementation will not be made available by another vendor in the future. 2. Message passing model on a shared memory machine: MPI on SGI Origin. The SGI Origin employed the CC-NUMA type of shared memory architecture, where every task has direct access to global memory. However, the ability to send and receive messages with Slide 23 MPI, as is commonly done over a network of distributed memory
Solution 1
Assume that books are organized into shelves and that the shelves are grouped into bays One simple way to assign the task to the workers is:
To divide the books equally among them. Each worker stacks the books one a time
This division of work may not be most efficient way to accomplish the task Slide 25 since
Solution 2
An alternative way to divide the work is to assign a fixed and disjoint set of bays to each worker. As before, each worker is assigned an equal number of books arbitrarily.
If the worker finds a book that belongs to a of Instance bay assigned to him or her, Communication task
he or she places that book in its assignment spot
Otherwise,
He or she passes it on to the worker responsible for the bay it belongs to.
The second approach requires less Slide 26 effort from individual workers
Grid environment
Move() Move randomly ( ) Until robot sees a stick in its nighbouhood Collect() Move(); Pick up a sick; Move(); Put it down; Collect();
Slide 28
Sorting in nature
1 3
Slide 29
Parallel Processing
(Several processing elements working to solve a single problem)
Downside: complexity
system, algorithm design
A parallel computer is of little use unless efficient parallel algorithms are available.
The issue in designing parallel algorithms are very different from those in designing their sequential counterparts. A significant amount of work is being done to develop efficient parallel Slide algorithms for a variety of parallel 31
Processor Trends
Moores Law
performance doubles every 18 months
Theoretical:
challenging problems
Slide 33
Atmospheric simulation
3D grid, each element interacts with neighbors 1x1x1 mile element 5 108 elements 10 day simulation requires approx. 35 Slide 100 days
Oil exploration
large amounts of seismic data to be processed months of sequential exploration
Slide 36
Computational biology
drug design gene sequencing (Celera) structure prediction (Proteomics)
Slide 37
Fundamental Issues
Is the problem amenable to parallelization? How to decompose the problem to exploit parallelism? What machine architecture should be used? What parallel resources are available? What kind of speedup is desired? Slide 38
Slide 39
Asymptotic Parallelism
Models
comparing/evaluating different architectures
Algorithm Design
utilizing a given architecture to solve a given problem
Computational Complexity
classifying problems according to their difficulty
Slide 41
Architecture
Single processor:
single instruction stream single data stream von Neumann model
Multiple processors:
Flynns taxonomy
Slide 42
Flynns Taxonomy
Instruction Streams
Many
MISD
MIMD
SISD
SIMD
Slide 44
Parallel Architectures
Multiple processing elements Memory:
shared distributed hybrid
Control:
centralized distributed
Slide 45
Distributed:
processing elements do not share memory or system clock
se u n l q e tia
p ra l a lle
p ce rs ro sso
A parallel algorithms is optimal iff this product is of the same order as the best known sequential time Slide 47
Metrics
A measure of relative performance between a multiprocessor system and a single processor system is the speed-up S( p), defined as follows: S( p) =
Execution time using a single processor system Execution time using a multiprocessor with p processors
S( p) =
T1 Tp
Efficiency =
Sp p
Cost = p Tp
Slide 48
Metrics
Parallel algorithm is cost-optimal:
parallel implementation may become slower than sequential T1 = n3 Tp = n2.5 when p = n2 Cp = n4.5
Slide 49
Amdahls Law
f = fraction of the problem thats inherently sequential
Tp = f + (1 f ) p
Sp = 1 f f+ p
Slide 50
Slide 51
Amdahls Law
Upper bound on speedup (p = ) 1 1 S = Converges to 0 S = 1 f f f+
p
Example:
f = 2% S = 1 / 0.02 = 50
Slide 52