Chapter 6 Parallel Processor

COMPUTER ORGANIZATION AND DESIGN
5th
Edition
The Hardware/Software Interface
Chapter 6
Parallel Processors from
Client to Cloud
§6.1 Introduction
Introduction
 Multiprocessor
 Goal: connecting multiple computers
to get higher performance
 Scalability, availability, power efficiency
 Multicore processors
 Chips with multiple processors (cores)
 Task-level (process-level) parallelism
 High throughput for independent jobs
 Parallel program (parallel software)
 A single program run on multiple processors
 Challenges: partitioning, coordination,
communications overhead
Chapter 6 — Parallel Processors from Client to Cloud — 2
Amdahl’s Law
 Sequential part can limit speedup
 Example: 100 processors, 90× speedup?
 Tnew = Tparallelizable/100 + Tsequential
 Solving: Fparallelizable = 0.999

 Need sequential part to be 0.1% of original
time

Scaling Example
 Workload
 sum of 10 scalars (sequential)
 sum of a pair of 10×10 matrix (parallel)
 What’s the speed up from single to 10 and 100 processors?
 Single processor:
 Time = (10 + 100) × tadd
Assumes load balanced

 10 processors
across processors
 Time = 10 × tadd + 100/10 × tadd = 20 × tadd
 Speedup = 110/20 = 5.5 (55% of potential)
 100 processors
 Speedup = 110/11 = 10 (10% of potential)

Scaling Example (cont)
 What if matrix size is 100 × 100?
 Single processor:
 Time = (10 + 10000) × tadd
 10 processors
Assumes load balanced

Time = 10 × tadd + 10000/10 × tadd = 1010 × tadd
across processors

 Speedup = 10010/1010 = 9.9 (99% of potential)

 100 processors
 Speedup = 10010/110 = 91 (91% of potential)

Strong vs Weak Scaling
 Strong scaling: problem size fixed
 Measure speed-up achieved on a multiprocessor while
keeping problem size fixed
 Weak scaling: problem size proportional to number of CPU
 Measure speed-up achieved on a multiprocessor while
increase the size of the problem proportionally to the
increase in the number of processors.
 10 processors, 10 × 10 matrix
performance
Time = 10 × tadd + 100/10 × tadd = 20 × tadd
Constant

 100 processors, 32 × 32 matrix

 Time = 10 × tadd + 1024/100 × tadd ~= 20 × tadd
Chapter 6 — Parallel Pro

cessors from Client to Cloud — 6
§6.3 SISD, MIMD, SIMD, SPMD, and Vector
Instruction and Data Streams
 An alternate classification on parallel
hardware
Data Streams
Single Multiple
Instruction Single SISD: SIMD (vector processor):
Streams Intel Pentium 4 SSE instructions of x86
Multiple MISD: MIMD :
No example yet Intel Xeon e5345

Vector architecture
 Highly pipelined function units
 Stream data from/to vector registers to units
 Data collected from memory into vector registers
 Results stored from vector registers to memory
 Example: Vector extension to MIPS
 32 vector registers, each has 64 64-bit elements
 Vector instructions
 lv, sv: load/store vector
 addv.d: add vectors of double
 addvs.d: add scalar to each element of vector of double

Vector architecture
 Single add pipeline
 Complete one addition per cycle
An array of parallel functional units
 Four add pipeline

 Complete four additions per cycle

Example: DAXPY (Y = a × X + Y)
 Conventional MIPS code
l.d $f0,a($sp) ;load scalar a
addiu r4,$s0,#512 ;bound of what to load
loop:l.d $f2,0($s0) ;load x(i)
mul.d $f2,$f2,$f0 ;a × x(i)
l.d $f4,0($s1) ;load y(i)
add.d $f4,$f4,$f2 ;a × x(i) + y(i)
s.d $f4,0($s1) ;store into y(i)
addiu $s0,$s0,#8 ;increment index to x
addiu $s1,$s1,#8 ;increment index to y
subu $t0,r4,$s0 ;compute bound
bne $t0,$zero,loop ;check if done

 Vector MIPS code

l.d $f0,a($sp) ;load scalar a
lv $v1,0($s0) ;load vector x
mulvs.d $v2,$v1,$f0 ;vector-scalar multiply
lv $v3,0($s1) ;load vector y
addv.d $v4,$v2,$v3 ;add y to product
sv $v4,0($s1) ;store the result
Vector vs. Scalar architecture
 Reduce Instruction fetch and decode
 A single vector instruction is equivalent to executing an
entire loop.
 Avoid data hazard checking
 Only check data hazard between vectors, not for every
element within the vectors (computation of every element
within the same vector is independent).
 Avoid control hazard
 An entire loop is replaced by a vector instruction  loop
branch (leading to control hazard) is non-existent.
 Efficient memory access
 If the vector’s elements are all adjacent in memory,
fetching them is efficient using interleaved memory banks.

§6.4 Hardware Multithreading
Hardware Multithreading
 Hardware Multithreading
 Multiple hardware threads (replicate registers, PC, etc.)
 fast context switching between threads
 1. Fine-grain multithreading
 Switch threads after execution of each instruction
 If one thread stalls (long and short), others are executed
 Slow down the execution of individual threads, especially
those without stalls
 2. Coarse-grain multithreading
 Only switch on long stall (e.g., L2-cache miss)
 Simplifies hardware, but doesn’t hide short stalls (eg,
data hazards)

Simultaneous Multithreading (SMT)
 3. Simultaneous multithreading (SMT)

 in multiple-issue dynamically scheduled pipelined CPU
 Motivation
 Multiple-issue processors often have more functional
units available than single thread needs to use.

 Schedule instructions from multiple threads
 Instructions from independent threads execute when
function units are available
 hide the throughput loss from both short and long stall

Multithreading Example
Coarse MT: Coarse-grained

multithreading
Fine MT: Fine-grained
multithreading
SMT: Simultaneous
multithreading
§6.5 Multicore and Other Shared Memory Multiprocessors
Shared Memory Multiprocessor
 SMP: Symmetric Multi-Processing
 Hardware provides single physical
address space for all processors
 Synchronize shared variables using locks
 Memory access time
 UMA (uniform) vs. NUMA (nonuniform)
Processor Processor Processor Processor

cache cache cache cache
Interconnection network
Memory Memory
Memory Memory Interconnection network
UMA NUMA
Example: Sum Reduction
 Sum 100,000 numbers on 100 processor UMA
 Each processor has ID: 0 ≤ Pn ≤ 99
 Partition 1000 numbers per processor
 Initial summation on each processor
sum[Pn] = 0;
for ( i=1000*Pn; i<1000*(Pn+1); i=i+1)
sum[Pn] = sum[Pn] + A[i];
 Now need to add these partial sums
 Reduction: divide and conquer
 Half the processors add pairs, then quarter, …
 Need to synchronize between reduction steps

Example: Sum Reduction
half = 100;
repeat
synch();
// The condition
if (half%2 != 0 && Pn == 0)
// when half is odd
sum[0] = sum[0] + sum[half-1];
half = half/2;
if (Pn < half) sum[Pn] = sum[Pn] + sum[Pn+half];
until (half == 1);

§6.7 Clusters, WSC, and Other Message-Passing MPs
Loosely Coupled Clusters
 Network of independent computers
 Each has private memory and OS
 Connected using I/O system
 E.g., Ethernet/switch, Internet
 Suitable for applications with independent tasks
 Web servers, databases, simulations, …
 High availability, scalable, affordable
 Problems
 Administration cost
 Low interconnect bandwidth
 c.f. processor/memory bandwidth on an SMP

Message Passing
 Each processor has private physical
address space
 Hardware sends/receives messages
between processors

Interconnection Networks
 Network topologies
 Arrangements of processors, switches, and links
Bus Ring
N-cube (N = 3)
2D Mesh
Fully connected

§6.14 Concluding Remarks
Concluding Remarks
 Higher performance by using multiple processors
 Difficulties
 Developing parallel software
 Devising appropriate architectures
 SIMD and vector operations match multimedia
applications and are easy to program
 Higher disk performance by using RAID

Chapter 6 Parallel Processor

Uploaded by

Copyright:

Available Formats

Chapter 6 Parallel Processor

Uploaded by

Copyright:

Available Formats

COMPUTER ORGANIZATION AND DESIGN

 Solving: Fparallelizable = 0.999

Chapter 6 — Parallel Processors from Client to Cloud — 3

Assumes load balanced

Chapter 6 — Parallel Processors from Client to Cloud — 4

Assumes load balanced

 Speedup = 10010/1010 = 9.9 (99% of potential)

Chapter 6 — Parallel Processors from Client to Cloud — 5

 100 processors, 32 × 32 matrix

Chapter 6 — Parallel Pro

Chapter 6 — Parallel Processors from Client to Cloud — 7

Chapter 6 — Parallel Processors from Client to Cloud — 8

An array of parallel functional units

 Four add pipeline

Chapter 6 — Parallel Processors from Client to Cloud — 9

 Vector MIPS code

Chapter 6 — Parallel Processors from Client to Cloud — 11

Chapter 6 — Parallel Processors from Client to Cloud — 12

 3. Simultaneous multithreading (SMT)

units available than single thread needs to use.

Chapter 6 — Parallel Processors from Client to Cloud — 13

Coarse MT: Coarse-grained

Processor Processor Processor Processor

Memory Memory Interconnection network

Chapter 6 — Parallel Processors from Client to Cloud — 16

Chapter 6 — Parallel Processors from Client to Cloud — 17

Chapter 6 — Parallel Processors from Client to Cloud — 18

Chapter 6 — Parallel Processors from Client to Cloud — 19

Chapter 6 — Parallel Processors from Client to Cloud — 20

Chapter 6 — Parallel Processors from Client to Cloud — 21

You might also like