Lec 3
Lec 3
Parallel Architectures
Lecture # 4
GPU System Context
GPU Computing?
Design target for CPUs:
Make a single thread very fast
Take control away from programmer
Idea #2
Amortize cost/complexity of managing an
instruction stream across many ALUs.
→ SIMD
Saving Yet More Space
Idea #2
Amortize cost/complexity of managing an
instruction stream across many ALUs.
→ SIMD
Saving Yet More Space
Idea #2
Amortize cost/complexity of managing an
instruction stream across many ALUs.
→ SIMD
Gratuitous Amounts of Parallelism!
Gratuitous Amounts of Parallelism!
Gratuitous Amounts of Parallelism!
Branches
Memory
Memory latency: The time taken for a memory request to be completed.
This usually takes 100s of cycles.
Memory bandwidth: The rate at which the memory system can provide
data to a processor.
We've removed
caches
branch prediction
out-of-order execution
So what now?
Remaining Problem: Slow Memory
Problem
Memory still has very high latency. . .
. . . but we've removed most of the hardware that helps us deal
with that.
We've removed
caches
branch prediction
out-of-order execution
So what now?
Hiding Memory Latency
Discussion !!
Multi-threading increases /decreases time for individual thread to
finish assigned task?!
• Under-utilization.
• Bandwidth to CPU
Modern GPU Hardware
GPUs have
many parallel execution units and
higher transistor counts,
while CPUs have
few execution units and
higher clock speeds
• GPUs have much deeper pipelines (several thousand stages vs 10-20 for
CPUs)
• GPUs have significantly faster and more advanced memory interfaces as
they need to shift around a lot more data than CPUs
Let’s Take A Closer Look:
The Hardware
GPU Architecture: GeForce 8800 (2007)
➢ The SM performs all the thread management including creation, scheduling and barrier
synchronization.
45
Scalar vs Threaded
Scalar program
float A[4][8];
for(int i=0;i<4;i++){
for(int j=0;j<8;j++){
A[i][j]++;
}
}
Multithreaded: (4x1) blocks – (8x1) threads
Multithreaded: (2x2) blocks – (4x2) threads