Lec 3

Uploaded by

zrashad04

Available Formats

Download as PDF, TXT or read online on Scribd

Download as pdf or txt

0% found this document useful (0 votes)

29 views48 pages

Lec 3

Uploaded by

zrashad04

Available Formats

Download as PDF, TXT or read online on Scribd

Download as pdf or txt

You are on page 1/ 48

CHW 461:

Parallel Architectures
Lecture # 4
GPU System Context
GPU Computing?
 Design target for CPUs:
 Make a single thread very fast
 Take control away from programmer

 GPU Computing takes a different approach:

 Throughput matters- single threads do not
 Give explicit control to programmer
“CPU-style" Cores
Slimming down
More Space: Double the Number of Cores
Saving Yet More Space

Idea #2
 Amortize cost/complexity of managing an
instruction stream across many ALUs.

→ SIMD
Saving Yet More Space

Idea #2
 Amortize cost/complexity of managing an
instruction stream across many ALUs.

→ SIMD
Saving Yet More Space

Idea #2
 Amortize cost/complexity of managing an
instruction stream across many ALUs.

→ SIMD
Gratuitous Amounts of Parallelism!
Gratuitous Amounts of Parallelism!
Gratuitous Amounts of Parallelism!
Branches
Memory
 Memory latency: The time taken for a memory request to be completed.
This usually takes 100s of cycles.

 Memory bandwidth: The rate at which the memory system can provide
data to a processor.

 Stalling: Occurs when a processor cannot continue to execute code

because of a dependency on a previous instruction. To continue with the
current instruction, the processor must wait until the previous instruction
is completed. Stalling can occur when we have to load memory.
Remaining Problem: Slow Memory
 Problem
 Memory still has very high latency. . .
. . . but we've removed most of the hardware that helps us deal
with that.

We've removed
 caches
 branch prediction
 out-of-order execution
So what now?
Remaining Problem: Slow Memory
 Problem
 Memory still has very high latency. . .
. . . but we've removed most of the hardware that helps us deal
with that.

We've removed
 caches
 branch prediction
 out-of-order execution
So what now?
Hiding Memory Latency
Discussion !!
 Multi-threading increases /decreases time for individual thread to
finish assigned task?!

 Does multithreading improve throughput?

 Does multithreading improve performance?

 The time to complete all the tasks should increase/decrease ?! Why?!

 Multi-threading requires a lot of memory bandwidth?! Explain?!

GPU Architecture Summary
 Core Ideas:
1. Many slimmed down cores
→ lots of parallelism.

2. More ALUs, Fewer Control Units.

3. Avoid memory stalls by interleaving execution of SIMD groups

Two Main Goals
• Maintain execution speed of old sequential programs.→ CPU

• Increase throughput of parallel programs. → GPU

CPU is optimized for sequential
code performance

Almost 10x the bandwidth of multicore

(relaxed memory model)
A Quick Glimpse on Flynn Classification
• A taxonomy of computer architecture.

• Proposed by Micheal Flynn in 1966.

• It is based two things:

– Instructions
– Data
Which one
is closest to
GPU?
Problems Faced by GPUs
• Need enough parallelism.

• Under-utilization.

• Bandwidth to CPU
Modern GPU Hardware
 GPUs have
 many parallel execution units and
 higher transistor counts,
 while CPUs have
 few execution units and
 higher clock speeds

• GPUs have much deeper pipelines (several thousand stages vs 10-20 for
CPUs)
• GPUs have significantly faster and more advanced memory interfaces as
they need to shift around a lot more data than CPUs
Let’s Take A Closer Look:
The Hardware
GPU Architecture: GeForce 8800 (2007)

➢ Each SM is capable of supporting thousands of concurrent hardware threads, up to 2048

on modern architecture GPUs.

➢ The SM performs all the thread management including creation, scheduling and barrier
synchronization.

➢ The SM employs a SIMT (Single Instruction, Multiple Thread) architecture to efficiently

manage the large number of threads that exist. 44
• Much higher bandwidth than typical system memory
• A bit slower than typical system memory
SPs within SM share control
• Communication between GPU memory and system
logic and instruction cache memory is slow
Streaming
Processor (SP) Streaming Multiprocessor
(SM)

45
Scalar vs Threaded
Scalar program
float A[4][8];
for(int i=0;i<4;i++){
for(int j=0;j<8;j++){
A[i][j]++;
}
}
Multithreaded: (4x1) blocks – (8x1) threads
Multithreaded: (2x2) blocks – (4x2) threads

Operation Manual: Electronic Cash Register
0% (1)
Operation Manual: Electronic Cash Register
44 pages
UGRD-CS6204A-Computer-Architecture-and-Organization-legit-not-quizzess MidALL
100% (1)
UGRD-CS6204A-Computer-Architecture-and-Organization-legit-not-quizzess MidALL
17 pages
Sysadmin Linux
No ratings yet
Sysadmin Linux
140 pages
Gpgpu Final
No ratings yet
Gpgpu Final
124 pages
UNIT-4
No ratings yet
UNIT-4
48 pages
Lecture ParallelArchTLP-DLP
No ratings yet
Lecture ParallelArchTLP-DLP
52 pages
Cse 216 - L14
No ratings yet
Cse 216 - L14
37 pages
Ahmad Aljebaly Department of Computer Science Western Michigan University
No ratings yet
Ahmad Aljebaly Department of Computer Science Western Michigan University
42 pages
Lecture 9-10 Computer Organization and Architecture
No ratings yet
Lecture 9-10 Computer Organization and Architecture
25 pages
CS 303 Chapter1, Lecture 3
No ratings yet
CS 303 Chapter1, Lecture 3
18 pages
Grade 12 IT Theory Notes PDF
No ratings yet
Grade 12 IT Theory Notes PDF
126 pages
Chapter 2 - Computer Organization
No ratings yet
Chapter 2 - Computer Organization
30 pages
AD Up Dig Design Be A
No ratings yet
AD Up Dig Design Be A
130 pages
217 Lec1
No ratings yet
217 Lec1
35 pages
The Central Processing Unit:: What Goes On Inside The Computer
No ratings yet
The Central Processing Unit:: What Goes On Inside The Computer
42 pages
Implementation of DSP Algorithms
No ratings yet
Implementation of DSP Algorithms
20 pages
Introduction To Programming Massively Parallel Graphics Processors
No ratings yet
Introduction To Programming Massively Parallel Graphics Processors
84 pages
Lecture 0: Cpus and Gpus: Prof. Mike Giles
No ratings yet
Lecture 0: Cpus and Gpus: Prof. Mike Giles
36 pages
Hyper Threading: Concepts & Architecture
No ratings yet
Hyper Threading: Concepts & Architecture
28 pages
18.1 Các vấn đề về hiệu suất phần cứng Giới thiệu
No ratings yet
18.1 Các vấn đề về hiệu suất phần cứng Giới thiệu
12 pages
Computer Science Crash Course - Session 1
No ratings yet
Computer Science Crash Course - Session 1
22 pages
L7 Multicore 1
No ratings yet
L7 Multicore 1
50 pages
Pankaj
No ratings yet
Pankaj
27 pages
Parralel Demro 001
No ratings yet
Parralel Demro 001
45 pages
Overview of Parallel Computing: Shawn T. Brown
No ratings yet
Overview of Parallel Computing: Shawn T. Brown
46 pages
2016defcon Intro To Disassembly Workshop PDF
No ratings yet
2016defcon Intro To Disassembly Workshop PDF
324 pages
Lecture 16
No ratings yet
Lecture 16
30 pages
Chapter 9 COA
No ratings yet
Chapter 9 COA
31 pages
ITEC582 Chapter18
No ratings yet
ITEC582 Chapter18
36 pages
2.2 DD2356 Threads
No ratings yet
2.2 DD2356 Threads
22 pages
Mines Paristech / Cri Lal / Cnrs / In2P3
No ratings yet
Mines Paristech / Cri Lal / Cnrs / In2P3
37 pages
Advanced Computer Architecture Solutions
No ratings yet
Advanced Computer Architecture Solutions
18 pages
09 - Thread Level Parallelism
50% (2)
09 - Thread Level Parallelism
34 pages
L 3 GPU
No ratings yet
L 3 GPU
33 pages
Chapter 3 Processes
No ratings yet
Chapter 3 Processes
42 pages
Winsem2022-23 Cse4001 Eth Vl2022230503160 Reference Material i 15-12-2022 1.4 Multi-core Processor
No ratings yet
Winsem2022-23 Cse4001 Eth Vl2022230503160 Reference Material i 15-12-2022 1.4 Multi-core Processor
34 pages
Lecture 2 Multi-core computing
No ratings yet
Lecture 2 Multi-core computing
42 pages
CO4 - ARM & PIC Part 1
No ratings yet
CO4 - ARM & PIC Part 1
25 pages
Multicore Computers
No ratings yet
Multicore Computers
21 pages
Lecture 3 Multi-core computing
No ratings yet
Lecture 3 Multi-core computing
38 pages
Memory Cache
No ratings yet
Memory Cache
18 pages
Dual Core Processors: Presented by Prachi Mishra IT - 56
No ratings yet
Dual Core Processors: Presented by Prachi Mishra IT - 56
16 pages
2023 CSC14120 Lecture00 CourseIntroduction
No ratings yet
2023 CSC14120 Lecture00 CourseIntroduction
30 pages
Part 1 - Lecture 2 - Parallel Hardware
No ratings yet
Part 1 - Lecture 2 - Parallel Hardware
60 pages
Lect 1
No ratings yet
Lect 1
16 pages
GPU Fundamentals
No ratings yet
GPU Fundamentals
20 pages
Core I5 Report
100% (2)
Core I5 Report
13 pages
IBM Power5 Chip A Dual-Core Multithreaded Processor
No ratings yet
IBM Power5 Chip A Dual-Core Multithreaded Processor
8 pages
MPMC (Unit 01) Part 02
No ratings yet
MPMC (Unit 01) Part 02
39 pages
100 Hardware Questions
No ratings yet
100 Hardware Questions
17 pages
Technologies For Network
No ratings yet
Technologies For Network
3 pages
Multi Core 15213 Sp07
No ratings yet
Multi Core 15213 Sp07
67 pages
01-System Architecture
No ratings yet
01-System Architecture
55 pages
Design Issues: SMT and CMP Architectures
No ratings yet
Design Issues: SMT and CMP Architectures
9 pages
ACA_Lecture_28_Multiprocessors
No ratings yet
ACA_Lecture_28_Multiprocessors
20 pages
Lecture 2
No ratings yet
Lecture 2
51 pages
Network Processors: Jeffrey Shafer
No ratings yet
Network Processors: Jeffrey Shafer
19 pages
1-2-memory
No ratings yet
1-2-memory
23 pages
5 - Embedded Systems
No ratings yet
5 - Embedded Systems
53 pages
Multi-Core Processor: From Wikipedia, The Free Encyclopedia
No ratings yet
Multi-Core Processor: From Wikipedia, The Free Encyclopedia
10 pages
Lecture 05 ARM Processors
No ratings yet
Lecture 05 ARM Processors
65 pages
Nintendo 64 Architecture: Architecture of Consoles: A Practical Analysis, #8
From Everand
Nintendo 64 Architecture: Architecture of Consoles: A Practical Analysis, #8
Rodrigo Copetti
No ratings yet
All My IT Tech Posts
From Everand
All My IT Tech Posts
Stephen Edwards
No ratings yet
SRX HA Deployment Guide v1.2
No ratings yet
SRX HA Deployment Guide v1.2
35 pages
Abb User Manaul
No ratings yet
Abb User Manaul
136 pages
TruNet Cat5e Solutions
No ratings yet
TruNet Cat5e Solutions
16 pages
Tools Consumable Overhaul Steam Turbin 8500KW
No ratings yet
Tools Consumable Overhaul Steam Turbin 8500KW
7 pages
Voltcraft 506
No ratings yet
Voltcraft 506
3 pages
Proportional Amplifier PVR
100% (1)
Proportional Amplifier PVR
8 pages
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"> <HTML><HEAD><META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1"> <TITLE>ERROR: The requested URL could not be retrieved</TITLE> <STYLE type="text/css"></STYLE> </HEAD><BODY> <H1>ERROR</H1> <H2>The requested URL could not be retrieved</H2> <HR noshade size="1px"> <P> While trying to process the request: <PRE> TEXT http://www.scribd.com/titlecleaner?title=5.SAND+FILLING-TECHNICAL+SPECIFICATIONS-BB+I.pdf HTTP/1.1 Host: www.scribd.com Proxy-Connection: keep-alive Accept: */* Origin: http://www.scribd.com X-CSRF-Token: 0574208edb2ed4c8f3614b46106263ed6c7ef7f1 X-Requested-With: XMLHttpRequest User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.101 Safari/537.36 Referer: http://www.scribd.com/upload-document?
No ratings yet
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"> <HTML><HEAD><META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1"> <TITLE>ERROR: The requested URL could not be retrieved</TITLE> <STYLE type="text/css"></STYLE> </HEAD><BODY> <H1>ERROR</H1> <H2>The requested URL could not be retrieved</H2> <HR noshade size="1px"> <P> While trying to process the request: <PRE> TEXT http://www.scribd.com/titlecleaner?title=5.SAND+FILLING-TECHNICAL+SPECIFICATIONS-BB+I.pdf HTTP/1.1 Host: www.scribd.com Proxy-Connection: keep-alive Accept: */* Origin: http://www.scribd.com X-CSRF-Token: 0574208edb2ed4c8f3614b46106263ed6c7ef7f1 X-Requested-With: XMLHttpRequest User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.101 Safari/537.36 Referer: http://www.scribd.com/upload-document?
7 pages
DLL CSS G12 Q3 Bootable
No ratings yet
DLL CSS G12 Q3 Bootable
8 pages
KA1102 Manual
No ratings yet
KA1102 Manual
20 pages
Communication Switching Techniques PDF
100% (1)
Communication Switching Techniques PDF
2 pages
3.7.YARN - Failures in Classic MapReduce.docx
No ratings yet
3.7.YARN - Failures in Classic MapReduce.docx
5 pages
Start Up The Wireshark Packet Sniffer
No ratings yet
Start Up The Wireshark Packet Sniffer
2 pages
MT 8816 Ae
No ratings yet
MT 8816 Ae
6 pages
HANDOUT - Rapid Control Prototyping With MATLAB PDF
No ratings yet
HANDOUT - Rapid Control Prototyping With MATLAB PDF
19 pages
Basic Elements of Java: Prof. Ryan Celis
No ratings yet
Basic Elements of Java: Prof. Ryan Celis
30 pages
Order Summary
No ratings yet
Order Summary
5 pages
Why Can't I Run Load Cells Directly Into A PLC
No ratings yet
Why Can't I Run Load Cells Directly Into A PLC
8 pages
RC 1976 12
100% (1)
RC 1976 12
66 pages
Storemi Quick Start Guide
No ratings yet
Storemi Quick Start Guide
10 pages
Va262 14 B 0246 A00005002
100% (1)
Va262 14 B 0246 A00005002
232 pages
FBSI 1103 en
No ratings yet
FBSI 1103 en
494 pages
DX Diag
No ratings yet
DX Diag
48 pages
VMware GOS Compatibility Guide
No ratings yet
VMware GOS Compatibility Guide
102 pages
Variador Santerno SINUS N
100% (1)
Variador Santerno SINUS N
157 pages
Series: Label Printing Scale
No ratings yet
Series: Label Printing Scale
4 pages
Mockup A330 Touch Screen Trainer ECA Group
No ratings yet
Mockup A330 Touch Screen Trainer ECA Group
2 pages
PTX PRM PGL T5 750929e
No ratings yet
PTX PRM PGL T5 750929e
382 pages