0% found this document useful (0 votes)

330 views

GPU Introduction

(1) GPUs have evolved from specialized processors in supercomputers to being integrated into graphics cards and used for general purpose processing (GPGPU). Modern GPUs contain hundreds/thousands of arithmetic units that can be used to accelerate many computing applications in parallel. (2) GPUs use a streaming multi-processor architecture with multiple cores that can maintain thousands of threads at once to hide memory latency. They rely on high computational density rather than large caches. (3) NVIDIA's CUDA extends C/C++ for general purpose GPU programming. It provides hardware-optimized memory and a programming model based on thread blocks and grids well-suited for data-parallel workloads.

Uploaded by

spark1122

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

330 views

GPU Introduction

Uploaded by

spark1122

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 52

The Architecture of Graphic Processor Unit - GPU

P. Bakowski

P.Bakowski

Evolution of parallel architectures

We can distinguish 3 generations of massively parallel . architectures (scientific calculation): (1) The super-computers with special processors for vector calculation (Single Instruction Multiple Data) The Cray-1 (1976) contained 200,000 integrated circuits and could perform 100 million floating point operations per second (100 MFLOPS). price: $5 - $8.8 million Number of units sold: 85
P.Bakowski 2

Evolution of parallel architectures

(2) The super-computers with standard microprocessors adapted for massive multiprocessing operating as Multiple Instruction Multiple Data computers. IBM Roadrunner: PowerXCell 8i CPUs, 6480 dual cores - AMD Opteron, Linux Consumption: 2,35 MW Surface: 296 racks, 560 m2 Memory: 103,6 TiB Performance: 1,042 petaflops Price: USD $125M
P.Bakowski 3

Evolution of GPU architectures

(3) General Processing on Graphic Processing Units (GPGPU) technology based on the circuits integrated into graphic cards.

P.Bakowski

GPU based processing

The GPUs (Graphic Processing Units) contain . hundreds/thousands of arithmetical units . These capacities may be used to accelerate a wide range of computing applications. CUDA cores 48 per Streaming Processor Example - nVIDIA: GT200,300,400,500 series

P.Bakowski

CPUs and SSE extensions

Modern CPU integrate specific SIMD units for graphic . processing. These units implement - SSE2, SSE3, SSE4 instructions and contain 4 arithmetic units that may operate in parallel on 4 fixed or floating point data.

P.Bakowski

CPUs and GPUs

The .GPU are based on multiple processing units with multiple processing cores (8/16/32 cores per processing unit), they contain register files and shared memories. A graphic card contains a global memory that can be used by all processors (including CPU), a local memory for each processing unit, and special memories for constant values.

P.Bakowski

GPUs : streaming multi-processors

. The streaming multiprocessor (SM) integrated in GPUs are the SIMD blocks with several arithmetic cores. Each core contains one Floating Point unit and one INTeger unit

8/16/32/48 cores per SM

P.Bakowski 8

CPUs and cache memories

CPUs use cache memories to reduce the access latency to main memory. CPU caches need more and more of the surface of the processor and use a lot of energy.

P.Bakowski

Cache memory : latency

P.Bakowski

CPUs and cache memories

GPUs use caches or shared memory to increase the bandwidth of memory.

Global Memory
P.Bakowski 11

GPU memory : transfer data rate

Each GPU multiprocessor has its own memory controller, For example, each memory controller of nVIDIA GT200 chip provides 8 64-bit communication channels. Shared Memory Raster OutPut SMs

8 * 64-bit channels
P.Bakowski 12

GPU memory : transfer data rate

data_rate = interface_width/8 * memory_clock*2 for GTX275: number of bytes on the bus: 448-bit/8 = 56 data_rate in bytes: 56 * 1224MHz = 68,544MB/s 68,544MB/s*2 = 137,088Mb/s = 137.1GB/s two reads/writes per clock cycle: DDR2

P.Bakowski

CPU/GPU : execution threads

GPU evoids the memory latency by the simulteneous execution of thousands of threads; if one thread waits on the memory access, the other one my be executed at the same time.
thread executes thread waits thread executes

P.Bakowski

CPU/GPU : execution threads

A CPU may execute 1-2 threads per core; a GPU multiprocessor may maintain up to 1024 threads each. The cost of thread context switching for a CPU core is tens or hundreds of memory cycles , a GPU may switch several threads per clock cycle.

P.Bakowski

SIMD versus SIMT

SIMD The CPUs exploit the vector processing units for SIMD processing (a single instruction is executed on multiple data elements) - single execution thread !

The GPUs use SIMT operational mode; single instruction is executed by multiple threads. SIMT processing does not require the transformation of the data into vectors. It allows for arbitrary branches in the threads.

SIMT

P.Bakowski

GPUs and high density computing

The GPUs give excellent results when the same sequence of operations is applied to a great number of data. The best results are obtained when the number of arithmetical operations greatly exceeds the number of memory accesses. High density of calculation does not require large cache memory that is necessary in CPUs.

calculations

high

low

memory access

P.Bakowski

GPUs : performance

P.Bakowski

GPU based calculus

In several cases the performance of GPU based processing is 5-30 times greater than CPU based processing. The biggest difference - performance gain up to 100 times! - relates to the code, that is not adapted to SEE instructions but suits well the GPU functions.

P.Bakowski

GPU based calculus

Some example of synthetic code accelerated by the use of GPUs compared to the same code vectorized for SSE : processing for fluorescent microscope : 12x modeling of molecular dynamics : 8-16x modeling electrostatic fields : 40-120x et 7x.

P.Bakowski

GPU based calculus: speed-up

The comparison of the speed-up relative to SSE

P.Bakowski

From GeForce8 to Tesla

P.Bakowski

From GeForce8 to Tesla

8-16 CUDA cores

P.Bakowski

From GeForce8 to Tesla

How many CUDA cores ?

P.Bakowski

From GeForce8 to Tesla

P.Bakowski

Tesla system S1070

P.Bakowski

NVIDIA and CUDA

CUDA technology is a software architecture based on nVIDIA hardware. CUDA language is an extension of the C programming language. It gives acces to GPU instructions and to the video memory for parallel calculations. CUDA allows to implement the algorithms that can be run on GeForce 8 cards and on all more recent GPUs chips (GeForce 9, GeForce 200, GeForce 300, GeForce 400, GeForce 500), Quadro and Tesla.

P.Bakowski

NVIDIA and CUDA

P.Bakowski

NVIDIA and CUDA

The CUDA Toolkit contains: compiler: nvcc libraries FFT and BLAS profiler debugger gdb for GPU runtime driver for CUDA included in nVIDIA drivers guide of programming SDK for CUDA developers source codes (examples) and documentation

P.Bakowski

CUDA : compilation phases

The CUDA C code is compiled with nvcc, that is a script activating other programs: cudacc, g++ , cl , etc.
P.Bakowski 30

CUDA : compilation phases

nvcc generates: the CPU code, compiled with other parts of application and written in pure C , and the PTX object code for the GPU

P.Bakowski

CUDA : compilation phases

The executable files with CUDA code require: runtime CUDA library (cudart) and base CUDA library

P.Bakowski

CUDA : advantages
Main CUDA advantage for GPGPU computing results from the new GPU architecture designed for the efficient implementation of non-graphic calculations and the use of C programming language. There is no need to convert the algorithms into pipelined format required for graphic calculations. The GPGPU does not use the graphic API and the corresponding drivers

P.Bakowski

CUDA : advantages
CUDA provides: the access to 16 KB of memory per SM; this access is shared by the SM threads an efficient transfer of data between the system and video memory (global GPU memory) a memory with linear addressing scheme and with random access to any memory location hardware implemented operations for FP, integers and bits

P.Bakowski

CUDA : limitations
Limitations: no recursive functions (no stack) processing block of minimum 32 threads (warp) CUDA is a proprietary architecture of nVIDIA

P.Bakowski

CUDA : programming model

CUDA programming model is based on groups of threads. The blocks of threads grids of one or two dimensions of threads cooperate via shared memory and synchronization points. A kernel program is executed in a grid of blocks of threads. Only one grid of blocks of threads is executed at a time. Each block may be built in one, two or three dimensions, and contain up two 512 threads.

P.Bakowski

CUDA : programming model

The blocks of threads are executed by groups of 32 threads called warps. A warp is a minimal volume of data that is processed by streaming processors. CUDA works with blocks of threads containing from 32 to 512 threads.

P.Bakowski

CUDA : memory model

Local and Global Memory is not cached . Local and Global Memory are implemented in separate circuits. The access time to Local and Global Memory is much longer than the Register access time.

P.Bakowski

CUDA : memory model

There are 1024 register entries per SM. The access to these registers is very rapid. Each register may store one 32-bit integer or floating point number.

P.Bakowski

CUDA : memory model

Global Memory from 256Mo to 2Go ( up to 4Go in Tesla). Data bandwidth may be over 100 Go/s but the latency is high (several hundreds of clock cycles) . There is no cache memory for Global Memory. Global Memory is used for global data and instructions

P.Bakowski

CUDA : memory model

Shared Memory: 16-KB of shared memory for all cores in a block of threads. Shared Memory is as rapid as the Registers.

P.Bakowski

CUDA : memory model

Constant Memory - 64 KB, read-only for all SM units Constant Memory is high latency memory with access time of several hundreds of clock cycles.

P.Bakowski

CUDA : memory model

That is why the Constant Memory data are cached in blocks of 8KB for each SM.

P.Bakowski

CUDA : memory model

Texture Memory is accessible (read-only) to all MS. Texture data are used directly by GPU, they may be interpolated linearly without additional operations.

P.Bakowski

CUDA : memory model

Texture Memory has long latency access and is cached.

P.Bakowski

CUDA : memory model

Typical use of CUDA memories: divide the task into several sub-tasks decompose the input data into blocks that correspond to the shared memory size each block of data will be processed by a block of threads load the data blocks from the Global Memory to Shared Memory process the data in the Shared Memory copy the results from the Shared Memory to Global Memory
P.Bakowski 46

CUDA : program example

main() - function at the CPU side

P.Bakowski

CUDA : program example

main() - function at the CPU side (cont.)

P.Bakowski

CUDA : program example

main() - function at the CPU side (cont.)

P.Bakowski

CUDA : program example

kernel function: at the GPU side

10 threads

++++++++++ no loop but several threads each thread with an index threadIdx.x
P.Bakowski 50

CUDA and graphic APIs

CUDA programs my exploit the graphic functions provided by graphic APIs (DirectX, openGL). These functions provide necessary image processing operations for rastering and shading rendering of the images on the screen. The proposed module does not deal with these primitives. However some of openGL operations may be used in practical classes to display the images directly from GPU memory.

P.Bakowski

Summary
Evolution of multiprocessing CPUs and GPUs SIMD and SIMT processing modes Performances of GPUs NVIDIA and CUDA CUDA processing model CUDA memory model a simple example

P.Bakowski

GPU Compute
100% (1)
GPU Compute
58 pages
RAID Spindle Calculator
No ratings yet
RAID Spindle Calculator
2 pages
Mastering SaltStack - Second Edition
From Everand
Mastering SaltStack - Second Edition
Joseph Hall
No ratings yet
Gpu1 - GPU Introduction
No ratings yet
Gpu1 - GPU Introduction
20 pages
Parallel Processing With Cuda
No ratings yet
Parallel Processing With Cuda
25 pages
1 Cuda
100% (1)
1 Cuda
173 pages
Lecture 0: Cpus and Gpus: Prof. Mike Giles
No ratings yet
Lecture 0: Cpus and Gpus: Prof. Mike Giles
36 pages
Module 4.1 - Memory and Data Locality: GPU Teaching Kit
No ratings yet
Module 4.1 - Memory and Data Locality: GPU Teaching Kit
132 pages
CUDA Introduction
No ratings yet
CUDA Introduction
71 pages
Big CPU Big Data
No ratings yet
Big CPU Big Data
424 pages
NVSwitch
No ratings yet
NVSwitch
23 pages
Introduction To PCI Express
No ratings yet
Introduction To PCI Express
71 pages
Introduction To OpenCL Programming (201005)
No ratings yet
Introduction To OpenCL Programming (201005)
132 pages
Nvidia Ampere Architecture Whitepaper
No ratings yet
Nvidia Ampere Architecture Whitepaper
83 pages
Part 2 Tlm2.0 Overview
No ratings yet
Part 2 Tlm2.0 Overview
35 pages
ARM Architecture
No ratings yet
ARM Architecture
547 pages
Image Rotation Using CUDA
No ratings yet
Image Rotation Using CUDA
18 pages
Virtual Development Board
No ratings yet
Virtual Development Board
10 pages
How Microprocessors Work 23
No ratings yet
How Microprocessors Work 23
13 pages
gtc22 Whitepaper Hopper
No ratings yet
gtc22 Whitepaper Hopper
71 pages
ARM-A Mandatory Primer
No ratings yet
ARM-A Mandatory Primer
4 pages
CUDA Memory Types: Parallel and High Performance Computing
No ratings yet
CUDA Memory Types: Parallel and High Performance Computing
27 pages
Android Graphics
No ratings yet
Android Graphics
58 pages
How PCI Express Works
No ratings yet
How PCI Express Works
11 pages
Fpga Hardware
No ratings yet
Fpga Hardware
32 pages
Everything You Wanted To Know About SOC Memory
No ratings yet
Everything You Wanted To Know About SOC Memory
29 pages
RISC Processor Fundamentals
No ratings yet
RISC Processor Fundamentals
18 pages
Practical Introduction To PCI Express With FPGAs - Extended
No ratings yet
Practical Introduction To PCI Express With FPGAs - Extended
77 pages
Getting Started With CUDA Samples
No ratings yet
Getting Started With CUDA Samples
9 pages
Asic Vs Fpga
No ratings yet
Asic Vs Fpga
34 pages
Tutorial On DNN 4 of 9 DNN Accelerator Architectures PDF
No ratings yet
Tutorial On DNN 4 of 9 DNN Accelerator Architectures PDF
73 pages
SystemC TLM
No ratings yet
SystemC TLM
33 pages
The Illustrated BERT, ELMo, and Co. (How NLP Cracked Transfer Learning) - Jay Alammar - Visualizing Machine Learning One Concept at A Time
No ratings yet
The Illustrated BERT, ELMo, and Co. (How NLP Cracked Transfer Learning) - Jay Alammar - Visualizing Machine Learning One Concept at A Time
20 pages
Soft Core) Soft-Core Processors For Embedded Systems
No ratings yet
Soft Core) Soft-Core Processors For Embedded Systems
38 pages
SR MR Iov
No ratings yet
SR MR Iov
63 pages
Understanding of Mipi I3c White Paper v0.95
No ratings yet
Understanding of Mipi I3c White Paper v0.95
13 pages
GPU Programming in MATLAB
No ratings yet
GPU Programming in MATLAB
6 pages
Introduction To VLSI Design: Etching Silicon Ingot
No ratings yet
Introduction To VLSI Design: Etching Silicon Ingot
11 pages
Pci - Pci Express Configuration Space Access
No ratings yet
Pci - Pci Express Configuration Space Access
7 pages
GPU
No ratings yet
GPU
17 pages
Memory Allocations in C PDF
No ratings yet
Memory Allocations in C PDF
2 pages
Cuda Talk
100% (1)
Cuda Talk
82 pages
Cpu Vs Gpu
No ratings yet
Cpu Vs Gpu
12 pages
Electronic Engineer Interview Questions
No ratings yet
Electronic Engineer Interview Questions
7 pages
The History of Microprocessor
No ratings yet
The History of Microprocessor
13 pages
Intel Microprocessor I3, I5, I7
100% (2)
Intel Microprocessor I3, I5, I7
22 pages
NVIDIA Techies Guide To Ethernet - Storage - Fabrics
100% (1)
NVIDIA Techies Guide To Ethernet - Storage - Fabrics
64 pages
Aos l4 Multithreading CPP
No ratings yet
Aos l4 Multithreading CPP
72 pages
Modern GPU
100% (1)
Modern GPU
221 pages
Wiki - FreeRTOS
No ratings yet
Wiki - FreeRTOS
23 pages
Hitachi White Paper Compute Blade 2000
No ratings yet
Hitachi White Paper Compute Blade 2000
15 pages
A 3-D CPU-FPGA-DRAM Hybrid Architecture For Low-Power Computation
No ratings yet
A 3-D CPU-FPGA-DRAM Hybrid Architecture For Low-Power Computation
14 pages
RISC-V ISA Lectures
100% (1)
RISC-V ISA Lectures
65 pages
Flexible Signal Processing Algorithms For Wireless Communications
No ratings yet
Flexible Signal Processing Algorithms For Wireless Communications
132 pages
PJSUA2 Doc
No ratings yet
PJSUA2 Doc
271 pages
MindShare Intro To PCIe
No ratings yet
MindShare Intro To PCIe
18 pages
MN Cache Coherence
No ratings yet
MN Cache Coherence
11 pages
Communication Nets: Stochastic Message Flow and Delay
From Everand
Communication Nets: Stochastic Message Flow and Delay
Leonard Kleinrock
3/5 (1)
VLSI Career ICE Breaker
From Everand
VLSI Career ICE Breaker
Yogesh Soni
3/5 (1)
Contemporary Machine Learning Methods: Harnessing Scikit-Learn and TensorFlow
From Everand
Contemporary Machine Learning Methods: Harnessing Scikit-Learn and TensorFlow
Adam Jones
No ratings yet
Assignment (EEE 323) For 2017-18 Batch
No ratings yet
Assignment (EEE 323) For 2017-18 Batch
2 pages
Tessent Plltest User'S Manual: Software Version 2014.1 March 2014
No ratings yet
Tessent Plltest User'S Manual: Software Version 2014.1 March 2014
146 pages
Aetina Carrier AN310 Datasheet v03
No ratings yet
Aetina Carrier AN310 Datasheet v03
3 pages
LAB Manual 8051
No ratings yet
LAB Manual 8051
60 pages
Addressing Modes
100% (1)
Addressing Modes
20 pages
Name - Section: - : Grade 8 Periodical Exam
No ratings yet
Name - Section: - : Grade 8 Periodical Exam
2 pages
8255 Mode 1,2 TD PDF
No ratings yet
8255 Mode 1,2 TD PDF
26 pages
BTINC401 Digital Electronics
No ratings yet
BTINC401 Digital Electronics
2 pages
HCF4060B 14-Stage Ripple Carry Binary Counter/divider and Oscillator
No ratings yet
HCF4060B 14-Stage Ripple Carry Binary Counter/divider and Oscillator
9 pages
Inventec S-Series 2009 R3a 6050a2252701 Schematics PDF
No ratings yet
Inventec S-Series 2009 R3a 6050a2252701 Schematics PDF
55 pages
stm32 Selection Chart
No ratings yet
stm32 Selection Chart
8 pages
The Specification and Pictures Are Subject To Change Without Notice and The Package Contents May Differ by Area or Your Motherboard Version!
No ratings yet
The Specification and Pictures Are Subject To Change Without Notice and The Package Contents May Differ by Area or Your Motherboard Version!
7 pages
8085 Microprocessor Architecture, Pin Diagram
0% (1)
8085 Microprocessor Architecture, Pin Diagram
10 pages
EEE303-Week04 - Verilog
No ratings yet
EEE303-Week04 - Verilog
39 pages
NVME M.2 Disk Compatibility
No ratings yet
NVME M.2 Disk Compatibility
3 pages
Usb Interfacing With Pic Microcontroller Step by Step PDF
No ratings yet
Usb Interfacing With Pic Microcontroller Step by Step PDF
5 pages
Engineering Colleges in Bangalore
No ratings yet
Engineering Colleges in Bangalore
24 pages
Assignment # 3 CHAPTERS# 1,2,3: CH#1 Answers To Review Qestions SECTION 1.1
No ratings yet
Assignment # 3 CHAPTERS# 1,2,3: CH#1 Answers To Review Qestions SECTION 1.1
34 pages
Compal La-1181 r2.0 Schematics
No ratings yet
Compal La-1181 r2.0 Schematics
42 pages
DP XUSB 14060 Drivers
No ratings yet
DP XUSB 14060 Drivers
74 pages
Database
No ratings yet
Database
32 pages
L6 - Cso 1
No ratings yet
L6 - Cso 1
29 pages
Module - 5: 8051 Interfacing: Syllabus
No ratings yet
Module - 5: 8051 Interfacing: Syllabus
30 pages
Hardware Function - Cpu
No ratings yet
Hardware Function - Cpu
42 pages
CS-404-COA_Syllabus
No ratings yet
CS-404-COA_Syllabus
2 pages
M.E. (VLSI & Embedded Systems) Analog & Digital Cmos Ic Design (2008 Course)
No ratings yet
M.E. (VLSI & Embedded Systems) Analog & Digital Cmos Ic Design (2008 Course)
4 pages
Mse9s08ac16 1n49a PDF
No ratings yet
Mse9s08ac16 1n49a PDF
5 pages
T 1000S Manual
No ratings yet
T 1000S Manual
8 pages
Verilog HDL - Samir Palnitkar PDF
No ratings yet
Verilog HDL - Samir Palnitkar PDF
403 pages