Cuda - New Features and Beyond Ampere Programming For Developers PDF
Cuda - New Features and Beyond Ampere Programming For Developers PDF
1
THE NVIDIA AMPERE GA102 GPU ARCHITECTURE
Adding new RTX desktop GPUs alongside the datacenter-class A100
2
THE NVIDIA AMPERE GA102 GPU ARCHITECTURE
3
THE NVIDIA AMPERE GPU ARCHITECTURE
Asynchronous barriers
L2 cache management
For full complete details of CUDA on the NVIDIA Ampere GPU Architecture, see the GTC May 2020 CUDA 11.0 talk: https://developer.nvidia.com/gtc/2020/video/s21760
4
CUDA ON WINDOWS SUBSYSTEM FOR LINUX
5
CUDA ON WINDOWS SUBSYSTEM FOR LINUX
Preview Available as Part of Microsoft Windows Insider Program
6
GPU-ACCELERATED DATA SCIENCE ON WSL
See CUDA-on-WSL blog for full details: TensorFlow container running inside WSL 2
https://developer.nvidia.com/blog/announcing-
cuda-on-windows-subsystem-for-Linux-2/
7
ANATOMY OF A CUDA APPLICATION
CUDA
Application
8
ANATOMY OF A CUDA APPLICATION
CUDA
Application
https://developer.nvidia.com/
cuda-downloads
install CUDA 11.1
9
ANATOMY OF A CUDA APPLICATION
CUDA
Application
10
ANATOMY OF A CUDA APPLICATION
CUDA
Application
CUDA C++
Compilers
Standard Library
Parallel Algorithm
CUDA 11.1 Toolkit Developer Tools
Libraries
Image & Video
Math Libraries
R455 User Mode Libraries
CUDA Driver
R455 (Base 11.1) Driver
R455 Kernel Mode
Display Driver
11
ANATOMY OF A CUDA APPLICATION
Application
11.1 11.1
Libraries Toolchain
12
ENTERPRISE & CLOUD DEPLOYMENT, TODAY
Deployed in
Application container
11.1 11.1
Libraries Toolchain
13
BACKWARD COMPATIBILITY BEFORE CUDA 11.1
Always runs
14
ENHANCED COMPATIBILITY SINCE CUDA 11.1
Newer CUDA 11.x will now run on older drivers with the same major version
Always runs
15
CUDA ENHANCED COMPATIBILITY
16
NEW MULTI-INSTANCE GPU (MIG)
Divide a Single A100 GPU Into Multiple Instances, Each With
Isolated Paths Through the Entire Memory System
SMs
USER0
Control
DRAM
Pipe
Data
Xbar
Xbar
Sys
GPU Instance 0
L2
Up To 7 GPU Instances In a Single A100
USER1
Full software stack enabled on each instance, with
Control
DRAM
Pipe
Data
Xbar
Xbar
Sys
GPU Instance 1
L2
dedicated SM, memory, L2 cache & bandwidth
USER2
Control
DRAM
Pipe
Data
Xbar
Xbar
Sys
GPU Instance 2
L2
Simultaneous Workload Execution With
USER3
Guaranteed Quality Of Service
Control
DRAM
GPU Pipe
Data
Xbar
Xbar
Sys
GPU Instance 3
L2
All MIG instances run in parallel with predictable
USER4 throughput & latency, fault & error isolation
Control
DRAM
Pipe
Data
Xbar
Xbar
Sys
GPU Instance 4
L2
USER5
Diverse Deployment Environments
Control
DRAM
Pipe
Data
Xbar
Xbar
Sys
GPU Instance 5
L2
Supported with Bare metal, Docker, Kubernetes
USER6
Pod, Virtualized Environments
Control
DRAM
Pipe
Data
Xbar
Xbar
Sys
GPU Instance 6
L2
17
LOGICAL VS. PHYSICAL PARTITIONING
A B C
18
CUDA CONCURRENCY MECHANISMS
19
EXECUTION SCHEDULING & MANAGEMENT
Pre-emptive scheduling Concurrent scheduling
Processes share GPU through time-slicing Processes run on GPU simultaneously
Scheduling managed by system User creates & manages scheduling streams
C
B
A B C A B
A
time time
time-slice
Time-slice configurable via nvidia-smi CUDA 11.0 adds a new stream priority level
20
FINE-GRAINED SYNCHRONIZATION
Thread Block
__syncthreads()
21
FINE-GRAINED SYNCHRONIZATION
Thread Block
__syncthreads()
22
FINE-GRAINED SYNCHRONIZATION
NVIDIA Ampere GPU Architecture Allows Creation Of Arbitrary Barriers
barrier
barrier
barrier
__syncthreads() barrier
23
FINE-GRAINED SYNCHRONIZATION
NVIDIA Ampere GPU Architecture Allows Creation Of Arbitrary Barriers
barrier
barrier
barrier
__syncthreads() barrier
24
FINE-GRAINED SYNCHRONIZATION
NVIDIA Ampere GPU Architecture Allows Creation Of Arbitrary Barriers
barrier
__syncthreads()
25
BARRIERS ALLOW EXCHANGE OF INFORMATION
Synchronize
Consumer step
Process results
26
BARRIERS ALLOW EXCHANGE OF INFORMATION
Arrive
Wait
Consumer step
Process results
27
ASYNCHRONOUS BARRIERS
Arrive
Wait
Consumer step
Process results
28
ASYNCHRONOUS BARRIERS
Process results
29
ASYNCHRONOUS BARRIERS
Independent
Pipelined processing
Work
Process results
30
SINGLE-STAGE vs. ASYNCHRONOUS BARRIERS
Arrive
All threads block on
slowest arrival
Independent Pipelined
Work processing
Arrive Single-Stage
Wait Barrier Wait
Shared Memory
Threads Threads
Registers Registers
1L1 Cache1
GPU Memory
HBM
HBM
32
COPYING DATA INTO SHARED MEMORY
SM
Shared Memory
2
Threads 2
Threads
Registers Registers
1L1 Cache1
GPU Memory
HBM
HBM
2
Threads 2
Threads Threads Threads
Two step copy to shared memory via registers Asynchronous direct copy to shared memory
1 Thread loads data from GPU Direct transfer into shared memory,
memory into registers 1
bypassing thread resources
Shared Memory
35
SIMPLE DATA MOVEMENT
Shared Memory
36
SIMPLE DATA MOVEMENT
Shared Memory
37
SIMPLE DATA MOVEMENT
Shared Memory
1 4
38
SIMPLE DATA MOVEMENT
39
DOUBLE-BUFFERED DATA MOVEMENT
Shared Memory
P1 P1
40
DOUBLE-BUFFERED DATA MOVEMENT
Shared Memory
P1 1 P1
41
DOUBLE-BUFFERED DATA MOVEMENT
Shared Memory
P1 1 P1
42
DOUBLE-BUFFERED DATA MOVEMENT
Shared Memory
P1 1
43
DOUBLE-BUFFERED DATA MOVEMENT
Shared Memory
44
DOUBLE-BUFFERED DATA MOVEMENT
#pragma unroll 2
2 for( e = 0; e < NUM_ELEMS; e++ ) {
// Kick off load of next image
pixel[(e+1)&1] = image[image_offset(e+1)];
3
45
ASYNCHRONOUS DIRECT DATA MOVEMENT
Shared Memory
Barrier
P1
46
ASYNCHRONOUS DIRECT DATA MOVEMENT
Shared Memory
P1 1
47
ASYNCHRONOUS DIRECT DATA MOVEMENT
Shared Memory
P1 1
48
ASYNCHRONOUS DIRECT DATA MOVEMENT
Shared Memory
49
ASYNCHRONOUS DIRECT DATA MOVEMENT
50
ASYNCHRONOUS COPY PIPELINES
Prefetch multiple images in a continuous stream
Shared Memory
Pipeline
P1
51
ASYNCHRONOUS COPY PIPELINES
Prefetch multiple images in a continuous stream
Shared Memory
Pipeline
P1 P2
52
ASYNCHRONOUS COPY PIPELINES
Prefetch multiple images in a continuous stream
Shared Memory
Pipeline
P1 P2
P3
53
ASYNCHRONOUS COPY PIPELINES
Prefetch multiple images in a continuous stream
Shared Memory
P1 P2
P3 1
54
ASYNCHRONOUS COPY PIPELINES
Prefetch multiple images in a continuous stream
Shared Memory
P3 1
55
ASYNCHRONOUS COPY PIPELINES
Prefetch multiple images in a continuous stream
Shared Memory
Async Copy using Pipeline vs. Barrier
2
Allows batching of multiple copy operations
Pipeline into a single transaction
P3 1
57
ISO C++ == Language + Standard Library
CUDA C++ == Language + libcu++
58
libcu++ : THE CUDA C++ STANDARD LIBRARY
59
cuda::std::
Opt-in Does not interfere with or replace your host standard library
61
cuda::std::
libcu++, the NVIDIA C++ Standard Library,
is the C++ Standard Library for your entire system
Open source, licensed under the Apache License 2.0 with LLVM Exceptions
62
COOPERATIVE GROUPS
Cooperative Groups Features Work On All GPU Architectures (incl. Kepler)
cg::memcpy_async(tile32, dst, dstCount, src, srcCount); New platforms Support (Windows and Linux + MPS)
Result Result
63
COOPERATIVE GROUPS MAPS NATURALLY TO PIPELINES
constexpr int N = 4; // 4-stage pipeline
__shared__ float smem[N][NUM_ELEMS];
__shared__ cuda::pipeline_shared_state<thread_scope_block, N> ps; Shared Memory
auto group = cooperative_groups::this_thread_block();
auto pipe = cuda::make_pipeline(group, &ps);
4 groups of 64 threads
Create groups with any power-of-2 number of threads, even spanning multiple warps
Sync and perform collective operations within these groups
65
MULTI-WARP COOPERATIVE GROUPS
66
FLOATING POINT FORMATS & PRECISION
67
NEW FLOATING POINT FORMATS: BF16 & TF32
Both Match fp32 8-bit Exponent: Covers The Same Range of Values
8-bit 7-bit
Available in CUDA C++ as nv_bfloat16 numerical type
bfloat16 Full CUDA C++ numerical type – #include <cuda_fp16.h>
16-bit Storage Size Can use in both host & device code, and in templated functions*
8-bit 10-bit
Tensor Core math mode for single-precision training
TF32 Not a numerical type – tensor core inputs are rounded to TF32
32-bit Storage Size CUDA C++ programs use float (fp32) throughout
68
TENSOR FLOAT 32 - TENSOR CORE MODE
A100 Tensor Core Input Precision
All Internal Operations Maintain Full FP32 Precision
FP32
x + FP32 output
FP32
Sum with
FP32
accumulator
TF32 MMA Dimensions: m,n,k = 16x8x8
69
A100 INTRODUCES DOUBLE PRECISION TENSOR CORES
All A100 Tensor Core Internal Operations Maintain Full FP64 Precision
70
A100 GPU ACCELERATED MATH LIBRARIES IN CUDA 11.0
71
cuBLAS
Eliminating Alignment Requirements To Activate Tensor Cores for MMA
AlignN means alignment to 16-bit multiplies of N. For example, align8 are problems aligned to 128bits or 16 bytes. 72
cuBLAS
Eliminating Alignment Requirements To Activate Tensor Cores for MMA
AlignN means alignment to 16-bit multiplies of N. For example, align8 are problems aligned to 128bits or 16 bytes. 73
cuBLAS
Eliminating Alignment Requirements To Activate Tensor Cores for MMA
AlignN means alignment to 16-bit multiplies of N. For example, align8 are problems aligned to 128bits or 16 bytes. 74
cuBLAS
Eliminating Alignment Requirements To Activate Tensor Cores for MMA
AlignN means alignment to 16-bit multiplies of N. For example, align8 are problems aligned to 128bits or 16 bytes. 75
MATH LIBRARY DEVICE EXTENSIONS
Introducing cuFFTDx: Device Extension
https://developer.nvidia.com/CUDAMathLibraryEA
76
GPU PROGRAMMING IN 2020 AND BEYOND
Math Libraries | Standard Languages | Directives | CUDA
__global__
void saxpy(int n, float a,
float *x, float *y) {
#pragma acc data copy(x,y) int i = blockIdx.x*blockDim.x +
{ threadIdx.x;
if (i < n) y[i] += a*x[i];
... }
std::transform(par, x, x+n, y, y,
[=](float x, float y) {
std::transform(par, x, x+n, y, y, int main(void) {
return y + a*x;
[=](float x, float y) { cudaMallocManaged(&x, ...);
});
return y + a*x; cudaMallocManaged(&y, ...);
}); ...
saxpy<<<(N+255)/256,256>>>(...,x, y)
do concurrent (i = 1:n) ... cudaDeviceSynchronize();
y(i) = y(i) + a*x(i) ...
enddo } }
77
DOWNLOAD CUDA 11.1 TODAY
https://developer.nvidia.com/cuda-downloads
78