Cuda - New Features and Beyond Ampere Programming For Developers PDF

CUDA NEW FEATURES AND BEYOND:
AMPERE PROGRAMMING FOR DEVELOPERS

Stephen Jones, GTC Fall 2020
1
THE NVIDIA AMPERE GA102 GPU ARCHITECTURE
Adding new RTX desktop GPUs alongside the datacenter-class A100
2
THE NVIDIA AMPERE GA102 GPU ARCHITECTURE
Titan RTX RTX 3090

SMs 72 82
Tensor Core TF32, BF16, FP16,
FP16
Precision I8, I4, B1
Shared Memory
64 kB 96 kB
per Block
L2 Cache Size 6144 kB 6144 kB
Memory Bandwidth 672 GB/sec 936 GB/sec
RT Cores 72 (1st gen) 82 (2nd gen)
3
THE NVIDIA AMPERE GPU ARCHITECTURE
NVIDIA Ampere Architectural Features
Multi-Instance GPU (A100 only)
Asynchronous barriers
Asynchronous data movement
L2 cache management
Task graph acceleration
New Tensor Core precisions
For full complete details of CUDA on the NVIDIA Ampere GPU Architecture, see the GTC May 2020 CUDA 11.0 talk: https://developer.nvidia.com/gtc/2020/video/s21760
4
CUDA ON WINDOWS SUBSYSTEM FOR LINUX
Run a Linux kernel natively on top of Windows 10
Runs Linux at near full speed without emulation
Multi-OS development & testing from a single

Windows desktop machine
No need for dual-boot systems - ideal for laptops
5
CUDA ON WINDOWS SUBSYSTEM FOR LINUX
Preview Available as Part of Microsoft Windows Insider Program
Linux GPU support is now available for WSL 2 users

libcuda
Run GPU-accelerated Linux applications natively on
libdxcore
D3DKMT*
your Windows desktop platform
/dev/dxg User mode
Kernel mode
Getting started is simple:
drivers/gpu/dxgkrnl
Linux Kernel
1. Enable WIP in your Windows system settings
VM Bus
GPU Kernel mode
Windows
dxgkrnl
2. Download preview CUDA WSL driver:
WDDM
Paravirtualization driver (KMD) Kernel mode
Protocol
Diagram of the WDDM model supporting the

https://developer.nvidia.com/cuda/wsl
CUDA user mode driver running inside Linux guest
6
GPU-ACCELERATED DATA SCIENCE ON WSL
Get the latest version of Docker and run:

 AI Frameworks (PyTorch, TensorFlow)
 RAPIDS & ML Applications
 Jupyter Notebooks
GPU-enabled DirectX, CUDA 11.1 and the NVIDIA

Container Toolkit are all available on WSL today
NVML and NCCL support coming soon
See CUDA-on-WSL blog for full details: TensorFlow container running inside WSL 2
https://developer.nvidia.com/blog/announcing-
cuda-on-windows-subsystem-for-Linux-2/
7
ANATOMY OF A CUDA APPLICATION
CUDA
Application
8
CUDA
Application
https://developer.nvidia.com/
cuda-downloads
install CUDA 11.1
9
CUDA
Application
CUDA 11.1 Toolkit
R455 (Base 11.1) Driver
10
CUDA
Application
CUDA C++
Compilers
Standard Library
Parallel Algorithm
CUDA 11.1 Toolkit Developer Tools
Libraries
Image & Video
Math Libraries
R455 User Mode Libraries
CUDA Driver
R455 (Base 11.1) Driver
R455 Kernel Mode
Display Driver
11
Application
11.1 11.1
Libraries Toolchain
R455 (Base 11.1)

Driver Stack
12
ENTERPRISE & CLOUD DEPLOYMENT, TODAY
Deployed in
Application container
11.1 11.1
Libraries Toolchain
Versions must match
R455 (Base 11.1)

Driver Stack
Driver Stack
installed on
system
13
BACKWARD COMPATIBILITY BEFORE CUDA 11.1
 Older CUDA always runs on newer drivers
 Future CUDA does not run on older drivers
Always runs
Older CUDA CUDA 11.0 CUDA 11.1 (CUDA 11.x)
Older Driver R450 Driver R455 Driver Future Driver

R450 Does not run CUDA 11.1 applications
14
ENHANCED COMPATIBILITY SINCE CUDA 11.1
 Older CUDA always runs on newer drivers
 Newer CUDA 11.x will now run on older drivers with the same major version
Always runs
Older CUDA CUDA 11.0 CUDA 11.1 (CUDA 11.x)
Older Driver R450 Driver R455 Driver Future Driver

CUDA 11.0 now runs CUDA 11.1 applications
Future CUDA 11.x will also run on an 11.0 system
CUDA 11.x will not run on pre-11.0 drivers
15
CUDA ENHANCED COMPATIBILITY
Build with any 11.x toolchain Application
Libraries from different CUDA 11.1 11.0 11.x

versions run on any base driver cuBLAS NCCL cuFFT
Runs on any CUDA 11.x

Deploy on any 11.x system
base driver stack
16
NEW MULTI-INSTANCE GPU (MIG)
Divide a Single A100 GPU Into Multiple Instances, Each With
Isolated Paths Through the Entire Memory System
SMs
USER0
Control
DRAM
Pipe
Data
Xbar
Xbar
Sys
GPU Instance 0
L2
Up To 7 GPU Instances In a Single A100
USER1
Full software stack enabled on each instance, with
Control
DRAM
Pipe
Data
Xbar
Xbar
Sys
GPU Instance 1
L2
dedicated SM, memory, L2 cache & bandwidth
USER2
Control
DRAM
Pipe
Data
Xbar
Xbar
Sys
GPU Instance 2
L2
Simultaneous Workload Execution With
USER3
Guaranteed Quality Of Service
Control
DRAM
GPU Pipe
Data
Xbar
Xbar
Sys
GPU Instance 3
L2
All MIG instances run in parallel with predictable
USER4 throughput & latency, fault & error isolation
Control
DRAM
Pipe
Data
Xbar
Xbar
Sys
GPU Instance 4
L2
USER5
Diverse Deployment Environments
Control
DRAM
Pipe
Data
Xbar
Xbar
Sys
GPU Instance 5
L2
Supported with Bare metal, Docker, Kubernetes
USER6
Pod, Virtualized Environments
Control
DRAM
Pipe
Data
Xbar
Xbar
Sys
GPU Instance 6
L2
17
LOGICAL VS. PHYSICAL PARTITIONING
A B C
CUDA MULTI-PROCESS SERVICE CONTROL
TensorFlow PyTorch PyTorch TensorFlow Jarvis + TensorRT TensorRT
GPU MULTI-PROCESS SERVICE
Multi-Process Service Multi-Instance GPU

Dynamic contention for GPU resources Hierarchy of instances with guaranteed resource allocation
Single tenant Multiple tenants
18
CUDA CONCURRENCY MECHANISMS
Streams MPS MIG

Partition Type Single process Logical Physical
Max Partitions Unlimited 48 7
Fractional Provisioning No Yes Yes
Memory Protection No Yes Yes
Memory Bandwidth QoS No No Yes
Fault Isolation No No Yes
Cross-Partition Interop Always IPC Limited IPC
Reconfigure Dynamic Process launch When idle
19
EXECUTION SCHEDULING & MANAGEMENT
Pre-emptive scheduling Concurrent scheduling
Processes share GPU through time-slicing Processes run on GPU simultaneously
Scheduling managed by system User creates & manages scheduling streams
C
B
A B C A B
A
time time
time-slice
$ nvidia-smi compute-policy cudaStreamCreateWithPriority(pStream, flags, priority);

--set-timeslice={default, short, medium, long} cudaDeviceGetStreamPriorityRange(leastPriority, greatestPriority);
Time-slice configurable via nvidia-smi CUDA 11.0 adds a new stream priority level
20
FINE-GRAINED SYNCHRONIZATION
Thread Block
__syncthreads()
21
Thread Block
__syncthreads()
22
NVIDIA Ampere GPU Architecture Allows Creation Of Arbitrary Barriers
Thread Block Thread Block
barrier
barrier
barrier
__syncthreads() barrier
23
barrier
barrier
barrier
__syncthreads() barrier
24
barrier
__syncthreads()
25
BARRIERS ALLOW EXCHANGE OF INFORMATION
Compute value &

store result
Producer step
Synchronize
Consumer step
Process results
26
BARRIERS ALLOW EXCHANGE OF INFORMATION
Compute value &

store result
Producer step
Arrive
Wait
Consumer step
Process results
27
ASYNCHRONOUS BARRIERS
Compute value &

store result
Producer step
Arrive
Wait
Consumer step
Process results
28
Compute value &

store result
Producer step
All threads have arrived Arrive
so “Wait” is already satisfied Wait

Consumer step
Process results
29
Compute value &

store result
Producer step
All threads have arrived Arrive
Independent
Pipelined processing
Work
so “Wait” is already satisfied Wait

Consumer step
Process results
30
SINGLE-STAGE vs. ASYNCHRONOUS BARRIERS
Produce Data Produce Data
Arrive
All threads block on
slowest arrival
Independent Pipelined
Work processing
Arrive Single-Stage
Wait Barrier Wait
Consume Data Consume Data
Single-Stage barriers combine Asynchronous barriers enable

back-to-back arrive & wait pipelined processing
31
COPYING DATA INTO SHARED MEMORY
SM
Shared Memory
Threads Threads
Registers Registers
1L1 Cache1
GPU Memory
HBM
HBM
Two step copy to shared memory via registers
1 Thread loads data from GPU

memory into registers
32
COPYING DATA INTO SHARED MEMORY
SM
Shared Memory
2
Threads 2
Threads
Registers Registers
1L1 Cache1
GPU Memory
HBM
HBM
Two step copy to shared memory via registers
1 Thread loads data from GPU

memory into registers
2 Thread stores data into SM

shared memory 33
ASYNC MEMCOPY: DIRECT TRANSFER INTO SHARED MEMORY
SM A100 SM
Shared Memory Shared Memory
2
Threads 2
Threads Threads Threads
Registers Registers Registers Registers

1
1L1 Cache1 L1 Cache
GPU Memory GPU Memory

HBM HBM
HBM HBM
Two step copy to shared memory via registers Asynchronous direct copy to shared memory
1 Thread loads data from GPU Direct transfer into shared memory,
memory into registers 1
bypassing thread resources
2 Thread stores data into SM

shared memory 34
SIMPLE DATA MOVEMENT
Shared Memory
1 Load image element into registers
35
Shared Memory
2 Store image element into shared memory

2
36
Shared Memory
3 Compute using shared memory data
37
Shared Memory

2
3 Compute using shared memory data
3 4 Repeat for next element
1 4
38
Shared Memory __shared__ float smem[ELEM_SIZE];
for( e = 0; e < NUM_ELEMS; e++ ) {

// Load an image element into shared mem
pixel = image[image_offset(e)]; // Step 1
smem[shared_offset(threadIdx.x)] = pixel; // Step 2
2 __syncthreads(); // Step 3
// Compute using this element

result = compute(smem); // Step 4
3
__syncthreads(); // Sync & repeat
1 4 }
39
DOUBLE-BUFFERED DATA MOVEMENT
Shared Memory
P1 Prefetch initial image element into registers
P1 P1
40
Shared Memory
1 Prefetch next element into more registers
P1 1 P1
41
Shared Memory
2 2 Store current element into shared memory
P1 1 P1
42
Shared Memory
2 Store current element into shared memory
3 3 Compute using shared memory data
P1 1
43
Shared Memory
2 2 Store current element into shared memory
P1 1 4 4 Repeat for next element
44
Shared Memory __shared__ float smem[ELEM_SIZE];

float pixel[2];
// Prefetch first element

pixel[0] = image[image_offset(0)];
#pragma unroll 2
2 for( e = 0; e < NUM_ELEMS; e++ ) {
// Kick off load of next image
pixel[(e+1)&1] = image[image_offset(e+1)];
3
P1 1 4 // Write prefetched data into shared mem

smem[shared_offset()] = pixel[e&1];
__syncthreads();
4 // Compute first image while second loads
result = compute(smem);
1
__syncthreads();
}
45
ASYNCHRONOUS DIRECT DATA MOVEMENT
Shared Memory
P1 Async copy initial element into shared memory
Barrier
P1
46
Shared Memory
1 Async copy next element into shared memory

Barrier Barrier
P1 1
47
Shared Memory

2 Barrier Barrier
2 Threads synchronize with current async copy
P1 1
Async copy notifies an asynchronous barrier when it is

done – the copy arrives, and threads can wait for it.
48
Shared Memory

Barrier Barrier 2
2 Threads synchronize with current async copy
P1 1 4 Repeat for next element
Async copy notifies an asynchronous barrier when it is

done – the copy arrives, and threads can wait for it.
49
Shared Memory __shared__ float smem[2][ELEM_SIZE];

cuda::barrier barrier[2];
buf_id = 0;
// Kick off initial copy – it will arrive on barrier

Barrier Barrier 2 memcpy_async(&smem[buf_id][shared_offset()],
&image[image_offset(0)], size,
barrier[buf_id]);
for( e = 1; e < NUM_ELEMS; e++ ) {

3
// Start by issuing copy of next chunk
P1 1 memcpy_async(&smem[!buf_id][shared_offset()],
&image[image_offset(e)], size,
barrier[!buf_id]);
4 // Sync on current chunk then compute
barrier[buf_id].arrive_and_wait();
result = compute(smem[buf_id]);
buf_id = !buf_id; // Flip buffers
}
50
ASYNCHRONOUS COPY PIPELINES
Prefetch multiple images in a continuous stream
Shared Memory
P1 Async copy multiple elements into shared memory
Pipeline
P1
51
Shared Memory
P1 P2 Async copy multiple elements into shared memory
Pipeline
P1 P2
52
Shared Memory
P1 P2 P3 Async copy multiple elements into shared memory
Pipeline
P1 P2
P3
53
Shared Memory

Pipeline
P1 P2
P3 1
54
Shared Memory
2 1 Async copy next element into shared memory

Pipeline
2 Threads synchronize with oldest pipelined copy
P1 P2 4 Repeat for next element
P3 1
55
Shared Memory
Async Copy using Pipeline vs. Barrier
2
Allows batching of multiple copy operations
Pipeline into a single transaction
Many in-flight transactions, completing in FIFO

3
order
P1 P2 Fastest possible synchronization performance
P3 1
For full details see the developer blog: https://developer.nvidia.com/blog/controlling-data-movement-to-boost-performance-on-ampere-architecture/

56
ISO C++ == Language + Standard Library
57
CUDA C++ == Language + libcu++
58
libcu++ : THE CUDA C++ STANDARD LIBRARY

CUDA C++ == Language + libcu++
Strictly conforming to ISO C++, plus conforming extensions
Opt-in, Heterogeneous, Incremental
59
cuda::std::
Opt-in Does not interfere with or replace your host standard library
Copyable/Movable objects can migrate between host & device

Heterogeneous Host & Device can call all member functions
Host & Device can concurrently use synchronization primitives*
A subset of the standard library today

Incremental
Each release adds more functionality
*Synchronization primitives must be in managed memory and be declared with cuda::std::thread_scope_system

60
libcu++ NAMESPACE HIERARCHY
// ISO C++, __host__ only

#include <atomic>
std::atomic<int> x;
// CUDA C++, __host__ __device__

// Strictly conforming to the C++ Standard
#include <cuda/std/atomic>
cuda::std::atomic<int> x;
// CUDA C++, __host__ __device__

// Conforming extensions to the C++ Standard
#include <cuda/atomic>
cuda::atomic<int, cuda::thread_scope_block> x;
61
cuda::std::
libcu++, the NVIDIA C++ Standard Library,
is the C++ Standard Library for your entire system
libcu++ is NVIDIA’s variant of LLVM’s libc++
It is opt-in, heterogeneous and incremental
Available on GitHub at: https://nvidia.github.io/libcudacxx/
Open source, licensed under the Apache License 2.0 with LLVM Exceptions
62
COOPERATIVE GROUPS
Cooperative Groups Features Work On All GPU Architectures (incl. Kepler)
Input Data in Global Memory Cooperative Groups Updates

No longer requires separate compilation
auto tile32 = 30% faster grid synchronization

cg::tiled_partition<32>(this_thread_block());
cg::memcpy_async(tile32, dst, dstCount, src, srcCount); New platforms Support (Windows and Linux + MPS)
Can now capture cooperative launches in a CUDA graph

Per-Tile Per-Tile in Shared Memory
Data Data
cg::reduce(tile32, dst[threadRank], [](int lhs, int rhs) {

return lhs + rhs; cg::reduce also accepts C++ lambda as reduction operation
});
Result Result
63
COOPERATIVE GROUPS MAPS NATURALLY TO PIPELINES
constexpr int N = 4; // 4-stage pipeline
__shared__ float smem[N][NUM_ELEMS];
__shared__ cuda::pipeline_shared_state<thread_scope_block, N> ps; Shared Memory
auto group = cooperative_groups::this_thread_block();
auto pipe = cuda::make_pipeline(group, &ps);
for ( e = f = 0 ; e < NUM_ELEMS ; e++ ) {

// Fetch-ahead empty buffers Pipeline
for ( ; f < NUM_ELEMS && f < (e+N) ; f++ ) {
// Begin pipeline push
pipe.producer_acquire();
cuda::memcpy_async(group, &smem[f % N],
&global1[f * group.size()],
sizeof(float) * group.size(), pipe);
pipe.producer_commit();
// End pipeline push
}
// Pop the oldest copy off the pipeline

pipe.consumer_wait(); // Begin pipeline pop
compute(smem[e % N]);
pipe.consumer_release(); // End pipeline pop
}
64
MULTI-WARP COOPERATIVE GROUPS
Experimental in CUDA 11.1
Original group of 8 warps = 256 threads
2 groups of 128 threads
1 group of 128 threads + 2 groups of 64 threads
4 groups of 64 threads
Create groups with any power-of-2 number of threads, even spanning multiple warps
Sync and perform collective operations within these groups
65
MULTI-WARP COOPERATIVE GROUPS
Caller provides memory for barriers & // in cg::experimental

#define _CG_ABI_EXPERIMENTAL
collectives #include <cooperative_groups.h>
#include <cooperative_groups/reduce.h>
cg::experimental::block_tile_memory
namespace cg = cooperative_groups;
Declares scratch memory for use in communication __shared__ cg::experimental::block_tile_memory
between tiles, based on the number of threads in < sizeof(float), BlockSize > scratch;
your block.
cg::experimental::thread_block cta =
this_thread_block(scratch);
cg::this_thread_block(…& scratch) auto tile = tiled_partition<128>(cta);
Takes in scratch memory and associates it with the
resulting group handle. cg::reduce(g, dst[threadRank], [](int lhs, int rhs) {
return lhs + rhs;
});
Scratch memory can be shared or global memory
(shared is higher performance)
66
FLOATING POINT FORMATS & PRECISION
sign 11-bit exponent 52-bit mantissa

double
8-bit 23-bit
float
5-bit 10-bit
half
8-bit 7-bit
bfloat16
8-bit 10-bit
TF32
Numerical Range Numerical Precision
value = (-1)sign x 2exponent x (1 + mantissa)
67
NEW FLOATING POINT FORMATS: BF16 & TF32
Both Match fp32 8-bit Exponent: Covers The Same Range of Values
8-bit 7-bit
Available in CUDA C++ as nv_bfloat16 numerical type
bfloat16 Full CUDA C++ numerical type – #include <cuda_fp16.h>
16-bit Storage Size Can use in both host & device code, and in templated functions*
8-bit 10-bit
Tensor Core math mode for single-precision training
TF32 Not a numerical type – tensor core inputs are rounded to TF32
32-bit Storage Size CUDA C++ programs use float (fp32) throughout
*(similar to CUDA’s IEEE-FP16 “half” type)
68
TENSOR FLOAT 32 - TENSOR CORE MODE
A100 Tensor Core Input Precision
All Internal Operations Maintain Full FP32 Precision
Convert to Full precision more products

TF32 product
FP32
x + FP32 output
FP32
Sum with
FP32
accumulator
TF32 MMA Dimensions: m,n,k = 16x8x8
69
A100 INTRODUCES DOUBLE PRECISION TENSOR CORES
All A100 Tensor Core Internal Operations Maintain Full FP64 Precision
DMMA Dimensions: m,n,k = 8x8x4
70
A100 GPU ACCELERATED MATH LIBRARIES IN CUDA 11.0
cuBLAS cuSPARSE cuTENSOR cuSOLVER

BF16, TF32 and FP64 Increased memory BW, BF16, TF32 and FP64 BF16, TF32 and FP64
Tensor Cores Shared Memory & L2 Tensor Cores Tensor Cores
nvJPEG cuFFT CUDA Math API CUTLASS

Hardware Decoder BF16, TF32 and FP64 Increased memory BW, BF16 & TF32 Support
Tensor Cores Shared Memory & L2
71
cuBLAS
Eliminating Alignment Requirements To Activate Tensor Cores for MMA
CUDA 10.2 - Align 8
AlignN means alignment to 16-bit multiplies of N. For example, align8 are problems aligned to 128bits or 16 bytes. 72
cuBLAS
CUDA 10.2 - Align 8
CUDA 10.2 - Align 1

Align 2
cuBLAS
CUDA 10.2 - Align 8

CUDA 11.0 - Align 8
CUDA 10.2 - Align 1

Align 2
cuBLAS
CUDA 10.2 - Align 8

CUDA 11.0 - Align 8
CUDA 11.0 - Align 2
CUDA 11.0 - Align 1
CUDA 10.2 - Align 1

Align 2
MATH LIBRARY DEVICE EXTENSIONS
Introducing cuFFTDx: Device Extension
Available in Math Library EA Program

Device callable library
Retain and reuse on-chip data
Inline FFTs in user kernels
Combine multiple FFT operations
https://developer.nvidia.com/CUDAMathLibraryEA
76
GPU PROGRAMMING IN 2020 AND BEYOND
Math Libraries | Standard Languages | Directives | CUDA
__global__
void saxpy(int n, float a,
float *x, float *y) {
#pragma acc data copy(x,y) int i = blockIdx.x*blockDim.x +
{ threadIdx.x;
if (i < n) y[i] += a*x[i];
... }
std::transform(par, x, x+n, y, y,
[=](float x, float y) {
std::transform(par, x, x+n, y, y, int main(void) {
return y + a*x;
[=](float x, float y) { cudaMallocManaged(&x, ...);
});
return y + a*x; cudaMallocManaged(&y, ...);
}); ...
saxpy<<<(N+255)/256,256>>>(...,x, y)
do concurrent (i = 1:n) ... cudaDeviceSynchronize();
y(i) = y(i) + a*x(i) ...
enddo } }
GPU Accelerated Incremental Performance Maximize GPU Performance with

ISO C++ and Fortran Optimization with Directives CUDA C++/Fortran
GPU Accelerated Math Libraries
77
DOWNLOAD CUDA 11.1 TODAY
https://developer.nvidia.com/cuda-downloads
78

Cuda - New Features and Beyond Ampere Programming For Developers PDF

Uploaded by

Cuda - New Features and Beyond Ampere Programming For Developers PDF

Uploaded by

CUDA NEW FEATURES AND BEYOND:

AMPERE PROGRAMMING FOR DEVELOPERS

Titan RTX RTX 3090

Memory Bandwidth 672 GB/sec 936 GB/sec

RT Cores 72 (1st gen) 82 (2nd gen)

NVIDIA Ampere Architectural Features

Multi-Instance GPU (A100 only)

Asynchronous data movement

Task graph acceleration

New Tensor Core precisions

Run a Linux kernel natively on top of Windows 10

Runs Linux at near full speed without emulation

Multi-OS development & testing from a single

No need for dual-boot systems - ideal for laptops

Linux GPU support is now available for WSL 2 users

Diagram of the WDDM model supporting the

Get the latest version of Docker and run:

GPU-enabled DirectX, CUDA 11.1 and the NVIDIA

NVML and NCCL support coming soon

CUDA 11.1 Toolkit

R455 (Base 11.1) Driver

R455 (Base 11.1)

Versions must match

R455 (Base 11.1)

 Older CUDA always runs on newer drivers

 Future CUDA does not run on older drivers

Older CUDA CUDA 11.0 CUDA 11.1 (CUDA 11.x)

Older Driver R450 Driver R455 Driver Future Driver

 Older CUDA always runs on newer drivers

Older CUDA CUDA 11.0 CUDA 11.1 (CUDA 11.x)

Older Driver R450 Driver R455 Driver Future Driver

Future CUDA 11.x will also run on an 11.0 system

CUDA 11.x will not run on pre-11.0 drivers

Build with any 11.x toolchain Application

Libraries from different CUDA 11.1 11.0 11.x

Runs on any CUDA 11.x

CUDA MULTI-PROCESS SERVICE CONTROL

TensorFlow PyTorch PyTorch TensorFlow Jarvis + TensorRT TensorRT

GPU MULTI-PROCESS SERVICE

Multi-Process Service Multi-Instance GPU

Streams MPS MIG

$ nvidia-smi compute-policy cudaStreamCreateWithPriority(pStream, flags, priority);

Thread Block Thread Block

Thread Block Thread Block

Thread Block Thread Block

Compute value &

Compute value &

Compute value &

Compute value &

All threads have arrived Arrive

so “Wait” is already satisfied Wait

Compute value &

All threads have arrived Arrive

so “Wait” is already satisfied Wait

Produce Data Produce Data

Consume Data Consume Data

Single-Stage barriers combine Asynchronous barriers enable

Two step copy to shared memory via registers

1 Thread loads data from GPU

Two step copy to shared memory via registers

1 Thread loads data from GPU

2 Thread stores data into SM

Shared Memory Shared Memory

Registers Registers Registers Registers

1L1 Cache1 L1 Cache

GPU Memory GPU Memory

2 Thread stores data into SM

1 Load image element into registers

1 Load image element into registers

2 Store image element into shared memory

1 Load image element into registers

2 Store image element into shared memory

3 Compute using shared memory data

1 Load image element into registers

2 Store image element into shared memory

3 4 Repeat for next element

Shared Memory shared float smem[ELEM_SIZE];

Shared Memory shared float smem[ELEM_SIZE];

Shared Memory shared float smem[2][ELEM_SIZE];

// ISO C++, host only

// CUDA C++, host device

// CUDA C++, host device