0% found this document useful (0 votes)
366 views78 pages

Cuda - New Features and Beyond Ampere Programming For Developers PDF

1) The Nvidia Ampere architecture adds new desktop GPUs and improves performance over previous generations with more SMs, higher memory bandwidth, and new tensor core precisions. 2) CUDA now runs natively on Windows Subsystem for Linux (WSL), allowing developers to run GPU-accelerated applications directly from Windows. 3) CUDA applications rely on the CUDA toolkit, driver, and libraries working together, and newer CUDA versions are now backward compatible to run on older driver versions with the same major number.

Uploaded by

jackops
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
0% found this document useful (0 votes)
366 views78 pages

Cuda - New Features and Beyond Ampere Programming For Developers PDF

1) The Nvidia Ampere architecture adds new desktop GPUs and improves performance over previous generations with more SMs, higher memory bandwidth, and new tensor core precisions. 2) CUDA now runs natively on Windows Subsystem for Linux (WSL), allowing developers to run GPU-accelerated applications directly from Windows. 3) CUDA applications rely on the CUDA toolkit, driver, and libraries working together, and newer CUDA versions are now backward compatible to run on older driver versions with the same major number.

Uploaded by

jackops
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 78

CUDA NEW FEATURES AND BEYOND:

AMPERE PROGRAMMING FOR DEVELOPERS


Stephen Jones, GTC Fall 2020

1
THE NVIDIA AMPERE GA102 GPU ARCHITECTURE
Adding new RTX desktop GPUs alongside the datacenter-class A100

2
THE NVIDIA AMPERE GA102 GPU ARCHITECTURE

Titan RTX RTX 3090


SMs 72 82
Tensor Core TF32, BF16, FP16,
FP16
Precision I8, I4, B1
Shared Memory
64 kB 96 kB
per Block
L2 Cache Size 6144 kB 6144 kB

Memory Bandwidth 672 GB/sec 936 GB/sec

RT Cores 72 (1st gen) 82 (2nd gen)

3
THE NVIDIA AMPERE GPU ARCHITECTURE

NVIDIA Ampere Architectural Features

Multi-Instance GPU (A100 only)

Asynchronous barriers

Asynchronous data movement

L2 cache management

Task graph acceleration

New Tensor Core precisions

For full complete details of CUDA on the NVIDIA Ampere GPU Architecture, see the GTC May 2020 CUDA 11.0 talk: https://developer.nvidia.com/gtc/2020/video/s21760
4
CUDA ON WINDOWS SUBSYSTEM FOR LINUX

Run a Linux kernel natively on top of Windows 10

Runs Linux at near full speed without emulation

Multi-OS development & testing from a single


Windows desktop machine

No need for dual-boot systems - ideal for laptops

5
CUDA ON WINDOWS SUBSYSTEM FOR LINUX
Preview Available as Part of Microsoft Windows Insider Program

Linux GPU support is now available for WSL 2 users


libcuda
Run GPU-accelerated Linux applications natively on
libdxcore
D3DKMT*
your Windows desktop platform
/dev/dxg User mode
Kernel mode
Getting started is simple:
drivers/gpu/dxgkrnl
Linux Kernel
1. Enable WIP in your Windows system settings
VM Bus
GPU Kernel mode
Windows
dxgkrnl
2. Download preview CUDA WSL driver:
WDDM
Paravirtualization driver (KMD) Kernel mode
Protocol

Diagram of the WDDM model supporting the


https://developer.nvidia.com/cuda/wsl
CUDA user mode driver running inside Linux guest

6
GPU-ACCELERATED DATA SCIENCE ON WSL

Get the latest version of Docker and run:


 AI Frameworks (PyTorch, TensorFlow)
 RAPIDS & ML Applications
 Jupyter Notebooks

GPU-enabled DirectX, CUDA 11.1 and the NVIDIA


Container Toolkit are all available on WSL today

NVML and NCCL support coming soon

See CUDA-on-WSL blog for full details: TensorFlow container running inside WSL 2
https://developer.nvidia.com/blog/announcing-
cuda-on-windows-subsystem-for-Linux-2/

7
ANATOMY OF A CUDA APPLICATION

CUDA
Application

8
ANATOMY OF A CUDA APPLICATION

CUDA
Application

https://developer.nvidia.com/
cuda-downloads
install CUDA 11.1

9
ANATOMY OF A CUDA APPLICATION

CUDA
Application

CUDA 11.1 Toolkit

R455 (Base 11.1) Driver

10
ANATOMY OF A CUDA APPLICATION

CUDA
Application
CUDA C++
Compilers
Standard Library
Parallel Algorithm
CUDA 11.1 Toolkit Developer Tools
Libraries
Image & Video
Math Libraries
R455 User Mode Libraries
CUDA Driver
R455 (Base 11.1) Driver
R455 Kernel Mode
Display Driver

11
ANATOMY OF A CUDA APPLICATION

Application

11.1 11.1
Libraries Toolchain

R455 (Base 11.1)


Driver Stack

12
ENTERPRISE & CLOUD DEPLOYMENT, TODAY

Deployed in
Application container

11.1 11.1
Libraries Toolchain

Versions must match

R455 (Base 11.1)


Driver Stack
Driver Stack
installed on
system

13
BACKWARD COMPATIBILITY BEFORE CUDA 11.1

 Older CUDA always runs on newer drivers

 Future CUDA does not run on older drivers

Always runs

Older CUDA CUDA 11.0 CUDA 11.1 (CUDA 11.x)

Older Driver R450 Driver R455 Driver Future Driver


R450 Does not run CUDA 11.1 applications

14
ENHANCED COMPATIBILITY SINCE CUDA 11.1

 Older CUDA always runs on newer drivers

 Newer CUDA 11.x will now run on older drivers with the same major version

Always runs

Older CUDA CUDA 11.0 CUDA 11.1 (CUDA 11.x)

Older Driver R450 Driver R455 Driver Future Driver


CUDA 11.0 now runs CUDA 11.1 applications

Future CUDA 11.x will also run on an 11.0 system

CUDA 11.x will not run on pre-11.0 drivers

15
CUDA ENHANCED COMPATIBILITY

Build with any 11.x toolchain Application

Libraries from different CUDA 11.1 11.0 11.x


versions run on any base driver cuBLAS NCCL cuFFT

Runs on any CUDA 11.x


Deploy on any 11.x system
base driver stack

16
NEW MULTI-INSTANCE GPU (MIG)
Divide a Single A100 GPU Into Multiple Instances, Each With
Isolated Paths Through the Entire Memory System
SMs
USER0

Control

DRAM
Pipe

Data
Xbar

Xbar
Sys
GPU Instance 0

L2
Up To 7 GPU Instances In a Single A100
USER1
Full software stack enabled on each instance, with

Control

DRAM
Pipe

Data
Xbar

Xbar
Sys
GPU Instance 1

L2
dedicated SM, memory, L2 cache & bandwidth
USER2

Control

DRAM
Pipe

Data
Xbar

Xbar
Sys
GPU Instance 2

L2
Simultaneous Workload Execution With
USER3
Guaranteed Quality Of Service
Control

DRAM
GPU Pipe

Data
Xbar

Xbar
Sys
GPU Instance 3

L2
All MIG instances run in parallel with predictable
USER4 throughput & latency, fault & error isolation
Control

DRAM
Pipe

Data
Xbar

Xbar
Sys

GPU Instance 4

L2
USER5
Diverse Deployment Environments
Control

DRAM
Pipe

Data
Xbar

Xbar
Sys

GPU Instance 5

L2
Supported with Bare metal, Docker, Kubernetes
USER6
Pod, Virtualized Environments
Control

DRAM
Pipe

Data
Xbar

Xbar
Sys

GPU Instance 6
L2

17
LOGICAL VS. PHYSICAL PARTITIONING

A B C

CUDA MULTI-PROCESS SERVICE CONTROL

TensorFlow PyTorch PyTorch TensorFlow Jarvis + TensorRT TensorRT

GPU MULTI-PROCESS SERVICE

Multi-Process Service Multi-Instance GPU


Dynamic contention for GPU resources Hierarchy of instances with guaranteed resource allocation
Single tenant Multiple tenants

18
CUDA CONCURRENCY MECHANISMS

Streams MPS MIG


Partition Type Single process Logical Physical
Max Partitions Unlimited 48 7
Fractional Provisioning No Yes Yes
Memory Protection No Yes Yes
Memory Bandwidth QoS No No Yes
Fault Isolation No No Yes
Cross-Partition Interop Always IPC Limited IPC
Reconfigure Dynamic Process launch When idle

19
EXECUTION SCHEDULING & MANAGEMENT
Pre-emptive scheduling Concurrent scheduling
Processes share GPU through time-slicing Processes run on GPU simultaneously
Scheduling managed by system User creates & manages scheduling streams

C
B
A B C A B
A

time time
time-slice

$ nvidia-smi compute-policy cudaStreamCreateWithPriority(pStream, flags, priority);


--set-timeslice={default, short, medium, long} cudaDeviceGetStreamPriorityRange(leastPriority, greatestPriority);

Time-slice configurable via nvidia-smi CUDA 11.0 adds a new stream priority level
20
FINE-GRAINED SYNCHRONIZATION

Thread Block

__syncthreads()

21
FINE-GRAINED SYNCHRONIZATION

Thread Block

__syncthreads()

22
FINE-GRAINED SYNCHRONIZATION
NVIDIA Ampere GPU Architecture Allows Creation Of Arbitrary Barriers

Thread Block Thread Block

barrier

barrier
barrier
__syncthreads() barrier

23
FINE-GRAINED SYNCHRONIZATION
NVIDIA Ampere GPU Architecture Allows Creation Of Arbitrary Barriers

Thread Block Thread Block

barrier

barrier
barrier
__syncthreads() barrier

24
FINE-GRAINED SYNCHRONIZATION
NVIDIA Ampere GPU Architecture Allows Creation Of Arbitrary Barriers

Thread Block Thread Block

barrier

__syncthreads()

25
BARRIERS ALLOW EXCHANGE OF INFORMATION

Compute value &


store result
Producer step

Synchronize

Consumer step

Process results

26
BARRIERS ALLOW EXCHANGE OF INFORMATION

Compute value &


store result
Producer step

Arrive

Wait
Consumer step

Process results

27
ASYNCHRONOUS BARRIERS

Compute value &


store result
Producer step

Arrive

Wait
Consumer step

Process results

28
ASYNCHRONOUS BARRIERS

Compute value &


store result
Producer step

All threads have arrived Arrive

so “Wait” is already satisfied Wait


Consumer step

Process results

29
ASYNCHRONOUS BARRIERS

Compute value &


store result
Producer step

All threads have arrived Arrive

Independent
Pipelined processing
Work

so “Wait” is already satisfied Wait


Consumer step

Process results

30
SINGLE-STAGE vs. ASYNCHRONOUS BARRIERS

Produce Data Produce Data

Arrive
All threads block on
slowest arrival

Independent Pipelined
Work processing
Arrive Single-Stage
Wait Barrier Wait

Consume Data Consume Data

Single-Stage barriers combine Asynchronous barriers enable


back-to-back arrive & wait pipelined processing
31
COPYING DATA INTO SHARED MEMORY
SM

Shared Memory

Threads Threads

Registers Registers

1L1 Cache1

GPU Memory
HBM
HBM

Two step copy to shared memory via registers

1 Thread loads data from GPU


memory into registers

32
COPYING DATA INTO SHARED MEMORY
SM

Shared Memory

2
Threads 2
Threads

Registers Registers

1L1 Cache1

GPU Memory
HBM
HBM

Two step copy to shared memory via registers

1 Thread loads data from GPU


memory into registers

2 Thread stores data into SM


shared memory 33
ASYNC MEMCOPY: DIRECT TRANSFER INTO SHARED MEMORY
SM A100 SM

Shared Memory Shared Memory

2
Threads 2
Threads Threads Threads

Registers Registers Registers Registers


1

1L1 Cache1 L1 Cache

GPU Memory GPU Memory


HBM HBM
HBM HBM

Two step copy to shared memory via registers Asynchronous direct copy to shared memory
1 Thread loads data from GPU Direct transfer into shared memory,
memory into registers 1
bypassing thread resources

2 Thread stores data into SM


shared memory 34
SIMPLE DATA MOVEMENT

Shared Memory

1 Load image element into registers

35
SIMPLE DATA MOVEMENT

Shared Memory

1 Load image element into registers

2 Store image element into shared memory


2

36
SIMPLE DATA MOVEMENT

Shared Memory

1 Load image element into registers

2 Store image element into shared memory

3 Compute using shared memory data

37
SIMPLE DATA MOVEMENT

Shared Memory

1 Load image element into registers

2 Store image element into shared memory


2
3 Compute using shared memory data

3 4 Repeat for next element

1 4

38
SIMPLE DATA MOVEMENT

Shared Memory __shared__ float smem[ELEM_SIZE];

for( e = 0; e < NUM_ELEMS; e++ ) {


// Load an image element into shared mem
pixel = image[image_offset(e)]; // Step 1
smem[shared_offset(threadIdx.x)] = pixel; // Step 2
2 __syncthreads(); // Step 3

// Compute using this element


result = compute(smem); // Step 4
3
__syncthreads(); // Sync & repeat
1 4 }

39
DOUBLE-BUFFERED DATA MOVEMENT

Shared Memory

P1 Prefetch initial image element into registers

P1 P1

40
DOUBLE-BUFFERED DATA MOVEMENT

Shared Memory

P1 Prefetch initial image element into registers

1 Prefetch next element into more registers

P1 1 P1

41
DOUBLE-BUFFERED DATA MOVEMENT

Shared Memory

P1 Prefetch initial image element into registers

1 Prefetch next element into more registers

2 2 Store current element into shared memory

P1 1 P1

42
DOUBLE-BUFFERED DATA MOVEMENT

Shared Memory

P1 Prefetch initial image element into registers

1 Prefetch next element into more registers

2 Store current element into shared memory

3 3 Compute using shared memory data

P1 1

43
DOUBLE-BUFFERED DATA MOVEMENT

Shared Memory

P1 Prefetch initial image element into registers

1 Prefetch next element into more registers

2 2 Store current element into shared memory

3 3 Compute using shared memory data

P1 1 4 4 Repeat for next element

44
DOUBLE-BUFFERED DATA MOVEMENT

Shared Memory __shared__ float smem[ELEM_SIZE];


float pixel[2];

// Prefetch first element


pixel[0] = image[image_offset(0)];

#pragma unroll 2
2 for( e = 0; e < NUM_ELEMS; e++ ) {
// Kick off load of next image
pixel[(e+1)&1] = image[image_offset(e+1)];
3

P1 1 4 // Write prefetched data into shared mem


smem[shared_offset()] = pixel[e&1];
__syncthreads();
4 // Compute first image while second loads
result = compute(smem);
1
__syncthreads();
}

45
ASYNCHRONOUS DIRECT DATA MOVEMENT

Shared Memory

P1 Async copy initial element into shared memory

Barrier

P1

46
ASYNCHRONOUS DIRECT DATA MOVEMENT

Shared Memory

P1 Async copy initial element into shared memory

1 Async copy next element into shared memory


Barrier Barrier

P1 1

47
ASYNCHRONOUS DIRECT DATA MOVEMENT

Shared Memory

P1 Async copy initial element into shared memory

1 Async copy next element into shared memory


2 Barrier Barrier
2 Threads synchronize with current async copy

3 3 Compute using shared memory data

P1 1

Async copy notifies an asynchronous barrier when it is


done – the copy arrives, and threads can wait for it.

48
ASYNCHRONOUS DIRECT DATA MOVEMENT

Shared Memory

P1 Async copy initial element into shared memory

1 Async copy next element into shared memory


Barrier Barrier 2

2 Threads synchronize with current async copy

3 3 Compute using shared memory data

P1 1 4 Repeat for next element

Async copy notifies an asynchronous barrier when it is


done – the copy arrives, and threads can wait for it.

49
ASYNCHRONOUS DIRECT DATA MOVEMENT

Shared Memory __shared__ float smem[2][ELEM_SIZE];


cuda::barrier barrier[2];
buf_id = 0;

// Kick off initial copy – it will arrive on barrier


Barrier Barrier 2 memcpy_async(&smem[buf_id][shared_offset()],
&image[image_offset(0)], size,
barrier[buf_id]);

for( e = 1; e < NUM_ELEMS; e++ ) {


3
// Start by issuing copy of next chunk
P1 1 memcpy_async(&smem[!buf_id][shared_offset()],
&image[image_offset(e)], size,
barrier[!buf_id]);
4 // Sync on current chunk then compute
barrier[buf_id].arrive_and_wait();
result = compute(smem[buf_id]);
buf_id = !buf_id; // Flip buffers
}

50
ASYNCHRONOUS COPY PIPELINES
Prefetch multiple images in a continuous stream

Shared Memory

P1 Async copy multiple elements into shared memory

Pipeline

P1

51
ASYNCHRONOUS COPY PIPELINES
Prefetch multiple images in a continuous stream

Shared Memory

P1 P2 Async copy multiple elements into shared memory

Pipeline

P1 P2

52
ASYNCHRONOUS COPY PIPELINES
Prefetch multiple images in a continuous stream

Shared Memory

P1 P2 P3 Async copy multiple elements into shared memory

Pipeline

P1 P2

P3

53
ASYNCHRONOUS COPY PIPELINES
Prefetch multiple images in a continuous stream

Shared Memory

P1 P2 P3 Async copy multiple elements into shared memory

1 Async copy next element into shared memory


Pipeline

P1 P2

P3 1

54
ASYNCHRONOUS COPY PIPELINES
Prefetch multiple images in a continuous stream

Shared Memory

P1 P2 P3 Async copy multiple elements into shared memory

2 1 Async copy next element into shared memory


Pipeline
2 Threads synchronize with oldest pipelined copy

3 3 Compute using shared memory data

P1 P2 4 Repeat for next element

P3 1

55
ASYNCHRONOUS COPY PIPELINES
Prefetch multiple images in a continuous stream

Shared Memory
Async Copy using Pipeline vs. Barrier

2
Allows batching of multiple copy operations
Pipeline into a single transaction

Many in-flight transactions, completing in FIFO


3
order
P1 P2 Fastest possible synchronization performance

P3 1

For full details see the developer blog: https://developer.nvidia.com/blog/controlling-data-movement-to-boost-performance-on-ampere-architecture/


56
ISO C++ == Language + Standard Library

57
ISO C++ == Language + Standard Library
CUDA C++ == Language + libcu++

58
libcu++ : THE CUDA C++ STANDARD LIBRARY

ISO C++ == Language + Standard Library


CUDA C++ == Language + libcu++

Strictly conforming to ISO C++, plus conforming extensions

Opt-in, Heterogeneous, Incremental

59
cuda::std::

Opt-in Does not interfere with or replace your host standard library

Copyable/Movable objects can migrate between host & device


Heterogeneous Host & Device can call all member functions
Host & Device can concurrently use synchronization primitives*

A subset of the standard library today


Incremental
Each release adds more functionality

*Synchronization primitives must be in managed memory and be declared with cuda::std::thread_scope_system


60
libcu++ NAMESPACE HIERARCHY

// ISO C++, __host__ only


#include <atomic>
std::atomic<int> x;

// CUDA C++, __host__ __device__


// Strictly conforming to the C++ Standard
#include <cuda/std/atomic>
cuda::std::atomic<int> x;

// CUDA C++, __host__ __device__


// Conforming extensions to the C++ Standard
#include <cuda/atomic>
cuda::atomic<int, cuda::thread_scope_block> x;

61
cuda::std::
libcu++, the NVIDIA C++ Standard Library,
is the C++ Standard Library for your entire system

libcu++ is NVIDIA’s variant of LLVM’s libc++

It is opt-in, heterogeneous and incremental

Available on GitHub at: https://nvidia.github.io/libcudacxx/

Open source, licensed under the Apache License 2.0 with LLVM Exceptions

62
COOPERATIVE GROUPS
Cooperative Groups Features Work On All GPU Architectures (incl. Kepler)

Input Data in Global Memory Cooperative Groups Updates


No longer requires separate compilation

auto tile32 = 30% faster grid synchronization


cg::tiled_partition<32>(this_thread_block());

cg::memcpy_async(tile32, dst, dstCount, src, srcCount); New platforms Support (Windows and Linux + MPS)

Can now capture cooperative launches in a CUDA graph


Per-Tile Per-Tile in Shared Memory
Data Data

cg::reduce(tile32, dst[threadRank], [](int lhs, int rhs) {


return lhs + rhs; cg::reduce also accepts C++ lambda as reduction operation
});

Result Result
63
COOPERATIVE GROUPS MAPS NATURALLY TO PIPELINES
constexpr int N = 4; // 4-stage pipeline
__shared__ float smem[N][NUM_ELEMS];
__shared__ cuda::pipeline_shared_state<thread_scope_block, N> ps; Shared Memory
auto group = cooperative_groups::this_thread_block();
auto pipe = cuda::make_pipeline(group, &ps);

for ( e = f = 0 ; e < NUM_ELEMS ; e++ ) {


// Fetch-ahead empty buffers Pipeline
for ( ; f < NUM_ELEMS && f < (e+N) ; f++ ) {
// Begin pipeline push
pipe.producer_acquire();
cuda::memcpy_async(group, &smem[f % N],
&global1[f * group.size()],
sizeof(float) * group.size(), pipe);
pipe.producer_commit();
// End pipeline push
}

// Pop the oldest copy off the pipeline


pipe.consumer_wait(); // Begin pipeline pop
compute(smem[e % N]);
pipe.consumer_release(); // End pipeline pop
}
64
MULTI-WARP COOPERATIVE GROUPS
Experimental in CUDA 11.1

Original group of 8 warps = 256 threads

2 groups of 128 threads

1 group of 128 threads + 2 groups of 64 threads

4 groups of 64 threads

Create groups with any power-of-2 number of threads, even spanning multiple warps
Sync and perform collective operations within these groups
65
MULTI-WARP COOPERATIVE GROUPS

Caller provides memory for barriers & // in cg::experimental


#define _CG_ABI_EXPERIMENTAL
collectives #include <cooperative_groups.h>
#include <cooperative_groups/reduce.h>
cg::experimental::block_tile_memory
namespace cg = cooperative_groups;
Declares scratch memory for use in communication __shared__ cg::experimental::block_tile_memory
between tiles, based on the number of threads in < sizeof(float), BlockSize > scratch;
your block.
cg::experimental::thread_block cta =
this_thread_block(scratch);
cg::this_thread_block(…& scratch) auto tile = tiled_partition<128>(cta);
Takes in scratch memory and associates it with the
resulting group handle. cg::reduce(g, dst[threadRank], [](int lhs, int rhs) {
return lhs + rhs;
});
Scratch memory can be shared or global memory
(shared is higher performance)

66
FLOATING POINT FORMATS & PRECISION

sign 11-bit exponent 52-bit mantissa


double
8-bit 23-bit
float
5-bit 10-bit
half
8-bit 7-bit
bfloat16
8-bit 10-bit
TF32

Numerical Range Numerical Precision

value = (-1)sign x 2exponent x (1 + mantissa)

67
NEW FLOATING POINT FORMATS: BF16 & TF32
Both Match fp32 8-bit Exponent: Covers The Same Range of Values

8-bit 7-bit
Available in CUDA C++ as nv_bfloat16 numerical type
bfloat16 Full CUDA C++ numerical type – #include <cuda_fp16.h>
16-bit Storage Size Can use in both host & device code, and in templated functions*

8-bit 10-bit
Tensor Core math mode for single-precision training
TF32 Not a numerical type – tensor core inputs are rounded to TF32
32-bit Storage Size CUDA C++ programs use float (fp32) throughout

*(similar to CUDA’s IEEE-FP16 “half” type)

68
TENSOR FLOAT 32 - TENSOR CORE MODE
A100 Tensor Core Input Precision
All Internal Operations Maintain Full FP32 Precision

Convert to Full precision more products


TF32 product

FP32
x + FP32 output
FP32

Sum with
FP32
accumulator
TF32 MMA Dimensions: m,n,k = 16x8x8

69
A100 INTRODUCES DOUBLE PRECISION TENSOR CORES
All A100 Tensor Core Internal Operations Maintain Full FP64 Precision

DMMA Dimensions: m,n,k = 8x8x4

70
A100 GPU ACCELERATED MATH LIBRARIES IN CUDA 11.0

cuBLAS cuSPARSE cuTENSOR cuSOLVER


BF16, TF32 and FP64 Increased memory BW, BF16, TF32 and FP64 BF16, TF32 and FP64
Tensor Cores Shared Memory & L2 Tensor Cores Tensor Cores

nvJPEG cuFFT CUDA Math API CUTLASS


Hardware Decoder BF16, TF32 and FP64 Increased memory BW, BF16 & TF32 Support
Tensor Cores Shared Memory & L2

71
cuBLAS
Eliminating Alignment Requirements To Activate Tensor Cores for MMA

CUDA 10.2 - Align 8

AlignN means alignment to 16-bit multiplies of N. For example, align8 are problems aligned to 128bits or 16 bytes. 72
cuBLAS
Eliminating Alignment Requirements To Activate Tensor Cores for MMA

CUDA 10.2 - Align 8

CUDA 10.2 - Align 1


Align 2

AlignN means alignment to 16-bit multiplies of N. For example, align8 are problems aligned to 128bits or 16 bytes. 73
cuBLAS
Eliminating Alignment Requirements To Activate Tensor Cores for MMA

CUDA 10.2 - Align 8


CUDA 11.0 - Align 8

CUDA 10.2 - Align 1


Align 2

AlignN means alignment to 16-bit multiplies of N. For example, align8 are problems aligned to 128bits or 16 bytes. 74
cuBLAS
Eliminating Alignment Requirements To Activate Tensor Cores for MMA

CUDA 10.2 - Align 8


CUDA 11.0 - Align 8

CUDA 11.0 - Align 2

CUDA 11.0 - Align 1

CUDA 10.2 - Align 1


Align 2

AlignN means alignment to 16-bit multiplies of N. For example, align8 are problems aligned to 128bits or 16 bytes. 75
MATH LIBRARY DEVICE EXTENSIONS
Introducing cuFFTDx: Device Extension

Available in Math Library EA Program


Device callable library

Retain and reuse on-chip data

Inline FFTs in user kernels

Combine multiple FFT operations

https://developer.nvidia.com/CUDAMathLibraryEA

76
GPU PROGRAMMING IN 2020 AND BEYOND
Math Libraries | Standard Languages | Directives | CUDA

__global__
void saxpy(int n, float a,
float *x, float *y) {
#pragma acc data copy(x,y) int i = blockIdx.x*blockDim.x +
{ threadIdx.x;
if (i < n) y[i] += a*x[i];
... }
std::transform(par, x, x+n, y, y,
[=](float x, float y) {
std::transform(par, x, x+n, y, y, int main(void) {
return y + a*x;
[=](float x, float y) { cudaMallocManaged(&x, ...);
});
return y + a*x; cudaMallocManaged(&y, ...);
}); ...
saxpy<<<(N+255)/256,256>>>(...,x, y)
do concurrent (i = 1:n) ... cudaDeviceSynchronize();
y(i) = y(i) + a*x(i) ...
enddo } }

GPU Accelerated Incremental Performance Maximize GPU Performance with


ISO C++ and Fortran Optimization with Directives CUDA C++/Fortran

GPU Accelerated Math Libraries

77
DOWNLOAD CUDA 11.1 TODAY
https://developer.nvidia.com/cuda-downloads

78

You might also like