Advanced GPU Topics #1
Jeremy Appleyard, September 2015
• Profiling GPU applications
• Optimizing Data Movement
In this talk • Using MPI with GPUs
Ask questions at any point!
2
Profiling GPU Applications
3
Profiling Tools
Many options!
From
FromNVIDIA
NVIDIA From
ThirdNVIDIA
Party
• nvprof • nvprof
• nvprof • System
• NVIDIA Visual profiler • NVIDIA Visual profiler
• NVIDIA Visual profiler • VampirTrace
• Standalone (nvvp) • Standalone (nvvp)
• Standalone (nvvp)
• Integrated into Nsight Eclipse • Integrated into Nsight Eclipse
• Integrated into Nsight Eclipse • PAPI CUDA component
Edition (nsight) Edition (nsight)
Edition (nsight)
• Nsight Visual Studio Edition • HPC Toolkit
• Nsight Visual Studio Edition
• Nsight Visual Studio Edition
4
This talk
We will focus on nvprof and nvvp
nvprof => NVIDIA profiler
Command line
nvvp => NVIDIA Visual Profiler
GUI based
5
nvprof
Simple usage
• nvprof ./<executable> attributes(global) subroutine vecAdd_GPU(c, a, b, n)
INTEGER, value :: n
• Example: vector addition REAL, device, intent(in) :: a(n), b(n)
REAL, device, intent(out) :: c(n)
from yesterday’s talk
INTEGER :: i
i = (blockIdx%x – 1) * blockDim%x + threadIdx%x
if (i <= n) then
c(i) = a(i) + b(i)
end if
end subroutine vecAdd_GPU
6
nvprof ./vecAdd
==34092== NVPROF is profiling process 34092, command: ./vecAdd
==34092== Profiling application: ./vecAdd
==34092== Profiling result:
Time(%) Time Calls Avg Min Max Name
69.07% 3.8547ms 2 1.9274ms 1.9093ms 1.9454ms [CUDA memcpy HtoD]
29.04% 1.6205ms 1 1.6205ms 1.6205ms 1.6205ms [CUDA memcpy DtoH]
1.19% 66.178us 1 66.178us 66.178us 66.178us kernel_vecadd_gpu_
0.71% 39.650us 2 19.825us 19.617us 20.033us __pgi_dev_cumemset_4
==34092== API calls:
Time(%) Time Calls Avg Min Max Name
95.29% 321.93ms 3 107.31ms 215.28us 321.48ms cudaMalloc
2.70% 9.1355ms 3 3.0452ms 2.1550ms 3.5234ms cudaMemcpy
1.47% 4.9710ms 498 9.9810us 177ns 498.93us cuDeviceGetAttribute
0.22% 758.16us 3 252.72us 234.62us 284.11us cudaFree
0.15% 519.49us 6 86.581us 82.801us 89.206us cuDeviceTotalMem
0.13% 449.63us 6 74.938us 70.892us 81.049us cuDeviceGetName
0.02% 69.908us 3 23.302us 11.485us 43.331us cudaLaunch
0.00% 12.300us 10 1.2300us 179ns 8.0460us cudaSetupArgument
0.00% 3.4430us 12 286ns 193ns 716ns cuDeviceGet
0.00% 2.8280us 2 1.4140us 285ns 2.5430us cuDeviceGetCount
0.00% 2.5750us 3 858ns 448ns 1.6120us cudaConfigureCall
7
nvprof ./vecAdd
==34092== NVPROF is profiling process 34092, command: ./vecAdd
==34092== Profiling application: ./vecAdd
==34092== Profiling result:
Time(%) Time Calls Avg Min Max Name
69.07% 3.8547ms 2 1.9274ms 1.9093ms 1.9454ms [CUDA memcpy HtoD]
29.04% 1.6205ms 1 1.6205ms 1.6205ms 1.6205ms [CUDA memcpy DtoH]
1.19% 66.178us 1 66.178us 66.178us 66.178us kernel_vecadd_gpu_
0.71% 39.650us 2 19.825us 19.617us 20.033us __pgi_dev_cumemset_4
• Top half of the profile is runtime measured from the GPU perspective
• What can we see?
• Memcpy XtoX is memory copy to and from the GPU
• The vector addition is only 1.19%!
• This is valuable information!
8
nvprof ./vecAdd
==34092== NVPROF is profiling process 34092, command: ./vecAdd
==34092== Profiling application: ./vecAdd
==34092== Profiling result:
Time(%) Time Calls Avg Min Max Name
69.07% 3.8547ms 2 1.9274ms 1.9093ms 1.9454ms [CUDA memcpy HtoD]
29.04% 1.6205ms 1 1.6205ms 1.6205ms 1.6205ms [CUDA memcpy DtoH]
1.19% 66.178us 1 66.178us 66.178us 66.178us kernel_vecadd_gpu_
0.71% 39.650us 2 19.825us 19.617us 20.033us __pgi_dev_cumemset_4
• (PGI) OpenACC kernels will be named after the subroutine name and line number
• eg. vecadd_17_gpu
• Subroutine vecadd
• Line 17 in the file
• Compiled for the gpu
9
• Bottom half of the profile is runtime measured from the CPU
perspective
• What can we see?
• First allocation is expensive (cuda initialisation)
==34092== API calls:
Time(%) Time Calls Avg Min Max Name
95.29% 321.93ms 3 107.31ms 215.28us 321.48ms cudaMalloc
2.70% 9.1355ms 3 3.0452ms 2.1550ms 3.5234ms cudaMemcpy
1.47% 4.9710ms 498 9.9810us 177ns 498.93us cuDeviceGetAttribute
0.22% 758.16us 3 252.72us 234.62us 284.11us cudaFree
0.15% 519.49us 6 86.581us 82.801us 89.206us cuDeviceTotalMem
0.13% 449.63us 6 74.938us 70.892us 81.049us cuDeviceGetName
0.02% 69.908us 3 23.302us 11.485us 43.331us cudaLaunch
0.00% 12.300us 10 1.2300us 179ns 8.0460us cudaSetupArgument
0.00% 3.4430us 12 286ns 193ns 716ns cuDeviceGet
0.00% 2.8280us 2 1.4140us 285ns 2.5430us cuDeviceGetCount
0.00% 2.5750us 3 858ns 448ns 1.6120us cudaConfigureCall
10
nvprof
More advanced options
• nvprof -h
• There are quite a few options!
• Some useful ones:
• -o: creates an output file which can be imported into nvvp
• -m and -e: collect metrics or events
• --analysis-metrics: collect all metrics for import into nvvp
• --query-metrics and --query-events: query which metrics/events are available
11
nvprof
Events and Metrics
Most are quite in-depth, however some useful ones for quick analysis
In general, events are only for the expert. Rarely useful
(A few) useful metrics:
dram_read_throughput: Main GPU memory read throughput
dram_write_throughput: Main GPU memory write throughput
flop_count_sp: Number of single precision floating point operations
flop_count_dp : Number of double precision floating point operations
12
nvprof
Metrics Example
nvprof --metrics dram_read_throughput,dram_write_throughput,flop_count_sp,flop_count_dp ./vecAdd
==32928== NVPROF is profiling process 32928, command: ./vecAdd
==32928== Warning: Some kernel(s) will be replayed on device 0 in order to collect all events/metrics.
==32928== Profiling application: ./vecAdd
==32928== Profiling result:
==32928== Metric result:
Invocations Metric Name Metric Description Min Max Avg
Device "Tesla K40m (0)"
Kernel: kernel_vecadd_gpu_
1 dram_read_throughput Device Memory Read Throughput 136.55GB/s 136.55GB/s 136.55GB/s
1 dram_write_throughput Device Memory Write Throughput 70.062GB/s 70.062GB/s 70.062GB/s
1 flop_count_sp Floating Point Operations(Single Precisi 1000000 1000000 1000000
1 flop_count_dp Floating Point Operations(Double Precisi 0 0 0
13
nvprof
Metrics Example
Problem was addition of 1,000,000
element single precision vectors
1,000,000 single precision flop count
is therefore expected!
dram_read_throughput 136.55GB/s
dram_write_throughput 70.062GB/s ~50 GFLOP/s
flop_count_sp 1000000
flop_count_dp 0 About 1% peak FLOPs
Measured dram throughput is
~206GB/s
Peak for this machine is 288 GB/s
72% peak bandwidth
14
Visual Profiler
nvvp
• Either import the output of nvprof –o ...
• Need to select metrics in advance
• Or run application through GUI
• Re-run on-the-fly to compute new metrics
based on requirements
15
Vecadd - timeline
16
NEMO - timeline
17
NEMO - details
18
Back to vecadd
More details!
• If we collected our profiling
information with –analysis-
metrics, or if we’re running the
application through nvvp
directly, more information is
available! This is what we want!
• -analysis-metrics may take some
time to run
• Run separately after
generating timeline info
NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE. 19
20
• Matches our earlier calculations of ~72% peak bandwidth
21
Compute, Bandwidth or Latency Bound
Other analysis options available
The tool will guide you
Becomes more useful the more experience
you get
22
Profiling tips
The nvprof output is a very good place to start
The timeline is a good place to go next
Only dig deep into a kernel if it’s taking a significant amount of your time
Where possible, try to match profiler output with theory
For example, if I know I’m moving 1GB, and my kernel takes 10ms, I expect the
profiler to report 100GB/s.
Discrepancies are likely to mean your application isn’t doing what you thought it was
23
Optimising Data Movement
24
Heterogeneous Computing
PCI Bus
25
Data movement
What are the costs?
Two costs:
Bandwidth
Latency
Large copies will be bandwidth bound.
Peak PCI-E bandwidth is 16 GB/s
Small copies will be high latency
Many small copies are expensive
26
Data movement
Several optimisation options
Several ways to optimise data movement:
1. Do less of it!
2. Fuse small copies into larger ones
3. Use pinned buffers
4. Overlap data movement with execution
Remember: profile before optimising!
27
Optimising Data Movement
1. Do less of it!
• Consider moving data movement outside of main loop
• For example, an iterative solver would not want data movement within the
iteration loop
• Often, during the process of making an application run on a GPU these additional
copies are unavoidable
• If doing intermediate performance analysis, bear in mind these copies may disappear
in the final application
• Consider if data can be passed as an argument (CUDA), or declared as private
(OpenACC)
28
Optimising Data Movement
2. Fuse small copies into larger ones
• Small data transfers are much less efficient
• If the opportunity arises, merging copies can be beneficial
• Can be difficult to accomplish in some applications
• Additional cost due to packing/unpacking may outweigh the savings
29
Optimising Data Movement
3. Use pinned buffers
• Page-locked (or pinned) allocations generally have higher bandwidth
• Can harm system performance if overused
• In CUDA:
• cudaHostAlloc() – allocate page-locked memory
• cudaHostRegister() – convert normal allocation to page-locked
• Pinned attribute in CUDA Fortran
• In OpenACC:
• (Hopefully) done automatically!
30
Optimising Data Movement
4. Overlap data movement with execution
• With pinned memory PCI-E transfers can occur at the same time as kernel
execution
• In CUDA:
• cudaMemcpyAsync
• In OpenACC:
• async clause on parallel or kernels directives
• This can greatly reduce the cost of memory copies – almost to none in the right
case
31
Optimising Data Movement
Summary
• There are good (and bad) ways of moving data between GPU and CPU
• For many applications this is not imporant
• Data resides on the GPU for the application’s lifetime
• Next-gen GPU Pascal will bring hardware improvements - NVLINK
32
NVLINK
Faster communications between processors
NVLINK
• High speed interconnect
• Alongside/replacing PCI-E
• 80-200 GB/s
• Improved energy efficiency
• Improved flexibility
33
NVLink Unleashes Multi-GPU Performance
GPUs Interconnected with NVLink Over 2x Application Performance Speedup
When Next-Gen GPUs Connect via NVLink Versus PCIe
Speedup vs
PCIe based Server
2.25x
CPU
2.00x
PCIe Switch
1.75x
1.50x
TESLA TESLA
GPU GPU
1.25x
5x Faster than 1.00x
PCIe Gen3 x16 ANSYS Fluent Multi-GPU Sort LQCD QUDA AMBER 3D FFT
34
Using MPI with GPUs
35
MPI+CUDA
System System System
GDDR5 Memory GDDR5 Memory GDDR5 Memory
Memory Memory Memory
…
GPU CPU GPU CPU GPU CPU
PCI-e PCI-e PCI-e
Network Network Network
Card Card Card
Node 0 Node 1 Node n-1
36
MPI+CUDA
System System System
GDDR5 Memory GDDR5 Memory GDDR5 Memory
Memory Memory Memory
…
GPU CPU GPU CPU GPU CPU
PCI-e PCI-e PCI-e
Network Network Network
Card Card Card
Node 0 Node 1 Node n-1
37
MPI+CUDA
//MPI rank 0
MPI_Send(s_buf_d, size, MPI_CHAR, n-1, tag, MPI_COMM_WORLD);
//MPI rank n-1
MPI_Recv(r_buf_d, size, MPI_CHAR, 0, tag, MPI_COMM_WORLD, &stat);
38
Message Passing Interface
MPI
• Standard to exchange data between processes via messages
• Defines API to exchanges messages
• Pt. 2 Pt.: e.g. MPI_Send, MPI_Recv
• Collectives, e.g. MPI_Reduce
• Multiple implementations (open source and commercial)
• Binding for C/C++, Fortran, Python, …
• E.g. MPICH, OpenMPI, MVAPICH, IBM Platform MPI, Cray MPT, …
39
MPI
Compiling and launching
$ mpicc –o myapp myapp.c
$ mpirun –np 4 ./myapp <args>
myapp myapp myapp myapp
40
Launch MPI + CUDA/OpenACC programs
Launch one process per GPU
MVAPICH: MV2_USE_CUDA
$ MV2_USE_CUDA=1 mpirun –np ${np} ./myapp <args>
Open MPI: CUDA-aware features are enabled per default
Cray: MPICH_RDMA_ENABLED_CUDA
IBM Platform MPI: PMPI_GPU_AWARE
41
Unified Virtual Addressing
• One address space for all CPU and GPU memory
• Determine physical memory location from a pointer value
• Enable libraries to simplify their interfaces (e.g. MPI and cudaMemcpy)
• Support:
• 64-bit applications on Linux
• Windows using TCC mode
42
Unified Virtual Addressing
No UVA : Separate Address Spaces UVA : Single Address Space
System GPU System GPU
Memory Memory Memory Memory
0x0000 0x0000 0x0000
0xFFFF 0xFFFF 0xFFFF
CPU GPU CPU GPU
PCI-e PCI-e
43
MPI + CUDA
With UVA CUDA-aware MPI No UVA and regular MPI
//MPI rank 0
//MPI rank 0 s_buf_h = s_buf_d
call MPI_Send(s_buf_d,size,…) call MPI_Send(s_buf_h,size,…)
//MPI rank n-1 //MPI rank n-1
call MPI_Recv(r_buf_d,size,…) call MPI_Recv(r_buf_h,size,…)
r_buf_d = r_buf_h
44
MPI + OpenACC
With UVA CUDA-aware MPI No UVA and regular MPI
!$acc host_data use_device (s_buf, r_buf) !$acc update host(s_buf)
!MPI rank 0 !MPI rank 0
call MPI_Send(s_buf,size,…) call MPI_Send(s_buf,size,…)
!MPI rank n-1 !MPI rank n-1
call MPI_Recv(r_buf,size,…) call MPI_Recv(r_buf,size,…)
!$acc end host_data !$acc update device(r_buf)
45
NVIDIA GPUDirectTM
Peer to Peer tranfers
GPU1 GPU2
Memory Memory
System
Memory
CPU
GPU GPU
1 2
PCI-e Chip
set
IB
46
NVIDIA GPUDirectTM
Peer to Peer tranfers
GPU1 GPU2
Memory Memory
System
Memory
CPU
GPU GPU
1 2
PCI-e Chip
set
IB
47
NVIDIA GPUDirectTM
Support for RDMA
GPU1 GPU2
Memory Memory
System
Memory
CPU
GPU GPU
1 2
PCI-e Chip
set
IB
48
NVIDIA GPUDirectTM
Support for RDMA
GPU1 GPU2
Memory Memory
System
Memory
CPU
GPU GPU
1 2
PCI-e Chip
set
IB
49
Performance Results two Nodes
OpenMPI 1.8.4 MLNX FDR IB (4X) Tesla K40 @ 875GHz
7000
6000
5000
BW (MB/s)
4000 GPU-aware MPI with
3000 GPUDirect RDMA
2000 GPU-aware MPI
1000
0 regular MPI
Message Size (byte)
Latency (1 byte) 19.79 us 17.97 us 5.70 us 50
• Profiling GPU applications
• Optimizing Data Movement
In this talk • Using MPI with GPUs
Any questions?
51
Bonus slides - Pascal!
52
GPU Roadmap
20
Pascal
Unified Memory
3D Memory
18 NVLink
16
SGEMM / W Normalized
14
Maxwell
12 DX12
10
8
Kepler
Dynamic Parallelism
6
Fermi
2 FP64
Tesla
CUDA
0
2008 2010 2012 2014 2016
53
Pascal
• Faster and larger global memory
• Faster communication between processors
• More powerful unified memory
• Mixed precision computing
54
Stacked Memory
High performance global memory
3D Stacked Memory
• 4x Higher Bandwidth (~1 TB/s)
• 3x Larger Capacity (up to 32GB)
• 4x More Energy Efficient per bit
55
NVLINK
Faster communications between processors
NVLINK
• High speed interconnect
• Alongside/replacing PCI-E
• 80-200 GB/s
• Improved energy efficiency
• Improved flexibility
56
KEPLER GPU PASCAL GPU
NVLink
NVLink
High-Speed GPU
Interconnect POWER CPU
NVLink
PCIe PCIe
X86, ARM64, X86, ARM64,
POWER CPU POWER CPU
2014 2016
57
Unified Memory: Simpler & Faster with NVLink
Traditional Developer View Developer View With Developer View With
Unified Memory Pascal & NVLink
NVLink
80 GB/s
System GPU Memory Unified Memory Unified Memory
Memory
Share Data Structures at
CPU Memory Speeds, not PCIe speeds
Oversubscribe GPU Memory
58
Mixed precision computing
IEEE 16-bit float support
• Halving precision results in:
• Half the memory footprint
• Half the bandwidth required
• Double the computational throughput
• Obviously comes at an accuracy penalty
• Application dependent
• 16/32/64 bit all supported
• Supported in software now
59