100% found this document useful (1 vote)

146 views59 pages

ECMWF Advanced GPU Topics 1

This document provides an overview and summary of using profiling tools nvprof and NVIDIA Visual Profiler (nvvp) to analyze GPU applications. Key points include: - nvprof is a command line profiler that provides runtime and API call profiling. It can identify bottlenecks like memory transfers. - nvvp is a GUI profiler that can import nvprof output files or attach live. It provides a timeline view and metrics analysis. - Useful nvprof options include -o to output a file for nvvp, and --metrics to collect specific metrics. Metrics like FLOP count and memory bandwidth help analyze kernel performance. - A vector addition example showed memory transfers taking most time. Pro

Uploaded by

kosakakosa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

146 views59 pages

ECMWF Advanced GPU Topics 1

Uploaded by

kosakakosa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Advanced GPU Topics #1

Jeremy Appleyard, September 2015

• Profiling GPU applications
• Optimizing Data Movement

In this talk • Using MPI with GPUs

Ask questions at any point!

2
Profiling GPU Applications

3
Profiling Tools
Many options!

From
FromNVIDIA
NVIDIA From
ThirdNVIDIA
Party
• nvprof • nvprof
• nvprof • System
• NVIDIA Visual profiler • NVIDIA Visual profiler
• NVIDIA Visual profiler • VampirTrace
• Standalone (nvvp) • Standalone (nvvp)
• Standalone (nvvp)
• Integrated into Nsight Eclipse • Integrated into Nsight Eclipse
• Integrated into Nsight Eclipse • PAPI CUDA component
Edition (nsight) Edition (nsight)
Edition (nsight)
• Nsight Visual Studio Edition • HPC Toolkit
• Nsight Visual Studio Edition
• Nsight Visual Studio Edition

4
This talk

We will focus on nvprof and nvvp

nvprof => NVIDIA profiler
Command line

nvvp => NVIDIA Visual Profiler

GUI based

5
nvprof
Simple usage

• nvprof ./<executable> attributes(global) subroutine vecAdd_GPU(c, a, b, n)

INTEGER, value :: n
• Example: vector addition REAL, device, intent(in) :: a(n), b(n)
REAL, device, intent(out) :: c(n)
from yesterday’s talk
INTEGER :: i

i = (blockIdx%x – 1) * blockDim%x + threadIdx%x

if (i <= n) then
c(i) = a(i) + b(i)
end if
end subroutine vecAdd_GPU

6
nvprof ./vecAdd
==34092== NVPROF is profiling process 34092, command: ./vecAdd
==34092== Profiling application: ./vecAdd
==34092== Profiling result:
Time(%) Time Calls Avg Min Max Name
69.07% 3.8547ms 2 1.9274ms 1.9093ms 1.9454ms [CUDA memcpy HtoD]
29.04% 1.6205ms 1 1.6205ms 1.6205ms 1.6205ms [CUDA memcpy DtoH]
1.19% 66.178us 1 66.178us 66.178us 66.178us kernel_vecadd_gpu_
0.71% 39.650us 2 19.825us 19.617us 20.033us __pgi_dev_cumemset_4

==34092== API calls:

Time(%) Time Calls Avg Min Max Name
95.29% 321.93ms 3 107.31ms 215.28us 321.48ms cudaMalloc
2.70% 9.1355ms 3 3.0452ms 2.1550ms 3.5234ms cudaMemcpy
1.47% 4.9710ms 498 9.9810us 177ns 498.93us cuDeviceGetAttribute
0.22% 758.16us 3 252.72us 234.62us 284.11us cudaFree
0.15% 519.49us 6 86.581us 82.801us 89.206us cuDeviceTotalMem
0.13% 449.63us 6 74.938us 70.892us 81.049us cuDeviceGetName
0.02% 69.908us 3 23.302us 11.485us 43.331us cudaLaunch
0.00% 12.300us 10 1.2300us 179ns 8.0460us cudaSetupArgument
0.00% 3.4430us 12 286ns 193ns 716ns cuDeviceGet
0.00% 2.8280us 2 1.4140us 285ns 2.5430us cuDeviceGetCount
0.00% 2.5750us 3 858ns 448ns 1.6120us cudaConfigureCall
7
nvprof ./vecAdd
==34092== NVPROF is profiling process 34092, command: ./vecAdd
==34092== Profiling application: ./vecAdd
==34092== Profiling result:
Time(%) Time Calls Avg Min Max Name
69.07% 3.8547ms 2 1.9274ms 1.9093ms 1.9454ms [CUDA memcpy HtoD]
29.04% 1.6205ms 1 1.6205ms 1.6205ms 1.6205ms [CUDA memcpy DtoH]
1.19% 66.178us 1 66.178us 66.178us 66.178us kernel_vecadd_gpu_
0.71% 39.650us 2 19.825us 19.617us 20.033us __pgi_dev_cumemset_4

• Top half of the profile is runtime measured from the GPU perspective
• What can we see?
• Memcpy XtoX is memory copy to and from the GPU

• The vector addition is only 1.19%!

• This is valuable information!

8
nvprof ./vecAdd
==34092== NVPROF is profiling process 34092, command: ./vecAdd
==34092== Profiling application: ./vecAdd
==34092== Profiling result:
Time(%) Time Calls Avg Min Max Name
69.07% 3.8547ms 2 1.9274ms 1.9093ms 1.9454ms [CUDA memcpy HtoD]
29.04% 1.6205ms 1 1.6205ms 1.6205ms 1.6205ms [CUDA memcpy DtoH]
1.19% 66.178us 1 66.178us 66.178us 66.178us kernel_vecadd_gpu_
0.71% 39.650us 2 19.825us 19.617us 20.033us __pgi_dev_cumemset_4

• (PGI) OpenACC kernels will be named after the subroutine name and line number
• eg. vecadd_17_gpu
• Subroutine vecadd

• Line 17 in the file

• Compiled for the gpu

9
• Bottom half of the profile is runtime measured from the CPU
perspective
• What can we see?
• First allocation is expensive (cuda initialisation)

==34092== API calls:

• nvprof -h
• There are quite a few options!
• Some useful ones:
• -o: creates an output file which can be imported into nvvp
• -m and -e: collect metrics or events
• --analysis-metrics: collect all metrics for import into nvvp
• --query-metrics and --query-events: query which metrics/events are available
11
nvprof
Events and Metrics

Most are quite in-depth, however some useful ones for quick analysis
In general, events are only for the expert. Rarely useful
(A few) useful metrics:
dram_read_throughput: Main GPU memory read throughput
dram_write_throughput: Main GPU memory write throughput
flop_count_sp: Number of single precision floating point operations
flop_count_dp : Number of double precision floating point operations
12
nvprof
Metrics Example

nvprof --metrics dram_read_throughput,dram_write_throughput,flop_count_sp,flop_count_dp ./vecAdd

==32928== NVPROF is profiling process 32928, command: ./vecAdd
==32928== Warning: Some kernel(s) will be replayed on device 0 in order to collect all events/metrics.
==32928== Profiling application: ./vecAdd
==32928== Profiling result:
==32928== Metric result:
Invocations Metric Name Metric Description Min Max Avg
Device "Tesla K40m (0)"
Kernel: kernel_vecadd_gpu_
1 dram_read_throughput Device Memory Read Throughput 136.55GB/s 136.55GB/s 136.55GB/s
1 dram_write_throughput Device Memory Write Throughput 70.062GB/s 70.062GB/s 70.062GB/s
1 flop_count_sp Floating Point Operations(Single Precisi 1000000 1000000 1000000
1 flop_count_dp Floating Point Operations(Double Precisi 0 0 0

13
nvprof
Metrics Example
Problem was addition of 1,000,000
element single precision vectors
1,000,000 single precision flop count
is therefore expected!
dram_read_throughput 136.55GB/s
dram_write_throughput 70.062GB/s ~50 GFLOP/s
flop_count_sp 1000000
flop_count_dp 0 About 1% peak FLOPs

Measured dram throughput is

~206GB/s
Peak for this machine is 288 GB/s
72% peak bandwidth
14
Visual Profiler
nvvp

• Either import the output of nvprof –o ...

• Need to select metrics in advance

• Or run application through GUI

• Re-run on-the-fly to compute new metrics
based on requirements

15
Vecadd - timeline

16
NEMO - timeline

17
NEMO - details

18
Back to vecadd
More details!

• If we collected our profiling

information with –analysis-
metrics, or if we’re running the
application through nvvp
directly, more information is
available! This is what we want!

• -analysis-metrics may take some

time to run
• Run separately after
generating timeline info

NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE. 19

20
• Matches our earlier calculations of ~72% peak bandwidth

21
Compute, Bandwidth or Latency Bound

 Other analysis options available

 The tool will guide you
 Becomes more useful the more experience
you get

22
Profiling tips
 The nvprof output is a very good place to start
 The timeline is a good place to go next
 Only dig deep into a kernel if it’s taking a significant amount of your time
 Where possible, try to match profiler output with theory
 For example, if I know I’m moving 1GB, and my kernel takes 10ms, I expect the
profiler to report 100GB/s.

 Discrepancies are likely to mean your application isn’t doing what you thought it was

23
Optimising Data Movement

24
Heterogeneous Computing

PCI Bus

25
Data movement
What are the costs?

Two costs:
Bandwidth
Latency

Large copies will be bandwidth bound.

Peak PCI-E bandwidth is 16 GB/s

Small copies will be high latency

Many small copies are expensive
26
Data movement
Several optimisation options

Several ways to optimise data movement:

1. Do less of it!
2. Fuse small copies into larger ones
3. Use pinned buffers
4. Overlap data movement with execution

Remember: profile before optimising!

27
Optimising Data Movement
1. Do less of it!

• Consider moving data movement outside of main loop

• For example, an iterative solver would not want data movement within the
iteration loop
• Often, during the process of making an application run on a GPU these additional
copies are unavoidable
• If doing intermediate performance analysis, bear in mind these copies may disappear
in the final application

• Consider if data can be passed as an argument (CUDA), or declared as private

(OpenACC)

28
Optimising Data Movement
2. Fuse small copies into larger ones

• Small data transfers are much less efficient

• If the opportunity arises, merging copies can be beneficial
• Can be difficult to accomplish in some applications
• Additional cost due to packing/unpacking may outweigh the savings

29
Optimising Data Movement
3. Use pinned buffers

• Page-locked (or pinned) allocations generally have higher bandwidth

• Can harm system performance if overused
• In CUDA:
• cudaHostAlloc() – allocate page-locked memory

• cudaHostRegister() – convert normal allocation to page-locked

• Pinned attribute in CUDA Fortran

• In OpenACC:
• (Hopefully) done automatically!
30
Optimising Data Movement
4. Overlap data movement with execution

• With pinned memory PCI-E transfers can occur at the same time as kernel
execution
• In CUDA:
• cudaMemcpyAsync

• In OpenACC:
• async clause on parallel or kernels directives

• This can greatly reduce the cost of memory copies – almost to none in the right
case

31
Optimising Data Movement
Summary

• There are good (and bad) ways of moving data between GPU and CPU
• For many applications this is not imporant
• Data resides on the GPU for the application’s lifetime

• Next-gen GPU Pascal will bring hardware improvements - NVLINK

32
NVLINK
Faster communications between processors

NVLINK
• High speed interconnect
• Alongside/replacing PCI-E
• 80-200 GB/s
• Improved energy efficiency
• Improved flexibility

33
NVLink Unleashes Multi-GPU Performance
GPUs Interconnected with NVLink Over 2x Application Performance Speedup
When Next-Gen GPUs Connect via NVLink Versus PCIe
Speedup vs
PCIe based Server
2.25x

CPU
2.00x

PCIe Switch
1.75x

1.50x
TESLA TESLA
GPU GPU
1.25x

5x Faster than 1.00x

PCIe Gen3 x16 ANSYS Fluent Multi-GPU Sort LQCD QUDA AMBER 3D FFT

34
Using MPI with GPUs

35
MPI+CUDA

System System System

GDDR5 Memory GDDR5 Memory GDDR5 Memory
Memory Memory Memory

…
GPU CPU GPU CPU GPU CPU

PCI-e PCI-e PCI-e

Network Network Network
Card Card Card

Node 0 Node 1 Node n-1

36
MPI+CUDA

System System System

GDDR5 Memory GDDR5 Memory GDDR5 Memory
Memory Memory Memory

…
GPU CPU GPU CPU GPU CPU

PCI-e PCI-e PCI-e

Network Network Network
Card Card Card

Node 0 Node 1 Node n-1

37
MPI+CUDA

//MPI rank 0
MPI_Send(s_buf_d, size, MPI_CHAR, n-1, tag, MPI_COMM_WORLD);

//MPI rank n-1

MPI_Recv(r_buf_d, size, MPI_CHAR, 0, tag, MPI_COMM_WORLD, &stat);
38
Message Passing Interface
MPI

• Standard to exchange data between processes via messages

• Defines API to exchanges messages
• Pt. 2 Pt.: e.g. MPI_Send, MPI_Recv

• Collectives, e.g. MPI_Reduce

• Multiple implementations (open source and commercial)

• Binding for C/C++, Fortran, Python, …

• E.g. MPICH, OpenMPI, MVAPICH, IBM Platform MPI, Cray MPT, …

39
MPI
Compiling and launching

$ mpicc –o myapp myapp.c

$ mpirun –np 4 ./myapp <args>

myapp myapp myapp myapp

40
Launch MPI + CUDA/OpenACC programs
 Launch one process per GPU
 MVAPICH: MV2_USE_CUDA

$ MV2_USE_CUDA=1 mpirun –np ${np} ./myapp <args>

 Open MPI: CUDA-aware features are enabled per default

 Cray: MPICH_RDMA_ENABLED_CUDA

 IBM Platform MPI: PMPI_GPU_AWARE

41
Unified Virtual Addressing

• One address space for all CPU and GPU memory

• Determine physical memory location from a pointer value

• Enable libraries to simplify their interfaces (e.g. MPI and cudaMemcpy)

• Support:
• 64-bit applications on Linux

• Windows using TCC mode

42
Unified Virtual Addressing
No UVA : Separate Address Spaces UVA : Single Address Space

System GPU System GPU

Memory Memory Memory Memory
0x0000 0x0000 0x0000

0xFFFF 0xFFFF 0xFFFF

CPU GPU CPU GPU

PCI-e PCI-e
43
MPI + CUDA

With UVA CUDA-aware MPI No UVA and regular MPI

//MPI rank 0
//MPI rank 0 s_buf_h = s_buf_d
call MPI_Send(s_buf_d,size,…) call MPI_Send(s_buf_h,size,…)

//MPI rank n-1 //MPI rank n-1

call MPI_Recv(r_buf_d,size,…) call MPI_Recv(r_buf_h,size,…)
r_buf_d = r_buf_h

44
MPI + OpenACC

With UVA CUDA-aware MPI No UVA and regular MPI

!$acc host_data use_device (s_buf, r_buf) !$acc update host(s_buf)
!MPI rank 0 !MPI rank 0
call MPI_Send(s_buf,size,…) call MPI_Send(s_buf,size,…)

!MPI rank n-1 !MPI rank n-1

call MPI_Recv(r_buf,size,…) call MPI_Recv(r_buf,size,…)
!$acc end host_data !$acc update device(r_buf)

45
NVIDIA GPUDirectTM
Peer to Peer tranfers
GPU1 GPU2
Memory Memory

System
Memory

CPU
GPU GPU
1 2

PCI-e Chip
set
IB
46
NVIDIA GPUDirectTM
Peer to Peer tranfers
GPU1 GPU2
Memory Memory

System
Memory

CPU
GPU GPU
1 2

PCI-e Chip
set
IB
47
NVIDIA GPUDirectTM
Support for RDMA
GPU1 GPU2
Memory Memory

System
Memory

CPU
GPU GPU
1 2

PCI-e Chip
set
IB
48
NVIDIA GPUDirectTM
Support for RDMA
GPU1 GPU2
Memory Memory

System
Memory

CPU
GPU GPU
1 2

PCI-e Chip
set
IB
49
Performance Results two Nodes
OpenMPI 1.8.4 MLNX FDR IB (4X) Tesla K40 @ 875GHz
7000
6000
5000
BW (MB/s)

4000 GPU-aware MPI with

3000 GPUDirect RDMA
2000 GPU-aware MPI
1000
0 regular MPI

Message Size (byte)

Latency (1 byte) 19.79 us 17.97 us 5.70 us 50
• Profiling GPU applications
• Optimizing Data Movement

In this talk • Using MPI with GPUs

Any questions?

51
Bonus slides - Pascal!

52
GPU Roadmap

20
Pascal
Unified Memory
3D Memory
18 NVLink

16
SGEMM / W Normalized

Maxwell
12 DX12

8
Kepler
Dynamic Parallelism
6

Fermi
2 FP64
Tesla
CUDA
0
2008 2010 2012 2014 2016
53
Pascal

• Faster and larger global memory

• Faster communication between processors
• More powerful unified memory
• Mixed precision computing

54
Stacked Memory
High performance global memory

3D Stacked Memory
• 4x Higher Bandwidth (~1 TB/s)
• 3x Larger Capacity (up to 32GB)
• 4x More Energy Efficient per bit

55
NVLINK
Faster communications between processors

NVLINK
• High speed interconnect
• Alongside/replacing PCI-E
• 80-200 GB/s
• Improved energy efficiency
• Improved flexibility

56
KEPLER GPU PASCAL GPU

NVLink

NVLink
High-Speed GPU
Interconnect POWER CPU

NVLink

PCIe PCIe

X86, ARM64, X86, ARM64,

POWER CPU POWER CPU

2014 2016
57
Unified Memory: Simpler & Faster with NVLink

Traditional Developer View Developer View With Developer View With

Unified Memory Pascal & NVLink

NVLink
80 GB/s

System GPU Memory Unified Memory Unified Memory

Memory
Share Data Structures at
CPU Memory Speeds, not PCIe speeds
Oversubscribe GPU Memory

58
Mixed precision computing
IEEE 16-bit float support

• Halving precision results in:

• Half the memory footprint

• Half the bandwidth required

• Double the computational throughput

• Obviously comes at an accuracy penalty

• Application dependent

• 16/32/64 bit all supported

• Supported in software now

Understanding NVIDIA Volta GPU Architecture
No ratings yet
Understanding NVIDIA Volta GPU Architecture
36 pages
Introduction To CUDA
No ratings yet
Introduction To CUDA
51 pages
Gpu1 - GPU Introduction
No ratings yet
Gpu1 - GPU Introduction
20 pages
Skylake Architecture
No ratings yet
Skylake Architecture
31 pages
Nvidia Ampere Architecture Whitepaper
No ratings yet
Nvidia Ampere Architecture Whitepaper
83 pages
Module 4.1 - Memory and Data Locality: GPU Teaching Kit
No ratings yet
Module 4.1 - Memory and Data Locality: GPU Teaching Kit
132 pages
Openshift - Container - Platform 4.10 Monitoring en Us
No ratings yet
Openshift - Container - Platform 4.10 Monitoring en Us
107 pages
gtc22 Whitepaper Hopper
No ratings yet
gtc22 Whitepaper Hopper
71 pages
Accelerating Data Science With GPUs
No ratings yet
Accelerating Data Science With GPUs
53 pages
Chapter 2. Pair Programming
No ratings yet
Chapter 2. Pair Programming
15 pages
GPU Programming Slides 2
No ratings yet
GPU Programming Slides 2
37 pages
AWS Cost Management Strategies
No ratings yet
AWS Cost Management Strategies
4 pages
Ps Cloud Block Store Ds
No ratings yet
Ps Cloud Block Store Ds
3 pages
Routing Protocols Overview and Configuration
No ratings yet
Routing Protocols Overview and Configuration
54 pages
CUDA Installation Guide Windows
No ratings yet
CUDA Installation Guide Windows
28 pages
PowerEdge Architecture Technical Overview
No ratings yet
PowerEdge Architecture Technical Overview
24 pages
Advanced Performance Optimization in CUDA (S62192)
100% (1)
Advanced Performance Optimization in CUDA (S62192)
127 pages
Machine Learning and AI Workloads Hardware Requirements
No ratings yet
Machine Learning and AI Workloads Hardware Requirements
2 pages
NVIDIA Techies Guide To Ethernet - Storage - Fabrics
100% (1)
NVIDIA Techies Guide To Ethernet - Storage - Fabrics
64 pages
ECN TLP Prefix 2008-12-15
100% (1)
ECN TLP Prefix 2008-12-15
19 pages
CUDA Tutorial
100% (1)
CUDA Tutorial
50 pages
Nptel: Parallel Computing - Video Course
No ratings yet
Nptel: Parallel Computing - Video Course
3 pages
NVSwitch
100% (1)
NVSwitch
23 pages
Evolution of GPU Architectures and Performance
No ratings yet
Evolution of GPU Architectures and Performance
52 pages
March 2019 NVMe TCP What You Need To Know About The Specification
No ratings yet
March 2019 NVMe TCP What You Need To Know About The Specification
34 pages
DPDK Guide for Network Engineers
No ratings yet
DPDK Guide for Network Engineers
10 pages
NetApp Optimizing Performance With Intelligent Caching
No ratings yet
NetApp Optimizing Performance With Intelligent Caching
15 pages
VDI Sizing Guide
No ratings yet
VDI Sizing Guide
8 pages
Primer Parrallel Processing 1980 To 2020
No ratings yet
Primer Parrallel Processing 1980 To 2020
192 pages
Mellanox OFED Linux User Manual v2.3-1.0.1
No ratings yet
Mellanox OFED Linux User Manual v2.3-1.0.1
207 pages
Cisco Nvme Fundamental
No ratings yet
Cisco Nvme Fundamental
109 pages
Solution Methodology2
No ratings yet
Solution Methodology2
3 pages
Understanding CUDA Architecture and GPU
No ratings yet
Understanding CUDA Architecture and GPU
6 pages
Vmware NSX-T: Prof. S.Dust
No ratings yet
Vmware NSX-T: Prof. S.Dust
37 pages
Senior Big Data Engineer Profile
No ratings yet
Senior Big Data Engineer Profile
6 pages
IDA+VMWare - Linux Debugger
No ratings yet
IDA+VMWare - Linux Debugger
8 pages
Unit 3 - IoT-new
No ratings yet
Unit 3 - IoT-new
31 pages
Parallel Computer Architecture Classification
No ratings yet
Parallel Computer Architecture Classification
23 pages
Understanding GPU Architecture and CUDA
No ratings yet
Understanding GPU Architecture and CUDA
12 pages
CUDA Installation Guide Windows
100% (1)
CUDA Installation Guide Windows
17 pages
Iot Merged
No ratings yet
Iot Merged
132 pages
Nvidia Presentation
No ratings yet
Nvidia Presentation
6 pages
DGX Superpod Deployment Guide DGX A100
No ratings yet
DGX Superpod Deployment Guide DGX A100
45 pages
DGX Superpod Reference Architecture DGX h100
No ratings yet
DGX Superpod Reference Architecture DGX h100
27 pages
Adq Using SPDK SB Unlocked
No ratings yet
Adq Using SPDK SB Unlocked
13 pages
User Manual
No ratings yet
User Manual
116 pages
Speaker - A02 - 5747 - Best Practices in Networking For AI
No ratings yet
Speaker - A02 - 5747 - Best Practices in Networking For AI
15 pages
NCP-AII Real Exam Questions Help You Pass - Testpassport
No ratings yet
NCP-AII Real Exam Questions Help You Pass - Testpassport
26 pages
Understanding Convolutional Neural Networks
No ratings yet
Understanding Convolutional Neural Networks
96 pages
Introduction to CUDA C/C++ Programming
No ratings yet
Introduction to CUDA C/C++ Programming
76 pages
Demystfying Container Networking2 190915040315
No ratings yet
Demystfying Container Networking2 190915040315
82 pages
SVN PPT
100% (1)
SVN PPT
19 pages
Ug1228 Ultrafast Embedded Design Methodology Guide
No ratings yet
Ug1228 Ultrafast Embedded Design Methodology Guide
217 pages
6wind Support Intel DPDK Presentation
100% (1)
6wind Support Intel DPDK Presentation
40 pages
NERSC SLURM System Insights
No ratings yet
NERSC SLURM System Insights
13 pages
Practical SIMD Programming Guide
No ratings yet
Practical SIMD Programming Guide
17 pages
Understanding Hypervisors in Cloud Computing
No ratings yet
Understanding Hypervisors in Cloud Computing
18 pages
Nvidia Profiling Tools Keipert 10 4 22
No ratings yet
Nvidia Profiling Tools Keipert 10 4 22
27 pages
CUDA Debugging and Profiling Tools Guide
No ratings yet
CUDA Debugging and Profiling Tools Guide
25 pages
S62256 - Demystify CUDA Debugging and Performance With Powerful Developer Tools
No ratings yet
S62256 - Demystify CUDA Debugging and Performance With Powerful Developer Tools
44 pages
Benchmark for VIO with Onboard Lighting
No ratings yet
Benchmark for VIO with Onboard Lighting
8 pages
Fast Visual Odometry For 3-D Range Sensors
No ratings yet
Fast Visual Odometry For 3-D Range Sensors
14 pages
Optical Flow Survey for Researchers
No ratings yet
Optical Flow Survey for Researchers
27 pages
Tango Isaac Albeniz v1
No ratings yet
Tango Isaac Albeniz v1
1 page
Practical Guide To Minimal Living
No ratings yet
Practical Guide To Minimal Living
8 pages
Vedic Astrology: Aast Grihas & Doshas
0% (1)
Vedic Astrology: Aast Grihas & Doshas
8 pages
Assignment-4 - Case Study On Anthropometric Measurment
No ratings yet
Assignment-4 - Case Study On Anthropometric Measurment
10 pages
Flutter Suppression System Test Example
No ratings yet
Flutter Suppression System Test Example
27 pages
The System - How To Take Control of Your Vices
No ratings yet
The System - How To Take Control of Your Vices
6 pages
Class Notes Teletraffic Engineering
No ratings yet
Class Notes Teletraffic Engineering
182 pages
Batman Reading List
No ratings yet
Batman Reading List
7 pages
Fibre Optic Systems
No ratings yet
Fibre Optic Systems
4 pages
Mathematics Notes: Matrices & Equations
60% (5)
Mathematics Notes: Matrices & Equations
234 pages
Choate Rosemary Hall Spring Term Week 3: Spring Visits 2018-2019
No ratings yet
Choate Rosemary Hall Spring Term Week 3: Spring Visits 2018-2019
4 pages
C++ File Handling for Beginners
No ratings yet
C++ File Handling for Beginners
30 pages
Delphi Technique in Future Studies
No ratings yet
Delphi Technique in Future Studies
26 pages
Bank Exam Inequality Practice
No ratings yet
Bank Exam Inequality Practice
4 pages
Class 9 English Beehive Prose Word Meaning Chapter 1
100% (9)
Class 9 English Beehive Prose Word Meaning Chapter 1
2 pages
Assignment II - Compile A PPT - v1.1
No ratings yet
Assignment II - Compile A PPT - v1.1
6 pages
Cryptography: Math Foundations & Applications
No ratings yet
Cryptography: Math Foundations & Applications
7 pages
Lindsey Smith
No ratings yet
Lindsey Smith
1 page
Public Relation Needs & Importance
83% (6)
Public Relation Needs & Importance
14 pages
Week 7 - Linear and Multiple Regression
100% (1)
Week 7 - Linear and Multiple Regression
2 pages
Chapter 01
No ratings yet
Chapter 01
26 pages
TOEFL - Listening Tips
No ratings yet
TOEFL - Listening Tips
9 pages
Lore of The Traditions (Final Download)
100% (3)
Lore of The Traditions (Final Download)
193 pages
Jmbfs 9740 Sennouni
No ratings yet
Jmbfs 9740 Sennouni
4 pages
Ferrite Installation Reva
No ratings yet
Ferrite Installation Reva
3 pages
Coaching for Hotel Staff Development
No ratings yet
Coaching for Hotel Staff Development
52 pages
Nama: Fazlun Nisak NIM: 180170127 MK: Kecerdasan Buatan (A2) Tugas Perceptron
No ratings yet
Nama: Fazlun Nisak NIM: 180170127 MK: Kecerdasan Buatan (A2) Tugas Perceptron
8 pages
Half Girlfriend: Struggles and Success
No ratings yet
Half Girlfriend: Struggles and Success
6 pages
Eco-Friendly Data Storage Breakthrough
No ratings yet
Eco-Friendly Data Storage Breakthrough
2 pages
Asking Appropriate Questions
No ratings yet
Asking Appropriate Questions
14 pages
Thill Ebc11 tb05
No ratings yet
Thill Ebc11 tb05
34 pages

ECMWF Advanced GPU Topics 1

Uploaded by

ECMWF Advanced GPU Topics 1

Uploaded by

Advanced GPU Topics #1

Jeremy Appleyard, September 2015

In this talk • Using MPI with GPUs

Ask questions at any point!

We will focus on nvprof and nvvp

nvvp => NVIDIA Visual Profiler

• nvprof ./<executable> attributes(global) subroutine vecAdd_GPU(c, a, b, n)

i = (blockIdx%x – 1) * blockDim%x + threadIdx%x

==34092== API calls:

• The vector addition is only 1.19%!

• This is valuable information!

• Line 17 in the file

• Compiled for the gpu

==34092== API calls:

nvprof --metrics dram_read_throughput,dram_write_throughput,flop_count_sp,flop_count_dp ./vecAdd

Measured dram throughput is

• Either import the output of nvprof –o ...

• Or run application through GUI

• If we collected our profiling

• -analysis-metrics may take some

NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE. 19

 Other analysis options available

Large copies will be bandwidth bound.

Small copies will be high latency

Several ways to optimise data movement:

Remember: profile before optimising!

• Consider moving data movement outside of main loop

• Consider if data can be passed as an argument (CUDA), or declared as private

• Small data transfers are much less efficient

• Page-locked (or pinned) allocations generally have higher bandwidth

• cudaHostRegister() – convert normal allocation to page-locked

• Pinned attribute in CUDA Fortran

• Next-gen GPU Pascal will bring hardware improvements - NVLINK

5x Faster than 1.00x

System System System

PCI-e PCI-e PCI-e

Node 0 Node 1 Node n-1

System System System

PCI-e PCI-e PCI-e

Node 0 Node 1 Node n-1

//MPI rank n-1

• Standard to exchange data between processes via messages

• Collectives, e.g. MPI_Reduce

• Multiple implementations (open source and commercial)

• E.g. MPICH, OpenMPI, MVAPICH, IBM Platform MPI, Cray MPT, …

$ mpicc –o myapp myapp.c

myapp myapp myapp myapp

$ MV2_USE_CUDA=1 mpirun –np ${np} ./myapp <args>

 IBM Platform MPI: PMPI_GPU_AWARE

• One address space for all CPU and GPU memory

• Enable libraries to simplify their interfaces (e.g. MPI and cudaMemcpy)

• Windows using TCC mode

System GPU System GPU

0xFFFF 0xFFFF 0xFFFF

CPU GPU CPU GPU

With UVA CUDA-aware MPI No UVA and regular MPI

//MPI rank n-1 //MPI rank n-1

With UVA CUDA-aware MPI No UVA and regular MPI

!MPI rank n-1 !MPI rank n-1

4000 GPU-aware MPI with

Message Size (byte)

In this talk • Using MPI with GPUs

• Faster and larger global memory

X86, ARM64, X86, ARM64,

Traditional Developer View Developer View With Developer View With

System GPU Memory Unified Memory Unified Memory

• Halving precision results in:

• Half the bandwidth required

• Double the computational throughput

• Obviously comes at an accuracy penalty

• 16/32/64 bit all supported

• Supported in software now

You might also like