0% found this document useful (0 votes)
20 views30 pages

2023 CSC14120 Lecture00 CourseIntroduction

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
0% found this document useful (0 votes)
20 views30 pages

2023 CSC14120 Lecture00 CourseIntroduction

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 30

Parallel Programming

Course Introduction

Phạm Trọng Nghĩa


ptnghia@fit.hcmus.edu.vn
Why need parallel?
• Many applications have demanded
more execution speed and
resources
• The rate of single-instruction stream
performance scaling has decreased
• Frequency scaling limited by power
• ILP scaling tapped out

• Architects are now building faster


processors by adding more execution
units that run in parallel
• Software must be written to be
parallel to see performance gains

2
CPU vs GPU

3
CPU vs GPU
CPU - Multicore GPU – Many core
- Have a few cores, each core is - Have many many cores, each
powerful and complex core is weak and simple
- Focus on execution speed - Focus on throughput

Image source: http://www.nvidia.com/object/what-is-gpu-computing.html 4


CPU vs GPU
CPU GPU
Have a few cores, each core is Have many many cores, each
powerful and complex core is weak and simple

Focus on optimizing latency; Focus on optimizing throughput;


latency = an amount of time to throughput = # tasks completed
complete a task in a time unit
Example: the task is transporting a person from location A to B,
the distance from A to B: 4500 km

Car: 2 people, 200 km/h Bus: 40 people, 50 km/h


Latency = ? h Latency = ? h
Throughput = ? people/h Throughput = ? people/h
5
CPU vs GPU
CPU GPU
Have a few cores, each core is Have many many cores, each
powerful and complex core is weak and simple

Focus on optimizing latency; Focus on optimizing throughput;


latency = an amount of time to throughput = # tasks completed
complete a task in a time unit
Example: the task is transporting a person from location A to B,
the distance from A to B: 4500 km

Car: 2 people, 200 km/h Bus: 40 people, 50 km/h


Latency = 22.5 h Latency = 90 h
Throughput = 0.09 people/h Throughput = 0.44 people/h
So, is car or bus better? 6
CPU vs GPU
CPU GPU - NVIDIA Tesla A100
- 24 core Intel multicore server - 108SM, 6912 CUDA cores and
microprocessor 432 Tensor cores
- 0.33 TLOPS for double- - 9.7 TFLOPS for 64-bit double-
precision and 0.66 TFLOPS for precision, 156 TFLOPS for 32-
single precision bit single-precision, and 312
TFLOPS for 16-bit half-precision

FLOPS (FLoating-point Operations Per Second)


TFLOPS (TeraFLOPS) 7
CPU: Latency-oriented design
• Powerful ALU
• Reduce operation latency.
• Increased chip area and power
• Large caches:
• Convert long-latency memory
accesses into short-latency
cache accesses
• Sophisticated control
• Branch prediction for reduced
branch latency dự đoán câu lệnh sẵn đỡ phải
đợi load

• Data forwarding for reduced data


latency
Reduces the execution latency of each individual thread 8
GPU: Throughput-oriented design
• Small caches
• To boost memory throughput
• Simple control:
• No branch prediction
• No data forwarding
• Energy efficient ALUs
• Many, long latency but heavily
pipelined for high throughtput
• Require massive number of
threads to tolerate latencies
• Threading logic
• Thread size
Reduces the execution latency of each individual thread 9
CPU + GPU

CUDA (Compute Unified Device Architecture) C/C++ is extended-C/C++, allows us to write a


program taking advantage of both CPU and GPU (NVIDIA): sequential parts will run on CPU,
massively parallel parts will run on GPU
Image source: John Cheng et al. Professional CUDA C
Programming. 2014 10
CPU + GPU

• Core area: sequential code.


• These portions are very hard to parallelize.
• CPUs tend to do a very good job on these portions.
• Take up a large portion of the code, but only a small portion of the
execution time
• "Peach flesh" portions:
• Easy to parallelize.
• Parallel programming in heterogeneous computing systems can
drastically improve the speed of these applications.
11
Applications of parallel programming on GPU

Image source: http://www.nvidia.com/object/gpu- 12


Challenges in parallel programming

• Question: Is parallel programming easier or hard?


• Answer:
• Easy: Do not care about performance, just want it able to run.
• Hard: when you want optimize, get higher performance

13
Challenges in parallel programming

• Challenging to design parallel algorithms with the same level


of algorithmic (computational) complexity as that of
sequential algorithms
• Some parallel algorithms do more work than their sequential
counterparts
• Parallelizing often requires non-intuitive ways of thinking about the
problem and may require redundant work during execution
• The execution speed of many applications is limited by
memory access latency and/or throughput
• Requires methods for improving memory access speed

14
Challenges in parallel programming

• Execution speed of parallel programs is often more sensitive


to the input data characteristics than is the case for their
sequential counterparts
• Unpredictable data sizes and uneven data distributions
• Require threads to collaborate with each other
• Using synchronization operations such as barriers or atomic
operations

Most of these challenges have been


addressed by researchers

15
3 Ways to Accelerate Applications

16
Libraries: Easy, High-Quality
• Ease of use: enables GPU acceleration without in-depth
knowledge of GPU programming
• “Drop-in”: Many GPU-accelerated libraries follow
standard APIs, thus enabling acceleration with minimal
code changes
• Quality: Libraries offer high-quality implementations of
functions encountered in a broad range of applications

17
NVIDIA GPU Accelerated Libraries

https://developer.nvidia.com/gpu-accelerated-libraries
18
Compiler Directives: Easy, Portable
• Ease of use: Compiler takes care of details of parallelism
management and data movement
• Portable: The code is generic, not specific to any type of
hardware and can be deployed into multiple languages
• Uncertain: Performance of code can vary across compiler
versions

19
Compiler Directives: OpenACC

https://ulhpc-tutorials.readthedocs.io/en/latest/gpu/openacc/basics/

20
Programming Languages: Most
Performance and Flexible
• Performance: Programmer has best control of parallelism
and data movement
• Flexible: The computation does not need to fit into a
limited set of library patterns or directive types
• Verbose: The programmer often needs to express more
details

21
Programming Languages: Most
Performance and Flexible

• MATLAB, Mathematica, LabVIEW


• PyCUDA, Numba
• CUDA Fortran, OpenACC
• CUDA C, OpenACC
• CUDA C++, Thrust
• Hybridizer

22
After successful completing the
Course topics: course, the student will be able
 Introduction to CUDA; example: to:
vector addition, convolution, … Parallelize common tasks to run
(3 weeks)
on GPU using CUDA
 GPU parallel execution in
CUDA; example: reduction, … Apply knowledge of GPU
(4 weeks) parallel execution in CUDA to
 Types of GPU memories in speed up a CUDA program
CUDA; example: reduction,
convolution, … (3 weeks)
Apply knowledge of GPU
memories in CUDA to speed up a
 Example: scan, histogram, sort
(4 weeks) CUDA program
 Optimizing a CUDA program; Apply the optimization process
additional topics in parallel to optimize a CUDA program
programming (1 week)
Apply teamwork skills to
complete final project
23
Course assessment
• Individual exercises throughout the course: 50% of the
grade
• Group final project: 50% of the grade, 2 students /
group

24
Course assessment
Remember: the main goal is to learn, truly learn

You can discuss ideas with others as well as consult Internet


sources, but your writing and code must be your own, based
on your own understanding

If you violate this rule, you will get 0 score for the course

25
Advices
• In this course, we will focus on parallel programming on
GPU (Graphics Processing Unit)
• Don’t worry if you don’t have GPU ;-)
• We will use Google Colab for this course.

26
Setup coding environment
• Where to find a machine with CUDA-enabled GPU?
• Google Colab, it’s free and ready to run CUDA programs ☺
• Even if you have your own GPU, you should use Google Colab because
teacher will use it to run and grade your programs
• Code, compile, and run:
• Write and save code (.cu file) in your local machine by your favorite editor
(with editors not recognizing .cu file automatically and not highlighting syntax
with colors, the simple way is to set language/syntax as C/C++)
• Open a notebook in Colab (you must sign in to your gmail), select “Runtime,
Change runtime type” and set “Hardware accelerator” as GPU, upload .cu file
• In a Colab cell, compile: !nvcc file-name.cu -o run-file-name
• If we don’t specify run-file-name, it will default to a.out
• In a Colab cell, run: !./run-file-name
• Demo …
27
RESOURCES
• Wen-Mei, W. Hwu, David B. Kirk, and Izzat El Hajj.
Programming Massively Parallel Processors: A Hands-on
Approach. Morgan Kaufmann, 2022.
• David B. Kirk, Wen-mei W. Hwu. Programming Massively
Parallel Processors. Morgan Kaufmann, 2016
• Cheng John, Max Grossman, and Ty
McKercher. Professional Cuda C Programming. John Wiley
& Sons, 2014
• Lê Hoài Bắc, Vũ Thanh Hưng, Trần Trung Kiên. Lập trình
song song trên GPU. NXB KH & KT, 2015
• NVIDIA. Intro to Parallel Programming. Udacity
• NVIDIA. CUDA Toolkit Documentation
28
Reference
• [1] Slides from Illinois-NVIDIA GPU Teaching Kit
• [2] Wen-Mei, W. Hwu, David B. Kirk, and Izzat El Hajj.
Programming Massively Parallel Processors: A Hands-on
Approach. Morgan Kaufmann, 2022

29
THE END

30

You might also like