UNIT V Parallel Programming Patterns in CUDA (T2 Chapter 7) - P P With CUDA

UNIT V
Parallel Programming Patterns

in CUDA
Chapter 7
(Learn CUDA Programming)
Ajeet K Jain
CSE, KMIT, Hyderabad
We will cover parallel programming algorithms that will help us to
understand how to parallelize different algorithms and optimize
CUDA. The techniques covered can be applied to a variety of
problems:
•Matrix multiplication optimization

•Image convolution
•Prefix sum
•Pack and split
•N-body operation
•QuickSort in CUDA using dynamic parallelism
•Radix sort
•Histogram calculation
Matrix multiplication optimization
Matrix multiplication is a group of dot product operations from

two matrices. We can simply parallelize the operations that are
done by all the CUDA threads to generate a dot product of
elements. However, this operation is inefficient in terms of
memory usage because the data that's loaded from memory isn't
reused. To confirm our analogy, let's measure the performance
limiter. The following chart shows the GPU utilization for a Tesla
V100 card using Nsight Compute:
Based on our performance limiter analysis, this utilization ratio can
be categorized as memory bounded. Therefore, we should review
the memory utilization to mitigate utilization. The following
screenshot shows the memory workload analysis section:
From this analysis, we can see that the L2 cache hit rate is
low and that the max bandwidth is low. We can presume
that this is because the original matrix multiplication
operation does not reuse loaded data, as we mentioned
earlier. This can be resolved by using shared
memory, that is, reusing the loaded data and mitigating
global memory usage. Now, let‘sn review matrix
multiplication and how we can optimize this to use shared
memory that has a small memory space.
This operation provides an optimization opportunity because
we can break down the large matrix operation with the small
problems and place it in the small memory space. In CUDA
programming, we place the small matrices in shared memory
and mitigate global memory access. In our implementation, we
will match the tile with the CUDA thread block. The tile's
position will be determined by its block index, which is done
with the tid_* variable.
Convolution
The convolutional operation (or filtering) is another common

operation in many applications, especially in image and signal
processing, as well as deep learning. Although this operation is
based on the product of sequential data from the input and filter,
we have a different approach for matrix multiplication.
Convolution operation in CUDA

The convolutional operation consists of source data and
a filter. The filter is also known as a kernel. By applying
the filter against the input data, we can obtain
the modified result. A two-dimensional convolution is
shown in the following diagram:
We need to consider a couple of concepts when we implement
convolution operation, that is, kernel and padding. The kernel is a
set of coefficients that we want to apply to the source
data. This is also known as a filter. The padding is extra virtual
space around the source data so that we can apply kernel
functions to the edge. When the padding size is 0, we don't
allow the filter to move beyond the source space. However, in
general, the padding size is half the size of the filter.
To start easily, we can design the kernel function with the
following in mind:
Each CUDA thread generates one filtered output.

Each CUDA thread applies the filter's coefficients to the data.
The filter shape is a box filter.
Prefix sum (scan)
Prefix sum (scan) is used to obtain a cumulative number array
from the given input
numbers array. For example, we can make a prefix-sum sequence
as follows:
In this approach, we can obtain the output using multiple CUDA cores.
However, this method does not reduce the total number of iterations
because the first input element should be added for all the outputs
one by one. Also, we cannot predict the output result when the array
is sufficiently large, so multiple thread blocks should be launched. This
is because all the scheduled CUDA threads are not launched at the
same time in the CUDA architecture and there would be conflicts in
multiple CUDA threads. To avoid this, we need a double buffer
approach for the array, which is another inefficiency. The following
code shows its implementation:
There are two steps based on the stride controls. While increasing
the stride, it obtains the partial summations accordingly. Then, it
obtains the partial summations while reducing the
stride accordingly. Each step has a different operation pattern, but
they can be figured out with the stride size.
Compact and split
Previously, we covered how to parallelize the sequential prefix
sum algorithm and discussed how it can be used for other
applications. Now, let's cover some of those
applications: compact and split. The compact operation is an
algorithm that can consolidate values that fulfill the given
condition from an array. On the other hand, the split operation
is an algorithm that distributes the values to the designated
place. In general, these algorithms work sequentially. However,
we will see how the parallel prefix-sum operation can improve
how it functions.
N-body
Any N-body simulation is a simulation of the dynamical system that evolves under
the influence of physical forces. Numerical approximation is done as the bodies
continuously interact with each other. N-body simulation is done extensively in
physics and astronomy, for example, so that scientists can understand the
dynamics of particles in the Universe. Nbody simulations are used in many other
domains, including computational fluid dynamics in order to understand turbulent
fluid flow simulation.
A relatively easy method for solving N-body simulation is to make use of a brute-
force technique that has O(N2) complexity. This approach is embarrassingly
parallel in nature.
There are various optimizations at algorithmic scale that can reduce the compute
complexity. Instead of applying all-pairs to the whole simulation, it can be used to
determine forces in close-range interactions. Even in this case, creating a kernel
for solving the forces on CUDA is very useful as it will also improve the
performance of far-field components. Accelerating one component will offload
work from the other components, so the entire application benefits from
accelerating one kernel.
Histogram calculation
In an embarrassingly parallel job, ideally, you would assign

computation to each thread working on independent data,
resulting in no data races. By now, you will have realized
that some patterns don't fit this category. One such pattern is when
we're calculating a histogram. The histogram pattern displays the
frequency of a data item, for example, the number of times we
used the word CUDA in each chapter, the number of times each
letter occurred in this chapter, and so on. A histogram takes
the following form:
Quicksort and CUDA dynamic parallelism
The Quicksort algorithm demands launching kernels recursively. So far, the

algorithms we have seen call the kernel once via the CPU. After the kernel has
finished executing, we return to the CPU thread and then relaunch it. Doing
this results in giving back control to the CPU, and may also result in data
transfer between CPU and GPU, which is a costly operation. It used to be very
difficult to efficiently implement algorithms such as Quicksort on GPUs that
demand features such as recursion. With the GPU architecture 3.5 and CUDA
5.0 onwards, a new feature was introduced called dynamic parallelism.
Dynamic parallelism allows the threads within a kernel to launch new kernels
from the GPU without returning control back to the CPU. The word dynamic
comes from the fact that it is dynamically based on the runtime data. Multiple
kernels can be launched by threads at once. The following diagram simplifies
this explanation:
Dynamic parallelism guidelines and constraints
Though dynamic parallelism provides us with an opportunity to port algorithms
such as Quicksort on GPU, there are some fundamental rules and guidelines
that need to be followed.

UNIT V Parallel Programming Patterns in CUDA (T2 Chapter 7) - P P With CUDA

Uploaded by

UNIT V Parallel Programming Patterns in CUDA (T2 Chapter 7) - P P With CUDA

Uploaded by

UNIT V

Parallel Programming Patterns

•Matrix multiplication optimization

Matrix multiplication is a group of dot product operations from

The convolutional operation (or filtering) is another common

Convolution operation in CUDA

Each CUDA thread generates one filtered output.

In an embarrassingly parallel job, ideally, you would assign

The Quicksort algorithm demands launching kernels recursively. So far, the

You might also like