Concurrent and Parallel Programming Unit V-Notes Unit V Openmp, Opencl, Cilk++, Intel TBB, Cuda 5.1 Openmp

CONCURRENT and PARALLEL PROGRAMMING
UNIT V-NOTES
UNIT V
OpenMP, OpenCL, Cilk++, Intel TBB, CUDA
5.1 OpenMP
It is an Application Program Interface (API) that may be used to explicitly direct multi-threaded,
shared memory parallelism.
Comprised of three primary API components:
o Compiler Directives
o Runtime Library Routines
o Environment Variables
Goals of OpenMP:
• Standardization:
o Provide a standard among a variety of shared memory architectures/platforms
o Jointly defined and endorsed by a group of major computer hardware and
software vendors
• Lean and Mean:
o Establish a simple and limited set of directives for programming shared memory
machines.
o Significant parallelism can be implemented by using just 3 or 4 directives.
o This goal is becoming less meaningful with each new release, apparently.
• Ease of Use:
o Provide capability to incrementally parallelize a serial program, unlike message-
passing libraries which typically require an all or nothing approach
o Provide the capability to implement both coarse-grain and fine-grain parallelism
• Portability:
o The API is specified for C/C++ and Fortran
o Public forum for API and membership
o Most major platforms have been implemented including Unix/Linux platforms
and Windows
OpenMP Programming model
Shared Memory Model:
• OpenMP is designed for multi-processor/core, shared memory machines. The underlying

architecture can be shared memory UMA or NUMA.
• Because OpenMP is designed for shared memory parallel programming, it largely limited
to single node parallelism. Typically, the number of processing elements (cores) on a
node determine how much parallelism can be implemented.
Motivation for Using OpenMP in HPC:
• By itself, OpenMP parallelism is limited to a single node.

• For High Performance Computing (HPC) applications, OpenMP is combined with MPI
for the distributed memory parallelism. This is often referred to as Hybrid Parallel
Programming.
o OpenMP is used for computationally intensive work on each node
o MPI is used to accomplish communications and data sharing between nodes
• This allows parallelism to be implemented to the full scale of a cluster.
Thread Based Parallelism:
• OpenMP programs accomplish parallelism exclusively through the use of threads.

• A thread of execution is the smallest unit of processing that can be scheduled by an
operating system. The idea of a subroutine that can be scheduled to run autonomously
might help explain what a thread is.
• Threads exist within the resources of a single process. Without the process, they cease to
exist.
• Typically, the number of threads match the number of machine processors/cores.
However, the actual use of threads is up to the application.
Explicit Parallelism:
• OpenMP is an explicit (not automatic) programming model, offering the programmer full
control over parallelization.
• Parallelization can be as simple as taking a serial program and inserting compiler
directives....
• Or as complex as inserting subroutines to set multiple levels of parallelism, locks and
even nested locks.
Fork - Join Model:

• OpenMP uses the fork-join model of parallel execution:
• All OpenMP programs begin as a single process: the master thread. The master thread
executes sequentially until the first parallel region construct is encountered.
• FORK: the master thread then creates a team of parallel threads.
• The statements in the program that are enclosed by the parallel region construct are then
executed in parallel among the various team threads.
• JOIN: When the team threads complete the statements in the parallel region construct,
they synchronize and terminate, leaving only the master thread.
• The number of parallel regions and the threads that comprise them are arbitrary.
Data Scoping:
• Because OpenMP is a shared memory programming model, most data within a parallel
region is shared by default.
• All threads in a parallel region can access this shared data simultaneously.
• OpenMP provides a way for the programmer to explicitly specify how data is "scoped" if
the default shared scoping is not desired.
• This topic is covered in more detail in the Data Scope Attribute Clauses section.
Nested Parallelism:
• The API provides for the placement of parallel regions inside other parallel regions.
• Implementations may or may not support this feature.
Dynamic Threads:
• The API provides for the runtime environment to dynamically alter the number of threads
used to execute parallel regions. Intended to promote more efficient use of resources, if
possible.
• Implementations may or may not support this feature.
I/O:
• OpenMP specifies nothing about parallel I/O. This is particularly important if multiple
threads attempt to write/read from the same file.
• If every thread conducts I/O to a different file, the issues are not as significant.
• It is entirely up to the programmer to ensure that I/O is conducted correctly within the
context of a multi-threaded program.
OpenMP API Overview
Three Components:
• The OpenMP 3.1 API is comprised of three distinct components:

o Compiler Directives (19)
o Runtime Library Routines (32)
o Environment Variables (9)
Later APIs include the same three components, but increase the number of directives,
runtime library routines and environment variables.
• The application developer decides how to employ these components. In the simplest case,
only a few of them are needed.
• Implementations differ in their support of all API components. For example, an
implementation may state that it supports nested parallelism, but the API makes it clear
that may be limited to a single thread - the master thread. Not exactly what the developer
might expect?
Compiler Directives:
• Compiler directives appear as comments in your source code and are ignored by
compilers unless you tell them otherwise - usually by specifying the appropriate compiler
flag, as discussed in the Compiling section later.
• OpenMP compiler directives are used for various purposes:
o Spawning a parallel region
o Dividing blocks of code among threads
o Distributing loop iterations between threads
o Serializing sections of code
o Synchronization of work among threads
• Compiler directives have the following syntax:
sentinel directive-name [clause, ...]
Run-time Library Routines:
• The OpenMP API includes an ever-growing number of run-time library routines.

• These routines are used for a variety of purposes:
o Setting and querying the number of threads
o Querying a thread's unique identifier (thread ID), a thread's ancestor's identifier,
the thread team size
o Setting and querying the dynamic threads feature
o Querying if in a parallel region, and at what level
o Setting and querying nested parallelism
o Setting, initializing and terminating locks and nested locks
o Querying wall clock time and resolution
• For C/C++, all of the run-time library routines are actual subroutines. For Fortran, some
are actually functions, and some are subroutines. For example:
Fortran INTEGER FUNCTION OMP_GET_NUM_THREADS()
C/C++ #include <omp.h>

int omp_get_num_threads(void)
• Note that for C/C++, you usually need to include the <omp.h> header file.
• Fortran routines are not case sensitive, but C/C++ routines are.
• The run-time library routines are briefly discussed as an overview in the Run-Time
Library Routines section, and in more detail in Appendix A.
Environment Variables:
• OpenMP provides several environment variables for controlling the execution of

parallel code at run-time.
• These environment variables can be used to control such things as:
o Setting the number of threads
o Specifying how loop interations are divided
o Binding threads to processors
o Enabling/disabling nested parallelism; setting the maximum levels of nested
parallelism
o Enabling/disabling dynamic threads
o Setting thread stack size
o Setting thread wait policy
5.2 OpenCL
Open Computing Language is a framework for writing programs that execute across
heterogeneous platforms.They consist for example of CPUs GPUs DSPs and FPGAs.
OpenCL specifies a programming language (based on C99) for programming these devices and
application programming interfaces (APIs) to control the platform and execute programs on the
compute devices.
OpenCL provides a standard interface for parallel computing using task-based and data-based
parallelism.
OpenCL Models
i. Device Model: How the device look inside.

ii. Execution Model: How work get done on devices.
iii. Memory Model: How devices and host see data.
iv. Host API: How the host control the devices.
Device Model
Global Memory: Shared with all Device, but slow. And is persistent between kernel calls.
Constant Memory: Faster than global memory, use it for filter parameters
Local Memory: Private to each compute unit, and shared to all processing elements.
Private Memory: Faster but local to each processing element.
The Constant, Local, and private memory are scratch space so each you cannot save data there
to be used by other kernels.
Execution Model
OpenCl applications run on the Host, which submit work to the compute devices.
1. Work Item: Basic unit of work on a compute device

2. Kernel: The code that runs on a work item (Basically a C function)
3. Program: Collection of kernels and other functions
4. Context: The environment where work-items execute (Devices, their memories and command
queues)
5. Command Queue: Queue used by the host to submit work (kernels, memory copies) to the
device.
It's a framework that define how kernel execute on each point on a problem (N-Dimension
vector). Or can be seen as the decomposition of a task in work-items.
OpenCL Memory Model
The OpenCL memory model defines the behavior and hierarchy of memory that can be used by
OpenCL applications. This hierarchical representation of memory is common across all OpenCL
implementations, but it is up to individual vendors to define how the OpenCL memory model
maps to specific hardware. This section defines the mapping used by SDAccel.
Host Memory
The host memory is defined as the region of system memory that is directly (and only) accessible
from the host processor. Any data needed by compute kernels must be transferred to and from
OpenCL device global memory using the OpenCL API.
Global Memory
The global memory is defined as the region of device memory that is accessible to both the
OpenCL host and device. Global memory permits read/write access to the host processor as well
to all compute units in the device. The host is responsible for the allocation and de-allocation of
buffers in this memory space. There is a handshake between host and device over control of the
data stored in this memory. The host processor transfers data from the host memory space into
the global memory space. Then, once a kernel is launched to process the data, the host loses
access rights to the buffer in global memory. The device takes over and is capable of reading and
writing from the global memory until the kernel execution is complete. Upon completion of the
operations associated with a kernel, the device turns control of the global memory buffer back to
the host processor. Once it has regained control of a buffer, the host processor can read and write
data to the buffer, transfer data back to the host memory, and de-allocate the buffer.
Constant Global Memory
Constant global memory is defined as the region of system memory that is accessible with read
and write access for the OpenCL host and with read only access for the OpenCL device. As the
name implies, the typical use for this memory is to transfer constant data needed by kernel
computation from the host to the device.
Local Memory
Local memory is a region of memory that is local to a single compute unit. The host processor
has no visibility and no control on the operations that occur in this memory space. This memory
space allows read and write operations by all the processing elements with a compute units. This
level of memory is typically used to store data that must be shared by multiple work-items.
Operations on local memory are un-ordered between work-items but synchronization and
consistency can be achieved using barrier and fence operations. In SDAccel, the structure of
local memory can be customized to meet the requirements of an algorithm or application.
Private Memory
Private memory is the region of memory that is private to an individual work-item executing
within an OpenCL processing element. As with local memory, the host processor has no
visiblilty into this memory region. This memory space can be read from and written to by all
work-items, but variables defined in one work-item's private memory are not visible to another
work-item. In SDAccel, the structure of private memory can be customized to meet the
requirements of an algorithm or application.
For devices using an FPGA device, the physical mapping of the OpenCL memory model is the
following:
• Host memory is any memory connected to the host processor only.

• Global and constant memories are any memory that is connected to the FPGA device.
These are usually memory chips (e.g. SDRAM) that are physically connected to the
FPGA device, but might also include distributed memories (e.g. BlockRAM) within the
FPGA fabric. The host processor has access to these memory banks through
infrastructure provided by the FPGA platform.
• Local memory is memory inside of the FPGA device. This memory is typically
implemented using registers or BlockRAMs in the FPGA fabric.
• Private memory is memory inside of the FPGA device. This memory is typically
implemented using registers or Block RAMs in the FPGA fabric.
5.3 Cilk++
Cilk™ Plus is the easiest, quickest way to harness the power of both multicore and vector
processing.
Intel Cilk Plus is an extension to the C and C++ languages to support data and task parallelism.
Primary features
a)High Performance
❖ An efficient work-stealing scheduler provides nearly optimal scheduling of parallel tasks

❖ Vector support unlocks the performance that's been hiding in your processors
❖ Powerful hyperobjects allow for lock-free programming
Easy to Learn
❖ Only 3 new keywords to implement task parallelism

❖ Serial semantics make understanding and debugging the parallel program easier
❖ Array Notations provide a natural way to express data parallelism
Easy to Use
❖ Automatic load balancing provides good behavior in multi-programmed environments

❖ Existing algorithms easily adapted for parallelism with minimal modification
❖ Supports both C and C++ programmers
5.4 Intel TBB
The Intel® Threading Building Blocks (Intel® TBB) library provides software developers with
a solution for enabling parallelism in C++ applications and libraries. The well-known advantage
of the Intel TBB library is that it makes parallel performance and scalability easily accessible to
software developers writing loop- and task-based applications. The library includes a number of
generic parallel algorithms, concurrent containers, support for dependency and data flow graphs,
thread local storage, a work-stealing task scheduler for task based programming, synchronization
primitives, a scalable memory allocator and more.
5.5 CUDA
CUDA is a parallel computing platform and an API model that was developed by Nvidia. Using
CUDA, one can utilize the power of Nvidia GPUs to perform general computing tasks, such as
multiplying matrices and performing other linear algebra operations, instead of just doing
graphical calculations. Using CUDA, developers can now harness the potential of the GPU for
general purpose computing (GPGPU).
Audience
Anyone who is unfamiliar with CUDA and wants to learn it, at a beginner's level, should read
this tutorial, provided they complete the pre-requisites. It can also be used by those who already
know CUDA and want to brush-up on the concepts.
Prerequisites
The reader should be able to program in the C language. He/She should have a machine with a
CUDA capable card. Knowledge of computer architecture and microprocessors, though not
necessary, can come extremely handy to understand topics such as pipelining and memories.

Concurrent and Parallel Programming Unit V-Notes Unit V Openmp, Opencl, Cilk++, Intel TBB, Cuda 5.1 Openmp

Uploaded by

Concurrent and Parallel Programming Unit V-Notes Unit V Openmp, Opencl, Cilk++, Intel TBB, Cuda 5.1 Openmp

Uploaded by

CONCURRENT and PARALLEL PROGRAMMING

OpenMP, OpenCL, Cilk++, Intel TBB, CUDA

Comprised of three primary API components:

Shared Memory Model:

• OpenMP is designed for multi-processor/core, shared memory machines. The underlying

Motivation for Using OpenMP in HPC:

• By itself, OpenMP parallelism is limited to a single node.

Thread Based Parallelism:

• OpenMP programs accomplish parallelism exclusively through the use of threads.

Fork - Join Model:

• The OpenMP 3.1 API is comprised of three distinct components:

sentinel directive-name [clause, ...]

Run-time Library Routines:

• The OpenMP API includes an ever-growing number of run-time library routines.

Fortran INTEGER FUNCTION OMP_GET_NUM_THREADS()

C/C++ #include <omp.h>

• OpenMP provides several environment variables for controlling the execution of

i. Device Model: How the device look inside.

Private Memory: Faster but local to each processing element.

1. Work Item: Basic unit of work on a compute device

Constant Global Memory

• Host memory is any memory connected to the host processor only.

❖ An efficient work-stealing scheduler provides nearly optimal scheduling of parallel tasks

❖ Only 3 new keywords to implement task parallelism

❖ Automatic load balancing provides good behavior in multi-programmed environments

5.4 Intel TBB

You might also like