Concurrent and Parallel Programming Unit V-Notes Unit V Openmp, Opencl, Cilk++, Intel TBB, Cuda 5.1 Openmp
Concurrent and Parallel Programming Unit V-Notes Unit V Openmp, Opencl, Cilk++, Intel TBB, Cuda 5.1 Openmp
UNIT V-NOTES
UNIT V
5.1 OpenMP
It is an Application Program Interface (API) that may be used to explicitly direct multi-threaded,
shared memory parallelism.
o Compiler Directives
o Runtime Library Routines
o Environment Variables
Goals of OpenMP:
• Standardization:
o Provide a standard among a variety of shared memory architectures/platforms
o Jointly defined and endorsed by a group of major computer hardware and
software vendors
• Lean and Mean:
o Establish a simple and limited set of directives for programming shared memory
machines.
o Significant parallelism can be implemented by using just 3 or 4 directives.
o This goal is becoming less meaningful with each new release, apparently.
• Ease of Use:
o Provide capability to incrementally parallelize a serial program, unlike message-
passing libraries which typically require an all or nothing approach
o Provide the capability to implement both coarse-grain and fine-grain parallelism
• Portability:
o The API is specified for C/C++ and Fortran
o Public forum for API and membership
o Most major platforms have been implemented including Unix/Linux platforms
and Windows
OpenMP Programming model
Explicit Parallelism:
• OpenMP is an explicit (not automatic) programming model, offering the programmer full
control over parallelization.
• Parallelization can be as simple as taking a serial program and inserting compiler
directives....
• Or as complex as inserting subroutines to set multiple levels of parallelism, locks and
even nested locks.
• All OpenMP programs begin as a single process: the master thread. The master thread
executes sequentially until the first parallel region construct is encountered.
• FORK: the master thread then creates a team of parallel threads.
• The statements in the program that are enclosed by the parallel region construct are then
executed in parallel among the various team threads.
• JOIN: When the team threads complete the statements in the parallel region construct,
they synchronize and terminate, leaving only the master thread.
• The number of parallel regions and the threads that comprise them are arbitrary.
Data Scoping:
• Because OpenMP is a shared memory programming model, most data within a parallel
region is shared by default.
• All threads in a parallel region can access this shared data simultaneously.
• OpenMP provides a way for the programmer to explicitly specify how data is "scoped" if
the default shared scoping is not desired.
• This topic is covered in more detail in the Data Scope Attribute Clauses section.
Nested Parallelism:
• The API provides for the placement of parallel regions inside other parallel regions.
• Implementations may or may not support this feature.
Dynamic Threads:
• The API provides for the runtime environment to dynamically alter the number of threads
used to execute parallel regions. Intended to promote more efficient use of resources, if
possible.
• Implementations may or may not support this feature.
I/O:
• OpenMP specifies nothing about parallel I/O. This is particularly important if multiple
threads attempt to write/read from the same file.
• If every thread conducts I/O to a different file, the issues are not as significant.
• It is entirely up to the programmer to ensure that I/O is conducted correctly within the
context of a multi-threaded program.
OpenMP API Overview
Three Components:
Later APIs include the same three components, but increase the number of directives,
runtime library routines and environment variables.
• The application developer decides how to employ these components. In the simplest case,
only a few of them are needed.
• Implementations differ in their support of all API components. For example, an
implementation may state that it supports nested parallelism, but the API makes it clear
that may be limited to a single thread - the master thread. Not exactly what the developer
might expect?
Compiler Directives:
• Compiler directives appear as comments in your source code and are ignored by
compilers unless you tell them otherwise - usually by specifying the appropriate compiler
flag, as discussed in the Compiling section later.
• OpenMP compiler directives are used for various purposes:
o Spawning a parallel region
o Dividing blocks of code among threads
o Distributing loop iterations between threads
o Serializing sections of code
o Synchronization of work among threads
• Compiler directives have the following syntax:
• Note that for C/C++, you usually need to include the <omp.h> header file.
• Fortran routines are not case sensitive, but C/C++ routines are.
• The run-time library routines are briefly discussed as an overview in the Run-Time
Library Routines section, and in more detail in Appendix A.
Environment Variables:
5.2 OpenCL
Open Computing Language is a framework for writing programs that execute across
heterogeneous platforms.They consist for example of CPUs GPUs DSPs and FPGAs.
OpenCL specifies a programming language (based on C99) for programming these devices and
application programming interfaces (APIs) to control the platform and execute programs on the
compute devices.
OpenCL provides a standard interface for parallel computing using task-based and data-based
parallelism.
OpenCL Models
Device Model
Global Memory: Shared with all Device, but slow. And is persistent between kernel calls.
Constant Memory: Faster than global memory, use it for filter parameters
Local Memory: Private to each compute unit, and shared to all processing elements.
The Constant, Local, and private memory are scratch space so each you cannot save data there
to be used by other kernels.
Execution Model
OpenCl applications run on the Host, which submit work to the compute devices.
It's a framework that define how kernel execute on each point on a problem (N-Dimension
vector). Or can be seen as the decomposition of a task in work-items.
OpenCL Memory Model
The OpenCL memory model defines the behavior and hierarchy of memory that can be used by
OpenCL applications. This hierarchical representation of memory is common across all OpenCL
implementations, but it is up to individual vendors to define how the OpenCL memory model
maps to specific hardware. This section defines the mapping used by SDAccel.
Host Memory
The host memory is defined as the region of system memory that is directly (and only) accessible
from the host processor. Any data needed by compute kernels must be transferred to and from
OpenCL device global memory using the OpenCL API.
Global Memory
The global memory is defined as the region of device memory that is accessible to both the
OpenCL host and device. Global memory permits read/write access to the host processor as well
to all compute units in the device. The host is responsible for the allocation and de-allocation of
buffers in this memory space. There is a handshake between host and device over control of the
data stored in this memory. The host processor transfers data from the host memory space into
the global memory space. Then, once a kernel is launched to process the data, the host loses
access rights to the buffer in global memory. The device takes over and is capable of reading and
writing from the global memory until the kernel execution is complete. Upon completion of the
operations associated with a kernel, the device turns control of the global memory buffer back to
the host processor. Once it has regained control of a buffer, the host processor can read and write
data to the buffer, transfer data back to the host memory, and de-allocate the buffer.
Constant global memory is defined as the region of system memory that is accessible with read
and write access for the OpenCL host and with read only access for the OpenCL device. As the
name implies, the typical use for this memory is to transfer constant data needed by kernel
computation from the host to the device.
Local Memory
Local memory is a region of memory that is local to a single compute unit. The host processor
has no visibility and no control on the operations that occur in this memory space. This memory
space allows read and write operations by all the processing elements with a compute units. This
level of memory is typically used to store data that must be shared by multiple work-items.
Operations on local memory are un-ordered between work-items but synchronization and
consistency can be achieved using barrier and fence operations. In SDAccel, the structure of
local memory can be customized to meet the requirements of an algorithm or application.
Private Memory
Private memory is the region of memory that is private to an individual work-item executing
within an OpenCL processing element. As with local memory, the host processor has no
visiblilty into this memory region. This memory space can be read from and written to by all
work-items, but variables defined in one work-item's private memory are not visible to another
work-item. In SDAccel, the structure of private memory can be customized to meet the
requirements of an algorithm or application.
For devices using an FPGA device, the physical mapping of the OpenCL memory model is the
following:
5.3 Cilk++
Cilk™ Plus is the easiest, quickest way to harness the power of both multicore and vector
processing.
Intel Cilk Plus is an extension to the C and C++ languages to support data and task parallelism.
Primary features
a)High Performance
Easy to Learn
Easy to Use
The Intel® Threading Building Blocks (Intel® TBB) library provides software developers with
a solution for enabling parallelism in C++ applications and libraries. The well-known advantage
of the Intel TBB library is that it makes parallel performance and scalability easily accessible to
software developers writing loop- and task-based applications. The library includes a number of
generic parallel algorithms, concurrent containers, support for dependency and data flow graphs,
thread local storage, a work-stealing task scheduler for task based programming, synchronization
primitives, a scalable memory allocator and more.
5.5 CUDA
CUDA is a parallel computing platform and an API model that was developed by Nvidia. Using
CUDA, one can utilize the power of Nvidia GPUs to perform general computing tasks, such as
multiplying matrices and performing other linear algebra operations, instead of just doing
graphical calculations. Using CUDA, developers can now harness the potential of the GPU for
general purpose computing (GPGPU).
Audience
Anyone who is unfamiliar with CUDA and wants to learn it, at a beginner's level, should read
this tutorial, provided they complete the pre-requisites. It can also be used by those who already
know CUDA and want to brush-up on the concepts.
Prerequisites
The reader should be able to program in the C language. He/She should have a machine with a
CUDA capable card. Knowledge of computer architecture and microprocessors, though not
necessary, can come extremely handy to understand topics such as pipelining and memories.