Parallel Computing
Parallel Computing
1
introduction to parallel computing 1
Prerequisites
1. The Linux kernel and shells
Lecturer: Dr. Peter Amoako-Yirenkyi
2. Navigating Linux files and directo-
July 29, 2021 ries
3. File ownership and permissions
4. Linux process management
5. Secure shell access and copy
Parallel computing2 is a subject of interest in the computing community.
Ever-growing size of database and increasing complexity are putting
great stress on the single computers. Now the entire computing com- After going through this lecture
notes,you would be able to:
munity is looking for some computing environment where current com-
• know historical facts of parallel
putational capacity can be enhanced. This could be done by improving
computing.
the performance of a single computer (Uniprocessor System) and by
• explain the basic concepts of pro-
parallel processing. In the simplest sense, parallel computing is the si-
gram, process, thread, concurrent
multaneous use of multiple compute resources to solve a computational execution, parallel execution and
problem using multiple CPUs. A problem is normally broken into dis- granularity.
crete parts that can be solved concurrently. Each part is further broken • explain the need of parallel comput-
down to a series of instructions so that instructions from each part gets ing.
executed simultaneously. • describe the levels of parallel pro-
cessing.
The most obvious solution is the introduction of multiple processors
• describe Parallel computer classifica-
working in tandem to solve a given problem. The Architecture used to tion Schemes.
solve this problem is Advanced/Parallel Computer Architecture and the
• describe various applications of
algorithms are known as Parallel Algorithms. Programming of these parallel computing
computers is what is known as Parallel Programming. 2
The experiments and implementations of the use of parallelism started in the 1950s by the IBM. Prior to this,
Constraints of conventional architecture(see Figure 1) was used.
A serious approach toward designing parallel computers was started with the development of ILLIAC IV in
1964. And in 1969, the concept of pipelining was introduced in computer CDC 7600. In 1976, the CRAY 1
was developed by Seymour Cray. CRAY 1 was a pioneering effort in the development of vector registers.
math458 - scientific computing ii 2
Today, many applications provide an equal or greater driving force in the development of faster computers.
These applications require the processing of large amounts of data in sophisticated ways. For example:
Databases, data mining,Oil exploration, Web search engines, web based business services Medical imag-
ing and diagnosis, Pharmaceutical design, Financial and economic modelling, Management of national and
multi-national corporations, Advanced graphics and virtual reality particularly in the entertainment indus-
try, Networked video and multi-media technologies and many other Collaborative work environment.
The next generation of Cray called Cray XMP was developed in the years 1982-84. It was coupled with su-
percomputers and used a shared memory. In the 1980s Japan also started manufacturing high performance
supercomputers. Companies like NEC, Fujitsu and Hitachi were the main manufacturers.
Hardware improvements like pipelining, superscalar did not scale well and required sophisticated compiler
technology to exploit their performance. Hence techniques such as vector processing was introduced and
worked for certain kind of problems. Due to this, a significant development in networking technology also
paved a way for network-based cost-effective parallel computing.
There are various parallel processing mechanisms that are carried out to achieve parallelism in uniprocessor
system. These include;
• Multiple function units
• Parallelism and pipelining within CPU
• Overlapped CPU and i/o operations
• Use of hierarchical memory system.
• Multiprogramming and time sharing.
Note that some parallel processing mechanism are supported by memory hierarchy as a simultaneous
transfer of instructions/data between (CPU, cache) and (MM, secondary memory)
Parallel Computers
Parallel computers are those systems which emphasize on parallel pro-
cessing. Parallel processing is an efficient form of information process-
ing which emphasizes on the exploitation of concurrent events in com-
puting process. Computers are therefore pipelined3 or nonpipelined.
3
In pipeline computers, instruction
cycle of digital computer involves 4
• Pipelined computers perform overlapped computations to exploit major steps:
Array Processor
Array processor is synchronous parallel computer with multiple ALUs called processing elements (PE).
These PEs can operate in parallel mode. An appropriate data routing algorithm must be established among
PEs.
math458 - scientific computing ii 4
Multiprocessor System
This system achieves asynchronous parallelism through a set of interactive processors with shared resources
(memories, databases, etc)
Data Parallelism
In data parallelism, the complete set of data is divided into multiple blocks and operations on the blocks
are applied parallel. Data parallelism is faster as compared to Temporal parallelism. And also, there is less
inter-processor communication. In data parallelism, no synchronization is required between counters (or
processors). It is more tolerant of faults. With data parallelism, the working of one person does not affect
the other. Disadvantages of Data Parallelism includes:
1. The task to be performed by each processor is pre-decided. This means that assignment of load is static.
2. It should be possible to break the input task into mutually exclusive tasks. Here Space would be required
for counters. This requires multiple hardware which may be costly.
PERFORMANCE EVALUATION
Some elementary concepts to consider are; Program, Process, Thread, Concurrent and Parallel Execution, Granu-
larity, and Potential of Parallelism
Process
Each process has a life cycle, which consist of creations, execution and termination phases. A process may
create several new processes, which in turn may also create a new process.
The process scheduling involves three concepts: process state, state transition and scheduling policy
Thread
Thread is a sequential flow of control within a process. A process can contain one or more threads. Threads
have their own program counter and register values, but they share the memory space and other resources
of the process.
Granularity
This refers to the amount of computation done in parallel relative to the size of the whole program. In
parallel computing, granularity is a qualitative measure of the ratio of computation to communication.
performance attributes
The performance attributes are: Cycle time (T): This is the unit of time for all the operations of a computer
system. It is the inverse of clock rate (l/f). The cycle time is represented in sec.
Cycles Per Instructions (CPI): Different instructions take different number of cycles for execution. CPI is
measurement of number of cycles per instructions.
Instruction Count (IC): Number of instructions in a program is called instruction count. If we assume that
all instructions have the same number of cycles, then the total execution time of a program is the product of
the number of instructions in the program, the number of cycles required by one instruction and the time of
one cycle.
Some problems may be easily parallelized. On the other hand, there are some inherent sequential problems
(Computation of Fibonacci sequence) whose parallelization is nearly impossible. If processes do not share
address space and we could eliminate data dependency among instructions, we can achieve higher level of
parallelism.
Speed-up
The concept of speed up is used as a measure of the speed that indi-
cated the extent at which a sequential program can be parallelized.
Execution time before improvement
Speedup4 = (1)
Execution time after improvement
4
Not all parts of a program can be improved to run faster. Essentially, • From Eq. (1), an increase in
speedup results in a correspond-
the program to be analyzed consists of a part that will benefit from ing decrease in the execution time
the improvement and another part that will not. This leads to the
• From Eq. (2), the speedup only
following expression for the execution time of the program, given as T affects the part of the program that
will benefit from the improvement
T = (1 − a5 ) × T + a × T (2)
Thus, we have the theoretical execution time of the whole task after 5
Where a is the fraction of execution
improvement as time that would benefit from the
improvement
a
T ( s ) = (1 − a ) × T + 6 × T (3)
s
6
In some instances, s is defined as the
Integrating Eq.(2), Eq. (3). into (1), number of resources used to parallelize
the part of the program that will benefit
1 from the improvement
Speedup = a (4)
1−a+ s
Overhead
Oh = N × T ( s ) − T (9)
PARALLEL COMPUTING
Parallel computing is the simultaneous use of multiple compute resources to solve a computational problem
by breaking down a problem into discrete parts by a series of instructions that can be solved concurrently
Types Of Classification
The classification is based on;
1. The instruction and data streams.
2. The structure of computers.
3. How the memory is accessed.
4. The grain size.
FLYNN’S CLASSIFICATION
This type of classification was proposed by Michael Flynn in 1972. He introduced the concept of instruction
and data streams for categorizing computers. This classification is based on instruction and data streams.
The instruction cycle consists of a sequence of steps needed for the execution of an instruction in a program.
The control unit fetches instructions one at a time. The fetched instruction is then decoded. The proces-
sor executes the decoded instructions and the result of execution is temporarily stored in Memory Buffer
Register (MBR). Here,
1. Flow of instruction is called instruction stream
2. The flow of operands is called data streams
3. Flow of operands between processors and memory is bi-directional.
Based on multiplicity of instruction streams and data streams observed by the CPU during program
execution, Flynn’s Classification is grouped into four(4) Types of Data Streams. They are:
1. Single Instruction and Single Data Stream (SISD).
2. Single Instruction and Multiple Data Stream (SIMD).
3. Multiple Instruction and Single Data Stream (MISD).
4. Multiple Instruction and Multiple Data Stream (MIMD)
of computations (such as those in image processing). It is often necessary to selectively turn off operations
on certain data items. For this reason, most SIMD programming architectures allow for an "activity mask",
which determines if a processor should participate in a computation or not.
math458 - scientific computing ii 10
In contrast to the SIMD processors, MIMD processors can execute different programs in different proces-
sors. A variant of this, called single program multiple data streams (SPMD), executes the same program on
different processors.
However, it is easy to see that SPMD and MIMD are closely related in terms of programming flexibility and
underlying architectural support. Examples of such platforms include current generation Sun Ultra Servers,
SGI Origin Servers,multiprocessors PCs, workstation clusters.
SIMD-MIMD Comparison
1. SIMD computers require less hardware than MIMD computers (single control unit)
2. However, since SIMD processors are specially designed, they tend to be expensive and have a long design
cycle. Not all applications are naturally suited to SIMD processors.
3. In contrast, platforms supporting the SPMD paradigm can be built from inexpensive off-the-shelf com-
ponents with relatively little effort in a short amount of time.
HANDLER’S CLASSIFICATION
In 1977, Handler proposed an elaborate notation for expressing the pipelining the parallelism if computers.
Handler’s classification addresses the computer at three distinct levels. These are;
1. Processor control unit (PCU)—- CPU
math458 - scientific computing ii 11
Structural Classification
A parallel computer (MIMD) can be characterized as a set of multiple
processors and shared memory or memory modules communicating
via an interconnection network.
When multiprocessors communicate through the global shared mem- Figure 10: Shared memory architecture.
ory modules it is termed Shared memory computer or Tightly cou-
pled systems. Shared memory multiprocessors have the following
characteristics:
2. For high speed real time processing, these systems are preferable as
their throughput is high as compared to loosely coupled systems.
Tightly-coupled System
In tightly coupled system organization, multiple processors share a
global main memory, which may have many modules. The processors
Figure 11: Tightly coupled system
have also access to I/O devices. The inter-communication between
processors, memory, and other devices are implemented through var-
ious interconnection networks.
math458 - scientific computing ii 12
The following are types of Interconnection Networks associated with tightly coupled system:
1. Processor-Memory Interconnection Network (PMIN): This is a switch that connects various processors to
different memory modules.
2. Input-Output-Processor Interconnection Network (IOPIN): This interconnection network is used for com-
munication between processors and I/O channels.
3. Interrupt Signal Interconnection Network (ISIN): When a processor wants to send an interruption to
another processor, then this interruption first goes to ISIN, through which it is passed to the destination
processor. In this way, synchronization between processor is implemented by ISIN.
To reduce this delay, every processor may use cache memory for the frequent references made by the
processor as Models of Tightly-Coupled Systems
In this model, main memory is uniformly shared by all processors in multiprocessor systems and each
processor has equal access time to shared memory. This model is used for time-sharing applications in
a multi-user environment. Uniform Memory Access is most commonly represented today by Symmetric
Multiprocessor(SMP) machines. The processors are identical and have equal access and access times to
memory. Sometimes called CC-UMA - Cache Coherent UMA. Cache coherent means if one processor
updates a location in shared memory, all the other processors know about the update. Cache coherency
is accomplished at the hardware level.
In shared memory multiprocessor systems, local memories can be connected with every processor. The
collection of all local memories form the global memory being shared. This global memory is distributed
to all the processors. In this case, the access to a local memory is uniform for its corresponding processor,
but if one reference is sent to the local memory of some other remote processor, then the access is not
uniform. It depends on the location of the memory. Thus, all memory words are not accessed uniformly.
math458 - scientific computing ii 13
shared memory multiprocessor systems may use cache memories with every processor for reducing the
execution time of an instruction .
when every processor in a multiprocessor system has its own local memory and the processors communi-
cate via messages transmitted between their local memories, then this organization is called Distributed
memory computer or Loosely coupled system.
math458 - scientific computing ii 14
Each processor in loosely coupled systems has a large local memory (LM). This memory is not shared by
any other processor. Such systems have multiple processors with their own local memory and a set of I/O
devices. This set of processors, memory and I/O devices makes a computer system. These systems are
also called multi-computer systems. These computer systems are connected together via message passing
interconnection network. It is through this network that processes communicate by passing messages to
one another. This system is also called the distributed multi computer system .
(a) Number and types of processors available. That is the architectural features of host computer.
(b) Memory organization
(c) Dependency of data, control and resources.
math458 - scientific computing ii 15
Data Dependency
Data dependency refers to the situation in which two or more instructions share same data. The instruc-
tion in a program can be arranged based on the relationship of data dependency.
(a) Flow Dependence : If instruction I2 follows I1 and output of I1 becomes input of I2 , then I2 is said to
be flow dependent on I1 .
(b) Antidependence : This occurs when instruction I2 follows I1 such that output of I2 overlaps with the
input of I1 on the same data.
(c) Output dependence : When output of the two instructions I1 and I2 overlap on the same data, the
instructions are said to be output dependent.
(d) I/O dependence : When read and write operations by two instructions are invoked on the same file,
it is a situation of I/O dependence.
Control Dependence
Instructions or segments in a program may contain control structures. Dependency among the statements
can be in control structures also. But the order of execution in control structures is not known before the
run time. Control structures dependency among the instructions must be analyzed carefully.
Resource Dependence
The parallelism between the instructions may also be affected due to the shared resources. If two instruc-
tions are using the same shared resource then it is a resource dependency condition
For execution of instructions or block of instructions in parallel, the instructions should be independent
of each other. These instructions can be data dependent, control dependent or resource dependent on
each other.
Instruction Level
This is the lowest level and the degree of parallelism is highest at this level. The fine grain size is used
at instruction level. The fine grain size may vary according to the type of the program. For example, for
scientific applications, the instruction level grain size may be higher. As the higher degree of parallelism can
be achieved at this level, the overhead for a programmer will be more.
Loop Level
This is another level of parallelism where iterative loop instructions can be parallelized. Fine grain size is
used at this level also. Simple loops in a program are easy to parallelize whereas the recursive loops are
difficult.
This type of parallelism can be achieved through the compilers .
Program Level
It is the last level consisting of independent programs for parallelism. Coarse grain size is used at this level
containing tens of thousands of instructions. Time sharing is achieved at this level of parallelism
Message Passing
Message Passing is a process of explicitly sending and receiving message using Point to Point Communi-
cation. During the lifetime of a program, you might want to share data among processes at some point. A
standard way of doing this on distributed Multiple Instruction Multiple Data (MIMD) systems is with the
Message Passing Interface (MPI).
Examples
Example 2. • Make sure you have mpi4py installed
4 a = 6.0
5 b = 3.0
6 if rank == 0:
7 print(a + b)
8 if rank == 1:
9 print(a * b)
10 if rank == 2:
11 print(max(a,b))
math458 - scientific computing ii 19
P2P Communication
Message passing9 basically involves a sender and a receiver 9
Note that
1. The Send and Recv functions are
referred to as blocking functions
1 import numpy
2. Typically, a process calls Recv blocks
2 from mpi4py import MPI
the process and waits
3
• Comm: The communicator we wish to query • Comm: The communicator we wish to query
• buf : The data we wish to send • buf : The data we wish to send
• dest: The rank of the destination process • source: The rank of the source process
• tag: A tag on your message • tag: A tag on your message
• status: Status of the object