Parallel Computing

SCIENTIFIC COMPUTING II
1
introduction to parallel computing 1
Prerequisites
1. The Linux kernel and shells
Lecturer: Dr. Peter Amoako-Yirenkyi
2. Navigating Linux files and directo-
July 29, 2021 ries
3. File ownership and permissions
4. Linux process management
5. Secure shell access and copy
Parallel computing2 is a subject of interest in the computing community.
Ever-growing size of database and increasing complexity are putting
great stress on the single computers. Now the entire computing com- After going through this lecture
notes,you would be able to:
munity is looking for some computing environment where current com-
• know historical facts of parallel
putational capacity can be enhanced. This could be done by improving
computing.
the performance of a single computer (Uniprocessor System) and by
• explain the basic concepts of pro-
parallel processing. In the simplest sense, parallel computing is the si-
gram, process, thread, concurrent
multaneous use of multiple compute resources to solve a computational execution, parallel execution and
problem using multiple CPUs. A problem is normally broken into dis- granularity.
crete parts that can be solved concurrently. Each part is further broken • explain the need of parallel comput-
down to a series of instructions so that instructions from each part gets ing.
executed simultaneously. • describe the levels of parallel pro-
cessing.
The most obvious solution is the introduction of multiple processors
• describe Parallel computer classifica-
working in tandem to solve a given problem. The Architecture used to tion Schemes.
solve this problem is Advanced/Parallel Computer Architecture and the
• describe various applications of
algorithms are known as Parallel Algorithms. Programming of these parallel computing
computers is what is known as Parallel Programming. 2
Parallel Computing is the simultaneous

INTRODUCTION execution of same task, split into sub-
tasks, on multiple processors in order to
Historically, parallel computing has been considered to be "the high obtain results faster.
end of computing", and has been used to model difficult problems

in many areas of science and engineering that involves Atmosphere,
Earth and Environment. For example in Physics applied, nuclear, par-
ticle, condensed matter, high pressure, fusion and photonics utilizes
parallel computing technologies.In a more recent times Bioscience,
Biotechnology, Genetics Chemistry, Molecular Sciences Geology, Seis- Figure 1: von Neumann ma-
chine(sequential computers)
mology Mechanical Engineering - from prosthetics to spacecraft Elec-
trical Engineering, Circuit Design, Microelectronics Computer Science
and computational Mathematics have also joined.
The experiments and implementations of the use of parallelism started in the 1950s by the IBM. Prior to this,
Constraints of conventional architecture(see Figure 1) was used.
A serious approach toward designing parallel computers was started with the development of ILLIAC IV in
1964. And in 1969, the concept of pipelining was introduced in computer CDC 7600. In 1976, the CRAY 1
was developed by Seymour Cray. CRAY 1 was a pioneering effort in the development of vector registers.
math458 - scientific computing ii 2
Today, many applications provide an equal or greater driving force in the development of faster computers.
These applications require the processing of large amounts of data in sophisticated ways. For example:
Databases, data mining,Oil exploration, Web search engines, web based business services Medical imag-
ing and diagnosis, Pharmaceutical design, Financial and economic modelling, Management of national and
multi-national corporations, Advanced graphics and virtual reality particularly in the entertainment indus-
try, Networked video and multi-media technologies and many other Collaborative work environment.
The next generation of Cray called Cray XMP was developed in the years 1982-84. It was coupled with su-
percomputers and used a shared memory. In the 1980s Japan also started manufacturing high performance
supercomputers. Companies like NEC, Fujitsu and Hitachi were the main manufacturers.
Hardware improvements like pipelining, superscalar did not scale well and required sophisticated compiler
technology to exploit their performance. Hence techniques such as vector processing was introduced and
worked for certain kind of problems. Due to this, a significant development in networking technology also
paved a way for network-based cost-effective parallel computing.
PARALLELISM IN UNIPROCESSOR SYSTEM
There are various parallel processing mechanisms that are carried out to achieve parallelism in uniprocessor
system. These include;
• Multiple function units
• Parallelism and pipelining within CPU
• Overlapped CPU and i/o operations
• Use of hierarchical memory system.
• Multiprogramming and time sharing.
Multiprogramming And Timesharing

In multiprogramming, several processes reside in many memory and CPU switches from one process (say
P1) to another (say P2) when the currently running process (P1) blocks for an input/output (I/O) operation.
I/O operation for P1 is handled by a DMA unit while the CPU runs P2. In Timesharing, processes are
assigned slices of CPU’s time. The CPU executes the processes in the round robin fashion as below. It
appears that every user (or process) has its own processor ( or multiple virtual processors). It also averts the
monopoly of a single(computationally intensive) process as in pure multiprogramming.
Figure 2: Multiprogramming and

Timesharing
Multiplicity of Functional Units

Use of multiple functional units like multiple adders, multiplies or even multiple ALUs to provide concur-
rency is not a new idea in uniprocessor environment. It has been around for decades.
For example the Havard Architecture provides separate memory units for instructions and data. This effec-
tively doubles the memory bandwidth saving CPU’s time. Eg. is split cache in which instructions are kept in
I-cache and data in D-cache. In contrast when instructions and data are kept in the memory, the architecture
is called Princeton Architecture. Examples are unified cache, main memory, etc.
Note that some parallel processing mechanism are supported by memory hierarchy as a simultaneous
transfer of instructions/data between (CPU, cache) and (MM, secondary memory)
Comparison Between Sequential and Parallel Computer
Sequential Computers Parallel Computer

Are uniprocessor systems (1 CPU) Are multiprocesor systems (many CPU’s)
Can execute 1 instruction at a time Can execute several instructions at a time
Speed is limited No limitation on speed
It is quite expensive to make single CPU faster Less expensive if we use larger number of fast proces-
sors to achieve better performance
An example is Pentium PC Examples include CRAY 1, CRAY-XMP(USA) and
PARAM (India)
Parallel Computers
Parallel computers are those systems which emphasize on parallel pro-
cessing. Parallel processing is an efficient form of information process-
ing which emphasizes on the exploitation of concurrent events in com-
puting process. Computers are therefore pipelined3 or nonpipelined.
3
In pipeline computers, instruction
cycle of digital computer involves 4
• Pipelined computers perform overlapped computations to exploit major steps:
temporal parallelism. Here, successive instructions are executed in 1. IF (Instruction Fetch)
overlapped fashion as shown in the figure below. 2. ID (Instruction Decode)

3. OF (Operand Fetch)
• In Nonpipelined computers, the execution of first instruction must 4. EX (Execute)
be completed before the next instruction can be issued.
Array Processor
Array processor is synchronous parallel computer with multiple ALUs called processing elements (PE).
These PEs can operate in parallel mode. An appropriate data routing algorithm must be established among
PEs.
Multiprocessor System
This system achieves asynchronous parallelism through a set of interactive processors with shared resources
(memories, databases, etc)
Data Parallelism
In data parallelism, the complete set of data is divided into multiple blocks and operations on the blocks
are applied parallel. Data parallelism is faster as compared to Temporal parallelism. And also, there is less
inter-processor communication. In data parallelism, no synchronization is required between counters (or
processors). It is more tolerant of faults. With data parallelism, the working of one person does not affect
the other. Disadvantages of Data Parallelism includes:
1. The task to be performed by each processor is pre-decided. This means that assignment of load is static.
2. It should be possible to break the input task into mutually exclusive tasks. Here Space would be required
for counters. This requires multiple hardware which may be costly.
PERFORMANCE EVALUATION
Some elementary concepts to consider are; Program, Process, Thread, Concurrent and Parallel Execution, Granu-
larity, and Potential of Parallelism
Process
Each process has a life cycle, which consist of creations, execution and termination phases. A process may
create several new processes, which in turn may also create a new process.
Process creation requires four actions. These are;

1. Setting up the process description
2. Allocating an address space
3. Loading the program into allocated address space
4. Passing the process description to the process scheduler.
The process scheduling involves three concepts: process state, state transition and scheduling policy
Thread
Thread is a sequential flow of control within a process. A process can contain one or more threads. Threads
have their own program counter and register values, but they share the memory space and other resources
of the process.
Thread is basically a lightweight process and has several advantages including:

1. It takes less time to create and terminate a new thread than to create and terminate a process.
2. It takes less time to switch between two threads within the same process.
3. It has less communication overheads.
Figure 3: process state transition dia-

gram
concurrent and parallel execution

Concurrent execution is the temporal behaviour of the N-client 1-server model and Parallel execution is
associated with the N-client N-server model. It allows the servicing of more than one client at the same time
as the number of servers is more than one. The study of concurrent and parallel executions is important due
to the following reasons;
• Some problems are most naturally solved by using a set of co-operating processes.
• To reduce the execution time.
Granularity
This refers to the amount of computation done in parallel relative to the size of the whole program. In
parallel computing, granularity is a qualitative measure of the ratio of computation to communication.
performance attributes
The performance attributes are: Cycle time (T): This is the unit of time for all the operations of a computer
system. It is the inverse of clock rate (l/f). The cycle time is represented in sec.
Cycles Per Instructions (CPI): Different instructions take different number of cycles for execution. CPI is
measurement of number of cycles per instructions.
Instruction Count (IC): Number of instructions in a program is called instruction count. If we assume that
all instructions have the same number of cycles, then the total execution time of a program is the product of
the number of instructions in the program, the number of cycles required by one instruction and the time of
one cycle.
Mathematically, execution time;

T = IC × CPI × Tsec
Practically the clock frequency of the system is specified in MHz. And the processor speed is measured in
terms of million instructions per second (MIPS)
Some problems may be easily parallelized. On the other hand, there are some inherent sequential problems
(Computation of Fibonacci sequence) whose parallelization is nearly impossible. If processes do not share
address space and we could eliminate data dependency among instructions, we can achieve higher level of
parallelism.
Speed-up
The concept of speed up is used as a measure of the speed that indi-
cated the extent at which a sequential program can be parallelized.
Execution time before improvement
Speedup4 = (1)
Execution time after improvement
4
Not all parts of a program can be improved to run faster. Essentially, • From Eq. (1), an increase in
speedup results in a correspond-
the program to be analyzed consists of a part that will benefit from ing decrease in the execution time
the improvement and another part that will not. This leads to the
• From Eq. (2), the speedup only
following expression for the execution time of the program, given as T affects the part of the program that
will benefit from the improvement
T = (1 − a5 ) × T + a × T (2)
Thus, we have the theoretical execution time of the whole task after 5
Where a is the fraction of execution
improvement as time that would benefit from the
improvement
a
T ( s ) = (1 − a ) × T + 6 × T (3)
s
6
In some instances, s is defined as the
Integrating Eq.(2), Eq. (3). into (1), number of resources used to parallelize
the part of the program that will benefit
1 from the improvement
Speedup = a (4)
1−a+ s
Example 1. Suppose7 the total time required to run a program is 1. The

program can have 40% of it’s code improved to run 2.3 faster. Determine the
speedup of the whole program.
7
From Eq.(4)
1
Speedup = a (5)
Efficiency 1−a+ s
1
= 0.4
(6)
1 − 0.4 +
Sp 2.3
E8 p = (8) = 1.292 (7)
N
where S p is speed up as defined previously and N is the number of
available processors (or cores)
Overhead
Oh = N × T ( s ) − T (9)
where Ts is execution time in serial and Tp is execution time in parallel 8

Note: Overhead and Efficiency de-
pends on Speedup
PARALLEL COMPUTING
Parallel computing is the simultaneous use of multiple compute resources to solve a computational problem
by breaking down a problem into discrete parts by a series of instructions that can be solved concurrently
Figure 4: example of problem bro-

ken down by set of instructions for
simultaneous processing
Characteristics of Parallel Computers

Parallel computers can be characterized based on the following.
1. The data and instruction streams forming various types of computer organizations.
2. The computer structure, eg. multiple processors have separate memory or one shared global memory.
3. Size of instructions in a program called grain size.
Types Of Classification
The classification is based on;
1. The instruction and data streams.
2. The structure of computers.
3. How the memory is accessed.
4. The grain size.
Classification Of Parallel Computers

The are a number of classifications for parallel computers. For example,
1. Flynn’s classification is based on instruction and data streams.
2. The Structural classification is based on different computer organizations;
3. The Handler’s classification is based on three distinct levels of the computer;
(a) Processor Control Unit (PCU)
(b) Arithmetic Logic Unit (ALU)

(c) Bit-Level Circuit (BLC)
FLYNN’S CLASSIFICATION
This type of classification was proposed by Michael Flynn in 1972. He introduced the concept of instruction
and data streams for categorizing computers. This classification is based on instruction and data streams.
The instruction cycle consists of a sequence of steps needed for the execution of an instruction in a program.
The control unit fetches instructions one at a time. The fetched instruction is then decoded. The proces-
sor executes the decoded instructions and the result of execution is temporarily stored in Memory Buffer
Register (MBR). Here,
1. Flow of instruction is called instruction stream
2. The flow of operands is called data streams
3. Flow of operands between processors and memory is bi-directional.
Figure 5: Instruction and data stream
Based on multiplicity of instruction streams and data streams observed by the CPU during program
execution, Flynn’s Classification is grouped into four(4) Types of Data Streams. They are:
1. Single Instruction and Single Data Stream (SISD).
2. Single Instruction and Multiple Data Stream (SIMD).
3. Multiple Instruction and Single Data Stream (MISD).
4. Multiple Instruction and Multiple Data Stream (MIMD)
Single Instruction and Single Data Stream (SISD)

In this data stream, sequential execution of instructions is performed by one CPU containing a single pro-
cessing element (PE). In this regard, SISD machines are conventional serial computers that process only one
stream of instructions and one stream of data. Examples are Cray-1, CDC 6600, CDC 7600.
Figure 6: Single Instruction and Single

Data Stream architecture or A conven-
tional computer
Multiple Instruction and Single Data Stream (MISD)

This data stream consists of multiple processing elements organized under the control of multiple control
units. Each control unit handles one instruction stream and processes it through its corresponding processing
element. Each processing element processes on;y a single data stream at a time. An example is the C.mmp
built by Carnegie-Mellon University. The following are advantages of the MISD. Thus MISD machines can
be applied to fault-tolerant real-time computers
Figure 7: Multiple Instruction and

Single Data Stream architecture
Single Instruction and Multiple Data Stream (SIMD)

Here, multiple processing elements work under the control of a single control unit. There is one instruction
and multiple data stream across. All the processing elements of this organization receive the same instruction
broadcast from the CU. In SIMD, the main memory can also be divided into modules for generating multiple
data streams. Every processor must be allowed to complete its instruction before the next instruction is taken
for execution. However, the execution of instructions is synchronous. SIMD relies on the regular structure
Figure 8: Single Instruction and Mul-

tiple Data Stream architecture. Some
of the earliest parallel computers such
as the Illiac IV, MPP, DAP CM-2 be-
long to this class of machines. variants
of this concept have found use in co-
processing units such as the MMx
units in Intel processors and IBM Cell
processor.
of computations (such as those in image processing). It is often necessary to selectively turn off operations
on certain data items. For this reason, most SIMD programming architectures allow for an "activity mask",
which determines if a processor should participate in a computation or not.
Multiple Instruction and Multiple Data Stream (MIMD)

This data stream consists of multiple processing elements and multiple control unit organized together.
The multiple control units handle the multiple instructions whereas processing elements are organized for
handling multiple data streams. Here, the processors work on their own data with their own instructions.In
the real sense, MIMD organization is said to be a Parallel Computer
Figure 9: Multiple Instruction and

Multiple Data Stream architecture.
Examples; C.mmp, Cray-2, Cray X-MP,
IBM 370/168 MP, Univac 1100/80, IBM
3081/3084 MIMD organization is the
most popular for a parallel computer.
In summary, parallel computers execute
the instructions in MIMD mode.
In contrast to the SIMD processors, MIMD processors can execute different programs in different proces-
sors. A variant of this, called single program multiple data streams (SPMD), executes the same program on
different processors.
However, it is easy to see that SPMD and MIMD are closely related in terms of programming flexibility and
underlying architectural support. Examples of such platforms include current generation Sun Ultra Servers,
SGI Origin Servers,multiprocessors PCs, workstation clusters.
All multiprocessor systems fall under this classification.
SIMD-MIMD Comparison
1. SIMD computers require less hardware than MIMD computers (single control unit)
2. However, since SIMD processors are specially designed, they tend to be expensive and have a long design
cycle. Not all applications are naturally suited to SIMD processors.
3. In contrast, platforms supporting the SPMD paradigm can be built from inexpensive off-the-shelf com-
ponents with relatively little effort in a short amount of time.
HANDLER’S CLASSIFICATION
In 1977, Handler proposed an elaborate notation for expressing the pipelining the parallelism if computers.
Handler’s classification addresses the computer at three distinct levels. These are;
1. Processor control unit (PCU)—- CPU
2. Arithmetic logic unit (ALU)— processing element

3. Bit-level circuit (BLC)— logic circuit .
Ways To Describe A Computer

Computer = ( p ∗ p0 , a ∗ a0 , b ∗ b0 ) where;
1. p = number of PCUs
2. p’= number of PCUs that can be pipelined.
3. a = number of ALUs controlled by each PCU
4. a’ = number of bits in ALU or processing element (PE)
5. b’ = number of pipeline segments on all ALUs or a single PE.
Relationship between various elements of the computer

• The * operator is used to indicate that the units are pipelined or macro-pipelined with a stream of data
running through all the units.
• The + operator is used to indicate that the units are not pipelined but work on independent streams of
data.
• The v operator is used to indicate that the computer hardware can work in one of several modes.
• The symbol is used to indicate a range of values for any one of the parameters.
Structural Classification
A parallel computer (MIMD) can be characterized as a set of multiple
processors and shared memory or memory modules communicating
via an interconnection network.
When multiprocessors communicate through the global shared mem- Figure 10: Shared memory architecture.
ory modules it is termed Shared memory computer or Tightly cou-
pled systems. Shared memory multiprocessors have the following
characteristics:
1. Every processor communicates through a shared global memory
2. For high speed real time processing, these systems are preferable as
their throughput is high as compared to loosely coupled systems.
Tightly-coupled System
In tightly coupled system organization, multiple processors share a
global main memory, which may have many modules. The processors
Figure 11: Tightly coupled system
have also access to I/O devices. The inter-communication between
processors, memory, and other devices are implemented through var-
ious interconnection networks.
The following are types of Interconnection Networks associated with tightly coupled system:
1. Processor-Memory Interconnection Network (PMIN): This is a switch that connects various processors to
different memory modules.
2. Input-Output-Processor Interconnection Network (IOPIN): This interconnection network is used for com-
munication between processors and I/O channels.
3. Interrupt Signal Interconnection Network (ISIN): When a processor wants to send an interruption to
another processor, then this interruption first goes to ISIN, through which it is passed to the destination
processor. In this way, synchronization between processor is implemented by ISIN.
To reduce this delay, every processor may use cache memory for the frequent references made by the
processor as Models of Tightly-Coupled Systems
Uniform Memory Access Model (UMA)
In this model, main memory is uniformly shared by all processors in multiprocessor systems and each
processor has equal access time to shared memory. This model is used for time-sharing applications in
a multi-user environment. Uniform Memory Access is most commonly represented today by Symmetric
Multiprocessor(SMP) machines. The processors are identical and have equal access and access times to
memory. Sometimes called CC-UMA - Cache Coherent UMA. Cache coherent means if one processor
updates a location in shared memory, all the other processors know about the update. Cache coherency
is accomplished at the hardware level.
Figure 12: Uniform Memory Access

Model.
Non-Uniform Memory Access Model (NUMA)
In shared memory multiprocessor systems, local memories can be connected with every processor. The
collection of all local memories form the global memory being shared. This global memory is distributed
to all the processors. In this case, the access to a local memory is uniform for its corresponding processor,
but if one reference is sent to the local memory of some other remote processor, then the access is not
uniform. It depends on the location of the memory. Thus, all memory words are not accessed uniformly.
Some Features of the NUMA
(a) Often made by physically linking two or more SMPs.

(b) One SMP can directly access memory of another SMP
(c) Not all processors have equal access time to all memories.
(d) Memory access across link is slower.
(e) If cache coherency is maintained, then it may also be called CC-NUMA - Cache Coherent NUMA.
Some of the advantages are;
(a) Global address space provides a user-friendly programming perspective to memory.

(b) Data sharing between tasks is both fast and uniform due to the proximity of memory to CPUs.
The disadvantages
(a) Primary disadvantage is the lack of scalability between memory and CPUs. Adding more CPUs
can geometrically increases traffic on the shared memory-CPU path, and for cache coherent systems,
geometrically increase traffic associated with cache/memory management.
(b) Programmer responsibility for synchronization constructs that ensure "correct" access of global mem-
ory.
(c) It becomes increasingly difficult and expensive to design and produce shared memory machines with
ever increasing numbers of processors
Figure 13: Non-Uniform Memory

Access architecture
Cache-Only Memory Access Model (COMA)
shared memory multiprocessor systems may use cache memories with every processor for reducing the
execution time of an instruction .
Loosely coupled System
when every processor in a multiprocessor system has its own local memory and the processors communi-
cate via messages transmitted between their local memories, then this organization is called Distributed
memory computer or Loosely coupled system.
Figure 14: Loosely coupled system
Each processor in loosely coupled systems has a large local memory (LM). This memory is not shared by
any other processor. Such systems have multiple processors with their own local memory and a set of I/O
devices. This set of processors, memory and I/O devices makes a computer system. These systems are
also called multi-computer systems. These computer systems are connected together via message passing
interconnection network. It is through this network that processes communicate by passing messages to
one another. This system is also called the distributed multi computer system .
CLASSIFICATION BASED ON GRAIN SIZE
This classification is based on recognizing the parallelism in a program to be executed on a multiprocessor

system.
The idea is to identify the sub-tasks or instructions in a program that can be executed in parallel.
Figure 15: Parallel execution for S1 and

S2
Factors affecting decision of parallelism
(a) Number and types of processors available. That is the architectural features of host computer.
(b) Memory organization
(c) Dependency of data, control and resources.
Data Dependency
Data dependency refers to the situation in which two or more instructions share same data. The instruc-
tion in a program can be arranged based on the relationship of data dependency.
Types of Data Dependency
(a) Flow Dependence : If instruction I2 follows I1 and output of I1 becomes input of I2 , then I2 is said to
be flow dependent on I1 .
(b) Antidependence : This occurs when instruction I2 follows I1 such that output of I2 overlaps with the
input of I1 on the same data.
(c) Output dependence : When output of the two instructions I1 and I2 overlap on the same data, the
instructions are said to be output dependent.
(d) I/O dependence : When read and write operations by two instructions are invoked on the same file,
it is a situation of I/O dependence.
Control Dependence
Instructions or segments in a program may contain control structures. Dependency among the statements
can be in control structures also. But the order of execution in control structures is not known before the
run time. Control structures dependency among the instructions must be analyzed carefully.
Resource Dependence
The parallelism between the instructions may also be affected due to the shared resources. If two instruc-
tions are using the same shared resource then it is a resource dependency condition
Bernstein Conditions for Detection of Parallelism
For execution of instructions or block of instructions in parallel, the instructions should be independent
of each other. These instructions can be data dependent, control dependent or resource dependent on
each other.
Bernstein conditions are based on the following two sets of variables:

(a) The Read set or input set RI that consists of memory locations read by the statement of instruction I1 .
(b) The write set or output set WI that consists of memory locations written into by instruction I1 .
Parallelism based on Grain size

Grain size or granularity is a measure which determines how much computation is involved in process.
Grain size is determined by counting the number of instructions in a program segment.
Types of Grain Size

1. Fine Grain: This type contains approximately less than 20 instructions.
2. Medium Grain: This type contains approximately less than 500 instructions.
3. Coarse Grain: This type contains approximately greater than or equal to 1000 instructions.
Level of parallel processing

1. Instruction level
2. Loop level.
3. Procedure level.
4. Program level
Instruction Level
This is the lowest level and the degree of parallelism is highest at this level. The fine grain size is used
at instruction level. The fine grain size may vary according to the type of the program. For example, for
scientific applications, the instruction level grain size may be higher. As the higher degree of parallelism can
be achieved at this level, the overhead for a programmer will be more.
Loop Level
This is another level of parallelism where iterative loop instructions can be parallelized. Fine grain size is
used at this level also. Simple loops in a program are easy to parallelize whereas the recursive loops are
difficult.
This type of parallelism can be achieved through the compilers .
Procedure or Sub Program Level

This level consists of procedures, subroutines or subprograms. Medium grain size is used at this level
containing some thousands of instructions in a procedure. It is at this level that Multiprogramming is
implemented.
Program Level
It is the last level consisting of independent programs for parallelism. Coarse grain size is used at this level
containing tens of thousands of instructions. Time sharing is achieved at this level of parallelism
Relation between grain sizes and parallelism
Grain Size Parallelism Level

Fine Grain Instruction or Loop Level
Medium Grain Procedure or Subprogram level
Coarse Grain Program Level
OPERATING SYSTEMS FOR HIGH PERFORMANCE COMPUTING
Figure 16: HPC Cluster. Normally

several computers are interconnected
to form compute or slave nodes and
managed by head or master nodes
Operating Systems for Parallel Processing

1. MPP systems having thousands of processors require OS radically different from current ones.
2. Every CPU needs OS:
(a) To manage its resources
(b) To hide its details.
3. Traditional systems are heavy, complex and not suitable for MPP.
Message Passing
Message Passing is a process of explicitly sending and receiving message using Point to Point Communi-
cation. During the lifetime of a program, you might want to share data among processes at some point. A
standard way of doing this on distributed Multiple Instruction Multiple Data (MIMD) systems is with the
Message Passing Interface (MPI).
Point to Point Communication normally follows the following procedure:

1. Process A decides to send a message to process B
2. Process A then packs up all of its necessary data into a buffer for process B
3. Process A indicates that the data should be sent to process B by calling the Send function
4. Before process B can receive the data, it needs to acknowledge that it wants to receive it. Process B does
this by calling the Recv function.
Figure 17: Diagram of the message

passing process
Message Passing Procedure

1. When an MPI program is started, each process is assigned a unique integer starting from 0. The rank
helps in identifying each process. We sometimes refer to the process with rank N simply as process N
2. MPI arranges processes into collections (known as communicators) indicating which processes can send
or receive messages
3. The MPI.COMM_WORLD communicator contains all the processes in the MPI program
4. Once we have access to a communicator we can use:
(a) Get_size() to return the total number of processes contained in the communicator
(b) Get_rank() to return the rank of the calling process within the communicator
Examples
Example 2. • Make sure you have mpi4py installed
• Run the file example1.py with the command
– mpiexec -n 4 python example1.py
1 from mpi4py import MPI

2
3 comm = MPI.COMM_WORLD
4 size = comm.Get_size()
5 rank = comm.Get_rank()
6 print('size=%d, rank=%d' % (size, rank))
Example 3. • The next example expects at least 3 process
• Run the example file: example2.py 8

Note that we choose which rank
should perform what task on the same
data
2 rank = MPI.COMM_WORLD.Get_rank()
3
4 a = 6.0
5 b = 3.0
6 if rank == 0:
7 print(a + b)
8 if rank == 1:
9 print(a * b)
10 if rank == 2:
11 print(max(a,b))
P2P Communication
Message passing9 basically involves a sender and a receiver 9
Note that
1. The Send and Recv functions are
referred to as blocking functions
1 import numpy
2. Typically, a process calls Recv blocks
the process and waits
3
4 comm = MPI.COMM_WORLD 3. When a message from the corre-

5 rank = comm.Get_rank() sponding Send is received before
6 randNum = numpy.zeros(1) then it will proceed
7 if rank == 1: 4. Similarly, the Send will wait until the
8 randNum = numpy.random.random_sample(1) message has been received by the
9 print("Process", rank, "chose ", randNum[0]) corresponding Recv
10 comm.Send(randNum, dest=0)
11 if rank == 0:
12 print("Process", rank, "before receiving has ", ...
randNum[0])
13 comm.Recv(randNum, source=1)
14 print("Process", rank, "received ", randNum[0])
Figure 18: Diagram depicting a simple
blocking send and receive action.
Explanation of the comm function:
Comm.Send(buf, dest=0, tag=0) Comm.Recv(buf, source=0, tag=0, status=None)
• Comm: The communicator we wish to query • Comm: The communicator we wish to query
• buf : The data we wish to send • buf : The data we wish to send
• dest: The rank of the destination process • source: The rank of the source process
• tag: A tag on your message • tag: A tag on your message
• status: Status of the object

Parallel Computing

Uploaded by

Parallel Computing

Uploaded by

SCIENTIFIC COMPUTING II

Parallel Computing is the simultaneous

end of computing", and has been used to model difficult problems

PARALLELISM IN UNIPROCESSOR SYSTEM

Multiprogramming And Timesharing

Figure 2: Multiprogramming and

Multiplicity of Functional Units

Comparison Between Sequential and Parallel Computer

Sequential Computers Parallel Computer

temporal parallelism. Here, successive instructions are executed in 1. IF (Instruction Fetch)

overlapped fashion as shown in the figure below. 2. ID (Instruction Decode)

Process creation requires four actions. These are;

Thread is basically a lightweight process and has several advantages including:

Figure 3: process state transition dia-

concurrent and parallel execution

Mathematically, execution time;

Example 1. Suppose7 the total time required to run a program is 1. The

where Ts is execution time in serial and Tp is execution time in parallel 8

Figure 4: example of problem bro-

Characteristics of Parallel Computers

Classification Of Parallel Computers

(b) Arithmetic Logic Unit (ALU)

Figure 5: Instruction and data stream

Single Instruction and Single Data Stream (SISD)

Figure 6: Single Instruction and Single

Multiple Instruction and Single Data Stream (MISD)

Figure 7: Multiple Instruction and

Single Instruction and Multiple Data Stream (SIMD)

Figure 8: Single Instruction and Mul-

Multiple Instruction and Multiple Data Stream (MIMD)

Figure 9: Multiple Instruction and

All multiprocessor systems fall under this classification.

2. Arithmetic logic unit (ALU)— processing element

Ways To Describe A Computer

Relationship between various elements of the computer

1. Every processor communicates through a shared global memory

Uniform Memory Access Model (UMA)

Figure 12: Uniform Memory Access

Non-Uniform Memory Access Model (NUMA)

Some Features of the NUMA

(a) Often made by physically linking two or more SMPs.

(a) Global address space provides a user-friendly programming perspective to memory.

Figure 13: Non-Uniform Memory

Cache-Only Memory Access Model (COMA)

Loosely coupled System

Figure 14: Loosely coupled system

CLASSIFICATION BASED ON GRAIN SIZE

This classification is based on recognizing the parallelism in a program to be executed on a multiprocessor

Figure 15: Parallel execution for S1 and

Factors affecting decision of parallelism

Types of Data Dependency

Bernstein Conditions for Detection of Parallelism

Bernstein conditions are based on the following two sets of variables:

Parallelism based on Grain size

Types of Grain Size

Level of parallel processing

Procedure or Sub Program Level

Relation between grain sizes and parallelism

Grain Size Parallelism Level

OPERATING SYSTEMS FOR HIGH PERFORMANCE COMPUTING

Figure 16: HPC Cluster. Normally

Operating Systems for Parallel Processing

Point to Point Communication normally follows the following procedure:

Figure 17: Diagram of the message

Message Passing Procedure

• Run the file example1.py with the command

– mpiexec -n 4 python example1.py

1 from mpi4py import MPI

Example 3. • The next example expects at least 3 process

• Run the example file: example2.py 8

4 comm = MPI.COMM_WORLD 3. When a message from the corre-

Explanation of the comm function:

Comm.Send(buf, dest=0, tag=0) Comm.Recv(buf, source=0, tag=0, status=None)

You might also like