0% found this document useful (0 votes)

43 views84 pages

Chapter 2 - Parallel Algorithm Design

Uploaded by

thanhtruongtran23

Available Formats

Download as PDF, TXT or read online on Scribd

Download as pdf or txt

0% found this document useful (0 votes)

43 views84 pages

Chapter 2 - Parallel Algorithm Design

Uploaded by

thanhtruongtran23

Available Formats

Download as PDF, TXT or read online on Scribd

Download as pdf or txt

You are on page 1/ 84

Parallel Algorithm

Design
References
• Michael J. Quinn. Parallel Computing. Theory and Practice.
McGraw-Hill
• Albert Y. Zomaya. Parallel and Distributed Computing
Handbook. McGraw-Hill
• Ian Foster. Designing and Building Parallel Programs.
Addison-Wesley.
• Ananth Grama, Anshul Gupta, George Karypis, Vipin Kumar .
Introduction to Parallel Computing, Second Edition. Addison
Wesley.
• Joseph Jaja. An Introduction to Parallel Algorithm. Addison
Wesley.
• Nguyễn Đức Nghĩa. Tính toán song song. Hà Nội 2003.

3
2.1 Parallel
Programming Models

4
Parallel Programming Models
• Overview
• Shared Memory Model
• Threads Model
• Message Passing Model
• Data Parallel Model
• Other Models

5
Overview
• There are several parallel programming models in
common use:
• Shared Memory
• Threads
• Message Passing
• Data Parallel
• Hybrid
• Parallel programming models exist as an abstraction
above hardware and memory architectures.

6
Overview
• Although it might not seem apparent, these models are
NOT specific to a particular type of machine or memory
architecture. In fact, any of these models can
(theoretically) be implemented on any underlying
hardware.
• Which model to use is often a combination of what is
available and personal choice. There is no "best" model,
although there certainly are better implementations of
some models over others.
• The following sections describe each of the models
mentioned above, and also discuss some of their actual
implementations.

7
Shared Memory Model
• In the shared-memory programming model, tasks share a common
address space, which they read and write asynchronously.
• Various mechanisms such as locks / semaphores may be used to control
access to the shared memory.
• An advantage of this model from the programmer's point of view is that
the notion of data "ownership" is lacking, so there is no need to specify
explicitly the communication of data between tasks. Program
development can often be simplified.
• An important disadvantage in terms of performance is that it becomes
more difficult to understand and manage data locality.

8
Shared Memory Model:
Implementations
• On shared memory platforms, the native compilers
translate user program variables into actual memory
addresses, which are global.
• No common distributed memory platform
implementations currently exist.

9
Threads Model
• In the threads model of parallel programming, a single process can have multiple,
concurrent execution paths.
• Perhaps the most simple analogy that can be used to describe threads is the concept of a
single program that includes a number of subroutines:
• The main program a.out is scheduled to run by the native operating system. a.out loads and
acquires all of the necessary system and user resources to run.
• a.out performs some serial work, and then creates a number of tasks (threads) that can be
scheduled and run by the operating system concurrently.
• Each thread has local data, but also, shares the entire resources of a.out. This saves the
overhead associated with replicating a program's resources for each thread. Each thread also
benefits from a global memory view because it shares the memory space of a.out.
• A thread's work may best be described as a subroutine within the main program. Any thread can
execute any subroutine at the same time as other threads.
• Threads communicate with each other through global memory (updating address locations).
This requires synchronization constructs to insure that more than one thread is not updating the
same global address at any time.
• Threads can come and go, but a.out remains present to provide the necessary shared resources
until the application has completed.
• Threads are commonly associated with shared memory architectures and operating
systems.

10
Threads Model

11
Threads Model Implementations
• From a programming perspective, threads implementations commonly comprise:
• A library of subroutines that are called from within parallel source code
• A set of compiler directives imbedded in either serial or parallel source code
• In both cases, the programmer is responsible for determining all parallelism.
• Threaded implementations are not new in computing. Historically, hardware vendors have
implemented their own proprietary versions of threads. These implementations differed
substantially from each other making it difficult for programmers to develop portable
threaded applications.
• Unrelated standardization efforts have resulted in two very different implementations of
threads: POSIX Threads and OpenMP.
• POSIX Threads
• Library based; requires parallel coding
• Specified by the IEEE POSIX 1003.1c standard (1995).
• C Language only
• Commonly referred to as Pthreads.
• Most hardware vendors now offer Pthreads in addition to their proprietary threads
implementations.
• Very explicit parallelism; requires significant programmer attention to detail.

12
Threads Model: OpenMP
• OpenMP
• Compiler directive based; can use serial code
• Jointly defined and endorsed by a group of major computer hardware and
software vendors. The OpenMP Fortran API was released October 28, 1997.
The C/C++ API was released in late 1998.
• Portable / multi-platform, including Unix and Windows NT platforms
• Available in C/C++ and Fortran implementations
• Can be very easy and simple to use - provides for "incremental parallelism"
• Microsoft has its own implementation for threads, which is not related to
the UNIX POSIX standard or OpenMP.

13
Message Passing Model
• The message passing model demonstrates the
following characteristics:
• A set of tasks that use their own local memory during
computation. Multiple tasks can reside on the same
physical machine as well across an arbitrary number of
machines.
• Tasks exchange data through communications by sending
and receiving messages.
• Data transfer usually requires cooperative operations to
be performed by each process. For example, a send
operation must have a matching receive operation.

14
Message Passing Model
Implementations: MPI Standard
• From a programming perspective, message passing implementations commonly
comprise a library of subroutines that are imbedded in source code. The
programmer is responsible for determining all parallelism.
• Part 1 of the Message Passing Interface (MPI) Standard was released in 1994.
Part 2 (MPI-2) was released in 1996.
• MPI is now the "de facto" industry standard for message passing, replacing
virtually all other message passing implementations used for production work.
Most, if not all of the popular parallel computing platforms offer at least one
implementation of MPI. A few offer a full implementation of MPI-2.
• For shared memory architectures, MPI implementations usually don't use a
network for task communications. Instead, they use shared memory (memory
copies) for performance reasons.

15
Data Parallel Model
• The data parallel model demonstrates the following characteristics:
• Most of the parallel work focuses on performing operations on a data set.
The data set is typically organized into a common structure, such as an array
or cube.
• A set of tasks work collectively on the same data structure, however, each
task works on a different partition of the same data structure.
• Tasks perform the same operation on their partition of work, for example,
"add 4 to every array element".
• On shared memory architectures, all tasks may have access to the data
structure through global memory. On distributed memory architectures
the data structure is split up and resides as "chunks" in the local memory
of each task.

16
Data Parallel Model

17
Other Models
• Other parallel programming models besides those previously mentioned
certainly exist, and will continue to evolve along with the ever changing
world of computer hardware and software.
• Only three of the more common ones are mentioned here.
• Hybrid
• Single Program Multiple Data
• Multiple Program Multiple Data

18
Hybrid
• In this model, any two or more parallel programming models are
combined.
• Currently, a common example of a hybrid model is the combination of
the message passing model (MPI) with either the threads model (POSIX
threads) or the shared memory model (OpenMP). This hybrid model
lends itself well to the increasingly common hardware environment of
networked SMP machines.
• Another common example of a hybrid model is combining data parallel
with message passing. As mentioned in the data parallel model section
previously, data parallel implementations (F90, HPF) on distributed
memory architectures actually use message passing to transmit data
between tasks, transparently to the programmer.

19
Single Program Multiple Data
(SPMD)
• Single Program Multiple Data (SPMD):
• SPMD is actually a "high level" programming model that can be built
upon any combination of the previously mentioned parallel programming
models.
• A single program is executed by all tasks simultaneously.
• At any moment in time, tasks can be executing the same or different
instructions within the same program.
• SPMD programs usually have the necessary logic programmed into them
to allow different tasks to branch or conditionally execute only those
parts of the program they are designed to execute. That is, tasks do not
necessarily have to execute the entire program - perhaps only a portion of
it.
• All tasks may use different data

20
Multiple Program Multiple Data
(MPMD)
• Multiple Program Multiple Data (MPMD):
• Like SPMD, MPMD is actually a "high level"
programming model that can be built upon any
combination of the previously mentioned parallel
programming models.
• MPMD applications typically have multiple
executable object files (programs). While the
application is being run in parallel, each task can be
executing the same or different program as other
tasks.
• All tasks may use different data

21
Basic Parallel
Algorithms

22
Parallel Random Access
Machine

23
Random Access Machine Model of
Computation
• The Random Access Machine (or RAM) model for
sequential computation.
• Assume that the memory has M memory locations, where
M is a large (finite) number
• Accessing memory can be done in unit time.
• Instructions are executed one after another, with no
concurrent operations.
• The input size depends on the problem being studied and is
the number of items in the input
• The running time of an algorithm is the number of primitive
operations or steps performed.

24
Random Access Machine

25
PRAM (Parallel Random Access
Machine)
• PRAM is a natural generalization of the RAM sequential
model.
• Each of the p processors P0, P1, … , Pp-1 are identical to a
RAM processor and are often referred to as processing
elements (PEs) or simply as processors.
• All processors can read or write to a shared global memory
in parallel (i.e., at the same time).
• The processors can also perform various arithmetic and
logical operations in parallel
• Running time can be measured in terms of the number of
parallel memory accesses an algorithm performs.

26
PRAM Properties
• An unbounded number of processors all can access
• All processors can access an unbounded shared memory
• All processor’s execution steps are synchronized
• However, processors can run different programs.
• Each processor has an unique id, called the pid
• Processors can be instructed to do different things, based
on their pid (if pid < 200, do this, else do that)

27
Parallel Random Access Machine

28
The PRAM Model

• Parallel Random Access Machine

• Theoretical model for parallel machines
• p processors with uniform access to a large memory
bank
• UMA (uniform memory access) – Equal memory access
time for any processor to any address

29
Types of PRAM
• Exclusive-Read Exclusive-Write
• Exclusive-Read Concurrent-Write
• Concurrent-Read Exclusive-Write
• Concurrent-Read Concurrent-Write

• If concurrent write is allowed we must decide which

“written value” to accept

30
The PRAM Models
• PRAM models vary according
• How they handle write conflicts
• The models differ in how fast they can solve various problems.
• Exclusive Read, Exclusive Write (EREW)
• Only one processor is allow to read or write to the same
memory cell during any one step
• Concurrent Read Exclusive Write (CREW)
• Concurrent Read Concurrent Write (CRCW)
• An algorithm that works correctly for EREW will also work
correctly for CREW and CRCW, but not vice versa

31
Types of PRAM

32
Assumptions
• There is no upper bound on the number of processors in the
PRAM model.
• Any memory location is uniformly accessible from any
processor.
• There is no limit on the amount of shared memory in the
system.
• Resource contention is absent.
• The algorithms designed for the PRAM model can be
implemented on real parallel machines, but will incur
communication cost.
• Since communication and resource costs varies on real
machines, PRAM algorithms can be used to establish a lower
bound on running time for problems.

33
More Details on the PRAM Model
• Both the memory size and the number of
processors are unbounded
• No direct communication between processors
• they communicate via the memory
P1
• Every processor accesses any memory
location in 1 cycle P2
• Typically all processors execute the same
algorithm in a synchronous fashion although P3
each processor can run a different program. Shared
• READ phase . Memory
• COMPUTE phase .
• WRITE phase .
• Some subset of the processors can stay idle
(e.g., even numbered processors may not PN
work, while odd processors do, and
conversely)

34
PRAM CW?
• What ends up being stored when multiple writes occur?
• Priority Conflict Resolution (PCR) CW: processors are assigned
priorities and the top priority processor is the one that does writing for
each group write
• Equality Conflict Resolution (ECR) CW: if values are equal, write the
value
• Arbitary Conflic Resolution (ACR) CW: if values not equal, write a
random coming value
• Fail common CW: if values are not equal, no write occurs
• Collision common CW: if values not equal, write a “failure value”
• Fail-safe common CW: if values not equal, then algorithm aborts
• Random CW: non-deterministic choice of which value is written
• Combining CW: write the sum, average, max, min, etc. of the values
• etc.
• For CRCW PRAM, one of the above type CWs is assumed. The CWs can be
ordered so that later type CWs can execute the earlier types of CWs.

35
2.2 Parallel Algorithm Complexity

36
Parallel algorithm evaluation
• The effectiveness of a parallel algorithm depends on
3 factors:
• Execution time.
• Number of processors involved.
• Supercomputer architecture.
• The processor number and supercomputer
architecture can be specified.
• The execution time depends on the working
algorithm.

37
Complexity of Parallel Algorithm
• The parallel algorithm's computation time is the
time that is calculated from the start of a processor
doing its work until all work at the same time.
• Calculation time through the number of math
operations to be performed during the calculation
(consider that all operations are completed in the
same time unit)

38
Complexity of Parallel Algorithm (2)
• To evaluate the proximity of complexity, we use the
following asymptotic comparison symbols:
• T(n) = O(f (n)) if we found positive numbers c and m
such that T (n) <c * f (n) for all values n> m.
• T(n) = Ω(f (n)) if we found positive numbers c and m
such that T (n)> c * f (n) for all values n> m.
• T(n) = Θ(f (n)) if we found positive numbers c1, c2 and
m such that c1 * f (n) <T (n) <c2 * f (n) for all n> m.
• Usually we are interested in the upper bound O(n).

39
Pseudo Code

40
Syntax
FOR index = 1 TO N DO IN PARALLEL
.. Paralle Task
END PARALLEL

FOR index of S DO IN PARALLEL

Parallel Task
END PARALLEL.

• Parallel Loop:
• As the serial loop, add the keyword ‘In Parallel’
• CPU i runs the code with corresponding value of
index i
• N CPU runs the code in parallel
41
Examples

42
Sum of 2 vectors
INPUT : 2 array of A[1..n], B[1..n] in shared memory.
OUTPUT : array C[1..n] = A[1..n] + B[1..n] in shared memory.
BEGIN
FOR i = 1 TO n DO IN PARALLEL
X = A[i];
Y = B[i];
C[i] = X + Y;
END PARALLEL.
END;

• The i_th processor read values of A[i] and B[i] from shared
memory and write to local variables X,Y.
• C[i] is assigned by sum of X,Y.
• Complexity: O(1).
• PRAM EREW.

43
Prob. BOOLEAN - AND
• Problem:
INPUT : A[1 n] OF BOOLEAN.
OUTPUT : RESULT = A[1] and A[2] and and A[n].

• Serial Algorithm:
BEGIN
RESULT= TRUE;
FOR i = 1 TO n DO
RESULT= RESULT and A[i];
END FOR;
END.

• Complexity: O(n) with 1 Processor

44
Prob. BOOLEAN - AND
• Analysis:
• If all A[i] = TRUE Result : TRUE.
• If ∃ A[i] = FALSE Result : FALSE.
• Algorithm for PRAM ERCW :
• Using ECR for CW?
• Using PCR, ACR, ECR for CW?
• Algorithm for PRAM EREW?

45
Prob. BOOLEAN - AND
BEGIN
RESULT= FALSE;
Using ECR FOR i = 1 TO n DO IN PARALLEL
X = A[i];
Complexity: O(1) RESULT= X;
END FOR;
END

BEGIN
RESULT= TRUE;
FOR i = 1 TO n DO IN PARALLEL
Using ECR, ACR, PCR IF A[i] = FALSE THEN
RESULT = FALSE;
Complexity: O(1) END IF;
END FOR;
END.

46
2.3 Basic Parallel Algorithms

47
2.3.1 Binary Tree Paradigm
• Also known as the balanced tree model.
• Characteristics:
• Node representing an action (math operation)
• Nodes on the same level executed in parallel.
• The nodes' input is the result of lower level
operations and taken from shared memory.
• After execution, each node writes the results to
the shared memory.

48
Example
• Sum of n-number problem. (n = 2k).
• If performed in serial algorithm: O(n) with 1 processor
• Parallelization with multiple processors?
• Independent data parallel data use processor indicators
to partition data.
• The plus operation executed with only 2
numbers need up to n/2 processors for
parallel execution of n/2 pairs.
• Where is the result written to in order to perform
the repetition as above?

49
Example: Sum of 8 numbers

50
Example: Sum of 8 numbers

51
Sum of n-number problem
Idea: A[i] = A[2*i-1] + A[2*i] (i>0)
Maximum number of steps : log(n)

52
Sum of n-numbers

53
Sum of n-number problem
INPUT : A[1 n];
OUTPUT : SUM = ∑ A[i];
BEGIN
p = n/2;
WHILE p > 0 DO
FOR i = 1 TO p DO IN PARALLEL
A[i] = A[2i-1] + A[2i];
END PARALLEL;
p = p/2;
END WHILE;
END.

Performance evaluation:
Complexity: O(log(n)) with O(n) processor.
Machine used: PRAM EREW.

54
More examples
• Problem Boolean-AND:
• Replace + operation with AND operation
• Done on ERCW : O(1).
• Done on EREW : O(log(n)).
• Scalar product of 2 vectors problem:
• There are 2 parallel steps:
• Multiply parallel each pair with n processors.
• Sum the obtained results according to the
balanced tree model.

55
Scalar product of 2 vectors problem
• Algorithm:
INPUT : A[1..n], B[1..n];
OUTPUT : RESULT= ∑(A[i]*B[i]);
BEGIN
FOR i = 1 TO n DO IN PARALLEL
C[i] = A[i] * B[i];
END PARALLEL;
FOR i = 1 TO log(n) DO
FOR j = 1 TO n/2I DO IN PARALLEL
C[j] = C[j] + C[j + n/2I];
END PARALLEL;
END FOR ;
END;

• Performance evaluation:
– Complexity: O(logn) with n processor
– Machine PRAM EREW.
56
2.3.2 Growing by Doubling
• As opposed to the previous technique:
• Balancing tree: processor number decreased by 1/2
after each step.
• Double development: number of processors
increased by 2 after each step.
• Common formula for both algorithms:
• Specify the number of repetitive steps (logn).
• Determine the number of processors and specific
indicators on each iteration step.
• Define work-by-processor in each serial step

57
Example
• Problem “Broadcast" in PRAM.
• Problem as follows:
• Machine PRAM EREW with n
processor.
• P1 contains value x in its private
memory.
• Write an algorithm that copy value x
to the rest of the processors
58
“Broadcast” in PRAM
• The concept of information transmission in PRAM:
• Step1: processor A records value x to shared
memory cell M
• Step2: processor B reads x from memory cell M
• The simplest algorithm:
• P1 processor that records value x in memory cell
M.
• PRAM EREW at a specific moment only 1
processor is reading data from 1 memory cell.
• the processors read data in turn O(n).

59
"Broadcast" in PRAM
• Parallel algorithm ideas:
• B1: P1 writes value x to the memory cell m1.
• B2: divided into 2 phases:
• P2 reads data from m1 P2 also has x.
• P2 records x in memory cell m2.
• B3: divided into 2 phases:
• P3, P4 reads data from : m1, m2.
• P3,P4 writes data to : m3, m4.
• B4: divided into 2 phases:
• P5,..P8 reads data from: m1,..m4.
• P5,..P8 writes data to: m5,..m8.
• After each step the number of participating
processors doubled.
60
"Broadcast" in PRAM

Khởi tạo Bước 1 Bước 2 Bước 3

61
"Broadcast" in PRAM
• Parallel algorithm:
INPUT : P1 enabled.
OUTPUT : P1.P2,.. Pn contains an x value.
BEGIN
P1: y = x;
L[1] = y;
FOR k = 0 TO log(n) -1 DO
FOR i = 2k + 1 TO 2k+1 DO IN PARALLEL
Pi: y = L[i – 2k];
L[i] = y;
END PARALLEL;
END FOR;
END.

• Performance evaluation:
– Complexity: O(logn) with O(n) processor.
– Machine PRAM EREW.
62
2.3.3 Pointer Jumping
• Used in many applications with dynamic structures
(lists, trees,,..).
• Method ideas:
• Consider 3 nodes in 1 list: A B C.
• Call R1, R2 is the ‘job’ function from A B and from B
C.
• Then A C explained as:
R3 = R1 + R2

63
Example
• Let's assume that the input is a list of linked elements
in some order. Ranking of elements in arrays should
be calculated.

• Show data as arrays:

• Values stored at the node.
• Index of the next node that the node points to.

64
The problem of determining the
rank
• Consider for example with 8 elements as drawing in
the previous.
• We call LINK[i] as the index of the next node of
A[i]. For example, LINK[3] = 7 means that in the
list of links, A[7] is behind A[3].
• LINK[i]= 0 means that A[i] is the last element of
the list of links.
• HEAD variable that contains the index of the first
element in the list.
• We call the rank of an array’s element in the list is
the distance from it to the end.

65
The problem of determining the rank

66
Serial Algorithm
INPUT : A[1..n], LINK[1..n], HEAD.
OUTPUT : RANK[1..n].
BEGIN
p = HEAD;
r = n;
RANK[p] = r;
REPEAT
p = LINK(p);
r = r – 1;
RANK[p] = r;
UNTIL LINK(p) = 0;
END.

• Performance evaluation:
• O(n) complexity with 1 processor

67
Parallel idea
• We originally set NEXT[i]= LINK[i]; that is, each
point sees only its upcoming jumping destination.
• The rank of the nodes is determined by the distance it
can jump to.
• In the next steps we take: NEXT[i] =
NEXT[NEXT[i]];
• Update ranking values of elements.
• At the same time with it the jumping distance will be
doubled. After log(n) times, NEXT[i] will reach the
end of the list.
• When all NEXT[i] reach to the last element, the
algorithm ends.
68
Parallel idea

69
Parallel idea

70
Parallel solved arts

71
Parallel idea

• Step 3 ends when all values Next[i] = 0

• RANK[i] defines the rank of elements in the linked
list.

72
Parallel algorithm
INPUT : A[1..n], LINK[1..n], HEAD;
OUTPUT : RANK[1..n];
BEGIN
FOR i = 1 TO n DO IN PARALLEL
RANK[i] = 1;
NEXT[i] = LINK[i];
END PARALLEL;
FOR k = 1 TO log(n) DO
FOR i = 1 TO n DO IN PARALLEL
IF NEXT[i] <> 0 THEN
RANK[i] = RANK[i] + RANK[NEXT[i]];
NEXT[i] = NEXT[NEXT[i]];
END IF;
END PARALLEL;
END FOR;
END;

73
Parallel algorithm
• Performance Evaluation:
• Algorithm divided into 2 serial parts
• Part one: O(1) time unit.
• Part two: O(logn) time unit.
• Computer architecture:
• Command: RANK[i]= RANK[i]+ RANK[NEXT[i]] can
be split into 3 single sentences as follows:
• B1: X = RANK[i];
• B2: Y = RANK[NEXT[i]];
• B3: RANK[i] = X+Y.
• Because processors perform in parallel, no memory
cells are read or written simultaneously PRAM
EREW architecture with O(n) processor.
74
2.3.4 Partitioning
• General ideas:
• Divide a large problem into p small problems,
where p is the number of processors allowed to
execute simultaneously.
• Example: Merge Sorting
• Let A[1..n] and B[1..n] be the two sorted arrays.
Performing merging these two arrays into an array
C[1..2n] that will be also sorted.

75
Serial algorithm
• Idea:
• 3 indicators i, j, k slide on 3 arrays A, B, C
respectively.
• Values of array C are determined by
passing the smallest value of the 2 values
Ai and Bj.
• Moving these indexes appropriately to be
able to pass through all elements of the 2
arrays A, B.

76
Serial algorithm
INPUT : A1 ≤ A2 ≤ . ≤ An và B1 ≤ B2 ≤ ≤ Bn .
OUTPUT : C[1..2n] = A[1..n] U B[1..n] : C1 ≤ C2 ≤ ≤ C2n
BEGIN
A[n+1] = ∞; B[n+1] = ∞ ;
i = 1; j = 1; k = 1;
WHILE k ≤ 2n DO
IF A[i] < B[j] THEN
C[k]= A[i];
i = i + 1;
ELSE
C[k]= B[j] ;
j = j + 1;
END IF;
k = k + 1;
END WHILE;
END.

• Complexity: O(n).
77
Parallel algorithm
• Using division technique :
• Divide array A into r = n / log(n);
• Each group has k = log(n) elements. Let's assume k, r are
integer numbers.
• So we have groups like this:
• Group NA1: A1, A2 ,…………………………..….Ak;
• Group NA2: Ak+1, Ak+2,………………….……A2k;
• ………………………………………….
• Group NAr: A(r-1)k+1, A(r-1)k+2, ……….……..An;

78
Parallel algorithm
• Find r integers j[1], j[2],....j[r] so that:
• j[1] is the largest index that satisfies Ak ≥ Bj[1] ;
• j[2] is the largest index that satisfies A2k ≥ Bj[2];
• ………………
• j[r] is the largest satisfying index An ≥ Bj[r];
• Divide array B[1..n] into r+1 groups:
• Group NB1: B1, B2 , ….. ……………………….Bj[1];
• Group NB2: Bj[1]+1, Bj[1]+2,….……………………Bj[2];
• ………………………………………….
• Group NBr: Bj[r-1]+1, Bj[r-1]+2.……………………..Bj[r];
• Group NBr+1: Bj[r]+1, ……………………………..Bn;

79
Parallel algorithm
• Now we realize that :
• Elements in the NAi group are no smaller than
NBi-1 group’s elements and no larger than NBi
group’s elements
• We separately mix the elements in NAi and NBi,
the ordering order of the elements in this new
set remains unchanged in array C.

80
Group 1 for C[1..9] = ( 1, 2, 3, 4, 5, 13, 15, 15, 18);
Group 2 for C[10..16] = (19, 19, 20, 21, 22, 23, 24);
Group 3 for C[17..22] = (27, 28, 29, 30, 31);
Group 4 for C[23..32] = (32, 37, 38, 41, 42, 43, 48, 49, 49);

81
Parallel algorithm
INPUT : A1 ≤ A2 ≤ . ≤ An và B1 ≤ B2 ≤ ≤ Bn .
OUTPUT : C[1..2n] = A[1..n] U B[1..n] : C1 ≤ C2 ≤ ≤ C2n
BEGIN
FOR i = 1 TO r DO IN PARALLEL
Pi :
READ(A(i-1)k+1, Aik);
j[i] = MAX{ t : Aik ≥ Bt ) : BINARY_SEARCH.
S_MERGE(A(i-1)k+1, Aik , Bj[i-1]+1, Bj[i]);
END PARALLEL;
END.

• Function BINARY_SEARCH: binary search on sorted

array.
• Function S_MERGE: mix 2 arrays in the order as the
previous algorithm presented.
82
Complexity evaluation
• 2 child programs used:
• Performing a binary search on array B consisting of n
elements will cost O(log(n)) time units.
• The work of merging NA and NB arrays depends on the size
of the NB array because the number of NB elements is not
pre-defined, while the number of NA elements is pre-defined
k = log(n).
• If the size of NBi array is also k, this step can be performed with a
time of O(log(n)).
• If the size of the NBi array is greater than k, we can recursively
repeat the division operation a few times with NB. Then step 3 can
also be done with O(log(n)) time units.

83
Thank
you for
your
attentions
!

SAP S4 HANA QM Module Training
100% (5)
SAP S4 HANA QM Module Training
2 pages
Merging High-Level and Low-Level Requirements
No ratings yet
Merging High-Level and Low-Level Requirements
8 pages
Learn Multithreading with Modern C++
From Everand
Learn Multithreading with Modern C++
James Raynard
No ratings yet
E-Book Business Process Automation CH1
100% (2)
E-Book Business Process Automation CH1
12 pages
Configure NLB For Sharepoint
No ratings yet
Configure NLB For Sharepoint
27 pages
Scorereport PDF
No ratings yet
Scorereport PDF
2 pages
Lecture 4
No ratings yet
Lecture 4
20 pages
Part 1 - Lecture 3 - Parallel Software-1
No ratings yet
Part 1 - Lecture 3 - Parallel Software-1
45 pages
Parallel and Distributed Computing Lecture#12
No ratings yet
Parallel and Distributed Computing Lecture#12
19 pages
2 Parallel Computer Memory Architectures
No ratings yet
2 Parallel Computer Memory Architectures
26 pages
HPC Module 4
No ratings yet
HPC Module 4
18 pages
Unit4 Session3 Parallel Computing Concepts Terminology Design Issues
No ratings yet
Unit4 Session3 Parallel Computing Concepts Terminology Design Issues
30 pages
Mansi Kadam PC Lab Assignment 1
No ratings yet
Mansi Kadam PC Lab Assignment 1
4 pages
Threads
No ratings yet
Threads
16 pages
Chapter 3 Processes
No ratings yet
Chapter 3 Processes
42 pages
Parallel Programming Models
No ratings yet
Parallel Programming Models
25 pages
15cs72aca Module-5 Aca
No ratings yet
15cs72aca Module-5 Aca
53 pages
Unit-7 Design Issues For Parallel Computers Definition
No ratings yet
Unit-7 Design Issues For Parallel Computers Definition
11 pages
Parallel Distributed Computing
No ratings yet
Parallel Distributed Computing
64 pages
OS-PROCESS MANAGEMENT module -2.2
No ratings yet
OS-PROCESS MANAGEMENT module -2.2
89 pages
Parallel Computing
No ratings yet
Parallel Computing
24 pages
High Performance Computing
No ratings yet
High Performance Computing
17 pages
Introduction To Parallel Computing: Aamir Shafi Khizar Hussain
No ratings yet
Introduction To Parallel Computing: Aamir Shafi Khizar Hussain
101 pages
Chapter 05 PCPF
No ratings yet
Chapter 05 PCPF
32 pages
CS0051 - Module 01 - Subtopic 1
No ratings yet
CS0051 - Module 01 - Subtopic 1
27 pages
IT105 Midterm Lecture Part1
No ratings yet
IT105 Midterm Lecture Part1
5 pages
(OS) - Unit-2.2-2.5 Process Management
No ratings yet
(OS) - Unit-2.2-2.5 Process Management
72 pages
CS0051 - Module 01
No ratings yet
CS0051 - Module 01
52 pages
Parallel Computing Unit 4 - Pthreads
No ratings yet
Parallel Computing Unit 4 - Pthreads
51 pages
Threads
No ratings yet
Threads
8 pages
NV Operating Systems UNIT II
No ratings yet
NV Operating Systems UNIT II
91 pages
01 Intro Parallel Computing
No ratings yet
01 Intro Parallel Computing
40 pages
OS Technical Report
No ratings yet
OS Technical Report
15 pages
3.3-Recent Trends in Parallel Computing
No ratings yet
3.3-Recent Trends in Parallel Computing
12 pages
1 - Concurrent Programming
No ratings yet
1 - Concurrent Programming
28 pages
Lecture-4 Parallel Programming Model
No ratings yet
Lecture-4 Parallel Programming Model
14 pages
Concurrent and Parallel Programming Unit V-Notes Unit V Openmp, Opencl, Cilk++, Intel TBB, Cuda 5.1 Openmp
No ratings yet
Concurrent and Parallel Programming Unit V-Notes Unit V Openmp, Opencl, Cilk++, Intel TBB, Cuda 5.1 Openmp
10 pages
Basics of Parallel Programming: Unit-1
No ratings yet
Basics of Parallel Programming: Unit-1
79 pages
Unit 4 Notes
No ratings yet
Unit 4 Notes
61 pages
Introduction To Parallel Processing and Distributed Systems
No ratings yet
Introduction To Parallel Processing and Distributed Systems
15 pages
Module-1 Theory of Parallelism: The State of Computing Computer Development Milestones
No ratings yet
Module-1 Theory of Parallelism: The State of Computing Computer Development Milestones
48 pages
Ch4 Threads
No ratings yet
Ch4 Threads
18 pages
WINSEM2022-23 CSE4001 ETH VL2022230503162 ReferenceMaterialI TueFeb1400 00 00IST2023 Module4DistributedSystemsLecture2
No ratings yet
WINSEM2022-23 CSE4001 ETH VL2022230503162 ReferenceMaterialI TueFeb1400 00 00IST2023 Module4DistributedSystemsLecture2
27 pages
Unit 3
No ratings yet
Unit 3
31 pages
Parallel Computing
100% (1)
Parallel Computing
53 pages
Unit2_a
No ratings yet
Unit2_a
70 pages
Case Study - Mac OS-2 - P
0% (1)
Case Study - Mac OS-2 - P
36 pages
DFWFWD
No ratings yet
DFWFWD
4 pages
Parallel Programming
No ratings yet
Parallel Programming
108 pages
CICS 504 Computer Organization
No ratings yet
CICS 504 Computer Organization
35 pages
Chapter 04
No ratings yet
Chapter 04
37 pages
Parallel Programming Module 1
No ratings yet
Parallel Programming Module 1
71 pages
Chapter-10 Parallel Programming Models, Languages and Compilers
No ratings yet
Chapter-10 Parallel Programming Models, Languages and Compilers
30 pages
Technical Seminar Report On: "High Performance Computing"
No ratings yet
Technical Seminar Report On: "High Performance Computing"
14 pages
MPI Python Workshop Day1 Fall2024
No ratings yet
MPI Python Workshop Day1 Fall2024
22 pages
BLG305 Ders4-En
No ratings yet
BLG305 Ders4-En
34 pages
Lecture 10 Parallel Computing - by FQ
No ratings yet
Lecture 10 Parallel Computing - by FQ
29 pages
Multicore Architecture
No ratings yet
Multicore Architecture
159 pages
Baker CHPT 5 SIMD Good
No ratings yet
Baker CHPT 5 SIMD Good
94 pages
Parallel and Distributed Computing
No ratings yet
Parallel and Distributed Computing
90 pages
Chapter 04
No ratings yet
Chapter 04
35 pages
FALLSEM2021-22 CSE4001 ETH VL2021220104078 Reference Material I 05-Aug-2021 Module1 (Part 1)
No ratings yet
FALLSEM2021-22 CSE4001 ETH VL2021220104078 Reference Material I 05-Aug-2021 Module1 (Part 1)
30 pages
Different Parallel Processing Architecture
No ratings yet
Different Parallel Processing Architecture
41 pages
Computer Science: Learn about Algorithms, Cybersecurity, Databases, Operating Systems, and Web Design
From Everand
Computer Science: Learn about Algorithms, Cybersecurity, Databases, Operating Systems, and Web Design
Jonathan Rigdon
No ratings yet
Dataflow and Reactive Programming Systems
From Everand
Dataflow and Reactive Programming Systems
Matt Carkci
No ratings yet
Google Test Framework Public
No ratings yet
Google Test Framework Public
40 pages
Tips For (DIGI YATRA)
No ratings yet
Tips For (DIGI YATRA)
70 pages
Software Requirements Specification
No ratings yet
Software Requirements Specification
7 pages
Computers in Garment Industry
No ratings yet
Computers in Garment Industry
10 pages
People Soft HCM Services
No ratings yet
People Soft HCM Services
2 pages
Hacking Exposed Cisco Networks - by Andrew A. Vladimirov, Konsta
No ratings yet
Hacking Exposed Cisco Networks - by Andrew A. Vladimirov, Konsta
1,055 pages
Adama Tvet Coc Hns 2013
82% (17)
Adama Tvet Coc Hns 2013
14 pages
GPS Micro21
No ratings yet
GPS Micro21
13 pages
Realizing The Real-Time Enterprise by Ben Van Der Merwe
No ratings yet
Realizing The Real-Time Enterprise by Ben Van Der Merwe
21 pages
Provisional-Program (ISSP Caravan)
No ratings yet
Provisional-Program (ISSP Caravan)
1 page
Testing The Security Awareness Using Open-Source Tools: Spear Phishing
No ratings yet
Testing The Security Awareness Using Open-Source Tools: Spear Phishing
4 pages
Vamsi Krishna Goteti
No ratings yet
Vamsi Krishna Goteti
5 pages
Ebi Pics r600
No ratings yet
Ebi Pics r600
8 pages
Module 1
No ratings yet
Module 1
3 pages
SQL Data Base Resume
No ratings yet
SQL Data Base Resume
3 pages
Cloud Computing and Its Key Features
No ratings yet
Cloud Computing and Its Key Features
13 pages
Bootstrap and Process Management
No ratings yet
Bootstrap and Process Management
5 pages
Artificial Intelligence For Fraud Detection and Prevention
No ratings yet
Artificial Intelligence For Fraud Detection and Prevention
11 pages
AcademyCloudFoundations Module 01
No ratings yet
AcademyCloudFoundations Module 01
48 pages
Storage For RHOS and Cloud Paks Portfolio Seller Presen
No ratings yet
Storage For RHOS and Cloud Paks Portfolio Seller Presen
25 pages
Online Class Time Table 2020-21
No ratings yet
Online Class Time Table 2020-21
6 pages
Stocks Analysis and Prediction Using Big Data Analytics
No ratings yet
Stocks Analysis and Prediction Using Big Data Analytics
4 pages
Current Log
No ratings yet
Current Log
14 pages
KCG College of Technology Karapakkam Chennai-600 097
No ratings yet
KCG College of Technology Karapakkam Chennai-600 097
3 pages