Module 1 - Parallel Computing

ELE5211: Advanced Topics in CE
Module One – Parallel Computing

(Implicit Parallelism)
Tutor: Hassan A. Bashir 1

Intro: Motivating Parallelism
1- The Computational Power Argument
- From Transistor to FLOPs
Gordon Moore, 1965
“The complexity for minimum component costs
has increased at a rate of roughly a factor of two
per year. That means by 1975, the number of
components per integrated circuit for minimum
cost will be 65,000”
After 1975, Moore revised the rate of circuit

complexity doubling to 18 months.
2
1- The Computational Power Argument
- From Transistor to FLOPs
Moore’s Law
“Moore’s Law states that the circuit complexity
doubles every eighteen months, ”
Moore’s empirical relationship has been amazingly resilient

over years both for microprocessors as well as for DRAMs.
Consequence: The amount of computing power available at a

given cost doubles every 18 months.
3
Moore’s Law
4
The Challenge – No. 1:
It is possible to fabricate devices with very large transistor

counts
How we use these transistors to achieve increasing rates of

computation is a key architectural challenge.
Parallelism: A logical recourse is to rely on parallelism:

- Implicit
- Explicit
5
The Memory/Disk Speed Argument: The overall speed of

computation is determined not just by the speed of the
processor, but also by the ability of the memory system to
feed data to it.
Bottleneck - Clock rates of high-end processors have

increased at roughly 40% annually, but DRAM access
times have only improved at about 10% annually
Cache: The growing mismatch between processor speed

and DRAM latency is typically bridged by a hierarchy of
successively faster memory devices called the cache. 6
The Data Communication Argument: As the networking
infrastructure evolves, the vision of using the Internet as one
large heterogeneous parallel/distributed computing
environment has taken shape – the cloud.
Resource Constraints – In many application there are constraints

on the location of data and/or resources across the Internet.
Parallelism and Distributed Computing: Even if

computational power is available, it is infeasible to collect
the data at a central location. 7
Parallel Computing – Applications

This has made a tremendous impact on a variety of areas
ranging from:
- Computational simulations for scientific and
engineering applications, to
- Commercial applications in data mining and transaction
processing.
8
Implicit Parallelism
Pipelining - This involves overlapping various stages in

instruction execution:
fetch, schedule, decode, operand fetch, execute, store, etc.
The Car Assembly Analogy

If the assembly of a car, taking 100 hours, can be broken into
10 pipelined stages of 10 units each, a single assembly line
can produce a car every 10 hours – a 10 fold speedup!
The speed of a single pipeline is ultimately limited by the largest

atomic task in the pipeline.
9
Implicit Parallelism
Superscalar Execution – Running multiple pipelines
which each clock cycle, multiple instructions are
piped into the processor in parallel.
Such instructions are executed on multiple functional units.
Example
Consider a processor with two pipelines and the ability to
simultaneously issue two instructions. The processors are
referred to as super-pipelined processors.
Note: The ability of a processor to execute multiple instructions

in the same cycle is referred to as superscalar execution.
10
Implicit Parallelism - Challenges
True Data Dependency
This is when the results of an instruction may be required
by subsequent instructions.
This type of dependency must be resolved before

simultaneous issue of instructions.
Two main implications:

- Resolution must be supported in hardware since it is done
at runtime.
- The amount of instruction level parallelism in a program is
often limited and is a function of coding technique.
11
Resource Dependency – When there are finite resources shared by
various pipelines.
Example: Co-scheduling of two floating point operations on a dual

issue machine with a single floating point unit.
While there is no data dependency, the scheduling will not
work since they both need floating point unit.
Branch or Procedural Dependency – This involves flow control

through a program.
Example: Consider execution of a conditional branch instruction.
Since the branch destination is known only at the point of
execution, scheduling of instruction a priori across branches
may lead to errors. 12
Waste Cycles – The performance of superscalar architectures is
limited by the available instruction level parallelism.
Example: Consider a cycle in which the floating point unit is idle.

These are essentially wasted cycles from the point of view of
execution unit.
Vertical Waste – This refers to a particular cycle in which no

instructions are issued on the execution units.
Horizontal Waste – This is when only part of the execution units

are used during a cycle.
Homework: What are Very long Instruction Word (VLIW) 13

Instruction
Code
Schedule
Clock Cycle
14
Limitations of Memory Performance
The effective performance of a program on a computer relies not

just on the speed of the processor but also on the ability of the
memory system to feed data to the processor.
Memory System – Suppose at the logical level, a memory system

(possibly with multiple levels of caches) takes in a request for
a memory word and returns a block of data of size b
containing the required word after l nanoseconds.
Latency – Here, l is referred to as latency of the memory; and
Bandwidth – The rate at which data can be pumped from the

memory to the processor determines the bandwidth b. 15
Example 1 – Effect of Latency on
Performance
• Consider a processor operating at 1 GHz (1 ns clock);
• If it is connected to a DRAM with a latency of 100 ns (no
caches);
• Assume the Processor has two multiply-add units and is
capable of executing four instructions in each cycle of 1 ns.
• The peak processor rating is therefore 4 GFLOPS
• Since the memory latency is equal to 100 cycles and block size
is one word, every time a memory request is made, the
processor must wait 100 cycles before it can process the data.
• Computing the dot-product of two vectors on such platform will
involve one multiply-add operation on a single pair of vector
elements, i.e. each floating point operation requires one data
fetch.
16
Example – Effect of Latency on Performance
• It can be seen that the peak speed of this operation is limited
to one floating point operation every 100 ns,
• That is a speed of 10 MFLOPS.
…a very small fraction of the peak processor rating!
• Hence, there is need for effective memory system

performance in order to achieve high computation rates.
17
Improving Effective Memory Performance
Cache – This is a smaller and faster memory between the
processor and the DRAM. It is characterized with low-latency
and high-bandwidth storage.
Hit ratio– This is the fraction of data references satisfied by the

cache
Memory Bound– This is the effective computation rate at which
data can be pumped into a CPU.
The performance of memory bound programs is critically affected

by the cache hit ratio.
18
Example 2– Effect of Cache on Memory
System Performance
• Suppose the processor in the previous example is introduced with a
cache of size 32 KB with latency of 1 ns.
• Assume we wish to multiply two matrices A and B of dimensions
(32x32) each.
• Assuming ideal cache placement strategy, fetching the two matrices
into the cache corresponds to fetching 2K words, which takes
approximately 200 micro seconds.
• From algebra, multiplying two n x n matrices take 2n3 operations.
• This corresponds to 64 K operations which can be performed in 16K
cycles (16 micro seconds) at four instructions per cycle.
• Total computation time = load/store time + compute time = 200+16
micros second.
• This corresponds to a peak performance of 64K/216 = 303 MFLOPS.
• This is more than 30-times the previous system; but it is still less than
10% of the peak processor performance. 19
Improving Effective Memory Performance
Memory Bandwidth– This refers to the rate at which data can be

moved between processor and memory. It is determined by
the bandwidth of the memory bus as well as the memory
units.
Cache line– This increases memory bandwidth by increasing the

size of the memory blocks. Suppose a single memory request
returns a contiguous block of four words, the single unit of
four words here is called cache line.
Typically, computers fetch 2 to 8 words together into the cache.
20
Example 3– Effect of Block size on Memory
System Performance
• Suppose the system in Example 1 has its block size increased to
four words.
• Then the processor can fetch a four-word cache line every 100
cycles.
• Assuming that the vectors are laid linearly in memory, four
multiply-add (eight FLOPS) operations can be performed in 200
cycles. This is because a single memory access fetches four
consecutive words in the vector.
• This corresponds to one FLOP every 25 ns, a peak speed of 40
MFLOPS.
Note: Increasing the block size from one to four words did not
change the latency of the memory system, however, it increased the
bandwidth four-fold and accelerate the dot-product algorithm. 21
Important Assumptions for Programmers
The above example illustrates how increased bandwidth results in

higher peak computation rates.
Assumption 1: Spatial Locality

The data layouts were assumed to be such that consecutive
data words in memory were used by successive instructions –
computation-centric view point.
Assumption 2:
The computation is ordered so that successive computations
require contiguous data – data layout-centric view point.
22
Hiding Memory Latency
Imagine browsing the web during peak traffic hours. The lack of response from
your browser can be alleviated using one of the three simple approaches.
Approach 1: Pre-fetching
Anticipate which pages we are going to browse ahead of time and issue a
request for them ahead of time.
Approach 2: Multithreading
Open multiple browsers (or tabs) and access different pages in each browser
– so that while waiting for one page, one can be reading another.
Approach 3: Spatial Locality

Access a whole bunch of pages in one go – amortizing the latency across
various accesses.
23
Multithreading for Latency Hiding
Thread: A thread is a single stream of control in the flow of a
program
-Consider multiplying an n x n matrix a by a vector b to get c.
for (i = 0; i < n; i++)

c[i] = dot_product(get_row(a, i), b);
- This code computes each element of c as the dot product of the

corresponding row of a with the vector b.
Note: In the above code, each dot product is independent of the

other and therefore represent a concurrent unit of execution.
24
Multithreading for Latency Hiding
The previous dot_product can be re-written as:
for (i = 0; i < n; i++)

c[i] = create_thread(dot_product, (get_row(a, i), b);
-The only difference is we explicitly specified each instance of the

dot product computation as a thread.
-Thus, the first instance of this function access a pair of vector

elements and waits for them. In the mean time, the second
instance of this function can access two other vector elements in
the next cycle, and so on.
-After l units of time (latency), the first function instance gets back
response and performs the required computation, etc. etc. 25
Multithreading
Multithreading is predicated upon three assumptions:
• The system is capable of servicing multiple outstanding

requests, and
• The processor is capable of switching threads at every cycle.
• It also requires the program to have an explicit specification of
concurrency in the form of thread
Note: Multithreaded machines are capable of hiding latency

provided there is enough concurrency (threads) to keep the
processor from idling.
Tradeoff between – Concurrency and Latency! 26

Prefetching
Suppose a data item is loaded and used by a processor in a small time
window.
If the load results in a cache miss, then the execution stalls.
A simple solution is to advance the load operation so that even if there is a

cache miss, the data is likely to have arrived by the time it is used.
However, if the data has been overwritten between load and use, a fresh
load is issued.
Note that this is no worse than the situation in which the load had not been
advanced.
Many compilers aggressively try to advance loads to mask memory system

27
latency.
Multithreading and Prefetching Tradeoffs
• While it might seem that multithreading and prefetching solve

all the problems related to memory system performance, they
are critically impacted by the memory bandwidth.
• The bandwidth requirements of a multithreaded system may

increase very significantly because of the smaller cache
residency of each thread.
Note: multithreading and prefetching only address the latency

problem and may often exacerbate the bandwidth problem.
28
Way out?
Explicit Parallel Computing!
29

Module 1 - Parallel Computing

Uploaded by

Module 1 - Parallel Computing

Uploaded by

ELE5211: Advanced Topics in CE

Module One – Parallel Computing

Tutor: Hassan A. Bashir 1

After 1975, Moore revised the rate of circuit

Moore’s empirical relationship has been amazingly resilient

Consequence: The amount of computing power available at a

It is possible to fabricate devices with very large transistor

How we use these transistors to achieve increasing rates of

Parallelism: A logical recourse is to rely on parallelism:

The Memory/Disk Speed Argument: The overall speed of

Bottleneck - Clock rates of high-end processors have

Cache: The growing mismatch between processor speed

Resource Constraints – In many application there are constraints

Parallelism and Distributed Computing: Even if

Parallel Computing – Applications

Pipelining - This involves overlapping various stages in

The Car Assembly Analogy

The speed of a single pipeline is ultimately limited by the largest

Note: The ability of a processor to execute multiple instructions

This type of dependency must be resolved before

Two main implications:

Example: Co-scheduling of two floating point operations on a dual

Branch or Procedural Dependency – This involves flow control

Example: Consider a cycle in which the floating point unit is idle.

Vertical Waste – This refers to a particular cycle in which no

Horizontal Waste – This is when only part of the execution units

Homework: What are Very long Instruction Word (VLIW) 13

The effective performance of a program on a computer relies not

Memory System – Suppose at the logical level, a memory system

Latency – Here, l is referred to as latency of the memory; and

Bandwidth – The rate at which data can be pumped from the

• Hence, there is need for effective memory system

Hit ratio– This is the fraction of data references satisfied by the

The performance of memory bound programs is critically affected

Memory Bandwidth– This refers to the rate at which data can be

Cache line– This increases memory bandwidth by increasing the

Typically, computers fetch 2 to 8 words together into the cache.

The above example illustrates how increased bandwidth results in

Assumption 1: Spatial Locality

Approach 3: Spatial Locality

-Consider multiplying an n x n matrix a by a vector b to get c.

for (i = 0; i < n; i++)

- This code computes each element of c as the dot product of the

Note: In the above code, each dot product is independent of the

for (i = 0; i < n; i++)

-The only difference is we explicitly specified each instance of the

-Thus, the first instance of this function access a pair of vector

• The system is capable of servicing multiple outstanding

Note: Multithreaded machines are capable of hiding latency

Tradeoff between – Concurrency and Latency! 26

If the load results in a cache miss, then the execution stalls.

A simple solution is to advance the load operation so that even if there is a

Many compilers aggressively try to advance loads to mask memory system

• While it might seem that multithreading and prefetching solve

• The bandwidth requirements of a multithreaded system may

Note: multithreading and prefetching only address the latency

Explicit Parallel Computing!

You might also like