0% found this document useful (0 votes)
50 views29 pages

Module 1 - Parallel Computing

1) Moore's Law states that the number of transistors on integrated circuits doubles approximately every two years, leading to exponential growth in computing power over time. 2) A key challenge is utilizing the increasing number of transistors for computation as processor speeds outpace memory speeds. Parallelism is a logical approach to address this challenge. 3) Implicit parallelism techniques like pipelining and superscalar execution allow overlapping instruction execution to increase throughput. Caches help bridge the processor-memory speed gap and improve effective memory performance.

Uploaded by

muwaheedmustapha
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
Download as pptx, pdf, or txt
0% found this document useful (0 votes)
50 views29 pages

Module 1 - Parallel Computing

1) Moore's Law states that the number of transistors on integrated circuits doubles approximately every two years, leading to exponential growth in computing power over time. 2) A key challenge is utilizing the increasing number of transistors for computation as processor speeds outpace memory speeds. Parallelism is a logical approach to address this challenge. 3) Implicit parallelism techniques like pipelining and superscalar execution allow overlapping instruction execution to increase throughput. Caches help bridge the processor-memory speed gap and improve effective memory performance.

Uploaded by

muwaheedmustapha
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1/ 29

ELE5211: Advanced Topics in CE

Module One – Parallel Computing


(Implicit Parallelism)

Tutor: Hassan A. Bashir 1


Intro: Motivating Parallelism
1- The Computational Power Argument
- From Transistor to FLOPs
Gordon Moore, 1965
“The complexity for minimum component costs
has increased at a rate of roughly a factor of two
per year. That means by 1975, the number of
components per integrated circuit for minimum
cost will be 65,000”

After 1975, Moore revised the rate of circuit


complexity doubling to 18 months.
2
Intro: Motivating Parallelism
1- The Computational Power Argument
- From Transistor to FLOPs
Moore’s Law
“Moore’s Law states that the circuit complexity
doubles every eighteen months, ”

Moore’s empirical relationship has been amazingly resilient


over years both for microprocessors as well as for DRAMs.

Consequence: The amount of computing power available at a


given cost doubles every 18 months.
3
Moore’s Law

4
Intro: Motivating Parallelism
The Challenge – No. 1:

It is possible to fabricate devices with very large transistor


counts

How we use these transistors to achieve increasing rates of


computation is a key architectural challenge.

Parallelism: A logical recourse is to rely on parallelism:


- Implicit
- Explicit

5
Intro: Motivating Parallelism
The Challenge – No. 2:

The Memory/Disk Speed Argument: The overall speed of


computation is determined not just by the speed of the
processor, but also by the ability of the memory system to
feed data to it.

Bottleneck - Clock rates of high-end processors have


increased at roughly 40% annually, but DRAM access
times have only improved at about 10% annually

Cache: The growing mismatch between processor speed


and DRAM latency is typically bridged by a hierarchy of
successively faster memory devices called the cache. 6
Intro: Motivating Parallelism
The Challenge – No. 3:
The Data Communication Argument: As the networking
infrastructure evolves, the vision of using the Internet as one
large heterogeneous parallel/distributed computing
environment has taken shape – the cloud.

Resource Constraints – In many application there are constraints


on the location of data and/or resources across the Internet.

Parallelism and Distributed Computing: Even if


computational power is available, it is infeasible to collect
the data at a central location. 7
Intro: Motivating Parallelism

Parallel Computing – Applications


This has made a tremendous impact on a variety of areas
ranging from:
- Computational simulations for scientific and
engineering applications, to
- Commercial applications in data mining and transaction
processing.

8
Implicit Parallelism

Pipelining - This involves overlapping various stages in


instruction execution:
fetch, schedule, decode, operand fetch, execute, store, etc.

The Car Assembly Analogy


If the assembly of a car, taking 100 hours, can be broken into
10 pipelined stages of 10 units each, a single assembly line
can produce a car every 10 hours – a 10 fold speedup!

The speed of a single pipeline is ultimately limited by the largest


atomic task in the pipeline.
9
Implicit Parallelism
Superscalar Execution – Running multiple pipelines
which each clock cycle, multiple instructions are
piped into the processor in parallel.
Such instructions are executed on multiple functional units.

Example
Consider a processor with two pipelines and the ability to
simultaneously issue two instructions. The processors are
referred to as super-pipelined processors.

Note: The ability of a processor to execute multiple instructions


in the same cycle is referred to as superscalar execution.
10
Implicit Parallelism - Challenges
True Data Dependency
This is when the results of an instruction may be required
by subsequent instructions.

This type of dependency must be resolved before


simultaneous issue of instructions.

Two main implications:


- Resolution must be supported in hardware since it is done
at runtime.
- The amount of instruction level parallelism in a program is
often limited and is a function of coding technique.
11
Implicit Parallelism - Challenges
Resource Dependency – When there are finite resources shared by
various pipelines.

Example: Co-scheduling of two floating point operations on a dual


issue machine with a single floating point unit.
While there is no data dependency, the scheduling will not
work since they both need floating point unit.

Branch or Procedural Dependency – This involves flow control


through a program.
Example: Consider execution of a conditional branch instruction.
Since the branch destination is known only at the point of
execution, scheduling of instruction a priori across branches
may lead to errors. 12
Implicit Parallelism - Challenges
Waste Cycles – The performance of superscalar architectures is
limited by the available instruction level parallelism.

Example: Consider a cycle in which the floating point unit is idle.


These are essentially wasted cycles from the point of view of
execution unit.

Vertical Waste – This refers to a particular cycle in which no


instructions are issued on the execution units.

Horizontal Waste – This is when only part of the execution units


are used during a cycle.

Homework: What are Very long Instruction Word (VLIW) 13


Instruction
Code

Schedule

Clock Cycle

14
Limitations of Memory Performance

The effective performance of a program on a computer relies not


just on the speed of the processor but also on the ability of the
memory system to feed data to the processor.

Memory System – Suppose at the logical level, a memory system


(possibly with multiple levels of caches) takes in a request for
a memory word and returns a block of data of size b
containing the required word after l nanoseconds.

Latency – Here, l is referred to as latency of the memory; and

Bandwidth – The rate at which data can be pumped from the


memory to the processor determines the bandwidth b. 15
Example 1 – Effect of Latency on
Performance
• Consider a processor operating at 1 GHz (1 ns clock);
• If it is connected to a DRAM with a latency of 100 ns (no
caches);
• Assume the Processor has two multiply-add units and is
capable of executing four instructions in each cycle of 1 ns.
• The peak processor rating is therefore 4 GFLOPS
• Since the memory latency is equal to 100 cycles and block size
is one word, every time a memory request is made, the
processor must wait 100 cycles before it can process the data.
• Computing the dot-product of two vectors on such platform will
involve one multiply-add operation on a single pair of vector
elements, i.e. each floating point operation requires one data
fetch.
16
Example – Effect of Latency on Performance
• It can be seen that the peak speed of this operation is limited
to one floating point operation every 100 ns,
• That is a speed of 10 MFLOPS.
…a very small fraction of the peak processor rating!

• Hence, there is need for effective memory system


performance in order to achieve high computation rates.

17
Improving Effective Memory Performance
Cache – This is a smaller and faster memory between the
processor and the DRAM. It is characterized with low-latency
and high-bandwidth storage.

Hit ratio– This is the fraction of data references satisfied by the


cache
Memory Bound– This is the effective computation rate at which
data can be pumped into a CPU.

The performance of memory bound programs is critically affected


by the cache hit ratio.

18
Example 2– Effect of Cache on Memory
System Performance
• Suppose the processor in the previous example is introduced with a
cache of size 32 KB with latency of 1 ns.
• Assume we wish to multiply two matrices A and B of dimensions
(32x32) each.
• Assuming ideal cache placement strategy, fetching the two matrices
into the cache corresponds to fetching 2K words, which takes
approximately 200 micro seconds.
• From algebra, multiplying two n x n matrices take 2n3 operations.
• This corresponds to 64 K operations which can be performed in 16K
cycles (16 micro seconds) at four instructions per cycle.
• Total computation time = load/store time + compute time = 200+16
micros second.
• This corresponds to a peak performance of 64K/216 = 303 MFLOPS.
• This is more than 30-times the previous system; but it is still less than
10% of the peak processor performance. 19
Improving Effective Memory Performance

Memory Bandwidth– This refers to the rate at which data can be


moved between processor and memory. It is determined by
the bandwidth of the memory bus as well as the memory
units.

Cache line– This increases memory bandwidth by increasing the


size of the memory blocks. Suppose a single memory request
returns a contiguous block of four words, the single unit of
four words here is called cache line.

Typically, computers fetch 2 to 8 words together into the cache.

20
Example 3– Effect of Block size on Memory
System Performance
• Suppose the system in Example 1 has its block size increased to
four words.
• Then the processor can fetch a four-word cache line every 100
cycles.
• Assuming that the vectors are laid linearly in memory, four
multiply-add (eight FLOPS) operations can be performed in 200
cycles. This is because a single memory access fetches four
consecutive words in the vector.
• This corresponds to one FLOP every 25 ns, a peak speed of 40
MFLOPS.

Note: Increasing the block size from one to four words did not
change the latency of the memory system, however, it increased the
bandwidth four-fold and accelerate the dot-product algorithm. 21
Important Assumptions for Programmers

The above example illustrates how increased bandwidth results in


higher peak computation rates.

Assumption 1: Spatial Locality


The data layouts were assumed to be such that consecutive
data words in memory were used by successive instructions –
computation-centric view point.

Assumption 2:
The computation is ordered so that successive computations
require contiguous data – data layout-centric view point.

22
Hiding Memory Latency
Imagine browsing the web during peak traffic hours. The lack of response from
your browser can be alleviated using one of the three simple approaches.

Approach 1: Pre-fetching
Anticipate which pages we are going to browse ahead of time and issue a
request for them ahead of time.

Approach 2: Multithreading
Open multiple browsers (or tabs) and access different pages in each browser
– so that while waiting for one page, one can be reading another.

Approach 3: Spatial Locality


Access a whole bunch of pages in one go – amortizing the latency across
various accesses.

23
Multithreading for Latency Hiding
Thread: A thread is a single stream of control in the flow of a
program

-Consider multiplying an n x n matrix a by a vector b to get c.

for (i = 0; i < n; i++)


c[i] = dot_product(get_row(a, i), b);

- This code computes each element of c as the dot product of the


corresponding row of a with the vector b.

Note: In the above code, each dot product is independent of the


other and therefore represent a concurrent unit of execution.
24
Multithreading for Latency Hiding
The previous dot_product can be re-written as:

for (i = 0; i < n; i++)


c[i] = create_thread(dot_product, (get_row(a, i), b);

-The only difference is we explicitly specified each instance of the


dot product computation as a thread.

-Thus, the first instance of this function access a pair of vector


elements and waits for them. In the mean time, the second
instance of this function can access two other vector elements in
the next cycle, and so on.
-After l units of time (latency), the first function instance gets back
response and performs the required computation, etc. etc. 25
Multithreading
Multithreading is predicated upon three assumptions:

• The system is capable of servicing multiple outstanding


requests, and
• The processor is capable of switching threads at every cycle.
• It also requires the program to have an explicit specification of
concurrency in the form of thread

Note: Multithreaded machines are capable of hiding latency


provided there is enough concurrency (threads) to keep the
processor from idling.

Tradeoff between – Concurrency and Latency! 26


Prefetching
Suppose a data item is loaded and used by a processor in a small time
window.

If the load results in a cache miss, then the execution stalls.

A simple solution is to advance the load operation so that even if there is a


cache miss, the data is likely to have arrived by the time it is used.

However, if the data has been overwritten between load and use, a fresh
load is issued.

Note that this is no worse than the situation in which the load had not been
advanced.

Many compilers aggressively try to advance loads to mask memory system


27
latency.
Multithreading and Prefetching Tradeoffs

• While it might seem that multithreading and prefetching solve


all the problems related to memory system performance, they
are critically impacted by the memory bandwidth.

• The bandwidth requirements of a multithreaded system may


increase very significantly because of the smaller cache
residency of each thread.

Note: multithreading and prefetching only address the latency


problem and may often exacerbate the bandwidth problem.

28
Way out?

Explicit Parallel Computing!

29

You might also like