Module 1 - Parallel Computing
Module 1 - Parallel Computing
4
Intro: Motivating Parallelism
The Challenge – No. 1:
5
Intro: Motivating Parallelism
The Challenge – No. 2:
8
Implicit Parallelism
Example
Consider a processor with two pipelines and the ability to
simultaneously issue two instructions. The processors are
referred to as super-pipelined processors.
Schedule
Clock Cycle
14
Limitations of Memory Performance
17
Improving Effective Memory Performance
Cache – This is a smaller and faster memory between the
processor and the DRAM. It is characterized with low-latency
and high-bandwidth storage.
18
Example 2– Effect of Cache on Memory
System Performance
• Suppose the processor in the previous example is introduced with a
cache of size 32 KB with latency of 1 ns.
• Assume we wish to multiply two matrices A and B of dimensions
(32x32) each.
• Assuming ideal cache placement strategy, fetching the two matrices
into the cache corresponds to fetching 2K words, which takes
approximately 200 micro seconds.
• From algebra, multiplying two n x n matrices take 2n3 operations.
• This corresponds to 64 K operations which can be performed in 16K
cycles (16 micro seconds) at four instructions per cycle.
• Total computation time = load/store time + compute time = 200+16
micros second.
• This corresponds to a peak performance of 64K/216 = 303 MFLOPS.
• This is more than 30-times the previous system; but it is still less than
10% of the peak processor performance. 19
Improving Effective Memory Performance
20
Example 3– Effect of Block size on Memory
System Performance
• Suppose the system in Example 1 has its block size increased to
four words.
• Then the processor can fetch a four-word cache line every 100
cycles.
• Assuming that the vectors are laid linearly in memory, four
multiply-add (eight FLOPS) operations can be performed in 200
cycles. This is because a single memory access fetches four
consecutive words in the vector.
• This corresponds to one FLOP every 25 ns, a peak speed of 40
MFLOPS.
Note: Increasing the block size from one to four words did not
change the latency of the memory system, however, it increased the
bandwidth four-fold and accelerate the dot-product algorithm. 21
Important Assumptions for Programmers
Assumption 2:
The computation is ordered so that successive computations
require contiguous data – data layout-centric view point.
22
Hiding Memory Latency
Imagine browsing the web during peak traffic hours. The lack of response from
your browser can be alleviated using one of the three simple approaches.
Approach 1: Pre-fetching
Anticipate which pages we are going to browse ahead of time and issue a
request for them ahead of time.
Approach 2: Multithreading
Open multiple browsers (or tabs) and access different pages in each browser
– so that while waiting for one page, one can be reading another.
23
Multithreading for Latency Hiding
Thread: A thread is a single stream of control in the flow of a
program
However, if the data has been overwritten between load and use, a fresh
load is issued.
Note that this is no worse than the situation in which the load had not been
advanced.
28
Way out?
29