Parallel Computing
Landscape
(CS 526)
Department of Computer Science,
The University of Lahore,
The cache coherence
problem
• Since, we have private caches:
How to keep the data consistent across caches?
• Each core should perceive the memory as a
monolithic array, shared by all the cores
The cache coherence
problem
Suppose variable x initially contains
15213
Core 1 Core 2 Core 3 Core 4
One or more One or more One or more One or more
levels of levels of levels of levels of
cache cache cache cache
multi-core chip
Main memory
x=15213
35
The cache coherence
problem
Core 1 reads
x
Core 1 Core 2 Core 3 Core 4
One or more One or more One or more One or more
levels of levels of levels of levels of
cache cache cache cache
x=15213
multi-core chip
Main memory
x=15213
36
The cache coherence problem
Core 2 reads
x
Core 1 Core 2 Core 3 Core 4
One or more One or more One or more One or more
levels of levels of levels of levels of
cache cache cache cache
x=15213 x=15213
multi-core chip
Main memory
x=15213
The cache coherence
problem
Core 1 writes to x, setting it to
21660
Core 1 Core 2 Core 3 Core 4
One or more One or more One or more One or more
levels of levels of levels of levels of
cache cache cache cache
x=21660 x=15213
multi-core chip
Main memory
x= 15213
The cache coherence
problem
Core 3 attempts to read x… gets a stale
copy
Both caches
Core 1 Core 2 Core 3 Core 4 contain
inconsistent data.
Unpredictable
One or more One or more One or more One or more Behavior
levels of levels of levels of levels of
cache cache cache cache
x=21660 x=15213
multi-core chip
Cache Coherence Protocols
Has many solutions: Main memory
x=15213
The cache coherence problem
To address the cache coherence problem, various cache coherence
protocols have been developed.
Two common cache coherence protocols are :
• MESI Protocol: MESI stands for Modified, Exclusive, Shared, and
Invalid.
• MOESI Protocol: MOESI extends the MESI protocol with an "Owned"
state
The cache coherence problem
• MESI Protocol: It is a widely used cache coherence protocol that
defines several states for each cache line (data block) in a cache,
allowing caches to coordinate reads and writes to maintain data
consistency.
• MOESI Protocol: The Owned state helps improve performance by
allowing a processor to read data in the Exclusive state without
sending requests to other caches, assuming that it is the only owner
of the data
Programming for Multi-
core
• Programmers must use threads or processes
• Spread the workload across multiple cores
• Write parallel algorithms
• OS will map threads/processes to cores
Thread safety very
important
• Pre-emptive context-switching:
context switch can happen AT ANY TIME
• True concurrency, not just uniprocessor time-
slicing
• Concurrency bugs exposed much faster with
multi-core
Assigning threads to
the cores
• Each thread/process has an affinity mask
• Affinity mask specifies what cores the thread is
allowed to run on
• Different threads can have different masks
• Affinities are inherited across fork( )
Affinity masks are bit
vectors
• Example: 4-way multi-core, without SMT
1 1 0 1
core 3 core 2 core 1 core 0
• Process/thread is allowed to run on
cores 0,2,3, but not on core 1
Affinity masks when multi-core and SM
combined
1 1 0 0 1 0 1 1
core 3 core 2 core 1 core 0
thread thread thread thread thread thread thread thread
1 0 1 0 1 0 1 0
• Core 2 can’t run the process
• Core 1 can only use one simultaneous thread
Default
Affinities
• Default affinity mask is all 1s:
all threads can run on all processors
• Then, the OS scheduler decides what threads run
on what core
• OS scheduler detects skewed workloads,
migrating threads to less busy
processors
Process migration is costly
• Need to restart the execution pipeline
• Cached data is invalidated
• OS scheduler tries to avoid migration as much as
possible:
it tends to keeps a thread on the same core
• This is called soft affinity
46
Hard
Affinities
• The programmer can prescribe her own affinities
(hard affinities)
• Rule of thumb: use the default scheduler unless a
good reason not to
When to set your own
affinities
• Two (or more) threads share data-structures
in memory
–map to same core so that can share cache
• Real-time threads:
Example: a thread running
a robot controller:
- must not be context
Source: Sensable.com
switched,
or else robot can go unstable
- dedicate an entire core just to this thread
Flynn’s
Taxonomy
• Michael Flynn (from Stanford)
– Made a characterization of computer systems
which became known as Flynn’s Taxonomy
Comput
er
Instructio Dat
ns a
Multiple Processor Organization
Flynn’s Taxonomy:
1.Single Instruction, Single Data stream - SISD
2.Single Instruction, Multiple Data stream - SIMD
3.Multiple Instruction, Single Data stream - MISD
4.Multiple Instruction, Multiple Data stream- MIMD
1. Single Instruction, Single Data Stream – SISD
• Single processor
• Single instruction stream
• Data stored in single memory
• Example: Uni-processor Systems
SI SISD SD
2. Single Instruction, Multiple Data Stream - SIMD
• Single machine instruction
• Controls simultaneous execution
• Large Number of processing elements
• Each processing element has associated memory
• Each instruction executed on different set of data
by different processors
• Examples: GPUs
SISD SD
SI SISD SD
SISD SD
3. Multiple Instruction, Single Data Stream - MISD
• Sequence of data
• Transmitted to set of processors
• Each processor executes different instruction
sequence, using same Data
• Few examples: Systolic array Processors
SI SISD
SI SISD SD
SI SISD
4. Multiple Instruction, Multiple Data Stream- MIMD
• Set of processors
• Simultaneously execute different instruction
sequences
• Different sets of data
• Examples: Multi-cores, SMPs, Clusters
SI SISD SD
SI SISD SD
SI SISD SD
MIMD - Overview
• General purpose processors
• Each can process all instructions necessary
• Further classified by method of processor
communication:
1. Via Shared Memory
2. Message Passing (Distributed Memory)
Taxonomy of Processor Architectures
Tightly Coupled -
SMP
• Processors share memory
• Communicate via that shared memory
• Symmetric Multiprocessor (SMP)
– Single shared memory
– Shared bus to access memory
– Memory access time to given area of memory is
approximately the same for each processor
Symmetric Multiprocessors
(SMPs)
– Two or more similar processors
– Processors share same memory and I/O
– Processors are connected by a bus or other internal
connection
– Memory access time is approximately the same for
each processor
– All processors share access to I/O
SMP
Advantages
• Performance
– If some work can be done in parallel
• Availability
– Failure of a single processor does not halt the system
• Incremental growth
– User can enhance performance by adding additional
processors
• Scaling
– Vendors can offer range of products based on number
of processors
Block Diagram of Tightly Coupled
Multiprocessor (SMP)
Symmetric Multiprocessor
Organization
Multithreading and Chip
Multiprocessors
• Instruction stream divided into smaller
streams called “threads”
• Executed in parallel
Definitions of Threads and
Processes
• Process:
– An instance of program running on computer
– A unit of resource ownership:
• virtual address space to hold process image
– Process switch
• Thread: dispatchable unit of work within process
– Includes processor context (which includes the PC register
and stack pointer) and data area for stack
– Interruptible: processor can turn to another thread
• Thread switch
– Switching processor between threads within same process
– Typically less costly than process switch
Implicit and Explicit
Multithreading
• All commercial processors use explicit
multithreading:
– Concurrently execute instructions from
different
explicit threads
– Interleave instructions from different threads on shared
pipelines OR parallel execution on parallel pipelines
• Implicit multithreading: Concurrent execution of
multiple threads extracted from single sequential
program:
– Implicit threads defined statically by
compiler or