1.symmetric and Distributed Shared Memory Architectures

CS2354 Advanced Computer Architecture
Unit III
Multiprocessors and Thread-Level Parallelism
By
N.R.Rejin Paul
Lecturer/VIT/CSE
Chapter 6. Multiprocessors and

Thread-Level Parallelism
6.1 Introduction
6.2 Characteristics of Application Domains
6.3 Symmetric Shared-Memory Architectures
6.4 Performance of Symmetric Shared-Memory
Multiprocessors
6.5 Distributed Shared-Memory Architectures
6.6 Performance of Distributed Shared-Memory
Multiprocessors
6.7 Synchronization
6.8 Models of Memory Consistency: An Introduction
6.9 Multithreading: Exploiting Thread-Level Parallelism
within a Processor
2
Taxonomy of Parallel Architectures

Flynn Categories
SISD (Single Instruction Single Data)
Uniprocessors
MISD (Multiple Instruction Single Data)

???; multiple processors on a single data stream
SIMD (Single Instruction Multiple Data)

same instruction executed by multiple processors using different data streams
Each processor has its data memory (hence multiple data)
Theres a single instruction memory and control processor
Simple programming model, Low overhead, Flexibility

(Phrase reused by Intel marketing for media instructions ~ vector)
Examples: vector architectures, Illiac-IV, CM-2
MIMD (Multiple Instruction Multiple Data)

Each processor fetches its own instructions and operates on its own data
MIMD current winner: Concentrate on major design emphasis <= 128 processors
Use off-the-shelf microprocessors: cost-performance advantages
Flexible: high performance for one application, running many tasks simultaneously
Examples: Sun Enterprise 5000, Cray T3D, SGI Origin

3
MIMD Class 1:
Centralized shared-memory multiprocessor
share a single centralized memory, interconnect processors and memory by a bus

also known as uniform memory access time taken to access from all processor
to memory is same (UMA) or
symmetric (shared-memory) multiprocessor (SMP)
A symmetric relationship to all processors
A uniform memory access time from any processor
scalability problem: less attractive for large-scale processors
MIMD Class 2:
Distributed-memory multiprocessor
memory modules associated with CPUs

Advantages:
cost-effective way to scale memory bandwidth
lower memory latency for local memory access
Drawbacks
longer communication latency for communicating data between processors
software model more complex
Each processor have same relationship to single memory

usually supports caching both private data and shared data
Caching in shared-memory machines
private data: data used by a single processor
When a private item is cached, its location is migrated to the cache

Since no other processor uses the data, the program behavior is identical to that
in a uniprocessor
shared data: data used by multiple processor

When shared data are cached, the shared value may be replicated in multiple
caches
advantages: reduce access latency and fulfill bandwidth requirements, due to
difference in communication for load store and strategy to write from caches
values form diff. caches may not be consistent
induce a new problem: cache coherence
Coherence cache provides:

migration: a data item can be moved to a local cache and used there in a
transparent fashion
replication for shared data that are being simultaneously read
both are critical to performance in accessing shared data
Multiprocessor Cache Coherence Problem

Informally:
memory system is coherent if Any read must return the most recent write
Coherent defines what value can be returned by a read
Consistency that determines when a return value will be returned by a read
Too strict and too difficult to implement
Better:
Write propagation : value return must visible to other caches Any write must
eventually be seen by a read
All writes are seen in proper order by all caches(serialization)
Two rules to ensure this:
If P writes x and then P1 reads it, Ps write will be seen by P1 if the read and
write are sufficiently far apart
Writes to a single location are serialized: seen in one order
Latest write will be seen
Otherwise could see writes in illogical order
(could see older value after a newer value)
Example Cache Coherence Problem

P2
P1
u=?
$
P3
3
u=?
u :5 u = 7
u :5
I/O devices
u :5
Memory
Processors see different values for u after event 3
Defining Coherent Memory System

1. Preserve Program Order: A read by processor P to location X
that follows a write by P to X, with no writes of X by another
processor occurring between the write and the read by P, always
returns the value written by P
2. Coherent view of memory: Read by a processor to location X that
follows a write by another processor to X returns the written value
if the read and write are sufficiently separated in time and no
other writes to X occur between the two accesses
3. Write serialization: 2 writes to same location by any 2 processors
are seen in the same order by all processors
For example, if the values 1 and then 2 are written to a
location X by P1 and P2, processors can never read the value
of the location X as 2 and then later read it as 1
Basic Schemes for Enforcing Coherence

Program on multiple processors will normally have copies of the
same data in several caches
Rather than trying to avoid sharing in SW,
SMPs use a HW protocol to maintain coherent caches
Migration and Replication key to performance of shared data
Migration - data can be moved to a local cache and used there in
a transparent fashion
Reduces both latency to access shared data that is allocated
remotely and bandwidth demand on the shared memory
Replication for shared data being simultaneously read, since
caches make a copy of data in local cache
Reduces both latency of access and contention for reading
shared data
10
2 Classes of Cache Coherence Protocols

1. Snooping Every cache with a copy of data also has a copy of
sharing status of block, but no centralized state is kept
All caches are accessible via some broadcast medium (a bus or switch)
All cache controllers monitor or snoop on the medium to determine
whether or not they have a copy of a block that is requested on a bus or
switch access
2. Directory based Sharing status of a block of physical memory

is kept in just one location, the directory
11
Snoopy Cache-Coherence Protocols

State
Address (tag)
Data
Pn
P1
Bus snoop
Mem
I/O dev ices
Cache-memory
transaction
Cache Controller snoops all transactions on the shared

medium (bus or switch)
relevant transaction if for a block it contains
take action to ensure coherence
invalidate, update, or supply value
depends on state of the block and the protocol
Either get exclusive access before write via write

invalidate or update all copies on write
12
Example: Write-thru Invalidate

P2
P1
u=?
$
P3
3
u=?
u :5 u = 7
u :5
I/O devices
u :5
u=7
Memory
Must invalidate before step 3

Write update uses more broadcast medium BW
all recent MPUs use write invalidate
13
Two Classes of Cache Coherence Protocols

Snooping Solution (Snoopy Bus)
Send all requests for data to all processors

Processors snoop to see if they have a copy and respond accordingly
Requires broadcast, since caching information is at processors
Works well with bus (natural broadcast medium)
Dominates for small scale machines (most of the market)
Directory-Based Schemes (Section 6.5)

Directory keeps track of what is being shared in a centralized place
Distributed memory => distributed directory for scalability
(avoids bottlenecks)
Send point-to-point requests to processors via network
Scales better than Snooping
Actually existed BEFORE Snooping-based schemes
14
Basic Snoopy Protocols

Write strategies
Write-through: memory is always up-to-date
Write-back: snoop in caches to find most recent copy
There are two ways to maintain coherence requirements using snooping protocols
Write Invalidate Protocol

Multiple readers, single writer
Write to shared data: an invalidate is sent to all caches which snoop and
invalidate any copies
Read miss: further read will miss in the cache and fetch a new copy of the data
Write Broadcast/Update Protocol

Write to shared data: broadcast on bus, processors snoop, and update any
copies
Read miss: memory/cache is always up-to-date
Write serialization: bus serializes requests!

Bus is single point of arbitration
15
Examples of Basic Snooping Protocols

Write Invalidate
Write Update
Assume neither cache initially holds X and the value of X in memory is 0
16
An Example Snoopy Protocol

Invalidation protocol, write-back cache
Each cache block is in one state (track these):
Shared : block can be read
OR Exclusive : cache has only copy, its writeable, and dirty
OR Invalid : block contains no data
an extra state bit (shared/exclusive) associated with a valid bit and a
dirty bit for each block
Each block of memory is in one state:
Clean in all caches and up-to-date in memory (Shared)
OR Dirty in exactly one cache (Exclusive)
OR Not in any caches
Each processor snoops every address placed on the bus
If a processor finds that is has a dirty copy of the requested cache block,
it provides that cache block in response to the read request
17
Cache Coherence Mechanism of the Example
Placing a write miss on the bus when a write hits in the shared state ensures an
exclusive copy (data not transferred)
18
Figure 6.11 State Transitions for Each Cache Block

Requests from CPU
Requests from bus
CPU may read/write hit/miss to the block May receive read/write miss from bus
May place write/read miss on bus
19
Cache Coherence
State Diagram
Figure 6.10 and Figure 6.12 (CPU in
black and bus in gray from Figure 6.11)
20

Distributed shared-memory architectures
Separate memory per processor
Local or remote access via memory controller
The physical address space is statically distributed
Coherence Problems
Simple approach: uncacheable
shared data are marked as uncacheable and only private data are kept in caches
very long latency to access memory for shared data
Alternative: directory for memory blocks

The directory per memory tracks state of every block in every cache
which caches have a copies of the memory block, dirty vs. clean, ...
Two additional complications
The interconnect cannot be used as a single point of arbitration like the bus
Because the interconnect is message oriented, many messages must have
explicit responses
21
Distributed Directory Multiprocessor
To prevent directory becoming the bottleneck, we distribute directory entries with

memory, each keeping track of which processors have copies of their memory blocks
22
Directory Protocols
Similar to Snoopy Protocol: Three states
Shared: 1 or more processors have the block cached, and the value in memory is
up-to-date (as well as in all the caches)
Uncached: no processor has a copy of the cache block (not valid in any cache)
Exclusive: Exactly one processor has a copy of the cache block, and it has
written the block, so the memory copy is out of date
The processor is called the owner of the block
In addition to tracking the state of each cache block, we must track

the processors that have copies of the block when it is shared
(usually a bit vector for each memory block: 1 if processor has copy)
Keep it simple(r):
Writes to non-exclusive data
=> write miss
Processor blocks until access completes
Assume messages received and acted upon in order sent
23
Messages for Directory Protocols
local node: the node where a request originates

home node: the node where the memory location and directory entry of an address reside
remote node: the node that has a copy of a cache block (exclusive or shared)
24
State Transition Diagram

for Individual Cache Block
Comparing to snooping protocols:
identical states
stimulus is almost identical
write a shared cache block is
treated as a write miss (without
fetch the block)
cache block must be in exclusive
state when it is written
any shared block must be up to
date in memory
write miss: data fetch and selective
invalidate operations sent by the
directory controller (broadcast in
snooping protocols)
25
State Transition Diagram for

the Directory
Figure 6.29
Transition
diagram for
cache block
Three requests: read miss,

write miss and data write back
26
Directory Operations: Requests and Actions

Message sent to directory causes two actions:
Update the directory
More messages to satisfy request
Block is in Uncached state: the copy in memory is the current value;

only possible requests for that block are:
Read miss: requesting processor sent data from memory &requestor made only
sharing node; state of block made Shared.
Write miss: requesting processor is sent the value & becomes the Sharing node.
The block is made Exclusive to indicate that the only valid copy is cached.
Sharers indicates the identity of the owner.
Block is Shared => the memory value is up-to-date:

Read miss: requesting processor is sent back the data from memory &
requesting processor is added to the sharing set.
Write miss: requesting processor is sent the value. All processors in the set
Sharers are sent invalidate messages, & Sharers is set to identity of requesting
processor. The state of the block is made Exclusive.
27
Directory Operations: Requests and Actions (cont.)

Block is Exclusive: current value of the block is held in the cache of
the processor identified by the set Sharers (the owner) => three
possible directory requests:
Read miss: owner processor sent data fetch message, causing state of block in
owners cache to transition to Shared and causes owner to send data to
directory, where it is written to memory & sent back to requesting processor.
Identity of requesting processor is added to set Sharers, which still contains the
identity of the processor that was the owner (since it still has a readable copy).
State is shared.
Data write-back: owner processor is replacing the block and hence must write it
back, making memory copy up-to-date
(the home directory essentially becomes the owner), the block is now
Uncached, and the Sharer set is empty.
Write miss: block has a new owner. A message is sent to old owner causing the
cache to send the value of the block to the directory from which it is sent to the
requesting processor, which becomes the new owner. Sharers is set to identity
of new owner, and state of block is made Exclusive.
28
Summary
Chapter 6. Multiprocessors and Thread-Level Parallelism
6.1 Introduction
6.2 Characteristics of Application Domains
6.4 Performance of Symmetric Shared-Memory
Multiprocessors
6.6 Performance of Distributed Shared-Memory
Multiprocessors
6.7 Synchronization
6.8 Models of Memory Consistency: An Introduction
6.9 Multithreading: Exploiting Thread-Level Parallelism
within a Processor
29

1.symmetric and Distributed Shared Memory Architectures

Uploaded by

1.symmetric and Distributed Shared Memory Architectures

Uploaded by

CS2354 Advanced Computer Architecture

Chapter 6. Multiprocessors and

Taxonomy of Parallel Architectures

MISD (Multiple Instruction Single Data)

SIMD (Single Instruction Multiple Data)

Simple programming model, Low overhead, Flexibility

MIMD (Multiple Instruction Multiple Data)

Examples: Sun Enterprise 5000, Cray T3D, SGI Origin

Centralized shared-memory multiprocessor

share a single centralized memory, interconnect processors and memory by a bus

memory modules associated with CPUs

6.3 Symmetric Shared-Memory Architectures

Each processor have same relationship to single memory

When a private item is cached, its location is migrated to the cache

shared data: data used by multiple processor

Coherence cache provides:

Multiprocessor Cache Coherence Problem

Example Cache Coherence Problem

Processors see different values for u after event 3

Defining Coherent Memory System

Basic Schemes for Enforcing Coherence

2 Classes of Cache Coherence Protocols

2. Directory based Sharing status of a block of physical memory

Snoopy Cache-Coherence Protocols

I/O dev ices

Cache Controller snoops all transactions on the shared

Either get exclusive access before write via write

Example: Write-thru Invalidate

Must invalidate before step 3

Two Classes of Cache Coherence Protocols

Send all requests for data to all processors

Directory-Based Schemes (Section 6.5)

Basic Snoopy Protocols

Write Invalidate Protocol

Write Broadcast/Update Protocol

Write serialization: bus serializes requests!

Examples of Basic Snooping Protocols

Assume neither cache initially holds X and the value of X in memory is 0

An Example Snoopy Protocol

Cache Coherence Mechanism of the Example

Figure 6.11 State Transitions for Each Cache Block

Requests from bus

6.5 Distributed Shared-Memory Architectures

Alternative: directory for memory blocks

Distributed Directory Multiprocessor

To prevent directory becoming the bottleneck, we distribute directory entries with

In addition to tracking the state of each cache block, we must track

Messages for Directory Protocols

local node: the node where a request originates

State Transition Diagram

State Transition Diagram for

Three requests: read miss,

Directory Operations: Requests and Actions

Block is in Uncached state: the copy in memory is the current value;

Block is Shared => the memory value is up-to-date:

Directory Operations: Requests and Actions (cont.)

You might also like