CS5204/EE5364 - Advanced Computer Architecture - Memory
CS5204/EE5364 - Advanced Computer Architecture - Memory
With slides from: Profs. Zhai, Mowry, Falsafi, Hill, Hoe, Lipasti, Shen,
Smith, Sohi, Vijaykumar, Patterson, Culler
Memory Technology and Review
2
Typical Memory Hierarchy
Higher Level
CPU Registers Registers (Closer to CPU)
100s Bytes
<10s ns Program/Compiler
Instr. Operands 1-8 bytes
Smaller
Cache Faster
~ K/M Bytes Cache
10-100 ns
1-0.1 cents/bit Cache Controller
Blocks 8-128 bytes
Main Memory
~ M/G Bytes Memory
200ns- 500ns
$.0001-.00001 cents /bit OS
Pages 4K/1M bytes
Disk
~ G/T Bytes, 10 ms
(10,000,000 ns) Disk
10-5 - 10-6 cents/bit User/Operator
Files G/T bytes Larger
Tape Slower
~ infinite
sec-min Tape
10-8 cents/bit
Lower Level
Memory Hierarchy
Desktops or laptops
Servers
Memory Hierarchy Design
• Memory hierarchy design becomes more crucial with recent
multi-core processors:
• Aggregate peak bandwidth grows with # of cores:
• Intel Core i7 can generate 2 references per core per clock cycle
• Example - 4 cores and 3.2 GHz clock
• (2 x 4 x 3.2 = 25.6) billion 8-byte data references/second +
(4 x 3.2 = 12.8) billion 16-byte instruction refs/second
• Total req. bandwidth = 12.8 x 16 + 25.6 x 8 = 409.6 GByte/s
• DRAM bandwidth is only 8% of this (34.1 GB/s)
• Requires:
• Multi-port, multi-banked, pipelined caches
• Two levels of private cache per core (L1 & L2 Cache)
• Shared third-level cache on chip (L3 Cache)
Why Memory Hierarchy Works - Program Locality
• Two Different Types of Program Locality:
Program access a relatively small portion of address space at
any instant of time
• Temporal Locality (Locality in Time):
sum = 0; § If an item is referenced, it will tend
for (i = 0; i < 100; i++){ to be referenced again soon.
sum += a(i); § Example:
} Data: sum
Program: instructions in loop
• Spatial Locality (Locality in Space):
§ If an item is referenced, items whose addresses are close
by tend to be referenced soon (implicit prefetching)
§ Example:
Data: array elements a(i), a(i+1), …..
Program: instructions in loop
Storage Organization and Working Set
• Rule-of-Thumb:
• Larger the hardware structure, slower the access time
• To exploit temporal and spatial locality
• Put things likely to be used in near future close to CPU
• Notion of working set
• Related to temporal locality
• Memory footprint in a pre-set sliding time window (e.g. 1M
instructions)
• Better the temporal locality is, smaller the working set size
• Cache memory most effective if a working set can fit in it.
• Algorithms need to decide which cache level to fit.
Announcement 9/12/2024
• Draft ppt slides available before each lecture on Canvas, an
updated version also posted after the lecture.
• Reminder – Deadline for term project proposals is Thur
9/26/24
• There are GPUs and multicores in CSE labs, if you plan to
do GPU- or multicore-related term projects.
• Check the following link for more information.
https://cse.umn.edu/cseit/classrooms-labs
• Note: Csci8205/EE8367 “Parallel Machine Organizations”
will be offered in Spring 2025
Recap
• To summarize performance measurements, we can use arithmetic
means, weighted arithmetic means & geometric means
• Each has its pros and cons
Temporal
Locality
Spatial
Locality
Working
Set
Donald J. Hatfield, Jeanette Gerald: Program Restructuring for Virtual Memory. Time
IBM Systems Journal 10(3): 168-192 (1971)
Review: Memory Systems and Organizations
16
How is a Single-Bit SRAM Stored?
• SRAM (Static RAM) – used in Cache Memory
• Feedback circuit (CMOS technology) – 6-transistor cell
A capacitor to store
charges (bit information)
Address: 1 0 0 1
Column address
00 01 10 11
Row address
00
0
1 0 01
4 x 4 Bit Array
0 1 10 1 1 0 0
0 11
RowHammer:
mov (X), %eax; // read from address X in DRAM
mov (Y), %ebx; // read from address Y in DRAM
clflush (X); // flush cache copy for address X
clflush (Y); // flush cache copy for address Y
Mfence; // wait for all mem ops to complete
jmp RowHammer
Row Hammer Attack
Address: 1 0 0 1
Column address
• Some optimizations:
• Multiple accesses to same row (i.e., block transfer)
• Synchronous DRAM (SDRAM)
• Added clock to DRAM interface
• Burst mode with critical-word first
• Double Data Rate (DDR) – use both rising and falling edge of a clock
• Multiple banks on each DRAM device
• Wider interfaces
• SIMM (single-in-line) vs. DIMM (dual-in-line)
Memory Optimizations
• DDR (Double-Data Rate DRAM)
• DDR2
• Lower power (2.5 V -> 1.8 V)
• Higher clock rates (266 MHz, 333 MHz, 400 MHz)
• DDR3
• 1.5 V
• 800 MHz
• DDR4
• 1-1.2 V
• 1.333 GHz
• DDR5
• 1.1 V
• 2 – 4 GHz
3-D High-Bandwidth Memory
64-KB
Figure 2.2 With 1980 performance as a baseline, the gap in performance, measured as difference in the
time between processor memory requests (single core) and latency of a DRAM access, is plotted over time.
• Vertical axis is on a logarithmic scale to record the size of processor-DRAM performance (latency) gap.
• Memory baseline is 64 KiB DRAM in 1980, with a 1.07X per year performance improvement in latency
• Processor line assumes a 1.25X per year until 1986, a 1.52X until 2000, a 1.20X between 2000 and 2005,
and only small improvements between 2005 and 2015.
• Until 2010 memory access times in DRAM improved slowly; since 2010 the improvement in access time
has reduced, as compared with earlier periods, but with continued improvements in bandwidth.
• In mid-2017, AMD, Intel and Nvidia all announced chip sets using versions of HBM technology.
Non-volatile Memory (NVM) - Flash Memory
40
Cache Memory
• Cache:
• A smaller, faster storage device that acts as a staging area for
a subset of data (working set) in a larger, slower next-level
memory devices.
• Fundamental idea of a memory hierarchy:
• Faster, smaller memory at level k serves as a cache for
larger, slower memory at level k+1.
• Why do memory hierarchies work?
• Programs tend to access level-k memory more often than
they access level (k+1) memory.
• Storage at level k+1 is slower, larger and cheaper per bit.
46
Caching: A Simple Example
DRAM Main Memory (16 blocks)
Cache Block ↔ Cache Line
0000
0001 Assume:
SRAM Cache (4 blocks)
0010 q 1 word per block
Index Valid Tag Data
0011
00 0100
01 0101
10 0110 Q1: Where to place a
11 0111 cache block?
1000
Q: Is the cache block there? 1001 Use 2 low-order
- Compare cache Tags 1010 address bits as cache
- Check Valid bit 1011 index
1100
1101 Use 2 high-order
1110 address bits as tag
1111
(block address) modulo (# of blocks in the cache)
Set-Associative Cache (4-way)
28 = 256 sets each with 4 ways 31 30 ... 13 12 11 ... 2 1 0 Byte offset
(each with one block)
Tag 22 8
Index
Index V Tag Data V Tag Data V Tag Data V Tag Data
0 0 0 0
1 1 1 1
2
Way 0 2
Way 1 2
Way 2 2
Way 3
. . . .
. . . .
. . . .
253 253 253 253
254 254 254 254
255 255 255 255
32
56
Associative Block Replacement
• Ideally —replace the block that will be access the furthest in the future
• Belady’s algorithm (OPT) – use “future” access information to
implement replacement policy – an optimal replacement algorithm
• Can’t implement it – Info. available only after program execution
• Approximations:
• FIFO, Least/Most recently/frequently used — LRU/MRU/LFU/MFU
• Attempt to optimize for temporal locality
• Not-Most-Recently-Used — NMRU
• Track MRU, random select from others, good compromise
Capacity miss
Conflict miss
Compulsory miss
61
Write Policies
• Writes are more complicated than Reads in cache design
• On reads, data accessed in parallel with tag compare in cache (next slide)
• On writes, needs 2 steps (i.e., make sure it is a hit, before write)
• Is turn-around time important for writes?
• Cache optimization often defer writes for reads
• No: no-write-allocate
32
Write-back
• Many cache lines are only read and never written to
• Update memory only on cache block replacement
• Add “dirty” bit to status word
• Originally cleared after replacement
CPU Cache
Write Buffer in CPU:
• To buffer CPU writes/stores to Cache
• Allows following CPU reads to proceed (reads can bypass writes)
• Stall only when write buffer becomes full
• What happens on dependent loads/stores? Should a dependent load get the
data from the write buffer?
• When is a write considered as committed/completed?
• After written into write buffer? (Fast)
• After written into cache? (Slower)
• After written into main memory? (Very slow)
• After write become visible to all other cores (memory consistency model)
Write-Back Buffers (for Replaced Cache Line)
Writeback Buffer
Cache Cache/Memory
70
Cache performance
• Miss-oriented approach to memory access:
MemAccess
CPUtime = (IC * CPIExecution + * MissRate * MissPenalty) * CycleTime
Inst
MemMisses
= (IC * CPIExecution + * MissPenalty)* CycleTime
Inst
74
Typical Memory Hierarchy – Virtual Memory
Higher Level
CPU Registers Registers (Closer to CPU)
100s Bytes
<10s ns Program/Compiler
Instr. Operands 1-8 bytes
Smaller
Cache Faster
~ K/M Bytes Cache
10-100 ns
1-0.1 cents/bit Cache Controller
Blocks 8-128 bytes
Main Memory
~ M/G Bytes
200ns- 500ns Main Memory
$.0001-.00001 cents /bit OS
Pages 4K/1M bytes
Disk
~ G/T Bytes, 10 ms
(10,000,000 ns) Disk
10-5 - 10-6 cents/bit User/Operator
Files G/T bytes Larger
Tape Slower
~ infinite
sec-min Tape
10-8 cents/bit
Lower Level
Virtual Memory
Main
CPU I/O
Cache
Memory
register Device
Why is VM important?
• Cheaper - no longer have to buy lots of DRAMs
• Removes burden of memory management from programmers
• Before VM, a programmer needs to write two programs: one is
application program, the other is memory management program
• Enables multiprogramming, time-sharing, protection, ……, etc..
Two Parts to Modern VM
• Part A: Protection
• Each process sees a large, contiguous memory segment
• Each process’s memory space is private, i.e. isolated and protected
from access by other processes
A non-
A contiguous contiguous
address space address
space
Disk addresses
31 12 11 0
Virtual Virtual Page Number Page Offset
Address
83
Page Table - Example
1) Break VA into virtual page number and page offset
2) Copy page offset to physical address
3) Use virtual page number as index into page table
3a) check the valid bit
4) Copy physical page number from page table to the physical address
• What do we do?
• Hardware asks OS to fetch the page from disk - a page fault
85
Translation Look-aside Buffer (TLB)
§ Essentially a small cache of recent
address translations
Virtual address
§ avoids going to the page table in VPN
memory on every reference Page offset
WikiChip
Virtual to Physical Address Translation
Virtual
Address
TLB £ 1 cycle
Lookup
miss hit
PTE 7 (null)
PTE 8 Gap 6K unallocated VM pages
1023 null
(9 – 1K)
PTEs
null PTEs
PTE 1023
unallocated
32 bit addresses, 4KB pages, 4-byte PTEs 1023 unallocated pages
pages
232 = 210 * 210 * 212 1 allocated VM page
VP 9215
10 10 12 for the stack
...