0% found this document useful (0 votes)
3 views

CS5204/EE5364 - Advanced Computer Architecture - Memory

The document provides an overview of memory technology and architecture, focusing on program locality, cache, and virtual memory. It discusses the importance of memory hierarchy design, especially in multi-core processors, and highlights various memory types including SRAM, DRAM, and non-volatile memory. Additionally, it addresses security vulnerabilities in DRAM, such as row hammer attacks, and outlines optimizations for memory performance.

Uploaded by

wop8u3z8o
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

CS5204/EE5364 - Advanced Computer Architecture - Memory

The document provides an overview of memory technology and architecture, focusing on program locality, cache, and virtual memory. It discusses the importance of memory hierarchy design, especially in multi-core processors, and highlights various memory types including SRAM, DRAM, and non-volatile memory. Additionally, it addresses security vulnerabilities in DRAM, such as row hammer attacks, and outlines optimizations for memory performance.

Uploaded by

wop8u3z8o
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 67

CSCI 5204/EE5364

Advanced Computer Architecture


Memory Technology &
A Review
Pen-Chung Yew
Department Computer Science and Engineering
University of Minnesota

With slides from: Profs. Zhai, Mowry, Falsafi, Hill, Hoe, Lipasti, Shen,
Smith, Sohi, Vijaykumar, Patterson, Culler
Memory Technology and Review

• Foundation of memory hierarchy – Program Locality


• What physical memory looks like? – HW Technologies
• What is cache? How does it work?
• What is virtual memory? How does it work?
• How does the memory system contribute to system
security vulnerability?

2
Typical Memory Hierarchy
Higher Level
CPU Registers Registers (Closer to CPU)
100s Bytes
<10s ns Program/Compiler
Instr. Operands 1-8 bytes
Smaller
Cache Faster
~ K/M Bytes Cache
10-100 ns
1-0.1 cents/bit Cache Controller
Blocks 8-128 bytes
Main Memory
~ M/G Bytes Memory
200ns- 500ns
$.0001-.00001 cents /bit OS
Pages 4K/1M bytes
Disk
~ G/T Bytes, 10 ms
(10,000,000 ns) Disk
10-5 - 10-6 cents/bit User/Operator
Files G/T bytes Larger
Tape Slower
~ infinite
sec-min Tape
10-8 cents/bit
Lower Level
Memory Hierarchy

Personal mobile devices

Desktops or laptops

Servers
Memory Hierarchy Design
• Memory hierarchy design becomes more crucial with recent
multi-core processors:
• Aggregate peak bandwidth grows with # of cores:
• Intel Core i7 can generate 2 references per core per clock cycle
• Example - 4 cores and 3.2 GHz clock
• (2 x 4 x 3.2 = 25.6) billion 8-byte data references/second +
(4 x 3.2 = 12.8) billion 16-byte instruction refs/second
• Total req. bandwidth = 12.8 x 16 + 25.6 x 8 = 409.6 GByte/s
• DRAM bandwidth is only 8% of this (34.1 GB/s)
• Requires:
• Multi-port, multi-banked, pipelined caches
• Two levels of private cache per core (L1 & L2 Cache)
• Shared third-level cache on chip (L3 Cache)
Why Memory Hierarchy Works - Program Locality
• Two Different Types of Program Locality:
Program access a relatively small portion of address space at
any instant of time
• Temporal Locality (Locality in Time):
sum = 0; § If an item is referenced, it will tend
for (i = 0; i < 100; i++){ to be referenced again soon.
sum += a(i); § Example:
} Data: sum
Program: instructions in loop
• Spatial Locality (Locality in Space):
§ If an item is referenced, items whose addresses are close
by tend to be referenced soon (implicit prefetching)
§ Example:
Data: array elements a(i), a(i+1), …..
Program: instructions in loop
Storage Organization and Working Set
• Rule-of-Thumb:
• Larger the hardware structure, slower the access time
• To exploit temporal and spatial locality
• Put things likely to be used in near future close to CPU
• Notion of working set
• Related to temporal locality
• Memory footprint in a pre-set sliding time window (e.g. 1M
instructions)
• Better the temporal locality is, smaller the working set size
• Cache memory most effective if a working set can fit in it.
• Algorithms need to decide which cache level to fit.
Announcement 9/12/2024
• Draft ppt slides available before each lecture on Canvas, an
updated version also posted after the lecture.
• Reminder – Deadline for term project proposals is Thur
9/26/24
• There are GPUs and multicores in CSE labs, if you plan to
do GPU- or multicore-related term projects.
• Check the following link for more information.
https://cse.umn.edu/cseit/classrooms-labs
• Note: Csci8205/EE8367 “Parallel Machine Organizations”
will be offered in Spring 2025
Recap
• To summarize performance measurements, we can use arithmetic
means, weighted arithmetic means & geometric means
• Each has its pros and cons

• Cost depends on yields


• Power & energy
• Static power (leakage current) vs. dynamic power
• Use power gating to reduce static power, and dynamic
voltage/frequency scaling (DVFS) to reduce dynamic power
• Dependability – mean time to failure (MTTF), # of failures in time
(FIT), mean time to repair (MTTR), mean time between failure (MTBF)
• Way to estimate system MTTF using its components’ MTTFs

• Memory system – program locality (temporal & spatial locality),


working set (next slide)
Address Profile of a Program
Bad locality behavior
Memory Address (one dot per access)

Temporal
Locality

Spatial
Locality

Working
Set

Donald J. Hatfield, Jeanette Gerald: Program Restructuring for Virtual Memory. Time
IBM Systems Journal 10(3): 168-192 (1971)
Review: Memory Systems and Organizations

• Foundation of memory hierarchy – Program Locality


• What does the physical memory look like?
• What is cache? How does it work?
• What is virtual memory? How does it work?

16
How is a Single-Bit SRAM Stored?
• SRAM (Static RAM) – used in Cache Memory
• Feedback circuit (CMOS technology) – 6-transistor cell

• Large, fast, more expensive and more power hungery


• Content does not fade with time, as long as there is power
• Constitute increasing percentage of total leakage power
(cache hierarchy is getting bigger in each new generation)
How is a Single-Bit DRAM Stored?
• DRAM (Dynamic RAM) – Used in Main Memory
• A transistor & a capacitor – 1-transistor cell
Row-Select Word Line

A capacitor to store charges


(bit information)

Column-Select Bit Line

• Content can fade with time even with power on


• Slower, smaller, less expensive and less power hungry
19
A DRAM Cell
Row-Select Word Line

A capacitor to store
charges (bit information)

Column-Select Bit Line

• Two problems with single-transistor DRAM:


1. Reads will destroy data - need a writeback after the read
2. Data will disappear if not accessed for a long time –
needs periodical refreshing (read-then-write) per ~10ms
Example with a 16-bit Memory – 2D Layout

Address: 1 0 0 1
Column address

00 01 10 11
Row address
00
0
1 0 01
4 x 4 Bit Array
0 1 10 1 1 0 0

0 11

Sensor & Amplifier


(Row Buffer)
0
1
DRAM Security Vulnerability – Row Hammer Attack
• Higher density of DRAM contains fewer electrical charges for
each memory cell, resulting in lower noise margin
• DRAM cells interact electrically, thus can leak charges and
change content of nearby rows, i.e. bit flips
• This electrical property can be triggered by rapidly activating
the same memory row(s) repeatedly (“row hammer”)

RowHammer:
mov (X), %eax; // read from address X in DRAM
mov (Y), %ebx; // read from address Y in DRAM
clflush (X); // flush cache copy for address X
clflush (Y); // flush cache copy for address Y
Mfence; // wait for all mem ops to complete
jmp RowHammer
Row Hammer Attack

Address: 1 0 0 1
Column address

00 01 10 11 Row Hammer Attack:


Row address
00
Rapid activation of
0 purple rows (can be
1 0 01 one or both “attack
rows”) with certain
0 1 10 1 1 0 0 bit patterns can
change contents of
0 11
green row (“victim
row”)
Sensor & Amplifier
(Row Buffer)
0
1
DRAM Security Vulnerability – Row Hammer Attack
• Mitigation for Row Hammer Attack:
• It cannot be mitigated by most existing error correction code,
e.g. single-error correction, double-error detection code
(SECDEC) because multiple bits can be flipped at a time.
• Counter-based row refreshing – refresh an adjacent row if its
neighboring rows have been accessed over a certain number of
times in a short period.
• Intel Ivy Bridge, Xeon used pseudo target-row refresh
(pTRR) with compliant DDR3 to defend against Row
Hammer attacks
• Some DDR4 chips uses TRR with preset maximum activation
count (MAC) and maximum activation window (tMAW) to
refresh potential victim row
• For more info: https://en.wikipedia.org/wiki/Row_hammer
Organization of DRAM Main Memory
Row Buffers

Figure 2.3 Internal organization of a DRAM.


• Modern DRAMs are organized in banks, up to 16 for DDR4 to increase bandwidth.
• Each bank consists of a series of rows. Sending an ACT command (Activate) opens a
bank and a row, and loads the row into a row buffer.
• When a row is in buffer, it can be transferred by successive column addresses at whatever
width of DRAM (typically 4, 8, or 16 bits in DDR4), or by specifying a block transfer and
starting address.
• The Precharge command (PRE) closes the bank and row, and readies it for a new access.
• Each command, as well as block transfers, are synchronized with a clock (in SDRAM)
• Row and column signals are sometimes called RAS (row-access strobe) and CAS
(column-access strobe), based on the original names of the signals.
Memory Technology
• Amdahl’s Law (sometime referred to as 2nd Amdahl’s Law)
• Memory bandwidth should grow linearly with processor speed
• Unfortunately, memory capacity and speed have not kept pace
with processors

• Some optimizations:
• Multiple accesses to same row (i.e., block transfer)
• Synchronous DRAM (SDRAM)
• Added clock to DRAM interface
• Burst mode with critical-word first
• Double Data Rate (DDR) – use both rising and falling edge of a clock
• Multiple banks on each DRAM device
• Wider interfaces
• SIMM (single-in-line) vs. DIMM (dual-in-line)
Memory Optimizations
• DDR (Double-Data Rate DRAM)
• DDR2
• Lower power (2.5 V -> 1.8 V)
• Higher clock rates (266 MHz, 333 MHz, 400 MHz)
• DDR3
• 1.5 V
• 800 MHz
• DDR4
• 1-1.2 V
• 1.333 GHz
• DDR5
• 1.1 V
• 2 – 4 GHz
3-D High-Bandwidth Memory

Figure 2.7 Two forms of die stacking.


• xPU can be CPU, GPU, TPU, etc.
• 2.5D form is available now.
• 3D stacking is under development and faces heat management
challenges due to xPU.
High-Bandwidth Memory (HBM)
HBM2 has 8 dies per stack, 2GT/s (2 billion transfers per second), 1024-bit
wide access, 256GB/s per package, 8GB capacity per package

For more details


https://en.wikipedia.org/wiki/High_Bandwidth_Memory#/media/File:High_Bandwidth
_Memory_schematic.svg
DRAM Performance – Memory Wall

Processor Performance (grows 60% / year)


Latency

Processor-Memory Performance Gap:


(grows 50% / year)

64-KB

Figure 2.2 With 1980 performance as a baseline, the gap in performance, measured as difference in the
time between processor memory requests (single core) and latency of a DRAM access, is plotted over time.
• Vertical axis is on a logarithmic scale to record the size of processor-DRAM performance (latency) gap.
• Memory baseline is 64 KiB DRAM in 1980, with a 1.07X per year performance improvement in latency
• Processor line assumes a 1.25X per year until 1986, a 1.52X until 2000, a 1.20X between 2000 and 2005,
and only small improvements between 2005 and 2015.
• Until 2010 memory access times in DRAM improved slowly; since 2010 the improvement in access time
has reduced, as compared with earlier periods, but with continued improvements in bandwidth.
• In mid-2017, AMD, Intel and Nvidia all announced chip sets using versions of HBM technology.
Non-volatile Memory (NVM) - Flash Memory

• Type of EEPROM (Electrically Erasable Programmable Read-


Only Memory)
• Two types: NAND (denser) and NOR (faster)
• NAND Flash is most popular today
• NAND read operation:
• Reads are sequential, reads entire page (.5 to 4 KB)
• 25 𝝁s for first byte, 40 MB/s for subsequent bytes (~2018)
• SDRAM: 40 ns for first byte, 4.8 GB/s for subsequent bytes
• 2 KB transfer: 75 𝝁𝐬 vs 500 ns for SDRAM, 150X slower
• 300X to 500X faster than magnetic disk
NAND Flash Memory
• NAND write operation
• Must be erased (in blocks) before being written

• Nonvolatile, can use as little as zero power


• Limited number of write cycles (~100,000)
• Use wear-leveling techniques

• Cost effective - $2/GiB, compared to $20-40/GiB for


SDRAM and $0.09 GiB for magnetic disk (~2018)
NVM - Phase-Change Memory (PCM)
• Possibly 10X improvement in write performance and 2X
improvement in read performance over NAND Flash
• Other memory technologies - Memrister memories
• How NVMs should be used in a memory hierarchy design is
still an active research issue
• A ”disk cache” between Main Memory (DRAM) and
Solid-State Disk (SSD) or
• Integrated with DRAMs to form the Main Memory.
Review: Memory Systems and Organizations

• Foundation of memory hierarchy – Program Locality


• What does the physical memory look like?
• What is cache? How does it work?
• Check Appendix B for a review of cache memory
fundamentals
• What is virtual memory? How does it work?

40
Cache Memory
• Cache:
• A smaller, faster storage device that acts as a staging area for
a subset of data (working set) in a larger, slower next-level
memory devices.
• Fundamental idea of a memory hierarchy:
• Faster, smaller memory at level k serves as a cache for
larger, slower memory at level k+1.
• Why do memory hierarchies work?
• Programs tend to access level-k memory more often than
they access level (k+1) memory.
• Storage at level k+1 is slower, larger and cheaper per bit.

• Net effect: If works well, a multi-level memory hierarchy can:


• cost as much as cheap storage near bottom,
• but serve at the rate of fast storage near top.
Cache Memory -Terminology
• Hit: Data exists in cache when accessed
• Hit Rate: Fraction of memory accesses found in cache
• Hit Time: Time to access cache , which consists of
Cache access time + Time to determine hit/miss
• Miss: Data not in cache. Need to retrieve it from a lower level
• Miss Rate = 1 - (Hit Rate)
• Miss Penalty: Time to fetch a missed block from next level
• Access time: Time to access next level = f(latency to next level)
• Transfer time: Time to transfer a block = f(BW between levels)
• Average memory-access time (in ns or clock cycles)
= Hit time + Miss rate x Miss penalty
Cache Inclusion Property
• Inclusive:
• A cache hierarchy is inclusive if a cache block is in an
upper level, it should have a copy in all of its lower levels
• Exclusive:
• A cache hierarchy is exclusive if a cache block can exist in
only one of all cache levels
• Non-inclusive:
• A cache hierarchy is non-inclusive if a cache block can
exist in some levels of the cache hierarchy
• “Exclusiveness” of cache memory has been exploited for
cache side-channel attacks in shared cache recently (mostly
in last-level cache). 43
Inclusive/Exclusive/Non-inclusive Caches

1. Data returned from an


L2/L3 cache miss
2. Returned data stored
into LLC (2) and L2 (5)
3. Drop clean cache line
replacement (3)
4. Dirty cache line
• Eviction from L3 does not replacement – needs a
incur back invalidation - write-back (4)
> Remove side channel
5. Cache hit in LLC (6)
• But there are other
attacks to non-inclusive 6. LLC cache miss (7)
and exclusive caches 7. Back-invalidation due
to inclusiveness (8)
8. Side-channel attacks in
inclusive LLC
Reference: Flexclusion:Balancing Cache Capacity and On-Chip
Bandwidth via Flexible Exclusion, J. Kim, et al, ISCA 2012
No back invalidation from LLC
Review - The 1,2,3,4 of Caching

1. Where can a block be placed in a cache?


2. How is a block found if it is in a cache?
3. Which block should be replaced on a miss?
4. What happens on a write?

46
Caching: A Simple Example
DRAM Main Memory (16 blocks)
Cache Block ↔ Cache Line
0000
0001 Assume:
SRAM Cache (4 blocks)
0010 q 1 word per block
Index Valid Tag Data
0011
00 0100
01 0101
10 0110 Q1: Where to place a
11 0111 cache block?
1000
Q: Is the cache block there? 1001 Use 2 low-order
- Compare cache Tags 1010 address bits as cache
- Check Valid bit 1011 index
1100
1101 Use 2 high-order
1110 address bits as tag
1111
(block address) modulo (# of blocks in the cache)
Set-Associative Cache (4-way)
28 = 256 sets each with 4 ways 31 30 ... 13 12 11 ... 2 1 0 Byte offset
(each with one block)
Tag 22 8
Index
Index V Tag Data V Tag Data V Tag Data V Tag Data
0 0 0 0
1 1 1 1
2
Way 0 2
Way 1 2
Way 2 2
Way 3
. . . .
. . . .
. . . .
253 253 253 253
254 254 254 254
255 255 255 255

32

Conflict sets: cache blocks mapped


to the same set, i.e. blocks with same 4x1 select
index bits 2-11, exploited in Evict &N
Reload cache side-channel attacks Hit Data
Announcement 9/17/2024
• Homework #1 will be issued either later today or
tomorrow on Canvas
• Due Thursday 10/10/24. No late submission allowed

• Draft ppt slides available before each lecture on Canvas,


an updated version also posted after the lecture.
• Reminder – Deadline for term project proposals is Thur
9/26/24
• There are GPUs and multicores in CSE labs, if you plan
to do GPU- or multicore-related term projects.
• Kartik will give a quick tutorial on the architecture
simulators and performance measurement tools/support
available on machines – for related term projects
Recap
• Memory technologies:
• Static RAMs (SRAM), dynamic RAMs (DRAM), High-
bandwidth memory (HBM), non-volatile memory (NVM),
• “Memory wall” – huge latency gap between CPU & memory
• Organization of DRAMs, and potential row-hammer attacks
and defenses.
• Cache memory hierarchy (L1, L2, LLC) – a review
• Direct-mapped cache vs. Set-associative cache.

• Inclusion property in cache memory hierarchy – inclusive,


exclusive and non-inclusive
The 1,2,3,4 of Caching
1. Where can a block be placed in a cache?
2. How is a block found in a cache?
3. Which block should be replaced on a miss?
4. What happens on a write?

56
Associative Block Replacement
• Ideally —replace the block that will be access the furthest in the future
• Belady’s algorithm (OPT) – use “future” access information to
implement replacement policy – an optimal replacement algorithm
• Can’t implement it – Info. available only after program execution

• Approximations:
• FIFO, Least/Most recently/frequently used — LRU/MRU/LFU/MFU
• Attempt to optimize for temporal locality

• Tree pseudo-LRU algorithms (used in 486, PowerPC G4)

• Not-Most-Recently-Used — NMRU
• Track MRU, random select from others, good compromise

• Random: Nearly as good as LRU, simpler (usually pseudo-random)

• Set-dueling strategies – very effective, need hardware support


• Choose multiple sets, each with different replacement policies
• Choose the winner during the tracking period.

• Anomaly in last-level cache (LLC) due to inclusive cache memory


Reducing Misses
• Classifying Misses: 4 Cs
• Compulsory—First access to a block is a miss. Also called cold
start misses or first reference misses.

• Capacity—If cache cannot contain all the blocks needed, capacity


misses will occur for blocks being replaced and later retrieved.

• Conflict—If block-placement strategy is set associative or direct


mapped, conflict misses will occur for blocks being discarded and
later retrieved if too many blocks map to the same set. Also
called collision misses or interference misses.

• Coherence – In multi-core systems, blocks can be invalidated by


another cache due to its updates (i.e. writes)
Miss rate vs. Cache Size for SPEC2000int

Capacity miss

Conflict miss

Compulsory miss

Ref. Jason Cantin and Mark Hill


The 1,2,3,4 of Caching
1. Where can a block be placed in the upper level?
2. How is a block found if it is in the upper level?
3. Which block should be replaced on a miss?
4. What happens on a write?

61
Write Policies
• Writes are more complicated than Reads in cache design
• On reads, data accessed in parallel with tag compare in cache (next slide)
• On writes, needs 2 steps (i.e., make sure it is a hit, before write)
• Is turn-around time important for writes?
• Cache optimization often defer writes for reads

• Choices of write Policies


• On write hits, update next-level cache/memory?
• Yes: write-through

+no coherence issue, +immediate observability


-use more cache bandwidth
• No: write-back

• On write misses, allocate a cache block frame?


• Yes: write-allocate

• No: no-write-allocate

• When does the system consider a write is “completed/committed” ?


Review: Set-Associative Cache (4-way)
28 = 256 sets each with 4 ways 31 30 ... 13 12 11 ... 2 1 0 Byte offset
(each with one block)
Tag 22 8
Index
Index V Tag Data V Tag Data V Tag Data V Tag Data
0 0 0 0
1 1 1 1
2
Way 0 2
Way 1 2
Way 2 2
Way 3
. . . .
. . . .
. . . .
253 253 253 253
254 254 254 254
255 255 255 255

32

Conflict sets: cache blocks mapped


to the same set, i.e., blocks with same 4x1 select
index bits 2-11, exploited in Evict &N
Reload cache side-channel attacks Hit Data
Write Policies (Cont.)
Write-through
• Update next-level cache/memory on every write
• Keep next-level cache/memory up-to-date
• No impact on cache miss rate

Write-back
• Many cache lines are only read and never written to
• Update memory only on cache block replacement
• Add “dirty” bit to status word
• Originally cleared after replacement

• Set when a block frame is written to

• Write back only on a dirty block, and “drop” clean blocks

Bandwidth used per reference to next cache level = miss-rate x fdirty x B


Example: 0.05 x 1/2 x 4 = 0.1 byte/reference
Announcement 9/18/2024
• Homework #1 has been issued on Canvas
• Due Thursday 10/10/24. No late submission allowed

• Reminder – Deadline for term project proposals is Thur


9/26/24.
Recap
• Kartik gave a short tutorial on simulators and performance
measurement tools - welcome to use them for your term
projects.
• Cache replacement policies (LRU, MRU, LFU, ….) and some
anomalies due to interaction between different cache levels
Write/Store Buffers in CPU
Write Buffer

CPU Cache
Write Buffer in CPU:
• To buffer CPU writes/stores to Cache
• Allows following CPU reads to proceed (reads can bypass writes)
• Stall only when write buffer becomes full
• What happens on dependent loads/stores? Should a dependent load get the
data from the write buffer?
• When is a write considered as committed/completed?
• After written into write buffer? (Fast)
• After written into cache? (Slower)
• After written into main memory? (Very slow)
• After write become visible to all other cores (memory consistency model)
Write-Back Buffers (for Replaced Cache Line)
Writeback Buffer

Cache Cache/Memory

• Between write-back cache and next-level cache/memory


1. Move replaced dirty blocks to buffer
2. Fetch new line
3. Move replaced data to memory
• Usually only need 1 or 2 write-back buffer entries

70
Cache performance
• Miss-oriented approach to memory access:

MemAccess
CPUtime = (IC * CPIExecution + * MissRate * MissPenalty) * CycleTime
Inst

MemMisses
= (IC * CPIExecution + * MissPenalty)* CycleTime
Inst

IC: Instruction Count (Executed)


MemAccess: No. of accesses to memory, i.e. (#loads + #stores)
MemMisses: No. of Cache misses

Average Memory Access Time (AMAT)


= HitTime + MissRate x MissPenalty
= (HitTime Inst + MissRate Inst x MissPenalty Inst ) +
(HitTime Data + MissRate Data x MissPenalty Data ) 71
How Cache Miss Impact Performance?
Example:
P
• L1 access time 1 cycle, miss rate 5%
1 cycle
• L2 access time 10 cycles, miss rate 20%
L1
• Memory access time is 300 cycles. 10 cycle
• What is the “real/effective” penalty for a L1 cache miss?
i.e., what is the AMAT for L2 cache? L2
10cycles + 20% * 300 cycles = 70 cycles;
300 cycle
• What is the AMAT for the L1 cache?
1cycle + 5% * 70 cycles = 4.5 cycles
Memory
• If we reduce L1 miss rate to 4% (i.e. 1% reduction), we
have (1 + 4% x 70 = 3.8 cycles), i.e. (4.5-3.8)/4.5 = ~15%
improvement in memory latency ==>
i.e., 1% improvement in L1 cache miss rate can improve
72
overall performance by 15%, which is very significant!!
How to Improve Cache Performance?

Average Memory Access Time (AMAT) =


HitTime + MissRate x MissPenalty

• How to improve cache performance?


1. Reduce MissRate, or
2. Reduce the MissPenalty, or
3. Reduce HitTime to cache.
• We will look at various optimization
techniques based on above strategies later
73
Review: Memory Systems and Organizations

• Foundation of memory hierarchy – Program Locality


• What does the physical memory look like?
• What is cache? How does it work?
• Check Appendix B for a review of cache memory
fundamentals
• What is virtual memory? How does it work?

74
Typical Memory Hierarchy – Virtual Memory
Higher Level
CPU Registers Registers (Closer to CPU)
100s Bytes
<10s ns Program/Compiler
Instr. Operands 1-8 bytes
Smaller
Cache Faster
~ K/M Bytes Cache
10-100 ns
1-0.1 cents/bit Cache Controller
Blocks 8-128 bytes
Main Memory
~ M/G Bytes
200ns- 500ns Main Memory
$.0001-.00001 cents /bit OS
Pages 4K/1M bytes
Disk
~ G/T Bytes, 10 ms
(10,000,000 ns) Disk
10-5 - 10-6 cents/bit User/Operator
Files G/T bytes Larger
Tape Slower
~ infinite
sec-min Tape
10-8 cents/bit
Lower Level
Virtual Memory
Main
CPU I/O

Cache
Memory
register Device

• Programmer perspective Caching Virtual Memory


• Theoretically, each program/process has a uniform 264 memory
address space (for 64-bit architectures with 64-bit memory
addresses). On x86-64, 48 bits of virtual address are used. Some
processors support 57 bits.
• In hardware/reality
• There is only one relatively smaller physical memory shared by all
active programs/processes being executed
• In a multi-programming system, virtual memory (VM) provides
each process with an illusion of a large, private, uniform memory
Virtual Memory
What is virtual memory (VM)?
• Technique that allows execution of a program (a process) that
• can reside in non-contiguous physical memory locations
• only partially resides in main memory (i.e., its working set)

• Allows programmers and compilers to imagine


• memory has a fixed size and is contiguous
• memory space is much larger than real physical memory

Why is VM important?
• Cheaper - no longer have to buy lots of DRAMs
• Removes burden of memory management from programmers
• Before VM, a programmer needs to write two programs: one is
application program, the other is memory management program
• Enables multiprogramming, time-sharing, protection, ……, etc..
Two Parts to Modern VM
• Part A: Protection
• Each process sees a large, contiguous memory segment
• Each process’s memory space is private, i.e. isolated and protected
from access by other processes

• Part B: Demand Paging


• Have a capacity of secondary memory (swap space on disk) at the
speed of main memory (DRAM).
• A mechanism to share limited physical/main memory among processes

• Based on a common HW mechanism: address translation/mapping


• User process operates on “virtual” or “effective” addresses
• HW translates from virtual to physical address (i.e., main memory
address) on each references
• Control which physical locations can be accessed by a process

• Allow dynamic relocation of physical backing store (DRAM vs. HD)


• VM HW and memory management policies controlled by OS
Virtual Memory with Paging
Virtual Memory (VM) Physical Memory (PM)
Virtual addresses Physical addresses
Address translation

A non-
A contiguous contiguous
address space address
space

Disk addresses

• Address space is partitioned into pages, similar to cache blocks/lines


• Pages are assigned/allocated/mapped from VM to PM by OS at runtime
• Physical memory is Fully-Associative (i.e., NO conflict misses)
Page Table: Support of Fully-Associative Main Memory in SW

• Partition memory space into bigger chunks called pages


• Only one page table entry (PTE) per page
• for a 4KB page, need one page table entry for every 1K words
• Typical page size is 4 or 8 Kbytes, can have larger page sizes to
reduce number of VA-> PA translation page table entries
• IBM Power 5 has 4 page sizes: 4KB, 64KB, 16MB and 64GB
• Address translation implies placement of physical pages in RAM

31 12 11 0
Virtual Virtual Page Number Page Offset
Address

Translation via Remains the same


Page Table – not translated
24 12 11 0
Physical
Address Physical Page Number Page Offset
Page Table
31 12 11 0 Page offset is size of a page
Virtual Virtual Page Number Page Offset
(12 bits --> 212 Bytes per page)
Address

valid bit Physical page number


0x00000
0x00001
0x00002 Each entry in the page table
0x00003 is called a Page Table Entry (PTE)
0x00004
0x00005
0x00006 No need to translate page offset
0x00007
...
0xfffff
23 12 11 0
Physical Physical Page Number Page Offset
Address

83
Page Table - Example
1) Break VA into virtual page number and page offset
2) Copy page offset to physical address
3) Use virtual page number as index into page table
3a) check the valid bit
4) Copy physical page number from page table to the physical address

31 Virtual Page Number 12 11 Page Offset 0


Virtual Address 0x02307 0x6A4

valid bit Physical page number Dirty Bit


0x00000 0x001
0x00001 0x005
...
VA = 0x023076A4 0x02306 0x004
0x02307 0x0C2
PA = 0x0C26A4 ...
0xfffff DISK
23 12 11 0

Physical Address 0x0C2 0x6A4

Physical Page Number Page Offset


What Happens if Page is not in RAM?
• How do we know it’s not in main memory (RAM)?
• PTE’s valid bit is set to INVALID (i.e. the page is on disk)

• What do we do?
• Hardware asks OS to fetch the page from disk - a page fault

• To accommodate a page, OS must evict a page from RAM if it is full


• The page to be evicted is called the victim page
• If the victim page is dirty, write the page back to update disk
• Only data pages can be dirty

• OS then reads the requested page from disk


• OS changes the page table to reflect the new mapping
• Hardware restarts at the faulting virtual address

85
Translation Look-aside Buffer (TLB)
§ Essentially a small cache of recent
address translations
Virtual address
§ avoids going to the page table in VPN
memory on every reference Page offset

§ Indexed by lower bits of VPN (virtual Tag Data Tag


page number)
=
§ Tag = unused bits of VPN + process ID
§ Data = PPN (physical page number) and Index
access permission
§ Status = valid, dirty Page
offset
§ usual cache design choices Physical page no.
(placement, replacement policy Physical address
multi-level, etc) apply here too.
§ What should be the relative sizes of ITLB
and I-cache?
89
Intel Xeon Phi TLB Parameters

WikiChip
Virtual to Physical Address Translation
Virtual
Address

TLB £ 1 cycle
Lookup

miss hit

100’s cycles Page Table Protection £ 1 cycle


by HW or SW Walk Check

succeed fail permitted


denied

Update TLB Page Fault


Protection
OS Physical
Fault
Table Walk Address
100,000’s cycles To Cache
A Two-Level Page Table Hierarchy
Level 1 Level 2 Virtual Slides from: Computer
Systems: A Programmer’s
page table page tables memory Perspective, by Bryant and
0
VP 0 O’Hallaron
PTE 0 PTE 0
PTE 1 ...
...
VP 1023 2K allocated VM pages
PTE 2 (null) PTE 1023
VP 1024 for code and data
PTE 3 (null)
PTE 4 (null) PTE 0 ...

PTE 5 (null) ... VP 2047

PTE 6 (null) PTE 1023

PTE 7 (null)
PTE 8 Gap 6K unallocated VM pages
1023 null
(9 – 1K)
PTEs
null PTEs
PTE 1023
unallocated
32 bit addresses, 4KB pages, 4-byte PTEs 1023 unallocated pages
pages
232 = 210 * 210 * 212 1 allocated VM page
VP 9215
10 10 12 for the stack
...

VPN 2 VPN 1 VPO


Summary
• Virtual memory
• Gives illusion of a LARGE physical main memory (RAM),
even if you have LESS real RAM
• RAM divided into chunks called pages. “Live” pages are in
physical RAM. Pages that don’t fit are on the disk.
• Hardware translates the virtual address (big address
space) into the physical address (real RAM address).
Allows a page to be placed anywhere (fully associative)
• Special translation mechanisms (like in a cache) and a
special lookup structure called a page table, used to this
translation
• Unlike Level-1 cache, usually the page table is too big to
live entirely on the CPU chip itself. Need special other
memory structures to handle this, i.e. TLB.

You might also like