0% found this document useful (0 votes)
12 views

Lecture 5

Uploaded by

ryuu.ducat
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

Lecture 5

Uploaded by

ryuu.ducat
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 51

Carnegie Mellon

The Memory Hierarchy


N. Navet - Computing Infrastructure 1 / Lecture 5

Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 1


Carnegie Mellon

The memory hierarchy


 The Virtual Memory system: address translation, MMU,
memory page
 The stack & the heap
 How arguments and return values are passed between
functions calling each other
 Caching in the memory hierarchy
 Temporal & spatial locality
 How CPU caches operate
 Direct-mapped caches
 E-way associate caches
 Fully associate caches
 Cache writes
 Cache performance metrics
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition [edited NN] 2
3

Virtual Memory,
the Stack and the Heap

“Virtual memory is an abstraction that


provides each process with the illusion that it
has exclusive use of the main memory. Each
process has the same uniform view of
memory, which is known as its virtual address
space.”

© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.


Virtual memory: a key abstraction provided by 4

OSes, which requires some HW support

“Physical addressing”: the CPU uses


the physical addresses to access memory. Here a
4-bytes word starting at address 4 is transferred.

Modern CPUs use virtual addressing


“Virtual addressing”: the CPU works with virtual
addresses, which are converted on-the fly
into physical addresses (“address translation”)
by the Memory Management Unit (MMU) using
a lookup-table managed by the OS

Some Benefits of Virtual Memory: 1) the MMU protects the address


space of each process from corruption by other processes 2) each
process can have a identical virtual address space
© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.
Linux Virtual Memory System: the virtual 5

memory of a process

✓ Read-only code and read-only data


(i.e., constants): begin at the same
fixed address for each process
✓ Read-write data: global variables
✓ Heap: area for dynamic memory in
user’s program (size varies at run-time)
✓ Stack: memory used for function calls
in user’s program and local variablse
(size varies at run-time)

Each process only knows its own virtual


address space – every address used by the
processor is translated (by the MMU)
to/from the virtual address space

© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.


How Linux organizes virtual memory – out of the 6

scope of the lecture

The kernel maintains a distinct


task structure (task_struct) for
each process in the system.

The Text segment (aka the Instruction or Code segment)


contains the executable program code and constant data.
© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.
Carnegie Mellon

“Memory page”: smallest unit of memory


allocated to a process
RAM is allocated to processes (by the kernel) in “pages” not in bytes – e.g., page
size under Linux is 4kB. Below, VP1 stands for Virtual Page #1

The address translation HW “Physical pages”


reads the “page table” each aka “page frames”
time it converts a virtual
address to a physical address

“Virtual pages” recorded in the “page table”

The OS maintains a separate page table for each process in the system. The VPs of a process
can be in physical memory (“cached”) or on disk (“uncached”). Shared pages allow to avoid
duplicating code in memory, such as the functions of the C library that are used by every
program on Linux. Pages when unused can be moved to disk by the OS.
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 7
Focus on the stack 8

Stack: data structure that stores information about the


active subroutines (=functions) of a program. E.g.,
When DrawSquare() calls DrawLine() 1) it must pass arguments to
DrawLine() 2) the CPU must know where the execution must resume when
DrawLine() finishes

✓ Each stack frame: stores the


information for a specific function call
✓ Data stored in each frame:
- Local variables of the function
- Arguments passed to the called function
- The return’s address (in the caller’s
program)

Stack: 1) provides memory space for


variables local to a function, 2) is a way to
pass the arguments to the function and 3) to
upward-growing stack here know where the execution should resume
upon the exit of the function (to set the IPR)

© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.


Focus on the Heap
◼ The heap is a memory area to store dynamically allocated
memory. In C programs, a block of memory is allocated,
resp. de-allocated, by malloc() and free()

◼ The main reason to use dynamic memory allocation is that


we cannot know the sizes and number of certain data
structures until the program executes

◼ Higher-level languages such as Python and Java rely on


garbage collection (GC) to free allocated blocks .. but GC
might severely slow down user code at times

◼ Dynamic memory leads to complex fragmentation issues:


overall, there is enough available memory, but no single
free block is large enough to handle the request
Carnegie Mellon

Procedures (aka functions, sub-routines)


P(…) {
 Transferring control flow to the •
procedure that is called •
▪ At beginning of procedure code y = Q(x);
print(y)
▪ Back to return point •
 Passing data }
▪ Procedure arguments: through stack (or
registers, see later)
int Q(int i)
▪ Return value: through register {
 Memory management int t = 3*i;
int v[10];
▪ Local variables allocated during procedure •
execution •
▪ Deallocated upon return (stack) return v[t];
▪ Dynamic memory (e.g. malloc(), new()) is }
allocated in the heap (automatic deallocation
or not depends on language and program)
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 10
Carnegie Mellon

MEMORY ALLOCATION
ON LINUX X86-64

Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 11


Carnegie Mellon

x86-64 Linux Memory Layout for a process


00007FFFFFFFFFFF
Stack
 Stack Virtual Memory 8MB
Addresses, as the
▪ Runtime stack (8MB limit) process sees it – not
physical memory Shared
 Heap
Libraries
▪ Dynamically allocated as needed
▪ Through malloc(), calloc(), new() calls
 Data not drawn to scale
▪ Statically allocated data (their address is constant)
▪ E.g., global vars, static vars
 Text / shared Libraries
▪ Executable machine instructions + constants
▪ Read-only Heap
Data
The virtual address space is 128TB large(!)
Text
Hex Address 400000
000000
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 12
Carnegie Mellon
Carnegie Mellon

Focus on the stack on x86-64 platforms


 Region of memory managed Stack “Bottom”
with stack discipline
Increasing
 Grows toward lower addresses Addresses

 A dedicated register, %rsp,


contains lowest stack address
▪ address of “top” element Stack
Grows
Down
Stack Pointer: %rsp

Stack “Top”

Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 13


Carnegie Mellon
Carnegie Mellon

Stack Frames Previous


Frame
 Contents
▪ Return address Frame Pointer: %rbx
▪ Arguments of the function x
▪ Local variables
Last Frame
▪ Saved registers

Stack Pointer: %rsp


 Management
▪ Space allocated when entering
procedure
Stack “Top”
▪ Frame pushed by call instruction
▪ De-allocated when the control flow Why is it needed to save
returns to calling function (ret registers upon function call?
instruction)

Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 14


Carnegie Mellon

not drawn to scale


Memory Allocation Example
Stack
char big_array[1L<<24]; /* 16 MB */
char huge_array[1L<<31]; /* 2 GB */

int global = 0;

int useless() { return 0; }


Where does
int main ()
{ everything go? Shared
void *p1, *p2, *p3, *p4; Libraries
int local = 0;
p1 = malloc(1L << 28); /* 256 MB */
p2 = malloc(1L << 8); /* 256 B */
p3 = malloc(1L << 32); /* 4 GB */
p4 = malloc(1L << 8); /* 256 B */ Heap
/* Some other statements ... */ Data
} Text (=code+constants)

Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 15


Carnegie Mellon

not drawn to scale


x86-64 Example Addresses 00007F
Stack
address range ~247

local 0x00007ffe4d3be87c
p1 0x00007f7262a1e010
p3 0x00007f7162a1d010 Shared
p4 0x000000008359d120 Libraries
p2 0x000000008359d010
big_array 0x0000000080601060
huge_array 0x0000000000601060
main() 0x000000000040060c
useless() 0x0000000000400590

Heap
Note: the memory blocks pointed to by p1,
p2, p3 and p4 are in the heap (dynamic
memory) but the 4 pointers themselves are Data
in the stack (local variables).
Text
000000
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 16
Carnegie Mellon

Arguments passing through registers (out of the


scope of the lecture)
 For speed, processors with many registers use them - instead of
stack - to pass arguments to procedures
Intel x86-64 has twice the number of registers of IA32

 Intel x86-64: Arguments (up to the first six) are passed to


procedures via registers, rather than on the stack.This eliminates
the overhead of storing and retrieving values on the stack.
C code Location of arguments

Variables local to a function


may also be allocated to
registers (instead of stack) if
space permits

Args 7 and 8 are on the stack where %rsp is a register


holding the value of the stack pointer
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 18
Carnegie Mellon

Return value of a function (advanced)


 Where the return value of a function is written is a convention
that depends on the architecture

 E.g., on Intel x86-64, if the return value of the function is


▪ a 32-bit integer, it is passed between the two functions
(caller and called function) in the EAX register (32 bit),
▪ a 64-bit integer, it is passed in both the EAX and EDX registers
▪ floating point values are returned in the XMM0 register

Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 19


Carnegie Mellon

CACHING IN THE MEMORY


HIERARCHY
✓ The process of using a cache is known as caching
✓ We will focus on the caches inside the CPU, for most CPUs: L1, L2
and L3
✓ There are 3 types of CPU-internal caches: direct mapped cache, set
associative cache and fully associative cache
✓ The cache is organized so that it can find the requested word by
simply inspecting the bits of the address that is requested

Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition [edited NN] 20
Carnegie Mellon

Caching
 Cache: A smaller, faster storage device that acts as a storing
area for a subset of the data in a larger, slower device.
 Information in use copied from slower to faster storage
temporarily
 Fastest storage (cache) checked first to determine if
information is there
▪ If it is, information used directly from the cache (fast)
▪ If not, data copied to cache and used there
 Cache smaller than storage being cached
▪ Thus cache size and replacement policy are important problems
 A cache miss is a failed attempt to read / write a piece of data in
the cache (data is not there), which results in a lower-level
memory access with much longer latency. Two kinds of cache
read misses: instruction read miss and data read miss.
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition [edited NN] 21
Carnegie Mellon

Typical Memory The faster, smaller device at level k serves as a


cache for the larger, slower device at level k+1.
Hierarchy L0:
Regs CPU registers hold words
Smaller, retrieved from the L1 cache.
faster, L1 cache
L1: L1 cache holds cache lines
volatile (SRAM)
and retrieved from the L2 cache.
costlier L2: L2 cache L2 cache holds cache lines
Nb: a cache
(per byte) (SRAM) line holds a
retrieved from L3 cache memory block
storage
devices L3: L3 cache L3 cache holds memory blocks
retrieved from main memory.
(SRAM) Nb: a memory block is a fixed-size packet of
data that is transferred to cache
Larger, Main memory holds
slower, L4: Main memory disk blocks retrieved
and (DRAM) from local disks.
cheaper
(per byte)
Local disks hold files
storage L5: Local secondary storage retrieved from disks
devices (local disks incl. SSD) on remote servers

L6: Remote secondary storage


(e.g., servers, cloud)
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 22
Carnegie Mellon

Random-Access Memory (RAM)


 Key features
▪ Volatile memory
▪ Memory can be read/written in any order
▪ the time required to read and write is independent of the location
 Two types of RAM:
▪ SRAM (Static RAM): faster much more expensive than DRAM, used for caches
▪ DRAM (Dynamic RAM): used for main memory
Access
time Cost Applications

SRAM 1X 100x Cache memories


& registers

DRAM 10X 1X Main memories

Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 23


Carnegie Mellon

Caching principles continued


 When the CPU is instructed by a load instruction to read a word from
address A of main memory, it sends the address A to the L1 cache. If L1 is
holding a copy of the word, it sends the word immediately back to the
CPU (“cache hit”). If not (“cache miss”), L2 is requested to provide the
data which is loaded into L1. If L2 does not a have copy of the word, it
requests it from L3 cache and loads it, etc. When the word is found, it is
propagated up the cache hierarchy.
 Nb: registers can serve as cache but how they are used is (statically) defined by
the compiler, not (dynamically) controlled by the CPU.
 Why do memory caches work?
▪ Because of locality, programs tend to access data already at level k (they
have been recently used) more often than they access the data at level k+1.

If the word a program loads is stored in a register, it can be accessed in 0


cycle during the execution of the instruction. If stored in a cache, 1 to 30
cycles. If stored in main memory, 50 to 200 CPU cycles!

Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 24


Carnegie Mellon

Migration of data “A” from Disk to Register

L1+L2+L3 caches Data “A” is then


managed by HW used as an operand
of an instruction

Two properties / requirements enforced in HW:


 All processes are provided the most recent value, no matter

the circumstances
 In particular, multiprocessor/multicore execution platforms

ensure cache coherency such that all CPUs have the most
recent value in their cache

Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 25


Carnegie Mellon

As a programmer it is important to understand the concept


Locality of locality as programs with good locality - thanks to
caching - run faster than program with poor locality

 Principle of Locality: Programs tend to use data and


instructions with addresses near or equal to those they
have used recently

 Temporal locality:
▪ Recently referenced items are likely
to be referenced again in the near future
(e.g., using the same variables several times)

 Spatial locality:
▪ Items with nearby addresses tend
to be referenced close together in time
(think of iterating an array)

Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 26


Carnegie Mellon

Locality Example #1 – good locality


sum = 0;
for (i = 0; i < n; i++)
Good locality!
sum += a[i];
return sum;

 Locality wrt data


▪ Reference array elements in succession (next Spatial locality
element more likely to be already in cache)
▪ Reference variables sum and i at each Temporal locality
iteration (both will stay in cache)
 Locality wrt instructions (i.e., code)
▪ Reference instructions in sequence. Spatial locality
▪ Cycle through loop repeatedly. Temporal locality
Writing programs so as to maximize locality is key to achieve fast
execution – achieving both spatial and temporal for all variables may not
be feasible, e.g. wrt array a the code above has good spatial locality but poor
temporal locality as elements are not re-used.
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 27
Carnegie Mellon

Locality Example #2 – poor spatial locality


Nb: int type is 4 bytes long in this example
Memory layout

1st row 2nd row

1st row

Stride k reference pattern means visiting every kth element of a contiguous


data structure – as the stride increases, the spatial locality decreases, and it
less likely that the accessed elements can be read from the cache.

How can the function be rewritten so as to improve spatial


locality ? i.e., so as to reduce the stride to 1
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 28
Carnegie Mellon

Intel Core i7 Cache Hierarchy


Processor package
L1 i-cache and d-cache:
Core 0 Core 3 32 KB, 8-way,
Regs Regs Access: 4 CPU cycles

L2 unified cache:
L1 L1 L1 L1
256 KB, 8-way,
d-cache i-cache
… d-cache i-cache
Access: 10 CPU cycles

L2 unified cache L2 unified cache L3 unified cache:


8 MB, 16-way,
Access: 40-75 CPU
cycles
L3 unified cache
(shared by all cores) Block size:
64 bytes for all caches.
Caches are partitioned into lines.
Transfer between two levels is
Main memory always done at the granularity of a
block, e.g. 64 bytes (allows to take
advantage of spatial locality)
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 29
Carnegie Mellon

Measurement of cache effects: read throughput


If a program reads n bytes over a period of s seconds,
(in MB/s) on a i7 CPU the read throughput is n/s, typically expressed in MB/s

The
higher,
Here stride = 16 bytes the
better
Probably caused by other data
or code blocks in the cache

Working set: set of memory blocks accessed in a certain “phase” of a program,


here the read loop. Here the working set size is the size of the array “data”.

Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 30


Carnegie Mellon

General Cache Concepts


At any point in time, the cache at level k contains copies
of a subset of the blocks from level k + 1.
Smaller, faster memory caches a
Level K 8
4 9 14
10 3 subset of the blocks of the
immediate lower level (in memory
hierarchy)
Illustration with a read Data is copied back and forth (read/write)
operation 10
4 between two adjacent levels in block-sized
transfer units
Larger, slower, cheaper memory
Level K+1 0 1 2 3 partitioned into “blocks”
4 5 6 7
Unlike in Intel I7 architecture,
8 9 10 11 the block size may not be
12 13 14 15 uniform across cache levels
on some other architectures

Terminology

Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 32


Carnegie Mellon

General Cache Concepts: Hit


Block 14 is needed: the CPU
Request:
block 14 needs a data object that is in
block 14
Cache 8 9 14 3
Block 14 is in cache:
Hit!

Memory 0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15

Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 33


Carnegie Mellon

General Cache Concepts: Miss


Request: 12 Block 12 is needed

Block 12 is not in cache:


Cache 8 9
12 14 3
Miss!

Block 12 is fetched from


12 Request: 12
memory

Memory 0 1 2 3 Block 12 is stored in cache


• Replacement policy:
4 5 6 7
determines which block
8 9 10 11 gets evicted (“victim block”)
12 13 14 15 and replaced by 12

Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 34


Carnegie Mellon

Examples of cache replacement policies


 Random Replacement policy: choice of victim block at random
 Least-Recently Used (LRU): choose the block that was last
accessed the furthest in the past
 Pseudo-LRU: discards “one of the least recently used” items
(based on an approximate measure of age involving less overhead)
 Most recently used (MRU) : in contrast to LRU, the most
recently used item is discarded first.
 Hardware may limit what is possible esp. for caches high in the
hierarchy → restrictive placement policy
▪ Blocks at level k+1 are restricted to a small subset of the block positions
at level k, possibly to single location like
▪ E.g. Block i at level k+1 must be placed in block (i mod 4) at level k.

How to estimate the efficiency of a cache replacement


policy for a given program?
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 35
Carnegie Mellon

3 types of Cache Misses


 Cold miss (or compulsory miss)
▪ Cold misses occur because the cache is empty (“cold” cache)
 Capacity miss
▪ Occurs when the set of “active” blocks (working set) is larger than the
cache (see examples in slide 30)
 Conflict miss
▪ Due to restrictive placement policy, blocks cannot be cached anywhere
in the cache but at specific locations (sometimes a single location)
▪ Conflict misses occur when multiple data objects all map to the same
level k block.
▪ E.g. if block i at level k+1 must be placed in block (i mod 4) at level k,
referencing blocks 0, 8, 4, 8, 4, 8, ... would miss every time.
▪ For later: conflict misses would not occur if the cache were fully associative with LRU replacement.

With the above policy (i.e., i mod 4), where would blocks 1,5,9,12
at level k+1 be stored at level k ? Use figure on slide #34
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 36
Carnegie Mellon

General Cache Organization


 A cache is organized as
▪ A number of cache sets (the rows)
▪ Each set consists of one or several cache lines containing each a cache
block (64 bytes is the norm today for the block size as DDR supports
transporting blocks of 64 bytes efficiently)
▪ Each line holds a cache block and some bit fields needed to operate the
cache

Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 37


Carnegie Mellon

General Cache Organization


 Each cache line contains
▪ A tag: address in main memory of the block stored in the line
▪ A valid bit : say whether the line contains meaningful information
▪ The data being actually cached in the data block

 Direct-mapped cache = one line per set


➔ A given memory address can be cached in only one cache line
 E-way Set Associative Cache = E lines per set
➔ A set of E cache lines must be checked to see if the requested block is present

Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 38


Carnegie Mellon

General Cache Organization (S, E, B)


E = 2e lines per set

set
line

2x indicates that
S = 2s sets a quantity is a
power of two’s

Cache size (data only):


v tag 0 1 2 B-1 C = S x E x B bytes

valid bit B = 2b bytes per cache block (the data)

Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 39


Carnegie Mellon

Cache functioning in more details


 Let’s assume a CPU with a set of registers, an L1 cache (no L2
and L3 caches) and main memory
 When the CPU executes an instruction that reads a memory
word w:
▪ It requests w from L1 cache
▪ If L1 cache has a cached copy of w, it returns it to the CPU (“hit”)
▪ Otherwise, we have a cache miss - L1 cache requests a copy of the block
containing w from memory
▪ When requested block arrives, L1 cache stores the block in one of its
cache lines, extracts word w from stored block and returns it to CPU

Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 40


Carnegie Mellon

Cache Read - often not an entire 1. Locate set


2. Check if any line in set
block is read by the CPU, but e.g. a variable has matching tag
E = 2e lines per set 3. Yes + line valid: hit
4. Locate data starting
at offset
Memory address of data object
needed is subdivided into
t bits s bits b bits

S = 2s sets
tag set block
index offset

data object begins at this offset

Tag + index identifies


v tag 0 1 2 B-1 uniquely each block in
memory
valid bit
B = 2b bytes per cache block (the data)
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 41
Carnegie Mellon

Example: Direct Mapped Cache (E = 1)


Direct mapped = One line per set
Assume: cache block size 8 bytes
Address of data object looked for in cache:

v tag 0 1 2 3 4 5 6 7
t bits 0…01 100

v tag 0 1 2 3 4 5 6 7
find set
S = 2s sets
v tag 0 1 2 3 4 5 6 7

t bits s bits b bits


v tag 0 1 2 3 4 5 6 7
tag set block
index offset
Size of these fields depends on
the HW architecture
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition [edited NN] 42
Carnegie Mellon

Example: Direct Mapped Cache (E = 1)


Direct mapped cache: One line (=1 block) per set
Assumed here: cache block size 8 bytes

Address of int:
valid? + Tag matches = hit
t bits 0…01 100

v tag 0 1 2 3 4 5 6 7
set index

offset in the block

Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 43


Carnegie Mellon

Example: Direct Mapped Cache (E = 1)


Direct mapped: One line per set
Assume: cache block size 8 bytes

Address of integer variable:


valid? + match: assume yes = hit
t bits 0…01 100

v tag 0 1 2 3 4 5 6 7
set index

block offset

int is 4 bytes in this architecture

If tag doesn’t match: old line is evicted and replaced by a line


containing the data object needed, loaded from levels below in
the cache hierharchy
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition [edited NN] 44
Carnegie Mellon

Direct Mapped Cache : where are memory


Cache has 4 sets, one line per set, 2 bytes per
blocks cached? memory block, memory addresses are 4 bit long,
word size is one byte, RAM is 16 bytes large
✓ Cache size = 4 * 2 bytes RAM
✓ Memory size = 16 bytes
(from 0 to 1510)
✓ A memory address (4
bits) is broken down
into 3 components: tag
(0 or 1), 2 index bits
(identifies the set) and
offset (starting byte in
cache block, can be 0 or
1)

8 memory blocks overall


but only 4 blocks can
fit into the cache

Share same location in cache → will lead to conflict misses


Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition [edited NN] 45
Carnegie Mellon

Direct-Mapped Cache Simulation (read)


t=1 s=2 b=1 Memory=16 bytes (4-bit addresses), block size is 2 bytes,
x xx x 4 sets, 1 line (= 1 block) per set, set index is underlined
Tag Set Offset below

Address read (one byte per read):


0 [00002], Miss (cold cache)
Nb: with direct-mapped caches,
there is no choice in terms of 1 [00012], hit
replacement policy as the 7 [01112], miss
address of the word tells the 8 [10002], miss →first line updated
single place in the cache where 0 [00002] miss →first line updated
it must be written
v Tag Block
Cache state after Set 0 0
1 1?
0 ?
M[8-9]
M[0-1] M[0-1] = content of memory
reading address 7, from address 0 to 1
Set 1
table not updated Set 2
afterwards Set 3 1 0 M[6-7]
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition [edited NN] 46
Carnegie Mellon

E-way Set Associative Cache (Here: E = 2)


Set associate cache = each set holds more than one cache line
Address of short int:
E = 2: Two lines per set t bits 0…01 100
Assume: cache block size 8 bytes

v tag 0 1 2 3 4 5 6 7 v tag 0 1 2 3 4 5 6 7

v tag 0 1 2 3 4 5 6 7 v tag 0 1 2 3 4 5 6 7 find set

v tag 0 1 2 3 4 5 6 7 v tag 0 1 2 3 4 5 6 7

v tag 0 1 2 3 4 5 6 7 v tag 0 1 2 3 4 5 6 7

Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition [edited NN] 47
Carnegie Mellon

E-way Set Associative Cache (Here: E = 2)


E = 2: Two lines per set
Assume: cache block size 8 bytes Address of short int:
t bits 0…01 100
compare both

valid? + match: yes = hit

v tag 0 1 2 3 4 5 6 7 v tag 0 1 2 3 4 5 6 7

block offset

“Line matching” is more time consuming in a set associative cache than in a


direct-mapped cache because the tags and valid bits of multiple lines must be
checked in order to determine if the requested data object is in the set.

Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition [edited NN] 48
Carnegie Mellon

E-way Set Associative Cache (Here: E = 2)


E = 2: Two lines per set
Assume: cache block size 8 bytes Address of short int:
t bits 0…01 100
compare both

valid? + match: yes = hit

v tag 0 1 2 3 4 5 6 7 v tag 0 1 2 3 4 5 6 7

block offset

short int (2 Bytes) is here

In case of cache miss:


• One line in the set is selected for eviction and replacement
• Replacement policies: random, least recently used (LRU), …
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition [edited NN] 49
Carnegie Mellon

2-Way Set Associative Cache Simulation


t=2 s=1 b=1 Memory=16 byte addresses, block size is 2 bytes,
xx x x 2 sets, 2 lines per set (= 2 blocks per set), set
number is underlined, tag is 2 bits large

✓ Fully associative cache Address trace (reads, one byte per read):
consists of a single set 0 [00002], miss
that contains all the hit
1 [00012],
caches lines.
7 [01112], miss
✓ Limited to small caches
8 [10002], miss
as it would be too slow
with a large number of 0 [00002] hit
tags to check
✓ Eviction policy can be v Tag Block
global to the cache and 0 ? ?
Set 0 1 00 M[0-1]
thus very efficient 0
1 10 M[8-9]
0
1 01 M[6-7]
Set 1
0
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition [edited NN] 50
Carnegie Mellon

What about writes?


 Multiple copies of same data exist:
▪ L1, L2, L3, Main Memory, Disk
 When the CPU writes at an address in cache ? called write-hit
▪ Write-through: write immediately to the next lower level, simple but slow
as it creates a lot of traffic on the internal buses ..
▪ Write-back: defer write to the next lower level until data is evicted from the
cache (used by modern CPUs as penalty induced by write-through is very
important for large caches)
 When the CPU writes at an address not in cache ? called write-miss
▪ Write-allocate: load block into cache (another block must be evicted),
update line in cache → good if more writes to the location follow
▪ No-write-allocate: writes straight to main memory (or lower level cache)
and does not load data into cache
 Typical policies
Write-through + No-write-allocate or Write-back + Write-allocate

Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition [edited NN] 51
Carnegie Mellon

Main Cache Performance Metrics


 Miss Rate
▪ Fraction of addresses not found in cache (misses / accesses)
= 1 – hit rate
These numbers show how
▪ Typical numbers (in percentages): efficient caching is!
▪ 3-10% for L1
▪ can be very small (e.g., < 1%) for L2, depending on size, etc.
 Hit Time
▪ Time to deliver a data that is in the cache to the processor
includes time to determine whether the line is in the cache

▪ Typical numbers:
▪ 4 clock cycles for L1
▪ 10 clock cycles for L2
 Miss Penalty
▪ Additional time required because of a miss
▪ typically 50-200 cycles for main memory

Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 52


Carnegie Mellon

Let’s think about those numbers


 Huge difference between a hit and a miss
▪ Difference in access times could be 100x, considering just L1 and main
memory

 Would you believe 99% hits is twice as good as 97%?


▪ Consider:
cache hit time of 1 cycle
miss penalty of 100 cycles

▪ Average access time:


97% hits: 0.97 * 1 cycle + 0.03 * 100 cycles = 3.97 cycles
99% hits: 0.99 * 1 cycle + 0.01 * 100 cycles = 1.99 cycles

Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 53

You might also like