0% found this document useful (0 votes)

12 views

Lecture 5

Uploaded by

ryuu.ducat

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views

Lecture 5

Uploaded by

ryuu.ducat

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 51

Carnegie Mellon

The Memory Hierarchy

N. Navet - Computing Infrastructure 1 / Lecture 5

Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 1

Carnegie Mellon

The memory hierarchy

 The Virtual Memory system: address translation, MMU,
memory page
 The stack & the heap
 How arguments and return values are passed between
functions calling each other
 Caching in the memory hierarchy
 Temporal & spatial locality
 How CPU caches operate
 Direct-mapped caches
 E-way associate caches
 Fully associate caches
 Cache writes
 Cache performance metrics
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition [edited NN] 2
3

Virtual Memory,
the Stack and the Heap

“Virtual memory is an abstraction that

provides each process with the illusion that it
has exclusive use of the main memory. Each
process has the same uniform view of
memory, which is known as its virtual address
space.”

© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.

Virtual memory: a key abstraction provided by 4

OSes, which requires some HW support

“Physical addressing”: the CPU uses

the physical addresses to access memory. Here a
4-bytes word starting at address 4 is transferred.

Modern CPUs use virtual addressing

“Virtual addressing”: the CPU works with virtual
addresses, which are converted on-the fly
into physical addresses (“address translation”)
by the Memory Management Unit (MMU) using
a lookup-table managed by the OS

Some Benefits of Virtual Memory: 1) the MMU protects the address

space of each process from corruption by other processes 2) each
process can have a identical virtual address space
© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.
Linux Virtual Memory System: the virtual 5

memory of a process

✓ Read-only code and read-only data

(i.e., constants): begin at the same
fixed address for each process
✓ Read-write data: global variables
✓ Heap: area for dynamic memory in
user’s program (size varies at run-time)
✓ Stack: memory used for function calls
in user’s program and local variablse
(size varies at run-time)

Each process only knows its own virtual

address space – every address used by the
processor is translated (by the MMU)
to/from the virtual address space

© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.

How Linux organizes virtual memory – out of the 6

scope of the lecture

The kernel maintains a distinct

task structure (task_struct) for
each process in the system.

The Text segment (aka the Instruction or Code segment)

contains the executable program code and constant data.
© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.
Carnegie Mellon

“Memory page”: smallest unit of memory

allocated to a process
RAM is allocated to processes (by the kernel) in “pages” not in bytes – e.g., page
size under Linux is 4kB. Below, VP1 stands for Virtual Page #1

The address translation HW “Physical pages”

reads the “page table” each aka “page frames”
time it converts a virtual
address to a physical address

“Virtual pages” recorded in the “page table”

The OS maintains a separate page table for each process in the system. The VPs of a process
can be in physical memory (“cached”) or on disk (“uncached”). Shared pages allow to avoid
duplicating code in memory, such as the functions of the C library that are used by every
program on Linux. Pages when unused can be moved to disk by the OS.
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 7
Focus on the stack 8

Stack: data structure that stores information about the

active subroutines (=functions) of a program. E.g.,
When DrawSquare() calls DrawLine() 1) it must pass arguments to
DrawLine() 2) the CPU must know where the execution must resume when
DrawLine() finishes

✓ Each stack frame: stores the

information for a specific function call
✓ Data stored in each frame:
- Local variables of the function
- Arguments passed to the called function
- The return’s address (in the caller’s
program)

Stack: 1) provides memory space for

variables local to a function, 2) is a way to
pass the arguments to the function and 3) to
upward-growing stack here know where the execution should resume
upon the exit of the function (to set the IPR)

© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.

Focus on the Heap
◼ The heap is a memory area to store dynamically allocated
memory. In C programs, a block of memory is allocated,
resp. de-allocated, by malloc() and free()

◼ The main reason to use dynamic memory allocation is that

we cannot know the sizes and number of certain data
structures until the program executes

◼ Higher-level languages such as Python and Java rely on

garbage collection (GC) to free allocated blocks .. but GC
might severely slow down user code at times

◼ Dynamic memory leads to complex fragmentation issues:

overall, there is enough available memory, but no single
free block is large enough to handle the request
Carnegie Mellon

Procedures (aka functions, sub-routines)

P(…) {
 Transferring control flow to the •
procedure that is called •
▪ At beginning of procedure code y = Q(x);
print(y)
▪ Back to return point •
 Passing data }
▪ Procedure arguments: through stack (or
registers, see later)
int Q(int i)
▪ Return value: through register {
 Memory management int t = 3*i;
int v[10];
▪ Local variables allocated during procedure •
execution •
▪ Deallocated upon return (stack) return v[t];
▪ Dynamic memory (e.g. malloc(), new()) is }
allocated in the heap (automatic deallocation
or not depends on language and program)
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 10
Carnegie Mellon

MEMORY ALLOCATION
ON LINUX X86-64

Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 11

Carnegie Mellon

x86-64 Linux Memory Layout for a process

00007FFFFFFFFFFF
Stack
 Stack Virtual Memory 8MB
Addresses, as the
▪ Runtime stack (8MB limit) process sees it – not
physical memory Shared
 Heap
Libraries
▪ Dynamically allocated as needed
▪ Through malloc(), calloc(), new() calls
 Data not drawn to scale
▪ Statically allocated data (their address is constant)
▪ E.g., global vars, static vars
 Text / shared Libraries
▪ Executable machine instructions + constants
▪ Read-only Heap
Data
The virtual address space is 128TB large(!)
Text
Hex Address 400000
000000
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 12
Carnegie Mellon
Carnegie Mellon

Focus on the stack on x86-64 platforms

 Region of memory managed Stack “Bottom”
with stack discipline
Increasing
 Grows toward lower addresses Addresses

 A dedicated register, %rsp,

contains lowest stack address
▪ address of “top” element Stack
Grows
Down
Stack Pointer: %rsp

Stack “Top”

Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 13

Carnegie Mellon
Carnegie Mellon

Stack Frames Previous

Frame
 Contents
▪ Return address Frame Pointer: %rbx
▪ Arguments of the function x
▪ Local variables
Last Frame
▪ Saved registers

Stack Pointer: %rsp

 Management
▪ Space allocated when entering
procedure
Stack “Top”
▪ Frame pushed by call instruction
▪ De-allocated when the control flow Why is it needed to save
returns to calling function (ret registers upon function call?
instruction)

Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 14

Carnegie Mellon

not drawn to scale

Memory Allocation Example
Stack
char big_array[1L<<24]; /* 16 MB */
char huge_array[1L<<31]; /* 2 GB */

int global = 0;

int useless() { return 0; }

Where does
int main ()
{ everything go? Shared
void *p1, *p2, *p3, *p4; Libraries
int local = 0;
p1 = malloc(1L << 28); /* 256 MB */
p2 = malloc(1L << 8); /* 256 B */
p3 = malloc(1L << 32); /* 4 GB */
p4 = malloc(1L << 8); /* 256 B */ Heap
/* Some other statements ... */ Data
} Text (=code+constants)

Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 15

Carnegie Mellon

not drawn to scale

x86-64 Example Addresses 00007F
Stack
address range ~247

local 0x00007ffe4d3be87c
p1 0x00007f7262a1e010
p3 0x00007f7162a1d010 Shared
p4 0x000000008359d120 Libraries
p2 0x000000008359d010
big_array 0x0000000080601060
huge_array 0x0000000000601060
main() 0x000000000040060c
useless() 0x0000000000400590

Heap
Note: the memory blocks pointed to by p1,
p2, p3 and p4 are in the heap (dynamic
memory) but the 4 pointers themselves are Data
in the stack (local variables).
Text
000000
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 16
Carnegie Mellon

Arguments passing through registers (out of the

scope of the lecture)
 For speed, processors with many registers use them - instead of
stack - to pass arguments to procedures
Intel x86-64 has twice the number of registers of IA32

 Intel x86-64: Arguments (up to the first six) are passed to

procedures via registers, rather than on the stack.This eliminates
the overhead of storing and retrieving values on the stack.
C code Location of arguments

Variables local to a function

may also be allocated to
registers (instead of stack) if
space permits

Args 7 and 8 are on the stack where %rsp is a register

holding the value of the stack pointer
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 18
Carnegie Mellon

Return value of a function (advanced)

 Where the return value of a function is written is a convention
that depends on the architecture

 E.g., on Intel x86-64, if the return value of the function is

▪ a 32-bit integer, it is passed between the two functions
(caller and called function) in the EAX register (32 bit),
▪ a 64-bit integer, it is passed in both the EAX and EDX registers
▪ floating point values are returned in the XMM0 register

Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 19

Carnegie Mellon

CACHING IN THE MEMORY

HIERARCHY
✓ The process of using a cache is known as caching
✓ We will focus on the caches inside the CPU, for most CPUs: L1, L2
and L3
✓ There are 3 types of CPU-internal caches: direct mapped cache, set
associative cache and fully associative cache
✓ The cache is organized so that it can find the requested word by
simply inspecting the bits of the address that is requested

Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition [edited NN] 20
Carnegie Mellon

Caching
 Cache: A smaller, faster storage device that acts as a storing
area for a subset of the data in a larger, slower device.
 Information in use copied from slower to faster storage
temporarily
 Fastest storage (cache) checked first to determine if
information is there
▪ If it is, information used directly from the cache (fast)
▪ If not, data copied to cache and used there
 Cache smaller than storage being cached
▪ Thus cache size and replacement policy are important problems
 A cache miss is a failed attempt to read / write a piece of data in
the cache (data is not there), which results in a lower-level
memory access with much longer latency. Two kinds of cache
read misses: instruction read miss and data read miss.
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition [edited NN] 21
Carnegie Mellon

Typical Memory The faster, smaller device at level k serves as a

cache for the larger, slower device at level k+1.
Hierarchy L0:
Regs CPU registers hold words
Smaller, retrieved from the L1 cache.
faster, L1 cache
L1: L1 cache holds cache lines
volatile (SRAM)
and retrieved from the L2 cache.
costlier L2: L2 cache L2 cache holds cache lines
Nb: a cache
(per byte) (SRAM) line holds a
retrieved from L3 cache memory block
storage
devices L3: L3 cache L3 cache holds memory blocks
retrieved from main memory.
(SRAM) Nb: a memory block is a fixed-size packet of
data that is transferred to cache
Larger, Main memory holds
slower, L4: Main memory disk blocks retrieved
and (DRAM) from local disks.
cheaper
(per byte)
Local disks hold files
storage L5: Local secondary storage retrieved from disks
devices (local disks incl. SSD) on remote servers

L6: Remote secondary storage

(e.g., servers, cloud)
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 22
Carnegie Mellon

Random-Access Memory (RAM)

 Key features
▪ Volatile memory
▪ Memory can be read/written in any order
▪ the time required to read and write is independent of the location
 Two types of RAM:
▪ SRAM (Static RAM): faster much more expensive than DRAM, used for caches
▪ DRAM (Dynamic RAM): used for main memory
Access
time Cost Applications

SRAM 1X 100x Cache memories

& registers

DRAM 10X 1X Main memories

Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 23

Carnegie Mellon

Caching principles continued

 When the CPU is instructed by a load instruction to read a word from
address A of main memory, it sends the address A to the L1 cache. If L1 is
holding a copy of the word, it sends the word immediately back to the
CPU (“cache hit”). If not (“cache miss”), L2 is requested to provide the
data which is loaded into L1. If L2 does not a have copy of the word, it
requests it from L3 cache and loads it, etc. When the word is found, it is
propagated up the cache hierarchy.
 Nb: registers can serve as cache but how they are used is (statically) defined by
the compiler, not (dynamically) controlled by the CPU.
 Why do memory caches work?
▪ Because of locality, programs tend to access data already at level k (they
have been recently used) more often than they access the data at level k+1.

If the word a program loads is stored in a register, it can be accessed in 0

cycle during the execution of the instruction. If stored in a cache, 1 to 30
cycles. If stored in main memory, 50 to 200 CPU cycles!

Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 24

Carnegie Mellon

Migration of data “A” from Disk to Register

L1+L2+L3 caches Data “A” is then

managed by HW used as an operand
of an instruction

Two properties / requirements enforced in HW:

 All processes are provided the most recent value, no matter

the circumstances
 In particular, multiprocessor/multicore execution platforms

ensure cache coherency such that all CPUs have the most
recent value in their cache

Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 25

Carnegie Mellon

As a programmer it is important to understand the concept

Locality of locality as programs with good locality - thanks to
caching - run faster than program with poor locality

 Principle of Locality: Programs tend to use data and

instructions with addresses near or equal to those they
have used recently

 Temporal locality:
▪ Recently referenced items are likely
to be referenced again in the near future
(e.g., using the same variables several times)

 Spatial locality:
▪ Items with nearby addresses tend
to be referenced close together in time
(think of iterating an array)

Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 26

Carnegie Mellon

Locality Example #1 – good locality

sum = 0;
for (i = 0; i < n; i++)
Good locality!
sum += a[i];
return sum;

 Locality wrt data

▪ Reference array elements in succession (next Spatial locality
element more likely to be already in cache)
▪ Reference variables sum and i at each Temporal locality
iteration (both will stay in cache)
 Locality wrt instructions (i.e., code)
▪ Reference instructions in sequence. Spatial locality
▪ Cycle through loop repeatedly. Temporal locality
Writing programs so as to maximize locality is key to achieve fast
execution – achieving both spatial and temporal for all variables may not
be feasible, e.g. wrt array a the code above has good spatial locality but poor
temporal locality as elements are not re-used.
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 27
Carnegie Mellon

Locality Example #2 – poor spatial locality

Nb: int type is 4 bytes long in this example
Memory layout

1st row 2nd row

1st row

Stride k reference pattern means visiting every kth element of a contiguous

data structure – as the stride increases, the spatial locality decreases, and it
less likely that the accessed elements can be read from the cache.

How can the function be rewritten so as to improve spatial

locality ? i.e., so as to reduce the stride to 1
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 28
Carnegie Mellon

Intel Core i7 Cache Hierarchy

Processor package
L1 i-cache and d-cache:
Core 0 Core 3 32 KB, 8-way,
Regs Regs Access: 4 CPU cycles

L2 unified cache:
L1 L1 L1 L1
256 KB, 8-way,
d-cache i-cache
… d-cache i-cache
Access: 10 CPU cycles

L2 unified cache L2 unified cache L3 unified cache:

8 MB, 16-way,
Access: 40-75 CPU
cycles
L3 unified cache
(shared by all cores) Block size:
64 bytes for all caches.
Caches are partitioned into lines.
Transfer between two levels is
Main memory always done at the granularity of a
block, e.g. 64 bytes (allows to take
advantage of spatial locality)
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 29
Carnegie Mellon

Measurement of cache effects: read throughput

If a program reads n bytes over a period of s seconds,
(in MB/s) on a i7 CPU the read throughput is n/s, typically expressed in MB/s

The
higher,
Here stride = 16 bytes the
better
Probably caused by other data
or code blocks in the cache

Working set: set of memory blocks accessed in a certain “phase” of a program,

here the read loop. Here the working set size is the size of the array “data”.

Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 30

Carnegie Mellon

General Cache Concepts

At any point in time, the cache at level k contains copies
of a subset of the blocks from level k + 1.
Smaller, faster memory caches a
Level K 8
4 9 14
10 3 subset of the blocks of the
immediate lower level (in memory
hierarchy)
Illustration with a read Data is copied back and forth (read/write)
operation 10
4 between two adjacent levels in block-sized
transfer units
Larger, slower, cheaper memory
Level K+1 0 1 2 3 partitioned into “blocks”
4 5 6 7
Unlike in Intel I7 architecture,
8 9 10 11 the block size may not be
12 13 14 15 uniform across cache levels
on some other architectures

Terminology

Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 32

Carnegie Mellon

General Cache Concepts: Hit

Block 14 is needed: the CPU
Request:
block 14 needs a data object that is in
block 14
Cache 8 9 14 3
Block 14 is in cache:
Hit!

Memory 0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15

Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 33

Carnegie Mellon

General Cache Concepts: Miss

Request: 12 Block 12 is needed

Block 12 is not in cache:

Cache 8 9
12 14 3
Miss!

Block 12 is fetched from

12 Request: 12
memory

Memory 0 1 2 3 Block 12 is stored in cache

• Replacement policy:
4 5 6 7
determines which block
8 9 10 11 gets evicted (“victim block”)
12 13 14 15 and replaced by 12

Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 34

Carnegie Mellon

Examples of cache replacement policies

 Random Replacement policy: choice of victim block at random
 Least-Recently Used (LRU): choose the block that was last
accessed the furthest in the past
 Pseudo-LRU: discards “one of the least recently used” items
(based on an approximate measure of age involving less overhead)
 Most recently used (MRU) : in contrast to LRU, the most
recently used item is discarded first.
 Hardware may limit what is possible esp. for caches high in the
hierarchy → restrictive placement policy
▪ Blocks at level k+1 are restricted to a small subset of the block positions
at level k, possibly to single location like
▪ E.g. Block i at level k+1 must be placed in block (i mod 4) at level k.

How to estimate the efficiency of a cache replacement

policy for a given program?
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 35
Carnegie Mellon

3 types of Cache Misses

 Cold miss (or compulsory miss)
▪ Cold misses occur because the cache is empty (“cold” cache)
 Capacity miss
▪ Occurs when the set of “active” blocks (working set) is larger than the
cache (see examples in slide 30)
 Conflict miss
▪ Due to restrictive placement policy, blocks cannot be cached anywhere
in the cache but at specific locations (sometimes a single location)
▪ Conflict misses occur when multiple data objects all map to the same
level k block.
▪ E.g. if block i at level k+1 must be placed in block (i mod 4) at level k,
referencing blocks 0, 8, 4, 8, 4, 8, ... would miss every time.
▪ For later: conflict misses would not occur if the cache were fully associative with LRU replacement.

With the above policy (i.e., i mod 4), where would blocks 1,5,9,12
at level k+1 be stored at level k ? Use figure on slide #34
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 36
Carnegie Mellon

General Cache Organization

 A cache is organized as
▪ A number of cache sets (the rows)
▪ Each set consists of one or several cache lines containing each a cache
block (64 bytes is the norm today for the block size as DDR supports
transporting blocks of 64 bytes efficiently)
▪ Each line holds a cache block and some bit fields needed to operate the
cache

Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 37

Carnegie Mellon

General Cache Organization

 Each cache line contains
▪ A tag: address in main memory of the block stored in the line
▪ A valid bit : say whether the line contains meaningful information
▪ The data being actually cached in the data block

 Direct-mapped cache = one line per set

➔ A given memory address can be cached in only one cache line
 E-way Set Associative Cache = E lines per set
➔ A set of E cache lines must be checked to see if the requested block is present

Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 38

Carnegie Mellon

General Cache Organization (S, E, B)

E = 2e lines per set

set
line

2x indicates that
S = 2s sets a quantity is a
power of two’s

Cache size (data only):

v tag 0 1 2 B-1 C = S x E x B bytes

valid bit B = 2b bytes per cache block (the data)

Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 39

Carnegie Mellon

Cache functioning in more details

 Let’s assume a CPU with a set of registers, an L1 cache (no L2
and L3 caches) and main memory
 When the CPU executes an instruction that reads a memory
word w:
▪ It requests w from L1 cache
▪ If L1 cache has a cached copy of w, it returns it to the CPU (“hit”)
▪ Otherwise, we have a cache miss - L1 cache requests a copy of the block
containing w from memory
▪ When requested block arrives, L1 cache stores the block in one of its
cache lines, extracts word w from stored block and returns it to CPU

Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 40

Carnegie Mellon

Cache Read - often not an entire 1. Locate set

2. Check if any line in set
block is read by the CPU, but e.g. a variable has matching tag
E = 2e lines per set 3. Yes + line valid: hit
4. Locate data starting
at offset
Memory address of data object
needed is subdivided into
t bits s bits b bits

S = 2s sets
tag set block
index offset

data object begins at this offset

Tag + index identifies

v tag 0 1 2 B-1 uniquely each block in
memory
valid bit
B = 2b bytes per cache block (the data)
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 41
Carnegie Mellon

Example: Direct Mapped Cache (E = 1)

Direct mapped = One line per set
Assume: cache block size 8 bytes
Address of data object looked for in cache:

v tag 0 1 2 3 4 5 6 7
t bits 0…01 100

v tag 0 1 2 3 4 5 6 7
find set
S = 2s sets
v tag 0 1 2 3 4 5 6 7

t bits s bits b bits

v tag 0 1 2 3 4 5 6 7
tag set block
index offset
Size of these fields depends on
the HW architecture
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition [edited NN] 42
Carnegie Mellon

Example: Direct Mapped Cache (E = 1)

Direct mapped cache: One line (=1 block) per set
Assumed here: cache block size 8 bytes

Address of int:
valid? + Tag matches = hit
t bits 0…01 100

v tag 0 1 2 3 4 5 6 7
set index

offset in the block

Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 43

Carnegie Mellon

Example: Direct Mapped Cache (E = 1)

Direct mapped: One line per set
Assume: cache block size 8 bytes

Address of integer variable:

valid? + match: assume yes = hit
t bits 0…01 100

v tag 0 1 2 3 4 5 6 7
set index

block offset

int is 4 bytes in this architecture

If tag doesn’t match: old line is evicted and replaced by a line

containing the data object needed, loaded from levels below in
the cache hierharchy
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition [edited NN] 44
Carnegie Mellon

Direct Mapped Cache : where are memory

Cache has 4 sets, one line per set, 2 bytes per
blocks cached? memory block, memory addresses are 4 bit long,
word size is one byte, RAM is 16 bytes large
✓ Cache size = 4 * 2 bytes RAM
✓ Memory size = 16 bytes
(from 0 to 1510)
✓ A memory address (4
bits) is broken down
into 3 components: tag
(0 or 1), 2 index bits
(identifies the set) and
offset (starting byte in
cache block, can be 0 or
1)

8 memory blocks overall

but only 4 blocks can
fit into the cache

Share same location in cache → will lead to conflict misses

Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition [edited NN] 45
Carnegie Mellon

Direct-Mapped Cache Simulation (read)

t=1 s=2 b=1 Memory=16 bytes (4-bit addresses), block size is 2 bytes,
x xx x 4 sets, 1 line (= 1 block) per set, set index is underlined
Tag Set Offset below

Address read (one byte per read):

0 [00002], Miss (cold cache)
Nb: with direct-mapped caches,
there is no choice in terms of 1 [00012], hit
replacement policy as the 7 [01112], miss
address of the word tells the 8 [10002], miss →first line updated
single place in the cache where 0 [00002] miss →first line updated
it must be written
v Tag Block
Cache state after Set 0 0
1 1?
0 ?
M[8-9]
M[0-1] M[0-1] = content of memory
reading address 7, from address 0 to 1
Set 1
table not updated Set 2
afterwards Set 3 1 0 M[6-7]
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition [edited NN] 46
Carnegie Mellon

E-way Set Associative Cache (Here: E = 2)

Set associate cache = each set holds more than one cache line
Address of short int:
E = 2: Two lines per set t bits 0…01 100
Assume: cache block size 8 bytes

v tag 0 1 2 3 4 5 6 7 v tag 0 1 2 3 4 5 6 7

v tag 0 1 2 3 4 5 6 7 v tag 0 1 2 3 4 5 6 7 find set

v tag 0 1 2 3 4 5 6 7 v tag 0 1 2 3 4 5 6 7

Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition [edited NN] 47
Carnegie Mellon

E-way Set Associative Cache (Here: E = 2)

E = 2: Two lines per set
Assume: cache block size 8 bytes Address of short int:
t bits 0…01 100
compare both

valid? + match: yes = hit

v tag 0 1 2 3 4 5 6 7 v tag 0 1 2 3 4 5 6 7

block offset

“Line matching” is more time consuming in a set associative cache than in a

direct-mapped cache because the tags and valid bits of multiple lines must be
checked in order to determine if the requested data object is in the set.

Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition [edited NN] 48
Carnegie Mellon

E-way Set Associative Cache (Here: E = 2)

E = 2: Two lines per set
Assume: cache block size 8 bytes Address of short int:
t bits 0…01 100
compare both

valid? + match: yes = hit

v tag 0 1 2 3 4 5 6 7 v tag 0 1 2 3 4 5 6 7

block offset

short int (2 Bytes) is here

In case of cache miss:

• One line in the set is selected for eviction and replacement
• Replacement policies: random, least recently used (LRU), …
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition [edited NN] 49
Carnegie Mellon

2-Way Set Associative Cache Simulation

t=2 s=1 b=1 Memory=16 byte addresses, block size is 2 bytes,
xx x x 2 sets, 2 lines per set (= 2 blocks per set), set
number is underlined, tag is 2 bits large

✓ Fully associative cache Address trace (reads, one byte per read):
consists of a single set 0 [00002], miss
that contains all the hit
1 [00012],
caches lines.
7 [01112], miss
✓ Limited to small caches
8 [10002], miss
as it would be too slow
with a large number of 0 [00002] hit
tags to check
✓ Eviction policy can be v Tag Block
global to the cache and 0 ? ?
Set 0 1 00 M[0-1]
thus very efficient 0
1 10 M[8-9]
0
1 01 M[6-7]
Set 1
0
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition [edited NN] 50
Carnegie Mellon

What about writes?

 Multiple copies of same data exist:
▪ L1, L2, L3, Main Memory, Disk
 When the CPU writes at an address in cache ? called write-hit
▪ Write-through: write immediately to the next lower level, simple but slow
as it creates a lot of traffic on the internal buses ..
▪ Write-back: defer write to the next lower level until data is evicted from the
cache (used by modern CPUs as penalty induced by write-through is very
important for large caches)
 When the CPU writes at an address not in cache ? called write-miss
▪ Write-allocate: load block into cache (another block must be evicted),
update line in cache → good if more writes to the location follow
▪ No-write-allocate: writes straight to main memory (or lower level cache)
and does not load data into cache
 Typical policies
Write-through + No-write-allocate or Write-back + Write-allocate

Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition [edited NN] 51
Carnegie Mellon

Main Cache Performance Metrics

 Miss Rate
▪ Fraction of addresses not found in cache (misses / accesses)
= 1 – hit rate
These numbers show how
▪ Typical numbers (in percentages): efficient caching is!
▪ 3-10% for L1
▪ can be very small (e.g., < 1%) for L2, depending on size, etc.
 Hit Time
▪ Time to deliver a data that is in the cache to the processor
includes time to determine whether the line is in the cache
▪
▪ Typical numbers:
▪ 4 clock cycles for L1
▪ 10 clock cycles for L2
 Miss Penalty
▪ Additional time required because of a miss
▪ typically 50-200 cycles for main memory

Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 52

Carnegie Mellon

Let’s think about those numbers

 Huge difference between a hit and a miss
▪ Difference in access times could be 100x, considering just L1 and main
memory

 Would you believe 99% hits is twice as good as 97%?

▪ Consider:
cache hit time of 1 cycle
miss penalty of 100 cycles

▪ Average access time:

97% hits: 0.97 * 1 cycle + 0.03 * 100 cycles = 3.97 cycles
99% hits: 0.99 * 1 cycle + 0.01 * 100 cycles = 1.99 cycles

Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 53

Computer Systems: Lecture 7: Virtual Memory
No ratings yet
Computer Systems: Lecture 7: Virtual Memory
103 pages
17-vm-concepts
No ratings yet
17-vm-concepts
42 pages
Virtual Memory Classnotes
No ratings yet
Virtual Memory Classnotes
70 pages
Part 1B, DR S.W. Moore
100% (3)
Part 1B, DR S.W. Moore
16 pages
09 Virtualization Memory Address Space Translation
No ratings yet
09 Virtualization Memory Address Space Translation
45 pages
LEC09 VM
No ratings yet
LEC09 VM
33 pages
L2 - Process Abstraction
No ratings yet
L2 - Process Abstraction
70 pages
OS Lecture 02
No ratings yet
OS Lecture 02
15 pages
Memory
No ratings yet
Memory
24 pages
Lecture22
No ratings yet
Lecture22
68 pages
Chapter8-2
No ratings yet
Chapter8-2
33 pages
18CSC205J Operating Systems-Unit-4
No ratings yet
18CSC205J Operating Systems-Unit-4
110 pages
Unit 4 - Os - Theory
No ratings yet
Unit 4 - Os - Theory
110 pages
CS330 Operating System Part V
No ratings yet
CS330 Operating System Part V
13 pages
Memory Management: Main Memory
No ratings yet
Memory Management: Main Memory
8 pages
Lecture 3.2-New
No ratings yet
Lecture 3.2-New
14 pages
Main Memory: Address Translation: (Chapter 8)
No ratings yet
Main Memory: Address Translation: (Chapter 8)
38 pages
Intro To C - Module 4
No ratings yet
Intro To C - Module 4
15 pages
Register Usage in MIPS ABI: Inf3 Computer Architecture - 2007-2008
No ratings yet
Register Usage in MIPS ABI: Inf3 Computer Architecture - 2007-2008
24 pages
Fundamentals of Operating Systems-April-2024
No ratings yet
Fundamentals of Operating Systems-April-2024
450 pages
06ABIs 1
No ratings yet
06ABIs 1
41 pages
virtualization
No ratings yet
virtualization
137 pages
Recap: Translation Box (MMU)
No ratings yet
Recap: Translation Box (MMU)
4 pages
09-Memory
No ratings yet
09-Memory
153 pages
Presentation of Chapter 4, LINUX Kernel Internals
No ratings yet
Presentation of Chapter 4, LINUX Kernel Internals
29 pages
OS_10__Main_Memory (1)
No ratings yet
OS_10__Main_Memory (1)
64 pages
Project 3
No ratings yet
Project 3
13 pages
OSPPT
No ratings yet
OSPPT
52 pages
M2_TODAY
No ratings yet
M2_TODAY
106 pages
1-os-mechanisms
No ratings yet
1-os-mechanisms
18 pages
CS124 Lec 18
No ratings yet
CS124 Lec 18
30 pages
Computer Systems: Lecture 7: Virtual Memory
No ratings yet
Computer Systems: Lecture 7: Virtual Memory
103 pages
AMD面试题
No ratings yet
AMD面试题
8 pages
Lecture Slides 05 051-Procstacks
No ratings yet
Lecture Slides 05 051-Procstacks
9 pages
08 Address Translation
No ratings yet
08 Address Translation
42 pages
Operating System 13
No ratings yet
Operating System 13
15 pages
Memory Management Topics
No ratings yet
Memory Management Topics
6 pages
Operating Systems Notes: 1. What Is An OS (P5)
No ratings yet
Operating Systems Notes: 1. What Is An OS (P5)
34 pages
05 Machine Basics
No ratings yet
05 Machine Basics
38 pages
The Internals of Hello World
No ratings yet
The Internals of Hello World
74 pages
Chapter 15. Address Translation
No ratings yet
Chapter 15. Address Translation
21 pages
Lecture Week12 Virtual Memory
No ratings yet
Lecture Week12 Virtual Memory
32 pages
04ABIs
No ratings yet
04ABIs
60 pages
Operating Systems Session 12 Address Translation
No ratings yet
Operating Systems Session 12 Address Translation
16 pages
CS8493_Unit_3
No ratings yet
CS8493_Unit_3
128 pages
Unit 4a
No ratings yet
Unit 4a
104 pages
MIT6 S096 IAP13 Lec3 PDF
No ratings yet
MIT6 S096 IAP13 Lec3 PDF
43 pages
Ch. 3 Lecture 1 - 3 PDF
No ratings yet
Ch. 3 Lecture 1 - 3 PDF
83 pages
FEROLIN, Mary Bernadette J. November29, 2020 BSCS-2 CS 3104 (4:30 - 6:00, MW) Chapter 10: Virtual Memory
No ratings yet
FEROLIN, Mary Bernadette J. November29, 2020 BSCS-2 CS 3104 (4:30 - 6:00, MW) Chapter 10: Virtual Memory
43 pages
Chapter 08
No ratings yet
Chapter 08
72 pages
Lec12 Translation
No ratings yet
Lec12 Translation
30 pages
OS Unit 4
No ratings yet
OS Unit 4
399 pages
07 Paging
No ratings yet
07 Paging
83 pages
Basics of Operating Systems: What Is An Operating System?
No ratings yet
Basics of Operating Systems: What Is An Operating System?
4 pages
Lecture 10a
No ratings yet
Lecture 10a
49 pages
Proc Emb Ch4
No ratings yet
Proc Emb Ch4
46 pages
He-Dieu-Hanh - Kai-Li - Vmdesign - (Cuuduongthancong - Com)
No ratings yet
He-Dieu-Hanh - Kai-Li - Vmdesign - (Cuuduongthancong - Com)
24 pages
Memory Layout of C Programs
No ratings yet
Memory Layout of C Programs
7 pages
Memory Management and Virtual Memory: Carnegie Mellon
No ratings yet
Memory Management and Virtual Memory: Carnegie Mellon
65 pages
Memory Basics Explained
From Everand
Memory Basics Explained
Alisa Turing
No ratings yet
Chapter 6 EX
No ratings yet
Chapter 6 EX
13 pages
Chapter 1 EX 2
No ratings yet
Chapter 1 EX 2
39 pages
Chapter 1 EX
No ratings yet
Chapter 1 EX
30 pages
Chapter 5 EX
No ratings yet
Chapter 5 EX
10 pages
Discrete Mathematics I: Solution
No ratings yet
Discrete Mathematics I: Solution
6 pages
Chapter 09 - Forms
No ratings yet
Chapter 09 - Forms
21 pages
Chapter 08 - Tables
No ratings yet
Chapter 08 - Tables
8 pages
Chapter 05 - Marking Up Text
No ratings yet
Chapter 05 - Marking Up Text
20 pages
Discrete Mathematics I: Solution
No ratings yet
Discrete Mathematics I: Solution
5 pages
Discrete Mathematics I: Solution
No ratings yet
Discrete Mathematics I: Solution
4 pages
Discrete Mathematics I: Solution
No ratings yet
Discrete Mathematics I: Solution
2 pages
Discrete Mathematics I: Solution
No ratings yet
Discrete Mathematics I: Solution
5 pages
Discrete Mathematics I: Solution
No ratings yet
Discrete Mathematics I: Solution
4 pages
Inter Process Communication Tutorial PDF
No ratings yet
Inter Process Communication Tutorial PDF
20 pages
AIML
No ratings yet
AIML
100 pages
BCA II Semester 2017-18 & Onwards
No ratings yet
BCA II Semester 2017-18 & Onwards
20 pages
Garbage Collection and The Ruby Heap
92% (13)
Garbage Collection and The Ruby Heap
70 pages
Difference Between Malloc and Calloc in C
No ratings yet
Difference Between Malloc and Calloc in C
3 pages
C Theory Questions
No ratings yet
C Theory Questions
3 pages
PPS CSE 101 Assignment
No ratings yet
PPS CSE 101 Assignment
4 pages
Kernel Memory Allocation: Unix Internals, Uresh Vahalia
No ratings yet
Kernel Memory Allocation: Unix Internals, Uresh Vahalia
16 pages
Data Structures Using C A Practical Approach for Beginners 1st edition by Amol Jagtap, Ajit Mali 9781000470741 1000470741 - Discover the ebook with all chapters in just a few seconds
100% (5)
Data Structures Using C A Practical Approach for Beginners 1st edition by Amol Jagtap, Ajit Mali 9781000470741 1000470741 - Discover the ebook with all chapters in just a few seconds
85 pages
C Language
No ratings yet
C Language
72 pages
Variables and Memory PDF
No ratings yet
Variables and Memory PDF
26 pages
C Programs General
No ratings yet
C Programs General
47 pages
CBCS Syllabus For BCA Kashmir University
100% (2)
CBCS Syllabus For BCA Kashmir University
70 pages
C Primer Plus Notes
No ratings yet
C Primer Plus Notes
25 pages
Pointers, Stack & Heap Memory, Malloc
100% (1)
Pointers, Stack & Heap Memory, Malloc
4 pages
Dynamic Memory Allocation in C Using Malloc
No ratings yet
Dynamic Memory Allocation in C Using Malloc
11 pages
Malloc: Bitwise Operators
No ratings yet
Malloc: Bitwise Operators
6 pages
Unit 4
No ratings yet
Unit 4
8 pages
(LECTURE 2) Advanced C Programming
No ratings yet
(LECTURE 2) Advanced C Programming
48 pages
C Language CheatSheet - CodeWithHarry
No ratings yet
C Language CheatSheet - CodeWithHarry
14 pages
U24AI029ass2
No ratings yet
U24AI029ass2
10 pages
Cs8391-Data Structures-1844715418-Cs8391 - DS PDF
No ratings yet
Cs8391-Data Structures-1844715418-Cs8391 - DS PDF
346 pages
C Lab Material
No ratings yet
C Lab Material
16 pages
ZRP Ns2 Report
No ratings yet
ZRP Ns2 Report
63 pages
Java As Object Oriented Programming Language: Unit IV
No ratings yet
Java As Object Oriented Programming Language: Unit IV
32 pages
Week 0 Assignment
100% (1)
Week 0 Assignment
7 pages
Programming For Problem Solving
No ratings yet
Programming For Problem Solving
5 pages
Memory As A Programming Concept in C and C++ (PDFDrive)
100% (1)
Memory As A Programming Concept in C and C++ (PDFDrive)
274 pages
CS 201 Short Notes
No ratings yet
CS 201 Short Notes
10 pages