Lecture 8 Cont. Cache Memory
Lecture 8 Cont. Cache Memory
Computer Organization
and Architecture
8th Edition
Chapter 4
Cont.Cache Memory
• No choice
• Each block only maps to one line
• Replace that line
Replacement Algorithms (2) Associative & Set Associative
• Hardware implemented algorithm (speed)
• Least Recently used (LRU)
• e.g. in 2 way set associative
– Which of the 2 block is lru?
• First in first out (FIFO)
– replace block that has been in cache longest
• Least frequently used
– replace block which has had fewest hits
• Random
Write Policy
• Must not overwrite a cache block unless main memory is
up to date
• Multiple CPUs may have individual caches
• I/O may address main memory directly
Write through
• All writes go to main memory as well as cache
• Multiple CPUs can monitor main memory traffic to keep
local (to CPU) cache up to date
• Lots of traffic
• Slows down writes
• Remember bogus write through caches!
Write back
• Updates initially made in cache only
• Update bit for cache slot is set when update occurs
• If block is to be replaced, write to main memory only if
update bit is set
• Other caches get out of sync
• I/O must access main memory through cache
• N.B. 15% of memory references are writes
Multilevel Caches
• High logic density enables caches on chip
– Faster than bus access
– Frees bus for other transfers
• Common to use both on and off chip cache
– L1 on chip, L2 off chip in static RAM
– L2 access much faster than DRAM or ROM
– L2 often uses separate data path
– L2 may now be on chip
– Resulting in L3 cache
• Bus access or now on chip…
Measuring Cache Performance
• No cache: Often about 10 cycles per memory access
• Simple cache:
– tave = hC + (1-h)M
– C is often 1 clock cycle
– Assume M is 17 cycles (to load an entire cache line)
– Assume h is about 90%
– tave = .9 (1) + (.1)17 = 2.6 cycles/access
– What happens when h is 95%?
10
Multi-level cache performance
• tave = h1C1 + (1-h1) h2C2 + (1-h1) (1-h2) M
– h1 = hit rate in primary cache
– h2 = hit rate in secondary cache
– C1 = time to access primary cache
– C2 = time to access secondary cache
– M = miss penalty (time to load an entire cache line
from main memory)
Processor Performance Without Cache
12
Performance with Level 1 Cache
13
Performance with L1 and L2 Caches
• Assume:
– L1 hit rate, h1 = 0.95
– L2 hit rate, h2 = 0.90 (this is very optimistic!)
– L2 access time = 5ns = 25 cycles
• CPI = 1 + # stall cycles
= 1 + 0.05 (25 + 0.10 x 500)
= 1 + 3.75 = 4.75
• Processor speed increase due to both caches
= 501/4.75 = 105.5
• Speed increase due to L2 cache
= 26/4.75 = 5.47
14
15
16
17
18
19
Example
20
Hit Ratio (L1 & L2)
For 8 kbytes and 16 kbyte L1
Unified v Split Caches
• One cache for data and instructions or two, one for data and one for
instructions
• Advantages of unified cache
– Higher hit rate
• Balances load of instruction and data fetch
• Only one cache to design & implement
• Advantages of split cache
– Eliminates cache contention between instruction fetch/decode
unit and execution unit
• Important in pipelining
Pentium 4 Cache
• 80386 – no on chip cache
• 80486 – 8k using 16 byte lines and four way set associative organization
• Pentium (all versions) – two on chip L1 caches
– Data & instructions
• Pentium III – L3 cache added off chip
• Pentium 4
– L1 caches
• 8k bytes
• 64 byte lines
• four way set associative
– L2 cache
• Feeding both L1 caches
• 256k
• 128 byte lines
• 8 way set associative
– L3 cache on chip
Pentium 4 Design Reasoning
• Decodes instructions into RISC like micro-ops before L1 cache
• Micro-ops fixed length
– Superscalar pipelining and scheduling
• Pentium instructions long & complex
• Performance improved by separating decoding from scheduling & pipelining
– (More later – ch14)
• Data cache is write back
– Can be configured to write through
• L1 cache controlled by 2 bits in register
– CD = cache disable
– NW = not write through
– 2 instructions to invalidate (flush) cache and write back then invalidate
• L2 and L3 8-way set-associative
– Line size 128 bytes
ARM Cache Features
Core Cache Cache Size (kB) Cache Line Size Associativity Location Write Buffer Size
Type (words) (words)
❑What are the differences among sequential access, direct access, and random
access?
❑What is the general relationship among access time, memory cost, and capacity?
❑How does the principle of locality relate to the use of multiple memory levels?
❑What is the distinction between spatial locality and temporal locality?
❑In general, what are the strategies for exploiting spatial locality and temporal
locality?
Thank you