0% found this document useful (0 votes)
8 views

Lecture 8 Cont. Cache Memory

Uploaded by

syed.12682
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Lecture 8 Cont. Cache Memory

Uploaded by

syed.12682
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 29

William Stallings

Computer Organization
and Architecture
8th Edition
Chapter 4
Cont.Cache Memory

Book by : Computer, Architecture and Organizations, 8th Edition ,William Stalling


Original Slides by : Adrian J Pullin
Cont. Cache Memory
Lecture Outcomes
Understanding of:
• Replacement Algorithm
• Write Policy
• Cache Performance
• Locality of Reference
• Pentium 4 Cache Organization
• ARM Cache Organization
Replacement Algorithms (1) Direct mapping

• No choice
• Each block only maps to one line
• Replace that line
Replacement Algorithms (2) Associative & Set Associative
• Hardware implemented algorithm (speed)
• Least Recently used (LRU)
• e.g. in 2 way set associative
– Which of the 2 block is lru?
• First in first out (FIFO)
– replace block that has been in cache longest
• Least frequently used
– replace block which has had fewest hits
• Random
Write Policy
• Must not overwrite a cache block unless main memory is
up to date
• Multiple CPUs may have individual caches
• I/O may address main memory directly
Write through
• All writes go to main memory as well as cache
• Multiple CPUs can monitor main memory traffic to keep
local (to CPU) cache up to date
• Lots of traffic
• Slows down writes
• Remember bogus write through caches!
Write back
• Updates initially made in cache only
• Update bit for cache slot is set when update occurs
• If block is to be replaced, write to main memory only if
update bit is set
• Other caches get out of sync
• I/O must access main memory through cache
• N.B. 15% of memory references are writes
Multilevel Caches
• High logic density enables caches on chip
– Faster than bus access
– Frees bus for other transfers
• Common to use both on and off chip cache
– L1 on chip, L2 off chip in static RAM
– L2 access much faster than DRAM or ROM
– L2 often uses separate data path
– L2 may now be on chip
– Resulting in L3 cache
• Bus access or now on chip…
Measuring Cache Performance
• No cache: Often about 10 cycles per memory access
• Simple cache:
– tave = hC + (1-h)M
– C is often 1 clock cycle
– Assume M is 17 cycles (to load an entire cache line)
– Assume h is about 90%
– tave = .9 (1) + (.1)17 = 2.6 cycles/access
– What happens when h is 95%?

10
Multi-level cache performance
• tave = h1C1 + (1-h1) h2C2 + (1-h1) (1-h2) M
– h1 = hit rate in primary cache
– h2 = hit rate in secondary cache
– C1 = time to access primary cache
– C2 = time to access secondary cache
– M = miss penalty (time to load an entire cache line
from main memory)
Processor Performance Without Cache

• 5GHz processor, cycle time = 0.2ns


• Memory access time = 100ns = 500 cycles
• Ignoring memory access, Clocks Per Instruction (CPI) =
1
• Assuming no memory data access:
CPI = 1 + # stall cycles
= 1 + 500 = 501

12
Performance with Level 1 Cache

• Assume hit rate, h1 = 0.95


• 5GHz processor, cycle time = 0.2ns
• Memory access time = 100ns = 500 cycles
• L1 access time = 0.2ns/processor cycle time (0.2ns) = 1 cycle
• CPI = 1 + # stall cycles
= 1 + 0.05 x 500
= 26
• Processor speed increase due to cache
= 501/26 = 19.3%

13
Performance with L1 and L2 Caches

• Assume:
– L1 hit rate, h1 = 0.95
– L2 hit rate, h2 = 0.90 (this is very optimistic!)
– L2 access time = 5ns = 25 cycles
• CPI = 1 + # stall cycles
= 1 + 0.05 (25 + 0.10 x 500)
= 1 + 3.75 = 4.75
• Processor speed increase due to both caches
= 501/4.75 = 105.5
• Speed increase due to L2 cache
= 26/4.75 = 5.47

14
15
16
17
18
19
Example

20
Hit Ratio (L1 & L2)
For 8 kbytes and 16 kbyte L1
Unified v Split Caches
• One cache for data and instructions or two, one for data and one for
instructions
• Advantages of unified cache
– Higher hit rate
• Balances load of instruction and data fetch
• Only one cache to design & implement
• Advantages of split cache
– Eliminates cache contention between instruction fetch/decode
unit and execution unit
• Important in pipelining
Pentium 4 Cache
• 80386 – no on chip cache
• 80486 – 8k using 16 byte lines and four way set associative organization
• Pentium (all versions) – two on chip L1 caches
– Data & instructions
• Pentium III – L3 cache added off chip
• Pentium 4
– L1 caches
• 8k bytes
• 64 byte lines
• four way set associative
– L2 cache
• Feeding both L1 caches
• 256k
• 128 byte lines
• 8 way set associative
– L3 cache on chip
Pentium 4 Design Reasoning
• Decodes instructions into RISC like micro-ops before L1 cache
• Micro-ops fixed length
– Superscalar pipelining and scheduling
• Pentium instructions long & complex
• Performance improved by separating decoding from scheduling & pipelining
– (More later – ch14)
• Data cache is write back
– Can be configured to write through
• L1 cache controlled by 2 bits in register
– CD = cache disable
– NW = not write through
– 2 instructions to invalidate (flush) cache and write back then invalidate
• L2 and L3 8-way set-associative
– Line size 128 bytes
ARM Cache Features

Core Cache Cache Size (kB) Cache Line Size Associativity Location Write Buffer Size
Type (words) (words)

ARM720T Unified 8 4 4-way Logical 8

ARM920T Split 16/16 D/I 8 64-way Logical 16


ARM926EJ-S Split 4-128/4-128 D/I 8 4-way Logical 16

ARM1022E Split 16/16 D/I 8 64-way Logical 16


ARM1026EJ-S Split 4-128/4-128 D/I 8 4-way Logical 8

Intel StrongARM Split 16/16 D/I 4 32-way Logical 32

Intel Xscale Split 32/32 D/I 8 32-way Logical 32


ARM1136-JF-S Split 4-64/4-64 D/I 8 4-way Physical 32
ARM Cache Organization
• Small FIFO write buffer
– Enhances memory write performance
– Between cache and main memory
– Small c.f. cache
– Data put in write buffer at processor clock speed
– Processor continues execution
– External write in parallel until empty
– If buffer full, processor stalls
– Data in write buffer not available until written
• So keep buffer small
ARM Cache and Write Buffer
Organization
Review Questions

❑What are the differences among sequential access, direct access, and random
access?
❑What is the general relationship among access time, memory cost, and capacity?
❑How does the principle of locality relate to the use of multiple memory levels?
❑What is the distinction between spatial locality and temporal locality?
❑In general, what are the strategies for exploiting spatial locality and temporal
locality?
Thank you

You might also like