Memory Cache
Memory Cache
Dheeraj Bhardwaj
Department of Computer Science and Engineering
Indian Institute of Technology, Delhi – 110
1
016
1
Performance Counters
Performance Data
• Branch mispredictions
2
Cache and Its Importance in Performance
• Motivation:
– Time to run code = clock cycles running code + clock cycles
waiting for memory
– For many years, CPU’s have sped up an average of 50% per
year over memory chip speed ups.
What is a cache?
• Small, fast storage used to improve average access time
to slow memory.
• Exploits spacial and temporal locality
• In computer architecture, almost everything is a cache!
• Registers “a cache” on variables – software managed
• First-level cache a cache on second-level cache
• Second-level cache a cache on memory
• Memory a cache on disk (virtual memory)
• TLB a cache on page table
• Branch-prediction a cache on prediction information?
Proc/Regs
L1 - Cache
Bigger L2 - Cache Faster
Memory
6
Disk, Tape, etc
3
Cache Sporting Terms
Cache Benefits
• Data cache was designed with two key concepts in mind
– Spatial Locality
• When an element is referenced its neighbors will be referenced too
• Cache lines are fetched together
• Work on consecutive data elements in the same cache line
– Temporal Locality
• When an element is referenced, it might be referenced again soon
• Arrange code so that data in cache is reused often
Cache-Related Terms
Least Recently Used (LRU): Cache replacement strategy for set
associative caches. The cache block that is least recently used
is replaced with a new block.
Random Replace: Cache replacement strategy for set associative
8
caches. A cache block is randomly replaced.
4
A Modern Memory Hierarchy
(Tape/Disk)
Tertiary Storage
(Disk)
Secondary Storage
Processor
(DRAM
Main Memory
Control (SRAM)
Level Cache
Second
Cache
On-chip
Register
Datapath
Uniprocessor Reality
• Modern processors use a variety of techniques for
performance
– caches
• small amount of fast memory where values are “cached” in hope
of reusing recently used or nearby data
• different memory ops can have very different costs
– parallelism
• superscalar processors have multiple “functional units” that can
run in parallel
• different orders, instruction mixes have different costs
– pipelining
• a form of parallelism, like an assembly line in a factory
• Why is this your problem?
– In theory, compilers understand all of this and can optimize your
10
program; in practice they don’t.
5
Traditional Four Questions for Memory
Hierarchy Designers
• Q1: Where can a block be placed in the upper level?
(Block placement)
– Fully Associative, Set Associative, Direct Mapped
• Q2: How is a block found if it is in the upper level?
(Block identification)
– Tag/Block
• Q3: Which block should be replaced on a miss?
(Block replacement)
– Random, LRU
• Q4: What happens on a write?
(Write strategy)
– Write Back or Write Through (with Write Buffer)
11
Cache-Related Terms
• ICACHE : Instruction cache
• DCACHE (L1) : Data cache closest to registers
• SCACHE (L2) : Secondary data cache
– Data from SCACHE has to go through DCACHE to registers
– SCACHE is larger than DCACHE
– Not all processors have SCACHE
6
Unified Vs Split Caches
• Unified vs Separate I&D
Proc
I - Cache-1 Proc D - Cache-1
Unified
Cache-1
Unified Cache-2
Unified Cache-2
• Example:
– 16KB I&D: Inst miss rate=0.64%, Data miss rate=6.47%
– 32KB unified: Aggregate miss rate=1.99%
7
Simples Cache: Direct Mapped
Memory Address
Memory 4 Byte Direct Mapped Cache
0
1
Cache Index
2 0
3 1
4 2
5 3
6
7
• Location 0 can be occupied by data from:
8
9
– Memory location 0, 4, 8, ... etc.
A – In general: any memory location whose 2
B LSBs of the address are 0s
C – Address<1:0> => cache index
D
E • Which one should we place in the cache?
F • How can we tell which one is in the cache?
15
8
Cache Basics
• Cache hit: a memory access that is found in the cache
– cheap
• Cache miss: a memory access that is not in the cache
– expensive, because we need to get the data from elsewhere
• Consider a tiny cache (for illustration only)
X10010 X001
Tag line offset
X010 X011
X100 X101 Address
X110 X111
Direct-Mapped Cache
18
9
Fully Associative Cache
19
Diagrams
20
10
Tuning for Caches
1. Preserve locality.
2. Reduce cache thrashing.
3. Loop blocking when out of cache.
4. Software pipelining.
21
Registers
22
11
Memory Banking
• This started in the 1960’s with both 2 and 4 way interleaved
memory banks. Each bank can produce one unit of memory per
bank cycle. Multiple reads and writes are possible in parallel.
– Memory chips must internally recover from an access before it is
reaccessed
• The bank cycle time is currently 4-8 times the CPU clock time and
getting worse every year.
• Very fast memory (e.g., SRAM) is unaffordable in large quantities.
• This is not perfect. Consider a 4 way interleaved memory and a
stride 4 algorithm. This is equivalent to non-interleaved memory
systems.
23
• Principal of Locality
– Program access a relatively small portion of the address space
at any instant of time.
24
12
Principals of Locality
25
13
Cache Thrashing
27
14
Processor Stall
Indirect Addressing
d=0
do i = 1,n
j = ind(i)
d = d + sqrt( x(j)*x(j) + y(j)*y(j) + z(j)*z(j) )
end do
• Change loop statement to
d = d + sqrt( r(1,j)*r(1,j) + r(2,j)*r(2,j) + r(3,j)*r(3,j) )
• Note that r(1,j)-r(3,j) are in contiguous memory and
probably are in the same cache line (d is probably in a
register and is irrelevant). The original form uses 3 cache
lines at every instance of the loop and can cause cache
thrashing. 30
15
Cache Thrashing by Memory Allocation
parameter ( m = 1024*1024 )
real a(m), b(m)
31
Cache Blocking
• We want blocks to fit into cache. On parallel computers we
have p x cache so that data may fit into cache on p
processors, but not one. This leads to superlinear speed
up! Consider matrix-matrix multiply.
do k = 1,n
do j = 1,n
do i = 1,n
c(i, j) = c(i, j) + a(i,k)*b(k,j)
end do
end do
end do
• An alternate form is ...
32
16
Cache Blocking
do kk = 1,n,nblk
do jj = 1,n,nblk
do ii = 1,n,nblk
do k = kk,kk+nblk-1
do j = jj,jj+nblk-1
do i = ii,ii+nblk-1
c(i,j) = c(i, j) + a(i,k) * b(k,j)
end do
...
end do
33
17
Lessons
• Algorithm 2:
– for j=1:n, for i=1:n, A(i,j) = B(i,j) + C(i,j)
36
18