Memory HierarchyImproving Performance
Professor Alvin R. Lebeck
Computer Science 220
Fall 2006
Admin
Homework #4 Due November 2nd
Work on Projects
Midterm
Max 98
Min 50
Mean 80
Read NUCA paper
Alvin R. Lebeck 2006
Review: ABCs of caches
Associativity
Block size
Capacity
Number of sets S = C/(BA)
1-way (Direct-mapped)
A = 1, S = C/B
N-way set-associative
Fully associativity
S = 1, C = BA
Know how a specific piece of data is found
Index, tag, block offset
Alvin R. Lebeck 2006
Write Policies
We know about write-through vs. write-back
Assume: a 16-bit write to memory location 0x00
causes a cache miss.
Do we change the cache tag and update data in the
block?
Yes: Write Allocate
No: Write No-Allocate
Do we fetch the other data in the block?
Yes: Fetch-on-Write (usually do write-allocate)
No: No-Fetch-on-Write
Write-around cache
Write-through no-write-allocate
Alvin R. Lebeck 2006
Sub-block Cache (Sectored)
Sub-block:
Share one cache tag between all sub-blocks in a block
Each sub-block within a block has its own valid bit
Example: 1 KB Direct Mapped Cache, 32-B Block, 8-B Sub-block
Each cache entry will have: 32/8 = 4 valid bits
Miss: only the bytes in that sub-block are brought in.
: : : :
Cache Data
B31
B24
Sub-block3 Sub-block2
SB0s V Bit
SB1s V Bit
Cache Tag
SB2s V Bit
SB3s V Bit
reduces cache fill bandwidth (penalty).
B7
B0 0
Sub-block1 Sub-block0 1
2
3
:
31
Byte 1023
Alvin R. Lebeck 2006
Byte 992
5
Review: Four Questions for Memory Hierarchy
Designers
Q1: Where can a block be placed in the upper level?
(Block placement)
Fully Associative, Set Associative, Direct Mapped
Q2: How is a block found if it is in the upper level?
(Block identification)
Tag/Block
Q3: Which block should be replaced on a miss?
(Block replacement)
Random, LRU
Q4: What happens on a write?
(Write strategy)
Write Back or Write Through (with Write Buffer)
Alvin R. Lebeck 2006
CPS 220
Cache Performance
CPU time = (CPU execution clock cycles + Memory stall
clock cycles) x clock cycle time
Memory stall clock cycles = (Reads x Read miss rate x
Read miss penalty + Writes x Write miss rate x Write
miss penalty)
Memory stall clock cycles = Memory accesses x Miss
rate x Miss penalty
Alvin R. Lebeck 2006
CPS 220
Cache Performance
CPUtime = IC x (CPIexecution + (Mem accesses per
instruction x Miss rate x Miss penalty)) x Clock cycle
time
hits are included in CPIexecution
Misses per instruction = Memory accesses per
instruction x Miss rate
CPUtime = IC x (CPIexecution + Misses per instruction x
Miss penalty) x Clock cycle time
Alvin R. Lebeck 2006
CPS 220
Example
Miss penalty 50 clocks
Miss rate 2%
Base CPI 2.0
1.33 references per instruction
Compute the CPUtime
CPUtime = IC x (2.0 + (1.33 x 0.02 x 50)) x Clock
CPUtime = IC x 3.33 x Clock
So CPI increased from 2.0 to 3.33 with a 2% miss rate
Alvin R. Lebeck 2006
Example 2
Two caches: both 64KB, 32 byte blocks, miss penalty
70ns, 1.3 references per instruction, CPI 2.0 w/ perfect
cache
direct mapped
Cycle time 2ns
Miss rate 1.4%
2-way associative
Cycle time increases by 10%
Miss rate 1.0%
Which is better?
Compute average memory access time
Compute CPU time
Alvin R. Lebeck 2006
10
Example 2 Continued
Ave Mem Acc Time =
Hit time + (miss rate x miss penalty)
1-way: 2.0 + (0.014 x 70) = 2.98ns
2-way: 2.2 + (0.010 x 70) = 2.90ns
CPUtime = IC x CPIexec x Cycle
CPIexec = CPIbase + ((memacc/inst) x Miss rate x miss penalty)
Note: miss penalty x cycle time = 70ns
1-way: IC x ((2.0 x 2.0) + (1.3x0.014x70)) = 5.27 x IC
2-way: IC x ((2.0 x 2.2) + (1.3x0.010x70)) = 5.31 x IC
Alvin R. Lebeck 2006
11
Review: Cache Performance
CPUtime = IC x (CPIexecution + Mem accesses per
instruction x Miss rate x Miss penalty) x Clock cycle
time
hits are included in CPIexecution
Misses per instruction = Memory accesses per
instruction x Miss rate
CPUtime = IC x (CPIexecution + Misses per instruction x
Miss penalty) x Clock cycle time
Alvin R. Lebeck 2006
CPS 220
12
Improving Cache Performance
Ave Mem Acc Time =
Hit time + (miss rate x miss penalty)
1. Reduce the miss rate,
2. Reduce the miss penalty, or
3. Reduce the time to hit in the cache.
Alvin R. Lebeck 2006
CPS 220
13
Reducing Misses
Classifying Misses: 3 Cs
CompulsoryThe first access to a block is not in the cache,
so the block must be brought into the cache. These are also
called cold start misses or first reference misses.
(Misses in Infinite Cache)
CapacityIf the cache cannot contain all the blocks needed
during execution of a program, capacity misses will occur due
to blocks being discarded and later retrieved.
(Misses in Size X Cache)
ConflictIf the block-placement strategy is set associative or
direct mapped, conflict misses (in addition to compulsory and
capacity misses) will occur because a block can be discarded
and later retrieved if too many blocks map to its set. These are
also called collision misses or interference misses.
(Misses in N-way Associative, Size X Cache)
Alvin R. Lebeck 2006
CPS 220
14
3Cs Absolute Miss Rate
0.14
1-way
Conflict
0.12
2-way
0.1
4-way
0.08
8-way
0.06
Capacity
0.04
0.02
Cache Size (KB)
Alvin R. Lebeck 2006
CPS 220
128
64
32
16
Compulsory
15
2:1 Cache Rule
0.14
1-way
0.12
Conflict
2-way
0.1
4-way
0.08
8-way
0.06
Capacity
0.04
0.02
Cache Size (KB)
Alvin R. Lebeck 2006
CPS 220
128
64
32
16
Compulsory
16
3Cs Relative Miss Rate
100%
1-way
80%
Conflict
2-way
4-way
8-way
60%
40%
Capacity
20%
Cache Size (KB)
Alvin R. Lebeck 2006
CPS 220
128
64
32
16
0%
Compulsory
17
How Can We Reduce Misses?
Change Block Size? Which of 3Cs affected?
Change Associativity? Which of 3Cs affected?
Change Program/Compiler? Which of 3Cs
affected?
Alvin R. Lebeck 2006
CPS 220
18
1. Reduce Misses via Larger Block Size
25%
1K
20%
Miss
Rate
4K
15%
16K
10%
64K
5%
256K
256
128
64
32
16
0%
Block Size (bytes)
Alvin R. Lebeck 2006
CPS 220
19
2. Reduce Misses via Higher Associativity
2:1 Cache Rule:
Miss Rate DM cache size N ~= Miss Rate 2-way cache size N/2
Beware: Execution time is only final measure!
Will Clock Cycle time increase?
Hill [1988] suggested hit time external cache +10%, internal + 2% for
2-way vs. 1-way
Alvin R. Lebeck 2006
CPS 220
20
Example: Avg. Memory Access Time vs. Miss
Rate
Example: assume Clock Cycle = 1.10 for 2-way, 1.12
for 4-way, 1.14 for 8-way vs. Clock cycle for direct
mapped
Cache Size
(KB)
1-way
1
2.33
2
1.98
4
1.72
8
1.46
16
1.29
32
1.20
64
1.14
128
1.10
Associativity
2-way 4-way
2.15
2.07
1.86
1.76
1.67
1.61
1.48
1.47
1.32
1.32
1.24
1.25
1.20
1.21
1.17
1.18
8-way
2.01
1.68
1.53
1.43
1.32
1.27
1.23
1.20
(Red means A.M.A.T. not improved by more associativity)
Alvin R. Lebeck 2006
CPS 220
21
3. Reducing Conflict Misses via Victim Cache
CPU
TAG
How to combine fast hit
time of Direct Mapped
yet still avoid conflict
misses?
Add buffer to place data
discarded from cache
Jouppi [1990]: 4-entry
victim cache removed
20% to 95% of conflicts
for a 4 KB direct mapped
data cache
DATA
?
TAG
DATA
Mem
?
Alvin R. Lebeck 2006
CPS 220
22
4. Reducing Conflict Misses via PseudoAssociativity
How to combine fast hit time of Direct Mapped and have
the lower conflict misses of 2-way SA cache?
Divide cache: on a miss, check other half of cache to see
if there, if so have a pseudo-hit (slow hit)
Hit Time
Pseudo Hit Time
Miss Penalty
Time
Drawback: CPU pipeline is hard if hit takes 1 or 2 cycles
Better for caches not tied directly to processor
Alvin R. Lebeck 2006
CPS 220
23
5. Reducing Misses by HW Prefetching of
Instruction & Data
E.g., Instruction Prefetching
Alpha 21064 fetches 2 blocks on a miss
Extra block placed in stream buffer
On miss check stream buffer
Works with data blocks too:
Jouppi [1990] 1 data stream buffer got 25% misses from 4KB cache;
4 streams got 43%
Palacharla & Kessler [1994] for scientific programs for 8 streams
got 50% to 70% of misses from 2 64KB, 4-way set associative
caches
Pointers vs. arrays
Kedem: Markov predictor (address correlation)
Prefetching relies on extra memory bandwidth that
can be used without penalty
Alvin R. Lebeck 2006
CPS 220
24
6. Reducing Misses by SW Prefetching Data
Data Prefetch
Load data into register (HP PA-RISC loads) binding
Cache Prefetch: load into cache (MIPS IV, PowerPC, SPARC v. 9)
non-binding
Special prefetching instructions cannot cause faults;
a form of speculative execution
Issuing Prefetch Instructions takes time
Is cost of prefetch issues < savings in reduced misses?
Alvin R. Lebeck 2006
CPS 220
25
Improving Cache Performance
1. Reduce the miss rate,
2. Reduce the miss penalty, or
3. Reduce the time to hit in the cache.
Alvin R. Lebeck 2006
CPS 220
26
Reducing Misses
Classifying Misses: 3 Cs
CompulsoryThe first access to a block is not in the cache,
so the block must be brought into the cache. These are also
called cold start misses or first reference misses.
(Misses in Infinite Cache)
CapacityIf the cache cannot contain all the blocks needed
during execution of a program, capacity misses will occur due
to blocks being discarded and later retrieved.
(Misses in Size X Cache)
ConflictIf the block-placement strategy is set associative or
direct mapped, conflict misses (in addition to compulsory and
capacity misses) will occur because a block can be discarded
and later retrieved if too many blocks map to its set. These are
also called collision misses or interference misses.
(Misses in N-way Associative, Size X Cache)
Alvin R. Lebeck 2006
CPS 220
27
7. Reducing Misses by Program/Compiler
Optimizations
Instructions
Reorder procedures in memory so as to reduce misses
Profiling to look at conflicts
McFarling [1989] reduced caches misses by 75% on 8KB direct
mapped cache with 4 byte blocks
Data
Merging Arrays: improve spatial locality by single array of
compound elements vs. 2 arrays
Loop Interchange: change nesting of loops to access data in
order stored in memory
Loop Fusion: Combine 2 independent loops that have same
looping and some variables overlap
Blocking: Improve temporal locality by accessing blocks of data
repeatedly vs. going down whole columns or rows
Alvin R. Lebeck 2006
CPS 220
28
Merging Arrays Example
/* Before */
int val[SIZE];
int key[SIZE];
/* After */
struct merge {
int val;
int key;
};
struct merge merged_array[SIZE];
Reducing conflicts between val & key
Alvin R. Lebeck 2006
CPS 220
29
Loop Interchange Example
/* Before */
for (k = 0; k < 100; k = k+1)
for (j = 0; j < 100; j = j+1)
for (i = 0; i < 5000; i = i+1)
x[i][j] = 2 * x[i][j];
/* After */
for (k = 0; k < 100; k = k+1)
for (i = 0; i < 5000; i = i+1)
for (j = 0; j < 100; j = j+1)
x[i][j] = 2 * x[i][j];
Sequential accesses instead of striding
through memory every 100 words
What is miss rate before and after?
Alvin R. Lebeck 2006
CPS 220
30
Loop Fusion Example
/* Before */
for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1)
a[i][j] = 1/b[i][j] * c[i][j];
for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1)
d[i][j] = a[i][j] + c[i][j];
/* After */
for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1)
{
a[i][j] = 1/b[i][j] * c[i][j];
d[i][j] = a[i][j] + c[i][j];}
2 misses per access to a & c vs. one miss per access
Alvin R. Lebeck 2006
CPS 220
31
Blocking Example
/* Before */
for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1)
{r = 0;
for (k = 0; k < N; k = k+1){
r = r + y[i][k]*z[k][j];};
x[i][j] = r;
};
Two Inner Loops:
Read all NxN elements of z[ ]
Read N elements of 1 row of y[ ] repeatedly
Write N elements of 1 row of x[ ]
Capacity Misses a function of N & Cache Size:
3 NxN => no capacity misses; otherwise ...
Idea: compute on BxB submatrix that fits
Alvin R. Lebeck 2006
CPS 220
32
Blocking Example
/* After */
for (jj = 0; jj < N; jj = jj+B)
for (kk = 0; kk < N; kk = kk+B)
for (i = 0; i < N; i = i+1)
for (j = jj; j < min(jj+B-1,N); j = j+1)
{r = 0;
for (k = kk; k < min(kk+B-1,N); k = k+1) {
r = r + y[i][k]*z[k][j];};
x[i][j] = x[i][j] + r;
};
Capacity Misses from 2N3 + N2 to 2N3/B +N2
B called Blocking Factor
6 loop variant exists
Conflict Misses?
Alvin R. Lebeck 2006
CPS 220
33
Reducing Conflict Misses by Blocking
0.1
Direct Mapped Cache
0.05
Fully Associative Cache
0
0
50
100
150
Blocking Factor
Conflict misses in caches not FA vs. Blocking size
Lam et al [1991] a blocking factor of 24 had a fifth the misses vs.
48 despite both fit in cache
Alvin R. Lebeck 2006
CPS 220
34
Summary of Program/Compiler Optimizations to
Reduce Cache Misses
vpenta (nasa7)
gmty (nasa7)
tomcatv
btrix (nasa7)
mxm (nasa7)
spice
cholesky
(nasa7)
compress
1
1.5
2.5
Performance Improvement
merged
arrays
Alvin R. Lebeck 2006
loop
interchange
loop fusion
CPS 220
blocking
35
Layout and Cache Behavior
Tile elements spread out in
memory because of columnmajor mapping
Fixed mapping into cache
Self-interference in cache
Cache
Memory
Alvin R. Lebeck 2006
Cache Mapping
36
Making Tiles Contiguous
Elements of a quadrant
are contiguous
Recursive layout
Elements of a tile are
contiguous
No self-interference in
cache
Memory
Alvin R. Lebeck 2006
Cache Mapping
37
Non-linear Layout Functions
00 11
44 55
88 99
22 33
66 77
10
10 11
11
00 11 44
22 33 66
88 99 12
12
55
77
12
13 14
14 15
15
12 13
13
13
10
11 14
14 15
15
10 11
00 33 44 55
11 22 77 66
14
14 13
13 88 99
15
12 11
11 10
10
15 12
4-D blocked
Morton order
Hilbert order
Different locality properties
Different inclusion properties
Different addressing costs
Alvin R. Lebeck 2006
38
Performance Improvement
CPU
Clock rate
L1 cache
L2 cache
L3 cache
RAM
TLB entries
Page size
BLKMXM
RECMXM
STRASSEN
CHOL
STDHAAR
NONHAAR
Alvin R. Lebeck 2006
UltraSPARC 2i
300MHz
16KB/32B/1
512KB/64B/1
320MB
64
8KB
Ultra 10
4D
0.93
0.78
0.68
0.62
UltraSPARC 2
300 MHz
16KB/32B/1
2MB/64B/1
512MB
64
8KB
Ultra 60
MO
4D
1.06
0.95
0.94
0.87
0.85
0.67
0.64
0.61
0.58
Alpha 21164
500MHz
8KB/32B/1
96KB/64B/3
2MB/64B/1
512MB
64
8KB
Miata
MO
4D
MO
1.05
0.97
0.95
0.94
0.95
0.79
0.91
0.67
0.64
0.42
0.43
0.58
0.40
0.40
39
Comparison with TSS
BMXM, comparison with TSS
Execution time (seconds)
8
TSS/Ultra 10
4D/ Ultra 10, t=17
4D/ Ultra 10, t=30
5
4
3
2
1
0
500
520
540
560
580
600
Problem size (elements)
Alvin R. Lebeck 2006
40
Summary
CPUtime = IC CPI
Execution
Memory accesses
Miss rate Miss penalty Clock cycle time
Instruction
3 Cs: Compulsory, Capacity, Conflict
How to eliminate them
Program Transformations
Change Algorithm
Change Data Layout
Implication: Think about caches if you want high
performance!
Alvin R. Lebeck 2006
CPS 220
41