Computer Structure - Memory
Computer Structure - Memory
2013
BK
TP.HCM
dce
2013
Chapter 5
Memory
2013, CE
dce
2013
Presentation Outline
Random Access Memory and its Structure Memory Hierarchy and the need for Cache Memory The Basics of Caches Cache Performance and Memory Stall Cycles
2013, CE
dce
2013
Random Access
Access time is practically the same to any data on a RAM chip
RAM
Address Data
m OE WE
2013, CE
dce
2013
Memory Technology
Requires 6 transistors per bit Requires low power to retain bit
dce
2013
Static RAM (SRAM): fast but expensive RAM 6-Transistor cell with no static current
Cell Implementation:
Cross-coupled inverters store bit Two pass transistors
bit bit
dce
2013
Dynamic RAM (DRAM): slow, cheap, and dense memory Typical choice for main memory Cell Implementation:
1-Transistor cell (pass transistor) Trench capacitor (stores bit)
Pass Transistor
Bit is stored as a charge on capacitor Must be refreshed periodically Refreshing for all memory rows
bit
Capacitor
2013, CE
dce
2013
24-pin dual in-line package for 16Mbit = 222 4 memory 22-bit address is divided into
11-bit row address 11-bit column address Interleaved on same address lines
10 11
12
dce
2013
Row decoder
Row address
...
Column decoder
Select column to read/write
Cell Matrix
2D array of tiny memory cells
Sense/write amplifiers
Data
m
Sense/Write amplifiers
Sense & amplify data on read Drive bit line with data in on write
...
Column Decoder
c
Column address
2013, CE
dce
2013
DRAM Operation
Latch and decode row address to enable addressed row
10
dce
2013
Block Transfer
Fast transfer of blocks between memory and cache Fast transfer of pages between memory and disk
2013, CE
11
dce
2013
Trends in DRAM
Chip size 64 Kbit 256 Kbit 1 Mbit 4 Mbit 16 Mbit 64 Mbit Type DRAM DRAM DRAM DRAM DRAM Row access 170 ns 150 ns 120 ns 100 ns 80 ns Column access 75 ns 50 ns 25 ns 20 ns 15 ns Cycle Time New Request 250 ns 220 ns 190 ns 165 ns 120 ns
SDRAM
SDRAM DDR1 DDR1 DDR2 DDR2 DDR3 DDR3
70 ns
70 ns 65 ns 60 ns 55 ns 50 ns 35 ns 30 ns
12 ns
10 ns 7 ns 5 ns 5 ns 3 ns 1 ns 0.5 ns
110 ns
100 ns 90 ns 80 ns 70 ns 60 ns 37 ns 31 ns
2013, CE 12
1998
2000 2002 2004 2006 2010 2012
128 Mbit
256 Mbit 512 Mbit 1 Gbit 2 Gbit 4 Gbit 8 Gbit
dce
2013
dce
2013
DDR4-3200
1600 MHz
3200 MT/s
PC-25600
25600 MB/s
dce
2013
Refresh cycle is about tens of milliseconds Refreshing is done for the entire memory
Refresh Cycle
Time
Voltage for 0
2013, CE
15
dce
2013
Memory chips typically have a narrow data bus We can expand the data bus width by a factor of p
Use p RAM chips and feed the same address to all chips Use the same Output Enable and Write Enable control signals
OE Address Data
WE
OE Address
WE
OE
WE
...
m
Address Data
Data
..
dce
2013
Next . . .
Random Access Memory and its Structure Memory Hierarchy and the need for Cache Memory The Basics of Caches Cache Performance and Memory Stall Cycles
2013, CE
17
dce
2013
dce
2013
Memory bandwidth limits the instruction execution rate Cache memory can help bridge the CPU-memory gap Cache memory is small in size but fast
Computer Architecture Chapter 5 2013, CE
19
dce
2013
L1 Cache
Bigger
20
L2 Cache Memory Bus Main Memory I/O Bus Magnetic or Flash Disk
2013, CE
dce
2013
Programs access small portion of their address space Temporal Locality (in time)
If an item is accessed, probably it will be accessed again soon Same loop instructions are fetched each iteration Same procedure may be called and executed many times
2013, CE
21
dce
2013
Goal is to achieve
Fast speed of cache memory access
Balance the cost of the memory system
Computer Architecture Chapter 5 2013, CE 22
dce
2013
E
32
ALU result 32
0 1
Instruction
RA RB RW
Instruction
Rt 5
PC
BusB
Address
0 1 2 3
A L U
1 0 32
0 32
Address Data_out
1
BusW
32
Rd2
Rd3
Data_in
Rd
0 1
clk
Instruction Block
Block Address
Block Address
D-Cache miss
I-Cache miss
I-Cache miss or D-Cache miss causes pipeline to stall Interface to L2 Cache or Main Memory
2013, CE
Data Block
Rd4
WB Data 23
I-Cache
Register File
Rs 5
BusA
ALUout
D-Cache
dce
2013
In computer architecture, almost everything is a cache! Registers: a cache on variables software managed First-level cache: a cache on second-level cache Second-level cache: a cache on memory Memory: a cache on hard disk
Stores recent programs and their data
Hard disk can be viewed as an extension to main memory
2013, CE
24
dce
2013
Next . . .
Random Access Memory and its Structure Memory Hierarchy and the need for Cache Memory The Basics of Caches Cache Performance and Memory Stall Cycles
2013, CE
25
dce
2013
dce
2013
Block: unit of data transfer between cache and memory Direct Mapped Cache:
A block can be placed in exactly one location in the cache
In this example: Cache index = least significant 3 bits of Memory address
000 001 010 011 100 101 110 111
Cache
00000 00001 00010 00011 00100 00101 00110 00111 01000 01001 01010 01011 01100 01101 01110 01111 10000 10001 10010 10011 10100 10101 10110 10111 11000 11001 11010 11011 11100 11101 11110 11111
2013, CE
Main Memory
27
dce
2013
Direct-Mapped Cache
Block Address Tag Index offset
V Tag
Block Data
2013, CE
28
dce
2013
=
Data Hit
2013, CE 29
dce
2013
Example
Solution
32-bit address is divided into:
Block Address 20 8 4
Tag
Index offset
2013, CE
30
dce
2013
Solution:
1000 = 0x3E8 1004 = 0x3EC 1008 = 0x3F0 2548 = 0x9F4 2552 = 0x9F8 2556 = 0x9FC
Index offset
cache index = 0x1E cache index = 0x1E cache index = 0x1F cache index = 0x1F cache index = 0x1F cache index = 0x1F
Miss (first access) Hit Miss (first access) Miss (different tag) Hit Hit
2013, CE
31
dce
2013
mux
m-way associative
Computer Architecture Chapter 5
Hit
Data
2013, CE
32
dce
2013
Set-Associative Cache
A set is a group of blocks that can be indexed A block is first mapped onto a set
Set index = Block address mod Number of sets in cache
dce
2013
m-way set-associative
Hit
mux Data
2013, CE
34
dce
2013
Write Policy
Writes update cache and lower-level memory
Cache control bit: only a Valid bit is needed Memory always has latest data, which simplifies data coherency Can always discard cached data when a block is replaced
Write Through:
Write Back:
Writes update cache only Cache control bits: Valid and Modified bits are required
dce
2013
No Write Allocate:
Send data to lower-level memory
Cache is not modified
2013, CE
36
dce
2013
Write Buffer
Permits writes to occur without stall cycles until buffer is full
2013, CE
37
dce
2013
Cache sends a miss signal to stall the processor Decide which cache block to allocate/replace
One choice only when the cache is directly mapped
Multiple choices for set-associative or fully-associative cache
Restart the instruction that caused the cache miss Miss Penalty: clock cycles to process a cache miss
Computer Architecture Chapter 5 2013, CE 38
dce
2013
Replacement Policy
Which block to be replaced on a cache miss? No selection alternatives for direct-mapped caches m blocks per set to choose from for associative caches Random replacement
Candidate blocks are randomly selected One counter for all sets (0 to m 1): incremented on every cycle On a cache miss replace block specified by counter
2013, CE
39
dce
2013
dce
2013
256 KB
92.2
92.1
92.5
92.1
92.1
92.5
92.1
92.1
92.5
2013, CE
41
dce
2013
Next . . .
Random Access Memory and its Structure Memory Hierarchy and the need for Cache Memory The Basics of Caches Cache Performance and Memory Stall Cycles
2013, CE
42
dce
2013
Hit Rate
Miss Rate = Misses / (Hits + Misses) I-Cache Miss Rate = Miss rate in the Instruction Cache D-Cache Miss Rate = Miss rate in the Data Cache Example:
Out of 1000 instructions fetched, 150 missed in the I-Cache 25% are load-store instructions, 50 missed in the D-Cache What are the I-cache and D-cache miss rates?
I-Cache Miss Rate = 150 / 1000 = 15% D-Cache Miss Rate = 50 / (25% 1000) = 50 / 250 = 20%
2013, CE
43
dce
2013
dce
2013
dce
2013
dce
2013
CPIMemoryStalls = CPIPerfectCache + Mem Stalls per Instruction CPIPerfectCache = CPI for ideal cache (no cache misses) CPIMemoryStalls = CPI in the presence of memory stalls Memory stall cycles increase the CPI
Computer Architecture Chapter 5 2013, CE 47
dce
2013
Cache miss penalty is 100 clock cycles for I-cache and D-cache
dce
2013
Average Memory Access Time (AMAT) Time to access a cache for both hits and misses Example: Find the AMAT for a cache with
Cache access time (Hit time) of 1 cycle = 2 ns Miss penalty of 20 clock cycles Miss rate of 0.05 per access
Solution:
AMAT = 1 + 0.05 20 = 2 cycles = 4 ns
dce
2013
Next . . .
Random Access Memory and its Structure Memory Hierarchy and the need for Cache Memory The Basics of Caches Cache Performance and Memory Stall Cycles
2013, CE
50
dce
2013
Average Memory Access Time (AMAT) AMAT = Hit time + Miss rate * Miss penalty Used as a framework for optimizations Reduce the Hit time
Small and simple caches
2013, CE
51
dce
2013
Hit time is critical: affects the processor clock cycle Small cache reduces the indexing time and hit time
Indexing a cache represents a time consuming portion Tag comparison also adds to this hit time
dce
2013
Conditions under which misses occur Compulsory: program starts with no block in cache
Also called cold start misses Misses that would occur even if a cache has infinite size
2013, CE
53
dce
2013
14% 12%
10%
8% 8-way 6%
4%
2% 0
16
32
64
128 KB
2013, CE 54
dce
2013
Increasing cache size reduces capacity misses It also reduces conflict misses
Larger cache size spreads out references to more blocks
Drawbacks: longer hit time and higher cost Larger caches are especially popular as 2nd level caches Higher associativity also improves miss rates
Eight-way set associative is as effective as a fully associative
2013, CE
55
dce
2013
Simplest way to reduce miss rate is to increase block size However, it increases conflict misses if cache is small
25% 20%
Reduced Compulsory Misses Increased Conflict Misses
1K 4K 16K 64K 64-byte blocks are common in L1 caches 128-byte block are common in L2 caches
Miss Rate
15% 10% 5%
2013, CE
56
dce
2013
Next . . .
Random Access Memory and its Structure Memory Hierarchy and the need for Cache Memory The Basics of Caches Cache Performance and Memory Stall Cycles
2013, CE
57
dce
2013
Multilevel Caches
Keep pace with processor speed
Top level cache should be kept small to Adding another cache level
Can reduce the memory gap
Can reduce memory bus loading
I-Cache D-Cache
dce
2013
32KB I-Cache/core 32KB D-Cache/core 3-cycle latency 256KB Unified L2 Cache/core 8-cycle latency 32MB Unified Shared L3 Cache Embedded DRAM 25-cycle latency to local slice
2013, CE
59
dce
2013
Multilevel Inclusion
L1 cache data is always present in L2 cache A miss in L1, but a hit in L2 copies block from L2 to L1 A miss in L1 and L2 brings a block into L1 and L2 A write in L1 causes data to be written in L1 and L2
2013, CE
60
dce
2013
Multilevel exclusion