Module 5: Emerging NVM: Advanced Topics in Modern VISL and Architecture
Module 5: Emerging NVM: Advanced Topics in Modern VISL and Architecture
Non-volatile memory, nonvolatile memory, NVM or non-volatile storage, is computer memory that can retain the stored information even when not powered. -www.wikipedia.org
Paper tape
HDD
Floppy Disk
CD-R
1864
1971
1980
1988
1956: First HDD: RAMAC 305 (IBM). 5MB of data at $50,000. As big as two refrigerators. Uses 50 24 platters. 1973: First modern "Winchester" HDD (IBM): Model 3340. 1979: First 5.25 HDD for PC (Shugart Tech., now Seagate Tech.). 1982: First drive with more than 1GB of storage: 1.2GB H-8598, with 50kg (Hitachi). 1983: First 3.5 HDD: RO352 (10MB, Rodime). 1988: First 2.5 HDD: 220 (20MB, Prairie Tek). 1997: Magneto resistive (GMR) heads (IBM). 2000: First 15,000-rpm HDD: Cheetah X15 (Seagate). 2002 100Gbits per square inch (Seagate). 2006: First 2.5-inch model to use perpendicular magnetic recording, boosts capacity up to 160GB. (Seagate) 2006: 1 12G HDD (Seagate) 2007: First real 1TB hard disk drive (Seagate).
4
NAND Cost Reduction Is Challenge After 2011 For Technology Barrier With NAND
*Source: IBM
Page 6
Leakage Current
Refresh Power
None
None
None
None
R-RAM outperforms NAND in cost (< x1/4), density (> x2) and performance STT-RAM is ideal for embedded solution Competition: IBM = PCM, SEC = PCM/NiO, Toshiba 3D NAND, SEAGATE = MRAM EVERSPIN=MRAM *Source: ITRS
Page 7
MRAM Cells
The structure of one transistor and one Magnetic Tunnel Junction (MTJ).
High resistance :: Low resistance 1 Free layer 0
Reference layer
3D View
Bit Line
M3
ine Source L
Drain M2
e Source Lin
M1
Drain
Ga te=
Wo rd
Lin S e G
Rmax
Rref
P ro b a b ility
Resistance (Ohm)
10
11
11
Cache configurations Low Leakage 2MB (16x128KB) SRAM cache 8MB (16x512KB) MRAM cache
Pros: Low leakage power, high density. Cons: Long write latency and large write energy. Replace SRAM caches with MRAM ? (HPCA 2009) 12 12
Cache bank
13
13
14
14
Direct Replacement
Replace SRAM with MRAM of same area. The number of banks are kept the same. The capacity of L2 cache increases by three times.
15
IPC (SRAM vs. MRAM) The last four benchmarks have high write intensities. (see Observation 1)
16
16
Replacing SRAM L2 caches directly with MRAM can reduce the access miss rate of L2 caches. However, the long access latency to MRAM cache has a negative impact on the performance. When the write intensity is high, it even results in performance degradation.
17
Power Analysis
(Direct Replacement)
(Normalized to 2M-SRAM-SNUCA)
18
Observation 2
Replacing SRAM L2 caches directly with MRAM can greatly reduce the leakage power. When the write intensity is high, the dynamic power increases significantly because of the high write energy of MRAM cache. Question: How to improve the performance and further reduce power of MRAM?
19
19
How can read request evict write Read Op. request (preemptive condition)?
Cores Read Data
20
16
The read-preemptive write buffer hides the MRAM long write latency. We propose SRAM-MRAM Hybrid Cache to reduce write intensities to MRAM.
21
17
MRAM bank
TSV
18
Migrate data migrations among MRAM cache banks. Reduce data frequently written to the SRAM cache banks.
23
19
20
IPC Comparison After adopting T1 and T2, the performance degradation is eliminated. The average IPC is increased by 15%.
25
21
Total Power Comparison After adopting T1 and T2, the dynamic power is reduced. The average total power is further reduced by 17%.
26
22
Cache configurations 8MB 16 X 512KB DRAM cache 8MB 16 X 512KB MRAM cache
27
28
29
SRAM 6T structure
MTJ
31
MTJ
SRAM 6T structure
32
Comparisons
Density High (4) High(16) (ratio) (1) Dynamic Low Low for read; Medium for High for read; High Power Leakage High Low Low write for write Fast outperform Slow for Speed Fast Hybrid Cache could for Power read; read; Non-volatility No Yes Yes Slow for technology Very slow its counterpart of single Scalability Yes Yes write forYes write >1015 1016 1012 Endurance
PRAM assumes four bits per cell
SRA Low M
MRAM
PRAM
Reducedynamicmiss rate High leakage power Increase hit latency Low Cache power
33
Read/Write
Reads and writes Reads and writes have different performance/power implications Varied read/write behaviors for different benchmarks Emerging memories have different read/write features
RWHCA
Read-write aware Hybrid Cache Architecture (RWHCA) using Emergin NVM: Made of different memory technologies and distinguish reads and writes Increase effective cache size under similar area Reduce leakage power consumption Read/write exclusive regions in the same cache level Write region has faster write and low write power (SRAM) Intra-cache data movement policies Placing frequently written data to the write region Reduce power, may improve performance
35
Methodology
Chiplet Core w/ L1s L2 Write (SRAM) Core w/ L1s L2 Write (SRAM) L2 (SRAM) L2 Read (MRAM/ PRAM) L2 Read (MRAM)
Core w/ L1s
L3 (PRAM)
Baseline
RWHCA
3DRWHCA
36
Methodology
Cache parameters: CACTI or modified versions SRAM: 1MB, 8 cycles, 0.388 nJ, 1.36 W (45nm) MRAM: 4MB, 20/60 cycles, 0.4/2.3 nJ, 0.15W PRAM: 16MB, 40/200 cycles, 0.8/1.5 nJ, 0.3W System configuration Simulator: IBM Mambo Processor: 8-way issue, out-of-order, 4GHz L1: 32KB DL1,32KB IL1, 128B, 4-way, 1 r/w port, 2 cycles L2/L3: different for design cases Workloads 30 workloads from SPECINT2006, SPECJBB, NAS, BioPerf, PARSEC, SPLASH2 Various cache size requirements
37
RWHCA-result
SRAM/MRAM RWHCA L2 performance
5% geometric mean performance improvement over baseline 3% improvement over previous DNUCA policy DNUCA: move a line to a closer bank on each hit, no difference for reads and writes, other policies Also achieve better performance than 3-level SRAM cache 256KB L2 and 1MB L3, similar area
1.66
1.94
38
RWHCA-result
SRAM/MRAM RWHCA L2 power
55% power reduction over baseline dynamic power: normal + swap, less leakage power Lower power than DNUCA and 3-level SRAM
39
RWHCA-result
SRAM/PRAM RWHCA L2 performance
20% performance degradation over baseline PRAM is not suitable for L2 cache from the performance perspective due to its long write latency Low endurance, not suitable for lower level cache
1.42 1.44
40
Outline
Introduction Methodology Read-write
and Motivation
3DRWHCA-configuration
SRAM/MRAM/PRAM 3DRWHCA SRAM + MRAM L2 Total size: 4MB, 256KB SRAM Write region: SRAM, region region: MRAM SRAM r/w: 6 cycles, MRAM r: 20 cycles, w: 60 cycles Bank number: 16, Associativity: 16 Block size: 128B, 1 r/w port
L3 PRAM 32MB (core + L1 has similar area with L2) L3 bank number: 64, Associativity: 64 Block size: 128B, 1 r/w port Power: scale from RWHCA
42
3DRWHCA-result
3DRWHCA performance
16% geometric mean performance improvement over baseline 11% improvement over SRAM/MRAM RWHCA
1.94
2.2
1.88
1.71
43
3DRWHCA-result
3DRWHCA power
10% power reduction over baseline even with a PRAM L3 Higher power than RWHCA Lower power than 3-level SRAM
44
Conclusion
Emerging NVM is getting mature Will this bring a new impact on computer architecture and system design?
45