CS 203A Advanced Computer Architecture
Lecture 1
Instructor: L. N. Bhuyan
9/23/2004
Lec 1-2
Instructor Information
Laxmi Narayan Bhuyan Office: Engg.II Room 351 E-mail: [email protected] Tel: (951) 787-2244 Office Times: W, 3-4.30 pm
9/23/2004
Lec 1-2
Course Syllabus
Introduction, Performance, Instruction Set, Pipelining Appendix A Instruction level parallelism, Dynamic scheduling, Branch Prediction and Speculation Ch 2 Text Limits on ILP and Software Approaches Ch 3 Multiprocessors, Thread-Level ParallelismCh 4 Memory Hierarchy Ch 5 I/O Architectures Papers Text: Hennessy and Patterson, Computer Architecture: A Quantitative Approach, Morgan Kaufman Publisher Fourth Editon Prerequisite: CS 161
9/23/2004 Lec 1-2 3
Course Details
Grading: Based on Curve Test1: 35 points Test 2: 35 points Project: 30 points
9/23/2004
Lec 1-2
What is *Computer Architecture*
Computer Architecture =
Instruction Set Architecture + Organization + Hardware +
9/23/2004
Lec 1-2
The Instruction Set: a Critical Interface
The actual programmer visible instruction set
software
instruction set
hardware
9/23/2004
Lec 1-2
Instruction-Set Processor Design
Architecture (ISA)
programmer/compiler view
functional appearance to its immediate user/system programmer Opcodes, addressing modes, architected registers, IEEE floating point
Implementation (architecture) processor designer/view
logical structure or organization that performs the architecture Pipelining, functional units, caches, physical registers
Realization
chip/system designer view physical structure that embodies the implementation Gates, cells, transistors, wires
Lec 1-2 7
(chip)
9/23/2004
Hardware
Trends in Technology (Section 1.4 and Figure 1.9):
Feature size (10 microns in 1971 to 0.18 microns in 2001, to 0.045 in 2010!!!)
Minimum size of a transistor or a wire in either the x or y dimension
Logic designs Packaging technology Clock rate Supply voltage Moores Law Number of transistors doubles every 1.5 years (due to smaller feature size and larger die size)
9/23/2004
Lec 1-2
Relationship Between the Three Aspects
Processors having identical ISA may be very different in organization. Processors with identical ISA and nearly identical organization are still not nearly identical.
e.g. Pentium II and Celeron are nearly identical but differ at clock rates and memory systems
Architecture covers all three aspects.
9/23/2004
Lec 1-2
Applications and Requirements
Scientific/numerical: weather prediction, molecular modeling
Need: large memory, floating-point arithmetic
Commercial: inventory, payroll, web serving, e-commerce
Need: integer arithmetic, high I/O
Embedded: automobile engines, microwave, PDAs
Need: low power, low cost, interrupt driven
Network computing: Web, Security, multimedia, games, entertainment
Need: high data bandwidth, application processing, graphics
9/23/2004
Lec 1-2
10
Network bandwidth outpaces Moores law
TCP requirements Rule of thumb: 1GHz for 1Gbps 1000
100 40
100 Network bandwidth
GHz and Gbps
10
10
Moores Law
0.1
Moores Law
1990 1995 2000 2003 2005 2006/7 2010
.01
Time
Classes of Computers
High performance (supercomputers) Balanced cost/performance
Supercomputers Cray T-90, SGI Altix Massively parallel computers Cray T3E
Workstations SPARCstations Servers SGI Origin, UltraSPARC High-end PCs Pentium quads
Low cost/power
Low-end PCs, laptops, PDAs mobile Pentiums
9/23/2004
Lec 1-2
13
Why Study Computer Architecture
Arent they fast enough already?
Are they? Fast enough to do everything we will EVER want?
AI, protein sequencing, graphics
Is speed the only goal?
Power: heat dissipation + battery life Cost Reliability Etc.
Answer #1: requirements are always changing
9/23/2004 Lec 1-2 14
Why Study Computer Architecture
Answer #2: technology playing field is always changing
Annual technology improvements (approx.)
Logic: density + 25%, speed +20% DRAM (memory): density +60%, speed: +4% Disk: density +25%, disk speed: +4%
Designs change even if requirements are fixed. But the requirements are not fixed.
9/23/2004
Lec 1-2
15
Example of Changing Designs
Having, or not having caches
1970: 10K transistors on a single chip, DRAM faster than logic having a cache is bad 1990: 1M transistors, logic is faster than DRAM having a cache is good 2000: 600M transistors -> multiple level caches and multiple CPUs -> Multicore CPUs Will software ever catch up?
9/23/2004
Lec 1-2
16
Performance Growth in Perspective
Same absolute increase in computing power
Big Bang 2001 2001 2003
1971 2001: performance improved 35,000X!!!
What if cars or planes improved at this rate?
9/23/2004
Lec 1-2
17
Measuring Performance
Latency (response time, execution time)
Minimize time to wait for a computation
Energy/Power consumption Throughput (tasks completed per unit time, bandwidth)
Maximize work done in a given interval = 1/latency when there is no overlap among tasks > 1/latency when there is
In real processors there is always overlap (pipelining)
Both are important (Architecture Latency is important,
Embedded system Power consumption is important, and Network Throughput is important)
9/23/2004
Lec 1-2
18
Performance Terminology
X is n times faster than Y means:
Execution timeY Execution timeX =n
X is m% faster than Y means: Execution timeY - Execution timeX Execution timeX
9/23/2004 Lec 1-2 19
X 100% = m
Compute Speedup Amdahls Law
Speedup is due to enhancement(E):
TimeBefore TimeAfter
Speedup (E) =
Execution time w/o E (Before) Execution time w E (After)
Suppose that enhancement E accelerates a fraction F of the task by a factor S, and the remainder of the task is unaffected, what is the Execution timeafter and Speedup(E) ?
9/23/2004 Lec 1-2 20
Amdahls Law
Execution timeafter = ExTimebefore x [(1-F) + F
Speedup(E) =
ExTimebefore ExTimeafter
1 F [(1-F) + S
9/23/2004
Lec 1-2
21
Amdahls Law An Example
Q: Floating point instructions improved to run 2X; but only 10% of execution time are FP ops. What is the execution time and speedup after improvement? Ans:
F = 0.1, S = 2 ExTimeafter = ExTimebefore x [ (1-0.1) + 0.1/2 ] = 0.95 ExTimebefore
Speedup =
ExTimebefore ExTimeafter
1 0.95
= 1.053
Read examples in the book!
9/23/2004 Lec 1-2 22
CPU Performance
The Fundamental Law
seconds instructio ns cycles seconds CPU time program program instructio n cycle
Three components of CPU performance:
Instruction count CPI Clock cycle time
Program Compiler Inst. Set Architecture Arch Physical Design
Inst. Count CPI X X X X X X
Clock
X X X
23
9/23/2004
Lec 1-2
CPI - Cycles per Instruction
Let Fi be the frequency of type I instructions in a program. Then, Average CPI:
Total Cycle CPI Total Instruction Count IC i CPI i Fi where Fi Instruction Count i 1
CPU time Cycle time (CPI i IC i )
i 1 n
Example:
Instruction type Frequency Clock cycles
ALU Load 43% 21% 1 2
Store 12% 2
Branch 24% 2
average CPI = 0.43 + 0.42 + 0.24 + 0.48 = 1.57 cycles/instruction
9/23/2004 Lec 1-2 24
Example (RISC Vs. CISC)
Instruction mix of a RISC architecture.
Inst. Freq. C. C. ALU 50% 1 Load 20% 2 Store 10% 2 Branch 20% 2
Add a register-memory ALU instruction format?
One op. in register, one op. in memory
The new instruction will take 2 cc but will also increase the Branches to 3 cc.
Q: What fraction of loads must be eliminated for this to pay off?
9/23/2004 Lec 1-2 25
Solution
Instr. ALU Load Store Branch Reg/Mem
1.0 CPI=1.5
Fi
.5 .2 .1 .2
CPIi
1 2 2 2
CPIixFi
.5 .4 .2 .4
Ii
.5-X .2-X .1 .2 X
CPIi
1 2 2 3 2
CPIixIi
.5-X .4-2X .2 .6 2X
1-X
(1.7-X)/(1-X)
Exec Time = Instr. Cnt. x CPI x Cycle time
Instr. Cntold x CPIold x Cycle timeold >= Instr. Cntnew x CPInew x Cycle timenew 1.0 x 1.5 >= (1-X) x (1.7-X)/(1-X) X >= 0.2
ALL loads must be eliminated for this to be a win!
9/23/2004 Lec 1-2 26
Improve Memory System
All instructions require an instruction fetch, only a fraction require a data fetch/store.
Optimize instruction access over data access
Programs exhibit locality
Spatial Locality Temporal Locality
Access to small memories is faster
Provide a storage hierarchy such that the most frequent accesses are to the smallest (closest) memories. Registers
9/23/2004
Cache
Memory
Lec 1-2
Disk/Tape
27
Benchmarks
program as unit of work
There are millions of programs Not all are the same, most are very different Which ones to use?
Benchmarks
Standard programs for measuring or comparing performance Representative of programs people care about repeatable!!
9/23/2004 Lec 1-2 28
Choosing Programs to Evaluate Perf.
Toy benchmarks
e.g., quicksort, puzzle No one really runs. Scary fact: used to prove the value of RISC in early 80s Attempt to match average frequencies of operations and operands in real workloads. e.g., Whetstone, Dhrystone Often slightly more complex than kernels; But do not represent real programs Most frequently executed pieces of real programs e.g., livermore loops Good for focusing on individual features not big picture Tend to over-emphasize target feature
Synthetic benchmarks
Kernels
Real programs
e.g., gcc, spice, SPEC89, 92, 95, SPEC2000 (standard performance evaluation corporation), TPCC, TPCD
29
Networking Benchmarks: Netbench, Commbench,
Applications: IP Forwarding, TCP/IP, SSL, Apache, SpecWeb Commbench: www.ecs.umass.edu/ece/wolf/nsl/software/cb/index.html
Execution Driven Simulators: Simplescalar http://www.simplescalar.com/ NepSim http://www.cs.ucr.edu/~yluo/nepsim/
9/23/2004
Lec 1-2
30
MIPS and MFLOPS
MIPS: millions of instructions per second:
MIPS = Inst. count/ (CPU time * 10**6) = Clock rate/(CPI*106) easy to understand and to market inst. set dependent, cannot be used across machines. program dependent can vary inversely to performance! (why? read the book)
MFLOPS: million of FP ops per second.
9/23/2004
less compiler dependent than MIPS. not all FP ops are implemented in h/w on all machines. not all FP ops have same latencies. normalized MFLOPS: uses an equivalence table to even out the various latencies of FP ops.
Lec 1-2 31
Performance Contd.
SPEC CINT 2000, SPEC CFP2000, and TPCC figures are plotted in Fig. 1.19, 1.20 and 1.22 for various machines. EEMBC Performance of 5 different embedded processors (Table 1.24) are plotted in Fig. 1.25. Also performance/watt plotted in Fig. 1.27. Fig.1.30 lists the programs and changes in SPEC89, SPEC92, SPEC95 and SPEC2000 benchmarks.
9/23/2004
Lec 1-2
32