COMPUTER ORGANIZATION AND DESIGN
The Hardware/Software Interface
Lecture 2: CPU Fundamentals
Computer Architecture
Injae Yoo
VLSI Research Lab, PNU
Lecture 2: CPU Fundamentals
TECHNOLOGY TRENDS
2
Technology Trends
Semiconductor technology continues to evolve
Increased capacity and performance
Reduced cost
Year Technology Relative performance/cost
1951 Vacuum tube 1
1965 Transistor 35
1975 Integrated circuit (IC) 900
1995 Very large scale IC (VLSI) 2,400,000
2013 Ultra large scale IC 250,000,000,000
3
Technology Trends: Logics
Integrated circuit technology (Moore’s Law)
Transistor density: 35%/year
Die size: 10-20%/year
Integration overall: 40-55%/year
Transistor size↓ Performance (clock rate)↑ Power ↓
4
Technology Trends: Memory
DRAM capacity: 25-40%/year (slowing)
8 Gb (2014), 16 Gb (2019), …
Flash capacity: 50-60%/year
8-10X cheaper/bit than DRAM
Magnetic disk capacity: recently slowed to 5%/year
Density increases may no longer be possible, maybe increase
from 7 to 9 platters
8-10X cheaper/bit then Flash
200-300X cheaper/bit than DRAM
5
Power Trends
In CMOS IC technology
Power Capacitive load Voltage Frequency
2
×30 5V → 1V ×1000
6
Power Trends
Intel 80386 consumed ~2W (1986)
3.3 GHz Intel Core i7 consumes ~130W
Heat must be dissipated from 1.5 x 1.5 cm chip
Hits the air cooling limit
7
Reducing Power
Suppose a new CPU has
85% of capacitive load of old CPU
15% voltage and 15% frequency reduction
Pnew Cold 0.85 (Vold 0.85)2 Fold 0.85
2
0.85 4
0.52
Pold Cold Vold Fold
The power wall
We can’t reduce voltage further
We can’t remove more heat
How else can we improve performance?
8
Uniprocessor Performance
Constrained by power, instruction-level parallelism,
memory latency
9
Multiprocessors
Multicore microprocessors
More than one processor per chip
Requires explicitly parallel programming
Compare with instruction level parallelism
Hardware executes multiple instructions at once
Hidden from the programmer
Hard to do
Programming for performance
Load balancing
Optimizing communication and synchronization
10
Lecture 2: CPU Fundamentals
WHAT’S A COMPUTER?
(+ CPU, ISA)
11
The Computer Revolution
Progress in computer technology
Underpinned by domain-specific accelerators
Makes novel applications feasible
World Wide Web
Search Engines
Smartphones
Computers in automobiles
Computers are pervasive
12
Classes of Computers
Personal computers
General purpose, variety of software
Subject to cost/performance tradeoff
Server computers
Network based
High capacity, performance, reliability
Range from small servers to building sized
Supercomputers
Type of server
High-end scientific and engineering calculations
Highest capability but represent a small fraction of the overall market
Embedded computers
Hidden as components of systems
Stringent power/performance/cost constraints
13
The PostPC Era
14
The PostPC Era
Personal Mobile Device (PMD)
Battery operated
Connects to the Internet
Hundreds of dollars
Smart phones, tablets, electronic glasses
Apple and Samsung
Cloud computing
Warehouse Scale Computers (WSC)
Software as a Service (SaaS)
Portion of software run on a PMD and a portion run in the Cloud
Amazon, Microsoft, and Google
15
Modern Computer in a Nutshell
From the perspective of programs running
on CPU, it’s just a bunch of memory space.
So, we will call everything just “memory”
• Where chunks of OS and apps are loaded
• Where chunks of data is loaded
• While your phone is running
I$
Storage Main (SRAM)
CPU
(Microprocessor)
(NAND Flash Memory
D$ Register
Memory) (DRAM) (SRAM) File
• Where entire OS (Android) and apps are stored
• Where all data is stored • I$ (instruction cache): Where programs (OS / app)
• While your phone is idle or off being executed are loaded
• D$ (data cache): Where data being
processed / created are loaded
• CPU: Where data are processed / created
according to programs
• Register file: A scribble pad of CPU
16
Apple iPad Pro Teardown
17
Apple iPad Pro Teardown
“The processor”
(CPU+GPU+NPU)
Storage
Memory
I/O
Power
18
Inside the Processor
Apple A12 processor
• 2 + 4 ARM CPU cores
• On-chip cache memories
• GPU, NPU
• DRAM controllers
19
Inside the CPU
Datapath: performs operations on data
Control: sequences datapath, memory, ...
Cache memory
Small fast SRAM memory for immediate access to data
I$
(SRAM)
CPU
(Microprocessor)
D$ Register
(SRAM) File
20
What We Will Learn (Briefly) Today
How programs are translated into the machine language
And how the CPU executes them
The hardware/software interface ISA
What determines program performance
And how it can be improved
How CPU hardware designers improve performance
21
Below Your Program
Application software
Written in high-level language (HLL) – C, Python, …
System software
Compiler: Translates HLL code to machine code
Operating System
Handling system input/output
Managing memory and storage
Scheduling tasks & sharing resources
Hardware
Processor, memory, I/O controllers
22
Levels of Program Code
High-level language
Level of abstraction closer to
problem (or algorithm) domain
Provides for productivity and portability
Assembly language
Textual representation of instructions
Hardware representation
Binary digits (bits)
Encoded instructions and data
23
What is a Program (or Software)?
Sequences of instructions to do a certain task
Example: Finding length of a text string
In a high-level programming language (like C):
In a low-level language (or assembly): Compile
Each line is one
microprocessor instruction
24
What is an Instruction?
A unit of microprocessor operations
Instruction set architecture (ISA):
Set of instructions, defining a microprocessor (CPU) architecture
x86 (Intel) vs ARM vs RISC-V
Usually 3 types of instructions The entire set of basic RISC-V instructions
Load/Store
Load
Arithmetic
Branch
Arithmetic
Store
Arithmetic
Branch
25
What is an Instruction?
3 types of instructions ?
Load/Store
Read/Write data from/to memory
(or external devices through memory-mapped IO)
Arithmetic
Process data inside CPU
Add, subtract, shift, …
Branch
Change program counter
(or change the program’s execution sequence)
For example, leaving a for loop after some iterations
Branch
I$
(SRAM)
CPU
(Microprocessor)
D$ Load
Arithmetic
(SRAM) Store
26
Lecture 2: CPU Fundamentals
PERFORMANCE OF A CPU
27
Understanding “Performance”
Algorithm (or a program) to execute
Determines number of operations executed
Programming language, compiler, architecture
Determine number of machine instructions executed per operation
Processor and memory system
Determine how fast instructions are executed
28
X Defining Performance
Which airplane has the best performance?
29
×
Response Time and Throughput
Response time
How long it takes to do a task
Throughput
Total work done per unit time
e.g., tasks/transactions/… per hour
How are response time and throughput affected by
Replacing the processor with a faster version?
Adding more processors?
We’ll focus on response time for now…
30
Relative Performance
Define Performance = 1 / Execution Time
“X is n time faster than Y”
Performanc e X Performanc e Y
Execution time Y Execution time X n
Example: Time taken to run a program
10s on A, 15s on B
Execution TimeB / Execution TimeA
= 15s / 10s = 1.5
So A is 1.5 times faster than B
31
CPU Clocking
Operation of digital hardware governed by a constant-rate clock
2 G 12 ) O 5 us
.
Clock period
Clock (cycles)
Data transfer
I instruction
and computation
Update state
Clock period: duration of a clock cycle
e.g., 250ps = 0.25ns = 250×10–12s
Clock frequency (rate): cycles per second
e.g., 4.0GHz = 4000MHz = 4.0×109Hz
32
CPU Time = executiontime
Performance improved by
Reducing number of clock cycles
Increasing clock rate (frequency)
Hardware designer must often trade off clock rate against
cycle count
CPU Time CPU Clock Cycles Clock Cycle Time
CPU Clock Cycles
Clock Rate
33
CPU Time Example
Computer A: 2GHz clock, 10s CPU time [ ]
Designing Computer B clock
cpu timed →
cyde .
Aim for 6s CPU time
Can do faster clock, but causes 1.2 × clock cycles 1 Os → 6, clock cycle 1 . 2
How fast must Computer B clock be?
Clock CyclesB 1.2 Clock Cycles A
Clock RateB
CPU Time B 6s
Clock Cycles A CPU Time A Clock Rate A
10s 2GHz 20 10 9
1.2 20 10 9 24 10 9
Clock RateB 4GHz
6s 6s
clock rate
34
clock cycled
Instruction Count and CPI
Instruction Count for a program C
Determined by program, ISA and compiler
Average cycles per instruction (CPI)
CPI
Determined by CPU hardware
If different instructions have different CPI
Average CPI affected by instruction mix
Clock Cycles Instruction Count Cycles per Instruction
CPU Time Instruction Count CPI Clock Cycle Time
Instruction Count CPI
Clock Rate
ICX CPI Xclock
=
period
35 ( Tc )
CPI Example
Computer A: Cycle Time = 250ps, CPI = 2.0
Computer B: Cycle Time = 500ps, CPI = 1.2
Assume the same ISA Same instruction count!
Which is faster, and by how much?
CPU Time Instruction Count CPI Cycle Time
A A A
I 2.0 250ps I 500ps A is faster…
CPU Time Instruction Count CPI Cycle Time
B B B
I 1.2 500ps I 600ps
CPU Time
B I 600ps 1.2
…by this much
CPU Time I 500ps
A
36
CPI in More Detail
If different instruction classes take different numbers of cycles
Weighted average CPI
n
Clock Cycles (CPIi Instruction Count i )
i1
Clock Cycles n
Instruction Count i
CPI CPIi
Instruction Count i1 Instruction Count
Relative frequency
37
CPI Example
Alternative compiled code sequences using instructions
in classes A, B, C
Class A B C
CPI for class 1 2 3
IC in sequence 1 2 1 2
IC in sequence 2 4 1 1
Sequence 1: IC = 5 Sequence 2: IC = 6
Clock Cycles Clock Cycles
= 2×1 + 1×2 + 2×3 = 4×1 + 1×2 + 1×3
= 10 =9
Avg. CPI = 10/5 = 2.0 Avg. CPI = 9/6 = 1.5
38
Performance Summary
Performance depends on
Algorithm: affects IC, possibly CPI
SW ISAA IC ,CPI
Programming language: affects IC, CPI ,
Compiler: affects IC, CPI
SW
ISA Instruction set architecture: affects IC, CPI, Tc (clock cycle time)
HW
Hardware design: affects Tc HW ISA
.
Tc
Instructions Clock cycles Seconds
CPU Time
Program Instruction Clock cycle
ISA : HW , SW
39
× SPEC CPU Benchmark
Programs used to measure performance
Supposedly typical of actual workload
Standard Performance Evaluation Corp (SPEC)
Develops benchmarks for CPU, I/O, …
SPEC CPU2006
Elapsed time to execute a selection of programs
Negligible I/O, so focuses on CPU performance
Normalize relative to reference machine
Summarize as geometric mean of performance ratios
CINT2006 (integer) and CFP2006 (floating-point)
n
n
Execution time ratio
i1
i
40
SPECspeed 2017 Integer benchmarks on a
1.8 GHz Intel Xeon E5-2650L
41
Amdahl’s Law search
Improving an aspect of a computer and expecting a
proportional improvement in overall performance
Taffected
Timproved Tunaffected
improvemen t factor
MoOs 80 s
Example: multiply accounts for 80s/100s
How much improvement in multiply performance to get 5× overall?
80
20 20 Can’t be done!
n
Corollary: make the common case fast!
42
Concluding Remarks
Cost/performance is improving
Due to underlying technology development
Hierarchical layers of abstraction
In both hardware and software
Instruction set architecture
The hardware/software interface
Execution time: the best performance measure
Power is a limiting factor
Use parallelism to improve performance
43