15-740/18-740 Computer Architecture Lecture 3: Performance: Carnegie Mellon University

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 20

15-740/18-740

Computer Architecture
Lecture 3: Performance

Prof. Onur Mutlu


Carnegie Mellon University
Last Time …
 Some microarchitecture ideas
 Part of microarchitecture vs. ISA
 Some ISA level tradeoffs
 Semantic gap
 Simple vs. complex instructions -- RISC vs. CISC
 Instruction length
 Uniform decode
 Number of registers

2
Review: ISA-level Tradeoffs: Number of Registers
 Affects:
 Number of bits used for encoding register address
 Number of values kept in fast storage (register file)
 (uarch) Size, access time, power consumption of register file

 Large number of registers:


+ Enables better register allocation (and optimizations) by
compiler  fewer saves/restores
-- Larger instruction size
-- Larger register file size
-- (Superscalar processors) More complex dependency check
logic

3
ISA-level Tradeoffs: Addressing Modes
 Addressing mode specifies how to obtain an operand of an
instruction
 Register
 Immediate
 Memory (displacement, register indirect, indexed, absolute,
memory indirect, autoincrement, autodecrement, …)

 More modes:
+ help better support programming constructs (arrays, pointer-
based accesses)
-- make it harder for the architect to design
-- too many choices for the compiler?
 Many ways to do the same thing complicates compiler design
 Read Wulf, “Compilers and Computer Architecture”
4
x86 vs. Alpha Instruction Formats
 x86:

 Alpha:

5
x86
register
indirect

absolute

register +
displacement
register

6
x86

indexed
(base +
index)

scaled
(base +
index*4)

7
Other ISA-level Tradeoffs
 Load/store vs. Memory/Memory
 Condition codes vs. condition registers vs. compare&test
 Hardware interlocks vs. software-guaranteed interlocking
 VLIW vs. single instruction
 0, 1, 2, 3 address machines
 Precise vs. imprecise exceptions
 Virtual memory vs. not
 Aligned vs. unaligned access
 Supported data types
 Software vs. hardware managed page fault handling
 Granularity of atomicity
 Cache coherence (hardware vs. software)
 …
8
Programmer vs. (Micro)architect
 Many ISA features designed to aid programmers
 But, complicate the hardware designer’s job

 Virtual memory
 vs. overlay programming
 Should the programmer be concerned about the size of code
blocks?
 Unaligned memory access
 Compile/programmer needs to align data
 Transactional memory?

9
Transactional Memory
THREAD 1 THREAD 2

enqueue (Q, v) { enqueue (Q, v) {


Node_t node = malloc(…); Node_t node = malloc(…);
node->val = v; node->val = v;
node->next = NULL; node->next = NULL;
acquire(lock); acquire(lock);
if (Q->tail) if (Q->tail)
Q->tail->next = node; Q->tail->next = node;
else else
Q->head = node; Q->head = node;
release(lock);
Q->tail = node; Q->tail
release(lock);
= node;
Q->tail
release(lock);
= node; release(lock);
Q->tail = node;
} }

begin-transaction begin-transaction
… …
enqueue (Q, v); //no locks enqueue (Q, v); //no locks
… …
end-transaction end-transaction

10
Transactional Memory
 A transaction is executed atomically: ALL or NONE

 If there is a data conflict between two transactions, only


one of them completes; the other is rolled back
 Both write to the same location
 One reads from the location another writes

11
ISA-level Tradeoff: Supporting TM
 Still under research
 Pros:
 Could make programming with threads easier
 Could improve parallel program performance vs. locks. Why?

 Cons:
 What if it does not pan out?
 All future microarchitectures might have to support the new
instructions (for backward compatibility reasons)
 Complexity?

 How does the architect decide whether or not to support


TM in the ISA? (How to evaluate the whole stack)
12
ISA-level Tradeoffs: Instruction Pointer
 Do we need an instruction pointer in the ISA?
 Yes: Control-driven, sequential execution
 An instruction is executed when the IP points to it
 IP automatically changes sequentially (except control flow
instructions)
 No: Data-driven, parallel execution
 An instruction is executed when all its operand values are
available (data flow)

 Tradeoffs: MANY high-level ones


 Ease of programming (for average programmers)?
 Ease of compilation?
 Performance: Extraction of parallelism?
 Hardware complexity?

13
The Von-Neumann Model
MEMORY
Mem Addr Reg

Mem Data Reg

PROCESSING UNIT
INPUT OUTPUT
ALU TEMP

CONTROL UNIT

IP Inst Register

14
The Von-Neumann Model
 Stored program computer (instructions in memory)
 One instruction at a time
 Sequential execution
 Unified memory
 The interpretation of a stored value depends on the control
signals

 All major ISAs today use this model


 Underneath (at uarch level), the execution model is very
different
 Multiple instructions at a time
 Out-of-order execution
 Separate instruction and data caches
15
Fundamentals of Uarch Performance Tradeoffs

Instruction Data Path Data


Supply (Functional Supply
Units)

- Zero-cycle latency - Perfect data flow - Zero-cycle latency


(no cache miss) (reg/memory dependencies)
- Infinite capacity
- No branch mispredicts - Zero-cycle interconnect
(operand communication) - Zero cost
- No fetch breaks
- Enough functional units

- Zero latency compute?


We will examine all these throughout the course (especially data supply)
16
How to Evaluate Performance Tradeoffs

time
Execution time =
program

# instructions # cycles time


= X X cycle
program instruction

Algorithm Microarchitecture
Program ISA Logic design
ISA Microarchitecture Circuit implementation
Compiler Technology

17
Improving Performance
 Reducing instructions/program

 Reducing cycles/instruction (CPI)

 Reducing time/cycle (clock period)

18
Improving Performance (Reducing Exec Time)
 Reducing instructions/program
 More efficient algorithms and programs
 Better ISA?

 Reducing cycles/instruction (CPI)


 Better microarchitecture design
 Execute multiple instructions at the same time
 Reduce latency of instructions (1-cycle vs. 100-cycle memory
access)

 Reducing time/cycle (clock period)


 Technology scaling
 Pipelining

19
Improving Performance: Semantic Gap
 Reducing instructions/program
 Complex instructions: small code size (+)
 Simple instructions: large code size (--)

 Reducing cycles/instruction (CPI)


 Complex instructions: (can) take more cycles to execute (--)
 REP MOVS
 How about ADD with condition code setting?
 Simple instructions: (can) take fewer cycles to execute (+)

 Reducing time/cycle (clock period)


 Does instruction complexity affect this?
 It depends
20

You might also like