15-740/18-740 Computer Architecture Lecture 3: Performance: Carnegie Mellon University

15-740/18-740
Computer Architecture
Lecture 3: Performance
Prof. Onur Mutlu

Carnegie Mellon University
Last Time …
 Some microarchitecture ideas
 Part of microarchitecture vs. ISA
 Some ISA level tradeoffs
 Semantic gap
 Simple vs. complex instructions -- RISC vs. CISC
 Instruction length
 Uniform decode
 Number of registers
2
Review: ISA-level Tradeoffs: Number of Registers
 Affects:
 Number of bits used for encoding register address
 Number of values kept in fast storage (register file)
 (uarch) Size, access time, power consumption of register file
 Large number of registers:

+ Enables better register allocation (and optimizations) by
compiler  fewer saves/restores
-- Larger instruction size
-- Larger register file size
-- (Superscalar processors) More complex dependency check
logic
3
ISA-level Tradeoffs: Addressing Modes
 Addressing mode specifies how to obtain an operand of an
instruction
 Register
 Immediate
 Memory (displacement, register indirect, indexed, absolute,
memory indirect, autoincrement, autodecrement, …)
 More modes:
+ help better support programming constructs (arrays, pointer-
based accesses)
-- make it harder for the architect to design
-- too many choices for the compiler?
 Many ways to do the same thing complicates compiler design
 Read Wulf, “Compilers and Computer Architecture”
4
x86 vs. Alpha Instruction Formats
 x86:
 Alpha:
5
x86
register
indirect
absolute
register +
displacement
register
6
x86
indexed
(base +
index)
scaled
(base +
index*4)
7
Other ISA-level Tradeoffs
 Load/store vs. Memory/Memory
 Condition codes vs. condition registers vs. compare&test
 Hardware interlocks vs. software-guaranteed interlocking
 VLIW vs. single instruction
 0, 1, 2, 3 address machines
 Precise vs. imprecise exceptions
 Virtual memory vs. not
 Aligned vs. unaligned access
 Supported data types
 Software vs. hardware managed page fault handling
 Granularity of atomicity
 Cache coherence (hardware vs. software)
 …
8
Programmer vs. (Micro)architect
 Many ISA features designed to aid programmers
 But, complicate the hardware designer’s job
 Virtual memory
 vs. overlay programming
 Should the programmer be concerned about the size of code
blocks?
 Unaligned memory access
 Compile/programmer needs to align data
 Transactional memory?
9
Transactional Memory
THREAD 1 THREAD 2
enqueue (Q, v) { enqueue (Q, v) {

Node_t node = malloc(…); Node_t node = malloc(…);
node->val = v; node->val = v;
node->next = NULL; node->next = NULL;
acquire(lock); acquire(lock);
if (Q->tail) if (Q->tail)
Q->tail->next = node; Q->tail->next = node;
else else
Q->head = node; Q->head = node;
release(lock);
Q->tail = node; Q->tail
release(lock);
= node;
Q->tail
release(lock);
= node; release(lock);
Q->tail = node;
} }
begin-transaction begin-transaction
… …
enqueue (Q, v); //no locks enqueue (Q, v); //no locks
… …
end-transaction end-transaction
10
Transactional Memory
 A transaction is executed atomically: ALL or NONE
 If there is a data conflict between two transactions, only

one of them completes; the other is rolled back
 Both write to the same location
 One reads from the location another writes
11
ISA-level Tradeoff: Supporting TM
 Still under research
 Pros:
 Could make programming with threads easier
 Could improve parallel program performance vs. locks. Why?
 Cons:
 What if it does not pan out?
 All future microarchitectures might have to support the new
instructions (for backward compatibility reasons)
 Complexity?
 How does the architect decide whether or not to support

TM in the ISA? (How to evaluate the whole stack)
12
ISA-level Tradeoffs: Instruction Pointer
 Do we need an instruction pointer in the ISA?
 Yes: Control-driven, sequential execution
 An instruction is executed when the IP points to it
 IP automatically changes sequentially (except control flow
instructions)
 No: Data-driven, parallel execution
 An instruction is executed when all its operand values are
available (data flow)
 Tradeoffs: MANY high-level ones

 Ease of programming (for average programmers)?
 Ease of compilation?
 Performance: Extraction of parallelism?
 Hardware complexity?
13
The Von-Neumann Model
MEMORY
Mem Addr Reg
Mem Data Reg
PROCESSING UNIT
INPUT OUTPUT
ALU TEMP
CONTROL UNIT
IP Inst Register
14
The Von-Neumann Model
 Stored program computer (instructions in memory)
 One instruction at a time
 Sequential execution
 Unified memory
 The interpretation of a stored value depends on the control
signals
 All major ISAs today use this model

 Underneath (at uarch level), the execution model is very
different
 Multiple instructions at a time
 Out-of-order execution
 Separate instruction and data caches
15
Fundamentals of Uarch Performance Tradeoffs
Instruction Data Path Data

Supply (Functional Supply
Units)
- Zero-cycle latency - Perfect data flow - Zero-cycle latency

(no cache miss) (reg/memory dependencies)
- Infinite capacity
- No branch mispredicts - Zero-cycle interconnect
(operand communication) - Zero cost
- No fetch breaks
- Enough functional units
- Zero latency compute?

We will examine all these throughout the course (especially data supply)
16
How to Evaluate Performance Tradeoffs
time
Execution time =
program
# instructions # cycles time

= X X cycle
program instruction
Algorithm Microarchitecture
Program ISA Logic design
ISA Microarchitecture Circuit implementation
Compiler Technology
17
Improving Performance
 Reducing instructions/program
 Reducing cycles/instruction (CPI)
 Reducing time/cycle (clock period)
18
Improving Performance (Reducing Exec Time)
 More efficient algorithms and programs
 Better ISA?

 Better microarchitecture design
 Execute multiple instructions at the same time
 Reduce latency of instructions (1-cycle vs. 100-cycle memory
access)

 Technology scaling
 Pipelining
19
Improving Performance: Semantic Gap
 Complex instructions: small code size (+)
 Simple instructions: large code size (--)

 Complex instructions: (can) take more cycles to execute (--)
 REP MOVS
 How about ADD with condition code setting?
 Simple instructions: (can) take fewer cycles to execute (+)

 Does instruction complexity affect this?
 It depends
20

15-740/18-740 Computer Architecture Lecture 3: Performance: Carnegie Mellon University

Uploaded by

Copyright:

Available Formats

15-740/18-740 Computer Architecture Lecture 3: Performance: Carnegie Mellon University

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

15-740/18-740 Computer Architecture Lecture 3: Performance: Carnegie Mellon University

Uploaded by

Copyright:

Available Formats

15-740/18-740

Prof. Onur Mutlu

 Large number of registers:

enqueue (Q, v) { enqueue (Q, v) {

 If there is a data conflict between two transactions, only

 How does the architect decide whether or not to support

 Tradeoffs: MANY high-level ones

Mem Data Reg

 All major ISAs today use this model

Instruction Data Path Data

- Zero-cycle latency - Perfect data flow - Zero-cycle latency

- Zero latency compute?

# instructions # cycles time

 Reducing cycles/instruction (CPI)

 Reducing time/cycle (clock period)

 Reducing cycles/instruction (CPI)

 Reducing time/cycle (clock period)

 Reducing cycles/instruction (CPI)

 Reducing time/cycle (clock period)

You might also like