20 Advanced Processor Designs
20 Advanced Processor Designs
Weve only scratched the surface of CPU design. Today well briefly introduce some of the big ideas and big words behind modern processors by looking at two example CPUs. The Motorola PowerPC, used in Apple computers and many embedded systems, is a good example of state-of-the-art RISC technologies. The Intel Itanium is a more radical design intended for the higher-end systems market.
April 9, 2003
April 9, 2003
Pipelining is parallelism
Weve already seen parallelism in detail! A pipelined processor executes many instructions at the same time. All modern processors use pipelining, because most programs will execute faster on a pipelined CPU without any programmer intervention. Today well discuss some more advanced techniques to help you get the most out of your pipeline.
April 9, 2003
IM
Reg
DM
Reg
How can we add floating-point operations, which are roughly three times slower than integer operations, to the processor?
6 ns
IM
Reg
DM
Reg
The longer floating-point delay would affect the cycle time adversely. In this example, the EX stage would need 6 ns, and the clock rate would be limited to about 166MHz.
April 9, 2003
Deeper pipelines
We could pipeline the floating-point unit too! We might split an addition into 3 stages for equalizing exponents, adding mantissas, and rounding. IF ID FEX1 FEX2 FEX3 MEM WB
A floating-point operation would still take 6ns total, but the stage length and cycle time can be reduced, and we can also overlap the execution of several floating-point instructions. IF ID IF FEX1 FEX2 FEX3 MEM ID IF ID WB WB WB
The homeworks already mentioned how the Pentium 4 uses a 20-stage pipeline, breaking an instruction execution into 20 stepsthats one of the reasons the P4 has such high clock rates.
April 9, 2003 Advanced processor designs 5
Superscalar architectures
What if we include both an integer ALU and a floating-point unit in the datapath? FEX1 FEX2 FEX3 IF ID EX This is almost like having two pipelines! A superscalar processor can start, or dispatch, more than one instruction on every clock cycle. MEM WB
April 9, 2003
Read address
Instruction 1 Instruction 2
Instruction memory
April 9, 2003
We could add more read and write ports to the register file, as we did for one of the homework questions.
RegWrite1 Read 1 Read 2 Read 3 Read 4 Write reg 1 Write data 1 Write reg 2 Write data 2 RegWrite2 Data 1 Data 2 Data 3 Data 4
Registers
April 9, 2003
A common alternative is to provide two register files, one for the integer unit and one for the floating-point unit. Most programs do not mix integer and FP operands. Double and extended precision IEEE numbers are 64 or 80-bits long, so they wouldnt fit in normal 32-bit registers anyway. Explicit casting instructions can be used to move data between the register files when necessary. We saw this in the MIPS floating-point architecture too.
April 9, 2003
PowerPC introduction
The PowerPC project was started in 1991 by Apple, IBM and Motorola. The basis was IBMs existing POWER architecture, a RISC-based design used in older IBM RS/6000 workstations. The first actual PowerPC processor was introduced in 1993. The G4 is the most current version, which runs at up to 1.42 GHz in Apples Power Macintosh G4 computers. PowerPCs are noted for very low power usage: 20-30W for a 1 GHz G4, compared to 60W or more for newer Pentiums and Athlons. Low power is especially good for laptops, which run on batteries. It also helps keep desktop machines cool without fans.
April 9, 2003
10
G4 superscalar architecture
The PowerPC G4 has a total of eleven pipelined execution units. There are four ALUs and one FPU for basic arithmetic operations. A separate load/store unit manages memory accesses. The AltiVec units support multimedia instructions (like MMX/SSE). The G4 is a four-way superscalar processorthe decoder can dispatch up to three instructions per cycle, as well as handle a branch instruction.
Instr Fetch Decode/ Branch
Integer ALUs
FPU
Load/ Store
AltiVec Units
April 9, 2003
11
Stalls are especially undesirable in a superscalar CPU because there could be several functional units sitting around doing nothing. Below, when the integer ALU stalls the FPU must also remain idle. The more functional units we have, the worse the problem. IF ID IF EX ID IF MEM WB EX ID MEM WB WB FEX1 FEX2 FEX3 MEM
April 9, 2003
12
Dynamic pipelining
One solution is to give each execution unit a reservation station, which holds an instruction until all of its operands become available, either from the register file or forwarding units. The instruction decoder dispatches Instr instruction to the reservation station Fetch of the appropriate execution unit. Stalls will only affect a single Decode/ execution unit; the decoder Branch can continue dispatching subsequent instructions to Reservation Reservation Reservation Reservation other execution stations station station stations units.
Integer ALUs FPU Load/ Store AltiVec Units
April 9, 2003
13
Out-of-order execution
If stalls only affect a single hardware element, the decoder can continue dispatching subsequent instructions to other execution units. IF ID IF EX ID IF ID MEM WB EX MEM WB WB FEX1 FEX2 FEX3 MEM
But then a later instruction could finish executing before an earlier one!
lw mul add $t0, 4($sp) $t0, $t0, $t0 $t0, $s2, $s3
This can be troublesome for writing the correct destination registers as here, and for handling precise interrupts.
April 9, 2003
14
Reordering
To prevent problems from out-of-order execution, instructions should not save their results to registers or memory immediately. Instead, instructions are sent to a commit unit after they finish their EX stage, along with a flag indicating whether or not an exception occurred. The commit unit ensures that instruction results are stored in the correct order, so later instructions will be held until earlier instructions finish. Register writes, like on the last page, are handled correctly. The flag helps implement precise exceptions. All instructions before the erroneous one will be committed, while any instructions after it will be flushed. The commit unit also flushes instructions from mispredicted branches. The G4 can commit up to six instructions per cycle.
April 9, 2003
15
G4 block diagram
Instr Fetch Decode/ Branch
Reservation station
FPU
Commit unit
April 9, 2003 Advanced processor designs 16
G4 summary
The G4 achieves parallelism via both pipelining and a superscalar design. There are eleven functional units, and up to four instructions can be dispatched per cycle. A stall will only affect one of the functional units. A commit buffer is needed to support out-of-order execution. The amount of work that can be parallelized, and the performance gains, will depend on the exact instruction sequence. For example, the G4 has only one load/store unit, so it cant dispatch two load instructions in the same cycle. As usual, stalls can occur due to data or control hazards. Problem 3 from todays homework demonstrates how rewriting code can improve performance.
April 9, 2003
17
Itanium introduction
Itanium began in 1993 as a joint project between Intel and HP. It was meant as a study of very different ISA and CPU designs. The first Itaniums, at 800MHz, appeared about two years ago. Current performance is rumored to be poor, but it uses many new, aggressive techniques that may improve with better compilers and hardware designs.
April 9, 2003
18
April 9, 2003
19
April 9, 2003
20
April 9, 2003
21
Tradeoffs
Having a VLIW compiler find parallelism has some advantages. The compiler can take its time to analyze a program, whereas the CPU has just a few nanoseconds to execute each instruction. A compiler can also examine and optimize an entire program, but a processor can only work on a few instructions at a time. Some things can still be done better by the CPU, since compilers cant account for dynamic events. Compilers cant predict branch patterns accurately. They cant adjust for delays due to cache misses or page faults. These issues illustrate a general tradeoff between compilation speed and execution speed.
April 9, 2003
22
April 9, 2003
23
Branch predication
Having a lot of hardware lets you do some interesting things. The Itanium can do branch predication: instead of guessing whether or not branches are taken, it just executes both branches simultaneously!
if (r1 == r2) r9 = r10 - r11; else r5 = r6 + r7; cmp.eq p1, p2 = r1, r2 ;; (p1) sub r9 = r10, r11 (p2) add r5 = r6, r7
The Itanium executes both the sub and add instructions immediately after the comparison. When the comparison result is known later, either sub or add will be flushed. p1 is a predicate register which will be true if r1 = r2, and p2 is the complement of p1. (p1) is a guard, ensuring that the sub is committed only if p1 is true. Predication wont always work (e.g., if there are structural hazards), so regular branch prediction must still be done.
April 9, 2003
24
Code motion
One way to avoid memory-based stalls is to move the load instruction to a position before the data is needed.
add $s0, $s1, $s2 lw $t0, 4($sp) sub $a0, $a0, $t0 lw $t0, 4($sp) add $s0, $s1, $s2 sub $a0, $a0, $t0
These two code sequences are semantically equivalent. However, the sequence on the left would stall one cycle, whereas the one on the right performs an add in place of the stall. Good compilers can do this code motion. Something like this also appears on the homework due today. This is especially important in machines where loads might stall for more than one cycle.
April 9, 2003
25
Ordinarily, this is problematic! For instance, assume that 4($sp) contains an invalid address. If the branch is taken on the left, the load will not be executed and no exception will occur. The code on the right tries to avoid a stall by executing the load first, but this would result in an exception!
April 9, 2003
26
Speculation
In the Itanium, any potential exception from a load is deferred. The original load instruction is actually replaced by a deferred exception check instruction chk, as shown below.
add beq lw sub $s0, $v0, $t0, $a0, $s1, $s2 $0, Label 4($sp) $a0, $t0 lw add beq chk sub $t0, 4($sp) $s0, $s1, $s2 $v0, $0, Label $a0, $a0, $t0
If the load on the right causes an exception, the destination register $t0 will be flagged, but the exception is not raised until the chk instruction executes. This is most useful for architectures like the Itanium, where loads can stall for multiple cycles; the chk can be done in just one cycle instead.
April 9, 2003
27
Summary
Modern processors try to do as much work in parallel as possible. Superscalar processors can dispatch multiple instructions per cycle. Lots of registers are needed to feed lots of functional units. Memory bandwidths are large to support multiple data accesses. A lot of effort goes toward minimizing costly stalls due to dependencies. Out-of-order execution reduces the impact of stalls, but at the cost of extra reservation stations and commit hardware. Dedicated load/store units handle memory operations in parallel with other operations. Control and data speculation help keep the Itanium pipeline full. Compilers are important for good performance. They can greatly help in maximizing CPU usage and minimizing the need for stalls.
April 9, 2003
28