Chapter 5
Instruction Level Parallelism and
Superscalar Processors
1
Outline
Overview
Introduction
Comparison
Limitation to ILP
Instruction Issue Policies
Superscalar Execution Review
Superscalar Implementation
2
Overview
Pipelining exploits the potential parallelism among
instructions:
◦ Instruction Level Parallelism (ILP)
The parallelism among instructions
Refers to the degree to which, on average, the instructions of a
program can be executed in parallel
Exists when instructions in a sequence are independent and thus can be
executed in parallel by overlapping
3
Overview…
There are two methods for increasing the potential
amount of ILP:
◦ Increase the depth of the pipeline
Enables the overlap of more instructions
Amount of parallelism being exploited is higher
◦ Replicate the internal components of the computer
Enables the launching of multiple instructions in every pipeline
stage
This techniques is called Multiple Issue
4
Overview…
Two ways to implement multiple issue processor
◦ Static multiple issue processors
Also called VLIW (Very Long Instruction Word) Processor
Many decisions are made by the compiler before execution
Focus of Chapter 6, next chapter
◦ Dynamic multiple issue processors
Also called Superscalar processors
Many decisions are made during execution by the processor
Focus of this chapter, Chapter 5
The major difference between the two:
The division of work between the compiler and the hardware
5
Introduction
Superscalar Architecture /Superscalar Processor
◦ In a superscalar architecture (SSA), several scalar
instructions can be initiated simultaneously and executed
independently
◦ Pipelining allows also several instructions to be executed at
the same time, but they have to be in different pipeline
stages at a given moment
◦ SSA includes all features of pipelining but, in addition, there
can be several instructions executing simultaneously in the
same pipeline stage
6
Introduction…
Superscalar Architecture /Superscalar Processor
◦ RISC machines lends itself readily to superscalar techniques
But it can be used on either RISC or CISC architectures
◦ Superscalar approach is now the standard method for
implementing
High performance microprocessors
7
Introduction…
General Superscalar Organization
◦ There are multiple functional units
Each of which is implemented as a pipeline
◦ In the diagram, the following operations can be executed at the same time:
Two integer operations
Two floating point operations and
One memory (load or store) operations
8
Introduction…
How does Superscalar processor work ?
◦ A SS processor fetches multiple instructions at a time, and
attempts to find nearby instructions that are independent of
each other and therefore can be executed in parallel
◦ Based on the dependency analysis, the processor may issue
and execute instructions in an order that differs from that of
the original machine code
◦ The processor may eliminate some unnecessary dependencies
by the use of additional registers and renaming of register
references
9
Comparison
Ordinary Pipeline (Base Pipeline) Vs
Superscalar
◦ Ordinary Pipeline (Base Machine):
Issues one instruction per clock cycle
Perform one pipeline stage per clock cycle
Although several instructions are executing
concurrently
Only one instruction is in its execution stage at any
one time
In the figure the pipeline has 4-stages:
Instruction fetch, Operation decode, Operation
execution , Result write back
◦ Superscalar Implementation:
Two instructions are executed concurrently in each
pipeline stage (in the figure)
Superscalar of degree 2
Duplication of hardware is required
10
Superscalar Recap
Allows several instructions to be issued and completed per
clock cycle
Consists of a number of pipelines that are working in parallel
Depending on the number and kind of parallel functional units
available, a certain number of instructions can be executed in
parallel
11
Limitations to ILP
The situations which prevent instructions to be executed in
parallel by SSA are very similar to those which prevent
efficient execution on an ordinary pipeline:
◦ Resource conflicts
◦ Control (procedural) dependency
◦ Data dependencies
Their consequences on SSA are more severe than those on
simple pipelines, because the potential of parallelism in
SSA is greater and thus a larger amount of performance will
be lost
12
Limitations to ILP…
Resource Conflicts
◦ Several instructions compete for the same hardware resource at
the same time
Example:
Memories, caches, buses, register file ports and functional units (ALU
adder)
Two arithmetic instructions need the same floating-point unit for execution
Similar to structural hazards in pipeline
◦ Can be solved partly by introducing several hardware units for the
same functions --- duplication of resources
E.g., have two floating-point units
The hardware units can also be pipelined to support several operations
at the same time
13
Limitations to ILP…
Procedural Dependency
◦ The presence of branches in an instruction sequence complicates the
pipeline operation
Cannot execute instructions after a branch until the branch is
executed
The instruction following the branch is said to have a procedural
dependency on the branch instruction
Similar to control hazards in pipeline
◦ If instructions are of variable length, they cannot be fetched and issued
in parallel, since an instruction has to be decoded in order to identify
the following one
Another type of procedural dependency
Therefore, superscalar techniques are more efficiently applicable to RISCs,
with fixed instruction length and format
14
Limitations to ILP…
Data Conflicts
◦ Caused by data dependencies between instructions in the
program
Similar to data hazards in pipeline
◦ To address the problem and to increase the degree of parallel
execution, SSA provides a great liberty in the order in which
instructions are issued and executed
◦ Therefore data dependencies have to be considered and dealt
with much more carefully
15
Limitations to ILP…
Data Conflicts…
◦ Due to data dependencies, only some part of the instructions are potential
subjects for parallel execution
◦ In order to find instructions to be issued in parallel, the processor has to
select from a sufficiently large instruction sequence
There are usually a lot of data dependencies in a short instruction
sequence
◦ Window of execution is defined as the set of instructions that is
considered for execution at a certain moment
◦ The number of instructions in the window should be as large as possible.
However, this is limited by:
Capacity to fetch instructions at a high rate
The problem of branches
The cost of hardware needed to analyze data dependencies
16
Limitations to ILP…
Data Dependencies
◦ All instructions in the window of execution may begin
execution, subject to data dependence and resource
constraints
◦ Three types of data dependencies can be identified:
True data dependency
Output dependency
Anti-dependency
17
Limitations to ILP…
True Data Dependency
◦ Also called write-read dependency/flow dependency
◦ Exists when the output of one instruction is required as an input to a
subsequent instruction:
MUL R4,R3,R1 (R4 := R3 * R1)
...
ADD R2,R4,R5 (R2 := R4 + R5)
Can fetch and decode second instruction in parallel with first
Can NOT execute second instruction until first is finished
◦ They have to be detected and handled by hardware
The addition above cannot be executed before the result of the multiplication is
available
The simplest solution is to stall the adder until the multiplier has finished
In order to avoid the adder to be idle, the hardware can find other instructions
which can be executed by the adder
18
Limitations to ILP…
True Data Dependency…
◦ They are intrinsic features of the user’s program, and cannot
be eliminated by compiler or hardware techniques
◦ There are often a lot of true data dependencies in a small
region of a program
Increasing the window size can reduce the impact of these
dependencies
19
Limitations to ILP…
True Data Dependency…
◦ Another Example:
ADD r1, r2 (r1 := r1+r2;)
MOVE r3,r1 (r3 := r1;)
Can fetch and decode second instruction in parallel with first
Can NOT execute second instruction until first is finished
◦ Exercise: Consider the following code, conclude about the
relationship between data dependency and region of code
L2 move r3,r7
load r8,(r3)
add r3,r3,#4
load r9,(r3)
ble r8,r9,L2
20
Limitations to ILP…
Output Dependency
◦ Also called write-write dependency
◦ Occurs if two instructions are writing into the same location
If the second instruction writes before the first one, an
error occurs:
MUL R4,R3,R1 (R4 := R3 * R1)
...
ADD R4,R2,R5 (R4 := R2 + R5)
21
Limitations to ILP…
Output Dependency…
◦ Another Example:
I1: R3:= R3 + R5
I2: R4:= R3 + 1
I3: R3:= R5 + 1
I4: R7:= R3 + R4
◦ What is the relationship between
I1 & I2 ?
True data dependency
I3 & I4 ?
True data dependency
◦ What about I1 & I3 ?
No true data dependency
If I3 completes before I1
The wrong value of the contents of R3 will be fetched for the execution of I4
I3 must complete after I1 to produce the correct output
22
Limitations to ILP…
Anti-Dependency
◦ Also called read-write dependency
◦ Exists if an instruction uses a location as an operand while a following one is
writing into that location
◦ The constrain is similar to that of true data dependency but reversed
The second instruction destroys a value that the first instruction uses
◦ Example
If the first one is still using the location when the second one writes into it, an error
occurs:
MUL R4,R3,R1 (R4 := R3 * R1)
...
ADD R3,R2,R5 (R3 := R2 + R5)
23
Limitations to ILP…
Anti-Dependency
◦ Another Example:
I1: R3:= R3 + R5
I2: R4:= R3 + 1
I3: R3:= R5 + 1
I4: R7:= R3 + R4
I3 can not complete before I2 starts as I2 needs a value in R3 and I3
changes R3
24
Limitations to ILP…
Output and Anti-Dependencies
◦ Output dependencies and anti-dependencies are not intrinsic features of the
executed program
They are not real data dependencies but storage conflicts
They are due to the competition of several instructions for the same
register
◦ They are only the consequence of the manner in which the programmer or
the compiler are using registers (or memory locations)
◦ In the previous examples the conflicts are produced only because:
The output dependency:
R4 is used by both instructions to store the result (due to, for example,
optimization of register usage)
The anti-dependency:
R3 is used by the second instruction to store the result
25
Limitations to ILP…
Output and Anti-Dependencies …
◦ Output dependencies and anti-dependencies can usually be eliminated
by using additional registers
◦ This technique is called register renaming
MUL R4,R3,R1 (R4 := R3 * R1)
...
ADD R4,R2,R5 (R4 := R2 + R5)
MUL R4,R3,R1 (R4 := R3 * R1)
...
ADD R3,R2,R5 (R3 := R2 + R5)
26
Limitations to ILP…
Output and Anti-Dependencies …
◦ Register renaming another example:
I1: R3b:=R3a + R5a
I2: R4b:=R3b + 1
I3: R3c:=R5a + 1
I4: R7b:=R3c + R4b
◦ Without subscript refers to logical register in instruction
◦ With subscript is hardware register allocated
◦ Note R3a, R3b, R3c
◦ Creation of R3c, avoids:
Anti-dependency on the second instruction
Output dependency on the first instruction
Doesn't interfere with the correct value being accessed by I4
27
Limitations to ILP…
Effect of Dependencies
28
Instruction Issue Policies
In SS processors
◦ Instructions can be executed in an order different from the strictly sequential
one, with the requirement that the results must be the same
To optimize utilization of the various pipeline elements
Three types of ordering are important:
◦ The order in which instruction are fetched
◦ The order in which instructions are executed
◦ The order in which instructions update the contents of register /memory
locations
Instruction Issue:
◦ Refer to the process of initiating instruction execution in the processor’s
functional units
◦ Occurs when instruction moves from the decode stage of the pipeline to the first
execute stage of the pipeline
29
Instruction Issue Policies…
Instruction Issue Policy
◦ Refer to the protocol used to issue instructions
Superscalar instruction issue policies fall into the following
categories:
◦ In-Order Issue with In-Order Completion
IOI with IOC
◦ In-Order Issue with Out-of-Order Completion
IOI with OOC
◦ Out-of-Order Issue with Out-of-Order Completion
OOI with OOC
30
Instruction Issue Policies…
In-Order Issue with In-Order Completion
◦ The simplest instruction issue policy
◦ Instructions are issued in exact program order, and completed in the
same order (with parallel issue and completion, of course!)
An instruction cannot be issued before the previous one has been issued
In order issue
An instruction cannot be completed before the previous one has been
completed
In order completion
◦ To guarantee in-order completion:
Issuing an instruction will stall temporarily, when
There is a conflict and
A a unit requires more than one cycle to execute
31
Instruction Issue Policies…
In-Order Issue with In-Order Completion…
◦ Example:
Assume a superscalar pipeline with the following capabilities:
Can issue and decode two instructions per cycle
Has three functional units
Two, single-cycle integer units,
One, Two-cycle floating-point unit, and
Can complete and write back two results per cycle
Also, assume an instruction sequence with the characteristics given
below:
I1 – needs two execute cycles (floating-point)
I2 –
I3 –
I4 – needs the same functional unit as I3
I5 – needs data value produced by I4
I6 – needs the same functional unit as I5
32
Instruction Issue Policies…
In-Order Issue with In-Order Completion…
◦ The processor detects and handles (by stalling) true data
dependencies and resource conflicts
◦ As instructions are issued and completed in their strict order
The exploited parallelism is very much dependent on the
way the program has been written or compiled
Example:
If I3 and I6 switch position,
the pairs I4/I6 and I3/I5 can be executed in parallel (see the
following slides)
To exploit such parallelism improvement, the compiler needs
to perform elaborate data-flow analysis
33
Instruction Issue Policies…
In-Order Issue with In-Order Completion…
◦ Example
I1 – needs two execute cycles (floating-point)
I2 –
I6 – needs the same functional unit as I5
I4 – needs the same functional unit as I3
I5 – needs data value produced by I4
I3 –
improvement, the compiler needs to perform elaborate data-flow analysis
34
Instruction Issue Policies…
In-Order Issue with In-Order Completion…
◦ The basic idea of SSA is not to rely on compiler-based
technique
◦ SSA allows the hardware alone to detect instructions which
can be executed in parallel and to do that accordingly
◦ IOI with IOC is not very efficient, but it simplifies the
hardware
35
Instruction Issue Policies…
In-Order Issue with Out-of-Order Completion
◦ With out-of-order completion, a later instruction may complete before a
previous one
◦ Requires
More complex instruction issue logic that in-order completion
To pay attention when an interrupt occurs
◦ Used to improve the performance of instructions that require multiple
cycles
Example: Long-latency operations such as division
I1 – needs two cycles
I2 –
I3 –
I4 – conflict with I3
I5 – depending on I4
I6 – conflict with I5
36
Instruction Issue Policies…
Out-of-Order Issue with Out-of-Order Completion
◦ With in-order issue
The processor will only decode instruction up to the point of a
dependency or conflict
No additional instructions are decoded until the conflict is resolved
The processor cannot look ahead of the point of conflict to subsequent
instructions
That may be independent of those already in the pipeline and
That may be usefully introduced into the pipeline
37
Instruction Issue Policies…
Out-of-Order Issue with Out-of-Order Completion…
◦ Out-of-order issue takes a set of decoded instructions, issues
any instruction, in any order, as long as the program
execution is correct
Decouple decode pipeline from execution pipeline, by introducing an
instruction window
When a functional unit becomes available an instruction can be executed
Since instructions have been decoded, processor can look ahead
38
Instruction Issue Policies…
Out-of-Order Issue with Out-of-Order Completion…
◦ Example
Instructions have a similar relationship as indicated in the previous slide
I1 – needs two execute cycles
I2 –
I3 –
I4 – conflicts with I3
I5 – depends on I4
I6 – conflict with I5
39
Superscalar Execution Review
The instruction fetch process, which includes branch predictions
◦ Used to from a dynamic stream of instructions
◦ This stream examined for dependencies and the processor may remove artificial
dependencies
The processor then dispatches the instruction into a widow of execution
◦ In the window
Instructions are structured according to their true data dependencies
No longer form a sequential stream
The processor performs the execution station of each instruction
◦ In an order determined by
The true data dependencies
The hardware resource availability
Finally, instructions are conceptually put back into sequential order and their
results are ordered
◦ Referred to as committing or retiring the instruction
40
Superscalar Execution Review…
Committing or retiring instruction needed for the following reason:
◦ Use of parallel , multiple pipelining
Instructions may completed in an order different from that shown in
the static program
◦ Use of branch prediction and speculative execution
Some instructions may be abandoned
Permanent storage an program-visible registers:
◦ Can not be updated immediately when instruction complete
execution
◦ Results held in some sort of temporary storage
Usable by dependent instructions
Made permanent when it is determined that the sequential model
would have executed the instruction
41
Superscalar Execution Review…
Figure: Superscalar Execution
42
Superscalar Implementation
Instruction fetch strategies that simultaneous fetch multiple instructions
◦ Often by predicting the outcomes and
◦ Fetching beyond conditional branch instructions
It requires use of
Multiple pipeline fetch and decode stage
Branch prediction logic
Logic for determining true dependencies involving register values and
mechanism for communicating these values to where they are needed
during execution
Mechanism for initiating, or issuing , multiple instructions in parallel
Resources for parallel execution of multiple instructions, including
◦ Multiple pipelined functional units
◦ Memory hierarchies capable of simultaneously servicing multiple memory
references
Mechanism for committing the process state in correct order
43
Superscalar Execution Review…
Figure
◦ In the figure two floating point and two integer operations can be issued and
executed simultaneously
◦ Each unit is also pipelined and can execute several operations in different
pipeline stages
44
Superscalar Execution Review…
Figure…
◦ Another view of superscalar processor organization
45
Summary
The following techniques are main features for superscalar processors:
◦ Several pipelined units which are working in parallel
◦ Out-of-order issue and out-of-order completion
◦ Register renaming
All of the above techniques are aimed to enhance performance
Experiments have shown:
◦ Only adding additional functional units is not very efficient
◦ Out-of-order issue is extremely important, which allows to look ahead
for independent instructions
◦ Register renaming can improve performance with more than 30%; in
this case performance is limited only by true dependencies
◦ It is important to provide a fetching/decoding capacity so that the
window of execution is sufficiently large
46
Reading Assignment
Superpipelining processor
◦ Comparison with
Ordinary pipeline
Superscalar
◦ Advantage of superpipelined
Superpipelined superscalar processors
◦ Characteristics
Instruction Level Parallelism Vs Machine Parallelism
CPI Vs IPC
Clock cycles Per instruction
Instructions Per Clock cycles
47