0% found this document useful (0 votes)
84 views47 pages

Superscalar Processors and ILP Explained

The document discusses instruction level parallelism and superscalar processors. It provides an overview of ILP and superscalar execution, compares ordinary pipelining to superscalar implementation, reviews superscalar concepts, and discusses limitations to exploiting parallelism such as resource conflicts, control dependencies, and data dependencies.

Uploaded by

zelalem2022
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
84 views47 pages

Superscalar Processors and ILP Explained

The document discusses instruction level parallelism and superscalar processors. It provides an overview of ILP and superscalar execution, compares ordinary pipelining to superscalar implementation, reviews superscalar concepts, and discusses limitations to exploiting parallelism such as resource conflicts, control dependencies, and data dependencies.

Uploaded by

zelalem2022
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd

Chapter 5

Instruction Level Parallelism and


Superscalar Processors

1
Outline
 Overview
 Introduction
 Comparison
 Limitation to ILP
 Instruction Issue Policies
 Superscalar Execution Review
 Superscalar Implementation

2
Overview
 Pipelining exploits the potential parallelism among
instructions:
◦ Instruction Level Parallelism (ILP)
 The parallelism among instructions
 Refers to the degree to which, on average, the instructions of a
program can be executed in parallel
 Exists when instructions in a sequence are independent and thus can be
executed in parallel by overlapping

3
Overview…
 There are two methods for increasing the potential
amount of ILP:
◦ Increase the depth of the pipeline
 Enables the overlap of more instructions
 Amount of parallelism being exploited is higher
◦ Replicate the internal components of the computer
 Enables the launching of multiple instructions in every pipeline
stage
 This techniques is called Multiple Issue

4
Overview…
 Two ways to implement multiple issue processor
◦ Static multiple issue processors
 Also called VLIW (Very Long Instruction Word) Processor
 Many decisions are made by the compiler before execution
 Focus of Chapter 6, next chapter
◦ Dynamic multiple issue processors
 Also called Superscalar processors
 Many decisions are made during execution by the processor
 Focus of this chapter, Chapter 5
 The major difference between the two:
 The division of work between the compiler and the hardware

5
Introduction
 Superscalar Architecture /Superscalar Processor
◦ In a superscalar architecture (SSA), several scalar
instructions can be initiated simultaneously and executed
independently
◦ Pipelining allows also several instructions to be executed at
the same time, but they have to be in different pipeline
stages at a given moment
◦ SSA includes all features of pipelining but, in addition, there
can be several instructions executing simultaneously in the
same pipeline stage

6
Introduction…
 Superscalar Architecture /Superscalar Processor
◦ RISC machines lends itself readily to superscalar techniques
 But it can be used on either RISC or CISC architectures
◦ Superscalar approach is now the standard method for
implementing
 High performance microprocessors

7
Introduction…
 General Superscalar Organization
◦ There are multiple functional units
 Each of which is implemented as a pipeline
◦ In the diagram, the following operations can be executed at the same time:
 Two integer operations
 Two floating point operations and
 One memory (load or store) operations

8
Introduction…
 How does Superscalar processor work ?
◦ A SS processor fetches multiple instructions at a time, and
attempts to find nearby instructions that are independent of
each other and therefore can be executed in parallel

◦ Based on the dependency analysis, the processor may issue


and execute instructions in an order that differs from that of
the original machine code

◦ The processor may eliminate some unnecessary dependencies


by the use of additional registers and renaming of register
references

9
Comparison
 Ordinary Pipeline (Base Pipeline) Vs
Superscalar
◦ Ordinary Pipeline (Base Machine):
 Issues one instruction per clock cycle
 Perform one pipeline stage per clock cycle
 Although several instructions are executing
concurrently
 Only one instruction is in its execution stage at any
one time
 In the figure the pipeline has 4-stages:
 Instruction fetch, Operation decode, Operation
execution , Result write back
◦ Superscalar Implementation:
 Two instructions are executed concurrently in each
pipeline stage (in the figure)
 Superscalar of degree 2
 Duplication of hardware is required

10
Superscalar Recap
 Allows several instructions to be issued and completed per
clock cycle
 Consists of a number of pipelines that are working in parallel
 Depending on the number and kind of parallel functional units
available, a certain number of instructions can be executed in
parallel

11
Limitations to ILP
 The situations which prevent instructions to be executed in
parallel by SSA are very similar to those which prevent
efficient execution on an ordinary pipeline:
◦ Resource conflicts
◦ Control (procedural) dependency
◦ Data dependencies

 Their consequences on SSA are more severe than those on


simple pipelines, because the potential of parallelism in
SSA is greater and thus a larger amount of performance will
be lost

12
Limitations to ILP…
 Resource Conflicts
◦ Several instructions compete for the same hardware resource at
the same time
 Example:
 Memories, caches, buses, register file ports and functional units (ALU
adder)
 Two arithmetic instructions need the same floating-point unit for execution
 Similar to structural hazards in pipeline
◦ Can be solved partly by introducing several hardware units for the
same functions --- duplication of resources
 E.g., have two floating-point units
 The hardware units can also be pipelined to support several operations
at the same time

13
Limitations to ILP…
 Procedural Dependency
◦ The presence of branches in an instruction sequence complicates the
pipeline operation
 Cannot execute instructions after a branch until the branch is
executed
 The instruction following the branch is said to have a procedural
dependency on the branch instruction
 Similar to control hazards in pipeline
◦ If instructions are of variable length, they cannot be fetched and issued
in parallel, since an instruction has to be decoded in order to identify
the following one
 Another type of procedural dependency
 Therefore, superscalar techniques are more efficiently applicable to RISCs,
with fixed instruction length and format

14
Limitations to ILP…
 Data Conflicts
◦ Caused by data dependencies between instructions in the
program
 Similar to data hazards in pipeline
◦ To address the problem and to increase the degree of parallel
execution, SSA provides a great liberty in the order in which
instructions are issued and executed
◦ Therefore data dependencies have to be considered and dealt
with much more carefully

15
Limitations to ILP…
 Data Conflicts…
◦ Due to data dependencies, only some part of the instructions are potential
subjects for parallel execution
◦ In order to find instructions to be issued in parallel, the processor has to
select from a sufficiently large instruction sequence
 There are usually a lot of data dependencies in a short instruction
sequence
◦ Window of execution is defined as the set of instructions that is
considered for execution at a certain moment
◦ The number of instructions in the window should be as large as possible.
However, this is limited by:
 Capacity to fetch instructions at a high rate
 The problem of branches
 The cost of hardware needed to analyze data dependencies

16
Limitations to ILP…
 Data Dependencies
◦ All instructions in the window of execution may begin
execution, subject to data dependence and resource
constraints

◦ Three types of data dependencies can be identified:


 True data dependency
 Output dependency
 Anti-dependency

17
Limitations to ILP…
 True Data Dependency
◦ Also called write-read dependency/flow dependency
◦ Exists when the output of one instruction is required as an input to a
subsequent instruction:
 MUL R4,R3,R1 (R4 := R3 * R1)
...
ADD R2,R4,R5 (R2 := R4 + R5)
 Can fetch and decode second instruction in parallel with first
 Can NOT execute second instruction until first is finished
◦ They have to be detected and handled by hardware
 The addition above cannot be executed before the result of the multiplication is
available
 The simplest solution is to stall the adder until the multiplier has finished
 In order to avoid the adder to be idle, the hardware can find other instructions
which can be executed by the adder

18
Limitations to ILP…
 True Data Dependency…
◦ They are intrinsic features of the user’s program, and cannot
be eliminated by compiler or hardware techniques
◦ There are often a lot of true data dependencies in a small
region of a program
 Increasing the window size can reduce the impact of these
dependencies

19
Limitations to ILP…
 True Data Dependency…
◦ Another Example:
 ADD r1, r2 (r1 := r1+r2;)
 MOVE r3,r1 (r3 := r1;)
 Can fetch and decode second instruction in parallel with first
 Can NOT execute second instruction until first is finished
◦ Exercise: Consider the following code, conclude about the
relationship between data dependency and region of code
L2 move r3,r7
load r8,(r3)
add r3,r3,#4
load r9,(r3)
ble r8,r9,L2

20
Limitations to ILP…
 Output Dependency
◦ Also called write-write dependency
◦ Occurs if two instructions are writing into the same location
 If the second instruction writes before the first one, an
error occurs:
 MUL R4,R3,R1 (R4 := R3 * R1)
...
ADD R4,R2,R5 (R4 := R2 + R5)

21
Limitations to ILP…
 Output Dependency…
◦ Another Example:
 I1: R3:= R3 + R5
 I2: R4:= R3 + 1
 I3: R3:= R5 + 1
 I4: R7:= R3 + R4
◦ What is the relationship between
 I1 & I2 ?
 True data dependency
 I3 & I4 ?
 True data dependency
◦ What about I1 & I3 ?
 No true data dependency
 If I3 completes before I1
 The wrong value of the contents of R3 will be fetched for the execution of I4
 I3 must complete after I1 to produce the correct output

22
Limitations to ILP…
 Anti-Dependency
◦ Also called read-write dependency
◦ Exists if an instruction uses a location as an operand while a following one is
writing into that location
◦ The constrain is similar to that of true data dependency but reversed
 The second instruction destroys a value that the first instruction uses

◦ Example
 If the first one is still using the location when the second one writes into it, an error
occurs:
 MUL R4,R3,R1 (R4 := R3 * R1)
...
ADD R3,R2,R5 (R3 := R2 + R5)

23
Limitations to ILP…
 Anti-Dependency
◦ Another Example:
 I1: R3:= R3 + R5
 I2: R4:= R3 + 1
 I3: R3:= R5 + 1
 I4: R7:= R3 + R4
 I3 can not complete before I2 starts as I2 needs a value in R3 and I3
changes R3

24
Limitations to ILP…
 Output and Anti-Dependencies
◦ Output dependencies and anti-dependencies are not intrinsic features of the
executed program
 They are not real data dependencies but storage conflicts
 They are due to the competition of several instructions for the same
register
◦ They are only the consequence of the manner in which the programmer or
the compiler are using registers (or memory locations)
◦ In the previous examples the conflicts are produced only because:
 The output dependency:
 R4 is used by both instructions to store the result (due to, for example,
optimization of register usage)
 The anti-dependency:
 R3 is used by the second instruction to store the result

25
Limitations to ILP…
 Output and Anti-Dependencies …
◦ Output dependencies and anti-dependencies can usually be eliminated
by using additional registers
◦ This technique is called register renaming

 MUL R4,R3,R1 (R4 := R3 * R1)


...
ADD R4,R2,R5 (R4 := R2 + R5)

MUL R4,R3,R1 (R4 := R3 * R1)


...
ADD R3,R2,R5 (R3 := R2 + R5)

26
Limitations to ILP…
 Output and Anti-Dependencies …
◦ Register renaming another example:
 I1: R3b:=R3a + R5a
 I2: R4b:=R3b + 1
 I3: R3c:=R5a + 1
 I4: R7b:=R3c + R4b
◦ Without subscript refers to logical register in instruction
◦ With subscript is hardware register allocated
◦ Note R3a, R3b, R3c
◦ Creation of R3c, avoids:
 Anti-dependency on the second instruction
 Output dependency on the first instruction
 Doesn't interfere with the correct value being accessed by I4

27
Limitations to ILP…
 Effect of Dependencies

28
Instruction Issue Policies
 In SS processors
◦ Instructions can be executed in an order different from the strictly sequential
one, with the requirement that the results must be the same
 To optimize utilization of the various pipeline elements
 Three types of ordering are important:
◦ The order in which instruction are fetched
◦ The order in which instructions are executed
◦ The order in which instructions update the contents of register /memory
locations

 Instruction Issue:
◦ Refer to the process of initiating instruction execution in the processor’s
functional units
◦ Occurs when instruction moves from the decode stage of the pipeline to the first
execute stage of the pipeline

29
Instruction Issue Policies…
 Instruction Issue Policy
◦ Refer to the protocol used to issue instructions

 Superscalar instruction issue policies fall into the following


categories:
◦ In-Order Issue with In-Order Completion
 IOI with IOC
◦ In-Order Issue with Out-of-Order Completion
 IOI with OOC
◦ Out-of-Order Issue with Out-of-Order Completion
 OOI with OOC

30
Instruction Issue Policies…
 In-Order Issue with In-Order Completion
◦ The simplest instruction issue policy
◦ Instructions are issued in exact program order, and completed in the
same order (with parallel issue and completion, of course!)
 An instruction cannot be issued before the previous one has been issued
 In order issue
 An instruction cannot be completed before the previous one has been
completed
 In order completion
◦ To guarantee in-order completion:
 Issuing an instruction will stall temporarily, when
 There is a conflict and
 A a unit requires more than one cycle to execute

31
Instruction Issue Policies…
 In-Order Issue with In-Order Completion…
◦ Example:
 Assume a superscalar pipeline with the following capabilities:
 Can issue and decode two instructions per cycle
 Has three functional units
 Two, single-cycle integer units,
 One, Two-cycle floating-point unit, and
 Can complete and write back two results per cycle
 Also, assume an instruction sequence with the characteristics given
below:
 I1 – needs two execute cycles (floating-point)
 I2 –
 I3 –
 I4 – needs the same functional unit as I3
 I5 – needs data value produced by I4
 I6 – needs the same functional unit as I5

32
Instruction Issue Policies…
 In-Order Issue with In-Order Completion…
◦ The processor detects and handles (by stalling) true data
dependencies and resource conflicts
◦ As instructions are issued and completed in their strict order
 The exploited parallelism is very much dependent on the
way the program has been written or compiled
 Example:
 If I3 and I6 switch position,
 the pairs I4/I6 and I3/I5 can be executed in parallel (see the
following slides)
 To exploit such parallelism improvement, the compiler needs
to perform elaborate data-flow analysis

33
Instruction Issue Policies…
 In-Order Issue with In-Order Completion…
◦ Example
 I1 – needs two execute cycles (floating-point)
 I2 –
 I6 – needs the same functional unit as I5
 I4 – needs the same functional unit as I3
 I5 – needs data value produced by I4
 I3 –

 improvement, the compiler needs to perform elaborate data-flow analysis

34
Instruction Issue Policies…
 In-Order Issue with In-Order Completion…
◦ The basic idea of SSA is not to rely on compiler-based
technique
◦ SSA allows the hardware alone to detect instructions which
can be executed in parallel and to do that accordingly
◦ IOI with IOC is not very efficient, but it simplifies the
hardware

35
Instruction Issue Policies…
 In-Order Issue with Out-of-Order Completion
◦ With out-of-order completion, a later instruction may complete before a
previous one
◦ Requires
 More complex instruction issue logic that in-order completion
 To pay attention when an interrupt occurs
◦ Used to improve the performance of instructions that require multiple
cycles
 Example: Long-latency operations such as division
 I1 – needs two cycles
 I2 –
 I3 –
 I4 – conflict with I3
 I5 – depending on I4
 I6 – conflict with I5

36
Instruction Issue Policies…
 Out-of-Order Issue with Out-of-Order Completion
◦ With in-order issue
 The processor will only decode instruction up to the point of a
dependency or conflict
 No additional instructions are decoded until the conflict is resolved
 The processor cannot look ahead of the point of conflict to subsequent
instructions
 That may be independent of those already in the pipeline and
 That may be usefully introduced into the pipeline

37
Instruction Issue Policies…
 Out-of-Order Issue with Out-of-Order Completion…
◦ Out-of-order issue takes a set of decoded instructions, issues
any instruction, in any order, as long as the program
execution is correct
 Decouple decode pipeline from execution pipeline, by introducing an
instruction window
 When a functional unit becomes available an instruction can be executed
 Since instructions have been decoded, processor can look ahead

38
Instruction Issue Policies…
 Out-of-Order Issue with Out-of-Order Completion…
◦ Example
 Instructions have a similar relationship as indicated in the previous slide
 I1 – needs two execute cycles
 I2 –
 I3 –
 I4 – conflicts with I3
 I5 – depends on I4
 I6 – conflict with I5

39
Superscalar Execution Review
 The instruction fetch process, which includes branch predictions
◦ Used to from a dynamic stream of instructions
◦ This stream examined for dependencies and the processor may remove artificial
dependencies

 The processor then dispatches the instruction into a widow of execution


◦ In the window
 Instructions are structured according to their true data dependencies
 No longer form a sequential stream
 The processor performs the execution station of each instruction
◦ In an order determined by
 The true data dependencies
 The hardware resource availability
 Finally, instructions are conceptually put back into sequential order and their
results are ordered
◦ Referred to as committing or retiring the instruction

40
Superscalar Execution Review…
 Committing or retiring instruction needed for the following reason:
◦ Use of parallel , multiple pipelining
 Instructions may completed in an order different from that shown in
the static program
◦ Use of branch prediction and speculative execution
 Some instructions may be abandoned
 Permanent storage an program-visible registers:
◦ Can not be updated immediately when instruction complete
execution
◦ Results held in some sort of temporary storage
 Usable by dependent instructions
 Made permanent when it is determined that the sequential model
would have executed the instruction

41
Superscalar Execution Review…
 Figure: Superscalar Execution

42
Superscalar Implementation
 Instruction fetch strategies that simultaneous fetch multiple instructions
◦ Often by predicting the outcomes and
◦ Fetching beyond conditional branch instructions
 It requires use of
 Multiple pipeline fetch and decode stage
 Branch prediction logic
 Logic for determining true dependencies involving register values and
mechanism for communicating these values to where they are needed
during execution
 Mechanism for initiating, or issuing , multiple instructions in parallel
 Resources for parallel execution of multiple instructions, including
◦ Multiple pipelined functional units
◦ Memory hierarchies capable of simultaneously servicing multiple memory
references
 Mechanism for committing the process state in correct order

43
Superscalar Execution Review…
 Figure
◦ In the figure two floating point and two integer operations can be issued and
executed simultaneously
◦ Each unit is also pipelined and can execute several operations in different
pipeline stages

44
Superscalar Execution Review…
 Figure…
◦ Another view of superscalar processor organization

45
Summary
 The following techniques are main features for superscalar processors:
◦ Several pipelined units which are working in parallel
◦ Out-of-order issue and out-of-order completion
◦ Register renaming
 All of the above techniques are aimed to enhance performance
 Experiments have shown:
◦ Only adding additional functional units is not very efficient
◦ Out-of-order issue is extremely important, which allows to look ahead
for independent instructions
◦ Register renaming can improve performance with more than 30%; in
this case performance is limited only by true dependencies
◦ It is important to provide a fetching/decoding capacity so that the
window of execution is sufficiently large

46
Reading Assignment
 Superpipelining processor
◦ Comparison with
 Ordinary pipeline
 Superscalar
◦ Advantage of superpipelined
 Superpipelined superscalar processors
◦ Characteristics
 Instruction Level Parallelism Vs Machine Parallelism
 CPI Vs IPC
 Clock cycles Per instruction
 Instructions Per Clock cycles

47

You might also like