0% found this document useful (0 votes)
17 views23 pages

Hafta 14

This document describes out-of-order execution in processors. It discusses: - Allowing instructions to complete out of order while still issuing them in program order. This enables optimizations like instruction I2 completing before I1. - Out-of-order issue allows looking ahead past dependencies/conflicts to issue independent instructions earlier. This requires decoupling decode and execute stages with an instruction window buffer. - Register renaming avoids false dependencies like write-after-write that could otherwise stall the pipeline with out-of-order execution. It allocates physical registers dynamically. - Examples show how out-of-order execution and multiple functional units can improve performance, especially with a larger instruction window to find independent instructions.

Uploaded by

nausicaatetoo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views23 pages

Hafta 14

This document describes out-of-order execution in processors. It discusses: - Allowing instructions to complete out of order while still issuing them in program order. This enables optimizations like instruction I2 completing before I1. - Out-of-order issue allows looking ahead past dependencies/conflicts to issue independent instructions earlier. This requires decoupling decode and execute stages with an instruction window buffer. - Register renaming avoids false dependencies like write-after-write that could otherwise stall the pipeline with out-of-order execution. It allocates physical registers dynamically. - Examples show how out-of-order execution and multiple functional units can improve performance, especially with a larger instruction window to find independent instructions.

Uploaded by

nausicaatetoo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

In-Order Issue

Out-Of-Order Completion
• Issue instructions in the exact order that would be achieved by sequential
execution but allow instructions to run to completion out of order.

• Any number of instructions may be in the execution stage at any one time, up
to the maximum degree of machine parallelism across all functional units.

• Instruction issuing is stalled by a resource conflict, a data dependency, or a


procedural dependency.

• Instruction I2 is allowed to run to completion prior to I1.

• This allows I3 to be completed earlier, with the net result of a savings of one
cycle.
Out-Of-Order Issue
Out-Of-Order Completion
• With in-order issue, the processor will only decode
instructions up to the point of a dependency or conflict.

• No additional instructions are decoded until the conflict is


resolved.

• As a result, the processor cannot look ahead of the point of


conflict to subsequent instructions that may be
independent of those already in the pipeline and that may
be usefully introduced into the pipeline.

• To allow out-of-order issue, it is necessary to decouple the


decode and execute stages of the pipeline

• This is done with a buffer referred to as an instruction


window
Out-Of-Order Issue
Out-Of-Order Completion
• With this organization, after a processor has finished
decoding an instruction, it is placed in the instruction
window.

• As long as this buffer is not full, the processor can continue


to fetch and decode new instructions.

• When a functional unit becomes available in the execute


stage, an instruction from the instruction window may be
issued to the execute stage.

• Any instruction may be issued, provided that


— it needs the particular functional unit that is available
— no conflicts or dependencies block this instruction
Out-Of-Order Issue
Out-Of-Order Completion
• The result of this organization is that the processor has a
lookahead capability, allowing it to identify independent
instructions that can be brought into the execute stage.

• Instructions are issued from the instruction window with


little regard for their original program order.

• As before, the only constraint is that the program execution


behaves correctly.
Out-Of-Order Issue
Out-Of-Order Completion
• During each of the first three cycles, two instructions are fetched
into the decode stage.

• During each cycle, subject to the constraint of the buffer size, two
instructions move from the decode stage to the instruction
window.

• In this example, it is possible to issue instruction I6 ahead of I5


(recall that I5 depends on I4, but I6 does not).

• Thus, one cycle is saved in both the execute and write-back


stages, and the end-to-end savings
Register Renaming
• When out-of-order instruction issuing and/or out-of-order
instruction completion are allowed, this may give rise to the
possibility of WAW dependencies and WAR dependencies.

• These dependencies differ from RAW data dependencies and


resource conflicts, which reflect the flow of data through a
program and the sequence of execution.

• WAW dependencies and WAR dependencies, on the other


hand, arise because the values in registers may no longer
reflect the sequence of values dictated by the program flow.

• May result in a pipeline stall

• Registers allocated dynamically


— i.e. registers are not specifically named
Register Renaming
• The register reference without the subscript refers to the
logical register reference found in the instruction.

• The register reference with the subscript refers to a hardware


register allocated to hold a new value.

• When a new allocation is made for a particular logical register,


subsequent instruction references to that logical register as a
source operand are made to refer to the most recently
allocated hardware register (recent in terms of the program
sequence of instructions).
Register Renaming
• In this example, the creation of register R3c in instruction I3
avoids the WAR dependency on the second instruction and the
WAW on the first instruction, and it does not interfere with the
correct value being accessed by I4.

• The result is that I3 can be issued immediately

• Without renaming, I3 cannot be issued until the first


instruction is complete and the second instruction is issued.
Register Renaming - Speedup
• The vertical axis corresponds to the mean speedup of the
superscalar machine over the scalar machine.
• The horizontal axis shows the results for four alternative
processor organizations.
• The base machine does not duplicate any of the functional units,
but it can issue instructions out of order.
Register Renaming - Speedup
• The second configuration duplicates the load/store functional unit
that accesses a data cache.
• The third configuration duplicates the ALU
• The fourth configuration duplicates both load/store and ALU.
• In each graph, results are shown for instruction window sizes of
8, 16, and 32 instructions, which dictates the amount of
lookahead the processor can do.
Register Renaming - Speedup
• The difference between the two graphs is that, in the second,
register renaming is allowed.
• First graph reflects a machine that is limited by all dependencies
• Second graph corresponds to a machine that is limited only by
true dependencies.
Machine Parallelism
• The two graphs, combined, yield some important conclusions.
• The first is that it is probably not worthwhile to add functional
units without register renaming.
• There is some slight improvement in performance, but at the cost
of increased hardware complexity.
• With register renaming, which eliminates antidependencies and
output dependencies, noticeable gains are achieved by adding
more functional units.
• Note, however, that there is a significant difference in the amount
of gain achievable between using an instruction window of 8
versus a larger instruction window.
• This indicates that if the instruction window is too small, data
dependencies will prevent effective utilization of the extra
functional units; the processor must be able to look quite far
ahead to find independent instructions to utilize the hardware
more fully.
Key Elements of a
Superscalar Processor Organization
• Instruction fetch strategies that simultaneously fetch multiple
instructions, often by predicting the outcomes of, and fetching
beyond, conditional branch instructions.

• These functions require the use of multiple pipeline fetch and decode
stages, and branch prediction logic.

• Logic for determining true dependencies involving register values,


and mechanisms for communicating these values to where they are
needed during execution.

• Mechanisms for initiating, or issuing, multiple instructions in parallel.

• Resources for parallel execution of multiple instructions, including


multiple pipelined functional units and memory hierarchies capable of
simultaneously servicing multiple memory references.

• Mechanisms for committing the process state in correct order.


Example - 1
• Consider the following sequence of instructions, where the syntax
consists of an opcode followed by the destination register followed by
one or two source registers:

• Assume the use of a four-stage pipeline: fetch, decode/issue,


execute, write back.
• Assume that all pipeline stages take one clock cycle except for the
execute stage.
• For simple integer arithmetic and logical instructions, the execute
stage takes one cycle, but for a LOAD from memory, five cycles are
consumed in the execute stage.
Example - 1
• If we have a simple scalar pipeline but allow out-of-order execution,
we can construct the following table for the execution of the first
seven instructions:
Example - 1
• The entries under the four pipeline stages indicate the clock cycle at
which each instruction begins each phase.

• In this program, the second ADD instruction (instruction 3) depends


on the LOAD instruction (instruction 1) for one of its operands, r6.

• Because the LOAD instruction takes five clock cycles, and the issue
logic encounters the dependent ADD instruction after two clocks, the
issue logic must delay the ADD instruction for three clock cycles.

• With an out-of-order capability, the processor can stall instruction 3


at clock cycle 4, and then move on to issue the following three
independent instructions, which enter execution at clocks 6, 8, and 9.

• The LOAD finishes execution at clock 9, and so the dependent ADD


can be launched into execution on clock 10.
Example - 1
a) Complete the preceding table.

b) Redo the table assuming no out-of-order capability. What is


the savings using the capability?

c) Redo the table assuming a superscalar implementation that


can handle two instructions at a time at each stage.
Example - 1
a) Complete the preceding table.

Inst.No / Time 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
I0 F D E WB
I1 F D E WB
I2 F D E WB
I3 F D E WB
I4 F D E WB
I5 F D E WB
I6 F D E WB
I7 F D E WB
I8 F D E WB
I9 F D E WB
I10 F D E WB
Example - 1
b) Redo the table assuming no out-of-order capability. What is the
savings using the capability?

Inst.No / Time 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
I0 F D E WB
I1 F D E WB
I2 F D E WB
I3 F D E WB
I4 F D E WB
I5 F D E WB
I6 F D E WB
I7 F D E WB
I8 F D E WB
I9 F D E WB
I10 F D E WB
Example - 1
c) Redo the table assuming a superscalar implementation that can
handle two instructions at a time at each stage.

Inst.No / Time 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
I0 F D E WB
I1 F D E WB
I2 F D E WB
I3 F D E WB
I4 F D E WB
I5 F D E WB
I6 F D E WB
I7 F D E WB
I8 F D E WB
I9 F D E WB
I10 F D E
Example - 2
• Identify the write-read, write-write, and read-write dependencies in
the following instruction sequence:
Example - 2
• True Data Dependency (Read After
Write, RAW)
❖ I1-I4
❖ I1-I5
❖ I2-I4
❖ I2-I5
• Antidependency (Write After Read,
WAR)
❖ I2-I3
❖ I2-I4
❖ I3-I4
❖ I4-I5
• Output Dependency (Write After
Write, WAW)
❖ I1-I2
❖ I1-I5
❖ I2-I5
Review Questions
1. What is the difference between the superscalar and
superpipelined approaches?
2. Explain the following terms with an example
i. True data dependency
ii. Procedural dependency
iii. Resource conflicts
iv. Output dependency
v. Antidependency
3. Explain three types of superscalar instruction issue policies
with an example
4. What are the key elements of a superscalar processor
organization

You might also like