Computer Architecture Notes
Computer Architecture Notes
The computer architecture and language classes must have a common foundation
or paradigm called a Computational Model. The concept of a computational
model represents a higher level of abstraction than either the computer
architecture or the programming language alone, and covers both, as show below -
Computational Model
Level of
Abstraction
Computer Computer
Architecture Language
The concept of computational model comprises the set of the following three
abstractions –
1. The basic items of computations
2. The problem description model
3. The execution model
Contrary to initial thoughts, the set of abstractions that should be chosen to specify
computational models is far from obvious. A smaller number of criteria would define
fewer but more basic computational models, while a larger number of criteria would
result in a relatively large number of distinct models.
Computational Model
For instance, in the von Neumann computational model the basic items of
computation are data. This data will typically be represented by named (that is,
indentifiable) entities in order to be able to distinguish between a number of
different data items in the course of a computation. These named entities are
commonly called variables in programming languages and are implemented by
memory or register addresses in architectures.
All the best known computational models, namely –
In contrast (rqyukRed), there are a number of models which are not data based. In
these models the basic items of computation are:
The problem description model refers to both the style and method of problem
description, as shown –
Problem description model
Style Method
Interpretation of the problem description model.
Procedural Declarative
Different problem description styles.
In a procedural style the algorithm for solving the problem is stated. A particular
solution is then declared in the form of an algorithm.
In a declarative style all the facts and relationships relevant to the given problem
have to be stated. There are two methods for expressing these relationships and
facts. The first uses functions, as in the applicative model of computation, while the
second states the relationships and facts in the form of predicates, as in the
predicate-logic-based computational model.
The other component of the problem description model is the problem
description method. It is interpreted differently for the procedural and the
declarative style. When using the procedural style the problem description model
states how a solution of the given problem has to be described. In contrast, while
using the declarative style, it specifies how the problem itself has to be described.
The third and the last element of computational model outlines the execution
model. It consists of three components as shown –
Execution model
The next component of the execution model specifies the execution semantics.
This can be interpreted as a rule that prescribes how a single execution step is to be
performed. This rule is, of course, associated with the chosen problem description
method and how the execution of the computation is interpreted. The different
kinds of execution semantics applied by the computational models are summarized
below –
Execution semantics
The last component of the model specifies the control of the execution
sequences. In the basic models, execution is either control driven or data
driven or demand driven as shown –
Control of execution sequence
Control driven Data driven Demand driven
Computation
al model
Specification
Implementation
tool tool
Programmin Computer
g language architecture
Execution
Computation
al model
Correspondi
ng
programmin
g languages
Correspondi
ng
architecture
Typical evolution sequence of a computational model and the corresponding
languages and architectures.
• Turing
• von Neumann
• dataflow
• applicative
• object based
• predicate logic based.
These models are called basic models because they may be declared using a
minimal (U;wure) set of abstractions. For each model we specify
Input data
Proble Description
m Sequences
descript of
ion instructions
Output data
The given sequence of instructions will be executed using the input data
Interpretation
of performing
the
State transition semantics
Assum computation
Execution
ed
semantics
executi
Control driven
on
Control of
the
execution
sequence
Key features of the von Neumann computation model.
The main attributes of the model follow from the possibility of multiple assignments
of data as well as from the control-driven execution.
Related languages
The related languages to von Neumann computational model are called imperative
languages, emphasizing (tksj) the possibility of multiple assignments, or
sometimes procedural languages, pinpointing the procedural nature of the
problem description that they support. Because the procedural style of problem
description is used in more than one computational model, in the following we refer
to use the term ‘imperative languages’ to denote the language class corresponding
to the von Neumann computational model. Examples of imperative languages are
C, Pascal, ALGOL, Fortran, and so on.
1. Granularity
Conventional
CISC (Medium grain)
(Medium grain)
2. Typing
In typed languages there exists a concept of data type and the language
system (compiler or interpreter) may check the consistency (lkeatL;) of the types
used in expressions, function invocations (vkea=.k) and so on. If the language is
strongly typed, a type mismatch will cause an error. In weakly typed languages,
a type mismatch is allowed under given circumstances, namely if the types involved
are consistent.
LISP and FP are examples of untyped languages. Pascal, Miranda and HOPE
are strongly typed languages, whereas the single assignment language Sisal is
weakly typed.
Low Conventional
assembly languages
High Conventional
high-level languages
The term ‘computer architecture’ was coined in 1964 by the ‘chief architects’
of the IBM System/360 in a paper announcing the most successful family of
computers ever (dnkfi) built. They interpreted (le>k gqvk) computer architecture as
‘the structure of a computer that a machine language programmer must
understand to write a correct program for a machine’. Essentially, their
interpretation comprises the definition of registers and memory as well as of the
instruction set, instruction formats, addressing modes and the actual coding of the
instructions excluding implementation and realization. By implementation they
understood the actual hardware structure and by realization the logic technology,
packaging and interconnections.
Since the authors just quote (nksgjkuk] Hkko cksyuk) also made use of the
multilevel description approach, they introduced a ‘two dimensional’ interpretation.
Here, one of the ‘dimensions’ is the level of abstraction, in much the same way as
introduced by Bell and Newell (1970). The other, orthogonal (vk;rh;) ‘dimension’ is
the scope of interest. Here, they differentiate at each level between a black-box-like
functional specification and the description of the implementation. Correspondingly,
they distinguish between ‘logical’ and ‘physical’ architecture (Sima 1977) and endo-
and exo- architecture (Dasgupta 1981), respectively.
As pointed out later in this section, the two dimensions considered by these
authors, constitute (cuk ysuk] cuk gksuk) single, hierarchical description framework
with increasing levels of abstraction.
Level of consideration
The term ‘architecture’ can be used at each level of consideration with two
distinct scopes of interest. If we are interested in the functional specification of a
computer, we are dealing with its abstract architecture. Or, if we are interested
in its implementation, we are concerned with its concrete architecture. Evidently,
the abstract architecture reflects the black-box view whereas the concrete
architecture covers the description of the internal structure and operation.
The abstract architecture is also referred to as an exo-architecture, an
external or logical architecture, a black-box description, or in certain contexts as a
programming model and behavioural description.
Scope of interest
In the first case we are dealing with the programming model, in the second
with the hardware model.
While the programming model is a black-box specification from the
programmer’s point of view, the hardware model is viewed as a black-box
description for a hardware designer.
Concrete architecture
The concrete architecture can also be considered from two different points of
view: logic design or physical design.
The logic design is an abstraction of the physical design and precedes it. Its
specification requires:
• the declaration of the logical components used, such as registers,
execution units and so on.
• the specification of their interconnections, and
• the specification of the sequence of information transfers, which are
initiated by each of the declared function.
The physical design is based on concrete circuit elements. The specification
of a physical design covers:
• the declaration of the circuit elements used, which also includes the
specification of signals,
• the specification of their interconnections, and
• the declaration of the initiated signal sequences.
Usually, both the logic and physical design are described in terms of the next
lower level of components. For example, the concrete architecture of a processor
would be best described in terms of functional units such as register files, execution
units, buses and so on.
Computer architecture at different level of abstraction
PMB
Interrupt interface
Programming
or
interface
64 bits
Data cache
e.g. of an ALU
e.g. ALU
A B
e.g. CReg.reg.
blockblock
1 1
Reg. block 2
ALU
ALU
P
Concrete architecture of a computer system
P P P
bus
M M M
Description of architectures
Concept of process
Finally, a process in the wait can go into the ready-to-run state, if the event it is
waiting for has occurred.
Concept of thread
Process
Thread tree.
Although the adjectives ‘concurrent’ and ‘parallel’ are often used as synonyms,
it is often desirable to make a distinction between them.
In this situation the key problem is how the competing clients, let us say
processes or threads, should be scheduled for service (execution) by the single
server (processor).
The scheduling policy may be viewed as covering two aspects. The first deals
with whether servicing a client can be interrupted or not and, if so, on what
occasions (pre-emption rule). The other states how one of the competing clients is
selected for service (selection rule).
Scheduling policy
Problem solutions may contain two different kinds of available parallelism, called
functional parallelism and data parallelism. We term functional parallelism that
kind of parallelism which arises from the logic of a problem solution.
There is a further kind of available parallelism, called data parallelism. It
comes from using data structures that allow parallel operations on their elements,
such as vectors or matrices, in problem solutions.
From another point of view parallelism can be considered as being either regular
or irregular. Data parallelism is regular, whereas functional parallelism, with the
execution of loop-level parallelism, is usually irregular.
1. Exploited by architectures
2. Exploited by means of operating systems
In order to round out the picture we will briefly outline below high-level concurrent
execution models such as multi-threading, multitasking, multiprogramming and
time sharing. Note that all these concurrent execution models refer to different
granularity of execution, also termed different levels. Typically, they are
implemented by means of the operating system on a single processor. It is
important to note, however, that all these notions can also be interpreted in a
broader sense, as notions designating the execution of multiple threads, processes
or users in a concurrent or parallel way. Nevertheless, the interpretations below are
focused on the narrower scope of concurrent execution.
Thread-level concurrent execution is termed multi-threading. In this case
multiple threads can be generated for each process, and these threads are
executed concurrently on a single processor under the control of operating system.
Multi-threading is usually interpreted as concurrent execution at the thread
level. Multi-threading evidently presumes that a process has multiple threads, that
is, a process-thread model is used to represent and schedule units of work for the
processor.
Process-level concurrent execution is usually called multitasking. All
current widely used operating systems support this concept. Multitasking refers to
concurrent execution of processes. Multiple ready-to-run processes can be created
either by a single user if process spawning is feasible, or by multiple users
participating in multiprogramming or in time-sharing.
Level of
granularity
Concurrent execution models.
Data parallelism may be utilized in two different ways. One possibility is to exploit
data parallelism directly by dedicated architectures that permit parallel or pipelined
operations on data elements, called data-parallel architectures. The other possibility
is to convert data parallelism into functional parallelism by expressing parallel
executable operations on data elements in a sequential manner, by using the loop
constructs of an imperative language.
Proposed classification
Pipelining
In pipelining a number of functional units are employed in sequence
to perform a single computation. These functional units form an assembly
line or pipeline. Each functional unit represents a certain stage of the
computation and each computation goes through the whole pipeline. If there
is only a single computation to be performed, the pipeline cannot extract any
parallelism. However, when the same computation is to be executed several
times, these computations can be overlapped by the functional units.
Pipelining is a very powerful technique to speed up a long series of
similar computations and hence is used in may parallel architectures. It can
be used inside a processor and among the processors or nodes of a parallel
computer. A member of the oldest class of parallel computers, the vector
processor, is a good example of pipeline technique inside a processor. The
elements of two vectors enter the vector unit at every time stip. The vector
unit, consisting of several functional units, executes the same operation on
each pair of vector elements. Modern pipelined processors and superscalar
processors also use the pipelining technique to overlap various stages of
instruction execution. The processors used in multi-threaded architectures
are usually built on the pipelining principle, too.
Replication
A natural way of introducing parallelism to a computer is the
replication of functional units. Replicated functional units can execute the
same operation simultaneously on as may data elements as there are
replicated computational resources available. The classical example is the
array processor which employs a large number of identical processors
executing the same operation on different data elements. Wavefront array
and two-dimensional systolic arrays also use replication alongside pipelining
parallelism.
Relationships between languages and parallel
architectures
Although languages and parallel architectures could be considered as
independent layers of a computer system, in reality, for efficiency reasons the
parallel architecture has a strong influence on the language constructs applied to
exploit parallelism.
There is only one common control unit for all the processors in SIMD
machines and hence, each processor executes the same code. However, there are
situations when only a subset of the elements of an array is involved in an
operation. A new language constructs, called the mask, is needed to define the
relevant subset of arrays. The mask technique is a crucial part of data-parallel
languages.
The two main lines of synchronization tools lead to the rendezvous language
construct which combines the features of remote procedure calls and monitors. The
rendezvous concept could be implemented on both shared and distributed memory
machines.
Dependencies between instructions
In a program, instructions often depend on each other in such a way that a
particular instruction cannot be executed until a preceding instruction or even two
or three preceding instructions have been executed. For instance, such a
dependency exists if an instruction uses the results of the preceding instruction as a
source operand.
There are three possible kinds of dependencies between instructions. If
subsequent instructions are dependent on each other because of data, we term this
data dependency. If a conditional transfer instruction is met, the subsequent
execution path depends on the outcome of the condition. Thus, instructions which
are in either the sequential or the taken path of a conditional control transfer
instruction are said to be control dependent. Finally, if instructions require the same
resource for execution they are said to be resource dependent.
Data dependencies
Consider two instructions ik and il of the same program, where ik precedes il. if
ik and il have a common register or memory operand, they are data dependent on
each other, except when the common operand is used in both instructions as a
source operand. An obvious example is when il uses the result of ik as a source
operand. In sequential execution data dependencies do not cause any troble, since
instructions are executed strictly in the specified sequence. But this is not true
when instructions are executed in parallel, such as in ILP-execution. Then, the
detection and fair handling of data dependencies becomes a major issue.
Data dependency
In this case i2 writes r2 while i1 uses r2 as a source. If, for any reason, i2
were to be executed ahead of i1 then the original content of r2 would be rewritten
earlier than it is read by instruction r1, which would lead to an erroneous result.
Accordingly, a WAR dependency exists between two instructions ik and il, if
il specifies as a destination one of the source addresses of i k (which can be either a
register or a memory address). Note that WAR dependencies are also termed anti-
dependencies. They are false-dependencies (as are WAW dependencies) since they
can be eliminated through register renaming.
A WAR dependency resolves as soon as the instruction involved has fetched
the operand causing the dependency.
do I = 2, n
X(I) = A*X(I-1) + B
Enddo
There is a recurrence in the loop since in order to compute the value of X(I)
we need the value of X(I-1), the result of the previous iteration.
The graph has one node for each instruction considered. Directed arcs
indicate dependencies (precedence relationships) between instruction pairs. If we
restrict ourselves to straight-line code, the directed graph representing data
dependencies is acyclic. Consequently, data dependencies among instructions of a
straight-line code can be represented by a directed acyclic graph (DAP).
Control dependencies
Consider the following code sequence:
mul r1, r2, r3;
jz zproc;
sub r4, r7, r1;
:
zproc: load r1, x;
:
In this example, the actual path of execution depends on the outcome of the
multiplication. This means that the instructions following a conditional branch are
depend3ent on it. In a similar way, all conditional control instructions, such as
conditional branches, calls, skips and so on, impose dependencies on the logically
subsequent instructions. Which are referred to as control dependencies.
Conditional braches can seriously impede the performance of parallel
instruction execution. Therefore, it is important to examine how frequent
conditional branches actually are.
Nodes with only one outgoing arc represent either an operational instruction
or a sequence of conditional branch-free operational instructions (straight-line
code). The usual term for directed graphs expressing control dependencies is
Control Dependency Graph (CDG).
Resource dependencies
An instruction is resource-dependent on a previously issued instruction if it
requires a hardware resource which is still being used by a previously issued
instruction. If, for instance, only a single non-pipelined division unit is available, as
in usual in ILP-processors, then in the code sequence the second division instruction
is resource-dependent on the first one and cannot be executed in parallel.
Resource dependencies are constraints caused by limited resources such as
EUs, buses or buffers. They can reduce the degree of parallelism that can be
achieved at different stages of execution, such as instruction decoding, issue,
renaming, execution and so on. In general, with the evolution of integrated circuit
technology, more hardware resources are becoming available in processors, and
related constraints will impede performance to a lesser extent than at present.
Instruction scheduling
When instructions are processed in parallel, it is often necessary to detect
and resolve dependencies between instructions. We usually discuss dependency
detection and resolution as it relates to processor classes and the processing tasks
involved separately. Nevertheless, the question of whether these tasks are
expected to be performed by the compiler or by the processor deserves special
attention.
The two basic approaches are termed static and dynamic. Static detection
and resolution is accomplished by the compiler, which avoids dependencies by
reordering the code. Then the output of the compiler is reordered into dependency-
free code. Note that VLIW processors always expect dependency-free code,
whereas pipelined and superscalar processors usually do not.
In contrast, dynamic detection and resolution of dependencies is performed
by the processor. If dependencies have to be detected in connection with
instruction issue, the processor typically maintains two gliding windows. The issue
window contains all fetched instructions which are intended for issue in the next
cycle, while instructions which are still in execution and whose results have not yet
been produced are retained in an execution window. In each cycle, all the
instructions in the issue window are checked for data, control and resource
dependencies with respect to the instructions in execution. There is also a further
check for dependencies among the instructions in the issue window. As a result of
the dependency checks, zero, one or more independent instructions will be issued
to the EUs.
It must be emphasized that this is only one possible method of dependency
detection. In general, detection and resolution is part of the instruction issue policy
of the processor.
Basic approaches to instruction scheduling.
• A pipeline stage associated with each subtask which performs the required
operations
• The same amount of time is available in each stage for performing the
required subtask
• All pipeline stages operate like an assembly line, that is, receiving their input
typically from the previous stage and delivering their output to the next stage
(a)
Instructions tc1 tc2 tc3 tc4 tc5 tc6 tc7 tc8 tc9 tc10 tc11 tc12 tc13
tc14 tc15
(c)
However, several problems arise when a pipeline is made deeper. First, data
and control dependencies occur more frequently, which directly decreases
performance. Furthermore, while increasing the number of pipeline stages, the
partitioning of the entire task becomes less balanced and clock skew grows. This
limits the possible increase of the clock frequently to a less than linear extent. As a
consequence, by increasing the depth of a pipeline, at first we expect an increase in
performance. However, after a maximum is reached, the performance would
certainly fall.
The second aspect is the specification of the subtasks to be performed
in each of the pipeline stages. The specification of the subtasks can be done at
a number of levels of increasing detail as shown:
The next aspect of the basic layout of a pipeline is the layout of the stage
sequence, which concerns how the pipeline stages are used as shown in below
figure.
While processing an instruction, the pipeline stages are usually operated
successively one after the other. In some cases, however, a certain stage is
recycled, that is, used repeatedly, to accomplish the result. This often happens
while performing a multiplication or division. Recycling allows an effective use of
hardware resources, but impedes pipeline repetition rate.
Bypassing (data forwarding) are intended to reduce or eliminate pipeline
stalls due to RAW dependencies. There are two kinds of RAW dependencies: define-
use and load-use dependencies separately.
Define-use dependencies occur when an instruction uses the result of a
preceding operate instruction, such as an arithmetic or Boolean instruction, as a
source operand. Unless special arrangements are made, the result of the operate
instruction is written into the register file, or into the memory, and then it is fetched
from there as a source operand.
Dependency resolution
In recent processors different pipelines are declared for each of the major
instruction classes. There are usually separate pipelines for the processing of FX
and logical data, called the FX pipeline, for FP data, the FP pipeline, for loads and
stores, the L/S pipeline, and finally, for branches, the B pipeline. Obviously, for
the processing of additional data types further pipelines could be specified. We note
furthermore that recent processors often provide two or even more separate
pipelines for FX or FP data.
The detailed specification of a pipeline includes the statement of the pipeline
stages as well as the specification of the subtasks to be done in each of the stages.
Multiplicity of pipelines.
The last aspect concerns only multiple pipelines. In this case, instructions
may be completed out of order, since subsequent instructions may be executed by
different pipelines of different length. Consider, for instance, the case when a long
latency instruction, such as an FP instruction, is followed by short latency one, like
an FX or logical instruction, and both instructions are executed in separate
pipelines. Then, the simple FX or logical instruction is completed earlier than the
preceding one. However, out-of-order instruction completion would cause a
deviation from the appropriate technique is needed to preserve sequential
consistency.
Basically, there are three methods to achieve this, as shown below figure.
Two of these are applied mostly when there are two pipelines, that is, an FX and an
FP pipeline, whereas the third method is used typically for more than two pipelines.
In the simplest case, we assume having a multifunctional pipeline for FX, L/S and
branch instruction as well as a distinct FP pipeline. Here, the FP pipeline is not
allowed to the write back the results into the register file or memory autonomously.
Instead the writeback operation is performed under software control as a result of
an explicit instruction. For instance, this is how earlier FP coprocessors were
synchronized with the master FX pipeline.
Methods for preserving sequential consistency when multiple pipelines are writing
back results.
In the second method we again assume two lock stepped operating pipelines,
a master FX and an FP pipeline. Here, in-order completion of the instructions is
achieved by properly lengthening the shorter FX pipeline, by introducing unused
cycles (bubbles) into it, if necessary. A more relaxed variant of this method checks
for possible hazards and triggers a delay only if a hazard is likely.
The third method is reordering. It is usually applied if there are more than
two pipelines. By reordering, the pipelines are not allowed to write back the results
directly into the architectural registers or into memory. Instead a mechanism is
provided such that the results of the instructions are written back in program order,
despite being generated in an out-of-order fashion. Reordering takes place in a so-
called completion stage.
VLIW architecture and Superscalar processor
The basic structure of VLIW and processors consists of a number of EUs, each
capable of parallel operation on data fetched from a register file as shown below. In
both kinds of processors, multiple EUs execute instructions and write back results
into the register file simultaneously.
Branch problem
This means that in a pipeline each branch instruction gives rise to a number
of wasted cycles (called bubbles) unless appropriate branching techniques are
introduced.
Basic approaches to branch handling
There are three aspects which give rise to the basic approaches in branch
handling. These are:-
Delayed branching
In the figure we move the add instruction of our program segment that
originally preceded the branch into the branch delay slot. With delayed branching
the processor executes the add instruction first, but the branch will only be effective
later. Thus, in this example, delayed branching preserves the original execution
sequence:
a, add
b, b
c, sub
Branch processing
Branch detection
Early microprocessors detect branches during the common instruction
decoding (master pipeline approach). The obvious drawback of this straightforward
approach is that it gives rise to long branch processing penalties.
In order to reduce taken penalties, up-to-date processors usually detect
branches earlier than during common instruction coding (early branch detection).
There are three methods: in parallel branch detection, look-ahead branch detection
and integrated instruction fetch and detection. In the following we introduce these
early branch detection schemes.
A number of processors detect and decode branches in parallel with the
‘common’ instruction decoding. We call this scheme in parallel branch detection.
A slightly more advanced branch detection method is to spot branches from
the instruction buffer in parallel with the common instruction decoding as before,
but also to look ahead into preceding buffer entries. We call this scheme look-
ahead branch detection.
The most advanced method of branch detection avoids explicit decoding such
as described above. Instead, branch detection is integrated into the instruction
fetch mechanism. This scheme is called integrated instruction fetch and
branch detection. For scalar processors its principle is as follows. The instruction
fetching mechanism is extended such that it is able to detect whether the next
instruction to be fetched is a branch or not. Each detected branch is guessed to be
taken and instead of, or in addition to, the next sequential instruction, the target
address of the branch or even the target instruction is also fetched in advance.
Evidently, in this scheme a ‘common’ decoding follows for conditional branches, and
if the guess process wrong, a mechanism is initiated to correct this.
The branch prediction scheme used in a processor has a crucial impact on its
performance. Basically, a prediction may be either a fixed or a true prediction. In a
fixed prediction the same guess is always made, either ‘taken’ or ‘not-taken’.
There are two basic tasks to be performed: discard the results of the
speculative execution and resume execution of the alternative, that is, the correct
path. When there is more than one pending conditional branch, the matching
alternative path must be selected and pursued.
In recovery from a mispredicted taken path, as preparation for a possible
recovery from a misprediction, at first, the processor have to save the address of
the sequential continuation, before it starts execution of the guessed taken path.
The recovery procedure can be further shortened if already prefetched sequential
instructions are not discarded but saved for possible later use in the case of a
misprediction.
The other situation is when the sequential path has been incorrectly
predicted and executed. Here, in conjunction with the ‘not taken’ prediction, the
branch target address should be pre-calculated and saved to allow a recovery.
Evidently, this requires additional buffer space as well as additional cache access
bandwidth.
Taken conditional branches have a much higher frequently than not taken
ones. Thus, the penalty for ‘taken’ conditional branches impedes the overall branch
penalty much more strongly than the penalty for ‘not taken’ ones. For this reason, it
is crucial for processor performance to keep the taken penalty of unresolved
conditional branches as low as possible.
Branch target access scheme.
The branch penalty for ‘taken’ guesses depends heavily on how the branch
target path is accessed. Recent processors use one of four basic methods:
compute/fetch scheme, BTAC (Branch Target Access Cache) scheme, BTIC (Branch
Target Instruction Cache) scheme and the successor index in the I-cache scheme.
This scheme is the natural method of accessing branch targets. First, the
branch target address (BTA) is computed either by the pipeline or by a dedicated
adder. Then, the corresponding branch target instruction (BTI) is fetched. In recent
processors this means an access to the I-cache, whereas in early pipelined
processors without an I-cache, the memory is accessed instead.
This scheme employs an extra cache, called the branch target address cache
(BTAC), for speeding up access to branch targets. The BTAC contains pairs of
recently used branch addresses and branch target addresses and is accessed
associatively. When the actual instruction fetch address is a branch address, and
there is a corresponding entry in the BTAC, the branch target address is fetched
along with the branch instruction in the same cycle. This BTA is then used to access
the branch target instruction in the next cycle.
This scheme is only used occasionally, in cases when the taken penalty would
be untolerably high due to a longer than I-cache latency. The basic idea of the BTIC
scheme is to provide a small extra cache which delivers, for taken or predicted
taken branches, the branch target instruction or a specified number of BTIs, rather
than the BTA. Thus, otherwise unused pipeline cycles can be filled with target
instructions.
Multiway branching
Guarded execution
(guard) instruction
Basic block scheduling is the simplest but least effective code scheduling
technique. Here, only instructions within a basic block are eligible for reordering. As
a consequence, the achievable speed-up is limited by both true data and control
dependencies.
Most basic block schedulers for ILP-processors belong to the class of list
schedulers, like the ones developed for the MIPS processors, Sparc processors.
List schedulers can be used in many contexts, such as in operational research
for scheduling tasks for assembly lines, in computing for scheduling tasks for
multiprocessors, for scheduling microcode for horizontally microprogrammed
machines or as in our case, for scheduling instructions for ILP-processors.
As far as List schedulers for basic block scheduling are concerned, the
scheduled items are instructions. As far as the number of scheduled instructions per
step is concerned, code schedulers for both pipelined and superscalar processors
deliver one scheduled instruction at a time.
Now let us turn to the main components of list schedulers. The selection rule
selects the set of eligible instruction for scheduling. Eligible instructions are
dependency-free, that is, they have no predecessors in the DDG and the hardware
required resources are available.
The rule for choosing the ‘best schedule’ is used to look for the
instruction most likely to cause interlocks with the follow-on instructions. This rule is
typically a matter of heuristics.
Most schedulers implement this heuristics by searching for the critical
execution path in the DDG and choosing a particular node associated with the
critical path for scheduling. However, ‘critical path’ can be interpreted in a
number of different ways and can be selected on the basis of either priorities or
criteria.
In general, a priority-based scheduler calculates a priority value for each
eligible instruction according to the chosen heuristics. In the case of basic block list
schedulers for pipelined and superscalar processors, the ‘priority value’ of an
eligible node is usually understood as the length of the path measured from the
node under consideration to the end of the basic block. After calculating all priority
values, the best choice is made by choosing the node with the highest priority
value. Evidently, a tie-breaking rule is needed for the case when more than one
node has the same highest priority value.
In contrast, criteria-based schedulers apply a set of selected criteria in a
given order to the items to be scheduled. A ‘best choice’ is made by finding the first
item meets the highest possible ranking criterion. Here, of course, the scheduler
also has to provide and additional criterion or rule for tie-breaking.
Global scheduling
2. Loop scheduling
Loops are the most fundamental source of parallelism for ILP-processors.
Here, the regularity of the control structure is used to speed up computation.
Therefore, loop scheduling is a focal point of instruction schedulers which have been
developed for highly parallel ILP-processors, such as VLIWs.
There are two basic approaches for scheduling loops, loop unrolling and
software pipelining. The first is a straightforward, less effective approach compared
with the second. As far as their current use is concerned, loop unrolling is employed
mostly as a component of more sophisticated scheduling methods, such as software
pipelining or trace scheduling. Software pipelining is the standard approach for loop
scheduling.
Loop unrolling
The basic idea of loop unrolling is to repeat the loop body a number of
times and to omit superfluous inter-iteration code, such as decrementing the loop
count, testing for loop end and branching back conditionally between iterations.
Obviously, this will result in a reduced execution time. Loop unrolling can be
implemented quite easily when the number of iterations is already fixed at compile
time, which occurs often in ‘do’ and ‘for’ loops.
Loop unrolling saves execution time, at the expense of code length, in much
the same way as code inlining or traditional macro expansion. Code inlining is one
of the standard compiler optimization techniques, applied for short, infrequently
used subroutines. Code inlining means inserting the whole subroutine body each
time the subroutine is called ‘at the point of call’, instead of storing the subroutine
body separately and calling it from the main code if needed.
Software pipelining
Software pipelining, is that, subsequent loop iterations are executed as if
they were a hardware pipeline as shown in below table. Let us consider cycles c+4,
c+5 and c+6. These are the ones showing the real merit of software pipelining. The
crucial point is that for these cycles the available parallelism between subsequent
loop iterations if fully utilized.
Most parallel execution of the given loop on an ILP-processor with multiple pipelined
execution units.
Structure of the software-pipelined code.
The simplest generalization of the single bus system towards a multiple bus
system is the 1-dimension multiple bus system. This approach leads to a typical
uniform memory access (UMA) machine where any processor can access any
memory unit through any of the buses. The employment of the 1-of-N arbiters is not
sufficient in such systems. Arbitration is a two-stage process in 1-dimension
multiple bus systems. First, the 1-of-n arbiters (one per memory unit) can resolve
the conflict when several processors require exclusive access to the same shared
memory unit. After the first stage m (out of n) processors can obtain access to one
of the memory unit. However, when the number of buses (b) is less than the
number of memory units (m), a second stage of arbitration is needed where an
additional b-of-m arbiter is employed to allocate buses to those processors that
successfully obtained access to a memory unit.
A further generalization of the 1-dimension multiple buses is the introduction
of the second and third dimensions. In these systems, multiple buses compose a
grid interconnection network. Each processor node is connected to a row bus and to
a column bus. Processors along a row or column constitute a conventional single
bus multiprocessor. The memory can be distributed in several ways. The most
traditional approach is to attach memory units to each bus. The main problem of
these architectures is the maintenance of cache coherency.
The third alternative to introduce several buses into the multiprocessor is the
cluster architecture which represents a NUMA machine concept. The main idea of
cluster architectures is that single bus multiprocessors, called clusters, are
connected by a higher-level bus. Each cluster has its own local memory. The access
time of a local cluster memory is much less than the access time of a remote cluster
memory. Keeping the code and stacks in the cluster memory can significantly
reduce the need to access remote cluster memory. However, it turned out that
without cache support this structure cannot avoid traffic jams on higher-level buses.
Another natural generalization of the single bus system is the hierarchical
bus system where single bus ‘supernodes’ are connected to a higher-level bus via
a higher-level cache or ‘supercache’. By recursively applying these construction
techniques, arbitrarily large networks can be built. The main advantage of this
approach is that each bus level can work as a single bus system. However, it raises
the problem of maintaining cache consistency in a hierarchical system.
Switching networks
Crossbar
The crossbar is the most powerful network type since it provides
simultaneous access among all the inputs and outputs of the network providing that
all the requested outputs are different. This great flexibility and parallel capability
stem from the large number of individual switches which are associated with any
pair of inputs and outputs of the network as shown below. All the switches must
contain an arbiter logic to allocate the memory block in the case of conflicting
requests and a bus-bus connection unit to enable connection between the buses of
the winning processor and the memory buses. It means that both the wiring and the
logic complexity of the crossbar is dramatically increased compared with the single
bus interconnection.
Multistage networks
Multistage networks represent a compromise between the single bus and the
crossbar switch interconnections from the point of view of implementation
complexity, cost, connectivity and bandwidth. A multistage network consists of
alternating stages of links and switches. Many kinds of multistage networks have
been proposed and built so far. They can be categorized according to the number of
stages, the number of switches at a stage, the topology of links connecting
subsequent stages, the type of switches employed at the stages and the possible
operation nodes.
The simplest multistage network is the omega network, which has log2N
stages with N/2 switches at each stage. The topology of links connecting
subsequent stages is a perfect shuffle, named after shuffling a deck of cards such
that the top half of the deck is perfectly interleaved with the bottom half. All the
switches have two input and two output links. Four different switch positions are
possible: upper broadcast, lower broadcast, straight through, and switch. Since the
switches can be independently configured, any single input can be connected to
any output. In omega networks switches input and outputs are connected in reverse
order. These networks are suitable for broadcasting information from a single input
to all the outputs, due to the upper broadcast and lower broadcast switch positions.
Cache coherence
Cache memories are introduced into computers in order to bring data closer
to the processor and hence to reduce memory latency. Caches are widely accepted
and employed in uniprocessor systems. However, in multiprocessor machines where
several processors require a copy of the same memory block, the maintenance of
consistency among these copies raises the so-called cache coherence problem
which has three causes:
• shared of writable data
• process migration
• I/O activity
Assume that data structure D is a shared writable structure and processes on
processors Pi and Pj read the value of D. As a result, D is loaded into caches Ci and
Cj and hence both caches contain a consistent value of D. If a process on processor
Pj updates D to D’, cache Ci will contain D’ while the other cache Cj still contains the
original D value; that is, the copies of D become inconsistent. A read from Cj will not
return the latest value of D.
Assume now that D is a private writable data structure owned by process A
running on processor Pi. If A write D into D’, Ci will contain D’ while the main
memory still contains the original D value. If afterwards A migrates to processor Pj
(j<>i) and performs a read operation on the data structure, the original value of D
will be fetched from the main memory to Cj instead of updated value of D.
Inconsistency from I/O activity can arise in the case of any writable data
structure if the I/O processor is working directly from the main memory. Obviously,
if the data structure D is written into D’ by any processor, the I/O system cannot see
this change of value for D since the main memory contains the stale value of D.
From the point of view of cache coherence, data structures can be divided
into three classes:
• Read-only data structures which never cause any cache coherence
problem. They can be replaced and placed in the number of cache
memory blocks without any problem.
• Shared writable data structures are the main source of cache
coherence problems.
• Private writable data structures pose cache coherence problems
only in the case of process migration.
There are several techniques to maintain cache coherence for the critical
case, that is, shared writable data structures. The applied methods can be divided
into two classes:
• Hardware-based protocols
• Software-based schemes.
Hardware-based protocols
Hardware-based protocols provide general solutions to the problems of cache
coherence without any restrictions on the cachability of data. The price of this
approach is that shared memory systems must be extended with sophisticated
hardware mechanisms to support cache coherence. Hardware-based protocols can
be classified according to their:
• memory update policy
• cache coherence policy
• interconnection scheme.
Two types of memory update policy are applied in multiprocessors. The
write-through policy maintains consistency between the main memory and
caches; that is, when a block is updated in one of the caches it is immediately
updated in memory, too. The write-back policy permits the memory to be
temporarily inconsistent with the most recently updated cached block. Memory is
updated eventually when the modified block in the cache is replaced or invalidated.
The application of the write-through policy leads to unnecessary traffic on the
interconnection network in the case of private data and infrequently used shared
data. On the other hand, it is more reliable than the write-back scheme since error
detection and recovery features are available only at the main memory. The write-
back policy avoids useless interconnection traffic; however, it requires more
complex cache controllers since read references to memory locations that have not
yet been updated should be redirected to the appropriate cache.
The write-through policy is a greedy-policy because it updates the memory
copy immediately, while the write-back policy is a lazy one with postponed memory
update. Similarly, a greedy and a lazy cache coherence policy have been introduced
for updating the cache copies of a data structure:
• write-update policy (a greedy policy)
• write-invalidate policy (a lazy policy)
The key idea of the write-update policy is that whenever a processor
updates a cached data structure, it immediately updates all the other cached copies
as well. Whether the shared memory copy is updated depends on the memory
update policy.
In the case of the write-invalidate policy, the updated cache block is not
sent immediately to other cache; instead a simple invalidate command is sent to all
other cached copies and to the original version in the shared memory so that they
become invalid. If later another processor wants to read the data structure, it is
provided by the updating processor.
Hardware-based protocols can be further classified into three basic classes
depending on the nature of the interconnection network applied in the shared
memory system. If the network efficiently supports broadcasting, the so-called
snoopy cache protocol can be advantageously exploited. This scheme is typically
used in single bus-based shared memory systems where consistency commands are
broadcast via the bus and each cache ‘snoops’ on the bus for incoming consistency
commands.
Large interconnection networks cannot support broadcasting efficiently and
therefore a mechanism is needed that can directly forward consistency commands
to those caches that contain a copy of the updated data structure. For this purpose
a directory must be maintained for each block of the shared memory to administer
the actual location of blocks in the possible caches. This approach is called the
directory scheme.
The third approach tries to avoid the application of the costly directory
scheme but still provide high scalability. It proposes multiple-bus networks with the
application of hierarchical cache coherence protocols that are generalized or
extended versions of the single bus-based snoopy cache protocol.
Design space of hardware-based cache coherence protocols.
The state transition diagram of the example protocol where P-Read and P-
Write are commands initiated by the associated processor, while Read-Blk, Write-
Blk and Update-Blk are a consistency commands arriving from the bus and initiated
by other caches. The state transition diagram defines how the cache controller
should work when a request is given by the associated processor or by other caches
through the bus. For example, when a Read-Blk command arrives at a cache block
in a Dirty state, the cache controller should modify the state of the block to Shared.
Software-based protocols
Although the UMA architecture is not suitable for building scalable parallel
computers, it is excellent for constructing small-size single bus multiprocessors.
Two such machines are the Encore Multimax of Encore Computer Corporation
representing the technology of the late 1980s and the Power Challenge of Silicon
Graphics Computing Systems representing the technology of the 1990s.
Encore Multimax
The most advanced feature of the Encore Multimax, when it appeared on the
market, was the Nanobus which was one of the first commercial applications of a
pended bus. Unlike in many locked buses, the address and data buses are
separated in the Nanobus. The address bus initiates both memory read and memory
write transfers on the Nanobus. In the case of write transactions the data bus is
used together with the address bus, while in a read transaction the data bus can be
used by a memory unit to transfer the result of a previous read access.
Separate but co-operating arbiter logics are employed to allocate the address
and data bus among the 20 processors and 16 memory banks. A centralized arbiter
is used to realize a fair round-robin arbitration policy for the address bus. However,
the work of the centralized address bus arbiter can be influenced by distributed
access control mechanisms under certain conditions. If a processor or memory
controller cannot gain control over the address bus for a certain number of bus
cycles, they can use special bus selection lines to force the central arbiter to deny
access to the address bus for other bus masters.
Another advantageous feature of the Encore Multimax is the application of
pipelining both on the processor board and on the memory boards. Pipelining
enables the processor to start a new bus cycle before finishing the previous one,
and for the memory controller to receive a new memory access request before
completing the servicing of the previous one. Pipelining is implemented by applying
buffer registers on both the processor board and the memory board.
Power challenge
The heart of the Power Challenge multiprocessor is the POWERpath-2 split-
transaction shared bus. The associative memory used for splitting read transactions
is constructed from eight so-called read resources, that is, up to eight reads can be
outstanding at any time. The POWERpath-2 bus was designed according to RISC
philosophy. The types and variations of bus transactions are small, and each
transaction requires exactly the same five bus cycles: arbitration, resolution,
address, decode, acknowledge. These five cycles are executed synchronously by
each bus controller.
Hector
Hector is a hierarchical NUMA machine consisting of stations connected by a
hierarchy of ring networks. Stations are symmetric multiprocessors where the
processing modules are connected by a single bus. Nodes comprise three main
units: a processor/cache unit, a memory unit and the station bus interface which
connects the otherwise separated processor and memory buses.
Cray T3D
Cray T3D is the most recent NUMA machine that was designed with the
intention of providing a highly scalable parallel supercomputer that can incorporate
both the shared memory and the message-passing programming paradigms. As in
other NUMA machines, the shared memory is distributed among the processing
elements in order to avoid the memory access bottleneck and there is no hardware
support for cache coherency. However, a special software package and
programming model, called the CRAFT, manages coherence and guarantees the
integrity of the data.
All the CC-NUMA machines share the common goal of building a scalable
shared memory multiprocessor. The main difference among them is in the way the
memory and cache coherence mechanisms are distributed among the processing
nodes. Another distinguishing design issue is the selection of the interconnection
network among the nodes. They demonstrate a progress from bus-based networks
towards a more general interconnection network and from the snoopy cache
coherency protocol towards a directory scheme.
The Wisconsin multicube architecture is the closest generalization of a single
bus-based multiprocessor. It completely relies on the snoopy cache protocol but in a
hierarchical way. The nodes of the Stanford Dash architecture are more complex;
they are realized as single bus-based multiprocessros called clusters. The Dash
architecture also combines the snoopy cache protocol and the directory scheme. A
snooping scheme ensures the consistency of caches inside the clusters, while the
directory scheme maintains consistency across clusters. In the Dash, the directory
protocol is independent of the type of interconnection network and hence, any of
the low-latency networks that were originally developed for multicomputers such as
the mesh can be employed. The Stanford FLASH architecture is a further
development of the Dash machine by the same research group. The main goal of
the FLASH design was the efficient integration of cache-coherent shared memory
with high-performance message passing. Since the cluster concept of the Dash is
replaced with one-processor nodes, FLASH applies only a directory scheme for
maintaining cache coherence.
Wisconsin multicube
The Wisconsin multicube architecture employs row and column buses
forming a two-dimensional grid architecture. The three-dimensional generalization
will result in a cube architecture. The main memory is distributed along the column
buses, and each data block of memory has a home column. All rows of processors
work similarly to single bus-based multiprocessors. Each processing element
contains a processor, a conventional cache memory to reduce memory latency and
a snoopy cache that monitors a row bus and a column bus in order to realize a
write-back, write-invalidate cache coherence protocol.
Stanford Dash
The nodes of the Dash architecture are single bus-based multiprocessors,
called clusters. Each cluster comprises four processor pairs I/O interface, first- and
second-level caches, a partition of the main memory, and a directory and
intercluster interface logic. Notice that the memory is distributed among the nodes
of the Dash, reducing the bandwidth demands on the global interconnect. The
second-level writeback caches are responsible for maintaining cache coherency
inside the cluster by applying a snoopy cache protocol. The directory memory
realizes a cache coherent directory scheme across the clusters. The interconnection
network can be any low-latency direct network developed for message-passing
architectures.
Stanford FLASH
The main design issue in the Stanford FLASH project is the efficient
combination of directory-based cache coherent shared memory architectures and
state-of-the-art message-passing architectures in order to reduce the high hardware
overhead of distributed shared memory machines and the software overhead of
multicomputers. The FLASH node comprises a high-performance commodity
microprocessor with its caches, a portion of the main memory, and the MAGIC chip.
The heart of the FLASH design is the MAGIC chip which integrates the
memory controller, network interface, programmable protocol processor and I/O
controller. The MAGIC chip contains an embedded processor to provide flexibility for
various cache coherence and message-passing protocols.
Convex Exemplar
Convex was the first computer manufacturer to commercialize a CC-NUMA
machine, called the SPP1000. Here SPP stands for Scalable Parallel Processor and,
indeed, the objective of the SPP Exemplar series is to create a family of high-
performance computers where the number of processors can easily range from 10
to 1000 and the peak performance would reach the TeraFLOPS. The frist member of
the Exemplar series, the SPP1000, can be upgraded up to 128 processors.
The nodes of the SPP1000 are symmetric multiprocessros, called
hypernodes. Each hypernode comprises four functional blocks and an I/O
subsystem. Each functional block consists of two CPUs sharing a single CPU agent,
and a memory unit holding hypernode-private memory data, global memory data
and network cache data. The four memories in a hypernode are interleaved,
providing higher bandwidth and less contention than accessing a single memory.
Interleaving is a scheme where sequential memory references search the four
memories on a round-robin basis.