0 ratings0% found this document useful (0 votes) 436 views381 pagesComputer Architecture Design and Performance
Computer Architecture Design and Performance
Barry Wilkinson, Prentice Hall, 1991
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content,
claim it here.
Available Formats
Download as PDF or read online on Scribd
Computer
Architecture
Design and performance
Barry Wilkinson
Department of Computer Science
University of North Carolina, Charlotte
Prentice Hall
New York London Toronto Sydney Tokyo Singapore“08
2 3. JUNI 1992
First published 1991 by
Prentice Hall International (UK) Led
(66 Wood Lane End, Hemel Hempstead
Hertfordshire HP2 4G
A division of
Simon & Schuster Intemational Group
© Preptice Hall Intemational (UK) Lid, 1991
All rights reserved. No part ofthis publication may be
‘reproduced, stored in «retrieval system, or transmitted,
in any form, or by any means, electronic, mechanical,
Photocopying, recording or otherwise, without prior
permission, in writing, from the publisher,
For permission within the United States of America
‘contact Prentice Hall Inc.. Englewoad Cliffs, NJ 07632.
Typeset in 16/12pt Times with Courier
Printed in Great Britain at
the University Press, Cambridge
Library of Congress Catalogiag-in-Publishing Data
Wilkinson, Barry
Computer architecture: design and performance(by Barry
Wilkinson
P. em,
Includes bibliographical references and index.
ISBN 0-13.173899-2. — ISBN 0-13-173007-7 (pbk)
1.-Computer architecture. 1, Tide
QA76.9.A73WS4 1991
004.22" de20 90.2953
ce
Wilkinson, Barry 1947—
Computer architecture: design and
[. High performance computer syst
L Title
(008,22
ISBN 0-13-173809-2
ISBN 0-13-173907-7 pb
45 94 93 92 91 99
OHL BLLETo my wife, Wendy
and my daughter, JohannaContents
Preface
Part | Computer design techniques
1 Computer systems
1.1 The stored program computer
1.11 Concept
1.12 Improvements i performance
1.2 Microprocessor systems
1.2.1 Development
12.2 Microprocessor architecture
1.3 Architectural devetopments
13.1 Gener
132 Processor functions
33 Memory hierarchy
134 Processor-memory interface
1355 Multiple procesor systems
1.36 Performance and cost
2 Memory management
2.1 Memory management schemes
2.2 Paging
2.2.1 General
2.2.2 Addzess translation
2.2°3 Translation look-aside butters
2.24 Page size
2.2.5 Multilevel page mapping
23 Replacement algorithms
23.1 General
wa
10
2
2
rr
16
16
16
18
19
2
m
25
28
21
2
2
36
38
x»
4
4l
vilvi
Contents,
24
2.32 Random ceplacement algorithm,
23,3 First-in first-out replacement algorithm
2.3.4 Clock replacement algorithm
2.3.5 Least recently used replacement algorithm
2.3.6 Working set replacement algorithm
2.3.7 Performance and cost
Segmentation
24.1 General
2.4.2 Paged segmentation
2.4.3 8086/286/386 segmentation
Problems
Cache memory systems
3.1 Cache memory
3.1.1 Operation
3.1.2 Hit ratio
3.2 Cache memory organizations
3.2.1 Direct mapping
3.2.2 Fully associative mapping
3.2.3 Set-associative mapping
3.2.4 Sector mapping
3.3 Fetch and write mechanisms
3.3.1 Fetch policy
3.3.2 Write operations
3.3.3 Write-through mechanism
3.3.4 Write-back mechanista,
3.4 Replacement policy
3.4.1 Objectives and constrains
3.4.2 Random replacement algorith
3.4.3 Firstin first-out replacement algorithin
3.44 Least rocontly used algorithm for a cache
3.5 Cache performance
3.6 Virtual memory systems with cache memory
3.6.1 Addressing cache with real addresses
3.62 Addressing cache with virtual addresses
3.6.3 Access time
3.7 Disk caches
3.8 Caches in multiprocessor systems
Problems
8
45
45
a
49
31
3
55
7
61
64
64
or
68
68
"
B
1”
15
15
18
n
80
81
aI
82
a2
86
%
83
94
95
99ipelined systems
4.1 Overlap and pipelining
4.1.1 Technique
4.1.2 Pipeline data transfer
4.13 Performance and cost
4.2 Instruction overlap and pipelines
4.2.1 Instruction fetchyexecute overlap,
4.22 Branch instructions
4.2.3 Data dependencies
4.24 Internal forwarding
4.25 Multistreaming
4.3 Arithmetic processing pipelines
4.3.1 General
4.3.2 Fixed point arithmetic pipelines
4.3.3 Floating point arithmetic pipelines
4.4 Logical design of pipelines
4A.1 Reservation tables
4.42 Pipoline scheduling and control
4.5 Pipelining in vector computers
Problems
Reduced instruction set computers
5.1 Cmpley instruction set computers (CISCs)
5.1.1 Characteristics
5.1.2 Instruction usage and encoding
5.2 Reduced instruction set computers (RISCs)
5.2.1 Design philosophy
5.2.2 RISC characteristics
5.3 RISC examples
53.1 IBM SOL
5.3.2 Early university research prototypes RISC [/Iand MIPS
5.33 A commercial RISC = MC8BIO0
5.3.4 The Inmos transputer
5.4 Concluding comments on RISCs
Problems
Contents ix
102
02
102
103
108
107
107
un
121
2
123
123
124
127
130
130
133,
138
140
144
144
144
146
148
148
150
153
153
136
160
165
166
iorx Contents
Part It Shared memory multiprocessor systems 169
6 Multiprocessor systems and programming 71
6.1 General m
6.2 Multiprocessor classification 3
62.1 Flynn's classification 173
6.2.2 Other classifications 175
6.3 Array computers Ms
6.3.1 General architecture 175
6.3.2 Features of some array computers 7
6.33 Bit-organized array computers 180
6.4 General purpose (MIMD) multiprocessor systems 182
54.1 Architectures 182
6.4.2 Potential for increased speed 188
6.5 Programming multiprocessor systems 193,
6.5.1 Concurrent processes 193
65.2 Explicit parallelism 198
6.5.3 Implicit parallelism 199
6.6 Mechanisms for handling concurrent processes 203
6.6.1 Critical sections 203
6.62 Locks 203
6.6.3 Semaphores 207
Problems 210
7 Single bus multiprocessor systems 213
7.1 Sharing a bus 213
7.11 General 213
TA.2 Bus request and grant signals a5
7.13 Muliple bus requests 216,
7.2 Priority schemes 218
7.2.1 Parallel priority schemes 218
72.2 Serial prosity sehemes 27
7.2.3 Additional mechanisms in serial and parallel priority schemes 234
7.24 Polling schemes 235
7.3 Performance analysis, 237
7.3.1 Bandwidth and execution time 237
7312 Access time 240
7.4 System and local buses 2a
1.5 Coprocessors 243
7.5.1 Arithmetic copeocessors 283
78.2 Inpatioutpat and other eopeocessors 07
Problems 248Contents xi
8 Interconnection networks 250
8.1 Multiple bus multiprocessor systems 250
8.2 Cross-bar switch multiprocessor systems 252
2.1 Architecture 252
8.2.2 Modes of operation and examples 253
8.3 Bandwidth analysis 256
8.3.1, Methods and assumptions 256
83.2 Bandwidth of cross-bar switch 21
8.3.3 Bandwidth of multiple bus systems 260
8.4 Dynamic interconnection networks 262
84.1 General 262
8.4.2 Single stage networks 263
8.4.3 Multistage networks 263
8.4.4 Bandwidth of mulistage networks 270
8.4.5 Hor spots ca
8.5 Overlapping connectivity networks 215
8.5.1 Overlapping eross-bar switeh networks 216
85.2 Overlapping multiple bus networks 29
8.6 Static interconnection networks 282
282
tic interconnections 282
8.6.3 Limited static interconnections 282
8.6.4 Evaluation of static notworks 287
Problems 290
Part III Multiprocessor systems without
shared memory 293
9 Message-passing multiprocessor systems 295
9.1 General 295
9.1.1 Architecture 295
9.1.2 Communication paths 298,
9.2 Programming 301
9.2.1 Message-passing constructs and routines 301
92.2 Synchronization and process structure 308
9.3 Message-passing system examples 308
9.3.1 Cosmic Cube 308
9.3.2 Intl iPSC system 309
9.4 Transputer an
9.4.1 Philosophy an
9.4.2 Processor architecture 312
9.5 Oceam 314
9.1 Structure 31410
Contents
9.5.2 Data types
9.5.3 Data transfer statements
9.5.4 Sequential, parallel and alternative processes
9.5.3 Repetitive processes
9.5.6 Conditional processes
9.5.7 Replicators
9.5.8 Other features
Problems
Multiprocessor systems using the dataflow mechanism
10.1 General
10.2 Dataflow computational
10.3 Dataflow systems
10311 Static dataflow
103.2 Dynamic dataiow
1033 VLSI datfiow structures
103.4 Dataflow languages
10.4 Macrodataftow
104.1 General
104.2 Macroéatatlow architectures
10.5 Summary and other directions
Problems
iodel
References and further reading
Index
315
316
317
320
321
323
wa
325
329
329
330
334
3M
337
342
344
349
349
350
353,
354
357
366Preface
Although computer systems employ a range of performance-improving techniques,
intense effort to improve present performance and to develop completely new types
of computer systems with this improved performance continues. Many design
techniques involve the use of parallelism, in which more than one operation is
performed simultaneously. Parallelism can be achieved by using multiple functional
units at various levels within the computer system. This book is concerned with
design techniques to improve the performance of computer systems, anc mostly with
those techniques involving the use of parallelism,
‘The book is divided imto three parts. In Part I, the fundamental methods 10
improve the performance of computer systems are discussed: in Past Tl, muli-
processor systems using shared memory are examined in detail and in Part HL
compurer systems not using shared memory are examined; these are often suitable
for VLSI fabrication. Dividing the book into parts consisting of closely related
‘groups of chapters helps delineate the subject matter.
Chapter 1 begins with an introduction to computer systems, microprocessor
systems and the scope for improved performance. ‘The chapter introduces the topics
dealt with in detail in the subsequent chapters, in particular, parallelism within the
processor, paralfelism in the memory system, management of the memory for
improved performance and multiprocessor systems. Chapters 2 and 3 concentrate
upon memory management — Chapter 2.0n main memory/secondary memory manage
‘ment and Chapter 3 on processor/high speed buffer (cache) memory management.
‘The importance of cache memory has resulted in a full chapter on the subject, rather
than a small section combined with main memory/secondary memory as almost
always found elsewhere, Similarly, Chapter 4 deals exclusively with pipetining as
applied within a processor, this being the basic technique for parallelism within «
processor. Scoge for overall improved performance exists when choosing the actual
instructions to implement in the instruction set. In Chapter 5, the concept of the so.
called reduced struction set computer (RISC). which has a very limited number of
instructions and is used predominantly for register-to-tepister operations, is discussed.
Chapter 6, the first chapter in Part M, intcoduces the design of shared memoryxiv. Preiace
multiprocessor systems, including a section on programming shared memory multi-
processor systems. Chapter 7 concentrates upon the design of a single bus multi-
processor system and its variant (system/local bus systems); the bus arbitration logic
is given substantial treatment. Chapter & considers single stage and multistage
interconnection networks for linking together processors and memory in a shared
memory multiprocessor system. This chapter presents bandwidth analysis of crost-bar
switch, multiple bus and multistage networks, including overlapping connectivity
networks.
Chapter 9, the first chapter in Part 111, presents multiprocessor systems having
local memory only. Message-passing concepts and architectures are described and
the transputer is outlined, together with its language, Occam, Chapter 10 is devoted
to the dataflow technique, used in a variety of applications. Dataflow languages
are presented and a short summary is given at the end of the chapter.
‘The text can serve as a course tex! for senior level/eraduate computer science,
‘compater engincering or electrical engineering courses in computer architecture and
multiprocessor system design, The text should also appeal to design engineers
‘working on 16-/32-bit microprocessor and multiprocessor applications. The material
presented is a natural extension to material in introductory computer organization/
computer architecture courses, and the book can be used in a variety of ways.
Material from Chapters 1 10 6 could be used for a senior computer architecture
course, whereas for a course on multiprocessor systems, Chapters 6 w 10 could be
studied in detail. Alternatively, for a computer architecture course with greater
scope, material could be selected from all or most chapters, though generally from
the first parts of sections. It is assumed that the reader has a basic knowledge of
logic design, computer organization and computer architecture, Exposure to computer
programming languages, both high level programming languages and low level
microprocessor assembly languages, is also assumed.
I would like to record my appreciation to Andrew Binnie of Prentice Hall, who
helped me start the project, and to Helen Martin, also of Prentice Hall, for her support
throughout the preparation of the manuscript. Special thanks are extended to my
students in the graduate courses CPGR 6182, CSCI 5041 and CSCI 5080, at the
University of Nonh Carolina, Charlotte, who, between 1988 and 1990, helped me
“classroom-test” the material; this process substantially improved the manuscript. I
should also like fo thank two anonymous reviewers who made constructive and
helpfut comments,
Barry Wilkinson
University of North Carolina
CharlottePART
| | Computer
design
techniquesCHAPTER
Computer systems
In this chapter, the basic operation of the traditional stored program digital computer
and microprocessor implementation are reviewed. The limitations cf the single
processor computer system are outlined and methods to improve the performance
are suggested. A general introduction to one of the fundamental techniques of
increasing performance — the introduction of separate functional units operating
concurrently within the system ~ is also given
1.1 The stored program computer
1.1.1 Concept
The computer system in which operations are encoded in binary, stored in a memory
and performed in a defined sequence is known as a stored program computer. Most
computer systems presently available are stored program computers. The concept of
computer which executes a sequence of steps to perform a particular computation
can be traced back over 100 years to the mechanical decimal computing machines
proposed and partially constructed by Charles Babbage. Babbage's Analytical Engine
of 1834 contained program and data input (punched cards), memory (mechanical),
a central processing unit (mechanical with decimal arithmetic) and output devices
(printed output or punched cards) — all the key features of a modern computer
system. However, a complete, large scale working machine could no: be finished
with the available mechanical technology and Babbage’s work seems 1o have been
largely ignored for 100 years, until electronic circuits, which were developed in
the mid-1940s, made the concept viable
The true binary programmable clectronic computers hegan to be ceveloped by
several groups in the mid-1940s, notably von Neumann and his colleagues in the
United States; stored program computers are often called von Newman computers
after his work. (Some pioneering work was done by Zuse in Germany during the
1930s and 1940s, but this work was not widely known at the time. During the
34 Computer design techniques
1940s, immense development of the stored program computer took place and the
basis of complex modern computing systems was created, However, there are
alternative computing structures with stored instructions which are nor executed in a
sequence related to the stored sequence (e.¢. dataflow computers, which are described
in Chapter 10) or which may not even have instructions stored in memory at all
(e.g. neural computers).
‘The basic von Neumann stored program computer has:
1. A memory used for holding both instructions and the data required by those
instructions.
2. A control unit for fetching the instructions from memory.
3. An aritimetic processor for performing the specified operations.
Input/output mechanisms and peripheral devices for transferring data to and
from the system.
‘The control unit and the arithmetic processor of a stored program computer are
normally combined into a ceniral processing unit (CPU), which results in the general
arrangement shown in Figure 1.1. Binary representation is used throughout for the
number representation and arithmetic, and corresponding Boolean values are used
for logical operations and devices. Thus. only two voltages of states are needed to
represent each digit (0 or 1). Multiple valued representation and logic have been,
and are still being, investigated.
‘The instructions being executed (or about to be executed) and their associated
data are held in the main memory, This 1s organized such that each binary word is
stored in a location idemtfied by a number called an address. Memory addresses are
allocated in strict sequence, with consecutive memory locations given consecutive
Input devices) Output dvicts
t
Input Outeut
interfaces) intertacess)
Canta
processing
tant (ePUy
Figure 1.1. Stored program digital computerComputer systems 5
addresses. Main memory must access individual storage locations in any order and.
at very high speed; such memory is known as random access memory (RAM) and is,
essential for the main memory of the system.
‘There is usually additional memory, known as secondary memory ot backing
store, provided to extend the capacity of the memory system more economically
than when main memory alone is used. Main memory usually consists of semi
conductor memory and is more expensive per bit than secondary memory, which
usually consists of magnetic memory. However, magnetic secondary memory is not
capable of providing the required high speed of data transfer, nor can it locate
individual storage locations in a random order at high speed (ic. it is not truely
random access memory).
Using the same memory for data and instructions is a key feature of the von
Neumann stored program computer. However, having data memory and program
‘memory separated, with separate transfer paths between the memory and the processor.
is possible. This scheme is occasionally called the Harvard architecture. The
Harvard architecture may simplify memory read/write mechanisms (see Chapter 3),
particularly as programs are normally only read during execution, while data might
be read or altered. Also, data and unrelated instructions can be broughé into the
processor simultaneously with separate memories. However, using one memory to
hold both the program and the associated data gives more efficient use of memory,
and itis usual for the bulk of the main memory in a computer system t hold both
‘The early idea that stored instructions could be altered during execution was quickly
abandoned with the introduction of other methods of modifying instruction execution.
The (central) processor has a number of intemal registers for holding specific
operands used in the computation, other numbers, addresses and control information.
‘The exact allocation of registers is dependent upon the design of the processor.
However, certain registers are always present. The program counter (PC), also called
the instruction pointer (IP), is an internal register holding the address of the next
instruction to be executed. The contents of the PC are usually incremented each time
an instruction word has been read from memory in preparation for the next instruction
‘word, which is often in the next location. A stack pointer register holds the address of
the “top” location of the stack. The stack is a set of locations, reserved in memory,
‘which holds return addresses and other parameters of subroutires.
‘A set of general purpose registers or sets of data registers and address registers
are usually provided (registers holding data operands and addresses pointing to
memory locations). In many instances these registers can be accessed more quickly
than main memory locations and hence can achieve a higher computational speed.
The binary encoded instructions are known as machine instructions, The operations.
specified in the machi ons are normally reduced to simple operations,
such as arithmetic operations, to provide the greatest flexibility. Atithmetic and
other simple operations operate on one or two operands, and produce a numeric
result. More complex operations are created from a sequence of simple instructions,
by the user. From a fixed set of machine instructions available in the computer (the
instruction set) the user selects instructions to perform a particular computation.
instruct6 Computer design techniques
The list of instructions selected is called a computer program. The selection is done
by a programmer. The program is stored in the memory and, when the system is
ready. each machine instruction is read from (main) memory and executed
Each machine instruction needs to specify the operation to be performed, e.g.
addition, subtraction, etc. The operands also need to be specified either explicitly in
the instruction or implicitly by the operation. Often, cach operand is specified in the
instruction by giving the address of the location holding it. This results in a general
instruction format having three addresses:
1. Address of the first operand.
2. Address of the second operand.
3. Storage address for the result of the operation.
AA further address could be included, that of the next instruction to be executed. This,
is the four-address instruction format. The EDVAC computer, which was developed
in the 1940s, used a four-address instruction format (Hayes, 1988) and this format
hhas been retained in some microprogrammed control units, but the fourth address is
always eliminated for machine instructions, This results in a shree-addrets instruction
format by arranging that the next instruction to be executed is immediately following
‘the current instruction in memory. It is then necessary to provide an alternative
‘method of specifying non-sequential instructions, normally by including instructions
in the instruction set which alter the subsequent execution sequence, sometimes
under specific conditions.
The third address can be eliminated to obtain the two-address instruction format
by always placing the result of arithmetic or logic operations in the location where
the first operand was held: this overwrites the first operand. The second address can
be eliminated to obtain the one-address instruction format by having only one place
for the first operand and result. This location, which would be within the processor
itself rather than in the memory, is known as an accumulator, because it accumulates
results. However, having only one location for one of the operands and for the
subsequent result is rather limiting, and a small group of registers within the
processor can be provided. as selected by a small field in the instruction; the
corresponding instruction format is the one-and-a-half-address instruction format or
register type. All the addresses can be eliminated to obtain the zero-address
instruction format, by using two known locations for the operands. These locations
are specified as the first and second locations of a group of locations known as &
stack. The various formats are shown in Figure 1.2. The one-and-a-half. or two-
address formats are mostly used, though there are examples of three-address
processors, e.g. the AT&T WE3210 processor
Various methods (addressing modes) can be used to identify the locations of the
operands. Five different methods are commonly incorporated into the instruction
setComputer systems 7
Operation | Tetoperand
(Four adiesomet
ptr | epee [top] Rei
(0) Teo adres mat
Ton} atoperan
Speration | Tandvesut |
ndoperand
(6) Tworadcrss format
‘Operation | Register [2nd operand
(©) Cne-and-a-natl adres format
Operation Tescrearal
(6) One-adress format
(0 oradcress format
Figure 1.2 Instruction formats (a) Four-address format _(b) Three-address format
(©) Two-address format (d) One-and-a-half-address formate) One-address
format (f) Zero-address format
Immediate addressing when the operand is part of the instruction.
Absolute addressing — when the address of an operand is held in the instruction.
Register direct addressing — when the operand is held in an addressed register.
Register indirect addressing ~ when the address of the operand location is
held in a register.
5. Various forms of relative addressing — when the address of the operand is,
computed by adding an offset held in the instruction to the contents of specific
registers.
‘The operation of the processor can be divided into two distinet steps, as shown in
Figure 1.3. First, an instruction is obtained from the memory and the program
counter is incremented ~ this step is known as the fetch cycte. Then the operation is
performed — this step is known as the execure cycle and includes fetching any
operands and storing the result. Sometimes, more than one memory location is
necessary to hold an instruction (depending upon the design of the instructions).® Computer design techniques
‘When this occurs the program counter is incremented by one as each location is
accessed to extract a part of the instruction, The contents of the program counter can
‘be purposely altered by the execution of “jump” instructions, used to change the
‘execution sequence. This facility is essential to create significant computations and
different computations which depend upon previous computations.
=
>| contrat (Mesnory ‘Address t<—
ae
7 mz
estes clon
om SE
ose be
mo E
(representative)
ry
> ]eontro! Memory ‘Address J —
soe
Il =
ls location
Control [> Control
Processor | aa
(0) Execute cycle
Figure 1.3 CPU mode of operation (a) Fetch cycle (b) Execute cycle
(SP, stack pointer; PC, program counter; IR, instruction register:
‘ALU, arithmetic and logic unit)Computer systems 9
‘The operations required to execute (and fetch) an instruction can be divided into a
number of sequential steps performed by the control unit of the processor. The
‘control unit can be designed using interconnected logic gates and counters t0
generate the required signals (a random logic approach). Alternatively, each step
could be binary-encoded into a microinstruction. A sequence of these micro-
instructions is formed for each machine instruction and is then stored in a control,
memory within the internal control unit of the processor. The sequence of micro-
instructions is known as a microprogram (or microcode) and one sequence must be
executed for each machine instruction read from the main memory. This technique
was first suggested by Wilkes in the early 1950s (Wilkes, 1951) but was not put into
practice in the design of computers until the 1960s, mainly because the performance
‘was limited by the centrol memory, which needs to operate much faster than the
main memory. Given a control memory with alterable contents, it is possible to
alter the machine instruction set by rewriting the microprograms; this leads to the
‘concept of emulation. In emulation, a computer is microprogrammed to have exactly
the same instruction set as another computer, and to behave in exactly the same
‘manner, so that machine instruction programs written for the emulated computer
will run on the microprogrammed computer.
‘The general arrangement of a microprogrammed control unit is shown in Figure
1.4. An instruction is fetched into an instruction register by a standard instruction
fetch microprogram. The machine instruction “points” to the first microinstruction
of the microprogram for that machine instruction. This microinstruction is executed,
together with subsequent microinstructions for the machine instruction. The sequence
‘ean be altered by conditions occurring within or outside the processor. In particular,
‘microprogram sequences of conditional jump instructions may be altered by conditions
indicated in a processor condition code register. Also, subroutine microinstructions,
‘can be provided to reduce the size of the microprogram. Just as a stack is used to
hold the return address of machine instruction subroutines, a control memory stack
irsracon register
Machine
pee ‘Gontrot memory
Ned
inputs —} ascress | —+| }< | nicroinstruction
logic
Lbs contrat
New adress om
conto! [Condition code
at regster
Figure 1.4 Microprogrammed control unit10 Computer design techniques
can be provided to hold the return address of a microinstruction subroutine return
‘The microinstructions can have one bit for each signal to be generated, binaiy-
encoded fields, or a combination. A two-level approach is also possible, in which a
short microinstruction points to a set of much longer nanoinstructions held in
another control memory.
To summarize, we can identify the main operating characteristics of the stored
rogram computer as follows:
1. Only elementary operations are performed (e.g. arithmetic addition, logical
operations).
2. ‘The user (programmer) selects operations to perform the required computation.
Encoded operations are stored in a memory.
4, Strict sequential execution of stored instructions occurs (unless otherwise
directed).
5. Data may also be stored in the same memory.
‘The reader will find a full treatment of basic computer architecture and organization,
in Stallings (1987) and Mano (1982).
1.1.2 Improvements in performance
Since the 1940s the development of stored program computer systems has con-
centrated upon three gencral areas:
1. Improvements in technology.
2. Software development.
3. Architectural enhancements.
Improvements in technology, i. in the type of components used and in fabrication
techniques, have led to dramatic increases in speed. Component speeds have typically
doubled every few years during the period. Such improvements are unlikely to
continue for electronic components because switching times now approach the limit
set by the velocity of electrical signals (about 2/3 speed of light 0.2 m ns“) and
the delay through interconnecting paths will begin to dominate. In fact, this limit
hhas been recognized for some time and has led seme researchers to look at
alternative technologies, such as optical technology (optical computers).
‘After the overall design specification has been laid down and cost constraints are
made, one of the first decisions made at the design stage of a computer is in the
choice of technology. This is normally between TTL/CMOS (transistor-transistor
logic/complementary metal oxide semiconductor) and ECL. (emitter-coupled logic)
for high performance systems. Factors to be taken into account include the availabilityComputer systems 11
of very large scale integration (VLSI) components and the consequences of the
much higher power consumption of ECL. ECL has a very low level of integration
compared to CMOS but has still been chosen for the highest performance systems
because, historically, it is much faster than MOS (metal oxide semiconductor).
Predictions need to be made‘as to the expected developments in technology,
especially those developments that can be incorporated during the design phase of
the system. For exemple, it might be possible to manufacture a chip with improved
performance, if certain design tolerances are met (see Maytal et al., 1989).
‘A computer system can be characterized by its instruction execution speed, the
internal processor cycle time or clock period, the capacity and cycle time of
memory, the number of bits in each stored word and by features provided withis
instruction set among other characteristics. The performance of a high performance
computer system is often characterized by the basic speed of machine operations,
e.g. millions of operations per second, MOPS (or sometimes millions of instruc-
tions per second, MIPS). These operations are further specified as millions of
floating point operations per second, MFLOPS, or even thousands of MELOPS,
called gigaflops, GFLOPS, especially for large, high performance computer systems.
A computer is considered to be a supercomputer if it can perform hundreds of
millions of floating point operations per second (100 MFLOPS) with a word length
of approximately 64 bits and a main memory capacity of millions of words (Hwang,
1985). However, as technology improves, these figures need to be revised upwards.
A Cray X-MP computer system, one of the fastest computer systems developed in
the early 1980s, has a peak speed of about 2 GFLOPS. This great speed has only
been achieved through the use of the fastest electronic components available, the
‘most careful physical design (with the smallest possible distances between com-
Ponents), very high speed pipelined units with vector processing capability (see
discussion, page 138 and Chapter 4), a very high speed memory system and, finally,
multiple processors, which were introduced in the Cray X-MP and the Cray 2 after
the single processor Cray 1.
‘The internal cycle time (clock period) specifies the period allotted to each basic
internal operation of the processor. In some systems, notably microprocessor systems
Gee page 12), the clock frequency is a fundamental figure of merit, especially for
otherwise similar processors. A clock frequency of 10 MHz would correspond to a
clock period of 100 ns. If one instruction is completed after every 100 ns clock
period, the instruction rate would be 10 MOPS. This would be the peak rate. One oF
‘mote periods may be necessary to fetch an instruction and execute it, but very high
speed systems can generate results at the end of each period by using pipelining and
‘multiple unit techniques. The Cray X-MP computer had a 9.5 ns clock period in
1980 and finally achieved its original design objective of an 8.5 ns clock period in
1986, by using faster components (August er al., 1989). Each subsequent design has
called for a shorter clock period, e-g. 4 ns and 1 ns for the Cray 2 and Cray 3,
respectively. Other large “mainframe” computer systems have had cycle times/elock
periods in the range 10-30 ns. For example, the IBM 308X, first delivered in 1981,
had a cycle time of 26 ns (later reduced to 24 ns) using TTL circuits mounted on12. Computer design techniques
ceramic thermal conduction modules. The IBM 3090, a development of the 3080
ith faster components, first introduced in 1985, had a cycle time of 18.5 ns
(Tucker, 1986).
Software development, i.e. the development of programming techniques and the
support environment, have included various high level languages such as PASCAL
and FORTRAN and complex multitasking operating systems for controlling more
than one user on the system, Some developments in software have led to variations
in the internal design of the computer. For example, computers have been designed
for the efficient handling of common features of high level languages by providing,
special registers or operating system operations in hardware. Most computer systems,
now have some hardware support for system software.
In this text we are concerned with architectural developments, i.e. developments
in the internal structure of the computer system to achieve improved performance.
Such developments will be considered further in the next section. First though, let
us examine the most striking technological development in recent years ~ the
development of the microprocessor — as this device is central to the future develop-
‘ment of multiprocessor systems, particularly those systems with large numbers of
processors.
1.2 Microprocessor systems
1.2.1. Development
Since the late 1960s, logic components in computer systems have been fabricated on
integrated circuits (chips) toachieve high component densities. Technological develop-
ments in integrated circuits have produced more logic components in a given area,
allowing more complex systems to be fabricated on the integrated circuit, first in
small scale integration (SSI, 1 to 12 gates) then medium scale integration (MSI, 12
to 100 gates), large scale integration (LSI, 100 to 1000 gates), through to very large
scale integration (VLSI, usually much greater than 1000 gates). This process led
directly to the microprocessor, a complete processor on an integrated circuit. The
‘early microprocessors required the equivalent of large scale integration,
Later integration methods are often characterized by the applied integrated circuit
design rules specifying the minimum features, e.g. 1.25 jum and then 0.8 um line
widths. Smaller line widths increase the maximum number of transistors fabricated
on one integrated circuit and reduce the gate propagation delay time. The number of
transistors that can be reasonably fabricated on one chip with acceptable yield and
1.25 ym design rules is in excess of one million, but this number is dependent upon
the circuit complexity. Repetitive cells, as in memory devices, can be fabricated at
higher density than irregular designs.
Microprocessors are often manufactured with different guaranteed clock fre-
quencies, e.g. 10 MHz, 15 MHz or 20 MHz. There is a continual improvement in theComputer systems 13
clock frequencies due to an improved level of component density and the attendant
reduced gate propagation delay times. By increasing the clock frequency the processor
immediately operates more quickly, and in direct proportion to the increase in clock
frequency, assuming that the main memory can also operate at the higher speed. The
choice of clock frequency is often closely related to the speed of available memory.
Microprocessors are designated 4-bit, 8-bit, 16-bit, 32-bit or 64-bit depending
uupon the basic unit of data processed internally. For example, a 32-bit micro-
processor will usually be able to add, subtract, multiply or divide two 32-bit integer
‘numbers directly. A processor can usually operate upon smaller integer sizes
addition to their basic integer size. A 32-bit microprocessor can perform arithmetic
‘operations upon 8-bit and 16-bit integers directly. Specific machine instructions
‘operate upon specific word sizes. An interesting computer architecture not taken up
in microprocessors (or in most other computer systems), called a tagged architecture,
uses the same instruction to specify an operation upon all allowable sizes of
integers. The size is specified by bits (a ixg) attached to each stored number.
The first microprocessor, the Intel 4004, introduced in 1971, was extremely
primitive by present standards, operating upon 4-bit numbers and with limited
external memory, but it was a milestone in integrated circuits. Four-bit micro-
processors are now limited to small system applications involving decimal arithmeti
such as pocket calculators, where 4 bits (a nibble) can conveniently represent one
decimal digit. The 4004 was designed for such applications and in the ensuing
period, more complex 8-bit, 16-bit and 32-bit microprocessors have been developed,
in that order, mostly using MOS integrated circuit technology. Binary-coded decimal
(BCD) arithmetic is incorporated into these more advanced processors as it is not
subject to rounding, and is convenient for financial applications.
Eight-bit microprocessors became the standard type of microprocessor in the mid.
1970s, typified by the Intel 8080, Motorola MC6800 and Zilog Z-80. At about this
time, the microprocessor operating system CP/M, used in the 8080 and the Z-80,
became widely accepted and marked the beginning of the modern microprocessor
system as a computer system capable of being used in complex applications.
Sixteen-bit microprocessors started to emerge as a natural development of the
increasing capabilities of integrated circuit fabrication techniques towards the end
of the 1970s, e.g. the Intel 8086 and Motorola MC68000, both introduced in 1978.
Subsequent versions of these processors were enhanced to include further instrue-
tions, circuits and, in particular, memory management capabilities and on-chip cache
memory (see pages 18-20 and Chapters 2 and 3). In the Intel 8086 family, the
80186 included additional on-chip circuits and instructions and the 80286 included
‘memory management, In the Motorola family, the MC68010 included memory
‘management. Thirty-two bit versions also appeared in the 1980s (e.g. the Intel
80386 with paged memory management, the Motorola MC68020 with cache memory
and the MC68030 with instruction/data cache memories and paged memory
management). In 1989 the 64-bit Intel 80486 microprocessor was introduced.
Floating point numbers can be processed in more advanced microprocessors by
additional special processors intricately attached to the basic microprocessor,14 Computer design techniques
theugh a floating point unit can also be integrated into the processor chip. Floating
point numbers correspond to real numbers in high level languages and are numbers
represented by two parts, a mantissa and an exponent, such that the number =
‘mantissa x base®*P", where the base is normally two for binary representation. For
fumher details see Mano (1982).
1.2.2 Microprocessor architecture
‘The basic architecture of a microprocessor system is shown in Figure 1.5, and
consists of a microprocessor, a semiconductor memory and input/output interface
components all connected through a common set of lines called the bus. The
memory holds the program currently being executed, thos> to be executed and the
associated data. There would normally be additional secondary memory, usually
disk memory and input/output interfaces are provided for external communication,
‘The bus-based architecture is employed in all microprocessor systems, but micro-
processor systems were not the first or only computer systems to use buses; the PDP
8E minicomputer, introduced in 1971, used a bus called the Omnibus and the PDP
11, first introduced in 1970, used a bus called Unibus. The expansibility of a bus
structure has kept the technique common to most small and medium size computer
systems,
The bus is the communication channel between the various parts of the system,
and can be divided into three parts:
1. Data lines.
2. Address lines.
3. Control lines.
cow] [ Bem,
oe eo
Disk Input
wae] [rem] [oe] [ae
ra] [rar] [ou] | sea,
os
Figure 1.5. Fundamental parts of a microprocessor system,Computer systems 15
‘The data lines carry (1) the instructions from the memory to the processor during
each irstruction fetch cycle, and (2) data between the processor and memory of
input/output interfaces during instruction execute cycles, dependent upon the instruc
tion being executed. Eight-bit microprocessors have eight data lines, 16-bit micro-
processors have sixteen data lines (unless eight lines are used twice for each 16-bit
data transfer, as in some 16-bit microprocessors). Similarly, 32-bit microprocessors
have thirty-two data lines, unless reduced by the same technique. Notice that the
microprocessor bit size ~ 8-bit, 16-bit, 32-bit or whatever ~ does not specify the
number of data lines. It specifies the basic size of the data being processed internally
and the size of the internal arithmetic and logic unit (ALU).
“The instructions fetched from memory to the processor comprise one oF more 8-bit
words (bytes), or one or more 16-bit words, depending upon the design of the
1microprocessor. The instructions of all 8-bit microprocessors have one or more bytes,
‘ypically up to five bytes, One byte is provided forthe operation including information
oon the number of subsequeat bytes, and two bytes each for each operand address when
required. Sixteen/32-bit microprocessors can have their instructions in mul-
tiples of bytes or in multiples of 16-bit words, generally up 1o 6 bytes or three words.
‘When the data bus cannot carry the whole instruction in one bus eycle, additional
cycles are performed to fetch the remaining parts ofthe instruction. Hence, the basic
instruction fetch cycle can consist of several data buls transfers, and the timing of
‘microprocessors is usually given in terms of bus cycles. Similarly, the operands (if
any) transferred during the basic execute cycle may require several bus cycles. In
all, the operation of the microprocessor is given in read and write bus transfer
cycles, whether these fetch instructions or transfer operands/resuls
Daring a bus eycle, the bus transfer might be 10 the processor, when an instruction
‘or data operand is fetched fron memory or a data operand is read from an input/
‘output interface, or from the processor, to @ location in the memory or an output
imerface to transfer a result, Hence, the data lines are bidirectional, though simul
taneous transfers in both directions are impossible and the direction of transfer
must be controlled by signals within the control section of the bus.
‘The address lines carry addresses of memory locations and input/output locations
to be accessed. A sufficient number of lines must be available to address @ large
number of memory locations. Typically, 8-bit microprocessors in the 1970s provided
for sixteen address lines, enabling 2!6 (65 536) locations to be specified uniquely.
More recent microprocessors have more address lines, e.g. the 16-bit 8086 has
twemty address lines (capable of addressing 1 048 576 bytes, i.e. 1 megabyte), the
16-bit 80286 and MC68000 have twenty-four (capable of addressing 16 megabytes)
and the 32-bit MC68020, MC68030 and 80386 have thirty-two (capable of addressing
4294 megabytes, ie. 4 gigabytes).
‘The contro lines carry signals to activate the data/instruction transfers and other
‘events within the system; there are usually twelve or more control lines. The control
signals, as a group, indicate the time and type of a transfer. The types of transfer
include transfers to or from the processor (i.e. read or write) and involve memory
and inpu¥output interfaces which may be differentiated.16 Computer design techniques
1.3 Architectural developments
1.3.1 General
‘There have been many developments in the basic architecture of the stored program
computer to increase its speed of operation. Most of these developments can be
reduced to applying parallelism, i.e. causing more than one operation to be performed
simultaneously, but significant architectural developments have also come about 10
satisfy requirements of the software or to assist the application areas. A range of
architectural developments has been incorporated into the basic stored program
‘computer without altering the overall stored program concept. In general, important
architectural developments can be identified in the following areas:
1. Those concerned with the processor functions.
2. Those concerned with the memory system hierarchy.
3. ‘Those around the processor-memory interface.
4. Those involving use of multiple processor systems.
Let us briefly review some of these developments, which will be presented in detail
in the subsequent chapters.
1.3.2 Processor functions
As we have noted, the oper
operations:
jon of the processor is centered on two composite
1, Fetching an instruction.
2. Executing the fetched instruction.
First, an instruction is read from memory using the program counter as a pointer to
the memory location, Next, the instruction is decoded, that is, the specified operations,
are recognized. In the fetch/execute partition, the instruction decode occurs during
the latter pert of the fetch cycle and once the operation has been recognized, the
instruction can be executed. The operands need to be obtained from registers or
‘memory at the beginning of the execute cycle and the specified operation is then
performed on the operands. The results are usually placed in a register or memory
location at the end of the execute cycle.
‘The execution of an instruction and the fetching of the next instruction can be
performed simultaneously in certain circumstances; this is known as instruction fetch
execate overlap. The principal condition for success of the instruction fetch/execute
overlap is that the particular instruction fetched can be identified before the previous
instruction has been executed. (This is the case in sequentially executed instruc-Computer systems 17
tions. However, some instructions will not be executed sequentially, or may only be
executed sequentially after certain results have been obtained.)
‘The two basic cycles, fetch and execute, can be broken down further into the
following three steps which, in some cases, can be overlapped,
1. Fetch instruction.
2. Decode instruction and fetch operands.
3. Execute operation.
‘The execute operation can be broken into individual steps dependent upon the
instruction being executed. Simple arithmetic operations operating upon integers
‘may only need one step while more complex operations, such as flosting point
multiplication or division, may require several steps.
In high speed processors the sequence of operations to fetch and decode, and the
steps to execute an instruction, are performed in a pipeline. In general, a pipeline
consists of a number of stages, as shown in Figure 1.6, with each stage performing
‘one sequential step of the overall task. Where necessary, the output of one stage is
passed to the input of the next stage. Information required to start the sequence
centers the first stage and results are produced by the final (and sometimes inter-
mediate) stage.
‘The time taken to process one complete task in the pipeline will be at least as
long as the time taken when one complex homogeneous functional unit, designed to
achieve the same result as the multistage pipeline, is used. However, if a sequence
of identical operations is required, the pipeline approach will generate results a: the
rate at which the inputs enter the pipeline, though each result is delayed by the
Processing time within the pipeline. For sequential identical operations, the pipeline
‘could be substantially faster than one homogeneous unit,
Clearly, instruction operations are not necessarily identical, nor always sequential
‘and predictable, and pipelines need to be designed to cope with non-sequential,
dissimilar operations. Also, it is not always possible to divide a complex operation
into a series of sequential steps, especially into steps which all take the same length
of time, Each stage need not take the same time, but if the times are different, the
Pipeline must wait for the slowest stage to complete before processing the next set
of inputs. However, substantial speed-up can be achieved using the pipeline tech-
‘nique and virtually all computer systems, even modern microprocessors, have a
Fetch operana om
Gat fetehunt | +
mae -O-O-O-O-O-
Figure 1.6 Processor pipeline18 Computer design techniques
pipeline structure (Chapter 4 deals with pipelining and pipelined processors in
MPL)
where m(n) is the miss ratio with n locations filled in the cache and p(n) is the
probability that a miss results in a new location being filled. (p(n) is zero if the
cache is filled, one if the cache is not filled and any free location can be used, ie. in
a fully associative cache, and less than one with direct and set-associative caches,
which place restraints upon the availability of locations for incoming blocks.)
Strecker assumes that the probability is numerically equal to the fraction of free
cache locations, i.e:
Pon) =
where s is the size of the cache. The reasonably good approximation to the miss90 Computer design techniques
ratio is given as:
atbn
aan
where a and b are constants to be found from trace results. Hence we obtain:
dn _ (a+ bm\s~n)
a arms
Itis left as an exercise to solve this equation (see Strecker, 1983)
‘Thiebaut and Stone (1987) introduced the term footprint to describe the active
portion of a process that is present in the cache. Footprints of two processes reside
in the cache during a tansition from one program to another. Probabilistic equations
are derived (see Stone, 1987).
Mathematical modelling is useful in helping to see the effect of changing para-
‘meters, but mathematical models cannot capture the vast differences in programs.
Virtual memory systems with cache memory
In a computer system with virtual memory, we can insert the cache after the virtual
real address translation, so that the cache holds real address tags and the comparison of
addresses is done with real addresses. Alternatively, we caa insert the cache before the
virtual-real translation so that the cache holds virtual address tags and the comparison,
of addresses is done using virtual addresses. Let us first consider the former case, which
is much less complicated and has fewer repercussions on the rest of the system design,
3.6.1 Addressing cache with real addresses
‘Though it is not necessary for correct operation, it is common to perform the virtual~
real translation at the same time as some independent part of the cache selection
operation to gain an improvement in speed. The overlap is done in the following way.
AS we have seen in Chapter 2, the address from the processor in a paged virtual
memory system is divided into two fields, the most significant field identifying the
page and the least significant field identifying the word (line) within the page. The
division into page and line is fixed for a particular system and made so that a suitable
sized block of information is transferred between the main and the secondary
‘memories. In a cache system, the address is also divided into fields - a most significant
field (the tag field corresponding to the tags stored in the cache) and a less significant
field (to select the set (in set-associative cache) and to select the block and word within
the block}. If the tag field corresponds directly to the page ficld in the real address,
then the set selection can be done with the next significant bits of the address before
the virtual address translation is done, and the virtual address translation can beCache memory systems 91
performed while the set selection is being done. When the address translation has been
done, and a real page address produced, this address can be compared with the tags,
selected from the cache, as shown in Figure 3.11. On a cache miss, the real address is,
mediately available for selecting the block in main memory, assuming a page fault
has not occurred and the block can be transferred into the cache directly.
Clearly, as described, the overlap mechanism relies on the page size being the
same as the overall cache size (irrespective of the organization), although some
variations in the lengths of the fields are possible while still keeping some concurrent
operations. In particular, the page size can be larger, so that there are more bits for
the line than needed for the set/block/word selection in the cache. The extra bits are
then concatenated with the real page address before being compared with the tags.
3.6.2 Addressing cache with virtual addresses
If the cache is addressed with virtual addresses these are immediately available for
selecting a word within the cache and there is a potential increase in speed over a
real addressed cache. Only on a cache miss would it be necessary to translate a
virtual address into a real adéress, and there is more time then, Clearly, if the tag
field of the virtual address is larger than the real address, the tag fields in the cache
would be larger and there would be more associated comparators. Similarly, if the
Vetual adress
PogelTag Wocdindex Byte
c— ea
a jm
Tranlatin
lookeasie
butler TLS) “cacte
Vitual Real
ascross accross frog Word
compare
‘Select byte
‘Access word
Figure 3.11. Cache with translation lock-aside buffer92. Computer design techniques
virtual address is smaller than the real address, the tag fields in the cache would be
smaller and there would be fewer comparators. Often though, the virtual and real
address tags have the same number of bits. A particular advantage of a virtual
addressed cache is thet there is no need for overlap between the virtual/real address
translation and the cache operation, as there is no translation mechanism for cache
hits, So the division of addresses into fields in the virtual/real addresses and the
division of fields in the cache selection mechanism can be designed separately and
‘need not have any interrelationship.
However, though the virtual addressed cache is an apparently attractive solution,
it has @ complication concerned with the relationship between virtual addresses in
different processes which may be in the cache together. It is possible for different
virtual addresses in different processes to map into the same real address. Such
virtual addresses are known as synonyms — from the word denoting the same
thing(s) as another but suitable for different contexts. Synonyms are especially likely
if the addressed location is shared between processes, but can also occur if programs
request the operating system to use different virtual addresses for the same real
address. Synonyms can occur when an input/output device uses real addresses 10
access main memory accessable by the programs. They can also occur in multi-
processor systems when processors share memory using different virtual addresses.
It is also possible for the same virtual address generated in different processes to
map into different real addresses.
Process or other tags could be attached to the addresses to differentiate between
virtual addresses of processes, but this adds a complication to the cache design, and
‘would still allow multiple copies of the same real block in the cache simultaneously.
Of course, synonyms could be disallowed by placing restrictions on virtual addresses,
For example, each location in shared code could be forced to have only one virtual
address. This approach is only acceptable for shared operating system code and is
done in the IBM MVS operating system.
Otherwise, synonyms are handled in virtual addressed caches by the use of a
reverse translation buffer (RTB), also called an inverse translation buffer (ITB). On
a cache miss, the virtual address is translated into a real address using the virtual—
real translation look-aside buffer (TLB) to access the main memory. When the real
address has been formed, a reverse translation occurs to identify all virtual addresses
given under the same real address. This reverse translation can be performed at the
same time as the main memory cycle. If the real address is given by another virtual
address already existing in the cache, the virtual address is renamed to eliminate
multiple copies of the same block. The information from the main memory is not
needed and is discarded. If a synonym does not exist, the main memory information
is accepted and loaded into the cache.
When there are direct accesses to the main memory by devices such as a direct
memory access (DMA) input/output device, the associated block in the cache, if
present, must be recognized and invalidated (see Section 3.2.2). To identify the
block, a real-virtual address translation also needs to be performed using a reverse
translation buffer.Cache memory systems 93
3.
3 Access time
‘The average access time of a system with both a cache and a paged virtsal memory
hhas several components, depending on one of several situations arising - whether
the real address (assuming a real addressed cache) is in the translation look-aside
buffer, the cache or the main memory and whether the data is in the cache or the
main memory. The translation look-aside buffer is used to perform the address
translation when the virtual page is in the translation look-aside buffer. If there is a
miss in the translation look-aside buffer, the translation is performed by accessing a
page table which may be in the cache or in the main memory. There are six
combinations of accesses, namely:
Address in the translation look-aside buffer, data in the cache.
‘Address in the translation look-aside buffer, data in the main memory.
Address in the cache, data in the cache.
‘Address in the cache, data in the main memory.
Address in the main memory, data in the cache.
Address in the main memory, data in the main memory.
(Part of the page table could be in the secondary memory, but we will not consider
this possibility.) Suppose there is no overlap between translation look-aside buffer
translation and cache access and the following times apply:
‘Translation look-aside buffer address translation time.
(or to generate a TLB miss)
Cache time to determine whether address in cache 25ns
25 ns
Cache data fetch if address in cache 25 ns
Main memory read access time 200 ns,
Translation look-aside buffer hit ratio 09
Cache hit ratio 0.95
the access times and probabilities of the various access combinations are given in
Table 3.2.
‘Table 3.2 Access times and probabilities ofthe various access combinations
‘Recess times Probabilities
25425425 09x05 0.855
25+25 +200 09 x0.05 0.045
25.425 +25 425425 0.1095 095 = 0.09025
25.425 +25 +25 +200 0.1% 095x005 = 0.00875
25425 +200 +25 +25 425 0.10.05 «0.93 = 0.00875
25.425 +200 +25 +25 + 200 0.10.05 «0.05 = 0.0002594 Computer design techniques
‘The average access time is given by:
(15 x 0,855) + (250 x 0.045) + (125 x 0.09025) + (300 x 0.00475) +
(325 x 0.00475) + (500 x 0.00025) = 89.75 ns (64.125 ns on a cache hit)
If the virtual memory system also incorporates two-level paging or segments,
further combinations exist. The calculation can easily be modified to take into
‘account partial overlap between the TLB access and cache access.
3.7_Disk caches
‘The concept of a cache can be applied to the main memory/secondary memory
interface. A disk cache is a random access memory introduced between the disk and
the normal main memory of the system. It can be placed within the disk unit, as
shown in Figure 3.12, or within the computer system proper. The disk cache has
considerable capacity, pethaps greater than 8 Mbytes, and holds blocks from the
disk which are likely to be used in the near future, The blocks are selected from
previous accesses in much the same way as blocks are placed in a main memory
cache. A disk cache controller activates the disk transfers, The principle of locality,
which makes main memory caches effective, also makes disk caches effective and
reduces the effective input/output data page transfer time, perhaps from 20-30 ms to
2-5 ms, depending upon the size of page transfer to the main memory. The disk
cache is implemented using semiconductor memory of the same type as normal
‘main memory, and clearly such memory could have been added to the main memory
as a design alternative. It is interesting to note that some opersting systems, such as
UNIX, employ a software cache technique of maintaining an input/output bufter in
the main memory.
‘The unit of transfer between the disk and the disk cache could be a sector,
multiple sectors or one or more tracks. A minimum unit of one track is one
Central processor
nt Disk uit
-
Disk cache
Figure 3.12. Disk cache in disk unitCache memory systems 95
candidate (Grossman, 1985), as is transferring the information from the select
sector to the end of the track. A write-through policy has the advantage of simplifying
error recovery. Not all the information from/to the disk need pass through the disk
cache and some data/code might be better not using the cache. One possibility is to
have a dynamic cache on/off mechanism which causes the cache to be bypassed
under selected circumstances.
Perhaps one of the main attractions of placing the additional cache memory in the
disk unit is that existing software and hardware may not need to be changed and
substantial improvements in speed can be obtained in an existing system, Most
commercial disk cackes are integrated into the disk units. Examples include the IBM
3880 Model 23 Cache Storage Controls with an 8-64 Mbyte cache. Disk caches
have also been introduced into personal computer systems. It is preferable to be able
to access the disk cache from the processor and to allow disk cache transfers
between the disk cache and disk simultaneously, as disk transfers might be one or
more tracks and such transfers can take a considerable time. Some early commercial
disk caches did not have this feature (for example the IBM 3880 Model 13).
Disk caches normally incorporate error detection and correction. For example the
disk cache incorporated into the IBM 3880 Model 23 has error detection/correction to
detect all triple-bit errors, and correct all double-bit errors and most triple-bit errors,
‘The earlier IBM 3880 Model 13, having a 4-8 Mbyte cache, could detect double
errors and correct single-bit errors (Smith, 1985). Both these disk drives maintain
copies of data in the cache using a least recently used replacement algorithm.
3.8 Caches in multiprocessor systems
In this section we will briefly review the methods suggested to cope with multiple
processors each having caches, or having access to caches. Multiprocessor systems
will be discussed in detail in subsequent chapters. In a situation of more than one
cache, it is possible that copies of the same code/data are held in more than one
cache, and are accessed by different processors. Reading different copies of the
same code/data does not generally cause a problem. A complication only exists if
individual processors alter their copies of data, because shared data copies should
generally be kept identical for correct operation. We note that write-through is not
sufficient, or even necessary, for maintaining cache coherence, as more than one
processor writing-through the cache does not keep all the values the same and up to
date. Several possibilities exist to maintain cache coherence, in particular:
Shared caches.
Non-cacheable items,
Sloop bus mechanism,
Broadcast write mechanisms.
Directory methods.96 Computer design techniques
‘Clearly, a single cache shared by all processors with the appropriate controls would
‘maintain cache coherence. Also, a shared cache might be feasible for DMA devices
accessing the cache directly rather than relying on a write policy. However, with
several processors the performance of the system would seriously degrade, due 19
‘contention. In a muluprocessor system with more than one memory module access-
ible by all the processors, an appropriate place for each cache is attached to the
Processor, as shown in Figure 3.13. It would also be possible to place the caches in
front of each memory module, but this arrangement would not decrease the inter-
‘connection traffic and contention.
Cache coherence problems only occur on data that can be altered, and such
writable data could be maintained only in the shared main memory and not placed
in the cache at all. Additional software mechanisms are needed to keep strict control
‘on the shared writable data, normally through the use of critical sections and
‘semaphores (see Chapter 6).
The sloop bus mechanism or bus watcher is particularly suitable for single bus
systems, as found in many microprocessor systems. In the sloop bus mechanism, a
bus watcher unit for each processor/cache observes the transactions on the bus and
in particular monitors all memery write operations. If a location in main memory
altered and a copy exists in the cache, the bus watcher unit invalidates the cache
copy by resetting the corresponding valid bit in the cache. This action requires the
unit co have access not only to the valid bits in the cache, but also to the tags in the
cache, or copies of the cache tags, in order to compare the main memory address tag
with the cache tag. Alternatively, the cache word/block with the same index as the
main memory location can be invalidated, whether or not the tags correspond. The
unit then does not need to access the tags, though access to the valid bits is still
Memory modules
Ta —Attmatve poston for cache
Interconnection newark
Processors
Figure 3.13, Multiprocessor witn local cachesCache memory systems 97
cont! signals
‘ache ‘cache
Processor |= contaler al ey
Other processors ach wth
cache and contolr
(weal
Main memory attached
tosystem bus
Businerace | loop ue
Systembus
Figure 3.14 Cache with sloop bus controller
necessary. However, the unit might mark as invalid a cache block which does not
correspond to an altered main memory word, because the cache location with the
same index as the main memory location would be invalidated, irrespective of the
values of the tags,
sure 3.14 shows a representative microprocessor implementation based upon an
80386 processor and an 82385 sloop bus cache controller (Intel, 1987b). The processor
accesses the cache through a local bus, all accesses being controlled by the cache
controller. For a cache miss that requires access to the main memory on the system
bbus, the cache controller sends the request through the system bus to the main
memory and loads the returning data into the cache. Write accesses with a sloop bus
are conveniently handled by write-through.
In write-once, the first time a processor makes write reference to a location in the
cache, the main memory is also updated in a write-through manner. The fact is
recorded in such a way that other processors can recognize that the location has
been updated and now cannot be accessed by them. If the stored information was
also stored in any other cache, these copies are invalidated by resetting valid bits in
the caches. Subsequently, if the first processor again performs a write operation to
the location, only the cache is altered, and the main memory is updated only when
the block is replaced as in write-back.
In broadcast writes, every cache write request is also sent to all other caches in
the system. Such broadcast writes interrogate the caches to discover whether each
cache holds the information being altered. The copy is then either updated (update
write) or an invalidated bit associated with the cache line is set to show that the
copy is now incorrect. The use of invalidating words is generally preferable to update
writes as multiple update writes by different processors to the same location might
cause inconsistencies between caches. In any event, significant additional memory
transactions occur with the broadcast method, though it has been implemented on
large computer systems (for example IBM 3033).
In one directory method (Smith, 1982), if a block has not been modified it may98 Computer design techniques
exist in several caches simultaneously, but once the block is altered by one
processor, all other copies are invalidated, and a valid block then exists only in one
cache, initially the cache associated withthe processor that made the alteration, A.
subsequent read operation to that block by another processor causes the block to be
transferred to the requesting cache so that multiple copies exist again, until a write
operation occurs, which invalidates the copies not immediately updated.
The mechanism is achieved through the use of a directory of bits created in the
main memory. One set of bits is created for each block that can be transferred into
the caches. One bit in each of these sets is for each cache, as shown in Figure 3.15.
Each set of bits has one further bit ~ a block modified bit to indicate thatthe block
has been altered. When this occurs only one cache may hold the block and only one
of the other bits can be set. Ifthe block has not been altered, the modified bit is
reset. Then, it is possible for more than one cacke to hold the block and corres-
Pondingly more than one bit set in the directory to indicate this fact.
Each block in each cache has a bit which is set ifthe block is the only valid copy.
This bit is set upon a write operation and the block has been transferred into the
cache from another cache. There are various situations that can arise in the multiple
cache system (i.e. combinations of read/write, hit/miss, present in another eache/not
present in another cache, altered/not altered) and the directory method must cope
with these situations. On a cache read operation when the block is already in the
cache, no directory and private bit operations are necessary. On a read operation
when the block is not in the cache, the modified bit of the block in the main
directory must be checked to see whether it has been altered in some other cache. If
the block is altered, it must be transferred into the cache and other copies invalidated,
‘The main directory is also updated, including resetting the modified bit. If the
missing block has fot been altered, the copy is Sent to the cache and the directory is
updated.
On a cache write operation when the block is in the cache, first the private bit is
checked to see whether it owns the only copy of the block. If it does own the only
oe
a
come
: =
os
ES etl
sits
Figure 3.15 A directory method to maintain cache coherenceCache memory systems 99
copy, the block is simply updated. If it does not own the only copy, the main
directory is examined to find the other copies. These copies are invalidated if the
directory allows a change in ownership. On a write operation when the block is not
in the cache, the main directory is updated and the block is transferred to the cache.
‘There are several variations on basic cache coherence techniques. Mathematical
performance analysis of seven different multiprocessor cache coherence techniques
for single bus systems is given in Yang, Bhuyan and Liu (1989).
PROBLEMS
3.1 Choose suitable memory interleaving to obtain an average access
time of less than 50 ns given that the main memory has an access time of
150 ns and a cache has an access time of 35 ns. If ten locations hold a
loop of instructions and the loop is repeated sixty times, what is the
average access time?
3.2 What is the average access time of a system having three levels of
Inemory ~ a cache memory, a semiconductor main memory and magnetic
disk secondary memory ~ if the access times of the memories are 20 ns,
200 ns and 2 ms, respectively? The cache hit ratio is 80 per cent and the
main memory hit ratio is 99 per cent.
3.3. A computer employs a 1 Mbyte 32-bit word main memory and a
cache of 512 words. Determine the number of bits in each field of the
address in the following organizations:
1. Direct mapping with a block size of one word,
2. Direct mapping with a block size of eight words.
3. Set-associative mapping with a set size of four words
3.4 Derive an expression for the hit ratio of a direct mapped cache
assuming there is an equal likelihood of any location in the main
memory being accessed (in practice this assumption is not true). Repeat
for a two-way set-associative mapped cache. Determine the size of
‘memory at which the direct mapped cache has a hit ratio within 10 per
cent of the set-associative cache.
3.8 Design the logic to implement the least recently used replacement
algorithm for four blocks using a register stack.
3.6 Design the logic to implement the least recently used replacement
algorithm for four blocks using the reference matrix method.100 Computer design techniques
3.7. Solve the equation given in Section 3.5:
dn _ (a+ bm(s-n)
& > arms
for n where n locations are filled in the cache, s is the size of the cache,
and a and b are constants,
3.8 Determine the conditions in which a write-through policy creates
‘more misses than simple write-back policy, given that the hit ratio is the
‘same in both cases.
39° Determine the conditions in which a write-through policy with no
fetch on write creates more misses than a write-through policy with fetch
con write, given that fetch on write creates 10 per cent higher hit ratio.
3.10 Determine the average access time in a computer system employing
‘a cache, given that the main memory access time is 125 ns, the cache
‘access time is 30 ns and the hit ratio is 85 per cent. The write-through
policy is used and 20 per cent of memory requests are write requests.
3.11 Repeat Problem 3.10 assuming a write-back policy is used, and
the block size is sixteen words fully interleaved.
3.12 Using aging counters to implement the least recently used algorithm,
1s described in Section 3.4.4, derive the numbers held in the counters
after each of the following pages has been referenced:
2,6,9,7,2,3,2,9,6,.2,7,4
jiven that the cache holds four pages.
3.13. Show how a reference matrix as described in Section 3.4.4, can be
used to implement the least recently used algorithm with the sequence:
2,6,9,7,2,3,2,9,6,2.7.4
sven that the cache holds four pages.
3.4 A cache in a system with virtual memory is addressed with a real
‘address, Both the real addresses and virtual addresses have thirty-two
bits and the page size is 512 words. The set size is two. Determine the
division of fields in the address to achieve full overlap between the page
translation and set selection. Suppose the cache must have only twoCache memory systems 101
pages, give a design showing the components and address formats.
3.15 A disk cache is introduced ino a system and the access time
reduces from 20 ms to 3 ms. What is the access time of the disk cache,
given that the hit ratio is 70 per cent?
3.16 Work through all combinations of actions that can occur in the
directory method described in Section 3.8, drawing a flow diagram and
the values of the bits in the directories.
3.17 Choose a real computer system or processor with both a cache and
virtual memory and identify those methods described in Chapters 2 and 3
which have been employed. Describe how the methods have been imple-
‘mented (block diagram, etc.) and comment on the design choices made.‘CHAPTER .
Pipelined systems
Overlap and the associated concept, pipelining, are methods which can be used to
increase the speed of operation of the central processor, They are often applied to
the internal design of high speed computers, including advanced microprocessors, as
a type of multiprocessing. In this chapter, we will describe how pipelining is applied
to instruction processing and include some of the methods of designing pipelines.
4.1 Overlap and pipelining
4.1.1 Technique
‘Overlap and pipelining really refer to the same technique, in which a task or operation
is divided into a number of subtasks that need to be performed in sequence, Each
subtask is performed by its own logical unit, rather than by a single unit which
performs all the subtasks. The units are connected together in a serial fashion with
the output of one connecting to the input of the next, and all the units operate
simultaneously. While one unit is performing a subtask on the ith task, the preceding
unit in the chain is performing a different subtask on the (i+1)th task, as shown in
Figure 4.1.
‘The mechanism can be compared to a conveyor belt assembly line in # factory, in
which products are in various stages of completion. Each product is assembled in
stages as it passes along the assembly line. Similarly, in overlap/pipelining, a task is
presented to the first unit. Once the first subtask of this task is completed, the results
are presented to the second uit and another task can be presented to the first unit.
Results from one sub(ask are passed to the next unit as required and a task is
completed when the subtasks have been processed by all the units.
‘Suppose each unit in the pipeline has the same operating time to complete a
subtask and that the first task is completed and a series of tasks is presented, The
time to perform one complete task is the same as the time for one unit to perform
fone subtask of the task, rather than the summation of all the unit times. Ideally, each
102Pipelined systems 103
Unt? Unt2 —Unt—Unit'a Units Unité Unit7
Input: | Output
as
Une? ai afelslalels
ume afalelalele ele
uses] alalelelele[elele
ute afeleleelele fale fe
Unit 3 nl wl als) | a] ala [ sles
uwel [a Fafa lle fa fe [ae ele
unt] [tpt Te pre] ep re fe] rep re
Tine
Figure 4.1. Pipeline processing (T; = jth subtask in the ith task)
subtask should take the same time, but if this is not the case, the overall processing,
time will be that of the slowest unit, with the faster units being delayed. It may be
advantageous to equalize stage operating times with the insertion of extra delays.
We will pursue this technique later.
‘The term pipelining is often used to describe a system design for achieving a
specific computation by splitting the computation into a series of steps, whereas the
term overlap is often used to describe a system design with two or more clearly
distinct functions performed simultaneously. For example, a floating point arithmetic,
‘operation can be divided into a number of distinct pipelined suboperations, which
must be performed in sequence to obtain the final floating point result. Conversely, a
computer system might perform central processor functions and input/output fune:
tions with separate processors ~ a central processor and an input/output processor ~
‘operating at the same time. The central processor and input/output processor opera-
tions are overlapped.
4.1.2 Pipeline data transfer
‘Two methods of implementing the data transfer in a pipeline can be identified:
1, Asynchronous method.
2. Synchronous methed.104 Computer design techniques
Tih I> Final
tnt P+ reut
Fra
resut
(6) Synctrenous method
Figure 4.2 Transfer of information between units in a pipeline (a) Asynchronous
method (b) Synchronous method
‘These are shown in Figure 4.2. In the asynchronous method, a pair of “handshaking”
signals are used between each unit and the next unit ~ a ready signal and an
acknowledge signal. The ready signal informs the next unit that it has finished the
present operation and is ready to pass the task and any resubis onwards. The
acknowledge signal is returned when the receiving unit has accepted the task and
results. In the synchronous method, one timing signal causes all outputs of units to
be transferred to the succeeding units. The timing signal occurs at fixed intervals,
taking into account the slowest unit.
‘The asynchronous method provides the greatest flexibility in stage operating
times and naturally should make the pipeline operate at its fastest, limited as always
by the slowest unit. Though unlikely in most pipeline designs, the asynchronous
‘method would allow stages to alter the operating times with different input operands.
‘The asynchronous method also lends itself to the use of variable length first-in first-
‘out buffers between stages, to smooth the flow of results from one stage to the next.
However, most constructed instruction and arithmetic pipelines use the synchronous
method. An example of a pipeline that might use asynchronous handshaking is in
dataflow systems when nodal instructions are only generated when all their operands
are received (see Chapter 10). Other examples include the pipeline structures
formed with transputers, as described in Chapter 9.
Instruction and arithmetic pipelines almost always use the s
reduce logic timing and implementation problems. There is a staging latch betwee:
each unit and the clock signal activates all the staging latches simultaneously, as
shown inPipelined systems 105
lLateh stage Latch Stage Latch Stage Latch
ata ||
eck
Figure 4.3 Pipeline with staging latches
Pipelines could have been designed without staging latches between pipeline
stages and without a synchronizing clock signal - pipeline stages could produce
their outputs after natural logic delays, results could percolate through the pipeline
from one stage to the next and the final output could be sampled at the same regular
frequency as that at which new pipeline inputs are applied. This type of pipeiine is
called a maximum-rate pipeline, as it should result in the maximum speed of
‘operation, Such pipelines are difficult to design because logic delays are not known
exactly — the delays vary between devices and depend upon the device inter-
connections. Testing such pipelines would be a distinct challenge. However, Cray
computers do not use staging latches in their pipelines, instead, path delays are
equalized.
4.1.3 Performance and cost
Pipelining is present in virtually all computer systems, including microprocessors. It
is a form of parallel computation; at any instant more than one task is being
performed in parallel (simultaneously). Pipeliniag is therefore done to increase the
speed of operation of the system, although as well as potentially increased speed, it
hhas the advantage of requiring little more logic than a non-pipelined solution in
‘many applications, and sometimes less logic than a high speed non-pipelined
solution. An alternative parallel implementation using » replicated units is shown in
Figure 4.4, Each unit performs the complete task. The system achieves an equivalent
increased speed of operation by applying 7 tasks simultaneously, one to each of the
rn units, and producing n resalts n cycles later. However, complete replication
requires much more logic. As circuit densities increase and logic gate costs reduce,
complete replication becomes attractive. Replicated parallel systems will be described
in later chapters, We can make a general comment that pipelining is much more
economical than replication of complete units.
We see from Figure 4.1 that there is a staircase characteristic at the beginning of
pipelining; there is also a staircase characteristic at the end of a defined number of,
tasks. If s tasks are presented to an n-stage pipeline, it takes n clock periods before
the first task has been completed, and then another s ~ 1 clock periods before all
the tasks have been completed. Hence, the number of clock periods necessary is
given by n + (s — 1). Suppose a single, homogeneous non-pipelined unit with
‘equivalent function can perform s tasks in sn clock periods. Then the speed-up106 Computer design techniques
Units Eachunitpetorms
alaskinncyoles
—o}—
resus cotectes
noyces ater
Figure 4.4 Replicated units
available in a pipeline can be given by:
Tan
1, 7+G=D
Speed-uy
)
‘The potential maximum speed-up is n, though this would only be achieved for an
infinite stream of tasks and no hold-ups in the pipeline. The assumption that a single
homogeneous unit would take as long as the pipelined system to process one task is
also not true. Sometimes, a homogeneous system could be designed to operate faster
than the pipelined version.
There is a certain amount of inefficiency in that only in the steady state of a
continuous submission of tasks are all the units operating. Some units are not busy
during start-up and ending periods. We can describe the efficiency as:
Efficiency
nx (overall operating time)
aeGe1)
Speed-up,Pipelined systems 107
‘where 1, is time unit / operates. Speed-up and efficiency can be used to characterize
pipelines,
Pipelining can be applied to various subunits in a traditional uniprocessor com-
puter and to the overall operation. First, we will consider pipelining applied to
overall instruction processing, and then we shall consider how the arithmetic
operations within the execution phase of an instruction can be pipelined. Staging
latches are assumed to be present in the following,
4.2. Instruction overlap and pipelines
4.2.1 Instruction fetch/execute overlap
‘The fetch and execute cycles of a processor are often overlapped. Instruction
processing requires each instruction to be fetched from memory, decoded, and then
‘executed, in this sequence. In the first instance, we shall assume one fetch cycle
fetching a complete instruction and requiring one execute cycle, and no further
decomposition.
This technique requires two separate units, a fetch unit, and an execute unit,
Which are connected together as shown in Figure 4.5(a). Both units have access t0
the main memory, the fetch unit to access instructions and the execute unit to fetch
operands and to store the result if either or both of these actions are necessary.
Processor registers, including the program counter, are accessible by the units if
necessary. Some dedicated processor registers might be contained within either unit,
depending upon the design.
‘The fetch unit proceeds to fetch the frst instruction. Once this operation has been
completed, the instruction is passed to the execute unit which decodes the instruction
and proceeds to execute it. While this is taking place, the fetch unit fetches the next
instruction. The process is continued with the fetch unit fetching the ith instruction
while the execute unit is executing the (/-I)th instruction, as shown in Figure
4.50).
The execute time is often variable and depends upon the instruction. With fixed
length instructions, the fetch time would generally be a constant. With variable
length multibyte/word instructions, the fetch time would be variable if the complete
instruction needed to be formed before the instruction was executed. Figure 4.5(c)
shows a variable instruction fetch and execution times. In this figure, the ith fetch
and the (/-1)th execute operations always begin operating together, irrespective of
the longer operating time of the previous execute and fetch operations. The overall
processing time is given by
atl
Processing time =) Max (TF), TE,..))108 Computer design techniques
Fetch Execute
unt wn
Instectons—e] fof fp
(0) Feteniexecute stages
fxeout |
Foten
Tie
(0) Ting win equal stage ines
Free
Execute
Feten
Tne
(Teng win unequal stage times
Figure4.5 Fetchlexecuteoverlap (a) Fetch/execute stages (b) Timing with equal
stage times (€) Timing with unequal stage times
where 7(F)
ime of ith fetch operation and T(E,) = time of ith execute operation.
Clearly the execute unit may operate at a different time to the fetch unit. In
particular, it is likely to requite more time for complicated instructions, and will
dominate the processing time. To reduce this effect, the execute unit could be split,
into further separate units, A separate instruction decode unit could be provided
after the fetch unit, followed by an execute unit, as shown in Figure 4,6(a). This
scheme is known as three-level overlap. The decode unit is responsible for identifying
the instruction, including fetching any values from memory in order to compute the
effective operand address. However, it is not usually possible for the fetch unit to
fetch an instruction and a decode unit to fetch any operands required for the
previously fetched instruction at the same time, if the program and data are held in
the same memory, as only one unit can access the memory at any instantlined systems 109
Fetch Decode Execute
unt ona nt
Two. ese pee les
‘natuetions
(a) Fetctvdecode/exeoute stages
r Bracate | Exeaeod
eee instueton | ingruction | __F°°
Decodeia |Decodeznd | Freq | Decora
dese) insrucion | instucion |_" instution
Fron Tavo Fach sarah
Fetch Tinswuctons | __ Free Free [linsructons | __Free
Tie
(0) Fetching two Instructions simutaneously
i Trace Tat | Executeznd | Beauteoa
Excoue | insivcton_| ‘inevucton | westuction
Aa, DecadeTat ] Decodeznd | Decode 3 | Decode ar
instcton_| instruction _| instruction, | instruction
eten | FoERTEE | Fetcnand | Fetcnard | Fetchatn [Fetch stn
snetuction | inetuction | instwcton | instucion | instction
Tie
(© el overap wth interleaved merry
Figure 4.6 Fetch/decode/execute overlap (a) Fetch/decode/execute stages
(b) Fetching two instructions simultaneously (c) Ideal overlap with
interleaved memory
(One method of overcoming this problem is to fetch more than one instruction at a
time using multiple memory modules or using true memory interleaving (Section
1.3.4, page 20). In Figure 4.6(6), the fetch unit fetches two instructions
simultaneously and then becomes free while the decode unit can access the memory.
However, none of the units is operating all of the time, and only two instructions
are processed in every three cycles. Figure 4.6(c) shows the ideal utilization using
two-way interleaved memory. The usage might be different for particular instruc-
tions. The fetch unit fetches an instruction and the decode unit fetches an operand
from memory if necessary in the same cycle. Instructions are processed at the
maximum rate of one per cycle, Clearly, memory contention will arise if both the
fetch unit and decode unit request the same memory module. Contention can be
reduced with a greater degree of interleaving. In one scheme, the fetch unit fetches110 Computer design techniques
‘two instructions simultaneously and becomes free on every alternate cycle, but still
allows the system to process one instruction per cycle.
Further instruction processing decomposition can be made. For example, we
‘could have five stages:
Fetch instruction,
Decode instruction.
Fetch operand(s).
Execute operation (ADD, ete.)
Store result.
This is shown in Figure 4.7. As we divide the processing into more stages, we hope
to reduce and equalize the stage processing times. Of the fve stages given, stage 1
always requires access to the memory, Stages 3 and 5 require access to the memory
if the operands and results (respectively) are held in memory. However, the instruc-
tions of many computer systems, particularly microprocessor systems (the 68000
being one exception), do not allow direct memory to memory operations, and
provide only memory to processor register, register to reyister and register to
memory operations, which forces the use of processor registers as temporary
holding locations. In such situations. stages 3 and $ do not occur in the same
instruction, and oniy one, at most, needs access to memory.
Unfortunately, at any given time during the processing, stage 3 will be processing
instruction m and stage 5 will be processing instruction n-2 and both might require
Fetch Decode Fetch Execute Store
instuetion instseton operands itt
el H HEH
aoe
pol Trace Tat | Execute ond
ese Bas |e
rare} ee Fase
operands | operands _| operands_|_ operands
— Sasa ees ee
ad ea ko
raat REG heme | tee | eae ae
Weerston lca] scan | eenan|feey
Tine
(6) Timing
Figure 4.7 Five-siage instruction pipeline (@) Units (b) Timing.Pipelined systems 111
access to memory. When it is not possible to guarantee only one stage requesting.
use of individual memory modules, or any other shared resource, additional logic
‘must be incorporated to arbitrate between requests.
In fact, there are several different hardware and software conditions that might
lead to hesitaiion in the instruction pipeline. Overlap and pipelining assume that
there is a sequence of tasks to be performed in one order, with no interaction
between tasks other than passing the results of one unit on to the next unit
However, although programs are written as a linear sequence, the execution of one
instruction will often depend upon the results of a previous instruction, and the
order of execution may be changed by branch instructions.
We can identify three major causes for breakdown or hesitation of an instruction
pipeline:
1, Branch instructions,
2. Data dependencies between instructions.
3. Conflict in hardware resources (memory, etc.)..
We will consider these factors separately in the following sections. The term
“branch” instructions will be used to include “jump” instructions,
4.2.2 Branch instructions
Given no other mechanism, each branch instruction (and the other instructions that
follow) could be processed normally in an instruction pipeline. When the branch
instruction is completely executed, or at least when the condition can be tested, it
‘would be known which instruction to process next. If this instruction is not the next
instruction in the pipeline, all instructions in the pipeline are abandoned and tae
pipeline cleared. The required instruction is fetched and must be processed through
all units in the same way as when the pipeline is first started, and we obtain a space—
time diagram such as that shown in Figure 4.8.
Typically, 10-20 per cent of instructions in a program are branch instructions and
these instructions could reduce the speed of operation significantly. For example, if
a five-stage pipeline operated at 100 ns steps, and an instruction which subsequently
cleared the pipeline at the end of its execution occurred every ten instructions, the
average instruction processing of the ten instructions would be:
9x 100 ns + 1 500 ns
1
40 ns.
‘The longer the pipeline, the greater the loss in speed due to conditional branch
instructions. Very few instruction pipelines have more than twenty stages. We have
ignored the step-up time of the pipeline, that is, the time to All the pipeline initially
when the system starts executing instructions112 Computer design techniques
Instruction fully processed
==>
Starup ‘bacon
Condtonal yachons
‘aren “st
instvcton
Figure 4.8 Effect of conditional branch instruction in a pipeline
Unconditional branch instructions always cause a change in the sequence and the
change is predictable and fixed, but can also affect the pipeline. The fetch unit
responsible for fetching instructions takes the value held in the program counter as
the address of the next instruction and the program counter is then incremented.
Therefore for normal sequential processing, the address of the next instruction is
available for the fetch unit immediately the program counter has been incremented,
and the fetch unit can keep fetching instructions irrespective of the execution of the
instructions. The address of the next instruction for unconditional branch instruc~
tions is held in the instruction, or in a memory or register location, or is computed
from the contents of addressed locations. If the address is held in the instruction, it
would be available after the decode operation, otherwise it would be available after
the operand fetch operation if the more complex effective address computations are
done then. In any event, the fetch unit does not have the information immediately
available and, given no other mechanism, would fetch the next instruction in
sequence,
‘The fetch and decode units could be combined. Then, the fetch/decode unit might
have obtained the next address during decoding. In a two-stage pipeline having an
instruction fetch/decode unit and an instruction execution unit, the address of the
next instruction after an unconditional instruction would be available after the fetch/
decode unit has acted upon the unconditional branch instruction. It is often assumed
that unconditional branch instructions do not cause a serious problem in pipelines.
This is not justified with complex effective addresses computed in stages.
Conditional branch instructions do not always cause a change in the sequence, or
even necessarily cause a change in the majority of instances, but this is dependent
upon the use of the branch instruction. Conditional branch instruction might typically
cause a change 40-60 per cent of the time, on average over a wide range of
applications, though in some circumstances the percentage could be much greater or
much less. Conditional branch instructions are often used in programs for:Pipelined systems 113
1, Creating repetitive loops of instructions, terminating the loop when a specific
condition occurs (loop counter = 0 or arithmetic computational result occurs).
2. To exit a loop if an error condition or other exceptional condition occurs.
The branch usually occurs in 1, but in 2 it does not usually occur. The same
instruction might be used for both applications, say branch if positive. Alternatively,
different instructions or different conditions might be used, say branch if not zer0
for a loop, and branch if zero for an error condition, dependent upon the program,
The use is not generally known by the system. The programmer could be guided in
the choice, given a particular pipeline design which makes a fixed selection after a
conditional branch. As with an unconditional branch instruction, even a fixed
selection is not possible in hardware if the effective address has net yet been
computed.
Strategies exist to reduce the number of times the pipeline breaks down due to
conditional branch instructions, using additional hardware, including:
1, Instruction buffers to fetch both possible instructions.
2. Prediction logic to fetch the most likely next instruction after a branch
instruction.
3. Delayed branch instructions.
Instruction buffers
A first-in first-out instruction buffer is often used to hold instructions fetched fom
the memory before passing them to the next stage of instruction pipeline. The buifer
becomes part of the pipelive as additional delay stages, and extends the length of the
pipeline. The advantage of a first-in first-out buffer is that it smoothes the flow of
instructions into the instraction pipeline, especially when the memory is also
‘accessed by the operand fetch unit, It also enables multiple word instructions to be
formed. However, increasing the pipeline with the addition of buffers increases the
amount of information in the pipeline that must be discarded if the incorrect
instructions are fetched after 2 branch instruction, Most 16-/32-bit microprocessors
have a pipelined structure with buffers between stages. For example, the Intel 80286
and 80386 have a prefetch queue in the instruction fetch unit for instruction words
fetched from memory, and a decoded instruction queue in the subsequent decode
unit leading to the execute unit,
Figure 4.9 shows two separate first-in first-out instruction buffers to fetch both
possible instruction sequences after a conditional branch instruction. It is assumed
that both addresses are available immediately after fetching the branch instruction,
Conditional branch instructions cause both buffers to fill with instructions, assumed
from an interleaved memory. When the next address has been discovered, instruc-
tions are taken from the appropriate buffer and the contents of the other buffer are
discarded. The scheme is sometimes called multiple prefetching or branch bypassing.
‘A major disadvantage of multiple prefetching is the problem encountered when114 Computer design techniques
Inston
\ tere
emery several
\,nettons i
a Pinel
Insiectonpirtine
eee
unt Y
Batrforaget
(ronsequential
‘natuctons
Figure 4.9 Instruction buffers,
more than one conditional branch instruction appears in the instruction stream. With
two sequential conditional branch instructions, there are four alternative paths, with
three instructions, eight alternative paths and, in general, there are 2" alternative
paths when there are conditional branch instructions, The number of possible
conditional branch instructions to be considered will be given by the number of
stages in the pipeline. Of course it is unreasonable to provide a buffer for all
alternative paths except for small 1.
‘A technique known as branch folding (Lilja, 1988) can be used with a two-stage
instruction pipeline having an instruction fetch/decode unit (an I unit) and an
instruction execute unit (an E unit). An instruction cache-type buffer is inserted
between the I and the E units, Instructions are fetched by the I unit, recognized, and
the decoded instructions placed in the instruction buffer, together with the address
of the next instruction in an associated field for non-branch instructions. If an
unconditional branch instruction is decoded, the next address field of the previous
(non-branch) instruction fetched is altered to correspond to the new target location,
i.e. the unconditional branch instruction folds into the previous instruction. Condi
tional branch instructions havs two next address fields in the buffer, one for each of
the next addresses, The execution unit selects one of the next address paths end the
other address is carried through the pipeline with the instruction until the instruction
has been fully executed and the branch can be resolved. At that time, either the
fetched path is used and the next address carried with the instruction is discarded, or
the path of the next adéress carried with the instruction is used and the pipeline is
cleared.
Prediction logic and branch history
‘There are various methods of predicting the next address, mainly based upon
expected repetitive usage of the branch instruction, though some methods are based
upon expected non-repetitive usage
"To make a prediction based upon repetitive historical usage, an ini
1 predictionPipelined systems 115
is made when the branch instruction is first encountered. When the true branch
instruction target address has been discovered, it is stored in a high speed memory
look-up table, and used if the same instruction at the same address is encountered
again. Subsequently, the stored target address will always be the address used on the
last occasion. A siored bit might be included with each table entry to indicate that a
previous prediction has been made.
‘There are variations in the prediction strategy; for example, rather than update
the predicted address when it was found to be wrong, allow it to remain until the
next occasion and change it then if it is still found to be wrong. This algorithm
requires an additional bit stored with each entry to indicate that the previous
prediction was correct, but might produce better results.
‘One form of prediction table is a branch history table, which is implemented in a
similar fashion to a cache. A direct mapping scheme could be used, in which target
addresses are stored in locations whose addresses are the same as the least significant
bits of the addresses of the instructions, together with most significant bit address
bits. We note that, as in the directly mapped data/instruction cache, all branch
instructions stored in the cache must have addresses with different least significant
bits, Alternatively, a fully associative or a set-associative cache-type table could be
‘employed, as shown in Figure 4.10, when a replacement algorithm, as used in
caches, is required. In any event, only a limited number of instruction addresses can
be stored,
‘The branch history table can be designed to be accessed after the decode
operation, rather than immediately the instruction is fetched. Then the target address
will often be known and hence it is only necessary to store a bit to indicate whether
the target address should be taken, rather than the full target address, and the table
uly associative look-up table
Instruction Target abo
‘aoess_adoress_bt
Loadtaget
Search able {I ‘dross
argetaddress
Adaress Decode
Instruction vat
Instction piping
Figure 4.10 instruction pipeline with branch history table (prediction logic not
shown ~ sequential instructions taken until correct target address loaded)116 Computer design techniques
requires fewer bits. This type of table is called a decode history table (Stone, 1987),
but it has the disadvantage that the next instruction will have been fetched before
the table has been interrogated and so this instruction may have to be discarded.
Delayed branch instructions
In the delayed branch instruction scheme, branch instructions operate such that the
sequence of execution is not altered immediately after the branch instruction is,
executed (if at all) but after one or more subsequent non-branch instructions,
depending upon the design. The subsequent instructions are executed irrespective of
the branch outcome. For example, in a two-stage pipeline, a branch instruction
‘might be designed to have an effect after the next instruction, so that this instruction
need not be discarded in the pipeline. For an n-stage pipeline, the branch instruction
could be designed to have an effect after n ~ 1 subsequent instructions, as shown in
Figure 4.11. Clearly, the instructions after the branch do not affect the branch
‘outcome, and must be such that the computation is still correct by placing the
instructions after the branch instruction. It becomes more difficult for the programmer
‘or compiler to find an increasing number of suitable independent instructions t0
place after a branch instruction. Typically, one instruction can be rearranged to be
Eee
a [Tsar
a Branch | Nea | Braneni
instucton instucton| solcted
Tire ——>
(@) Two-stage ppsine
Otter
stages
Branch LG Brana
Feteh fi Nestinsructon Branch
Tm
(0) m-sa9@ pioeine
Figure 4.11 Delayed branch technique (a) Two-stage pipeline
(b) stage pipelinePipelined systems 117
after the branch instruction in about 70 per cent of occasions, but additional
instructions are harder to find
‘A one-instruction delayed branch technique has been used extensively in micro-
programmed systems at the microinstruction level because microinstructions can
often be executed in one cycle and hence can use a two-stage microinstruction fetch/
execute pipeline, The one-stage delayed branch instruction has also found application
in RISCs (reduced instruction set computers) which have simple instructions execut-
able in one cyc'e (see Chapter 5).
‘A number of refinements have been suggested and implemented to improve the
performance of delayed branch instructions. For example, in the nullification method
for loops, the instruction following a conditional branch instruction at the end of the
oop is made to be the instruction at the top of the loop. When the loop is
terminated, this instruction is converted into a no-op, an instruction with no operation
and achieving nothing except an instruction delay.
4.2.3 Data dependencies
Suppose we wish to compute the value of C = 2 x (A + contents of memory location
100) with the program sequence given in 8086 code as:
ADD AX, [100] ;Add memory location 100 contents
ito AX register
SAL AK/1 ;Shift AX one place left
Mov CX,AK —;Copy AX into CX register
‘and these instructions are in a five-stage pipeline, as shown in Figure 4.12. (The
8086 does not have the pipeline shown.) It would be incorrect to begin shifting AX
before the add instruction and, similarly, it would be incorrect to load Cx before the
shift operation. Hence, in this program each instruction raust produce its result
before the next instruction can begin,
Should the programmer know that a pipeline organization exists in the computer
used, and also the operation of this pipeline, it may be possible to rewrite some
programs to separate data dependencies. Otherwise, when a data dependency does
occur, there are two possible strategies:
1, Detect the data dependencies at the input of the pipeline and then hold up the
pipeline completely until the dependencies have been resolved (by instructions
already in the pipeline being fully executed).
2. Allow all instructions to be fetched into the pipeline but only allow independent
instructions to proceed to their completion, and delay instructions which are
dependent upon other, not yet executed, instructions until these instructions
are executed118 Computer design techniques
ata
Inscatesinsraton (Is 2nd or ne
store yf B
peas x x
r a a
a 200 suet
Fetes) a a
owns | fre a x
Decade Wa a
indicton | __eeeteoened —_pscose
Fen TF Tt
inancten | 000 | sat] 4 | wov
Te
emery contention
Fetch of MOV instruction canat take place at tis
tine because metucton 1 needs fen 10),
Figure 4.12 Five-stage pipeline with data dependencies and memory contention
Data dependencies can be detected by considering read and write operations on
specific locations accessible by the instructions (including all operations such as
arithmetic operations which alter the contents of the locations)
In terms of two operations ~ read and write ~ operating upon a single location, a
vwrite-after-write hazard exists if there are two write operations upon a location such
that the second write operation in the pipeline completes before the first. Hence the
‘written value will be altered by the first write operation when it completes. A read-
after-write hazard exists if a read operation occurs before a previous write operation
hhas been completed, and hence the read operation would obtain an incorrect value (a
value not yet updated). A write-after-read hazard exists when a write operation
‘occurs before a previous read operation has had time to complete, and again the read
‘operation would obtain an incorrect value (a prematurely updated value). Read-
after-read hazards, in which read operations occur out of order, do not normally
‘cause incorrect results. Figure 4.13 illustrates some of these hazards in terms of an
instruction pipeline in which read and write operations are done at various stages.
‘An instruction can usually only alter one location, but might read two locations
For a two-address instruction format, one of the locations read will be the same as
the location altered. Condition code flags must also be included in the hazard
detection mechanism. The number of hazard conditions to be checked becomes quite
large for a long pipeline having many partially complete instructions.
‘We can identify a potential hazard between instruction i and instruction j when
one of the following conditions occursPipelined systems 119
Istinstucton we
2ndinstructon E wat TJ
(@) Wte-afterwete
‘stinstucton Tie
2nd ingestion
Istinstrvtion ead
2nd instruction Wie
(©) Wrte-aherreee
Figure 4.13 Read/write hazards (a) Write-after-write (b) Read-after-write
(© Write-atter-read,
For write-after-write WO AWD 20
For read-after-write WAAR #0
For write-after-read R} AWG) #0
‘W(i) indicates the set of locations altered by instruction i; R() indicates the set of
locations read by instruction i, and 0 indicates an empty set. For no hazard, neither
ofthe sets on the left hand side of each condition includes any of the same elements.
Clearly these conditions can cover all possible readiwrite arrangements in the
Pipeline, It would be better to limit the detection only to the situations that are
Possibte.
Detecting the hazard at the beginning of the pipeline and stopping the pipeline
‘completely until the hazard has passed is obviously much simpler than only stopping
the specific instruction creating the hazard from entering the pipeline, because a
satisfactory sequence of operations must be maintained to obtain the desired result
(though not necessarily the same order as in the program). Hazard detection must also
include any instructions held up at the entrance to the pipeline.
‘A relatively simple method of maintaining a proper sequence of read/write
‘operations is to associate a I-bit tag with each operand register. This tag indicates120 Computer design techniques.
whether a valid result exists in the register, say 0 for not valid and 1 for valid. A.
fetched instruction which will write to the register examines the tag and if the tag is
1, it sets the tag to 0 (0 show that the value will be changed. When the instruction
has produced the value, it loads the register and sets the tag bit to 1, letting other
instructions have access to the register. Any instruction fetched before the operand
tags have been set has to wait. A form of this scoreboard technique is used on the
Motorola MC88100 RISC microprocessor (Motorola, 1988a). The MC88100 also
hhas delayed branch instructions.
Figure 4.14 shows the mechanism in a five-stage pipeline having registers read
only in stage 3 and altered only in stage 5. In this case, it is sufficient to reset the
valid bit of the register to be altered during stage 3 of a write instruction in
reparation for setting it in stage 5. Figure 4.14 shows a write instruction followed
by two read instructions. Both read instructions must examine the valid bit of their
source registers prior to reading the contents of the registers, and will hesitate if
they cannot proceed.
Notice that the five-stage pipeline described only has read-after-write hazards;
write-after-read and write-after-write hazards do not occur if instruction sequencing
is maintained, ie. if instructions are executed in the order in the program, and if the
pipeline is “stalled” by hazards, as in Figure 4.12. A somewhat more complicated
scoreboard technique was used in the CDC 6600. The CDC 6600 scoreboard kept a
Goneraipepose
ropstrtie
covets
Reset Wits operand
vat setvanobe
of
Readvalitandtne
= operand itbit set
twiorancton LF] [Lo] [ro] | [ex
fees *] >) PIE
Senucon | =] &] Fe] ©] bs)
Figure 4.14 Register read/write hazard detection using valid bits
(IF, instruction fetch; 1D, instruction decode; RD, read operand;
EX, execute phase; WR write operand)Pipelined systems 121
record of the availability of operands and functional units for instructions as they
were being processed to allow instructions to proceed as soon as possible and out of
sequence if necessary. The interested reader should consult Thornton (1970).
4
.4 Internal forwarding,
‘The term forwarding refers to the technique of passing the result of one instruction
to another instruction via a processor register without storing the result in a memory
location. Forwarding would generally increase the speed of operation, as access to
processor operand registers is normally faster than access to memory locations.
‘Three types of forwarding can be identified:
1, Store-fetch forwarding.
2. Fetch-fetch forwarding.
Store-store overwriting.
Store and fetch refer to writing operands into memory and reading operands from
memory respectively. In each case, unnecessary memory references can be eliminated.
In store-fetch forwarding, fetching an operand which has been stored and hence is
also held in a processor operand register is eliminated by taking the operand directly
from the processor operand register. For example, the code:
MOV [200],AX Copy contents of AX register
into memory location 200
ADD BX, [200] ;Add memory contents 200 to register BX
could be reduced to:
Mov [200] AX
ADD BX, AX
which eliminates one memory reference (in the final ADD instruction).
In fetch-fetch forwarding, multiple accesses to the same memory location are
eliminated by making all accesses to the operand in a processor operand register
once it has been read into the register. For example:
Mov AX, (200]
MoV BX, [200]
‘could be reduced to:
Mov AX, [200]
Mov BX, AX122 Computer design techniques
In store-store overwriting, one or more write operations without intermediate
‘operations on the stored information can be eliminated. For example:
Mov [200] ,Ax
Mov [200],Bx
could be reduced to:
Mov [200], BK
though the last simplification is unlikely in most programs. Rearrangements could
be done directly by the programmer when necessary, or done automatically by the
system hardware after it detects the forwarding option, using internal forwarding.
Internal forwarding is hardware forwarding implemented by processor registers
not visible to the programmer. The most commonly quoted example of this type of
internal forwarding is in the IBM 360 Model 91, as reported by Tomasulo (1967).
The IBM 360 Model 9i is now only of historical interest and was rapidly super-
seded by other models with caches (the Model 85 and the Model 195). In internal
forwarding, the results generated by an arithmetic unit are passed directly to the
input of an arithmetic unit, by matching the destination address carried with the
result with the addresses of the units available. Operand pairs are held in buffers at
the input of the units. Operations are only executed when a unit receives a full
complement of operands, and then new results, which may become new source
‘operands, are generated. It may be that instructions are not executed in the sequence
in which they are held in memory, though the final result will be the same. The IBM
360 Model 91 internal forwarding mechanism is similar to dataflow computing
described in Chapter 10 and predates the implementation of the latter. A cache could
be regarded as a forwarding scheme which short-circuits the main memory. The
complicated forwarding scheme of the Model 91 may not be justified if a cache is
present. RISCs often use relatively simple internal forwarding (see Chapter 5).
4.2.5 Multistreaming
We have assumed that the instructions being processed are from one program and
that they depend upon the immediately preceding instructions. However, many large
‘computer systems operate in a multiuser environment, switching from one user to
another at intervals. Such activities often have a deleterious effect on cache-based
systems, as instructions/data for a new program need to be brought into the cache to
replace the instructions/data of a previous program. Eventually, the instructions/data
of a replaced program will need to be reinstated in the cache.
In contrast, this process could be used to advantage in a pipeline, by interleaving
instructions of different programs in the pipeline and by executing one instruction
from each program in sequence. For example, if there are ten programs to bePipelined systems 123
executed, every tenth instruction would be from the same program. In a ten-stage
Pipeline, each instruction would be completely independent of the other instructions
in the pipeline and no hazard detection for conditional jump instructions or data
dependencies would be necessary. The instructions of the ten programs would
execute at the maximum pipeline rate of one instruction per cycle. This technique
necessitates a complete set of processor registers for each program, i.e. for ten
Programs, ten sets of operand registers, ten program counters, ten memory buffers,
if used, and tags are also needed in the instruction to identify the program, In the
Past, such duplication of registers might have been difficult to justify, but now it
may be a reasonable choice, given that the maximum rate is obtained under the
special conditions of several time-shared programs and no complicated hazard
detection logic is necessary. The scheme may be difficult to expand to more time-
shared programs than the number of stages in the pipeline.
4.3 Arithmetic processing pipelines
Se ee
4.3.1 General
In the previous sections we considered the arithmetic units as single entities. In fact,
arithmetic operations of the execute phase could be decomposed further into several
separate operations. Floating point arithmetic, in particular, can be naturally decom-
posed into several sequential operations. It is also possible to pipeline fixed point
operations to advantage, especially if several operations are expected in sequence.
We will briefly consider how arithmetic operations might be pipelined in the
following sections.
Note that an arithmetic pipeline designed to perform a particular arithmetic
operation, say floating point addition, could only be supplied with continuous tasks
in an instruction pipeline if a series of floating point instructions were to be
‘executed. Such situations arise in the processing of the elements of vectors, and
hhence pipelined arithmetic units find particular application in computers which can
‘operate upon vectors and which have machine instructions specifying vector opera-
tions. Such computers are called vector computers, and the processors within them
are vector processors. For general purpose (scalar) processors only capable of
‘operating upon individual data elements, pipelined arithmetic units may not be kept
fully occupied. Pipelined arithmetic units in scalar processors should be used for the
following reasons:
|. Increased performance should a series of similar computations be encountered.
2. Reduced logic compared to non-pipelined designs in some cases,
3. Multifunction units might be possible.
Multifunction arithmetic pipelines can be designed with internal paths that can be124 Computer design techniques
reconfigured statically to produce different overall arithmetic functions, or can be
reconfigured dynamically to produce different arithmetic functions on successive
input operands. In a dynamic pipeline, different functions are associated with sets of
‘operands as they are applied to the entrance of the pipeline. The pipeline does not
need to be cleared of existing partial results when a different function is selected
and the execution of previous functions can continue unaffected. Multifunction
Pipelines have not been used much in practice because of the complicated logic
required, but they should increase the performance of a single pipeline in scalar
processors. Multifunction pipelines do not seem to have an advantage in vector
computers, as these computers often perform the same operation on a series of
clements fed into the pipeline.
4.3.2 Fixed-point arithmetic pipelines
The conventional method of adding two integers (fixed point numbers) is to use a
parallel adder consisting of cascaded full adder circuits. Suppose the two numbers to
be added together have digits A,_,~ Ag and B,_, ~ By. There are n full adders in the
parallel adder. Each full adder adds two binary digits, A, and B,, together with a
* from the previous addition, C,_,, to produce a sum digit, S,, and a “carry-
‘out” digit, C,, as shown in Figure 4.15(a). A pipelined version of the parallel adder
is shown in Figure 4.15(). Here, the n full adders have been separated into different
pipeline stages.
‘A multifunction version of the parallel adder pipeline giving both addition and
subtraction can be achieved easily. Subtraction, A ~ B, can be performed in a parallel
adder by complementing the B digits and setting the carry-in digit to the first stage
to I (Father than to 0 for addition). Hence, one of each pair of digits passed on to the
adjacent stage needs to be complemented and this operation can be incorporated into
the pipeline stage. The adder/subtractor pipeline could be static. In this case, the
complementing operation occurs on the appropriate bits of each pair of operands
applied to the pipeline as they pass through the pipeline. Alternatively, the adder/
subtractor pipeline could be dynamic, and the complementing operation performed
upon specific operands. These operands could be identified by attaching @ tag to
‘them; the tag is passed from one stage to the next with the operands. Additional
functions could be incorporated, for example, singie operand increment and decre-
ment. Multiplication and division might be better performed in a separate unit
though it is possible to design a muitifunction pipeline incorporating all of the basic
arithmetic operations.
‘The previous addition pipeline is based upon a parallel adder in which the carry
signal “rippies” from one pipeline stage to another. In a non-pipelined version, the
speed of operation is limited by the time it takes for the carry to ripple through all
the full adders. (This is also true in the pipelined version, but other additions can be
started while the process takes place.) A well-known method of reducing ripple time
is to predict the carry signals at each full adder by using carry-look-ahead logicPipelined systems 125
‘and Binpts
AaB AB, AB, ABS
Cn
Fulladdor
884 s 3%
(a) Para adcer
ana Binputs
AB Be BB) ABaOy
| 0 Staging latches
Cary
Pipoine stages 1
Final oputs
(©) Ppatined version
Figure 4.15 Pipelined parallel adder (a) Parallel adder (b) Pipelined version
rather than waiting for each to be generated by adjacent full adders. Such prediction
logic can also be pipelined, The full details of carry-look-ahead adders can be found
in Baer (1980).
‘There are also various ways to perform multiplication. Most of these are suitable
for arrangement as a pipeline as they involve a series of additions, each of which can
‘be done in a pipeline stage. The conventional method to implement multiplication is a
shift-and-add process using a parallel adder to successively add A to an accumulating,
‘sum when the appropriate bit of B is 1. Hence, one pipeline solution would be to
Unfold the iterative process and have n stages, each consisting of a parallel adder.126 Computer design techniques
One technique applicable to multiplication is the carry-save technique. AS an
example, consider the multiplication of two 6-bit numbers:
A 110101
B 11011
110101
110101
000000
110101
000000
110101
100011100111
‘The partial products are divided into groups, with three partial products in each
‘group. Therefore we have two groups in this example. The numbers in each group
are added simultaneously, using one full adder for each triplet of bits in each group,
without carry being passed from one stage to the next. All three inputs of the full
adders are used. This process results in two numbers being generated for each
‘group, namely a sum word, end a carry word:
11010: 110101
Group I 110101 Group 2 600000
000000 110101
‘Sum 1 onoitit ‘Sum 2 11100001
Carry 1 01000000 Carry 2 00101000
Each carry word is moved one place left to give it the correct significance. The true
sum of the three numbers in each case could be obtained by adding together the sum
and carry words. The final froduct is the summation of Sum 1, Carry 1, Sum 2 and
Carry 2. Taking three of these numbers, the carry-save process is repeated 10
produce Sum 3 and Carry 3, ie.:
Sum 1 own
Carry 1 01000000
‘Sum 2 1110¢001
‘Sum 3 1110C010111
Carry 3 (0001¢010000Pipelined systems 127
‘The process is repeated taking Sum 3, Carry 3 and Carry 2 to produce Sum 4 and
Carry 4:
‘Sum 3 11100010111,
Carry 3 (00010010000
Carry 2 00101000
Sum 4 41011000111
Carry 4 01000100000
Finally, Sum 4 and Carry 4 are added together using a parallel adder:
Sum 4 11011000111
Carry 4 01000100000
Final sum 100011100111
Each step can be implemented in one stage of a pipeline, as shown in Figure 4.16.
‘The partial product bits can be generated directly from the logical AND of the
corresponding A and B bits. The first partial product has the bits A, ,By ~ A,By,
‘AgBo- The second partial product has the bits A, ,B, ~~ A,B,, AgB,, ete
‘The multiplier using carry-save adders lends itself to become a feedback pipeline
to save on components, as shown in Figure 4.17. Here, the carry-save adders are
reused one or more times, depending upon the numaber of bits in the multiplier, and
fon the organization. Note that the advantage of being able to submit new operands
for multiplication on every cycle is now lost.
‘Another multiplication technique involves having a two-dimensional array of
cells. Each cell performs a 3-bit full adder addition. There are several versions of
the array multiplier, each of which passes on signals in different ways. The shift-and-
add multiplier is in fact a form of an array multiplier when reconfigured for a
pipeline. The reader is referred to Jump er al. (1978) for a study of array multipliers
arranged for pipelining. Array multipliers are suitable for VLSi implementation,
4.3.3 Floating point arithmetic pipelines
Floating point arithmetic is particularly suitable for pipeline implementation as a
sequence of steps can be readily identified. It is perhaps the commonly quoted
‘example for pipeline implementation. Even in a non-pipelined computer system,
floating point arithmetic would normally be computed as a series of steps (whereas
fixed point arithmetic might be computed in one step.)128 Computer design techniques
AB AB
36 AND gates
Prodcing ais
“GROUP DIGITS
L |
ual dors Sy
CARRY
cS
Fulvalfadoe Soy
canny 3
Cary toxk-ahead adders
Pa Po Peo Po Pr mR Fm Pm mm
PRooucT
Figure 4.16 6-bit « 6-bit cary-save multiplier
Each floating point number is represented by a mantissa and exponent, given by:
‘number = mantissa x 2esvonet
where the base of the number system is 2. (The base could also be power of 2, for
example, it is occasionally 16). The mantissa and exponent are stored as two
numbers. The sign of the number is shown by a separate sign bit and the remaining
‘mantissa is a positive number (i.e. the full mantissa is represented in the sign plus
‘magnitude representation). A biased exponent representation is often used for thePipelined systems 129
Inputs
Matilexer
Feedback
outputs
Figure 4.17 Carry-save adder with feedback
exponent such that the stored exponent is always positive, even when representing a
negative exponent. In the biased exponent system, the stored exponent = actual
exponent + bias. The bias i usually either 2"! or 2°-! = 1, where there are n bts in
the number. The biased exponent representation does not affect the basic floating
point arithmetic algorithms but makes it easier to implement the comparison of
exponents which is necessary in floating point addition (not in floating point
‘multiplication
Numbers are also usually represented in a normalized form in which the most
significant digit of the (positive) mantissa is made to be non-zero (i.e. 1, with a base
‘of 2) and the exponent adjusted accordingly, to obtain the greatest possible precision
of the number (the greatest number of significant digits in the mantissa). In fact, the
most significant bit need not be stored in base two system if itis always 1. The
stored mantissa is normally a fraction, i.e. the binary point is to the immediate left
of the stored mantissa, and the exponents are integers. The position of the binary
point is immaterial tothe algorithm.
The addition of two normalized floating point numbers, represented by the
‘mantissa/exponent pairs, me and m,e,, requires a number of sequential steps, for
example:
1. Subtract exponents ¢,, €2, and generate the difference e, ~ ¢,
2. Interchange mantissa m, and m2, if e, ~ €) is negative and make the difference
positive. Otherwise no action is performed in this step.g
Computer design techniques
Shift mantissa m, by ¢, ~ e places right.
‘Add mantissas to produce result mantissa replacing m,.
Normalize result as follows. If mantissa greater than I, shift one place right
and add 1 to exponent. If mantissa less than 0.5, shift mantissa left until
leftmost digit = 1 and subtract number of shifts from exponent. If mantissa =
0, load special zero pattern into exponent. Otherwise do nothing. Check for
underflow (number too small to be represented) or overflow (number too large
to be represented) and in such cases generate the appropriate actions.
‘Some steps might be divided further, and any group of sequential steps in a pipeline
can be formed into one step.
Floating point multiplication is conceptually easier, hat
18 the steps:
Add exponents e, and e,
Multiply mantissa m, and m,
Normalize if necessary.
Round mantissa to a single length result.
Renormalization if necessary (rounding may increase mantissa one digit which
necessitates renormalization).
However, the mantissa multiplication operation would typically be divided into wo
‘or more stages (perhaps iterative stages with feedback) which would make floating
point multiplication a longer process than floating point addition. It is possible to
‘combine floating point multiplication with addition, as the exponent addition of the
floating point multiplication and the exponent subtraction of floating point addition
could both be performed with a parallel adder. Also, both operations require
normalization
A floating point multiply/divide unit can be designed as a feedback pipeline by
internally feeding back partial product results until the final result is obtained. The
general motive for designing feedback pipelines is reduction in hardware, compared
to a non-feedback pipeline, New inputs cannot be applied to a feedback pipeline (at
least not when the feedback is to the input) until previous results have been
generated and consumed, and hence this type of pipeline does not necessarily
increase throughput, and externally the unit may not be regarded as a pipeline. We
will use the term linear pipeline to describe a pipeline without feedback paths.
4.4. Logical design of
4.4.1 Reservation tables
The reservation table is central to pipeline designs. A reservation table is a two-
dimensional diagram showing pipeline stages and their usage over time, ie. aPipelined systems 131
space-time diagram for the pipeline. Time is divided into equal time periods,
normally equivalent to the clock periods in a synchronous pipeline. If a pipeline
stage is used during a particular time period, an X is placed in the reservation table
time slot. The reservation table is used to illustrate the operation of a pipeline and
also used in the design of pipeline control algorithms.
‘A reservation table of a five-stage linear pipeline is shown in Figure 4.18. In this
particular case, each of the five stages operate for one time period, and in sequence.
It is possible to have stages operate for more than one time period, which would be
shown with Xs in adjacent columns of one row. More than one X in one row, not
necessarily adjacent columns, could also indicate that a stage is used more than once
in a feedback configuration. A reservation table with more than one X in 2 column
would indicate that more than one stage is operating simultaneously on the same or
different tasks. Operating on the same task would indicate parallel processing, while
operating on different tasks would generally indicate some form of feedback in the
pipeline.
‘A reservation table describes the actions performed by the pipeline during each
time period. A single function pipeline has only one set of actions and hence would
have one reservation table; a multifunction pipeline would have one reservation
table for each function of the pipeline. In a static multifunction pipeline, only one
function can be selected for all entering tasks until the whole pipeline is reconfigured
for a new function, and only one of the reservation tables is of interest at any instant
corresponding to overall function selected. In a dynamic multifunction pipeline,
different overall functions can be performed on entering data, and all of the
reservation tables of functions selected need to be considered as a set
Pipelines generally operate in synchronism with a common clock signal and each
time slot would be related to this clock period, the boundary between two adjacent
slots normally corresponding to clocking the date from one pipeline stage to the
next stage. Note though, that the reservation table does not show the specific paths
taken by information from one stage to another, and it is possible for two different
pipelines to have the same reservation table.
‘The reservation table does help determine whether a new task can be applied after
Tine
Glockpeiods—» 0 1 23 4
x
Stages x
x
x
Figure 4.18 Reservation table of a five-stage linear pipeline132 Computer design techniques
the last task has been processed by the first stage. Each time the pipeline is called
upon to process a new task is an initiation. Pipelines may not be able to accept
initiations at the start of every period. A collision occurs when two or more
initiations attempt to use the same stage in the pipeline at the same time.
Consider, for example, the reservation table of a static pipeline shown in Figure
4.19. This table has adjacent Xs in rows. Two consecutive initiations would cause a
collision at slots 1-2. Here, the stage is still busy with the first initiation when the
second reaches the input of the stage. Such collisions need to be avoided by
delaying the progress of the second initiation through this particular pipeline until
cone cycle later. A potential collision can be identified by noting the distance in time
slots between Xs in each row of the reservation table. Two adjacent Xs have a
stance” of 1 and indicate that two initiations cannot be applied in successive
cycles. A distance of 2 would indicate that two initiations could be separated by an
extra cycle.
Stages
xx
Figure 4.19 Reservation table with collision
A collision vector is used to describe the potential collisions and is defined for a
‘given reservation table in the following way:
Collision vector C = C,_,C,-2 ~~ CC,Co
where there are n stages in the pipeline. C, = 1 if @ collision would oceur with an
initiation i cycles after an initiation (taking into account all existing tasks in the
pipeline), otherwise C, = 0. We note that Cy will always be 1, as two simultaneous
initiations would always collide. Hence, sometimes Cy is omitted from the collision
vector. C, and subsequent bits are always 0, as initiations so separated would never
collide. All previous initiations would have passed through the pipeline completely
The inital collision vector is the collision vector after the first initiation is
presented to the pipeline. To compute this it is only necessary to consider the
distance between all pairs of Xs in each row of the reservation table. The distances
between all pairs in the reservation table shown in Figure 4.19 are (5,4,1,0) and the
initial collision vector is 110011 (including C9)Pipelined systems 133
4.4.2 Pipeline scheduling and control
Now let us consider the situations when a pipeline should not accept new initiations
fon every cycle because a collision would occur sometime during the processing of
the task. The pipeline needs a control or scheduling mechanism to determine when
new initiations can be accepted without a collision occurring.
Latency is the term used to describe the number of clock periods between two
initiations. The average latency is the average number of clock periods between
initiations generally over a specific repeating cycle of initiations. The forbidden
latency set contains those latencies which cause collisions, e.g. (5,4, 1, 0) previously.
‘This set is also represented in the collision vector. The smallest average latency
considering all the possible sequences of tasks (initiation cycles) is called the
‘minimum average latency (MAL). Depending upon the design criteria, the optimum
scheduling strategy might be one which produces the minimum average latency.
‘The following scheduling strategy is due to Davidson (1971). A pipeline can be
considered in a particular state; it changes from one state to another as a result of
accepted initiations. A diagram of linked states becomes a state diagram. Each state
im the state diagram is identified by the collision vector (sometimes called a status
vector in the state diagram) which indicates whether a new initiation may be made
to the pipeline. The initial state vector of an empty pipeline before any initiations
have been made is 00 ~: 00, since no collision can occur with the first initiation. After
the first initiation has been taken, the collision vector becomes the initial collision
vector and C, in the collision vector will define whether another initiation is allowed
im the next cycle.
First the collison vector is shifted one place right and 0 is entered into the left
side. If Cy = 1, indicating that an initiation is not allowed, the pipeline is now in
another state defined by the shifted collision vector. If Cy = 0, indicating that an
initiation is allowed, there are two possible new states ~ one when the initiation is,
not taken, which is the same as when Cy = 1, and one when the initiation is taken,
If the initiation is taken, the initial collision vector is bit-wise logically ORed
‘with the shifted collision vector to produce a new collision vector. This logical
Ring of the shifted collision vector with the initial collision vector incorporates
into the collision vector the effect of the new initiation and its effect on potential
collisions.
Figure 4.20 illustrates the algorithm for computing the collision vector for a
pipeline when initiations may or may not be taken. It immediately leads to a
possible scheduling algorithm, i.e. after shifting the collision vector, if Cy = 0. an
initiation is taken and a new collision vector is computed by logically ORing
‘operations. The strategy of always taking the opportunity of submitting an initiation
to the pipeline when it is known that a collision will not occur, i.e. choosing the
minimum latency on every suitable occasion, is called a greedy strategy. Unfor-
tunately, @ greedy strategy will not necessarily result in the minimum average
latency (an optimum strategy). though it normally comes fairly close to the optimum,
strategy, and is particularly easy to implement.134 Computer design techniques
Loadattal
colton vector
Logical OR stat
vector with int
‘lsonvector
Figure 4.20 Davidson’s pipeline control algorithm
The state diagram for the collision vector 110011 (the reservation table in Figure
4.19) is shown in Figure 4.21. All possible states are included, whether or not an
initiation is taken. Clearly such state diagrams could become very large.
The state diagram can be reduced to only showing changes in state when an
initiation is taken. The various possible cycles of initiations can be easily located
from this modified (or reduced) state diagram. The modified state diagram is shown
in Figure 4.22. The number beside each arc indicates the number of cycles necessary
to reach a state. We can identify possible closed simple cycles (cycles in which a
state is only visited once during the cycle), as given by 3.33.3. 2.6.26.
3,6,3,6.", and 6,6,6,6,~. These simple cycles would be written as (3), (2.6), (3,6).
and (6).y
vont
‘ooacar 001107
Tat
Pipelined systems 135
oars
¥ ye
o00c0
y
‘Figure 4.21. State diagram for collision vector 110011
6+
les
10011
Tory
b+
Figure 4.22 Modified state diagram
(6+ = 6 oF more cycles to reach state)136 Computer design techniques
‘There is usually more than one greedy cycle if the starting point for a cycle can
be other than the initial state. In Figure 4.22, the greedy cycles are (2,6) starting at
the initial state 110011 and (3) starting at 110111. The average latency of any
areedy (simple) cycle is less than or equal to the number of 1 in the initial collision
vector (see Kogge, 1981). More complex cycles exist, in which states are visited
more than once. However it has been shown (see Kogge (1981) for proof) that for
any complex cycle with a given average latency, there is at least one simple cycle
with an average latency no greater than this latency. In searching for an optimum
strategy there is no need to consider complex cycles, as a simple cycle exists with
the same or better latency, assuming the criterion is minimum latency.
‘The minimum average latency is always equal to or greater than the maximum
number of Xs in the rows of the reservation table. This condition gives us the lower
bound on latency and can be deduced as follows: let the maximum number of Xs in
a reservation table row be yyy» Which equals the number of times the most used
stage is used by one initiation. Given + time slots in the reservation station, the
maximum possible number of initiations is limited by the most used stage which, of
‘course, can be used by one initiation at a time. Hence the maximum number of
initiations = gq, The minimum latency = (maximum number of initiation) =
‘We now have the conditions: maximum number of Xs in row < minimum average
latency (MAL) S greedy cycle average latency < number of initial collision vector
1s, giving upper and lower bounds on the MAL.
‘A given pipeline design may not provide the required latency. A method of
reducing the latency is to insert delays into the pipeline to expand the reservation
table and reduce the chances of a collision. In general, any fixed latency equal to or
greater than the lower bound can be achieved with the addition of delays, though it
may never be possible to achieve a particular cycle of unequal latencies. Mathematical
methods exist to determine whether a particular cycle could be achieved (see
Kogge (1981),
Given a simple cycle of equal latencies, and that all stages (Xs) in the reservation
table depend upon previously marked stages, we have the following algorithm to
identify where to place delays for a latency of n cycles:
1, Starting with the first X in the original reservation table, enter the X in a
revised table and mark every n cycles from this position to indicate that these
positions have been reserved for the initiations every n cycles. Mark with, say,
an F (for forbidden).
2. Repeat for subsequent Xs in the original reservation table until an X falls on an
centered forbidden F mark. Then delay the X one or more positions until a free
position is found for it. Re-mark delayed positions with a D. Delay all
subsequent Xs by the same amount.
AILDs in the reservation table indicate where delays must be generated in the pipeline.Pipelined systems 137
Figure 4.23(a) shows a reservation table with a collision vector 11011. There is one
simple cycle (2,5) giving an MAL of 3.5. However, the lower bound (number of Xs in
any row) is 2. The previous algorithm is performed for a cycle of (2) in Figure 4.23(b).
Only one delay is necessary in Figure 4.23. This delay consists of a stage in the
pipeline which simply holds the information for one cycle as it passes from one
stage to the next. It can be implemented using only one extra stage latch. Multiple
delays between processing stages, had they been required, might be best implemented
using a dual port memory in which different locations can be read and written
simultaneously, as shown in Figure 4.24. Locations read are those which were
written n cycles previously, when an r-eycle delay was required.
Tine
orzsas
Stages
Tine
Stages
x
(&) Reservation tale with ey aed
Figure 4.23 Adding delays to reduce latency (a) Original reservation table
(b) Reservation table with delay added
Ppstine
Dualport
[= |=
Figure 4.24 Using dual port memory for delay138 Computer design techniques
‘The algorithm described assumes that Xs must be maintained in the same order as,
in the original reservation table, It may be that certain stages could be executed
before others, though the relationship between the stages is not shown in the
reservation table. In that case, it would not be necessary to delay all subsequent Xs,
only those which depended upon the delayed stage.
‘Apart from having a strategy for accepting initiations, pipeline control logic is
necessary to control the flow of data between stages. A flexible method of control is
by microprogramming, in which the specific actions are encoded in a control
memory. This method can be extended so that the specific actions are encoded in
words which pass from ore stage to the next with the data
4.5. Pipelining in vector computers
We conclude this chapter with some remarks on the design of large, very high speed
vector computers, these being a very successful application of pipelining. Apart
from normal “scalar” instructions operating upon one or two single element operands,
vector computers have instructions which can operate on strings of numbers formed
as one-dimensional arrays (vectors). Vectors can contain either all integers or all
floating point numbers. A vector computer might handle sixty-four element vectors.
One operation can be specified on all the elements of vectors in a single instruction.
Various vector instructions are possible, notably arithmetical/logical operation
requiring one or two vectors, oF one scalar and one vector producing a vector result,
and an arithmetical/logical operation of all the elements of one vector to produce @
scalar result. Vector processors can also be designed to attach to scalar computers to
increase their performance on vector computations. Supercomputers normally have
vector capability.
Vector computers can be register-to-register type, which use a large number of
processor registers to hold the vectors (e.g. Cray 1, 2, X-MP, Y-MP computers) or
memory-to-memory tyPe. which use main memory locations to hold the vectors (¢-8.
Cyber 205). Most systems use vector registers. In either case, the general architecture
is broadly as shown in Figure 4.25, where the data elements are held in main
memory or processor registers. AS in all stored program computers described,
instructions are read from a program memory by a processor. The vector processor
accepts elements from one or two vectors and produces a stream of result elements.
Most large, high speed computer systems have more than one functional unit to
perform arithmetical and logical operations. For example, in a vector computer,
Separate scalar and vector arithmetical functional units can be provided, as can
different functional units for addition/subtraction and multiplication/division. Func-
tional units can be pipelined and fed with operands before previous results have
‘been generated if there are no hazard conditions. Figure 4.26 shows multiple
functional units using vector registers to hold vector operands, as in Cray computers;
scalar register would also exist. The units take operands from vector registers andPipelined systems 139
Program Data
memory memory
Fast
vector
Vvectorinstrctions"L__]
Pipelnedvector
anime processor
Figure 4.25 Pipelined vector processing,
Main memory
Vector egters
Functional unts
Figure 4.26 Multiple functional units
return results to the vector registers. Each vector register holds the elements of one
vector, and individual elements are fed to the appropriate functional unit in succes-
‘Typically, a series of vector instructions will be received by the processor. To
increase the speed of operation, the results of one functional unit pipeline can be fed
into the input of another pipeline, as shown in Figure 4.27. This technique is known,
as chaining and overlaps pipeline operations to eliminate the “drain” time of the140 Computer design techniques
[Add pipeline
Vectors
args —|
Mute pipeine_ ast vector
Jaen
vectorc-—
Figure 4.27 Chaining
first pipeline. More than two pipelines can be chained when available. Details of
vector pipelining and chaining in large vector processor systems can be found in
Cheng (1989).
PROBLEMS
4.1 Derive an expression for the minimum clock period ia a ten-stage
synchronous pipeline in terms of the stage operating time, tgr StAge
latch set-up time, [Link] and the clock propagation time from one stage to
the next, fepopr assuming that the clock passes from one stage t0 the next
stage.
4.2 A microprocessor has two internal units, an instruction fetch unit
and an instruction execute unit, with fetch/execute overlap. Compute the
‘overall processing time of eight sequential instructions, in each of the
following cases.
L. TER) =F) = 100 ns for i= 1 108
2. TE) = 50 ns, T(E) = 100 ns for i= 1 10 8
3. T(F}) = 100 ns, T(E) = 50, 75, 125, 100, 75
and 50 ns for i= 1, 2, 3,4, 5, 6, 7 and 8 respectively.
where T(F,) is the time to fetch the ith instruction and T(E,) is the time 10
‘execute the ith instruction.
43 A computer system has a three-stage pipeline consisting of an
instruction fetch unit, an instruction decode unit and an instruction
execute unit, as shown in Figure 4.6. Determine the time to execute
twenty sequential instructions using two-way interleaved memory if the
fetch unit fetches two instructions simultaneously. Draw the timing
diagram for maximum concurrency given four-way interleaved memory.Pipelined systems 141
4.4 A microprocessor has five pipelined internal units, an instruction
fetch unit (IF), an instruction decode unit (ID), an operand fetch unit
(OF), an operation execute unit (OE) and a result operand store unit
(OS). Different instructions require particular units to operate for the
periods shown in Table 4.1 (in cycles, one cycle = 100 ns).
‘Table 4.1 Pipeline unit operating times for instructions in Problem 4.4
Tnstruction Tah Tad) TOR TOE) TS)
Load memory wo register 1 1 T 0 0
Load register to register 1 1 0 1 0
Store register to memory 1 1 0 ° 1
‘Add memory toregister 1 1 1 1 o
Compute the overall processing time of sequential instructions, in each
of the following cases.
1, MOV AX,[100] ;Copy contents of location 100
sinto AX register
Mov BX, [200]
MoV CX, [300]
MoV DX, [400]
2, MOV AX, [100] ;Copy contents of location 100
finto AX register
Mov BX, [200] ;Copy contents of location 200
BX register
ADD AX, BX ;Add contents of BX to AX
Mov [200],AX ;Copy contents of AX
sinto location 200
4.5 Given that an instruction pipeline has five units, as described in
Problem 4.4, deduce the times required for each unit to process the
Following instructions:
ADD AX, [102]
SUB BX, AX
INC Bx
Mov AX, [DX] Copy the contents of the
plocation whose address is in
iregister DX into recister AX.
Identify three types of instructions in which TOF) = TOE) = T(OS)142 Computer design techniques
4.6 What is the average instruction processing time of a five-stage
instruction pipeline if conditional branch instructions occur as follows:
third instruction, ninth instruction, tenth instruction, twenty-fourth instruc-
tion, twenty-seventh instruction, given that there are thirty-six instruc~
tions to process? Assume that the pipeline must be cleared after a branch
instruction has been decoded.
4.7 Identify potential data dependency hazards in the following code:
Mov AX, [100]
ADD AK, BX
Mov CX/1 pload the literal 1 into CX register
wov [100],AX
Mov (200],BxX
ADD CX, (200]
given a five-stage instruction pipeline. Suppose that hazards are recog-
rized at the input to the pipeline, but that subsequent instructions are
allowed to pass through the pipeline. Determine the sequence in which
the instructions are processed,
4.8 Design a dynamic arithmetic pipeline which performs fixed point
(integer) addition or subtraction.
4.9 Design an arithmetic pipeline which performs shift-and-add unsigned
integer multiplication.
4.10 Design a static multifunction pipeline which will perform floating
point addition or floating point multiplication.
4.11 Draw the reservation table for the pipeline shown in Figure 4.28,
and draw an alternative pipeline which has the same reservation table.
-LE;
Figure 4.28 Pipeline for Problem 4.11
4.12 Determine the initial collision vector for the reservation table
shown in Figure 4.29. Derive the state diagram and simplify the diagram
into @ reduced state diagram. List the simple cycles, and give the
‘minimum average latency (MAL).Pipelined systems 143
Stages
Figure 4.29 Reservation table for Problem 4.12
4.13 For the reservation table shown in Figure 4.30, introduce delays to
obtain the cycle (3), ie. an initiation every third cycle.
Tine
o12sase
XXL XT]
x [x
XxX] |x
1 [xt [x
Figure 4.30 Reservation table for Problem 4.13CHAPTER .
Reduced instruction set
computers
In this chapter the concept of providing a limited number of instructions within the
processor (reduced instruction set computers, RISCs) as an alternative to the more
usual large number of instructions (complex instruction set computers, CISCs) will
be discussed. This is a major departure from the previous trend of increasingly
complex instructions, and is concemed with improving the performance of the
processor.
1 Complex instruction set computers (CISCs)
5.1.1 Characteristics
The choice of instructions in the instruction set of the processor is a major design
factor. Chapter I stated that operations in instructions are reduced to a simple form.
However, throughout the development of computers until the 1980s, the instructions
provided in the instruction set became more complex as more features were added to
aid the software development and close the so-called semantic gap between the
hardware and software. Mostly. a simple instruction format was retained with one
operation, one or two operands and one result, but specialized operations and
addressing modes were added. The general argument for providing additional
operations and addressing modes is that they can be performed at greater speed in
hardware than as a sequence of primitive machine instructions.
Let us look first at the possibilities for more complex instructions provided in
complex instruction set computers (CISCs). Complex instructions can be identified
in the following areas:
To replace sequences of primitive arithmetic operations.
For alternative indirect methods of accessing memory locations.
For repetitive arithmetic operations.
In support of procedure calls and parameter passing.
144Reduced instruction set computers 145
5. In support of the operation system.
6. In support of multiprocessor systems.
Less common composite operations include checking for error conditions. For
example, the Motorola MC68000 has a “check register against bounds” (CHK)
instruction to compare the value held in a register with an upper bound. If the upper
bound is exceeded, or the register value is below zero, an exception (internal
interrupt) occurs. The upper bound is held in another register or in memory.
More than one arithmetic/logic operation could be specified in one instruction, for
‘example, to add two operands and shift the result one or more places left or right, as
in the Nova minicomputer of the early 1970s. Clearly the number of instances in a
program that such operations are required in sequence is limited. Arithmetic
‘operations followed by shift operations can be found in microprogrammed devices,
for example in the 4-bit Am2901 microprogram device introduced in 1975. One
application at the microprogram level is to implement multiplication and division
algorithms.
‘Apart from adding more complex operations to increase the speed of the system,
complex addressing modes have also been introduced into systems. Addressing
modes can be combined, for example index register addressing and base register
addressing (i.e. base plus index register addressing). Indirect addressing could be
‘multilevel. In multilevel indirect memory addressing, the address specifies a memory
in which holds either the address of the operand location or, if the most
significant bit is set to 1, the remaining bits are interpreted as an address of another
memory location. The contents of this location are examined in the same manner.
The indirection mechanism will continue until the most significant bit s 0 and the
required operand address is obtained. Such multilevel indirection was provided in
the NOV computer ofthe 1970s. Multilevel indirection is an example of amechanism
which is relatively simple to implement but which is of limited application and is
now rarely found,
Support for common repetitive operations is appealing because one instruction
could initiate a long sequence of similar operations without further instruction
fetches. Examples include instructions to access strings and queues, and many
CISCs have support for strings and queues. The Intel 8086 microprocessor family
has several instructions which access a consecutive sequence of memory locations.
The Motorola MC68000 microprocessor family has postincrement and predecrement
addressing modes, in which the memory address is automatically incremented after a
memory access and decremented prior to a memory access respectively. (Similar
addressing can also be found in the VAX family.)
Multiple operations are needed during procedure calls and returns. In addition to
saving and restoring the return address, more complex call and return instructions
can save all the main processor registers (or a subset) automatically. Mechanisms
for passing procedure parameters are helpful, as procedure calls and returns occur
frequently and can represent a significant overhead.
It is helpful for the operating system if some instructions (e.g. input/output146 Computer design techniques
instructions) simply cannot be executed by the user and are only available to the
‘operation system. In addition, access to areas of memory are restricted. We have
seen in Chapter 2 that memory protection can be incorporated into the memory
‘management system. Finally, multiprocessor systems (as we shall discuss in sub-
sequent chapters) require hardware support in the form of special instructions to
maintain proper access to shared locations.
CISCs often have between 100 and 300 instructions and 8-20 addressing modes.
An often quoted extreme example of a CISC is the VAX-11/780, introduced in
1978, having 303 instructions and 16 addressing modes with complex instruction
encoding. Microprocessor examples include the Intel $0386, with 111 instructions
and 8 addressing modes, and the Motorola MC68020, with 109 instructions and 18
‘addressing modes. In many cases, the development came about by extending
previous system designs and because of the view that the greatest speed can be
achieved by providing operations in hardware rather than using software routines.
Large numbers of operations and addressing modes require long instructions for
their specification. They also require more than one instruction format because
different operations require different information to be specified. In a CISC, a
general technique to reduce the instruction lengths and the program storage requite-
‘ments, though increasing the complexity even further, is 10 encode those instruc-
tions which are most likely to be used into a short length,
5.1.2 Instruction usage and encoding
To discover which instructions are more likely to be used, extensive analyses for
application programs are needed. It has been found that though high level languages
allow very complex constructs, many programs use simple constructs. Tanenbaum,
(1990) identifies, on average, 47 per cent of program statements to be assignment
statements in various languages and programs, and of these assignment statements,
80 per cent are simply assigning & value to a constant. Other studies have shown
that the complex addressing modes are rarely used. For example, DEC found during
the development of the VAX architecture that 20 per cent of the instructions
required 60 per cent of the microcode but were only used 0.2 per cent of the time
(Paterson and Hennessy, 1985). This observation led to the micro VAX-32 having
a slightly reduced set of the full VAX instruction set (96 per cent) but a very
significant reduction in control memory (five-fold).
Hennessy and Patterson (1990) present instruction frequency results for the VAX,
IBM 360, Inve! 8086 and their paper design, DLX processor. Table 5.1 is based upon
the 8086 results. Three programs are listed, all running under MS-DOS 3.3 on an
8086-processor IBM PC. The first is the Microsoft assembler, MASM, assembling a
500-line assembly language program. The second is the Turbo C compiler compiling
the Dhrystone benchmark and the third is a Lotus 1-2-3 program calculating a 128
cell worksheet four times. The Dhrystone benchmark has been proposed as a
benchmark program embodying operations of a “typical” program. This benchmark,Reduced instruction set computers 147
and the other widely quoted benchmark program ~ the Whetstone benckmark — have
been criticized as not being able to predict performance (see for example Hennessy
and Patterson (1990), pp. 73 and 183). The test done here refers to the compiler, not
to the execution of the Dhrystone benchmark.
Of course, each instruction frequency study will give different results depending
upon benchmark programs, the processor and other conditions, However, register
accesses generally account for a large percentage of accesses, and a significant
Percentage are move operations (for example 51 per cent register addressing, 29
per cent MOY and 12 per cent PUSH/POP in Table 5.1). Conditional jump instruc-
tions also account for a significant percentage of instructions (10 per cent in Table
5.1) and, though not shown in Table 5.1, instructions using small literals are very
‘commonly used for counters and indexing
Table $.1 8086 Instruction usage
NASM Turbo C Tous Average
assembler (%) __compiler (%) &)
Operand access ee
Memory 37 43 3 41
Immediate 7 n 3 8
Register 55 46 2 st
Memory access addressing
Indirect 2 9 1s 2
Absolute 36 18 34 30
Displacement 2 B 31 59
Instruction type
Data transfer
MOV 30 30 2 29
PUSH/POP 2 18 8 2
LEA 3 6 0 3
Arithmeticfogical
CMP 9 3 3 7
SAL/SHR/RCR 0 3 2 5
INC/DEC 3 3 3 5
ADD. 3 3 3 3
OR/KOR 1s 4s 3 3
Other each 3
ControVeall
IMP 3 1s 1s 2
Loop ° 0 12 4
CALURET 3 6 3 4
Cond. jump 2 2 6 10148 Computer design techniques
CISC processors take account ofthis characteristic by using variable length instruc-
tions in units of bytes or 16-bit words. Totally variable length instructions, using
Huffman coding, can be used and, in one study, led t0 a 43 per cent saving in code
size (Katevenis, 1985). The Intel 432 microprocessor uses bit-encoded instructions,
having from 6 to 321 bits. Instructions can be limited to be multiples of bytes or
words, which leads to 35 and 30 per cent savings, respectively. Limiting instructions
in this way is often done because it matches the memory byte/word fetch mechanism.
For example, an MC68000 instruction can be between one and five 16-bit words. An
8086 instruction can be between I and 6 bytes. The VAX-11/780 takes this
technique to the extreme with between 2 and 57 bytes in an instruction.
The following frequently used operations are candidates for compact encoding
Loading a constant to a register.
Loading a small constant (say 0 to 15) to a register.
Loading a register or memory with 0.
Arithmetic operations with small literals.
‘The MC68000 has “quick” instructions (move/add/subtract quick) in compact en-
coding with small constants. Similarly, the 8086 family has compact encoding for
some register operations,
A significant consequence of complex instructions with irregular encoding is the
need for complex decode logic and complex logic to implement the operations
specified. Most CISCs use microcode (Chapter 1) to sequence through the execution
steps, an ideal method of complex instructions. This can lead to a very large control
store holding the microcode. Again, an extreme example is the 456 Kbyte
microcode control store of the VAX-11/780. A consequence of bit-, byte- and word-
encoded instructions is that the decoding becomes a sequential operation. Decoding,
continues as further parts of the instruction are received.
5.2 Reduced instruction set computers (RISCs)
5.2.1 Design philosophy
‘The policy of complex machine instructions with complex operations and long
microprograms has been questioned. An alternative design surfaced in the early
1980s, that of having very simple instructions with few operations and few addressing
‘modes, leading to short, fast instructions, not necessarily microprogrammed. Such
computers are known as reduced instruction set computers (RISCs) and have been
established as an alternative to complex instruction set computers. The general
philosophy is to transfer the complexity into software when this results in improved
‘overall performance. The most frequent primitive operations are provided in hard-
ware. Less frequent operations are provided only if their inclusion does not adverselyReduced instruction set computers 149
affect the speed of operation of the existing operations. The prime objective is to
‘obtain the greatest speed of operation through the use of relatively simple hardware.
‘The following issues lead to the concept of RISCs:
‘The effect of the inclusion of complex instructions.
‘The best use of transistors in VLSI implementation.
‘The overhead of microcode.
The use of compilers.
Inclusion of complex instructions
‘The inclusion of complex instructions is a key issue. As we have mentioned, it was
already recognized prior to the introduction of RISCs that some instructions are
more frequently used than others. The CISC solution was to have shorter instruction
lengths for commonly used instructions; the RISC solution is not to have the
infrequently used instructions at all, To paraphase Radin (1983), even if adding
complex instructions only added one extra level of gates to a ten-level basic
machine cycle, the whole CPU has been slowed down by 10 per cent. The frequency
and performance improvement of the complex functions must first overcome this 10
per cent degradation and then justify the additional cost.
VLSI implementation
‘One of the arguments put forward for the RISC concept concerns VLSI implementa-
tion. In the opening paragraph of his award-winning thesis, Katevenis (1985) makes the
point that “it was found that hardware support for complex instructions is not the most
effective way of utilizing the transistors in a VLSI processor”. There is a trade-off
between size/complexity and speed. Greater VLSI complexity leads directly to
decreased component speeds due to circuit capacitances and signal delays. With
Increasing circuit densities, a decisior has to be made on the best way to utilize the
circuit area. Is it o add complex instructions atthe risk of decreasing the speed of other
operations, or should the extra space on the chip be used for other purposes, such as a
larger number of processor registers, caches or additional execution units, which can be
Performed simultaneously with the main processor functions? The RISC proponents.
argue for the latter. Many RISCs employ silicon MOS technology; however, the RISC
concept is also applicable to the emerging, lower density gallium arsenide (GaAs)
technology and several examples of GaAs RISC processors have been constructed,
Microcode
‘A factor leading to the original RISC concept was changing memory technology.
CISCs often rely heavily on microprogramming, which was first used at a time when
the main memory was based upon magnetic core stores and faster read-only control,
memory could be provided. With the move to semiconductor memory, the gap
between the achievable speed of operation of main memory and control memory
narrows; the cache memory concept has also been developed. Now, a considerable150 Computer design techniques
overhead can appear in a microprogrammed control unit, especially when a simple
operation might correspond to one microinstruction. Microprogramming, in which
the programmer uses the microinstructions directly, was tried in the 1970s, by
providing writable control stores, but is now not popular.
Compilers
There is an increased prospect for designing optimizing compilers with fewer
instructions. Some of the more exotic instructions are rarely used, particularly in
compilers which have to select an appropriate instruction automatically, as it is,
4ifficult for the compiler to identify the situations where the instructions can be used
effectively. A key part of the RISC development is the provision for an optimizing
compiler which can take over some of the complexities from the hardwaie and make
best use of the registers. Many of the techniques that can be used in an optimizing
RISC compiler are known and can be used in CISC compilers.
Further advantages of the RISC concept include simplified interrupt service logic.
In a RISC, the processor can easily be interrupted at the end of simple instructions.
Long, complex instructions would cause a delay in interrupt servicing or necessitate
complex logic to enable an interrupt to be servicing before the instruction had
completed. A classic example of a complex instruction which could delay an
interrupt service is a string instruction.
The growth of RISC systems can be evidenced by the list of twenty-one RISC
processors given by Gimare and Milutinovié (1987), all developed in the 1980s; a
list which does not include the MC88100 introduced by Motorola just afterwards
and early prototype systems. The MC88100 is considered in Section 5.3.3 as
representative of commercial RISCs.
There are claims against the RISC concept. Disadvantages include the fact that
if the machine instructions are simple, it is reasonable to expect the programs to be
longer. There is some dispute over this point, as it is argued that compilers can
produce better optimized code from RISC instruction sets, and in any event, more
complex instructions are longer than RISC instructions. Certain features identified
with a RISC might also improve a CISC. For example, RISCs usually call for a
large number of general purpose registers. A large register file, with a suitable
addressing mechanism, could improve the performance of a CISC. Similarly,
‘optimizing compilers using information on the intemal structure of the processor
can improve the performance of a CISC.
5.2.2 RISC characteristics
‘Though the RISC philosophy can be achieved after various architectural choices,
there are common characteristics. The number of different instructions is limited t0
128, or fewer, carefully selected instructions which are likely to be most used.
‘These instructions are preferably encoded in one fixed-size word and execute in one
cycle without microcoding. Perhaps only four addressing modes are provided.Reduced instruction set computers 151
Indexed and PC-relative addressing modes are probably a minimum requirement;
others can be obtained from using these two addressing modes. All instructions
conform to one of a few instruction formats. Memory operations are limited to load
and store and all arithmetic/logical operations operate upon operands in processor
registers. Hence it is necessary to have a fairly large number of general purpose
processor registers, perhaps between thirty-two and sixty-four.
‘A memory stack is not often used for passing procedure parameters ~ internal
processor registers are used instead, because procedure calls and returns have been
identified as very common operations which could incur a heavy time penalty if they
require memory accesses.
‘A three-register address instruction format is commonly chosen for arithmetic
instructions, i.e. the operation takes operands from two registers and places the
result in a third register. This reduces the number of instructions in many applications
and differs from many CISC microprocessors, which often have two register, of one
register/one memory, address instructions.
In keeping all instructions to a fixed size, some do not use all the bits in the
instruction for their specification, and unused bits would normally be set to zero.
Such wastage is accepted for simplicity of decoding. At least with fixed instruction
length we do not have the problem of instructions crossing over page boundaries
during a page fault. An implication of fixed instruction word length, sey 32 bits, is
that it is not possible to specify a 32-bit literal in one instruction ~ at least two
instructions are needed if a 32-bit literal is necessary. It may be necessary to shift
cone literal before adding to another literal. Similarly, it is not possible to specify a
full 32-bit address in one instruction.
‘Those instructions which are likely to be used need to be identified; this usually
involves tracing program references of typical applications and identifying instruction
usage. In CISCs, a wide range of applications is supported. One possible approach
for RISCs is to limit the application area and provide instructions suitable for that
area, such as embedded computers for signal processing, artificial intelligence or
multiprocessing systems.
Like all processors, RISCs rely on pipeline processing, A two-stage pipeline
‘would seem appropriate for a RISC, one stage to fetch the instruction and one to
execute it. Branch instructions provided usually include the option of single cycle
delayed branch instructions (described in Chapter 4) which match a two-stage
pipeline well. Some RISCs do not conform to a two-stage pipeline, though all have
short pipelines. For register-to-register processing, an instruction could be divided
into four steps:
1. Instruction fetch/decode.
2. Register read.
3. Operate.
4. Register write.
‘Two-, three- and four-stage pipelines assuming register-to-register oper152 Computer design techniques
shown in Figure 5.1. In all pipelines, each register reads calls for two accesses to the
internal register file 10 obtain both operands. Both accesses should preferably be
performed simultaneously, and then a two-port register file is necessary. The actual
implementation may put further requirements and constraints upon register/memory
accesses, for example, because of the need to precharge buses in a VLSI implementa-
tion.
inswvcton [CFacnaracion [Rend Operate Wo]
lostuton2 Fach ramucion [Fad poate Wr]
Instuction 3
lection? a ko
tesneton 3 Facer Res ome —We
( Treesiage pie
wens ECS ae]
wetne
wanton ae]
(@ Fourstageppetine
Figure 5.1 Pipelines for register-to-register operations (a) Two-stage pipeline
(b) Three-stage pipeline (€) Four-stage pipeline
‘The two-stage pipeline assumes that an instruction fetch operation requires the
‘same time as the read-operate-write execution phase, a reasonable assumption for
register read-write operations and a main memory without a cache. A cache would
bring the instruction fetch time closer to register access times. With three stages, an
instruction fetch time equates with a read-operate and write times; with four stages
the four steps (fetch, read, operate and write) stould all take the same time,
including any circuit precharging.
With three or more stages in the pipeline, there may be register read-write hazards
(Chapte: 4). For example, an instruction may attempt to read the contents of a
register whose value has not yet been updated by a previous instruction in the
pipeline. Logic can be introduced to detect the hazards (e.g. scoreboard bits) or,
keeping with the RISC philosophy, such hazards could be recognized by the
‘compiler and the instruction sequence modified accordingly.
‘There may also be scope for internal forwarding; when a value is written into a
register it could also be transferred directly as one of the seurces of a subsequent
instruction, saving a register read operation. A three-stage pipeline calls for theReduced instruction set computers 153
execution of the read/operate part of one instruction at the same time as the register
write of another instruction. This would suggest a three-port register file with three
buses. This can be reduced to a two-port register file by arranging for the write
operation to occur during the operate part of the next instruction. The four-stage
pipeline would need a three-port register file with three buses, two read and one write.
RISCs have to access the main memory for date, though with a large number
of registers such access can be reduced. Accessing memory typically requires more
time than register read/write. During a memory access in some designs the pipeline
is stalled for one cycle, rather than having complex pipeline logic incorporated to
keep it busy with other operations. There is also a potential memory conflict
between an instruction fetch and a data access. Separate instruction and data
memory modules with separate buses can eliminate the memory bottleneck. Some
RISCs employ separate memory for data and instructions (Harvard architecture)
RISCs can employ pipelining much more extensively than the simple 2- to 4-stage
pipelining described, especially if they have multiple pipelined arithmetic units
which can be arranged to operate simultaneously. Memory accesses for both data
and instructions may be pipelined intemally.
5.3. RISC examples
5.3.1 IBM 801
‘The first computer system designed on RISC principles was the IBM 801 machine,
designed over the period 1975-79 and publicly reported in 1982 (see Radin,
1983). The work marks the time when increasing computer instruction set complexity
‘was first questioned. The 801 establishes many of the features for subsequent RISC
designs. It has a three-register instruction format, with register-to-register arith-
‘metic/logical operations, The only memory operations are to load a register from
‘memory and to store the contents of a register in memory. All instructions have
32 bits with regular instruction formats. Immediate operands can appear as 16-bit
arithmetic/logicatimmediate operands, 11-bitmask constants, 16-bit constant displace-
‘ment for PC relative branch instructions and 26-bit offset for PC relative addressing
or absolute adiressing. The system was constructed using SSI/MSI ECL components
with a cycle time of 66 ns,
Programming features include:
‘+ 32 general purpose registers.
+ 120 32-bit instructions.
+ Two addressing modes:
base plus index:
base plus immediate
+ Optimizing compiler.154 Computer design techniques
Architectural features include:
+ Separate instruction cache and data cache.
+ Four-stage pipeline:
Instruction fetch;
Register read or address calculation;
ALU operation;
Register write.
+ Internal forwarding paths in pipeline.
+ Interrupt facility implemented in a separate controller.
Register fields in the instruction are 5-bits (to specify one of thirty-two registers).
‘The three-register format is carried out “pervasively” throughout. For example, it
allows shift operations to specify a source register, a destination register and the
number of shifts in a third register. Instruction memory contents cannot be altered
except to load the instructions. Instructions are provided for cache management to
reduce unnecessary cache load and store operations. Procedure parameters are
passed through registers when possible. A memory stack is not used. Data is stored
aligned to boundaries; words on word boundaries, half word (bytes) on half word
boundaries, instructions on word boundaries.
Branch instructions come in two versions, “delayed branch with execute” and
“delayed branch”. The delayed branch with execute delays execution of the branch.
until after the next instruction, but executes the next instruction regardless of the
‘outcome of the branch instruction. The compiler will attempt to use the delayed
branch with execute version if possible, placing a suitable instruction immediately
after the branch, otherwise the non-delayed version is used.
‘Memory load instructions are also delayed instructions. When an instruction
which will load a register is fetched, the register is locked so that subsequent
instructions in the pipeline do not access it until it has been loaded properly. The
‘compiler attempts to place instructions which do not require access to the register
immediately after the “delayed load” instruction. Notice how the com,
know the operation of the pipeline intimately to gain the greatest possible speed in
the RISC. It is reported that 30 per cent of 801 instructions are load/store (Radin,
1983),
‘The 801 design team wanted all user programming to be done in a high level
language, which means that the only assembly language programming necessary
will be that for system programs. In conventional systems, hardware is provided to
protect the system against the “user. For example, in memory management, protec-
tion mechanisms exist to stop users accessing operating system memory and operating
system instructions. The 801 team argument is that complex protection would slow
down the system. All users should use compilers supplied with the system, and the
complex protection is undesirable and unnecessary. Without hardware complexity it
becomes easier to accommodate changes in technology. The 801 programming
source language is called PL.8, which is based upon PL/I, but is without certainReduced instruction set computers 155
features, such as those which would call for absolute pointers to Automatic or Static
storage (Radin, 1983).
‘A key aspect of the project was the design of an optimizing compiler. The project
depended upon being able to transfer complexity from the architecture into an
‘optimizing compiler. From a source code program, intermediate language code is
first produced and then optimized by the compiler. Conventional optimizing tech-
niques applicable to any system are used. For example, constants are evaluated at
compile time, loops are shortened by moving constant expressions to outside the
loop, intermediate values are reused when possible and some procedures are expanded
in-line to reduce register saving.
Allocation of variables to registers is done by considering all of the variables,
rather than local variables only. Register allocation is illustrated in Figure 5.2. First,
an arbitrary large number of registers is assumed to be present and the compiler uses
one register for each variable in the program. The “lifetime” of each variable is
identified, ic. the time between the first and last use of the variables. Then the
variables are mapped onto the available set of registers in such a manner as to
‘minimize memory accesses. In the example of Figure 5.2, four registers are available
aj —___.
| ——__—.
e —
Registers 0 —
c ——
s| —__.
——
Le
(2 Assuring united umber of rgistrs
Green | —————_,
Four Bue a
ISIS Gacy 8
rea A, __2&__,
Tnstactons Time
(©) with four registers
Figure 5.2 Register allocation with limited number of registers (a) Assuming
Unlimited number of registers (b) With four registers156 Computer design techniques
(called red, black, blue and green) and seven variables, initially calling for seven
registers (A, B, C, D, E, F and G), Those variables which cannot be allocated
registers are held in memory, for example G in Figure 5.2. The algorithm used in the
IBM project is fully described by Chaitin er al. (1981) and is based upon the notion
that the register allocation problem can be seen as a graph coloring problem. There
are other register allocation algorithms. Notice that the “lifetime” of a variable may
not always represent its usage. A register with a short lifetime might be referenced
‘many times, and hence should be held in register, while another variable might have
long lifetime but is not referenced very often and would have a lower overhead if
held in memory. Figure 5.2 does not show this aspect.
5.3.2 Early university research prototypes — RISC I/II and MIPS
‘The first university-based RISC project was probably at the University of California
at Berkeley (Patterson, 1985 and Katevenis, 1985), very closely followed by the
MIPS (Microprocessor without interlocked Pipeline Siages) project at Stanford
University. Both projects resulted in the first VLSI implementations of RISCs, the
Berkeley RISC I in 1982, and the Stanford MIPS and the Berkeley RISC II, both in
1983. These early VLSI RISCs did not have floating point arithmetic, though it was
anticipated that floating point units could be added to operate independently of other
units in the processor. Floating point operations are regarded as candidates for
inclusion in the instruction set, especially for numeric applications.
Features of these early VLSI RISCs are shown in Table 5.2. All processors are
32-bit, register-to-register processors and do not use microcode. Regular instruction,
formats are used,
Figure 5.3 shows the two instruction formats of the RISC II, where R,, and R,y
refer to the two source registers and Ry refers to the destination register. These
registers are specified by a S-bit number, Le. one from a group of 32 registers which
can be identified from the 138 registers at any instant. (A register window pointer
register is preloaded to specify which group of 32 registers is being referenced, see
age 159 for more details.) The flag SCC (Set Condition Codes) specifies whether
‘Table $.2 Features of early VLSI RISCs
Features VLSI RISC
RISCI RISC IL MIPS
Registers 78 138 16
Instructions 3 39 55
‘Addcessing modes, 2 2 2
Jastruction formats 2 2 4
Pipeline stages 2 3 5Reduced instruction set computers 157
sec
] ®
opcode | |e Ra
Gag Fa pai constant
am tn °
(@) Shor-immecat format (Register-to-registr,registr-indexed memory oa6,
‘memory store, contol tansfer instructions)
sce.
fe
Op-code | erate)
a 2 3
'[Link] constant
() Long-mmediate format (PC-elatve instructions)
Figure 5.3. RISC I/II instruction formats (register-to-register, register-indexed memory
load, memory store, control transfer instructions) (a) Short-immediate format
(b) Long-immediate format (PC-relative instructions)
the condition code flags are to be set according to the result of the operation. The
short-immediate format shown in Figure 5.3(a) is used for register-to-register,
register-indexed memory load, memory store end control transfer instructions. Two
fields in this format each have alternative interpretations, as shown. For non-
conditional instructions, a destination register, Ry, is specified. For conditional
instructions, a 4-bit condition is specified instead. One source operand can be held
in a register, R,,, or given as a 13-bit immediate constant in the instruction. The
long-immediate format, shown in Figure 5.3(b), is used for PC-relative instruc
tions. As indicated earlier, two instructions are necessary to load a 32-bit constant
into a register.
Figure 5.4 shows the intemal arrangement of the RISC II processor (slightly
simplified). The 138 word register file is addressed from busEXT and has two buses,
busA and busB. SHFTR is a 32-bit shifter using the left and right shift buses, busL
and busR. BusR is also used to load the BI input of the 32-bit arithmetic/logic unit
(ALU) and busL can be used to load data/constants into the data path. A full
description of the operation of the RISC II can be found in Katevenis (1985),
Notice the use of multiple program counters to specify the instructions in the
pipeline. This characteristic can be found in subsequent RISCs,
‘The three-stage pipeline of the RISC II is shown in Figure 5.5. In Figure 5.5(a) all
instructions are register-to-register. In Figure 5.5(b), the effect of a memory instruc-
tion is shown. Subsequent instructions are suspended while the memory access is in
progress. Internal forwarding is implemented. Dataflow of operands internally
forwarded to two subsequent instructions is shown by arrows. For example, while a158 Computer design techniques
Tomemory
19852
Register
‘te
besa
bus
Figure 5.4 RISC II processor (DST, destination latch (a temporary pipeline latc
‘SRC, source latch for the shifter; DIMM, combined data in/immediate latch
{holding data from memory of an immediate constant from the instruction); PC,
program counter (holding the address ofthe instruction being executed during the
current cycle); NX7PC, next-program counter (holding the address of the
instruction being fetched during the current cycle); LSTPC, last-PC-register (holding
the address of the instruction last executed or attempted to be executed); INC,
incrementer which generates NXTPC + 4.)
vette [a | penn =
1 actontetcn—][Pe0S="| Ope esase Ee
wna vcore | om LE]
scons Taaae
ao ome
(0) Regie so easter
cones ase
remeens [Tranangaan [PSI] opeme || woes ea
~ era owarng
Inatcton2 Trams ech som, Fe] ow
Induction Stoned, (“ramen
(0 Memory oa tution)
Figure 5.5 RISC II pipeline (a) Register-to-register
(b) Memory load (instruction 1)Reduced instruction set computers 159
register has been loaded with a value, this value becomes immediately available
without the subsequent instructions having to read the contents of the regiser.
The Berkeley RISC project introduced the concept of a register window to
simplify and increase the speed of passing parameters between nested procedures.
‘The internal register file holds parameters passed between procedures, as shown in
Figure 5.6. Each procedure has registers in the ile allocated for its use. The central
registers are used only within the procedure. The upper portion is used by the
procedure and by the procedure that called it. The lower portion is used by the
procedure and the procedure it cals, i.e. both the upper and lower portions of the
registers allocated to one procedure overlap with the allocation of registers of other
procedures. In this way, it is not necessary to save parameters in memory during
procedure calls, assuming a sufficient number of registers is provided for the
procedures, otherwise main memory must be used to siore some of the register
contents. Another potential disadvantage occurs when multiple tasks are performed
which would necessitate allocating some of the registers for particular tasks or
saving registers when tasks are swapped.
Registersavaloble Processregistr
‘oalproceaures te
Registers tor
procedure
Fegistostor
procodue2
Registers for
procedure
gsr for
procedures
Figure 5.6 RISC register window160 Computer design techniques
The seventy-eight registers of the RISC I are configured as six windows, each of
fourteen registers with two groups of four overlapping registers and eighteen global
registers accessible by all procedures. Each window had six local registers available
to the procedure alone. The next version, the RISC II, has 138 registers configured
as eight windows, each of twenty-two registers with two groups of 6 overlapping.
registers and 10 global registers. It was found that procedures are not usually nested
to a depth of greater than eight and very rarely greater than ten or eleven. especially
over any reasonably short period of the computation.
The register windows can be viewed arranged in a circular fashion, as shown in
Figure 5.7 (for the RISC II). The current window pointer, CWP, points to the
‘window that can be accessed. The specific register within the window is specified as
register number in the instruction. The register address is made up of a 3-bit
window address concatenated to a 5-bit register number. Note how a register in an
overlapping group has two addresses. For example, register 1:26 in window 1 is also
register 2:10 in window 2. Register numbers between 0 and 9 always refer to the
global registers irrespective of the current window. We would expect that during a
period in the computation, the procedures would nest to a limited extent, and the
circular nature of the windows accommodates this characteristic well,
5.3.3 A commercial RISC - MC88100
‘The Motorola MC88100 RISC 32-bit microprocessor, introduced in 1988 (Motorola,
1988a), is one of the first RISCs to be produced by a major CISC microprocessor
manufacturer. The main characteristics of the MC88100 are:
1. Register-to-register (3-address) instructions, except load/store.
2. ‘Thirty-two general purpose registers.
3. Fifty-one instructions.
4. All instructions fixed 32-bit length.
5.
6
1.
No microcode.
Four pipelined execution units that can operate simultaneously.
Separate data and address paths (Harvard architecture)
‘The instruction format is regular in that the destination and source specifications are
in the same places in the instruction, though there are several instruction formats
The fifty-one instructions are given in Table 5.3, and include the common integer
and floating point arithmetic and logical operations. Unusual instructions include
‘number of instructions for manipulating bit fields within registers. “Extract Unsigned
Bit Field”, ext, copies a group of bits ina source register into the least significant
end of the destination register. “Extract Signed Bit Field”, ext, is similar but sign
extends the result. The position of the field in the source register is specified in
terms of an offset from the least significant end and a width giving the number of
bits in the field. Offset and width are held either in the instruction or in a secondReduced instruction set computers 161
Register x0
adores
Curent window pointer, CWP
a Register windows
Figure 5.7. Register window addresses
source register. The reverse operation of copying a number of the least significant
bits of a source register into a destination register in a specified field position is also
available ("Make Bit Field”, mak). Fields can be set to Is with “Set Bit Field”, set,
or cleared to Os with “Clear Bit Field”, cls. The instruction ext can be used for
shift operations, the only specific shift instruction provided being rot, which rotates
the contents of a source register right by a specified number of places, Another
unusual instruction is “Find First Bit Clear”, ££0, which scans the source register
from the most significant bits towards the least significant bit and loads the
destination register with the bit number of the first bit found to be clear (0). “Find
First Bit Set”, ££1, loads the bit number of the most significant bit set.162 Computer design techniques
‘Table $.3. MC8B100 Instruction Set
courtesy of Motorola Ine.
Integer arithmetic: Loadistorelexchange
add Add 1a Load register from memory
adda Add unsigned ida Loadaddress
cmp Compare Ader Load from control register
div Divide St Store register to memacy
diva Divide unsigned ster Store to control register
mal Matiply xcr Exchange control register
sub Subtract, mem Exchange register with memory
subs Submract unsigned
Floating point arithmetic Flow control:
fadd Floating point add bet Branch on bit clear
emp Floating point compare bi Branch on bit set
Fdiv Floating point divide bend Conditional branch
flder Load from floating point control register br Unconditional branch
Fie Convert integer to floating point bsr Branch to subroutine
‘mui Floating point multiply 3mp Unconditional jump
Estcr Store to floating point control register jer Jump to subroutine
sub — Floating point surat rte Retum from exception
fxcr Exchange floating point contol register tb0 Trap on bit clear
Ant Round floating point to integer tbl Trapon bit set
int Round floating point to nearest integer tnd Trap on bounds check
ttene Truncate floating point to integer tend Conditional trap
Logical: Bit field
and AND cle Clear bit feld
‘mask Logical mask immediate ext Extract signed bitfield
or OR extu Extract unsigned bitfield
xor Exclusive OR ££0 Find first bt clear
£41 Find first bit set
mak — Make bit field
cot Rotate register
set __Setbit fieldReduced instruction set computers 163
laing pia unit
‘c18200
Figure 5.8 MC88100 system
‘There are seven addressing modes, three for accessing data and four for generating
instruction addresses, namely:
Data addressing
1, Register indirect with unsigned immediate index.
2. Register indirect with register index.
3. Register indirect with scaled register index.
Instruction addressing
1, Register with 9-bit vector number.
2. Register with 16-bit signed displacement.
3. Instruction pointer relative (26-bit signed displacement).
4. Register direct.
The internal architecture of the MC88100 is shown in Figure 5.8. We would expect,
a RISC system to execute instructions in a single cycle and to produce a result after
each cycle, and the MC88100 can achieve this. Integer arithmetic/logical instruc-164 Computer design techniques
tions execute in a single cycle. However, because of the multiple pipelined units, it
is possible for units to complete their operations in a different order to the one in
which they were started. An internal scoreboard technique is used to keep a record
of registers being updated.
Figure 5.9 shows the thirty-two general purpose registers and their usage. Except r 0
and rl, the uses are software conventions suggested by Motorola to aid software
compatibility. Register r 0 holds the constant zero which can be read but cannot be
altered. (This idea was present in the Berkeley RISC processors.) Register is loaded
with a retum pointer by Dsr and 5sr. Other registers exist in the system for floating
Point numbers, the supervisor state, three program counters, execute instruction
pointer (XP), next instruction pointer (N'ZP) and fetch instruction pointer (PIP).
ft [eames Deore]
« [sine umpanr |
2
8
«
5 Catesprocedue
8 | pwametoregeters
"
8
°
no
Hi Cateaproceaue
2] ternporary registers:
3
nl
3
5
m
n3
Taf Catingprocesre
at recenecregiter
a
Ba
Bs
24 a
5
25 Taser
| Tinker
2 tinker
| Unie
‘20 Fane pa ——]
iL sexes ——]
Figure 5.9 MC88100 general purpose registersReduced instruction set computers 165
5.3.4 The Inmos transputer
‘The Inmos transputer was certainly one of the first processors to embody the
principles of the RISC; in fact the early work on the transputer took place at the
same time as the IBM 801, but independently and without knowledge of the latter,
though the actual implementation of the transputer was not made available for
some time afterwards. The transputer is a VLSI processor with external commun-
ication links to other transputers in a multiprocessor system, The multiprocessor
aspect of the device, and its high level programming languege occam, are considered
in detail in Chapter 9. Occam is normally used in preference to assembly language.
Here we are interested in the RISC aspect of the device and hence will mention.
some details of the machine language.
Basic machine instructions have one byte with the format shown in Figure 5.1
‘The first 4 bits specify a data operand (from 0 10 15) and the second 4
specify a function. The sixteen functions are allocated as follows:
+ Thirteen frequently occurring functions.
+ Two prefix/negative prefix functions.
+ One operate function,
‘The thirteen frequently occurring functions include the loadjstore functions:
+ Load constant.
+ Load/store local.
+ Load local pointer.
+ Loadjstore non-local.
and also:
+ Add constant.
+ Jump.
+ Conditional jump.
+ Call
“Local” locations are relative to a workspace pointer, an internal processor registers
and sixteen local locations can be specified in single byte instructions. “Non-local”
locations are relative to the processor A register. Inmos claims that the instructions
Figure 5.10 Transputer instruction format166 Computer design techniques
chosen for single byte encoding lead to 80 per cent of executed instructions
encoded in one byte.
The two prefix functions allow the operand to be extended in length in further
bytes. Operands specified in ali instructions are stored in an internal operand
register and, apart from the prefix instructions, the operand registers are cleared of
their contents after the instruction has been executed, The prefix instruction loads
the 4-bit operand of the instruction in the operand register and then shifts the
‘contents four places to the left. Thus, by including one prefix instruction before
another instruction, the operand can be increased up to eight bits. The operand
register in 32-bit transputers has thirty-two bits and can be completely filled using
three prefix instructions and a non-prefix instruction. The “negative prefix” instruction
loads the operand register but complements the contents before it shifts the contents
four places left
‘The “operate” function interprets the operand stored in the operand register as an
‘operation on operands held in an internal stack. Hence, without prefix, the operate
function extends the number of instructions to twenty-nine (1e. thirteen frequently
‘occurring functions plus sixteen operate functions). Arithmetic instructions are
encoded as operate functions. With prefix, the number of instructions can be
extended further, and less frequently used instructions are encoded with a single
prefix.
Transputer instructions have either one address or zero address formats, the
operate instructions being zero address. Three processor registers, A, B and C, are
provided as an evaluation stack for zero address stack instructions (among other
purposes). For example, “load local/non-local” (load onto the evaluation
stack} moves the contents of B into C and the contents of A into B before loading A.
“Store local/non-local” moves B into A, copies the contents of C into B and
stores A. The add instruction adds the contents of A and B, putting the
result in the A register. One address instruction uses the A register (top of the stack)
inherently and the specified location is usually relative to the workspace pointer. A
literal can be use.
5.4 Concluding comments on RISCs
‘The RISC concept has been established as a design philosophy leading away from
complex instructions. This is not to say that CISCs will not be designed, especially
those processors which must be compatible with existing CISC processors. For
example, the Motorola 68000 family, a true CISC processor family, has been
enhanced with various products since the introduction of the 68000 in 1979,
including the 16-bit 68010 and 68020, and the 32-bit 68030 and 68040. The more
recent trend, as in the 68040, is to have multiple pipelined units so that instructions
can be executed in close to one cycle, on average (as in RISCs). Without the
constraint of hardware compatibility with CISCs, RISC designs such as theReduced instruction set computers 167
Motorola 88100 are concernéd fully with performance. It seems likely that to obtain
the greatest performance, processors will need to take on board RISC concepts.
PROBLEMS
5.1 A certain processor has 100 instructions in its instruction set and six
addressing modes. It has four instruction formats, one 16-bit and three
32-bits. What additional information, if any, would be needed to be able
to categorize the processor as a RISC or CISC?
5.2 Show how the addressing modes indexed plas literal and PC-relative
‘can be used to implement all other common addressing modes (as given
in Chapter 1, Section 1.1),
5.3 A processor has four general purpose registers (an artificially low
number for this problem). By trial and error, allocate four registers to
hold variables so as to minimize the number of variables held in memory,
given the following lifetimes:
Variable Lifetime
a 1t0 10
> 1t08
c 410 12
d 6108
e 10013
f 6108
g 1105
h 10 13,
‘The lifetime is given in execution periods of the program. Identify the
variables held in memory. When would the assignment result in non-
‘optimum processor speed (i.e. what additional information might be
needed for an optimum assignment)?
$4 Design the logic requ
register window given in
sd to decode the register addresses in the
re $.7,PART
[] | Shared memory
multiprocessor
systems‘CHAPTER
Multiprocessor systems
and programming
‘This chapter identifies various types of multiprocessor systems and outlines th
‘operation. Software techniques applicable to general purpose multiprocessors are
presented, in preparation for further study of general purpose multiprocessor archi:
tectures in subsequent chapters.
6.1 General
In previous chapters, we considered methods of improving the performance of
single processor computer systems. Now we will consider the extension of the
stored program computer concept to systems having more than one processor. Such
systems are called multiprocessor systems. Each processor executes the same or
=
(6) Muttiport emery
(0) Matistage networks
Figure 6.6 Shared memory architectures (a) Single bus
(b) System and local buses (©) Multiple buses (4) Cross-bar switch
(©) Multiport memory () Multistage networksMultiprocessor systems and programming 185
simultaneous requests, first to select up to one request for each memory module and
then to select up to B of those requests to use the buses.
In the cross-bar switch system (Figure 6.6(4)), a direct path is made between each
processor and each memory module using one electronic bus switch to interconnect
the processor and memory module. Each bus switch connects the set of processor
bus signals, perhaps between forty and eighty signals, to a memory module. The
cross-bar architecture eliminates bus contention completely, though not memory
contention, and can allow processors and memory to operate at their maximum
speed
‘The multiport memory architecture, as shown in Figure 6.6(e), uses one multiport
memory connecting to all the processors. Multiport memory is designed to enable
more than one memory location to be accessed simultaneously, in this case by
different processors. If there are, say, sixteen processors, sixteen ports would be
provided into the memory, one port for each processor. Though large multiport
memory could be designed, the design is too complex and expensive and con-
sequently “pseudomultiport” memory is used, which appears to access more than
one location simultaneously but in fact accesses the locations sequentially at high
speed. Pscudomultiport memory can be implemented using normal single-port high
speed random access memory with the addition of arbitration logic at the memory—
processor interface to allow processors to use the memory on a first-come first-
served basis. Using normal memory components, it is necessary for the memory to
‘operate substantially faster than the processcrs. To service NV simultaneous requests,
the memory would need to operate at least N times faster than when servicing a
single request. The multiport architecture with pseudomultiport memory can be
considered as a variation of the cross-bar switch architecture, with each column of
cross-bar switches moved to be close to the associated memory module.
‘The cost and complexity of the cross-bar switch grows as O(N2) where there are N
processors and memory modules. Hence, the cross-bar interconnection network
would be unsuitable for large numbers of processors and memory modules. In such
cases, a multistage network (Figure 6.6() can be used to reduce the number of
switches. In such networks, a path is established through more than one switching
element in an array of switching elements. Most multistage networks have three or
‘more stages and each path requires one switch element at each stage.
Message-passing multiprocessor systems
‘There are various possible direct link (star) interconnection networks for message-
passing multiprocessor systeris; some examples are shown in Figure 6.7. A very
restricted static interconnection network, but a particularly suitable scheme for
VLSI fabrication, is 10 connect processors directly to their nearest neighbors,
perhaps to other processors ina two-dimensional array of processors. Four links are
needed to make contact with four other processors, as shown in Figure 6.7(a) and 3
Tinks in all for n processors. In general, n(m ~ 1) bidirectional links are needed in the
array to connect m processors to m other processors (each processor having m shared186 Shared memory multiprocessor systems
links). In a system of many concurrent processes in individual processors, processes
are likely to communicate with the neighbors. Many multiprocessor algorithms are
structured to create this characteristic, to map on to static array connected multi-
Processors. We will consider static networks in Chapter 8 and message-passing
systems in Chapter 9.
(a) Nearest neighbor mesh (0) Nod wit ines
(6) Nodes with ight aks (0 Exhaustive
Hi
(Cubic (9 Tree
Figure 6.7 Some static interconnection networks (a) Nearest neighbor mesh
(b) Nodes with six links (e) Nodes with eight links (@) Exhaustive
(©) Cubic (f) TreeMultiprocessor systems and programming 187
Fault tolerant systems
‘We mentioned in Section 6.1 that multiprocessor systems are sometimes designed to
obtain increased reliability. The reliability of a system can be increased by adding
redundant components. If the probability that a single component is working (the
reliability) is given by P, the probability that at Ieast one component is working with
‘duplicated components is given by 1 ~ (1 ~ P)*, ie. one minus the probability that
all of the components have failed. As n increases, the probability of failure
decreases. In this example, the fault tolerant system with duplicated components
must be designed so that only one of the components need work.
‘We can duplicate parts at the system level (extra systems), gate level (extra gates)
‘or component level (extra transistors, eic.). To be able to detect failures and
continue operating in the face of faults, the duplicate parts need to repeat actions
performed by other parts, and some type of combining operating is performed which
disregards the faulty actions. Alternatively, error detecting codes could be used; this
requires extra gates.
One arrangement for system redundancy is to use three systems together with a
voter circuit which examines the outputs of the systems, as shown in Figure 6.8.
Each system performs the same computations. If all three systems are working, the
corresponding outputs will b¢ the same. If only two of the three systems are
‘working, the voter chooses the two identical outputs. If more than one system is not
working, the system fails. The probability that the system will operate is given by
P, = P) + 3P%(I-P), ie. the probability of all three systems operating or three
‘combinations of two systems working and one not working. The triplicated system
reliability is greater than for a single system during an initial operating period, but
becomes less reliable later if the reliability decreases with time (see Problem 6.4). It
is assumed that there is negligible probability of two faulty systems producing the
same output, and that the voter will not fail. The concept can be extended to handle
two faulty sysiems using five systems.
Figure 6.8 Triplicated system with a voter188 Shared memory multiprocessor systems
6.4.2 Potential for increased speed
To achieve an improvement in speed of operation through the use of parallelism, it
is necessary to be able to divide the computation into tasks or processes which can
be executed simultaneously. We might use different computational algorithm with
a multiprocessor rather than with a uniprocessor system, as it may not always be the
best strategy simply to take an existing sequential computation and find the parts
which can be executed simultaneously. Hence, a direct comparison is somewhat
‘complicated by the algorithms chosen for each system. However, let us ignore this
point for now. Suppose thet a computation can be divided, at Teast partially, into
concurrent tasks for execution on a multiprocessor system. A measure of relative
performance between « multiprocessor system and a single processor system is the
speed-up factor, $(n), defined as:
Sony = Exseution time using one processor (uniprocessor system)
(") = Execution time using a multiprocessor with n processors
which gives the increase in speed in using a multiprocessor. The efficiency, E, is
defined as:
225 109
We note that the maximum efficiency of 100 per cent occurs when the speed-up
factor, S(n), = m.
‘There are various possible divisions of processes onto processors depending upon
the computation, and different divisions lead to different speed-up factors, Also, any
‘communication overhead between processors should be taken into account. Again,
there are various possible communication overheads, from exhaustive communica
tion between all processors to very limited communication between processors. The
‘communication overhead is normally an increasing function of the number of
processors. Here we will investigate some idealized situations. We shall use the
term process to describe a contained computation performed by a processor; a
Processor may be scheduled to execute more than one process.
Equal duration process
‘The computation might be such tha it can be divided into equal duration processes,
with one process mapped onto one processor. This ideal situation would lead to the
‘maximum speed-up of, given n processors, and can be compared to a full pipeline
system (Chapter 4). The speed-up factor becomes
Sin)Multiprocessor systems and programming 189
where 1 is the time on a single processor. Suppose there is a communication
‘overhead such that each process communicates once with one other process, but
concurrently, as in a linear pipeline. The communications all occur simultaneously
and thus appear as only one communication, as shown in Figure 6.9. Then the
speed-up would be:
Tasca ~T+e
Sea
where c is the fractional increase in the process time which is taken up by
communication between a pair of processes. If ¢ = 1 then the time taken to
‘communicate between processes is the same as the process time, S(n) = n/2, a
reduction to half the speed-up.
In more general situations, the communication time will be a function of the
‘number of processes and the communications cannot be fully overlapped.
Parallel computation with a serial section
It is reasonable to expect that some part of a computation cannot be divided at all
into concurrent processes and must be performed serially. For example the computa-
tion might be divided as shown in Figure 6.10. During some period, perhaps an
initialization period or period before concurrent processes are being set up, only one
processor is doing useful work and, for the rest of the computation, all of the
available processors (n processors) are operating on the problem, ic. the remaining
part of the computation has been divided into n equal processes.
Processors
Tine
‘Communication
overhead
ire 6.9 Equal duration tasks190 Shared memory multiprocessor systems
Oneprocesser
‘ve
‘A processors ative
Processors
Tie
Figure 6.10 Parallel computation with serial section
If the fraction of the computation that cannot be divided into concurrent tasks is f,
and no overhead incurs when the computation is divided into concurrent parts, the
time to perform the computation with n processors is given by ft + (I=fi/n and the
speed-up factor is given by:
' a
SO=Fea= pam “Tay
‘This equation is known as Amdahl’s law. Figure 6.11 shows S(») plotted against
number of processors and plotted against f. We see that indeed a speed improvement
is indicated, but the fraction of the computation that is executed by concurrent
processes needs to be a substantial fraction of the overall computation if a significant
increase in speed is to be achieved. The point made in Amdahl’s law is that even
with an infinite number of processors, the maximum speed-up is limited to 1/f. For
‘example, with only 5 per cent of the computation being serial, the maximum speed-
up is 20, irrespective of the number of processors.
In fact, the situation could be worse. There will certainly be an additional
‘computation to start the parallel section and general communication overhead
between processes. In the general case, when the communication overhead is some
function of n, say if,(n), we have the speed-up given by:
SO = TEU finMultiprocessor systems and programming 191
Fo fo0%
wd
a
: f=5%
i
oe
Oa Oe erie
Number af processors
(@) Spood-un against umber of processors
‘[Link] tater Sn)
Le
Cr a CT
‘eral ration.
(0) Speed-upagatst sera fraction, #
Figure 6.11 Speed-up factor (a) Speed-up factor against number of processors -
(b) Speed-up factor against serial fraction, f
In practice we would expect computations to use a variable number of processors, as
illustrated in Figure 6.12.
Optimum division of processes
We need to know whether utilizing all the available processors and dividing the
work equally among processors is the best strategy, or whether an alternative
strategy is better. Stone (1987) investigated this point and developed equations for
differen communication overheads, finding that the overhead eventually dominates,
after which it is better not even to spread the processes among all the processors,192. Shared memory multiprocessor systems
Prods when diterertoumbers
of processors active
Tm
Figure 6.12 Parallel computation with variable processor usage
but to let only one processor do the work, i.e. a single processor system becomes
faster than a multiprocessor system. In our equations, this point is reached when the
denominator of the speed-up equations equals or exceeds m, making S(n) equal or
less than one. Stone confirms that if dividing the process is best, spreading the
processes equally among processors is best (assuming that the number of processes
will divide exactly into the number of processors).
Speed-up estimates
It was once speculated that the speed-up is given by logjn (Minsky's conjecture).
Lea (1986) used the term applied parallelism for the parallelism achieved on a
particular system given the restricted parallelism processing capability of the system,
and suggested that the applied parallelism is typically logyn. He used the term
natural parallelism for the potential in a program for simultaneous execution of
independent processes and suggested that the natural parallelism is n/log,7.
Hwang and Briggs (1984) presented the following derivation for speed-up:
Sn log.n. Suppose at some instant i processors are active and sharing the work
‘equally with a load 1/i (seconds). Let the probability that i processors are active
simultaneously be P, = I/n where there are n processors. There is an equal chance of,
each number of processors (i = 1, 2, 3 n) being active. The (normalized) overall
processing time on the multiprocessor is given by:
oF
7
‘The speed-up factor is given by:
eyMultiprocessor systems and programming 193
L n n
mS Tog 0, > O, = 9. As noted by
Baer (1980), the third condition reduces to 0, © O, = @ for high level language
statements.200 Shared memory multiprocessor systems
If the three conditions are all satisfied, the two statements can be executed
concurrently. The conditions can be applied to processes of any complexity. A
process can be a single statement, when it can be determined whether the two
statements can be executed simultaneously. I, corresponds to the variables on the
right hand side of the statements and O, corresponds to the variables on the left band
side of the statements.
Example:
Suppose the two statements were (in Pascal)
Arex eye
Bink + 27
we have:
=a
= 02
0, =)
0; = (@)
and the conditions:
1, 90,=6
1, 90,=
0,90, =9
are satisfied. Hence the statements A := X + ¥, B
simultaneously. Suppose the statements were:
X + 2 can be executed
K+Y
A+B
the condition I, 9 0, # @ Hence the two statements cannot be executed
simultaneously.
The technique can be extended to processes involving more than two statements
Then, the set of inputs and outputs to each process, rather than each statement, is
considered. The technique can also be used to determine whether several statements
can be executed in parallel. In this case, the conditions are:
100,=6
1 00,=6
0,n0,=6‘Multiprocessor systems and programming 201
for all ij (excluding i= j).
Example:
ow
Here 1, = (X,¥), Ip = (X12), Ip = (44%), 0, = (A), O) = (B) and O5 = (C),
All the conditions:
.90,=¢ —-1,.90,=6 1 N04=0
1.9026 9026 = 90,=6
0,90,=¢ — 0,90,=9 0, 05-9
are satisfied and hence the three statements can be executed simultaneously (or in
any order).
Parallelism in loops
Parallelism can be found in high level language loops. For example the Pascal loop:
FOR 4 := 1 TO 20 DO
ACA) := BLA
could be expanded to:
af] i= BO
AI2] i= BIZ):
AB] = B31;
A[19] - B19]F
[20] := B(20)
and, given iwenty processors, these could all be executed in parallel (Bernstein's
conditions being satisfied). If the result of the statement(s) within the body of the
loop does depend upon previous loop iterations, it may still be possible to split the
sequential statements into partitions which are independent. For example the Pascal
loop:
FOR i := 3 TO 20 bo
AU] t= Ati-2] + 4202 Shared memory multiprocessor systems
computes:
a3)
ata)
ALS]
At) +4
al2) + 43
AL] + 47
SAtIT) + 43
aie) +4
At19)
[20]
Hence A(5] can only be computed after [3], A(4} after A(2] and so on, The
computation ean be split into two independent sequences (partitions)
A[3] = Ata) + 47 ALA] i= Al2) + 4
AI] 2= AGB) + 43 A(6) := Al4] + 4;
A(L7) := A(25] + 47 A(18) := Atle) + 47
Al19] i= A217) +4
or written as two DO loops:
is 3, 4
FOR j := 1 70 9 DO FOR j := 1 TO 8 DO
BEGIN BEGIN
iimita2 ireita;
Ali] i= Ai-2] +4 ati)
END END
A[i-2] +4
Each loop can be executed by a separate processor in a multiprocessor system. The
approach can be applied to generate a number of partitions, dependent upon the
references within the body of the loop.
A parallelizing compiler accepts a high level language source program and makes
translations and code restructuring to create independent code which can be executed
concurrently. There are various recognition algorithms and strategies that can be
applied and incorporated into a parallelizing compiler apart from the methods
outlined previously. Further information can be found in Padua, Kuck and Lawrie
(1980) and Padua and Wolfe (1986). Some parallelizing compilers are designed to
translate code into parallel form for vector computers. Padua and Wolfe use the term
concurrentizing for code translation to create multiprocessor computations,Multiprocessor systems and programming 203
6.6 Mechanisms for handling concurrent processes
6.6.1 Critical sections
Suppose we have obtained, by either explicit or implicit parallelism, a set of
processes that are to be executed simultaneously. A number of questions arises. First,
wwe need a mechanism for processes to communicate and pass data, even if this only
occurs when a process terminates. Coupled with this, we need a mechanism to
‘ensure that communication takes place at the correct time, i.e. we need a syn-
chronization mechanism. A synchronization mechanism is also required to terminate
processes, as we have seen in the JOIN construct. If processes are to access
common variables (memory locations) or interact in some other way, we need to
ensure that incorrect data is not formed while two or more processes attempt to alter
variables.
‘A process typically accesses a shared resource from time to time. The shared
resource might be physical, such as an input/output device or a database contained
within shared memory, and may accept data from, or provide data to, the process.
‘More than one process might wish to access the same resource from time to time,
‘A mechanism for ensuring that only one srocess accesses a particular resource at
4 time is to establish sections of code involving the resource as so-called critical
sections and arrange that only one such critical section is executed at a time, ic.
‘mutual exclusion exists. The first process to reach a critical section for a particular
resource executes the critical section (“enters the critical section”) and prevents all
other processes from executing a critical section for the same resource by some as
yet undefined mechanism, Once the process finishes the critical section, another
process is allowed to enter it for the same resource
6.6.2 Locks
‘The simplest mechanism for ensuring mutual exclusion of critical sections is by the
use of a lock. A lock isa I-bit variable which is set to 1 to indicate that a process has
entered the critical section and reset to 0 to indicate that no process is in the critical
section, the last process having left the critical section. The lock operates like a door
lock. A process coming to the “door” of a critical section and finding it open may
enter the critical section, locking the door to prevent other processes entering. Once
the process has finished the critical section, it unlocks the door and leaves.
‘Suppose that a process reaches a lock which is set, indicating that the process is
excluded from the critical section. It now has to wait until it is allowed to enter the
critical section. The process might need to examine the lock bit continually in a
tight loop, for example, equivalent to:204 Shared memory multiprocessor systems
WHILE Lock = 1 DO SKIP; ‘Skip means no operation
Lock := 1; enter critical section
Critical Section
Lock := 0; leave critical section
Such locks are called spin locks and the mechanism is called busy waiting. Busy
waiting is inefficient of processors as no useful work is being done while waiting for
the lock, though this is a common approach with locks.
Other computations could be done in place of SKIP. In some cases it may be
possible to deschedule the process from the processor and schedule another process
while waiting for a lock to open, though this in itself incurs an overhead in saving
and reading process information. If more than one process is busy waiting fora lock
to be reset, and the lock opens, a mechanism might be necessary to choose the best
or highest priority process to enter the critical section, rather than let this be
resolved by indeterminate busy waiting. Such a mechanism is incorporated into the
semaphore operation (see Section 6.6.3).
It is important that more than one process does not set the lock (open the door)
and enter the critical section simultaneously, or that one process finds the lock
reset (door open) but before it can set it (close the door) another process also finds
the door open and enters the critical section. Hence the actions of examining
whether a lock is set and of setting it must be done as one uninterruptable operation,
‘and one during which no other process can operate upon the lock. This exclusion
mechanism is generally implemented in hardware by having special indivisible
‘machine instructions which perform the complete operation sequence. Most recent
‘microprocessors have such indivisible machine instructions.
Inte! 8086 lock prefix/signal
‘The Intel 8086 microprocessor implements a lock operation by providing a special
L-byte 1L0CK instruction which prevents the next instruction from being interrupted
by other bus transactions. The LOCK instruction causes a LOCK signal to be
generated forthe duration ofthe LOCK instruction andthe next instruction, whatever
type of instruction this may be, The LOCK signal is used with external logic to
inhibit bus transactions of other processors. If a bus request is received by the
processor, the request is recorded internally but not honored until after the LOCK
instruction and the next instruction. The exact timing is described by Intel (1979)
The lock operation preceding a critical section could be implemented in 8086
assembly language as follows:
12: MOV CX,FFFFH ;Set up value to load into lock
Lock ;Make next instruction indivisible
XCHG Lock,CX Set lock
JCxZ Li jStart critical section if
} lock not originally set
op 12 iWait for lock to open
Li iCritical sectionMultiprocessor systems and programming 205
In this sequence, XCHG Lock, CX exchanges the contents of memory location Lock
and register CX. The exchange instruction takes two bus cycles to complete,
Without the LOCK prefix, the exchange operation could be interrupted between bus
cycles in a multiprocessor system, and lead to an incorrect result.
Motorola MC68000 Test and Set instruction
‘The MC68000 microprocessor has one indivisible instruction, the TAS instruction
(test and set an operand), having the format:
TAS effective address
where effective address identifies a byte location using any of the 68000 “data
alterable addressing” modes (Motorola, 1984). There are two sequential operations,
“test” and “set”. First, the value read from the addressed location is “tested” for
positive/negative and zero, i.e. the N (negative) and Z (zero) flags in the condition
code register are set according to the value in the location. The Z flag is set when
the bit is zero and the N flag is set when the whole number held is negative. Next,
the most significant bit of the addressed location is set, irrespective of the previous
test, ie. whether or not the bit was 1, it is set to 1 during the TAS instruction, The
addressed location is read, modified as necessary and the result written in one
indivisible read-modify-write bus cycle. A lock operation before a critical section
‘could be encoded using TAS instruction in 68000 assembly language as:
Ll: TAS Flag
BEL LI fRepeat if lock already set (positive)
‘The 68000 also has a festa bit and set instruction (BSE) which is not indivisible and
could not be used alone as a lock operation. Most processors have some form of
indivisible instruction. The 32-bit MC68020 microprocessor hasan indivisible
compare and swap (CAS) instruction which can be used to maintain linked lists in a
multiprocessor environment. This instruction can also be found on mainframe
computers such as the IBM 370/168 (see Hwang and Briggs (1984) for more details).
‘Though indivisible instructions simplify the locks, locks with mutual exclusion
can be implemented without indivisible TAS instructions. For example, one apparent
solution is given below using two variables A and B:
Process 1 Process 2
B r= 07
Non-critical section
Bil:
WHILE B = 1 DO SKIP; WHILE A = 1 D0 SKIP;
Critical section Critical section
Bim OF
Non-critical section206 Shared memory multiprocessor systems
However, this scheme can casily be deadlocked. In deadlock, the processes cannot
proceed as each process is waiting for others to proceed. The code will deadlock
when both A and B are set to 1 and tested simultaneously. SKTP could be replaced
with code to avoid this type of deadlock.
The solution is still susceptible to both Process 1 and Process 2 entering the
critical section together if the sequence of instructions is not executed as specified in
the program, which is possible in some systems. We have seen, in Chapter 4, for
‘example, that some pipelined systems might change the order of execution (Section
4.2.3). Memory contention and delays might also change the order of execution, if
‘queued requests for memory are not executed in the order presented to the memory.
The effect of such changes of execution was first highlighted by Lamport (1979)
who used code similar to that given for Process 1 and Process 2 to elucidate a
solution, namely that the following conditions must prevail:
1. Each processor issues memory requests in the order specified by its program.
2. Memory requests from all processors issued to an individual memory location
are serviced from a single frst-in first-out queue (in the order in which they
are presented to the memory)
In fact, itis only necessary for memory requests to be serviced in the order that they
are made in the program, but in practice that always means that the two separate
Lamport conditions are satisfied.
To eliminate the busy waiting deadlock condition and maintain at most one
process in the critical section at a time, a third variable, P, can be introduced into
the code as below:
Process 1 Process 2
A i= 0; B= 0;
Non-critical section Non-critical section
Biel;
Pas;
WHILE B= 1 AND P = 2 DO SKIP;WRILE A = 1 AND P = 1 DO SKIP;
Critical section Critical section
Ai 07
Non-critical se
Irrespective of whether any of the instructions of one process are separated by
instructions of the other process, P can only be set to Process I or Process 2 and
hhence the conditional loop will resolve the conflict and one process will be chosen
to enter its critical section. It does not matter whether both conditional loops are
performed simultaneously or are interleaved, though it is assumed that only one
Process can access a variable at a time (read or write), which is true for normal
computer memory. Also, assuming that each critical section executes in a finiteMultiprocessor systems and programming 207
time, both processes will eventually have the opportunity to enter their critical
sections (i.e. the algorithm is fair to both processes). It is left as an exercise to
determine whether Lamport’s conditions must still be satisfied.
6.6.3 Semaphores
Dijkstra (1968) devised the concept of a semaphore which is a positive integer
(including zer0) operated upon by two operations named P and V. The P operation
fon a semaphore, s, written as P(s), waits until s is greater than zero and then
decrements s by one and allows the process to continue. The V operation increments
$ by one. The P and V operations are performed indivisibly. (The letter P is from the
Dutch word “passeren” meaning to pass, and the letter V is from the Dutch word
“vrijgeven” meaning to release.)
‘A mechanism for activating waiting processes is also implicit in the P and V
operations, though the exact algorithm is not specified; the algorithm is expected to
be fair. Delayed processes should be activated eventually, commonly in the order in
which they are delayed. Processes delayed by P(s) are kept in abeyance until
released by a V(s) on the same semaphore. Processes might be delayed using a spin
lock (busy waiting) or more likely by descheduling processes from processors and
allocating in its place a process which is ready.
‘Matual exclusion of critical sections of more than one process accessing the same
resource can be achieved with one semaphore having the value 0 or I (a binary
semaphore) which acts as a lock variable, but the P and V operations include a
Process scheduling mechanism. The semaphore is initialized to 1, indicating that no
process is in its critical section associated with the semaphore. Each mutually
‘exclusive critical section is preceded by a P (s) and terminated with a Vs) on the
same semaphore, i.e.
Process 1 Process 2 Process 3
Non-critical section Non-critical section Non-critical section
P(s) P(s) P(e)
Critical section Critical section Critical section
vis) vis) vis)
Non-critical section Non-critical section Non-critical section
‘Any process might reach its P(s) operation first (or more than one process may
reach it simultaneously). The first process to reach its P(s) operation, or to be
accepted, will set the semaphore to 0, inhibiting the other processes from proceeding,
ast their P(s)s, but any process reaching its P(s) operation will be recorded
in a first-in first-out queue. The accepted process executes its critical section. When
the process reaches its V(s) operation, it sets the semaphore s to 1 and allows one
of the processes waiting to proceed into its critical section.208 Shared memory multiprocessor systems
‘A general semaphore (or counting semaphore) can take on positive values other
than zero and one, Such semaphores provide, for example, a means of recording the
‘number of “resource units” available or used. Consider the action of a “producer” of
data linked to a “consumer” of data through a first-in first-out buffer. The buffer
would normally be implemented as a circular buffer in memory, using a pointer to
indicate the front of the queue and a different pointer to indicate the back of the
queue. The locations currently not holding valid data are those locations between
the front and back pointer, in the clockwise direction, not including the locations
pointed at by each pointer. The locations holding valid items to be taken by the
‘consumer are those locations between the front and back pointer in the counter:
clockwise direction, including the locations pointed at by each pointer.
Loading the queue and taking items from the queue must be indivisible and
separate operations. Two counting semaphores can be used, one called empty, 10
indicate the number of empty locations in the complete circular queue, and one
called full, to indicate the number of data items in the queue ready for the consumer.
When the queue is full, full = n, the total number of locations in the queue, and
empty = 0, When the queue is empty, the initial condition, full = 0 and empty =n.
‘The two semaphores can be used as shown below:
Producer Consumer
Produce data message
(empty) (full)
Load buffer ‘Take next message from queue
v(full) viempty)
Notice that the P and V operations surrounding each critical section do not operate
on the same semaphore as in the previous example of a mutually exclusive critical
section,
When the producer has a message for the queue, it performs a P (empty)
operation, If empty = 0, indicating that there are no empty locations, the process is
delayed until empty # 0, indicating that there is at least one free location. Then the
‘empty semaphore is decremented, indicating that one of the free locations is to be
used and the producer enters its critical section to load the buffer using the back
pointer of the queue, updating the back pointer accordingly. On leaving the critical
section, a ¥(fu11) is performed, which increments the full semaphore to show that
cone location has been filled.
‘When the consumer Wants to take the next message irom the queue, it performs a
(full) operation which delays the process if full = 0, i.. if there are no
messages in the queue. When full # 0, ie when there is at least one message in the
queue, full is decremented to indicate that one message is to be taken from the
queue. The consumer then enters its critical section to take the next message from
the queue, using the front pointer and updating this pointer accordingly. On leaving the
critical section, a V(empty) is performed which increments the empty semaphore
to show that one more location is free.Multiprocessor systems and programming 209
QO wwe
OS —
(@) Two-process deadlock
OOM 8
OO OO
(b) mprocess deadiock
Figure 6.16 Deadlock (deadly embrace) (a) Two-process deadlock
(by m-process deadlock
The previous example can be extended to more than one buffer between a
producer and a consumer, and with more than two processes. An important factor is
to avoid deadlock (sometimes called a deadly embrace) which prevents processes
from ever proceeding. Deadlock can occur with two processes when one requires a
resource held by the other, and this process requires a resource held by the first
process, as shown in Figure 6.16(a). In this figure, each process has acquired one of
the resources. Both processes are delayed and unless one process releases a resource
‘wanted by the other process, neither process will ever proceed.
Deadlock can also occur in a circular fashion, as shown in Figure 6.16(b), with
several processes having a resource wanted by another. Process PI requires resource
R2, which is held by P2, process P2 requires resource R3, which is held by process
3, and so on, with process Pn requiring resource RI held by Pl, thus forming a
deadlock situation. Given a set of processes having various resource requests, a
Circular path between any group indicates a potential deadlock situation. Deadlock
cannot occur if all processes hold at most only one resource and release this
resource in a finite time. Deadlock can be eliminated between two processes
accessing more than one resource if both processes make requests first for one
resource and then for the other.
It is widely recognised that semaphores, though capable of implementing most
critical section applications, are open to human errors in use. For example, for every
P operation on a particular semaphore, there must be a corresponding V operation on
the same semaphore. Omission of a P or V operation, or misnaming the semaphore,
would create havoc, The semaphore mechanism is’ very low level mechanism,210. Shared memory multiprocessor systems
programmed into processes.
‘Semaphores combine two distinct purposes; first, they achieve mutual exclus
of critical sections and second, they achieve synchronization of processes. Mutual
‘exclusion is concerned with rhaking sure that only one process accesses a particular
resource, The separate action of making sure that processes are delayed until
another process has finished with the resource has been called condition synchroniza~
tion, which leads to a conditional critical section, proposed independently by Hoare
‘and by Brinch Hanson (see Andrews and Schneider (1983) for details). Another
technique is to use a monitor (Hoare, 1974), a suite of procedures which provides
the only method to access a shared resource. Reading and writing can only be done
by using a monitor procedure and only one process can use a monitor procedure at
any instant. If a process requests a monitor procedure while another process is using
fone, the requesting process is suspended and placed on a queue. When the active
process has finished using the monitor, the first process in the queue (if any) is
allowed to use a monitor procedure (see Grimsdale (1984). A study of these
techniques is beyond the scope of this book.
PROBLEMS
6.1 Suggest two advantages of MIMD multiprocessors and two advan-
tages of SIMD multiprocessors.
6.2 Suggest two advantages of shared memory MIMD multiprocessor
systems and two advantages of message-passing MIMD multiprocessors.
6.3 How many systems are necessary to survive any four systems failing
in a fault tolerant system with a voter?
6.4 Determine when a triplicated system becomes Jess reliable than a
single system, given that the reliability of a single system is given by
eM, 2s the failure rate.
6.5 Identify unique features of each of the following array computers:
Mliac IV.
BSP.
GF-Al.
Blitzen,
6.6 Determine the execution time to add together all elements of a 33 x 33
element array in each of the following multiprocessor systems:Multiprocessor systems and programming 211
1, An MIMD computer system with sixty-four independent processors
accessing a shared memory through an interconnection network,
2. An SIMD computer system with sixty-four processors connected
through a north-south-east-west nearest neighbor connection
network. The processors only have local memory,
3. As 2. but with sixteen processors.
4. An SIMD system having sixty-four processors connected to
shared memory through an interconnection neiwork
One addition takes 1, sec. Make and state any necessary assumptions.
6.7 Show a suitable path taken between two nodes which are the maxi-
‘mum distance apart in the Iliac IV system (with an 8 x 8 mesh nearest
neighbor network). Develop a routing algorithm to establish a path
between any two nodes. Repeat assuming that paths can only be left to
right or top to bottom (in Figure 6.3).
6.8 Develop the upper and lower bourd for the speed-up factor of @
multiprocessor system given that each processor communicates with four
other processors but simultaneous communications are not allowed.
6.39 In a multiprocessor system, the time each processor executes a
cfitical section is given by . Prove thatthe total execution time is given
by:
T= fT, +0 -fT lp +t,
and hence prove that the best case time becomes:
T P.P-1.)
where T, is the total execution time with one processor, p is the nuinber
of processors in the system and f is the fraction of the operations which
‘must be performed sequentially. Differentiate the first expression to
‘obtain the number of processors for the minimum execution time. Assume
that a sufficient number of processors is always available for any program.
6.10 Using Bernstein's conditions, identify the statements that can be
executed simultaneously in the following:212 Shared memory multiprocessor systems
D*EF
ASE;
AND;
ANB;
EH;
momop
‘Are there any statements that can be executed simultaneously and are not.
identified by Bernstein's conditions? Is it possible for such statements to
be present?
6.11 Separate the following Pascal nested ioop into independent loops
which can be executed on different processors simultaneously:
FOR i := 2 T0 12 DO
FOR j = 1 To 10 DO
(i) r= XUi+g]x (i)
6.12 Deduce what the following parallel code achieves (given in two
versions, one “C-like” and one “Pascal-like”):
Colike:
PARFOR (i = 2
pixel [i] (3]
Jt 1s i <= 10; ite, 544) (
(pixel [4] [j+1] +pixel [4#1] [3]
+pixel li] (} 1]+pixel(i 1103) /4z
)
Pascal-like:
ged
PARFOR 1 = 1 TO 10 D0
BEGIN
jr ar
Pixel (i, j] = (pixel[i, j#1]+pixe1 (itt, j)
tpixel{i,j L+pixel(i 1,31)/4
END
In what aspect is the Pascal version inefficient?
6.13 Identify the conditions (if any) which lead to deadlock or incorrect
‘operation in the code for a lock using the three shared variables A, B and
P (Section 6.6.2).CHAPTER
7 Single bus
multiprocessor systems
‘This chapter will consider the use of a bus to interconnect processors, notably
microprocessors. Substantial treatment of the arbitration function is given and the
extension of the single bus system to incorporate system and local buses is considered.
‘The operation of coprocessors on local buses is presented with microprocessor
examples.
7.1 Shat
g a bus
7.1.1 General
Microprocessor systems with one processor normally use a bus to interconnect the
processor, memory modules and input/output units. This method serves well for
transferring instructions from the memory to the processor and for transferring data
operands to or from the memory. A single bus can be used in a multiprocessor
system for interconnecting all the processors with the memory modules and input/
‘output units, as shown in Figure 7.1. Clearly, only one transfer can take place on the
Bus masters
Ty
“moll Ulf a
Figure 7.1. Time-skared bus system
213214 Shared memory multiprocessor systems
bus at any instant; however, the scheme is practical and has been adopted by
‘microprocessor manufacturers.
Ina single bus multiprocessor system, all memory modules connecting to the bus
become a single memory, available to all processors through the bus. Processors
‘make requests to bus arbitration circuitry for the bus, and one processor is allowed
to use the bus at a time. This processor can access any memory module and the
performance is unaffected by the selection of the memory module. Processors
‘compete for the use of the bus and a mechanism must be incorporated into the
system to select one processor at a time to use the bus. When more than one
processor wishes to use the bus, bus contention occurs.
A single bus can only improve processing speed if each processor attached to it
‘has times when it does not use the bus. If each processor requires the bus con-
tinuously, no mcrease in speed will result, because only one processor will be active
and all the other processors will be waiting for the bus. Most processors have times
when they do not require the bus, though processors without local memory require
the bus perhaps 50-80 per cent of the time. If a processor requires the bus 50 per
‘cent of the time, two processors could use it alternately, giving a potential increase
Of speed of 100 per cent over a single processor system.
A synchronous system could achieve this speed. For example, the Motorola 6800
8-bit microprocessor operates on a two phase clock system with equal times in each
phase. Memory references are only made during one phase. Hence, two processors
‘could be arranged to operate on memory in opposite phases, and no bus arbitration
circuitry would be required. If the processors each required the bus I/n of the time,
then 1 processors could use the bus in an interleaved manner, resulting in an n-fold
increase in speed. If further similar processors were added, no further increase in
speed would result, Below maximum utilization of the bus there is a linear increase
in speed, while at the point the bus is fully utilized, no increase in speed results as
further processors are added.
Synchronizing memory references is rather unusual and not applicable to more
recent microprocessors; microprocessors have times when they use the bus, which
‘change depending upon the instructions. For an asynchronous multiprocessor system.
where processors have to compete for the bus, processors will sometimes need to
wait for the bus to be given to them, and the speed-up becomes less than in a
synchronous system, A mathematical analysis is given in Section 7.3. It is rare for it
to be worthwhile to attach more than 4~5 processors to a single bus.
Processors can be provided with local cache-holding instructions and data which
will reduce the number of requests for memory attached to the bus and reduce bus
contention. First, though, let us discuss the various mechanisms fer transferring
control of the bus from one processor to another. Processors, or any ether device
that can control the bus, will be called bus masters. The processor controlling the
bbus at any instant will be called the current bus master. Bus masters wishing to use
the bus and making a request for it will be called requesting bus mastersSingle bus multiprocessor systems 215,
7.1.2. Bus request and grant signals
‘There are two principal signals used in the transfer of control of the bus from one bus
‘master to another, namely the bus request signal and the bus grant signal, though
other signals are usually also present and the signals are variously entitled depending
upon the system designer or microprocessor manufacturer. Transfer of the control of
the bus from one bus master to another uses a handshaking scheme. The bus master
wishing to use the bus makes a request to the current bus master by activating the bus.
request signal. The current bus master releases the bus some time later, and passes
back a bus grant signal to the requesting bus master, as shown in Figure 7.2(a). The
‘exact timing is system dependent. Figure 7.2(b) shows one possible timing using the
two signals described. Bus request causes, in due course, bus grant to be returned.
‘When bus grant is received, bus request is deactivated, which causes bus grant to be
deactivated. Bus control signals are often active-low, meaning that the quiescent
state is 1 and that O indicates action. Such signals are shown with a bar over their
name. We shall use the word “activated” to indicate action.
Requesting ‘curent
busmaster busmaster
Bsreqest
Sagan
(2) Bus request and bus rant signals
state State? States» Stated
ot PT
a
©) Tring
Sisrequen ) Sagat oa
(6 State dagram
Figure 7.2 Bus requestgrant mechanism (a) Bus request and bus grant signals
(b) Timing (©) State diagram216 Shared memory multiprocessor systems
Buses can be classified as cither synchronous or asynchronous. For all bus
transactions in the synchronous bus, the time for each transaction is known in
advance, and is taken into account by the source device in accepting information and
generating further signals. In the asynchronous bus, the source device does not know
how long it will take for the destination to respond. The destination replies with an
acknowledgement signal when ready. When applied to transferring control of the bus,
the asynchronous method involves not only a request signal from the requesting bus
‘master and a grant signal from the current bus master, but also a further grant
acknowledge signal from the current bus master acknowledging the grant signal.
In a synchronous bus, the two signal handshake system is often augmented with a
bus busy signal, which indicates whether the bus is being used. It may be that an
‘acknowledge signal, rather than a grant signal, is returned from the current bus master
to the requesting bus master after the request has been received. The current bus master
then releases the bus busy line when it eventually releases the bus, and this action
indicates that the requesting master can take over the bus, as shown in Figure 7.3.
Microprocessors designed for multiprocessor operation have request/acknow-
ledge/grant signals at the pin-outs although, when there are more than two processors,
in the system, additional logic may be necessary to resolve multiple requests for
particular schemes.
7.1.3 Multiple bus requests
It is necessary for the current bus master to decide whether to accept a particular
request and to decide between multiple simultaneous requests, should these occur.
In both cases, the decision is normally made on the basis of the perceived priority of
the incoming requests, in much the same way as deciding whether to accept an
interrupt signal in a normal single processor microprocessor system. Individual bus
masters are assigned a priority level, with higher priority level masters being able to
take over the bus from lower priority bus masters. The priority level may be fixed by
making specific connections in a priority scheme (i.e. static priority/fixed priority)
cr, less commonly, altered by hardware which alters the priority according to some
algorithm (dynamic priority).
‘Arbitration schemes can generally be:
1. Parallel arbitration schemes.
2. Serial arbitration schemes.
In parallel arbitration schemes, the bus request signals enter the arbitration logic
separately and separate bus grant signals are generated. In serial arbitration schemes,
a signal is passed from one bus master to another, to establish which requesting bus
‘master, if any, is of higher priority than the current bus master. The serial configuration
is often called a daisy chain scheme.Single bus multiprocessor systems 217
Requesting Cumont
bismaste busmaster
| easyer
aero
sry
(a) Bus eqest, ackrowiedge and grat signals
Statet Stato? State
ssTeqsi tom =
reauestingmaster
sasacmronede
‘romourent” Basgam) >]
master
aay
Diven by curerinasor Diver By rew master
(6 Timing
Rarer RELY Saba
released erweny
(\ renes remaster
(© State diagram
Figure 7.3 Bus request/acknowledge/busy mechanism (a) Bus request, acknow-
ledge and busy signals (b) Timing (c) State diagram
Arbitration schemes can also be:
1, Centralized arbitration schemes,
2. Decentralized arbitration schemes.
In centralized schemes, the request signals, either directly or indirectly, reach one
central location for resolution and the appropriate grant signal is generated from this,218 Shared memory multiprocessor systems
point back to the bus masters, In decentralized schemes, the signals are not resolved
at one point ~ the decision to generate a grant/acknowledge signal may be made at
various places, normally at the processor sites. The decentralized schemes often (but
not alweys) have the potential for fault tolerance, whereas the centralized schemes
are always susceptible (0 point failures. Parallel and serial arbitration schemes can
cither be centralized or decentralized, though the centralized forms are most common.
7.2. Priority schemes
7.2.1. Parallel priority schemes
‘The general centialized parallel priority scheme is shown in Figure 7.4. Each bus
master can generate a bus request signal which enters the centralized arbitration
logic (arbiter). One of the requests is accepted and a corresponding grant signal is
returned to the bus master, A bus busy signal is provided; this is activated by the
bbus master using the bus. A bus master may use the bus when it receives a grant
signal and the bus is free, as indicated by the bus busy line being inactive. While a
bus master is using the bus, it must maintain its request and bus busy signals active.
Should a higher priority bus master make a request, the arbitration logic recognizes
the higher priority master and removes the grant from the current bus master. It also
provides a grant signal te the higher priority requesting bus master, but this bus
‘master cannot take over the bus until the current bus master has released it. The
‘current bus master recognizes that it has lost its grant signal from the arbitration
logic, but it will usually not be able to release the bus immediately if it is in the
process of making a bus transfer. When a suitable occasion has been reached, the
Grant signals
igure 7.4 Centralized parallel arbitrationSingle bus multiprocessor systems 219
current bus master releases the bus and the bus busy line, which signals to the
requesting master that it can take over the bus.
Notice that it is necessary to provide @ bus busy signal because bus masters are
incapable of releasing the bus immediately when they loose their grant signal.
Hence we have a thice signal system. There are various priority algorithms which
can be implemented by the arbitration logic to select a requesting bus master, all
implemented in hardware as opposed to software because of the required high speed
of operation, We have already identified static and dynamic priority. In the first
instance, let us consider static priority. Dynamic priority in parallel priority schemes
is considered on page 225.
In static (fixed) priority, requests always have the same priority. For example.
suppose that there were eight bus masters 0, 1, 2, 3, 4, 5, 6 and 7 with eight request
signals REQO, REQI, REO2, REQ3, KEQ4, REQS, REQ6 and REQ7, and
eight associated grant signals GRANTO, GRANT1, GRANT2, GRANT3, GRANTS,
GRANTS, GRANT6 and GRANT7. Bus master 7 could be assigned the highest
priority, with the other bus masters assigned decreasing priority such that bus master
0 has the lowest priority. If the current master is bus master 3, any of the bus
masters 7, 6, 5 and 4 could take over the bus from the current master, but bus
masters 2, 1 and 0 could not. In fact, bus master 0 could oniy use the bus when it
was not being used and would be expected to release it to any other bus master
wishing to use it.
Static priority is relatively simple to implement. For eight request inputs and
eight “prioritized” grant signals, the Boolean equations to satisfy are:
GRANT? = REQ?
GRANTS ~ [Link]
GRANTS = [Link]
ORANT4 = REQ? REQS:
GRANT3 = REQ?
GRANT? = REQT. REQS .REQS .REQ3.REQ2
GRANT1 = REQ? [Link]
GRANTO = [Link] REQS . REQ? .REQI. REQO
which could be implemented as shown in Figure 7.5. This arrangement could be
extended for any number of bus masters and standard logic components are available
to provide static priority (for example the SN74278 4-bit cascadable priority com-
ponent (Texas Instruments, 1984) which also has flip-flops to store requests),
Static priority circuit devices can generally be cascaded to provide further inputs
and outputs, as shown in Figure 7.6. In this figure, each priority circuit has an
enable input, ET, which must be activated to generate any output, and an enable
‘output, EO, which is activated when any one of the priority request outputs is active,
‘The EZ of the highest priority circuit is permanently activated. When a request is
received by a priority circuit, the outputs of the lower priority circuits are disabled,220 Shared memory multiprocessor systems
rea? GRANT?
Reos
GRANTS
EOS
GRANTS
Peas
GRANT
Busrequest usrant
Sora pea signals
GRANT
REQ?
Grant 2
Peat
GRANT
reao
GRANTO
Figure 7.5. Parallel arbitration logic
Hence, after all requests have been applied and sufficient time has elapsed for all
logic levels to be established, only the highest priority grant signal will be generated,
The previous Boolean equations can easily be modified to incorporate enable
signals.
To prevent transient output changes due to simultaneous asynchronous input
changes, the request input information can be stored in flip-flops, as in the SN74278
‘This necessitates a clock input and, as in any synchronizing circuit, there is a finite
probability that an asynchronous input change occurs at about the same time as the
clock activation and this might cause maloperation,
The speed of operation of cascaded priority arbiters is proportional to the number
of circuits cascaded, Clearly, the method is unsuitable for a large number of
requests. To improve the speed of operation of systems with more than one arbiter,
two-level, parallel bus arbitration can be used, as shown in Figure 7.7. Groups of
requests are resolved by a first level of arbiters and a second level arbiter selects the
highest priority first level arbiter. For larger systems, the arrangement can be
extended to further levels.Single bus multiprocessor systems 221
Highest rcnty Parl arbiters
Request
‘signals
a
seals
{NAL
|
Enable output
Enable input
Lowest prionty
Figure 7.6 Cascaded arbiters
Busreques signals
Bus grant signals
Figure 7.7, Two-level parallel bus arbitration222, Shared memory multiprocessor systems
Microprocessor example
The Motorola MC68452 Bus Arbitration Module (BAM) (Motorola, 1985a) is an
«example of a microprocessor arbitration device designed to interface to the MC68000
microprocessor. The MC68452 BAM can accept up to eight device bus requests
DBRO, DERI, DBR2, DBR3, DBR4, DBRS, DERG and DER? and has eight
corresponding device bus grant ouputs DBGO, BGI, DBG2, DBG3, DEG,
DBGS, DEGC and DEG7 generated according to a static priority (DBRT is the
highest priority, through to DBRO, the lowest priority). The BAM generates a bus
request signal, SR, indicating that it has received one oF more requests according 10
the Boolean AND function BR = [Link] .DBG?.[Link]
DEG 7. The BG input enables the 8G outputs.
‘An asynchronous three signal handshake system is used for the transfer of bus
control. This consists of a bus request signal, DBRn, a bus grant signal, DEG, and a
bus grant acknowledge signal, BGACK. This three signal handshaking system matches
the general bus operation of the MC68000, The timing of the signals is shown in
Figure 7.8 When one or more bus requests is received and the grant outputs are
enabled, the BAM generates a bus grant signal corresponding to the highest priority
bus request input. The bus request signal is returned to the requesting bus master,
Which must then acknowledge receipt of the signal by activating the common bus
grant acknowledge signal, The requesting bus master can then take over the bus
immediately. While the bus master is using the bus, it must maintain the acknowledge-
‘ment, BGACK low, and return EGACK high when it has finished with the bus, The
request, DBRn, must be returned high before BGACK. The requesting bus masters
‘must maintain their requests low until an acknowledgement is received
‘The MC68000 does not generate a bus request signal directly at its pin-out
external processor bus request circuitry is necessary to produce this signal, which
dependent upon the system configuration. A bus request signal needs 10 be
8
on
ac i
—
aancR
Basrester
vss
Figure 7.8 MC68000 request/grant/acknowledge sequenceSingle bus multiprocessor systems 223
generated for every bus transaction. If there is a local bus, the logic needs to
incorporate a bus address decoder. Specific interface parts are available to interface
to the VME bus (MC68172/3) and for arbitration (MC68174).
A bus master can use the bus as long as it wishes, which may be for one
transaction, or for several, upon condition that it maintains BGACK low throughout
the transaction(s). There is nd mechanism built into the BAM for forcing masters off
the bus though a bus clear signal (BCLR) is generated whenever a higher priority
‘bus master makes a request for the bus, and this signal could be used with additional
circuitry. Also BGACK must be generated by circuitry in addition to the BAM.
‘The BAM can operate as an arbiter for a system with a central processor and
devices which can control the bus temporarily, such as DMA devices and display
controllers, or in a multiprocessor system where the control of the bus is not
returned to one particular processor. Figure 7.9 shows how the BAM can be used in
a single processor system containing other devices which can temporarily control
the bus. In this application, BR is not connected to BG. Whenever any device
connected to the BAM makes a request for the bus, the processor is informed via the
ER signal. Normally the MC68000 processor will relinquish the bus between 1.5 and
3,5 cycles after the request has reached it, and then return a bus grant signal to the
BAM. The BAM then passes a grant signal to the highest priority requesting device.
Dataladéressloontrol bus
Da
contol
[Serr OBC? DBRS OBC aR aC
cesses attr
%
Ponty estabires
DBRIOEG sigs used
(Bere 0867 anes: y
BBROOEGD lowest pron)
Figure 7.9 Using an MC68452 arbiter in a single processor system224 Shared memory multiprocessor systems
Decentralized parallel priority scheme
In the decentralized parallel priority scheme, one arbiter is used at each processor
site, as shown in Figure 7.10, to produce the grant signal for that processor, rather
than a single arbiter producing all grant signals. All the request signals need to pass
along the bus to all the arbiters, but individual processor grant signals do not need to
pass to other processors. Each processor will use a different arbiter request input
and corresponding arbiter grant output. An implementation might use wire links for
the output of a standard arbiter part, as shown in Figure 7.10. Altematively, the
arbiter function could be implemented from the basic Boolean equations given
earlier for parallel priority logic (see page 219), as shown in Figure 7.11. In this
case, the total arbitration logic of the system would be the same as the centralized
parallel priority scheme.
‘The decentralized parallel priority scheme is potentially more reliable than the
centralized parallel priority scheme, as a failure of one arbiter should only affect the
associated processor. An additional mechanism would be necessary to identify
faulty arbiters (or processors), perhaps using a time-out signal. However, certain
arbiter and processor faults could affect the complete system. For example, if an
arbiter erroneously produced a grant signal which was not associated with the
highest priority request, the processor would attempt to control the bus, perhaps at
the same time as another processor. This particular fault could also occur on a
centralized system.
‘An advantage of the scheme is that it requires fewer signals on the bus. It does
not require grant signals on the bus. Also, in a multiboard system with individual
processors on separate boards, a special arbiter board is not necessary. All processor
boards can use the same design.
us masters
Guat
cogust ee
te
H Busrequest
i Sons
7
Figure 7.10 Decentralized parallel arbitrationSingle bus multiprocessor systems 225
Highest pry Lowes prioty
Request
eran
Bisby
Figure 7.11, Decentralized parallel arbitration using gates
Dynamic priority in parallel priority schemes
‘The implementation of the parallel priority schemes so far described assigns a fixed
priority to individual bus masters. More complex logic, which assigns different
Priorities depending upon conditions present in the system, can be provided at the
arbitration sites. The general sim is to obtain more equitable use of the bus,
especially for systems in which no single processor should dominate the use of the
bus. Various algorithms can be identified, notably:
Simple rotating priority
Acceptance-dependent rotating priority.
Random priority.
Equal priority.
Least recently used (LRU) algorithm.
After each arbitration cycle in simple rotating priority, all priority levels are
reduced one place, with the lowest priority processor taking the highest priority. In
acceptance-dependent rotating priority (usually called rotating priority), the pro-
cessor whose request has just been accepted takes on the lowest priority and the
others take on linearly increasing priority. Both forms of rotating policies give all
processors a chance of having their request accepted, though the request-dependent
rotating policy is most common. In random priority, after each arbitration cycle, the
iority levels are distributed in a random order, say by @ pseudorandom number
generator. In equal priority, when two or more requests are made to the arbiter,226 Shared memory multiprocessor systems
there is an equal chance of any one request being accepted. Equal priority is
applicabie to asynchronous systems in which requests are processed by the arbiter as
soon as they are generated by processors operating independently. If two or more
equests occur simultaneously, the arbiter circuit resolves the conflict. In the least
recently used algorithm, the highest priority is given to the bus master which has
not used the bus for the longest time. This algorithm could also be implemented in
logic.
In the (accepiance-dependent) rotating priority algorithm, all possible requests
ccan be thought of as sequential entries in a circular list, as shown in Figure 7.12, for
4 sixteen bus master system. A pointer indicates the last request accepted. The bus
master associated with this request becomes of the lowest priority after being
serviced. The next entry has the highest priority and subsequent requests in the list
are of decreasing priority. Hence, once a request has been accepted, all other
requests become of greater priority. When further requests are received, the highest
priority request is accepted, the pointer adjusted to this request and a further request.
—
A of, request accepted
= a
Cex
te
(a) Ater request 8 accepied
ase i
2,
Fine
Lowest peoity
Highest
(b) Aterroquest 6 accepted
Figure 7.12 Rotating priority algorithm (a) After request 3 accepted
(b) After request 6 acceptedSingle bus multiprocessor systems 227
from this master becomes the lowest priority request. For example, the list shown in
Figure 7.12(a) shows the allocation of sixteen devices after request 3 has been
received and is serviced. In Figure 7.12(b) request number 6 has been received and
the pointer is moved accordingly.
Rotating priority has been used in interrupt controllers, for example the Advanced
Micro Devices Am9519, and in many ways the interrupt mechanism is similar to the
bus control mechanism but uses interrupt request and acknowledge/grant signals
rather than bus request and acknowledge/grant signals. Various features in the
Am9519 device can de preprogrammed, including a fixed priority or rotating
Priority and particular responses to interrupts. Features such as mask registers to
lock out specific requests are not normally found in bus arbitration systems.
Rotating priority can also be performed in the serial priority scheme (see Section
7.2.2).
‘There are some schemes which assign priority according to some fixed strategy;
these schemes are not strictly dynamic, in so far as the assignment does not
necessarily change after each request is serviced. We can identify two such algorithms:
1, Queueing (first-come first-served) algorithm.
2. Fixed time slice algorithm,
The queueing (first-come first-served) algorithm is sometimes used in analytical
studies of bus contention and assumes a queue of requests at the beginning of an
arbitration cycle. The request accepted is the first request in the queue, ie. the first
request received. This algorithm poses problems in implementation and is not
normally found in microprocessor systems. In the fixed time slice algorithm, each
‘bus master is allocated one period in a bus arbitration sequence. Each bus master
can only use the bus during its allocated period, and need not use the bus on every
occasion, This scheme is suitable for systems in which the bus transfers are
synchronized with a clock signal.
7.2.2 Serial priority schemes
‘The characteristic feature of serial priority schemes is the use of a signal which
passes from one bus master to another, in the form of a daisy chain, to establish
whether a request has the highest priority and hence can be accepted. There are
three general types, depending upon the signal which is daisy chained:
I. Daisy chained grant signal.
2. Daisy chained request signal.
3. Daisy chained enable signal.
‘The daisy chained grant scheme is the most common. In this scheme the bus
requests from bus masters pass along a common (wired-OR) request line, as shown228 Shared memory multiprocessor systems
in Figure 7.13. A bus busy signal is also common and, when active, indicates that
the bus is being used by a bus master. When one or more bus masters make a
request, the requests are routed to the beginning of the daisy chain, sometimes
through a special bus controller and sometimes by direct connection to the highest
priority master. The signal is then passed from one bus master to the next until the
highest priority requesting bus master is found. This bus master prevents the signal
passing any further along the daisy chain and prepares to take over the bus.
In the daisy chained request scheme, as shown in Figure 7.14, the daisy chain
connection is again from the highest priority bus master through to the lowest
priority bus master, but with the request signal being routed along the daisy chain.
Each requesting bus master generates a bus request signal which is passed along the
daisy chain, eventually reaching the current bus master. This bus master is of lower
priority than any of the requesting bus masters to the left of it, and hence will honor
the request by generating a common (wired-OR) bus acknowledge/grant signal. All
requesting bus masters notice this signal but only the one which has a request
pending and does not have a request present at its daisy chain input responds, as it
Highest pronty Lowest pry
Busmasters
Bus grant
Braque
Busey
Figure 7.13 Centralized serial priority arbitration with daisy-chained grant signal
Bis grant
Figure 7.14 Centralized serial priority arbitration with daisy-chained request signalSingle bus multiprocessor systems 229
must be the highest priority requesting bus master. Other requesting bus masters
hhave an opportunity to compete for the bus in future bus arbitration cycles. The
8086 microprocessor supports a form of daisy-chained request arbitration.
In the daisy chained enable scheme, both the bus request end bus acknowledge/
grant signals are common (wired-OR) signals and an additional enable signal is
daisy chained, When a bus master makes a request it disables the daisy chained
enable output, indicating to lower priority bus masters that a higher priority bus
master has presented a request. The common request signal is routed to a bus
controller, which generates a common (wired-OR) bus acknowledge signal to all bus
masters. The highest priority requesting bus master will have its enable input
activated and this condition will allow it to take over the bus. The daisy chained
enable system was used in single processor Z-80 systems for organizing interrupts
from input/output interfaces.
In all types of daisy chain schemes, a key point is that the mechanism must be
such that a requesting bus master cannot take over the bus until it has been freed. A
‘bus controller can be designed to issue an acknowledge/grant signal only when the
bus is free. If there is no bus controller, there are two possible mechanisms, namely:
1, Bus masters are not allowed to make a request until the bus is free.
2. Bus masters are allowed to make a request at any time but are not allowed to
take over the bus until the bus is free (and after receipt of a grant signal).
In |, after the grant signal comes via the daisy chain, the bus master can take over
the bus immediately. In 2, the bus master must wait until the bus is free. When the
bus is taken over, the bus busy line is activated.
A strategy must be devised for terminating the control of the bus. One strategy
would be to allow a bus master only one bus cycle and to make it compete with
other bus masters for subsequent bus cycles. Alternatively, bus masters could be
forced off the bus by higher priority requests (and perhaps requested to terminate by
lower priority bus masters).
MC68000 microprocessor
‘The MC68000 microprocessor is particularly designed to use the daisy-chained
acknowledge scheme with its three processor signals bus request input (BR), bus
grant output (BG) and bus grant acknowledge input (BGACK). The bus grant
acknowledge signal is, in effect, a bus busy signal and is activated when a bus
master his control of the bus. External circuitry is necessary to generate this signal
for each bus master. Bus request indicates that at least one bus master is making a
request for the bus. Again, external circuitry is necessary to generate this signal for
ceach bus master. The bus grant signal, BG, is generated by the processor and
indicates that it will release the bus at the end of the current bus cycle in response 10
receiving the BR signal, The requesting processor waits for all of the following
conditions to be satisfied (Motorola, 1985b):230. Shared memory multiprocessor systems
1, The bus grant, BG, has been received.
The address strobe, AS, is inactive indic
the bus.
3. ‘The data transfer acknowledge signal, DEACK, is inactive indicating that
neither memory or peripherals are using the bus.
4. ‘The bus grant acknowledge signal, BGACK, is inactive indicating that no other
device still has control over the bus.
ing that the processor is not using
‘The scheme described allows masters to make requests even if the bus is being used,
but the transfer of control is inhibited until the bus becomes free. Hence the
arbitration cycle can be overlapped with the current bus cycle. In contrast, bus
requests in a Z8000 multiprocessor are inhibited until the bus is free, when
arbitration takes place to find the highest priority requesting bus master.
Decentralized serial priority scheme
‘Though the daisy chain distributes the arbitration among the bus master sites, the
daisy chain signal originates at one point and subsequently passes along the daisy
chain. Hence the daisy chain methods so far described are categorized as centralized
priority schemes. The daisy chain grant method can be modified to be a decentralized
scheme by making the current bus master generate the daisy chain grant signal and
arranging a circular connection, as shown in Figure 7.15. The daisy chain signal
now originates at different points each time control is transferred from one bus
‘master to another, which leads to a rotating priority. The current bus master has the
lowest priority for next bus arbitration, The bus master immediately to the right of
the current bus master has the highest priority and bus masters further along the
daisy chain have decreasing priority
‘When a bus master has control of the bus, it generates a grant signal which is
passed to the adjacent bus master. The signal is passed through bus masters that do
not have a request pending. Whenever a bus master makes a request, and has a grant
Busmasters
rotation
acuest | cont loge
Day
chain
gant
Figure 7.15 Rotating daisy chainSingle bus multiprocessor systems 231
input signal active, it inhibits the grant signal from continuing along the daisy chaii
However, it cannot take over the bus until the current bus master releases the bus
(assuming a bus master is using the bus). When the current bus master finds that it
hhas lost its daisy chained grant, it must release the bus at the earliest opportunity
and release a common bus busy line. Then the requesting master can take over the
‘bus. When more than one bus master makes a request for the bus, the requesting bus
‘master nearest the current bus master in the clockwise direction is first to inhibit the
daisy chain grant signal and claim the bus.
An implementation of the rotating daisy chain scheme typically requires one flip-
flop at each bus master to indicate that it was the last to use the bus or that it is
currently wsing the bus. One design is given by Nelson and Refai (1984). Flip-flops
are usually activated by a centralized clock signal, and request signals should not
change at about the time of the activating clock transition or the circuit might enter
a metastable state for a short period (with an output voltage not at a binary level),
Finally, note that though the scheme is decentralized, it still suffers from single point
failures. If one of the arbitration circuits fails o pass on the grant signal, the complete
system will eventually fail as the daisy chain signal propagates to the failed circuit
Combined serial-parallel scheme
The serial priority scheme is physically easy to expand though the speed of
operation is reduced 8 the daisy chain length increases. The.paralel_ priority
scheme is faster but requires extra bus lines. The parallel scheme cannot be
expanded easily in a parallel fashion beyond the original design since its dependent
'upon the number of lines available on the bus for request and acknowledge/grant
signals, and the arbitration logic. Typically eight or sixteen bus masters ean be
handled with parallel priority scheme.
The parallel priority scheme can be expanded by daisy chaining each request or
grant sigoal, thus combining the serial and parallel techniques. A scheme is shown
in Figure 7.16. Here the bus request signals pass tothe parallel arbitration circuit ax
before. However, these signals are wired-OR types and several bus masters may use
each line, The grant signals are daisy chamed for each master using the same
request line, so that the requesting master can be selected. The operation is as
follows: the requesting master produces a bus request signal. If accepted by the
priority logic, the corresponding grant signal is generated. This signal passes down
the daisy ‘chain until it reaches the requesting master. At the same time, an
additional common bus clear signal is generated by the priority logic and sent to all
the bus masters. On receiving this signal the current master will release the bus at
the earliest possible moment, indicating this by releasing the bus busy signal. The
new master will then take over the bus.
The parallel and serial schemes are in fact two extremes of implementing the
same Boolean equations for arbitration given in Section 7.2.1. From these equations,
we can obtain the equations implemented at each bus master site in a daisy chain
grant system. Defining IN as the nth daisy chain input and OUTr. as the mh daisy
chain output, which are true if no higher priority request is present, then:232 Shared memory multiprocessor systems
wn 1
orn
GRANT?
our7
GRANTE
ours
GRANTS
ours,
GRANTS
oura
GRANT?
ours
Gran2
our2
GRANTL
eusbuay
onic
busmasters
Figure 7.16 Parallel arbiter with daisy chained grant signals
ourn
REQR. INA
REQT
ING
REQ7.REQ6 =_ING.REQS
INS = REQT-REQG = [Link]
» [Link] = _INS.REQS
= In4 = [Link] = [Link]
[Link].REQ4 =_IN6.REQ4
= IN3 = [Link].REO4 = [Link]
= [Link].REG3 = IN3.REQ3
= IN2 © [Link] = [Link]
= [Link] .[Link]?2 = IN2.REQ2
= IN] = REQ7.REQ6. [Link].REO2 = [Link]
= [Link]. [Link] = [Link]Single bus multiprocessor systems 233
OUT! = INO = [Link] REQ’. [Link] = [Link]
GRANTO = REQ? [Link]. REQ .[Link]
[Link]
Alternatively, we could have grouped two grant circuits together to get:
GRANT? = REQ?
GRANTS = REQT.REQS_
0UT7/6 = INS/4 = REQT.REQ6
GRANTS = [Link] = 1N5/[Link]
GRANTA = [Link].REQ4 = IN5/[Link]
0UTS/4 = IN3/2 = [Link] = INS/[Link]
GRANTS = [Link] = 1N3/2.REQ3
GRANT2 = REQ?.[Link].REQ2 = IN3/[Link].REQ2
our3/2 = IN1/0 = [Link].REQ2
1N3/2.REQ3.REQ2
GRANTL = [Link].REQ2-REQI = IN1/0-REQI
GRANTO = [Link] .[Link]
= 1N1/[Link]-REQO
Similarly, groups of four arbitration circuits can be created with a daisy chain signal
between them, ie.
GRANT? = REQ?
GRANTS = [Link]
GRANTS = [Link]
GRANTS = [Link]
OUT7~4 = IN3 0 = [Link]. REO
GRANTS = [Link].REQ4-REQ3 = IN3 0.REQ3
GRANT2 = [Link] REQ?.REQS.REQ2 = IN3 0.REQ3.REQ2
GRANT1 = [Link]
IN3-0 .REQS.REQ2- REQ]
REQ7.REQ6. REQS .REQ4 . REOS.REQ2-REQI .REQO
= IN3 [Link]-REQO
GRANTO
Figure 7.17 shows implementations for a purely serial approach, arbiters with
‘groups of two request inputs and arbiters with groups of four request inputs.234 Shared memory multiprocessor systems
REQT [sr re [> [errs Re00| =) |eranto
rear few [FS leewrs_[ Jowe [lover
gece! Flore | lawns | nw | Java
es] Dee] Dre] DY
|e Bete Lome
sear] [Jew _[ > Jowrs
"aH
se2d| > lonwe | lowe
‘ax
se2t| runs. | Jaw
a
we Fevers | fone
| REQO| .
oa
LD Ly
Figure 7.17 Serial-parallel arbitration logic (a) Serial _(b) Groups of two
requests (c) Groups of four requests
7.2.3. Additional mechanisms in serial and parallel priority schemes
‘Apart from the basic request, grant/acknowledge and bus busy signals, additional
‘bus signals may be present in arbitration schemes. For example, there may be twoSingle bus multiprocessor systems 235
types of “bus request/clear” signals, one signal as a top priority clear, causing the
current bus master to release the bus at the earliest possible moment, and one signal
indicating a request which does not need to be honored so quickly. With the second
request, the current bus master may complete a sequence of tasks, for example
complete a DMA (direct memory access) block transfer. The top priority clear might
be used for power down routines,
As described, the serial and parallel priority schemes with static priority will
generally prevent lower priority bus masters obtaining control of the bus while
higher priority bus masters have control. Consequently, it may be that these lower
priority masters may never obtain control of the bus, This can be avoided by using a
common bus request signal which is always activated whenever a bus request is
made. If the requesting master has a higher priority than the current master, the
normal arbitration logic will reselve the conflict and generate a bus request signal to
the current bus master, causing the master to relinquish control of the bus at the end
of the current cycle, If, however, the requesting master is of lower priority than the
current bus master, a signal is not generated by the priority logic, but the bus master
recognizes and takes note of the fact that the common bus request signal is
activated. The current bus master continues but when it does not require the bus,
pethaps while executing an internal operation, it releases the bus busy signal, thus
allowing the requesting master access to the bus until the current bus master
requires the bus again,
‘This scheme is particularly suitable if the master has an internal instruction queue
with instruction prefetch, so that after the queue is full there may be long periods
during which the bus is not required. Note that while the common bus request signal
is not activated by a requesting bus master, the current bus master might not release
the bus signal between bus transfers. The Intel 16-bit Multibus I system bus (the
IEEE 796 bus) uses the common bus request mechanism with the signal CBRO
intel, 1979), though the actual microprocessors (e.g. 8086, 80286) do not generate
the common bus signal
7.2.4 Polling schemes
In polling schemes, rather than the bus masters issuing requests whenever they wish
to take over control of the bus and a bus controller deciding which request to accept,
‘a bus controller asks bus masters whether they have a request pending. Such polling
schemes can be centralized, with one bus controller issuing request inquiries, oF
decentralized, with several bus controllers.
‘The mechanism generally uses special polling lines between the bus controller(s)
and the bus masters to effect the inquiry. Given 2° bus masters, 2" lines could be
provided, one for each bus master, and one activated at a time to inquire whether the
‘bus master has a request pending. Alternatively, to reduce the mimber of polling
lines, a binary encoded polling address could be issued on the polling lines and then
only n lines would be necessary. In addition, there are bus request and busy lines.236 Shared memory multiprocessor systems
Centralized polling schemes
A centralized polling scheme is shown in Figure 7.18. The bus controller asks each
bus master in turn whether it has a bus request pending, by issuing each bus master
polling address on the polling lines. Ifa bus master has a request pending when its
Address is issued, it responds by activating the common request line. The controller
then allows the bus master to use the bus, The bus master addresses can be
sequenced in numerical order ot according to a dynamic priority algorithm. The
former is easy to implement using a binary counter which is temporarily halted from
sequencing by the bus busy line.
Decentralized polling schemes
‘A decentralized polling scheme is shown in Figure 7.19, Each bus master has a bus
controller consisting of an address deccder and an address generator. First, at the
beginning of the polling sequence, an address is generated which is recognized by a
controller. If the associated processor has a request outstanding, it may now use the
bus. On completion, the bus controller generates the address of the next processor in
the sequence and the process is repeated. It is usually necessary to have a hand-
shaking system, as shown in Figure 7.19, consisting ofa request signal generated by
the address generator and an acknowledge signal generated by the address decoder.
“The decentralized polling scheme, as described, is not resilient to single point
failures, i.e. if a bus controller fails to provide the next polling address, the whole
system fails. However, a time-out mechanism could be incorporated such that if a
bus controller fails to respond in a given time period, the next bus controller takes
2"bsemastos
[Dx
Request
ee
Figure 7.18 Centralized polling schemeSingle bus multiprocessor systems 237
Figure 7.19 Decentralized polling scheme
Software polling schemes
Although all the priority schemes presented are implemented in hardwate to obtain
high speed of operation, the polling schemes lend themselves toa software approach.
The arbitration algorithms could be implemented in software, using processor-based
bus controllers ifthe speed of operation is sufficient. For example, the bus controller(s)
in the polling scheme could store the next polling addresses, and these could be
‘modified if a bus master is taken out of service. A degree of fault tolerance or the
ability to reconfigure the system could be built into a polling scteme. For example,
each bus master could be designed to respond on a common “I'm here” line when
polled. No response could be taken asa fault atthe bus master, ofa siga thatthe bus
‘master had been removed from service. However, such schemes are more appro-
priate for systems in which the shared bus is uscd to pass relatively long messages
between computers, or message-passing systems.
7.3. Performance analysi
In this section we will present an analysis of the single bus system and the
arbitration function. The methods will be continued in Chapter 8 for multiple bus
and other interconnection networks.
7.3.1 Bandwidth and execution time
Suppose requests for memory are generated randomly and the probability that a
processor makes a request for memory is r. The probability tha: the processor does238 Shared memory multiprocessor systems
not make a request is given by 1 ~ r. The probability that no processors make a
request for memory is given by (1 ~ r)” where there are p processors. The probability
that one or more processors make a request is given by 1 ~ (1 =r) ?. Since only one
request can be accepted at a time in a single bus system, the average number of
requests accepted in each arbitration cycle (the bandwidth, BW) is given by:
BW
-u-9
which is plotted in Figure 7.20. We see that at a high request rate, the bus soon
saturates.
If a request is not accepted, it would normally be resubmitted until satisfied, and
the request rate, r, would increase to an adjusted request rate, say a. Figure 7.21
shows the execution time, 7, of one processor with all requests accepted, and the
execution time, T, with some requests blocked and resubmitted on subsequent
cycles (Yen et al., 1982). Since the number of cycles without requests is the same in
both cases, we have:
rd-a)
Ta-n)
Let P, be the probability that a request will be accepted with the adjusted request
rate, d, and BW, be the bandwidth. With a fair arbitration policy, each request will
have the same probability of being accepted and hence P, is given by
BW, _1-(1-ay
pa” pa
‘Bandwith
re 7.20 Bandwidth of a single bus system (~~ using rate adjusted equations)Single bus multiprocessor systems 239
Memory reference sequence
Coyete with no reterence
/
Without contention
! N
With some reteronces
blocked to
contention |
Reference blocked FRference resubmitted Time
Figure 7.21. Memory references without contention and with contention
‘The number of requests before a request is accepted (including the accepted request)
= UP, Hence, we have:
T= +P oT
and then:
1
TP
5
Here the request rate with the presence of resubmissions is given in terms of the
original request rate and the acceptance probability at the rate a. The equations for
P, and a can be solved by iteration.
‘On a single processor system, the execution time would be Tp. If all requests
were accepted, the time on the multiprocessor would be T and the speed-up factor
would be p. In the presence of bus coatention and resubmitted requests, the
execution time is T" and the speed-up factor is given by:
Figure 7.22 shows the speed-up factor using iteration 10 compute P,. We see that the.
speed-up is almost linear until saturation sets in. The maximum speed-up is given by
Ur. For example, if r = 0.1 (10 per cent) the maximum speed-up is 10, irrespective
of the number of processors. With r = 0.1 and with ten processors, the speed-up is 9.
Note that the derivation uses random requests; in practice the sequence may not be
random,240 Shared memory multiprocessor systems
Using rate asusted equations
Speedup factor
Processors
Figure 7.22. Speed-up factor ofa single bus system
7.3.2 Access time
From the number of requests before a request is accepted being given by 1/P,, we
obtain the time before a rejected request from the ith processor is finally accepted
(the access time) as:
where P, is the probability that processor i successfully accesses the bus, that is, the
probability that a submitted request is accepted. (An alternative derivation is given
in Hwang and Briggs, 1984.) If a request is immediately accepted in the first
arbitration cycle (i.e. P, = 1), the access time is zero, The access time is measured in
arbitration cycles. The probability that a processor successfully accesses the bus
will depend upon the arbitration policy, and the initial request rate, r.
Fair priority
‘The probability, ,, for a fair priority giving equal chance to all processors was given
by P, previously, ic. P, = (I-(I-rP)/pr or, if the adjusted rate is used, P,
(1-1-apyipa. Figure 7.23 shows the access time against number of processors
using the adjusted rate equations with iteration.Single bus multiprocessor systems 241
Using rate acusted equations
(eyes)
0 2 & 7 76
Processors
Figure 7.23. Access time of a single bus system
Fixed priority
Ignoring resubmitted requests, the probability. P,, for fixed priority (e.g. daisy chain
arbitration) would seem to be given by:
Pen!
which is the probability that none of the processors with higher priority than
processor i makes a request. The lower processor number corresponds to the higher
Priority. (Processor i has priority i and processor iI is the next higher priority
processor.) Resubmitted requests have a very significant effect on the computed
access time with fixed priority. Unfortunately it is very difficult to incorporate an
adjusted request rate into the equations as the probability of an individual processor
‘not making a request is dependent upon other processors. The previous equation is
invalid for a = r. Computer simulation can be performed to obtain the most accurate
results
7.4 System and local buses
We noted in Section 7.1 that a single bus cannot be used for all processor-memory
transactions with more than a few processors and we can sce the bus saturation in
the previous analysis. The addition of a cache or local memory to each processor
would reduce the bus traffic. This idea results in each processor having a local bus242. Shared memory multiprocessor systems
Inout! row
Processor Memory output Processor Memory outhut
: Tocalbas
Localsystm Localsystem
busitrace interac
‘Syternbus
Figure 7.24 Multiple microprocessor system with local buses and system bus
for local memory and input/output, and a system bus for communication between
local buses and to a shared memory, as shown in Figure 7.24. Now let us look at this
type of architecture in detail, Bus arbitration is still necessary at the system bus
level and possibly also at the local bus level.
A local/system bus interface circuitry connects the local and system buses together.
Memory addresses are divided into system addresses referring co memory on the
system bus, and local memory addresses referring to memory on the local bus, No
local bus arbitration is required if there is only one processor on the local bus, but
generally system bus arbitration logic is necessary to resolve multiple system bus
requests to select one processor to use the system bus. When a processor issues a
memory address, decode logic identifies the bus. Input/output addresses could be in
local or system space, depending upon the design.
Since blocks of memory locations generally need to be transferred from the
system memory to the local memory before being used, it is advantageous to
provide a direct path between the system bus and the local memory using two-port
‘memory. Two-port memory can be accessed by one of two buses, sometimes
simultaneously. Small two-port memory with simultaneous access characteristics
are available, but in any event two-port memory can be created (though without
simultaneous access characteristics) using normal random access memory com-
ponents and memory arbitration logic to select one of potentially two requests for
the memory. In effect, the bus arbitration logic is replaced by similar memory
arbitration logic. Care needs to be taken to ensure data consistency in the two-port
memory using critical section locks (see Chapter 6). Most recent microprocessors
have facilities for local and system buses, either built into the processors or
contained in the bus controller interfaces.Single bus multiprocessor systems 243
Example of microprocessor with local and system bus signals
‘The 8-bit Zilog 2-280 microprocessor (Zilog, 1986) (a substantial enhancement to the
Z-80 microprocessor) has the ability to distinguish between local bus addresses and
system bus addresses using internal logic. The processor can operate in 4 mult
processor configuration or not, by seting a bit inthe 8-bit internal processor register
called the Bus Timing and Initialization register In the non-multiprocessor mode,
only a single bus, the local bus, is supported and the processor is the controlling bus
master for this bus. Other processor-ike devices, such as DMA devices, must request
the use of the bus from the Z-280 using the bus request signal (BUSRE) into the Z-
280. The 2-280 acknowledges the request withthe bus acknowledge signal (BUSACK)
land releases the bus by the time the acknowledgement is issued.
In the multiprocessor configuration mode, both local and global buses can be
present and memory addresses are separated into those which refer to the local bus
and those which refer to the global bus using the four most significant bits of the
address. These four most sigeificant bits can be selected as set to 1 oF 0 for the local
bus using a processor register called the local address register. Four base bits in this
register are compared to the foer address bits and if all four match, a local address
reference is made, otherwise a global memory reference is made. The other four bits
are mask enable bits to override global selection for each address bit when the
corresponding mask bit is set to 0. If all mask bits are set to 0, all memory
references are to the local memory. The Z-280 has four on-chip DMA channels,
‘which may use the global bus in the same Way as the Z-280 ~ using the local address
register to ascertain whether addresses are local or global.
‘The local bus is controlled in the same way as in the non-multiprocessor mode,
using BUSREO and BUSACK, but the processor must request the global bus. This
request is made by issuing a Global Request output (GREQ) from the processor,
which is acknowledged by the Global Acknowledge input (GACR) to the processor.
RED would normally enter a global bus arbiter, which resolves multiple requests
and priorities for the global bus, issuing GACK to the processor allowed to used the
global bus.
7.5_Coprocessors
7.5.1 Arithmetic coprocessors
‘The local bus could, of course, carry more than one processor if suitably designed.
More commonly, it carries coprocessors and DMA devices which are allowed to use
the local bus, though overall control is always returned to the processor, Coprocessors,
are devices designed to enhance the capabilities of the central processor and operate
in cooperation with the central processor. For example, an arithmetic coprocessor
enhances the arithmetical ability of the central processor by providing additional
arithmetical operations, such as floating point and high precision fixed point addition,244 Shared memory multiprocessor systems
subtraction, compare, multiplication and division operations. Arithmetic coprocessors
also include floating point trigonometric functions such as sine, cosine and tangent,
inverse functions, logarithms and square root. The coprocessor can perform designed
operations at the same time as the central processor is performing other duties.
Not all the binary patterns available for encoding machine instructions are used
internally by a microprocessor, and it is convenient to arrange an arithmetic
coprocessor to respond to some of the unused bit patterns as they are fetched from the
memory. The main processor would expect the arithmetic coprocessor to supply the
results of any such operations, and in this way the arithmetic coprocessor is seen
simply as an extension to the processor. The coprocessor would be designed for
particular processors.
8086 family coprocessors
The Inte! 8087 (Intel, 1979) numeric coprocessor is designed to match the 16-bit
Intel 8086 processors. The 80287 numeric coprocessor matches the 80286 processor.
Figure 7.25 shows an 8087 coprocessor attached to an 8086 processor and a
common bus. The 8086 processor fetches instructions in the normal way and all
instructions are monitored by the 8087 coprocessor. Instructions which begin with
the binary pattern 11011 are assigned in the 8086 instruction set for external
coprocessor operation and are grouped as ESC (escape) instructions. If an ESC
instruction is fetched, the 8087 prepares to act upon it. The ESC instruction also
indicates whether an operand is to be fetched from memory. If an operand fetch is
indicated, the address of the operand is provided in the thied and fourth bytes of a
rultibyte instriction, and the 8086 fetches the address of the operand: otherwise,
the 8086 will continue with the next instruction. The 8087 recognizes the ESC
instruction and performs the encoded operation. If an operand address is fetched by
the 8086, the address is accepted by the 8087. The 8087 will subsequently fetch the
operand. It is possible for both processors to be operating simultaneously, with the
£8086 executing the next instruction, The operations provided in the 8087 coprocessor
include long word length, fixed point and floating point operations. The 8087 has an
imternal 8-word, 80 bit-word stack to hold operands. Some coprocessor instructions
‘operate upon two operands held in the top two locations of the stack. Results ean be
stored in the stack or in memory locations.
‘The operations are performed about 100 times faster than if the 8086 had
performed them using software algorithms, However, once the 8087 has begun
‘executing an instruction, the two processors act asynchronously, When the 8087 is
‘executing an instruction, its BUSY output is brought high. BUSY is usually connected
to the TEST input of the 8086. The TEST input can be examined via the 8086 WATT
instruction. If TEST = 1, the WAIT instruction causes the 8086 to enter wait states,
until TEST = 0. Then the next instruction is executed. Typically, the WALT instruction
‘would be executed before an ESC instruction, to ensure that the coprocessor is ready
to respond to the ESC instruction, Hence the two processors can be brought back
into synchronism. Other signals connect between the two processors, including bus
request and grant signals to enable the two processors to share the bus. The 8086Single bus multiprocessor systems 245
|r} sey
wos | 1
' | Instructions
am | |
vor chotrg
soo
Figure 7.25 CPU with coprocessor
has an internal 6-byte queue used to hold instructions prior to their execution. The
state of the queue is indicated by queue status outputs which the 8087 uses to ensure
proper operation of £SC instructions. (The $087 can also be connected to the 16-bit
8088 processor which has a 4-byte queue.)
MC68020 coprocessors
‘As with the 8086 family, the Motorola MC68000 family instruction set has some
instruction encoding patterns not used by the processor, and some patterns are
reserved for coprocessors (Beims, 1984). All instructions with the most significant
four bits 1010 (A hexadecimal) or 1111 (F hexadecimal) in the first word are
reserved for future expansion and extemal devices. Pattern 1111 (F) (called “line-F”
op-codes) are reserved for coprocessor instructions. ‘The MC68020 32-bit processor
supports coprocessors, and-coprocessors are attached to the local bus. Communica
tion between the 68020 and the coprocessor is through data transfers to and from
internal registers within the coprocessor.
‘The address space of the system is divided into eight spaces using a 3-bit function
code (processor status outputs FCO-FC2) generated by the processor. In the
MC68020, five are defined ~ user data (001), user program (010), supervisor data
(101), supervisor program (110) and special processor-related functions (111), for
example, breakpoint acknowledge, access level control, coprocessor communication
or interrupt acknowledge. In function code 111, address lines 31 through to 20 are
not used, and address bits 19, 18, 17 and 16 differentiate between the functions.
Coprocessors use A19-A16= 0010, and A1S-A13 to identify the coprocessor,
leaving twelve address bits plus upper/lower byte select lines to identify internal246 Shared memory multiprocessor systems
registers within a particular coprocessor, i.e. up to 8192 bytes can be addressed
within each coprocessor. Thirty-two bytes are defined as coprocessor registers used.
for communication with the main processor.
Coprocessor instructions include a 3-bit code in the first word 10 identify the
coprocessor and the instructions may have extension words. In some cases, the first
word includes the same 6-bit encoding of the effective address as internal MC68000
instructions, and the same addressing modes are available, The instructions are
categorized in one of three groups — general, conditional and system control. In the
general group, the first extension word contains a coprocessor command (defined by
8 particular coprocessor). In the conditional group, specific coprocessor tests are
given in a condition selector field. In the system group, operations for virtual
‘memory systems can be specified.
‘When the MC68020 fetches a coprocessor op-code (ine-F op-code) the processor
‘communicates with the coprocessor by writing a value into the coprocessor register.
Coprocessors have eleven addressable registers used to hold commands and data
passed to or from the MC68020 processor. For the general coprocessor instruction,
the command in the fetched coprocessor instruction is transferred to the coprocessor
command register. For conditional instructions, the condition selector is transferred
to the coprocessor condition register.
The coprocessor should respond by issuing a 16-bit “primitive” command to the
‘main processor. The encoding of the primitive commands allows up to sixty-four
functions, though some are reserved. The functions are categorized into five groups
= processor synchronization, instruction manipulation, exception handling, general
‘operand transfer and register transfer. For example, in processor synchronization,
the MC68020 can be instructed to proceed with the next instruction. In the general
‘operand transfer group, the MC68020 can be instructed to evaluate the effective
address of the coprocessor instruction and pass the stored data or the address to the
‘coprocessor. If an addressed coprocessor does not exist in the system, hardware
should issue a bus error signa}, and typically the processor will enter a software
routine 10 emulate the coprocessor operations. Bus error signals are normally
_generated if the processor does receive an acknowledgement after a specific dur
An example of a Motorola coprocessor is the MC6888I floating point coprocessor.
‘The overhead of issuing and receiving commands is generally insignificant in
typical coprocessor operations, which might take perhaps 50 microseconds to
‘complete a complex floating point arithmetic operation.
Attached arithmetic processors
‘Some early attached arithmetic processors, for example the Intel 8231A Arithmetic.
Processing Unit (Intel, 1982), were simply memory mapped or input/output mapped
devices which responded to particular commands from the central processor. Results
were held in an internal stack, which could be examined by the processor under
‘program control or under an interrupt scheme, These arithmetic processors did not
require special coprocessor instructions in the central processor instruction set and
‘could be attached to most microprocessor buses.Single bus multiprocessor systems 247
7.5.2 Input/output and other coprocessors
Apart from arithmetic coprocessors, coprocessors exist to perform other operations.
independently, notably:
1, 1/0 (DMA; controllers/coprocessors.
2. Local area network coprocessors,
3. Graphics coprocessor
In all cases, the main processor can continue with other operations while the
coprocessor is completing its task. Normally, these coprocessors are attached to the
local bus, though it is possible to provide a separate local bus, as shown in Figure
7.26. This eliminates memory conflicts if the transactions can be completed totally
‘on the coprocessor local bus. Coprocessors can be provided with their own instruction
set and execute their programs from local memory on a separate bus.
Local bus
Memory
Figure 7.26 input/output processor with local bus248. Shared memory multiprocessor systems
PROBLEMS
7.1 Prove that the maximum speed-up of a multiprocessor system having
‘n processors, in which each processor uses the bus for the fraction m of
every cycle, is given by m.
7.2 Identify the relative advantages of the synchronous bus and the
asynchronous bus.
7.3 Identify the relative advantages of parallel arbitration and serial
arbitration.
7.4 Identify the relative advantages of centralized arbitration and decen-
alized arbitration,
7.5 Identify the relative advantages of the daisy chain grant arbitration,
scheme and the daisy chain request arbitration scheme. Which would you
‘choose for a new microprocessor? Why?
7.6 A 3-t0-8 line priority encoder is a logic component which accepts
eight request inputs and produces a 3-bit number identifying the highest
priority input request using fixed priority. A 3-to-8 line decoder accepts
a 3-bit number and activates one of its eight outputs, as identified by the
input number. Show how these components could be used to implement
parallel arbitration. Derive the Boolean expressions for each component
and show that these equations correspond to the parallel ar
expressions given in the text.
7.7 Design a parallel arbitration system using three levels of parallel
arbiter parts and determine the arbitration time of the system,
7.8 Suppose a rot
signals:
ing daisy chain priority circuit has the following
BR Bus request from bus master
BG Bus grant to bus master
BRIN Bus grant daisy chain Input
BROUT Bus grant daisy chain output
and contains one J-K flip-flop whose output, BMAST (bus master),
indicates that the master is the current bus master. Draw a state table
showing the eight different states of the circuit. Derive the Boolean
expressions for the flip-flop inputs, and for BROUT. (See Nelson and
Refai (1984) for solution.)Single bus multiprocessor systems 249
7.9 For any 16-/32-bit microprocessor that you know, develop the Boolean
expressions and logic required to generate bus request and grant signals
for both local and global (system) buses. The local bus addresses are 0 to
(65535 and the global bus addresses are from 65536 onwards.
7.10 Derive Boolean expressions to implement a daisy chain scheme
having three processors at each arbitration site
7.11 Derive an expression for the arbitration time of a combined serial
parallel arbitration scheme having m processors, using one n-input
parallel arbiter. (m is greater than ”.)
7.12 What is the access time for the highest and next highest priority
processor in a system using dsisy chain priority, given that the request
rate is 0.257
7.13 Suppose a new arithmetic coprocessor can have eight arithmetic
operations. List those operations you would choose in the coprocessor.
Justify.
7.14 Compare and contrast the features and mechanisms of 8086 co-
processors and 68020 coprocessors.CHAPTER
Interconnection
networks
Interconnection networks are of fundamental importance to the design and opera
of multiprocessor systems. They are required for processors to communicate
between themselves or with memory modules. This chapter will consider the
interconnection network as applicable to a wide range of multiprocessor archi-
tectures, though with particular reference to general purpose MIMD computers.
Multiple bus systems will be considered as an interconnection network, extending
the single bus interconnection scheme of Chapter 7.
ther
8.1. Multiple bus multiprocessor systems
In the last chapter, we considered single bus multiprocessor systems, We can extend.
the bus system to one with b buses, p processors and m memory modules, as shown
in Figure %.1(a), in which no more than one processor can use one bus sirtul-
taneously. Each bus is dedicated to a particular processor for the duration of a bus
transaction. Each processor and memory module connects to each of the buses using
electronic switches (normally three-state gates). A connection between two com-
ponents is made, with two connections to the same bus. We will refer to processors
and memory modules only. (Memory and I/O interfaces can be considered as similar
components for basic bus transactions.) Processor-memory module transfers can use
any free bus, and up to b requests for different memory modules can be serviced
simultaneously using 6 buses. A two-stage arbitration process is necessary, as
shown in Figure 8.1(b). In the first phase, processors make requests for individual
memory modules using one arbiter for each memory module, and up to one request
for each memory module is accepted (as only one request can be serviced for each
module). There might be up to m requests accepted during this phase, with m
memory modules. In the second phase, up to b of the requests accepted in the first
phase are finally accepted and allocated to the b buses using a bus arbiter. If m is
less than b, not all the buses are used. Blocked requests need to be honored later.
250Interconnection networks 251
Memory modules
Buses
(0 interconnection
1 mariers
{onetoreach
‘memory mec)
Busariter
bus requests
(e) Atbiraton
Figure 8.1. Multiple bus system (a) Interconnection (b) Arbitration
Clearly, bus contention will be less than in a single bus system, and will reduce as
the number of buses increases; the complexity of the system then increases, Though
extensive analytical studies have been done to determine the performance character-
istics of multiple bus systems, few systems have been constructed for increased
speed. Apart from such applications, multiple bus systems (especially those with
‘two or three buses) have been used in fault tolerant systems. A single bus system
collapses completely if the bus is made inoperative (perhaps through a bus driver
short-circuited to a supply line).
Variations of the multiple bus system have been suggested. For example, not all
the memory modules need to be connected to all the buses. Memory modules can be
grouped together, making connections to some of the buses, as shown in Figure 8.2.
Multilevel multiple bus systems can be created in which multiple bus systems
connect to other multiple bus systems, either in a tree configuration or other
hierarchical, symmetrical or non-symmetrical configurations.
Lang er al. (1983) showed that some switches in a multiple bus system can be
removed (up to 25 per cent) while still maintaining the same connectivity and
throughput (bandwidth). In particular, Lang showed that a single “thombic” multiple252. Shared memory multiprocessor systems
Processors Memarymedules Memory modes
[ |
Figure 8.2 Partial multiple bus system.
bus system can be designed with the same connec
scheme and no reduetion in performance whatever when:
ty of a full multiple bus
1, p-b+1Smsm
2. ptm+l—b-m,Sp,Sp
where m, memory modules and p, processors are connected to bus i, Lang also
showed that the minimum switch configuration can be achieved by keeping the
processor connections complete and minimizing the memory module connections.
‘We shall use Lang's observations in overlapping multiple bus networks (Section
8.5.2).
8.2 Cross-bar switch multiprocessor systems
8.2.1 Architecture
In the cross-bar switch system, processors and memories are interconnected through
an array of switches with one electronic cross-bar switch for each processor
‘memory connection, All permutations of processor-memory connections are poss-
ible simultaneously, though of course only one processor may use each memory at
any instant, The number of switches required is p x m where there are p processors
and m memory modules.
Each path between the processors and memories normally consists of a full bus
carrying data, address and control signals, and each cross-bar switch provides one
simultaneous switchable connection. Hence the switch may handle perhaps 60-100
lines if itis to be connected between each processor and each memory. The address
lines need only be sufficient to identify the location within the selected memory
module. For example, twenty address tines are sufficient with 1 Mbyte memoryInterconnection networks 253
modules, Additional addressing is necessary to select the memory module. The
memory module address is used to select the cross-bar switch. The cross-bar switch
connections ntay be made by:
1, Three-state gates.
2. Wired-OR gates.
3. Analog transmission gates.
4. Multiplexer components
‘The cross-bar switch connections could be fabricated in VLSI, though the number of
input/output connections is significant. Analog transmission gates have the advantage
of being intrinsically bidirectional.
Each processor bus entering the cross-bar network contains all the necessary
signals to access the memory modules, and would include all the data lines,
sufficient address lines and memory transfer control signals. The switch network can
also be implemented using multiport memory. In effect, then, all of the switches in
‘one column of the cross-bar are moved to be within one memory module.
‘The number of switches in a cross-bar network becomes excessive and impractical
for large systems. However the cross-bar is suitable for small systems, perhaps with
up to twenty processors and memories.
8.2.2 Modes of operation and examples
“There are two basic modes of operation for cross-bar switeh architectures, namely’
1, Master-slave architecture.
2. Architecture without a central control processor.
Each has distinct hardware requiremeats.
In the master-slave approach, one processor is assigned as the master processor
and all the other processors are slave processors. All cross-bar switches are controlled
by the master processor, as shown in Figure 8.3. The operating system for this
architecture could also operate on a master-slave principle, possibly with the whole
operating system on the master processor. Alternatively, the central part of the
‘operating system could be on the master processor, with some dedicated routines
passed over to slave processors which must report back to the master processor. The
slave processors are available for independent user programs. In any event, only the
‘master processor can reconfigure the network connections, and slave processors
executing user programs must request a reconfiguration through the master processor.
‘The master-slave approach is certainly the simplest, both for hardware and software
design,
In the cross-bar switch system without central control, each processor controls the
switches on its processor bus and arbitration logic resolves conflicts. Processors254 Shared memory multiprocessor systems
Bussviteh
Figure 8.3. Cross-bar switch system with central control (master-slave)
make independent requests for memory modules. Each memory module/bus has its
own arbitration logic and requests for that memory module are directed to the
corresponding memory arbitration logic. Up to one request will be accepted for each
‘memory module, and other requests are held back for subsequent arbitration cycles.
Arbitration is effected by one arbiter for each memory module receiving requests for
that module, as shown in Figure 8.4
Perhaps the first example of a cross-bar switch multiple processor system (certainly
the first commercial example) was the Burroughs D-825 four processor/sixteen
memory module cross-bar switch system introduced in 1962 for military applications.
Subsequently, commercial cross-bar switch systems have occasionally appeared,
usually with small numbers of processors. There is at least one commercial example
of a master-slave architecture, the IP-1 (International Parallel Machines Inc.). The
‘basic configuration of the’ IP-1 has nine processors, one a master processor, with
eight cross-bar switch memory modules. The system can be expanded to thirty-three
processors. The cross-bar switch memory operates like multiport memory. There has
been at least one small master-slave architecture research project (Wilkinson and
Abachi, 1983).
A significant, influential and extensively quoted but now obsolete cross-bar
switch system without central control called the [Link] ([Link]-
miniprocessor) was designed and constructed in the early 1970s at Camegie-Metlon
University (Wulf and Harbison, 1978). [Link] employed sixteen PDP-11 computer
systems, each with local memory, connected to a sixteen memory module. In 1978,Interconnection networks 255
Memory modes
Processors
Figure 8.4 Cross-bar switch system without central control
at the end of the main investigation, the five original PDP-11s were PDP-11/20s and.
the eleven introduced in 1974 to complete the system were the faster PDP-11/40s.
‘There were 3 Mbytes of memory in total (32 Mbytes possible). The total hardware
cost of $600 000 was divided into $300 000 for the processors, $200 000 for the
shared memory and $100 000 for the cross-bar switch. Apart from the cross-bar
switch communication paths between the processors and memory, a communication
path was established between the processors using an interprocessor (IP) bus. Input/
output devices and backing memory were connected to specific processors.
PDP-11 processor instructions can specify a 16-bit address. This address is
divided into eight 8 Kbyte pages (three most significant bits for the page and
thirteen bits for the location within the page). This address is extended to eighteen
bits on the local bus (Unibus) by concatenating two bits contained in the processor
status word with the 16-bit address. The two processor status word bits cannot be
changed by user programs and constrain user programs to operate within the 16-bit
address space, i.e. within 64 Kbytes (eight 8 Kbyte pages).
In the [Link], shared memory is accessed via the cross-bar switch with 25-bit
address, with the most significant four bits selecting the memory module, i.e. with
high order interleaving. The 18-bit local bus address is translated into a 25-bit shared
memory address by an address translation unit called Dmap, using a direct mapping
technique (Section 2.2.2, page 32). Dmap contains four sets of relocation registers,
‘with eight registers in each set. One set is selected by the two processor status bits and
the register within the set is selected by the three next significant bits of the address,
ice. by the page bits. Each register contains a 12-bit frame address and three bits for
memory management. The frame address selected is concatenated with the thirteen
remaining address bits to obtain a 25-bit address. The frame bits are divided into a 4-
bit port number selecting the memory module and an 8-bit page within port.256 Shared memory multiprocessor systems
As [Link] employed the approach without central control, any processor could
execute any part of the operating system at any time, Shared data structures were
accessed by only one process at a time, using one of two general mechanisms —
cither fast simple binary locks for small data structures, or semaphores with a
descheduling and queueing mechanism for larger data structures. A widely reported
disadvantage of the [Link], as constructed with PDP-Ils, is the small user address
space allowed by the 16-bit addresses
‘The cross-bar switch architecture without central control has been used more
recently, for example the S1 multiprocessor system developed for the United States
Navy. The $1 also has sixteen processors connected to sixteen memory modules
through a 16 x 16 cross-bar switch. However, the processors are specially designed
very high speed ECL (emitter-coupled logic) processors,
In a cross-bar system, input/output devices and backing memory can be associated
‘with particular processors, as in the [Link] and SI, Alternatively, they can be made
accessible to all processors by interconnecting them to the processors via the same
cross-bar switch network as the memory modules; the cross-bar switch would then
need to be made larger. Input/output devices and backing memory could also be
connected to the processors via a separate cross-bar switch.
There are a number of possible variations in the arrangement of a cross-bar switch
network. For example, Hwang ¢f al. (1989) proposed the orthogonal multiprocessor
using a network in which processors have switches fo one of two orthogonal buses
in the cross-bar network. At any instant, the processors can all connect to the
vertical or horizontal buses. Each memory module needs to access only two buses.
Hwang develops several algorithms for this system, Various memory access patterns
are allowed. Overlapping connectivity networks including cross-bar versions, are
considered in Section 8.5.
8,3 Bandwidth analysis
8.3.1 Methods and assumptions
One of the key factors in any interconnection network is the bandwidth, BW, which
is the average number of requests accepted in a bus cycle. Bandwidth gives the
performance of the system under bus contention conditions. Bandwidth and other
performance figures can be found via one of four basic techniques:
Using analytical probability techniques.
Using analytical Markov queueing techniques.
By simulation,
By measuring an actual system performance.
Simplifying assumptions are often made for techniques 1 and 2 to develop a closedInterconnection networks 257
form solution, which is then usually compared to simulations. Measurements on an.
actual system can confirm or contradict analytical and simulation studies for one
particular configuration, We shall only consider probabilistic techniques. The principal
assumptions made for the probabilistic analysis are as follows:
1. The system is synchronous and processor requests are only generated at the
beginning of a bus cycle,
2. All processor requests are random and independent of each other.
3. Requests which are not accepted are rejected, and requests generated in the
next cycle are independent of rejected requests generated in previous cycles.
If bus requests are generated during a cycle, they are only considered at the
beginning of the next cycle. Arbitration actions are only taken at the beginning of
each bus cycle. Asynchronous operation, in which requests can occur and be acted
upon at any time, can be modelled by reducing the cycle time to that required to
arbitrate asynchronous requests. In practice, most bus-based multiprocessor systems
respond to bus requests only at the beginning of bus cycles, or sometimes only at
the beginning of instruction cycles. instruction cycles would generally be of variable
time, but virtually all published probabilistic analyses assume a fixed bus cycle.
Assumption 2 ignores the characteristic that programs normally exhibit referential
locality for both data and instruction references. However, requests from different
processors are normally independent. A cross-bar switch system can be used to
implement an interleaved memory system and some bandwidth analysis is done in
the context of interleaved memory. Low order interleaving would generally ensure
that references are spread across all memory modules, and though not truly in a
random order, it would be closer to the random request assumption.
According to assumption 3, rejected requests are ignored and not queued for the
next cycle. This assumption is not generally true. Normally when a processor
request is rejected in one cycle, the same request will be resubmitted in the next
cycle. However, the assumption substantially simplifies the analysis and makes very
little difference to the results.
‘Though it is possible to incorporate memory read, write and arbitration times into
the probabilistic analysis, we will refrain from adding this complexity. Markov
queueing techniques take into account the fact that rejected requests are usually
resubmitted in subsequent cycles.
8.3.2, Bandwidth of cross-bar switch
In a cross-bar switch, contention appears for memory buses but not for processor
buses, because only one processor uses each processor bus but more than one
processor might compete for a memory module and its memory bus. In the multiple
bus system, to be considered later, both system bus contention and memory conten-
tion can limit the performance. In the cross-bar switch, we are concerned with the258. Shared memory multiprocessor systems
probability that more than one request is made for a memory module as, in such
cases, only one of the multiple requests can be serviced, and the other requests must
be rejected.
First, let us assume that all processors make a request for some memory module
during each bus cycle. Taking a small numerical example, with two processors and
three memories, Table 8.1 lists the possible requests. Notice that there are nine
combinations of two requests taken from-three possible requests. The average
‘bandwidth is given by the average number of requests that can be accepted. Fifteen
requests can be accepted and the average bandwidth is given as 15/9 = 1.67.
‘Memory contention occurs when both processors request the same mentory module,
For our two processor/three memory system, we see that processor 1 makes a
request for memory 1 three times, memory 2 three times and memory 3 three times,
and similarly for processor 2. Hence there is a 1/3 chance of requesting a particular
‘memory.
‘Table 8.1 Processor requests with two processors and three memories
Memory requests
Processors Number of Memory
», requests accepted contention
1 1 1 YES
1 2 2 No
1 3 2 No
2 1 2 No
2 2 1 YES
2 3 2 No
3 1 2 No
3 2 2 No
3 3 1 YES
Now let us develop a general expression for bandwidth, given p processors and m
memory modules. We have the following probabilities: The probability that a
processor P, makes a request for a particular memory module M, is I/m for any i and.
(as there is equal probability that any memory module is requested by a processor),
‘The probability that a processor, P,, does not make a request for that memory
module, M,, is 1 ~ I/m. The probability that no processor makes a request for the
memory module is given by (1 ~ I/m. The probability that one or more processors
make a request for memory module M, (i.e. the memory module has at least one
request) is (1- (1 — 1/m)). Hence the cross-bar switch bandwidth, i. the number of
‘memory modules with at least one request, is given by.Interconnection networks 259
BW = m(1~ (1 ~ Imp)
‘The bandwidth function increases with p and m and is asymptotically linear for
either p of m, given a constant p/m ratio (Baer, 1980). Alternative explanations and
derivations of bandwidth exist, perhaps the first being in the context of interleaved
memory (Hellerman, 1966). An early derivation for the bandwidth can be found in
Ravi (1972), also in the context of interleaved memory.
The cross-bar switch bandwidth can be derived for the situation in which
processors do not always generate @ request during each bus cycle, for example, in a
system having local memory attached to the processors. Let r be the probability that
8 procesor makes request. Then the probability tha a processor makes reqtest
for a memory module, M, = r/m. For a simple derivation, this term can be
substituted into the previous derivation to get the bandwidth as:
BW = m(1 ~ (I~ rimy)
Patel (1981) offers an altemative derivation for the bandwidth with requests not
necessarily always generated.
Figure 8.5 shows the bandwidth function. Simulation results (Lang et al., 1982;
Wilkinson, 1989) are also shown using a random number generator to specify the
requests and with blocked requests resubmitted on subsequent cycles. For a request,
rate of 1, the bandwidth equation derived will give a value higher than that found in
simulation and in practice, because rejected requests which are resubmitted in the next
cycle will generally lead to more contention. At request rates of less than 1, the
simulation results can give a higher bandwidth than analysis because there is then an
‘opportunity for blocked requests to be satisfied later, when some other processors do
not make requests.
° 7 e 1% 16
Processorsimamory modes
Figure 8.5. Bandwidth of cross-bar switch network (-— analysis,
simulation)260. Shared memory multiprocessor systems
‘The probability that an arbitrary request is accepted is given by:
BW a
= BY = omirpy - (1 - rim?)
and the expected wait time for a request to be accepted is (I/P, ~ 1 )t, where f, isthe
bus cycle time.
8.3.3 Bandwidth of multiple bus systems
In the multiple bus system, processors and memory modules connect to a set of b
buses, and the bandwidth will depend upon both memory contention and bus
contention. Only a maximum of b requests can be serviced in one bus cycle, and
then only if the b requests are for different memory modules. We noted that
servicing a memory request can be regarded as a two stage process. First, up £0 a
memory requests must be selected from all the requests. This mechanism has
already been analyzed in the cross-bar switch system as it is the only selection
process. We found that the probability that a memory has at least one request is
1 (= t/my’ =q (say). Second, of all the different memory requests received, only b
requests can be serviced, due to the limitation of & buses. The probability that
exactly i different memory modules are requested during one bus cycle is given in
Mudge et al. (1984) (see also Goyal and Agerwala, 1984):
»
to Jaco
were (2) isthe binomial coetcient. The overall bandwith sven by:
7b
Bw=J ot + Dito
sb i=
‘The first term relates to b or more different requests being made and all b buses
being in use, and the second term relates to fewer than b different requests being
made and fewer than b buses being used. Figure 8.6 shows the bandwidth function
and also simulation results (Lang ef al., 1982). As with the cross-bar switch for a
request rate of 1, the simulation bandwidth is slightly less than the analytical value,Interconnection networks 261
Bandit
ee
c 4 = % 16
Nurber of buses
Figure 8.6 Bandwidth of multiple bus system (- analysis, --- simulation)
but for request rates of less than I, the analytical values are less than the simulation
values, as then there is more opportunity for rejected requests to be accepted in later
cycles.
ha the analysis for the cross-bar switch and for the multiple bus system, we
assume that rejected requests are discarded and do not influence the bandwidth. In
Chapter 7, Section 7.3.1, we presented a method of computing the effect of rejected
requests being resubmitted by adjusting the request rate. This method can be applied
to multiple bus and cross-bar switch networks to obtain a more accurate value for
the bandwidth. However, the method assumes that the rejected requests will be
resubmitted to a memory module selected at random rather than to the same memory
module as would normally happen. This does not matter in the case of the single bus,
system with a single path to all memory modules, but has an effect in the case of
multiple buses and cross-bar switches. However, the method does bring the results
closer to actual values from simulation.
Some work has been done to incorporate priority into the arbitration function (see
Liu and Jou, 1987) and to have a “favorite” memory module for each processor
which is more likely to be selected (see Bhuyan, 1985) and to characterize the
reliability (see Das and Bhuyan, 1985). An early example of the use of Markov
cchain model is given by Bhandarkar (1975). Markov models are used by Irani and
Onyiiksel (1984) and Holliday and Vernon (1987). Actual measurements and
simulation are used to compare analytical models, for example as in Baskett and
‘Smith (1976).262. Shared memory multiprocessor systems
8.4 Dynamic
In this section, we will describe various schemes for interconnecting processing
elements (processors with memory) or interconnecting processors to memories, apart
from using buses. The schemes are applicable to both MIMD and SIMD computer
systems, though particular network characteristics might better suit one type of
computer system, Our emphasis is on general purpose MIMD computer systems,
8.4.1 General
In a dynamic interconnection network, the connection between two nodes (pr0-
cessors, processor/memory) is made by electronic switches such that some (or all)
of the switch settings are changed to make different node to node connections. For
case of discussion, we will refer to inputs and outputs, implying that the transfer is
‘unidirectional; in practice most networks can be made bidirectional. (Of course, the
‘whole network could be replicated, with input and outputs transposed.)
Networks sometimes allow simultaneous connections between all combinations of
input and outputs; such networks are called non-blocking networks. Non-blocking
networks are strictly non-blocking if a new interconnection between an arbitrary
‘unused input and an arbitrary unused output can always be made, irrespective of
existing connections, without disturbing the existing paths. Some non-blocking
networks may requite paths to be routed according to a prescribed routing algorithm
to allow new inpuvoutput connections to be made without disturbing existing
interconnections; such non-blocking networks are called wide-sense non-blocking
networks. Many networks are formulated to reduce the number of switches and do
not allow all combinations of input/output connections simultaneously; such net:
works are called blocking networks. A network is rearrangeable if any blocked
input/output connection path can be re-established by altering the internal switches
to reroute paths and free the blockage.
In general, the switches are grouped into switching stages which may have one
(or more) input capable of connecting to one (or more) output, Dynamic networks
can be classified as:
1. Single stage.
2. Multistage.
In a single stage network, information being routed passes through one switching.
stage from input to output. In a multistage network, information is routed through
more than one switching stage from input to output, Multistage networks generally
have fewer internal switches, but are often blocking, Some networks have non-
blocking characteristies for certain input/output combinations, which may be useful
in particular applications.Interconnection networks 263
8.4.2 Single stage networks
‘A fundamental example of a dynamic single stage network is the cross-bar switch
network analyzed previously, in which the stage consists of m x m switches (n input
‘nodes, m outpat nodes) and each switch allows one node to node connection. This
network is non-blocking and has the minimum delay through the network compared
to other networks, as only one switch is involved in any path, The number of
switches increases as O(nm) (or O(n?) for a square network) and becomes impractical
for large systems. We shall see that the non-blocking nature of the cross-bar switch
network can be preserved in the multistage Clos network and with substantially
fewer switches for large systems. However, the single stage cross-bar switch
network is still a reasonable choice for a small system. The complete connectivity
and flexibility of the cross-bar is a distinct advantage over multistege blocking
networks for small systems. The term “cross-bar” stems from the
‘mechanical switches in old telephone exchanges.
8.4.3 Multistage networks
Multistage networks can be divided into two types:
1. Cross-bar switch-based networks.
2. Cell-based networks.
Cross-bar switch-based networks use switching elements consisting of cross-bar
switches, and hence multistage cross-bar switch networks employ more than one
ccross-bar switch network within a larger network. Cell-based networks usually
‘employ switching elements with only two inputs and two outputs, and hence could
be regarded as a subset of the cross-bar switch network, though the 2 x 2 switching
elements in some cell-based networks are not full cross-bar switches. Instead they
have limited interconnections.
Multistage cross-bar switch-based networks — Clos networks.
In 1953 Clos showed that a multistage cross-bar switch network using three or more
stages could give the full non-blocking characteristic of a single stage cross-bar
switch with fewer switches for larger networks. This work was done originally in
the context of telephone exchanges, but has direct application to computer networks,
especially when the non-blocking characteristic is particularly important.
A general three-stage Clos network is shown in Figure 8.7, having r, input stage
ceross-bar switches, m middle stage cross-bar switches and r, output stage cross-bar
switches. Each cross-bar switch in the first stage has n, inputs and m outputs, with
fone output to each middle stage cross-bar switch. The cross-bar switches in the
‘middle stage have r, inputs, matching the number of input stage cross-bar switches.
and r, outputs, with one output t0 each output stage cross-bar switch. The cross-bar
switches in the final stage have m inputs, matching the number of middle stage264 Shared memory multiprocessor systems
(ross-bar switch elements
mx
Inputs outpats
Figure 8.7 Three-stage Clos network
cross-bar switches, and n, outputs. Hence the numbers m, my, rrp and m com-
pletely define the network. The number of inputs, N. is given by rn, and the number
of outputs, M, is given by ram,
Clearly, any one network input has a path to any network output, Whether the
network is non-blocking will depend upon the number of middle stages. Clos
showed that the network is non-blocking if the number of cross-bar elements in the
middle stage, m, satisfies:
menen-1
For a network with the same number of inputs as outputs, the number of input/
‘outputs = rym, = rpft. If n; = mp, the middle stages are square cross-bar switches and
the non-blocking criterion reduces to:
m>2n-1
Clos derived the number of switches in a square three-stage Clos network with input
and output networks of the same size, as:
xe
Number of switches = (2n ~ 1) 2N +>Interconnection networks 265
2 x 2emten laments
vot De owns
Figure 8.8. Three-stage Benes network
resulting in fewer switches than a single stage cross-bar switch when N is greater
than about twenty-five for a squate network (Broomell and Heath, 1983). It has been
shown that a Clos network is rearrangeable if m > n,, otherwise the network
becomes blocking.
Clos networks can be created with five stages by replacing each switching
element in the middle row with a three-stage Clos network. Similarly seven, nine,
eleven stages, etc. can be created by further replacement. The Benes network is a
special case of the Clos network with 2 x 2 cross-bar switch elements. A three-stage
Benes network is shown in Figure 8.8. Benes networks could also be classified as
cell-based networks.
Cell-based networks
The switching element (or cell) in cell-based networks typically has two inputs and
two outputs, A full cross-bar switch 2 x 2 network cell has twelve different useful
input/output connections (states). Three further 2 x 2 network patterns exist, one
‘connecting the inputs together, leaving the outputs free, one connecting the outputs
together, leaving the inputs free, and one connecting the inputs together and the
outputs together; there is no inputfoutput connection. A final state has no inter~
connections. Four binary control signals would be necessary to specify the states of
a2x 2 network,
‘Some, if not most, cell-based networks employ 2 x 2 cells which do not have all
possible states, The two state (straight through or exchange) 2 x 2 network is the
‘most common. In practice, once a path is chosen for one of the inputs ~ either the
upper or the lower output ~ there is only one possible path allowed for the other
input (which will be the upper output if the lower output has been taken, or the
Jower output if the upper output has been taken). Hence, the straight through/
exchange states are sufficient and only one binary signal need be present to select
which state should exist at any instant,
Most cell-based networks are highly blocking, which can be evidenced by the fact
‘that if there are s switching cells, each with two states, there are only 2" different states
in the complete network. However, with, say, p inputfoutputs, there are p! different
combinations of input/output connections and usually p! is much larger than 2266 Shared memory multiprocessor systems
Each stage of cells can be interconnected in various ways. The baseline network
(Feng, 1981) shown in Figure 8.9, is one example of a network with a very
convenient self-routing algorithm (destination tag algorithm) in which successive
bits of the destination address control successive stages of the network. Each stage
of the baseline network divides the routing range into two. The first stage splits the
route into two paths, one to the lower half of the network outputs and one to the
upper half, Hence, the most significant bit of the destination address can be used to
route the inputs to either the upper half of the second stage, when the bit is 0, or to
the lower half if the bit is 1. The second stage splits the route into the upper quarter
ff second quarter if the upper half of the outputs has been selected, or to the third
quarter or lower quarter if the lower half has been selected. The second most
significant bit is used to select which quarter, once the most significant bit selection
hhas been made. This process is repeated for subsequent stages if present, For eight
inputs and outputs, there would be three stages, for sixteen inputs and outputs there
would be four stages, and so on. The least significant bit controls the last stage.
Such self-routing networks suggest packet switching data transmission.
Shuffle interconnection pattern
The perfect shuffle pattern finds wide application in multistage networks, and can
also lead to destination tag self-routing networks, Originally, the perfect shufile was
developed by Pease, in 1968, for calculating the fast Fourier transform (Broomell
and Heath, 1983), and was later developed for other interconnection applications by
Stone and others. The input to output permutation of the (2-) perfect shuffle network
is based upon shuffling a pack of cards by dividing the pack into two equal parts
which are slid together with the cards from each half of the pack interleaved, The
Perfect shuffle network takes the first half of the inputs and interleaves the second
half such thatthe first half of inputs pass to odd numbered outputs and the second
2 2 witch elements
00 000
oot 7
00 [010
ont }—on
Inputs Outputs
100 —f [100
si TT
10 }—110
| =
Figure 8.9 8 x 8 baseline networkInterconnection networks 267
half to even numbered outputs. For example, with tight inputs, the first half of the
inputs consists of 0, 1,2 and 3 and the second half of 4,5, 6 and 7. Input 0 passes to
‘output 0, input 1 to output 2, input 2 to output 4, input 3 to output 6, input 4 0
‘output 1, input 5 to output 3, input 6 to output 5 and input 7 to output 7
Given that the input/output addresses have the form a, ,4,. ~ ajd the perfect
shuffle performs the following transformation:
Shuttle (44.4). 4104) = y.9 442
i.e. the address bits are cyclically shifted one place left, The inverse perfect shufite
cyclically shifts the address bits one place right.
‘To make all possible interconnections with the shuffle pattern, a recirculating
network can be created by recirculating the outputs back to the inputs until the
required connection is made. Exchange “boxes” are introduced; these selectively
swap pairs of inputs, as shown in Figure 8.10 (shuffle exchange network). Each
‘exchange box has two inputs and two outputs. There are two selectable transfer
patterns, one when beth inputs pass to the two corresponding outputs, and one when
‘each input passes to the other output (i.e. the inputs are transposed). The exchange
boxes transform the address bits by complementing the least significant bit, ie.
=e a
Exchange box
hs 4 swich postions
: nem
3 Outputs i
vous | to rots
ss s
ops 6
Ax 2
a
Exchange box srt
Figure 8.10 Shuffle exchange network268 Shared memory multiprocessor systems
Exchange (@y-10y-p yf) = 4 14y-2 9
For example, 6 (110) passes over to 7 (111) and 7 passes over to 6, The inter-
connection function is given by a number of shuffle exchange functions. Any input
‘ean be transferred to any output by repeated passes through the network, For
example, to make a connection from 0 (000) to 6 (110) would require two passes,
fone pass to exchange to 1 (001) and shuffle to 2 (010), and one pass to exchange to
3 (O11) and shuffle to 6 (110). A maximim of n recirculations are necessary t0
‘obtain all permutations.
Multistage perfect shuffle networks - Omega network
Rather than recirculate the paths, perfect shuffle exchange networks can be cascaded
to become the Omega network, as shown in Figure 8.11. The network (like the
baseline network) has the particular feature of the very simple destination tag selt-
routing algorithm. Each switching cell requires one control signal to select either the
upper cell output or the lower cell output (0 specifying the upper output and 1
specifying the lower). The most significant bit of the address of the required
destination is used to control the first stage cell; if this is 0 the upper ousput is
selected, and if it is 1, the lower output is selected. The next most significant bit of
the destination address is used to select the cell output of the next stage, and so on
‘until the final output has been selected.
‘The cells used need to be able to select either the upper or the lower output and a
2.x 2 straight through/exchange cell is sufficient. The Omega network was proposed
for array processing applications with four-state cells (straight through/exchange/
broadcast upper/broadcast lower). The Omega network is highly blocking, though
2x 2awitch eloments
avalght trough or
crost-ve connectors)
——
000:
001
ono. 4
ony
Inputs Outputs
100
101
110
m4
Figure 8.11 Omega networkInterconnection networks 269
‘one path can always be made from any input to any output in a free network. The
indirect binary n-cube network, which is similar to the Omega network, was
proposed for processor to processor interconnections using only two-state cells
(The direct binary n-cube has links between particular nodes and is also called @
hypercube, see page 286.) The indirect binary n-cube and Omega networks were
found to be functionally equivalent by a simple address translation. Switching
networks are deemed equivalent if they produce the same permutations of input/
‘output connections irrespective of their internal connections or actual input/output
address numbering system.
Generalized self-routing networks
The selfrouting networks such as Omega, bascline and indirect binary n-cube
networks can be extended to use numbering system bases her than two and a
generalized q-shuffe. In terms of cards, the q-shuffle takes gr cards and divides the
cards into 4 piles ofr cards. Then one card from each ple is taken in tur to create a
shuffled pile
‘The Delta network (Patel, 1981) is generalization using a numbering base which
can be other than 2 throughout. This network connects a* inputs to b* outputs
through n stages of a x b cross-bar switches, (Omega, baseline and indirect N-cube
networks use 2.) The destination address is specified in base b numbers and
the destination tag sef-routing algorithm applies. Each destination digit has a value
from 0 10 b ~ 1 and selects one of 6 outputs of the a x b cross-bar element. An
example of a Delta network is shown in Figure 8.12.
Figure 8.12 Delta network (base-4)270 Shared memory multiprocessor systems
‘The stage to stage link pattern is a four-shuffle in this example. The destination
tag self-routing networks have been further generalized into the generalized shuffle
network (GSN) (Bhuyan and Agrawal, 1983). The GSN uses a shuffle network
pattern constructed from arbitrary number system radices. An example is shown in
Figure 8.13. Different radices can be used at each stage.
Note that now the basic 2 x 2 cell is not necessarily employed. Some studies have
indicated that better performance/cost might be achieved by, for example, using 4 x4
networks. In all destination tag routing networks (baseline, Omega, n-cube, and all
networks that come under generalized networks) there can be only one route from
cach input to each output. Hence the networks are not resilient to cell failures, Extra
stages can be introduced, as shown in Figure 8.14 to provide more than one path from
input to output. This method has been studied by Raghavendra and Varma (1986).
8.4.4 Bandwidth of multistage networks
We derived the band
th of a single cross-bar switch as:
BW = m(1 - (I~ rimy)
It follows that for a multistage network composed of stages of a x b cross-bar
switches (Delta, GSN etc.) the number of requests that are accepted and passed on to
the next stage is given by:
bq ~~ roy)
Figure 8,13 Generalized shuffle network stageInterconnection networks 271
Figure 8.14 Ex stage Delta network
where ry is the request rate at the input of the first stage. The number of requests on
any one of the 6 output lines of the first stage is given by:
(rgb
‘These requests become the input to the next stage, and hence the number of requests
at the output of the second stage is given by:
(roy
Hence the number of requests passed on to the output of the final stage can be found
by recursively evaluating the function:
==, fy
for i = 1 to n, where n is the number of stages in the network, and ry = r. The
bandwidth is given by:
BW =6'r,
as there are 6" outputs in all; there are a” inputs. The probability that a request will
be accepted is given by:272 Shared memory multiprocessor systems
Py = ber,
The derivation given is due to Patel (1981) in connection with Delta networks.
Figure 8.15 shows the bandwidth and probability of acceptance of Omega networks
(Delta network with a = b = 2) compared to single stage N x N cross-bar switch
metworks, where N = 2*, Note that the number of stages in the 2" x 2" multistage
network is log,N and this can be significant, ie. for N = 4096, there are twelve
stages,
a)
40a
Bandwith
200|
10
Probability of acceptance
20-400 Bod es TOO
Processors
(©) Probabityot acceptance
Figure 8.15 Performance of multistage networks (— Omega, --— full cross-bar
switch) (@) Bandwidth (by Probability of acceptanceInterconnection networks 273
8.4.5. Hot spots
‘Though memory references in a shared memory multiprocessor might be spread
across a number of memory locations, some locations may experience a dis-
proportionate number of references, especially when used to store locks and synchron-
ization variables. These shared locations have been called hot spots by Pfister and
Norton (1985). When a multistage interconnection network is used between the
memory and processors, some paths between processors and memories are shared.
Accesses to shared information can cause widespread contention in the network, as,
the contention at one stage of the network can affect previous stages. Consider a
multistage network with request queues at the input of each stage. A hot spot in
‘memory occurs and the last stage request queue fills up. Next, requests entering the
inputs of the stage become blocked and the queues at this stage fill up. Then
requests at the inputs of previous stages become blocked and the queues fill up, and
so on, if there are more stages. This effect is known as tree saturation and also
blocks requests not even aimed for the hot sfot. The whole network can be affected.
Pfister and Norton (1985) present the following analysis to highlight the effect of
hot spots. Suppose there are N processors and N shared memory modules, and the
memory request rate is r. Let the fraction 07 these requests which are for hot spots
be h. Then the number of hot-spot requests directed to the hot-spot memory is Nrh.
‘The number of remaining non-hot-spot requests directed to the memory module is
Nr(1 — RIN = r(1 — h) assuming that these requests are uniformly distributed among
all memory modules. The total number of requests for the memory module is Nrh +
71 = h) = r(A(N ~ 1) + 1). The asymptotically maximum number of requests that
can be accepted by one memory module is 1. Hence the asymptotically maximum,
number of accepted requests is rir(h(N ~ 1) + 1)) = 1M(N ~ 1) + 1). Hence the
maximum bandwidth is given by:
BW = NK(h(N ~ 1) +1)
‘This equation is plotted in Figure 8.16. We see that even a small fraction of hot-spot
requests can have a profound effect on the bandwidth. For example, with h = 0.1 per
‘cent, the bandwidth is reduced to 500 with 1000 processors. The request rate, r, has
no effect on the bandwidth, and for large numbers of processors (P), the bandwidth
tends to 1/h. For example, when h = 1 per cent, the bandwidth is limited to 100
irrespective of the number of processors.
Two approaches have been suggested to alleviate the effects of hot spots, namely:
1, Software combining trees.
2. Hardware combining circuits.
In the software approach, operations on a single variable are broken down into
operations which are performed separately so as to distribute and reduce the hot
spots. The operations are arranged in a tree-like manner and results are passed along,274 Shared memory multiprocessor systems
1000) n= 0%
,
i
i
Bl
a neo
3 409)
i he 0.2%
ax naas
poe
re
0-200 40600 8007005,
Processor and memories)
Figure 8.16 Asymptotic bandwidth in presence of hot spots
the tree to the root. Further information on the software approach can be found in
Yew, Tzeng and Lawrie (1987).
In the hardware approach, circuits are incorporated into the network to recognize
requests for shared locations and to combine the data access. In one relatively
simple hardware combining network, read accesses to the same shared memory
location are recognized at the switching cells and the requests combined to produce
one read request to the memory module. The returning data is directed through the
network to all required destinations.
‘Since shared variables are often used as synchronization variables, synchronization
operations can be combined. The fetch-and-add operation suggested by Gottlieb e
al, (1983) for combining networks returns the original value of a stored variable and
adds constant to the variable as specified as an operand. The addition is performed
by a cell within the network, When more than one such operation is presented to the
network, the network recognizes the operations and performs additions, leaving the
memory to return the original value through the network and be presented with one
final siore operation. The network will modify the value returned to give each
processor a value it would have received if the operations were performed serially
‘An example of fetch-and-add operations in a multistage network is shown in
Figure 8.17. Three fetch-and-add operations are received from three processors to
‘operate upon memory location M:
Processor | foa-a Mix
Processor 2 f-b-a M+y
Processor 3 Eré-a M42Interconnection networks 275
Figure 8.17 Fetch-and-add operations in a multistage network
Suppose the original value stored in M is w. As requests are routed through the
network, individual cells perform additions and store one of the increments internally.
In Figure 8.17, the first two requests are routed through the same cell and x and y
are added together to be passed forward, to be added to the z from the third
operation. The result, x + y + 2, is presented to the memory and added to the stored
value, giving w + x+y +z stored in the memory. The original value, w, is passed
back through the network. At the first cell encountered, x + y had been stored and
this is added to the w to give w +x+ y, which is routed towards processor 3, and w is
routed towards processors 1 and 2. In this cell, x had been stored, and is added to
the w to give w + x, which is routed to processor 2, and w is routed to processor 1.
Hence the three processors receive w, w +x and w +x + y respectively, which are
the values they might have received had the operations been performed separately
(actually the values if the operations were in the order: first processor 1, then
processor 2 and then processor 3).
8.5 Overlapping connecti
ity networks
In this section we will introduce a class of networks called overlapping connectivity
nenworks (Wilkinson, 1989). ‘These networks have the characteristic that each
Processor can connect directly to group of memory modules and processors, and to
other groups through intermediate processors. Adjacent interconnection groups include
some of the same memories and processors. The networks are attractive, especially for
very large number of processors which cannot be provided with full connectivity but
need to operate with simultancous requests to locally shared memory or bycommunica
tion between processors. Applications for cascaded/overlapping connectivity networks
include image processing, neural computers and dataflow computers.276 Shared memory multiprocessor systems
8.5.1 Overlapping cross-bar switch networks
‘Two forms of an overlapping connectivity “rhombic” eross-bar switch scheme are
shown in Figure 8.18. In Figure 8.18(a) each memory module has two ports, and
processors can access whichever side the processor buses connect. The buses form
rings by connecting one edge in the figure to the opposite edge, and the scheme
expands to any size. With four buses, as shown in the figure, processor P, can
connect to memory modules M,_3.-M;-3» Mii. My Migs Maar Migs and M,,, using
fone of the two ports on each memory, for all i where M, is the memory to the
immediate left of processor P,. Hence, each processor has an overlapping range of
ight memory modules. In the general case of b vertical and b horizontal buses in
each group, processor P, can connect to memory modules, My-yy1 ° Mii» Mis Mig
Mis» ie. 2b memory modules. Connections from processor to memory modules are
Processors
PUT HH? PLES Pld PAHS PiE6
3
Pit Piss
‘OF a
Figure 8.18 Cross-bar switch with overlapping connectivity (a) With two-port
memory (b) With two-port processorsInterconnection networks 277
made with one cross-bar switch. Since two memories are accessed via each bus,
there will be bus contention as well as memory contention. The bus contention
could be removed by providing separate buses for the memory modules to the right
and left of processors, but this would double the number of switches and buses. We
shall assume only one bus providing access to two memory modules and separate
memory addresses used to differentiate between the memory modules.
Let the total number of processors and memory modules in the system be P and M
respectively, and the number of processors and memory modules in each section be
1p and m respectively. Then, Mb switches are needed in the cascaded networks
compared to MP in a cross-bar switch (M? in a square switch)
In Figure 8.18(), single port memory modules are used, together with processors
having access to two buses. With four buses, as shown in the figure, all processors
ccan connect to four memory modules on each side of the processor, or to 2b memory
‘modules when there are b buses. There are 2-2 memory modules common to two
Adjacent sets of reachable elements, as in Figure 8.18(a). Note that not all requests
ccan be honored because the corresponding bus may be used to honor another request
to a different memory module, i.e. the system has bus contention because two
processors share each bus. The bus arbitration might operate by memory module
arbiters independently choosing a request from those pending, and when two requests
Which require the same bus are chosen, only one is accepted. Ideally, the arbitration
circuitry should consider all requests pending before making any selection of
requests, so that alternative selections can be made to avoid bus contention when
possible
Bandwidth
‘The bandwidth of the networks in Figure 8.18 with one stage (i.e. a single stage
“rhombic” cross-bar switch network with circular bus connections), can be derived
in a similar fashion to a full cross-bar switch network and leads to:
BW =M(1~(1~r/my)
where M is the total number of memory modules, m is the number of memory
‘modules reached by each processor, and there are the same number of processors as
memories. Figure 8.19 shows the bandwidth function plotted against a range of
requests for a single stage network, and simulation results when rejected requests,
are resubmitted until satisfied.
‘The bandwidth of the unidirectional cascaded rhombic cross-bar network can be
derived by deducing the probability that a request has been issued for a memory in
‘an immediate group, r,, say, and by substituting r,, for r in the previous equation.
Suppose that each processor generates an output request from either an internal
Program queue or from an input buffer holding requests passed on from adjacent
Processors, and that the program queue has priority over the input buffer. AS a first
approximation, we can consider the program queue from the nearest processor, and
then if no program requests present, the program queue from the next processor in278 Shared memory multiprocessor systems
64
‘
Deere eee
Range
Figure 8.19 Bandwidth of single stage rhombic cross-bar network (-—— analysis, —
simulation)
the previous cycle is passed forward, then the next processor in the cycle before.
This leads to:
rig + (= Drie + = Dlg A= nirlg
=(-(0-79/g
and hence:
BW =M(1—(1-(1—))/gmy
where requests from each processor extend over g groups of memories. This
equation ignores queueing, but has been found to compare reasonably well wit
simulation results ofthe network. Figures 8.20(a) and (b) show the bandwidth of the
cascaded network against range and against number of buses respectively. Simulation
results are also shown.
‘The overlapping connectivity cross-bar switch network can be expanded into two or
higher dimensional networks. A two-dimensional network is shown in Figure 8.21
The processors (or processing elements) are identified by the tuple (iJ) along the
two diagonals. Each processor in Figure 8.21 can reach twelve other processors with
‘an overlapping connectivity. Py can reach Pj, Pij-ts Prsjots Prayets Prays Piety
Press Preayr Preagets Prayers Punts Paiyats Wid horizontal and vertical buses. The
scheme can be expanded to provide more processors within each group. In the
general case, if each bus has c switching elements, 4c+4 processors can be reachedInterconnection networks 279
Benoweth
andwith
Figure 8.20 Bandwidth of cascaded rhombic cross-bar network (—~ simulation,
analysis) (a) Bandwidth against range of requests,
(b) Bandwidth against number of buses
by any processor (with edges wrapping round). The switch points could be three
state switches providing left, right and cross-over paths. However, two-state switches
providing cross-over and cither right or left turn are sufficient. By always crossing
over or making one a right turn (say), a path will be established between two
processors.
8.5.2 Overlapping multiple bus networks
Figure 8.22 shows two overlapping bus configurations. In Figure 8.22(a) there are
four buses with four processors connecting to the buses. As in all multiple bus
systems, two connections need to be made for each path. Under these circumstances,
with four buses, processor P, can connect to a group of processing elements to the280 Shared memory multiprocessor systems
ove
Coorinates:
Figure 8.21 Two-dimensional scheme with overlapping connectivity
immediate left, P,s, P-z and Py, and to the immediate right, P,., Pj, and P,,s, for
all. can be reached through one bus, P, can be reached through two buses, P,
through three buses, P,,, through three buses, P,,. through two buses and P,,
through one bus, for all As the processor to be reached is further away, there are
fewer buses available and corsequently less likelihood of reaching the processor. In
the general case, processor P, can connect to processors Py," P,y+Piay ™ Pragats OF
2(6-1) other processors. There are b ~ 1 buses available to connect processors P,_,
and P,,, and a linearly decreasing number of buses for more distant processors,
which is an appropriate characteristic. The scheme as described is appropriate to
interconnect processors with local memory. Global memory could be incorporated
into the interconnectior network by replacing some processors with memory modules.
‘An overlapping connectivity multiple bus scheme with fewer buses than elements in
each group and both processors and memory modules is presented in Figure 8.22(b).
‘The processors are fully connected to the buses and the memory is partially connected
to the buses. (Mertory modules fully connected and processors partially connected is
also possible.) Since each group of memory modules connect to two adjacent sets of
buses, these modules can be shared between adjacent groups of processors. TheInterconnection networks 281
Processors
Pic Pind PIB PI? Pind Pi Pet Pid? Pied Picd PIES
uses
(@) With processors
Memory modes Processors —_ Memory modules
(©) With processors and memory modules
Figure 8.22 Multiple bus scheme with overlapping connectivity
(a) With processors (b) With processors and memory modules
scheme can be considered as composed of a number of zhombic cross-bar switches,
cascaded together, similar to Lang's simplification (Section 8.1). A suitable rhombic
configuration would be eight processors completely connected to eight buses and
sixteen memory modules connecting to the buses in a rhombic pattern.
In Figure 8.22/6), the memory modules form the Lang rhombic pattem but are
divided by processors which are fully connected to the buses. Hence, the same
connectivity is possible between the processors and memory modules on both sides
of the processors (given suitable b and m to satisfy Lang’s conditions). If we ignore
contention arising when requests from adjacent rhombic groups are made to the
same shared memory module, the bandwidth can be derived from the bandwidth of a
fully connected multiple bus system.282 Shared memory multiprocessor systems
8.6.1 General
Static interconnection networks are those which allow only direct fixed paths
between two processing elements (nodes). Each path could be unidirectional or
bidirectional, In he following, we will generally assume links capable of bidirectional
transfers when counting the number of links. The number of links would, of course,
bbe double if separate links were needed for each direction of transfer. Static inter-
connection networks would be particularly suitable for regular processor-processor
interconnections, i.e. in which all the nodes are processors and processors could
process incoming data or pass the data on to other processors. We will find that st
networks are used in multiple processor VLSI structures described in Chapter 9
JIn general, the number of links in a static interconnection network when each
clement has the same number of links is given by (number of nodes)x(number of
links of a node)/2, the factor of 1/2 due to each path being used in two nodes.
8.6.2 Exhaustive static interconnections
In exhaustive or completely connected networks, all nodes have paths to every other
node, Hence n nodes could be exhaustively interconnected with n ~ I paths from each
node to the other n ~ 1 node. There are n(n ~ 1)/2 paths in all. If each direction of
transfer involves a separate path, there are n(n ~ 1) paths. Exhaustive interconnection
has application for small n. For example, a set of four microprocessors could
reasonably be exhaustively interconnected using three parallel or serial ports attached
to each microprocessor. All four processors could send information simultaneously
to other processors without contention. The absence of contention makes static
exhaustive interconnections particularly attractive, when compared to the non-
exhaustive shared path connection schemes to be described. However, as n increases,
the number of interconnections clearly becomes impractical for economic and
engineering reasons.
8.6.3 Limited static interconnections
Interconnections could be limited to, say, a group of the neighboring nodes; there
are numerous possibilities. Here we will give some common examples.
Linear array and ring structures
‘A one-dimensional linear array has connections limited to the nearest wo neighbors
and can be formed into a ring structure by connecting the free ends as shown in
Figure 8.23. The interconnection might be unidirectional, in which case the former
creates a linear pipeline structure; alternatively the links might be bidirectional. InInterconnection networks 283
either case, such arrays might be applicable to certain computations. Each node
requires two links, one to each neighboring node, and hence an n node array
requires n links. In the chordal ring network, shown in dotted lines, each node
connects to its neighbors as in the ring, but also to one node three nodes apart.
‘There are now three links on each node and 3n/2 paths in all
Two-dimensional arrays
‘A two-dimensional array or near-neighbor mesh can be created by having each node
in a two-dimensional array connect to all its four nearest neighbors, as shown in
Figure 8.24. The free ends might circulate back to the opposite sides. Now each
node has four links and there are 2n links in all. This particular network was used in
the Iliac IV computer with an 8 x 8 array, and is popular with VLSI structure
because ofthe ease of layout and expandability
‘The two-dimensional array can be given extra diagonal links. For example, one,
two, three or all four diagonal links can be put in place, allowing connections to
diagonally adjacent nodes. Each node has eight links and the network has 4m links.
Figure 8.23 Linear array
unis Processing
Figure 8.24 Two-dimensional array284 Shared memory multiprocessor systems
In Figure 8.25, each node has six links and there are 3n links in the network. This
network is also called a systolic array, as it can be used in systolic multiprocessors,
Star network
‘The star connection has one node into which all other nodes connect. There are n~ 1
links in all, ic. the number of links grows proportional to n, which is generally the
best one could hope for, and any two nodes can be reached in two paths. However, the
central node must pass on all transfers to required destinations and substantial
contention or bottleneck might occur in high traffic. Also, should the central node
fail, the whole system would fail. This might be the case in other networks if
additional mechanisms were not incorporated into the system to route around faulty
nodes but, given alternative routes, fault tolerance should be possible. Duplicated
star networks would give additional routes.
Tree networks
‘The binary tree network is shown in Figure 8.26. Apart from the root node, each
node has three links and the network fans oat from the root node. At the first level
below the root node there are two nodes. At the next level there are four nodes, and
at the th level below the root node there are 2! nodes (counting the root node as
level 0). The number of nodes in the system dows, to the jth level is:
n=NQ= 142422423241
@-»
en
=¥-1
Processing elements
/
{LH
Coo
Figure 8.25 Hexagonal configurationInterconnection networks 285
Une Processing elements
Figure 8.26 Tree structure
and the number of levels + 1 = logy(n + 1) + 1. This network requires n ~ 1 links
(The easiest way to prove this expression is to note that every additional node
‘except the root node adds one link.)
‘The tree network need not be based upon the base two. In an mary tree, each
node connects to m nodes beneath it and ore from above, The number of nodes in
this system down tothe jth level is:
RENG)= 1+ ms ne em mel
a1)
m1)
and the number of levels j + 1 = log,(n+1) + 1. Again, the network requires n~ 1
links, but fewer intermediate nodes are needed to connect nodes as the value of m is,
increased.
‘The binary and general m-ary tree networks are somewhat similar to the star
network in terms of routing through combining nodes. The root node is needed to
route from one side of the tree to the other. Intermediate nodes are needed to route
between nodes which are not directly connected. This usually means travelling from
the source node up the tree until a common node in both paths from the route node
1s reached and then down to the destination node.
The networks so far described are generally regular in that the structure is
symmetrical. In irregular networks, the syr:metry is lost in either the horizontal or
vertical directions, or in both directions. An irregular network can be formed, for
example, by removing existing links from a regular network or inserting extra links.
‘The binary tree network is only regular if all nodal sites are occupied, i.e. the tree
has I node, 3 nodes, 7 nodes, 15 nodes, 31 nodes, etc286 Shared memory multiprocessor systems
Hypertree networks
In the hypertree network (Goodman and Séquin, 1981) specific additional links are
putin place directly between nodes to reduce the “average distance” between nodes.
(The average distance is the average number of links that must be used to connect
two nodes, see page 287.) Each node is given a binary address starting at the root
node as node 1, the two nodes below it as nodes 2 and 3, with nodes 4, 5, 6 and 7
immediately below nodes 2 and 3. Node 2 connects to nodes 4 and 5. Node 3
connects to nodes 6 and 7, and so on. The additional links of the hypertree connect
nodes whose binary addresses differ by only one bit (a Hamming distant of one).
Notice that the hypertree network is not regular.
Cube networks
In the 3-cube network, each node connects to its neighbors in three dimensions, as
shown in Figure 8.27. Each node can be assigned an address which differs from
adjacent nodes by one bit. This characteristic can be extended for higher dimension n-
cubes, with each node connecting to all nodes whose addresses differ in one bit position
for each dimension. For example, in a 5-cube, node number 11101 connects to 11100,
11111, 11001, 10101 and 01101. The number of bits in the nodal address is the same as
the number of dimensions. N-cube structures, particularly higher dimensional n-cubes,
are commonly called hypercube networks. The generalized hypercube (Bhuyan and
Agrawal, 1984) can use nodal address radices other than 2, but still uses the
characteristic that addresses of interconnected nodes differ in each digit position. The
(binary) hypercube is an important interconnection network; it has been shown to be
suitable for a very wide range of applications. Meshes can be embedded into a
hypercube by numbering the edges of the mesh in Gray code. In Chapter 9, we will
describe message-passing multiprocessor systems using hypercube networks.
Numerous other networks have been proposed, though in most cases they have
not been used to a significant extent. In the cube connected cycles network, 2 nodes
divided into 2 x 2* nodes are connected such that 2° nodes form a group at the
vertices of a (2'*)-cube network. Each group of 2° nodes is connected in a loop,
110 "
Figure 8.27 Three-dimensional hypercubeInterconnection networks 287
with one connected to each of the two neighboring nodes and also one link to a
corresponding node in another dimension.
‘Though we have described direct link static networks in terms of communicating
nodes, some networks could be used for shared memory systems. For example, the
nodes in the network could contain shared memory which can be reached by
processors in other nodes using the links that operate as buses. A possibility is to
have multiple buses which can extend through to other nodes. This can, for
example, lead to an overlapping connectivity mesh network. In a spanning bus
hypercube network, each node connects to one bus in each dimension of the
network. For a two-dimensional network, nodes connect to two buses or two sets of
‘buses that stretch in each of the two dimensions. For a three-dimensional network,
each node connects to three buses.
8.6.4 Evaluation of static networks
Clearly, there are numerous variations in limited interconnections, some of which
suit particular computations. With limited interconnections, some transfers will
require data to pass through intermediate nodes to reach the destination node.
‘Whatever the limited connection network devised, there must be a means of locating
the shortest route from the source to the destination. A routing algorithm which is,
‘easy and fast to implement is preferable.
Request paths
‘A critical factor in evaluating any interconnection network is the number of links
between two nodes. The number of intermediate nodes/links is of interest because
this gives the overall delay and the collision potential. The average distance is
defined as (Agrawal et al., 1986):
Max
yx,
0
Wi
Average distance
where N, is the number of nodes separated by d links. Max is the maximum distance
necessary to interconnect two nodes (not the maximum distance as this would be
infinity) and N is the number of nodes. For any particular network, interconnection
paths for all combinations of nodal connections would need to be computed, which
is not always an easy task. Notice that the average distance formulae may not be the
actual average distance in an application,
Number of links
‘Another factor of interest is the number of links emanating from each node, as this
‘gives the node complexity. The number of links is usually fairly obvious from the288 Shared memory multiprocessor systems
network definition. With an increased number of links, the average distance is
shorter; the two are interrelated. A normalized average distance is defined as:
Normalized average distance
wwerage distance x links/node
which gives an indication of network performance taking into account its complexity.
‘The message density has been defined as:
Average distance x number of nodes
Message density =~ Torat number of Tinks
In a limited static interconnection network, distant nodes can be reached by passing
Tequests from a source node (processor) through intermediate nodes (called
“levels"). Links to four neighbour reach 4(2i-1) nodes at the ith level from the
node. For hexagonal groups (Figure 8.25), there are 6/ nodes at the ith level, ie. the
‘number of nodes at each level increases proportionally, and the number of nodes
that can be reached, n, is given by:
sL(L+1)
where L is the number of levels. In the hexagonal configuration, every node at each
level can be reached by one path from the previous level (this is not true for the
square configuration). The average number of levels to teach a node, and hence the
average number of requests in the system for each initial nodal request, is given by:
L
av = 6? y/n
To place an upper bound on the number of simultaneous requests in the system,
requests from one processor to another can be passed on through a fixed number of
nodes.
Bandwidth of static networks
‘We have seen that the performance of dynamic networks is often characterized by their
bandwidth and also probability of acceptance. The bandwidth and the probability of
acceptance metric can be carried over to static networks, though this is rarely done. One
example of a direct binary n-cube (hypercube) is given in (Abraham and Padmanabhan
(1989). We can make the following general analysis for any static network.
‘Suppose that each node has input requests and can generate output requests either
by passing input requests onwards or from some internal program (internalInterconnection networks 289
requests). Let the probability that a node can generate an internal request for another
node be r. The requested node might be one directly connected to it or it might be
‘one which can be reached through intermediate nodes. In the latter case, the request
‘must be presented to the intermediate nodes as external requests, but these nodes
‘might also have internally generated requests and only one request can be generated
from a node, irrespective of how many requests are present. There could be at most
‘one internal request and as many external requests as there are links into the node.
Let ry, be the probability that a node generates a request (either internally or passes
fon an external request) and r,, be the probability that a node receives an external
request. Some external requests will be for the node and only a percentage, say A,
‘will be passed onwards to another node. Incorporating A, we get:
Fog 2+ Ar g( =)
and the bandwidth given by:
BW=(1-Ayr,N
where there are N nodes. The value for A will depend upon the network.
The probability that an external request is received by node é from a node j will
depend upon the number of nodes that node j can request, i.e. the number of nodes
connected directly to node j, and the probability is given by r,,/n, where n nodes
connect directly to node j and all links are used. We shall assume that all nodes have
the same number of links to other nodes and, for now, all are used. The probability
that node j has not requested node i is given by (1 ~ ry,/n). The probability that no
node has requested node i is given by (I — r,y/n)". The probability that node 7 has
‘one or more external requests at its inputs is given by:
in 1 = Tog
‘The probability that a node generates a request in terms of the probability of an
internal request and the number of nodes directly connected (and communicating) to
the node is given by:
Fog = FF ACL = Y= (1 Fog dP)
which is a recursive formula which converges by repeated application. A suitable
initial value for ray i8 7, Foy, being some value in excess of r.
‘The derivation assumes that an external request from node j to node i could be
sent through node i and back t0 node j, which generally does not occur, i.e. an
external request passing through node / can only be sent to n=1 nodes at most, and
‘more likely only to nodes at the next level in the sphere of influence (up to two.
nodes in the hexagonal configuration) whereas internal requests will generally have
‘an equal probability of requesting any of the nodes connected.290 Shared memory multiprocessor systems
PROBLEMS
8.1 Suggest relative advantages of the cross-bar switch system with
central control and the cross-bar switch system without central control.
82 Design a 16 x 16 cross-bar switch multiprocessor system using
microprocessors (any type) for the master-slave mode of operation. Give
details at the block diagram level of the major components.
8.3 Repeat the design in Problem 8.2 for a system without central
control.
8.4 Derive an expression for the probability that i requests are made for
particular memory, given that the probability that a request made by
‘one processor is r and there are m memories. (Clue: look at the Bernouli
formula.) Using this expression, derive the general expression for the
‘bandwidth of a p x m cross-bar switch system.
85 Derive an expression for the bandwidth of a cross-bar switch system,
given that each processor has an equal probability of making a request
for any memory or of not making a request at all
8.6 Design an 8-bus multiple bus multiprocessor system using micro-
processors (any type) for a system without a master processor. Give
details at the block diagram level of the major components.
8.7 Suggest how a multiple bus system could be designed for a master~
slave operation, Are there any advantages of such systems?
8.8 Derive an expression for a multiple bus system in which the bus
arbitration is performed before the memory arbitration. Show that this
arrangement leads to a lower bandwidth than the normal method of
having memory arbitration before the bus arbitration.
8.9 Figure 8.28 shows a combined cross-bar switch/shared bus system
without central control. There are P processors and M memory modules
in the system with p processors sharing each horizontal bus. Show that,
the bandwidth of the system is given by:
pweM(- [CaryInterconnection networks 291
Figure 8.28 System for Problem 8.9
8.10 Design a non-blocking Clos network for sixty-four processors and
sixty-four memories
8.11 Identify relative advantages of multistage networks and single
stage networks.
8.12 Ascertain all input/output combinations in an 8 x 8 single stage
recirculating shuffle exchange network which require the maximum
number of passes through the network
8.13 How many stages of a multistage Omega network are necessary 0
imterconnect 900 processors and 800 memories? What is the bandwidth
when the request rate is 40 per cent? Make a comparison with a single
stage cross-bar switch network.
8.14 Design the logic necessary with each cell in an 8 x 8 Omega
network for self-routing.
8.15 Determine whether it is possible to connect input / to output i in an
8 x 8 Omega network for all / simultaneously,
8.16 Show that a three-stage indirect binary n-cube network and a three-
stage Omega network are functionally equivalent.
8.17 Illustrate the flow of information in a three-stage multistage network
with fetch-and-add operations, given that four processors execute the
following:292 Shared memory multi
rocessor systems
Processor 1 fs-a 120,9
Processor 2 f-s-a 120,8
Processor 3 f-6-a 120,7
Processor 4 f-6-a 120,6
8.18 Derive the average distance becween two nodes in a three-
dimensional hypercube.
8.19 Demonstrate how each of the following structures can be imple-
‘mented on a hypercube network:
1. Binary tree structure
2. Mesh network
8.20 Derive an expression for the number of nodes that can be reached
in a north-south-cast-west nearest neighbor mesh network at the Lth
level from the node,lil Multiprocessor
systems without
shared memoryCHAPTER
Message-passing
multiprocessor
systems
This chapter concentrates upon the design of multiprocessor systems which do not
use global memory; instead each processor has local memory and will communicate
with other processors via messages, usually through direct links between processors.
Such systems are called message-passing multiprocessors and are particularly suit-
able when there is a large number of processors.
9.1 General
9.1.1 Architecture
The shared memory multiprocessors described in the previous chapters have some
distinct disadvantages, notably:
1. They do not easily expand to accommodate large numbers of processors.
2. Synchronization techniques are necessary to control access to shared variables.
3. Memory contention can significantly reduce the speed of the system.
Other difficulties can arise in shared memory systems. For example, data coherence
‘must be maintained between caches holding shared variables. Shared memory is,
however, a natural extension of a single processor system. Code and data can be
placed in the shared memory to be accessed by individual processors.
One alternative multiprocessor system to the shared memory system, which
totally eliminates the problems cited, is to have only local memory and remove all,
shared memory from the system. Code for each processor is loaded into the local
‘memory and any required data is stored locally. Programs are still partitioned into
separate parts, as in a shared memory system, and these paris are executed concurrently
by individual processors. When processors need to access information from other
processors, or ro send information to other processors, they communicate by sending.
messages, usually along direct communication links. Data words are not stored
295296 Multiprocessor systems without shared memory
globally in the system; if more than one processor requires the data, it must be
duplicated and sent to all requesting processors.
The basic architecture of the message-passing multiprocessor system is shown in
Figure 9.1. The message-passing multiprocessor consists of nodes, which are normally
connected by direct links to a few other nodes. Each node consists of an instruction
processor with local memory and input/output communication channels. The system
is usually controlled by @ host computer, which loads the local memories and
accepts results from the nodes. For communication purposes, the host can be
considered simply as another node, though the communication between the instruction
processor nodes and the host will be slower if it uses a single globally shared
channel (for example an Ethernet channel). There are no global memory locations.
The local memory of each nodal processor can only be accessed by that processor
and the local memory addresses only refer to the specific local memory. Each local
‘memory may use the same addresses. Since each node is a self-contained computer,
message-passing multiprocessors are sometimes called message-passing multi
computers.
The number of nodes could be as small as sixteen (or less), oF as large as several
thousand (or more). However, the message-passing architecture gains its greatest
advantage over shared memory systems for large numbers of processors, For small
‘multiprocessor systems, the shared memory system probably has better performance
and greater flexibility. The number of physical links between nodes is usually
between four and eight. A principal advantage of the message-passing architecture
is that it is readily scalable and has low cost for large systems. It suits VLSI
construction, with one or more nodes fabricated on one chip, or a few chips,
depending upon the amount of local memory provided.
Each node executes one or more processes. A process often consists of sequential
(Commuication Communication
ns
Figure 9.1 Message-passing multiprocessor architectureMessage-passing multiprocessor systems 297
code, as would be found on a normal von Neumann computer. If there is more than
‘one process mapped onto one nodal processor, one process is executed at a time. A.
process may be descheduled when it is waiting for messages to be sent of received,
and in the meantime another process started. Messages can be passed between
processes on one processor using internal channels. Messages between processes in
different processors are passed through external channels using physical com-
munication links between processors. We will use the term link to refer to a physical
communication path between a pair of processors. Channel refers to a named
communication path either between processes in one processor or between processes
on different processors.
Ideally, the process and the processor which will execute the process are regarded
as completely separate entities, even at this level. The application problem is
described as a set of communicating processes which is then mapped onto the
physical structure of processors. A knowledge of the physical structure and composi-
tion of the nodes is necessary to plan an efficient computation.
The size of a process is determined by the programmer and can be described by
its granularity:
1. Coarse granularity.
2. Medium granularity.
3. Fine granularity.
In coarse granularity, each process contains a large number of sequential instruc-
tions and takes a substantial time to execute. In fine granularity, a process might
consist of a few instructions, even one instruction; medium granularity describes the
middle ground. As the granularity is reduced, the process communication overhead
usually increases. It is particularly desirable to reduce the communication overhead
because of the significant time taken by a nodal communication. Message-passing
multiprocessors usually employ medium/coarse granularity; fine granularity is poss-
ible and is found in dataflow systems. (Dataflow is described in Chapter 10). A fine
‘grain message-passing system has been developed by Athas and Seitz (1988) after
pioncering work by Seitz on medium grain message-passing designs, which will be
described later. For fine grain computing, the overhead of message passing can be
reduced by mapping several processes onto one node and switching from one
Process to another when a process is held up by message passing. The process
granularity is sometimes related to the amount of memory provided at each node.
Medium granularity may require megabytes of local memory whereas fine granularity
may require tens of kilobytes of local memory. Fine grain systems may have a much
larger number of nodes than medium grain systems.
Process scheduling is usually reactive ~ processes are allowed to proceed until
halted by message communication. Then the process is descheduled and another
process is executed, i.e. processes are message-driven in their execution. Processes
do not commonly migrate from one node to another at run time; they will be
assigned to particular nodes statically before the program is executed, The298 Multiprocessor systems without shared memory
programmer makes the selection of nodes. A disadvantage of static assignment is,
that the proper load sharing, in which work is fairly distributed among available
processors, may be unclear before the programs are executed. Consideration has to
be given to spreading code/data across available local memory given limited local
memory.
Each node in a message-passing system typically has a copy of an operating
system kernel held in read-only memory. This will schedule processes within a node
and perform the message-passing operations at run time. The message-passing
routing operations should have hardware support, and should preferably be done
‘completely in hardware, Hardware support for scheduling operations is also desirable.
‘The whole system would normally be controlled by a host computer system.
However, there are disadvantages to message-passing multiprocessors. Code and
data have to be physically transferred to the local memory of each node prior to
‘execution, and this action can constitute a significant overhead. Similarly, results
need to be transferred from nodes to the host system. Clearly the computation to be
performed needs to be reasonably long to lessen the loading overhead. Similarly, the
application program should be computational intensive, not input/output or message-
passing intensive. Code cannot be shared. If processes arc to execute the same code,
which often happens, the code has to be replicated in each node and sufficient local
memory has to be provided for this purpose. Data words are difficult to share; the
data would need to be passed to all requesting nodes, which would give problems of
incoherence. Message-passing architectures are generally less flexible than shared
memory architectures. For example, shared memory multiprocessors could emulate
message passing by using shared locations 10 hold messages, whereas message-
passing multiprocessors are very inefficient in emulating shared memory multi-
processor operations. Both shared memory and message-passing architectures could
in theory perform single instruction stream-multiple data stream (SIMD) computing,
though the message-passing architecture would be least suitable and would normally
be limited to multiple instruction stream-multiple data stream (MIMD) computing.
9.1.2 Communication paths
Regular static direct link networks, which give local or nearest neighbor connections
{as described in Section 8.6, page 283). are generally used for large message-passing
systems, rather than indirect dynamic multistage networks. Some small dedicated or
embedded applications might use direct links to certain nodes chosen to suit the
‘message transfers of the application. Routing a message to a destination not directly
‘connected requires the message to be routed through intermediate nodes.
‘A network which has received particular attention for message-passing multi-
processors is the direct binary hypercube, described in Section 8.6.3 (page 286). The
direct binary hypercube network has good interconnection patterns suitable for a
wide range of applications, and expands reasonably well. The interconnection
pattern for binary hypercubes is defined by giving each node a binary address. Each‘Message-passing multiprocessor systems 299
node connects to those nodes whose binary addresses differ by one bit only. Hence
each node in an n-dimensional hypercube requires m links to other nodes. A
six-dimensioral hypercube is shown in Figure 9.2 laid out in one plane. Hyper-
cube connections could be made in one backplane, as shown in Figure 9.3 for a
Figure 9.2 Six-dimensional hypercube laid out in one plane
Figure 9.3. Three-dimensional hypercube (a) Interconnection pattern
(b) Laid out in one plane (©) Connections along a backplane300 Multiprocessor systems without shared memory
three-dimensional hypercube. Nearest neighbor two-dimensional mesh networks are
also candidates for message-passing systems, especially large systems.
‘The nodal links are bidirectional. The links could transfer the information one bit
at a time (bit-serial) or several bits at a time, Complete words could be transmitted
simultaneously. However, bit-serial lines are often used, especially in large systems,
to reduce the number of lines in each link. For coarse grain computations, message
passing should be infrequent and the bit-serial transmission may have sufficient
bandwidth. The network latency, the time to complete a message transfer, has two
‘components; first there is a path set-up time, which is proportional to the number of
nodes in the path; second is the actual transmission time, which is proportional 10
the size of the message for a fixed link bandwidth. The link bandwidth should be
about the same as memory bandwidth; a greater bandwidth cannot be utilized by the
node. Since the message data can be more than one word, the links require DMA
(direct memory access) capabilities.
Each process is given an identification number (process ID) which is used in the
message-passing scheme. Message passing can use a similar format to computer
network message passing. For example, messages consist of a header and the data;
Figure 9.4 shows the format of a message. Because more than one process might be
mapped onto a node, the process ID has two parts, one part identifying the node and.
fone part the process within the node. The nodal part (physical address) of the ID is
used to route the message to the destination node. The message type enables
different messages along the same link to be identified.
ee
Destination Sender
Process | Message | Message
[mee Pa [neue aa [Sea [oa
Figure 9.4 Message format
Messages may need to pass through intermediate nodes to reach their destinaticn.
Queues inside the nodes are used to hold pending messages not yet accepted.
However, the messages may be blocked from proceeding by previous messages not
being accepted, and would then become queued, until the queues become full and
eventually the blockage would extend back to the source process. The order in
which messages are sent to a particular process should nosmally be maintained, even
when messages are allowed to take different routes to the destination. Of course,
constraining the route to be the same for all messages between two processes
simplifies maintaining message order.
Messages can be routed in hypercube networks according to the following
algorithm, which minimizes the path distance. Suppose the current nodal address is
P=, :Py_2°" P)Pp and the destination address is D=d,_,d,_.°-dydy. The exclusive-OR
function R = P& D is performed operating on pairs of bits, to obtain R=
2 Tif‘Message-passing multiprocessor systems 301
Let 5, be the ith bit of R. The hypercube dimensions to use in the routing are
given by those values of i for which r, = 1. At each node in the path, the exclusive
function R = P® D is performed. One of [Link] in R which is 1 say r, identifies the
‘th dimension to select in passing the message forwasd until none of the bits are 1,
and then the destination node has been found. The bits of R are usually scanned
from most significant bit to least significant bit until a 1 is found. For example,
suppose routing from node 5 (000101) to node 34 (100010) is sought in a six-
dimensional hypercube. The route taken would be node 5 (000101) to node 21
(100101) to node 17 (100001) to node 19 (100011) to node 34 (100010). This
hypercube routing algorithm is sometimes called the e-cube routing algorithm, or
lefi-to-right routing.
Deadlock is @ potential problem. Deadlock occurs when messages cannot be
forwarded to the next node because the message buffers are filled with messages
waiting to be forwarded and these messages are blocked by other messages waiting
to be forwarded. Dally and Seitz (1987) developed the following solution to
deadlock,
The interconnections of processing nodes can be shown by a directed graph,
called an interconnection graph, depicting the communication paths. A channel
dependency graph is a ditected graph showing the route taken by a message for a
particular routing function. In the channel dependency graph, the channels are
depicted by the vertices of the graph and the connections of channels are depicted
by the edges. A network is deadlock-free if there are no cycles in the channel
dependency graph. Given a set of nodes n,n, n,m, and corresponding channels
Cor € Cya1 Cy there are no cycles if messages are routed in decreasing order
(Subscript) ‘of channel. Dally and Seitz introduced the concept of virtual channels.
Each channel, ¢, is split into two channels, a low channel, ¢g, and a high channel,
,, For example, with four channels, cy, cj, ¢2 and cy, We have the low virtual
channels, ¢99, Co» Cop» and Coy, and the high channels cig, C),, 6p, and cy. If a
message is routed on high channels from a node numbered less than the destination
node and to low channels from a node numbered greater than the destination node,
there are no cycles and hence no deadlock.
Routing messages according to a decreasing order of dimension in a hypercube
(left-to-right routing) is naturally deadlock-free as it satisfies the conditions without
virtual channels.
9.2 Programming
9.2.1, Message-passing constructs and routines
‘Message-passing multiprocessor systems can be programmed in conventional sequen-
tial programming languages such as FORTRAN, PASCAL, or C, augmented with
mechanisms for passing messages between processes. In this case, message-passing302 Multiprocessor systems without shared memory
is usually implemented using external procedure calls or routines, though statement
extensions could be made to the language. Altematively, special. programming
languages can be developed which enable the message passing to be expressed.
Message-passing programming is not limited to message-passing architectures oF
even multiprocessor systems; it is done on single processor systems, for example
between UNIX processes, and many high level languages for concurrent programming
have forms of message passing (see Gehani and McGettrick (1988) for examples).
Message-passing language constructs
Programming with specially developed languages with message-passing facilities is
usually ata much higher level than using standard sequential languages. with
message-passing routines, The source and destination processes may only need to be
identified. For example, the construct:
SEND expression _list TO destination_identifier
‘causes a message containing the values in expression list to be sent to the
destination specified. The construct:
RECEIVE variable list FROM source identifier
causes a message to be received from the specified source and the values of the
‘message assigned to the variables in variable_1ist. Sources and destination can
be given direct names. We might, for example, have three processes - keyboard,
process] and display ~ communicating via messages:
PROGRAM Comprocess
PROCESS keyboard
VAR key_value, ret_code: INTEGER;
REPEAT
BEGIN
read keyboard information
SEND key value 70 processi;
END
UNTIL key_value = ret_code
END
PROCESS process]
VAR key_value, ret_code, disp_value: INTEGE!
REPEAT
BEGIN
RECEIVE key value FROM keyboard:
‘compute dis_value from key_value
SEND disp_value 70 display:
ENDMessage-passing multiprocessor systems 303
UNTIL key_value = ret_code
END
PROCESS display
VAR ret_code,disp_value: INTEGER;
REPEAT
BEGIN
RECEIVE dis _value FROM process;
display dis_value
END
UNTIL dis_value = ret_code
END
It is also possible to have statements causing message-passing operations to occur
under specific conditions, for example the statement:
WHEN Boolean_expression RECEIVE variable list FROM
source identifier
or alternatively, the “guarded” command:
IF Boolean_expression RECEIVE variable list FROM
source_identifier
which will accept a message only when/if the Boolean expression is TRUE.
Sequential programming languages with message-passing routines
TTo send and receive message-passing routines attached to standard sequential pro-
gramming languages may be more laborious in specification and would only
implement the basic message-passing operations. For example, message-passing
send and receive routines with the format:
send(channel ID, type,buffer, buffer_length, node, process_ID)
recv (channel ID, type, buffer, buffer_length,message_byte_
count node, process_ID)
‘might be provided for FORTRAN programming. Such routines are usually found on
prototype and early message-passing multiprocessor systems and need further routines
to handle the message memory.304 Multiprocessor systems without shared memory
9.2.2 Synchronization and process structure
Message-passing send/receive routines can be divided into two types:
1. Synchronous or blocking.
2. Asynchronous or non-blocking.
Synchronous or blocking routines do not allow the process to proceed until the
‘operation has been completed. Asynchronous or non-blocking routines allow the
process 10 proceed even though the operation may not have been completed, i.e.
statements after @ routine are executed even though the routine may need further
time to complete.
‘A blocking send routine will wait until the complete message has been transmitted
and accepted by the receiving process. A blocking receive routine will wait until the
message it is expecting is received. A pair of processes, one with a blocking send
operation and one with a matching blocking receive operation, will be synchronized
with neither the source process nor the destination process being able to proceed
until the message has been passed from the source process to the destination
process, Henc?, blocking routines intrinsically perform two actions; they transfer
data and they synchronize processes. The term rendezvous is used to describe the
‘meeting and synchronization of two processes through blocking send/receive opera-
tions.
‘A non-blocking message-passing send routine allows a process to continue
immediately after the message has been constructed without waiting for it to be
accepted of even received. A non-blocking receive routine will not wait for the
message and will allow the process to proceed. This is not a common requirement as
the process cannot usually do any more computation until the required message has
been received. It could be used to test for blocking and to schedule another process
while waiting for a message. The non-biocking routines generally decrease the
process execution time. Both blocking and non-blocking variants may be available
for programmer choice in systems that use routines to perform the message passing.
Non-blocking message passing implies that the routines have buffers to hold
messages. In practice, buffers can only be of finite length and a point could be
reached when a non-blocking routine is blocked because all the buffer space has
been exhausted. Memory space aceds to be allocated and deallocated and the
‘messages and routines should be provided for this purpose; the send routine might
automatically deallocate memory space. For low level message passing, it is neces-
sary to provide an additional primitive routine to check whether a message buffer
space is reavailable.
Process structure
‘The basic programming technique for the system is to divide the problem into
concurrent communicating processes. We can identify two possible methods of
generating processes, namely:Message-passing multiprocessor systems 305,
1. Static process structure.
2. Dynamic process structure.
In the static process structure, the processes are specified before the program is
executed, and the system will execute a fixed number of processes. The programmer
usually explicitly identifies the processes. It might be possible for a compiler to
assist in the creation of concurrent message-passing processes, but this seems to be
‘an open research problem. In the dynamic process structure, processes can be
created during the execution of the program using process creation constructs;
processes can also be destroyed. Process creation and destruction might be done
conditionally. The number of processes may vary during execution.
Process structure is independent of the message-passing types, and hence we have
the following potential combinations in a language or system:
‘Synchronous communication with static process structure,
‘Synchronous communication with dynamic process structure,
Asynchronous communication with static process structure.
Asynchronous communication with dynamic process structure,
Language examples include Ada (having synchronous communication with static
process structure), CSP (having asynchronous communication with static process
structure) and MP (having synchronous communication with dynamic process struc
ture) (Liskov, Herlihy and Gilbert, 1988). Asynchronous communication with
dynamic process structure is used in message-passing systems using procedure call
additions to standard sequential programming languages (e.g. Intel iPSC, see
Section 9.3.2). The combination is not known together in specially designed lan-
guages, though it would give all possible features. Liskov, Herlihy and Gilbert
suggest that either synchronous communication or static process structure should be
abandoned but suggest that it is reasonable to retain one of them in a language. The
particular advantage of asynchronous communication is that processes need not be
delayed by messages, and static process structure may then be sufficient. Dynamic
process structure can reduce the effects of delays incurred with synchronous com-
munication by giving the facility to create a new process while a cormunication
delay is in progress. The combination, synchronous communication with dynamic
process structure, seems a good choice.
Program example
‘Suppose the integral of a function f(x) is required. The integration can be performed
numerically by dividing the area under the curve f(x) into very small sections which
are approximated to rectangles (or trapeziums). Then the area of each section is
computed and added together to obtain the total area. One obvious parallel solution
is to use one process for each area or group of areas, as shown in Figure 9.5. A.
single process is shown accepting the results generated by the other processes.306 Multiprocessor systems without shared memory
mz dbdALGBA
Result
Figure 9.5 Integration using message-passing processes
Let the basic blocking message-passing primitives in the system be send(message,
destination process) and receive(message, source process). With the
integral processes numbered from 0 to n—1 and the accumulation process numbered
rn, we have two basic programs, one for the processes performing the integrals and
‘one for the process performing the accumulation, i.e.:
Integral process j Accumulation process
PROGRAM Integral PROGRAM Accumulate
VAR area,n: INTEGER; VAR area,i,n,acc: INTEGER;
compute jth area FOR i = 0 70 n-1
send(area,n) BEGIN
END receive (area, i);
ace := acc + area
END
WRITE (/Integral is’,acc)
END
Variables are local and need to be declared in each process. The same names in
different processes refer to different objects. Note that processes are referenced
directly by number. The integral process requires information to compute the arcas
namely the function, the interval size and number of intervals to be computed in
each process. This information is passed to the integral processes perhaps via anMessage-passing multiprocessor systems 307
initiation process prior to the integral processes starting their computation.
The accumulation process could also perform one integration while waiting for
the results to be generated. A single program could be written for all processes
using conditional statements to select the actions a particular process should take,
and this program copied to all processes. This would be particularly advantageous if
there is a global host-node broadcast mode in which all nodes can receive the same
‘communication simultaneously. In this case, we have:
Composite process
PROGRAM Comprocess
VAR mynode, area, i,n, acc: INTEGER;
read input parameters
identify nodal address, mynode
IF mynode = n THEN
BEGIN
compute nth area
FOR i = 0 TO n-1
BEGIN
receive (area, i);
acc := acc + area
END
WRITE ("Integral is',acc)
END
ELSE
BEGIN
compute jth area
send(area,n)
END
END
‘Various enhancements can be made to improve the performance. For exemple, since
the last accumulation is in fact a series of steps, it could be divided into groups of
accumulations Which are performed on separate processors. The number of areas
computed by each process defines the process granularity and would be chosen to
gain the greatest throughput taking into account the individual integration time and
the communication time. In some cases, reducing the number of nodes involved has.
been found to decrease the computation time (see Pase and Larrabee, 1988).
Host-node communication is usually much slower than node-node communication.
If separate transactions need to be performed for each node loaded (i.e. there is no
broadcast mode) the time to load the nodal program could be decreased by arranging.
the program to be sent to the first node, which then passes a copy on to the next node
and so on. The most effective method to reduce the communication time is 10308 Multiprocessor systems without shared memory
arrange for each node to transmit its information according to minimal spanning
tree. Results could be collected in a pipeline or tree fashion with the results passed
from one node to the next. Each node adds its contribution before passing the
accumulation onwards. Pipeline structures are useful, especially if the computation
is to be repeated several times, perhaps with different initial values.
9.3 Message-passing system examples
9.3.1 Cosmic Cube
‘The Cosmic Cube is a research vehicle designed and constructed at Caltech
(California Institute of Technology) under the direction of Seitz during the period
1981-5 (Seitz, 1985; Athas and Seitz, 1988) and is credited with being the first
working hypercube multiprocessor systen (Hayes, 1988) though the potential of
hypercubes had been known for many years prior to its development. The Cosmic
Cube significantly influenced subsequent commercial hypercube systems, notably
the Intel iPSC hypercube system. Sixty-four-node and smaller Cosmic Cube systems
have been constructed. The Intel 8086 processor is used as the nodal instruction
Processor with an Intel 8087 floating point coprocessor. Each node has 128 Kbytes
of dynamic RAM, chosen as a balance between increasing the memory or increasing,
the number of nodes within given cost constraints. The memory has parity checking,
but not error correction. (A patity error was reported on the system once every
several days!) Each node has 8 Kbytes of read-only memory to store the initialization
and bootstrap loader programs. The kernel in each node occupies 9 Kbytes of code
and 4 Kbytes of tables. The interconnection links operate in asynchronous full-
duplex mode at a relatively slow rate of 2 Mbits/sec. The basic packet size is sixty-
four bits with queues at each node. Transmission is started with send and receive
calls. These calls can be non-blocking, i.e. the calls return after the request is put
place. The request becomes “pending” until it can be completed. Hence @ program
‘can continue even though the message request may not have been completed.
‘The nodal kernel, called the Reactive Kernel, RK, has been divided into an inner
kernel (written in assembly language) and an outer kernel. The inner kernel performs
the send and receive message handling and queues messages. Local communication
between processes in one node and between non-local processes is treated in a
similar fashion, though of course local communication is through memory buffers
and is much faster. The inner kernel also schedules processes in a node using a
Found robin selection. Each process executes for a fixed time period or until it is
delayed by a system call. The outer kernel contains a set of processes for com-
‘munication between user processes using messages. These outer kernel processes
include processes to create, copy and stop processes.
‘The host run-time system, called the Cosmic Environment, CE, has routines to
establish the set of processes for a computation and other routines for managing theMessage-passing multiprocessor systems 309
whole system. The processes of a computation are called the process group. The
system can be used by more than one user but is not time-shared; each user can
specify the size of a hypercube required using a CE routine and will be allocated a
part of the whole system not used by other users ~ this method has been called
space-shared. In a similar manner to a virtual memory system, users reference
logical nodal addresses, which have corresponding physical nodal addresses. The
logical nodal addresses for a requested n-cute could be numbered from 0 to m1.
Dynamic process structure with reactive process scheduling is employed. Pro-
‘gramming is done in the C language, with support routines provided for both
message passing and for process creation/destruction. The dynamic process creation
function - spawn (parameters) — creates a process consisting of a compiled
program in a node and process, all specified as function parameters. Specifying the
node/process as function parameters rather than letting the operating system make
this choice, enables predefined structures to be built up and allows changes to be
made while the program is being executed. The send routine is xsend (parameters)
where the parameters specify the node/process and a pointer to a message block. The
xsend routine deallocates message space. Other functions available include block-
ing receive message, xrecvb, returning a pointer to the message block, allocating
message memory space, xna loc, and freeing message space, xfree. Later develop-
‘ment of the system incorporated higher level message-passing mechanisms and fine
‘grain programming. Statements such as:
IF 4 = 10 THEN SEND(i+1) TO self ELSE EXIT Fi
ccan be found in programs in the programming environment Cantor (see Athas and
Seitz (1988) for further details).
Seitz introduced wormhole routing (Dally and Seitz, 1987) as an alternative to
normal store-and-forward routing used in distributed computer systems. In store-
and-forward routing, a packet is stored in a node and transmitted as a whole to the
next node when a free path can be establisted. In wormhole routing, only the head
of the packet is initially transmitted from the input to the output channel of a node.
Subsequent parts of the packet are transmitted when the path is available. The term
flit (low control bits) has been coined to describe the smallest unit that can be
‘accepted or blocked in the transmission path. It is necessary to ensure that the flits
are received in the same order that they are transmitted and hence channels need to
be reserved for the flits until the packet has been transmitted. Other packets cannot
be interleaved with the fits along the same channels.
9.3.2 Intel iPSC system
The Intel Personal Supercomputer (iPSC) is a commercial hypercube system
developed after the Cosmic Cube. The iPSC/1 system uses Intel 80286 processors
with 80287 floating point coprocessors. The architecture of each node is shown in310 Multiprocessor systems without shared memory
Figure 9.6, Each node consists of a single board computer, having two buses, a
processor bus and an input/output bus. The PROM (programmable read-only memory)
hhas 64 Kbytes and the dual port memory has 512 Kbytes. The nodes are controlled
by a host computer system called a cube manager. The cube manager has 2-4
Mbytes of main memory, Winchester and floppy disk memory, and operates under
the XENIX operating system. As with the Cosmic Cube, each node has a small
‘operating system (called NX). Eight communication channels are provided at each
node, seven for links to other nodes in the hypercube and one used as a global
Ethernet channel for communication with the cube manager. Typical systems have
thirty-two nodes using five internode communication links. Internode communication
takes between I and 2.2 ms for messages between 0 and 1024 bytes. Cube manager
to node communication takes 18 ms for a 1 Kbyte message (Pase and Larrabee,
1988), The iPSC/2, an upgrade to the iPSC/I, uses Intel $0386 processors. and
hardware wormhole routing. Additional vector features can be provided at each
node or selected nodes on one separate board per node.
FORTRAN message-passing routines are provided, including send and recy,
having the format given previously. Sendw and recvw are blocking versions of send
and recv. If non-blocking message passing is done, the routine status can be used 10
20287 ABK-1T pot
Nomece
process |
10] [ ouspon ] [iene
conor] | memory | | processing
aot
Comenrication [ [
coprocessors
‘Channel L L I
owes L_]
(Channel Channel Channel Channel Channel Chanel Channel Etre
ot 8 a 8 channel
Connections
‘atheros
Figure 9.6 Intel IPSC nodeMessage-passing multiprocessor systems 311
ascertain whether a message buffer area is reavailable for use. Messages sent and
received from the host use the commands sendmsg and recmsg and operate wi
blocked messages without type selection.
9.4 Transputer
In this section, we will present the details of the transputer, the first single chip
computer designed for message-passing multiprocessor systems. A special high
level programming language called occam has been developed as an integral part of
the transputer development. Occam has a static process structure and synchronous,
‘communication, and is presented in Section 9.5.
9.4.1. Philosophy
‘The transputer is a VLSI processor produced by Inmos (Inmos, 1986) in 16- and 32-
bit versions with high speed internal memory and serial interfaces. The device has
a RISC type of instruction set (Section 5.2, page 151) though programming in
‘machine instructions is not expected, as occam should be used.
Each transputer is provided with a processor, internal memory and originally four
high-speed DMA channels which enable it to connect to other transpuers directly
using synchronous send/receive types of commands. A link consists of two serial
lines for bidirectional transfers. Data is transmitted as a single item or as a vector.
‘When one serial line is used for a data package, the other is used for an acknowledge-
‘ment package, which is generated as soon as a data package reaches the destination.
Various arrays of transputers can be constructed easily. Four links allow for a
two-dimensional array with each transputer connecting to its four nearest neighbors.
Other configurations are possible. For example, transputers can be formed into
groups and linked to other groups. Two transputers could be interconnected and
provide six free links, as shown in Figure 9.7(a). Similarly, a group of three
transputers could be fully interconnected and have six links free for connecting to
other groups, as shown in Figure 9.7(b). A group of four transputers could be fully
interconnected and have four links to other groups, as shown in Figure 9.7(c). A.
group of five transputers, each having four links, could be fully interconnected but
‘with no free links to other systems.
‘A key feature of the transputer is the programming language, occam, which was
designed specifically for the transputer. The name occam comes from the fourteenth
century philosopher, William of Occam, who presented the concept of Occam's Razor:
“Entia non sunt multiplicanda praeter necessitatem”, i.¢. “Entities should not be
multiplied beyond necessity” (May and Taylor, 1984). The language has. been
designed for simplicity and provides the necessary primitive operations for point-to-
point data transfers and to specify explicit parallelism. The central concept in an312 Multiprocessor systems without shared memory
Free links ee
Toile :
res
(0 twstesres 0) Tee Tames (0 four Tanger
Figure 9.7 Groups of transputers fully interconnected (a) Two transputers
(b) Three transputers (€) Four transputers
‘occam program is the process consisting of one or more program statements, which
can be executed in sequence or in parallel. Processes can be executed concurrently
and one or more processes are allocated to each transputer in the system, There is,
hardware support for sharing one transputer among more than one process, The
statements of one process are executed until a termination statement is reached or a
point-to-point data transfer is held up by another process. Then, the process is,
descheduled and another process started automatically.
9.4.2 Processor architecture
‘The internal architecture of the processor is shown in Figure 9.8 and has the
following subparts:
Processor.
Link interfaces.
Internal RAM,
‘Memory interface for external memory.
Event interface.
System services logic.
The first transputer product, the T212, announced in 1983, contained a 16-bit
integer arithmetic processor. Subsequent products included a 32-bit integer arith-
metic processor part (T414, announced in 1985) and a floating point version (the
‘7800, announced in 1988). The floating point version has an internal floating point
arithmetic processor attached to the integer processor and the data bus, such that
both processors can operate simultaneously. Though the processor itself is a RISC
type, it is microprogrammed internally and instructions take one or more processor