Building An Optimizing Compiler
Building An Optimizing Compiler
says
hello !
ISBN 155558179x
Pub Date: 12/01/97
PREFACE.......................................................................................................................................... 8
PHILOSOPHY FOR CHOOSING COMPILER TECHNIQUES ..................................................................... 9
HOW TO USE THIS BOOK ............................................................................................................... 10
CHAPTER 1 OVERVIEW ......................................................................................................... 11
1.1 WHAT IS AN OPTIMIZING COMPILER? ...................................................................................... 11
1.2 A BIASED HISTORY OF OPTIMIZING COMPILERS ...................................................................... 12
1.3 WHAT HAVE WE GAINED WITH ALL OF THIS TECHNOLOGY?.................................................. 14
1.4 RULES OF THE COMPILER BACK-END GAME............................................................................ 14
1.5 BENCHMARKS AND DESIGNING A COMPILER ........................................................................... 15
1.6 OUTLINE OF THIS BOOK ........................................................................................................... 16
1.7 USING THIS BOOK AS A TEXTBOOK ......................................................................................... 17
1.8 REFERENCES ............................................................................................................................ 19
CHAPTER 2 COMPILER STRUCTURE ................................................................................... 20
2.1 OUTLINE OF THE COMPILER STRUCTURE ................................................................................. 21
2.1.1 Source Program Representation...................................................................................... 21
2.1.2 Order of Transformations ................................................................................................ 24
2.2 COMPILER FRONT END............................................................................................................. 26
2.3 BUILDING THE FLOW GRAPH ................................................................................................... 28
2.4 DOMINATOR OPTIMIZATIONS ................................................................................................... 31
2.5 INTERPROCEDURAL ANALYSIS ................................................................................................. 35
2.6 DEPENDENCE OPTIMIZATION ................................................................................................... 36
2.7 GLOBAL OPTIMIZATION ........................................................................................................... 37
2.8 LIMITING RESOURCES .............................................................................................................. 44
2.9 INSTRUCTION SCHEDULING ...................................................................................................... 50
2.10 REGISTER ALLOCATION ......................................................................................................... 53
2.11 RESCHEDULING ...................................................................................................................... 56
2.12 FORMING THE OBJECT MODULE ............................................................................................ 56
2.13 REFERENCES .......................................................................................................................... 57
CHAPTER 3 GRAPHS .................................................................................................................. 58
3.1 DIRECTED GRAPHS .................................................................................................................. 58
3.2 DEPTH-FIRST SEARCH.............................................................................................................. 60
Table 3.1 Classification of Graph Edges .............................................................................. 62
3.3 DOMINATOR RELATION ........................................................................................................... 62
3.4 POSTDOMINATORS ................................................................................................................... 65
3.5 DOMINANCE FRONTIER ............................................................................................................ 67
Table 3.2 Dominance Frontiers............................................................................................. 68
3.6 CONTROL DEPENDENCE ........................................................................................................... 69
Table 3.3 Control Dependences for the Example Program .................................................. 71
3.7 LOOPS AND THE LOOP TREE..................................................................................................... 71
3.7.1 Infinite Loops ................................................................................................................... 73
3.7.2 Single- and Multiple-Entry Loops.................................................................................... 73
3.7.3 Computing the Loop Tree ................................................................................................ 76
6.7.1 An Approximate Algorithm that Takes Less Time and Space ........................................ 137
6.8 INCLUDING THE EFFECTS OF LOCAL EXPRESSIONS ................................................................ 137
6.9 SMALL AMOUNT OF FLOW-SENSITIVE INFORMATION BY OPTIMIZATION ............................... 138
6.9.1 Handling the Pointer from a Heap Allocation Operation ............................................. 139
6.10 REFERENCES ........................................................................................................................ 140
CHAPTER 7 STATIC SINGLE ASSIGNMENT ..................................................................... 141
7.1 CREATING STATIC SINGLE ASSIGNMENT FORM ..................................................................... 142
7.2 RENAMING THE TEMPORARIES ............................................................................................... 148
7.3 TRANSLATING FROM SSA TO NORMAL FORM ....................................................................... 150
7.3.1 General Algorithm for Translating from SSA to Normal Form..................................... 152
7.4 REFERENCES .......................................................................................................................... 158
CHAPTER 8 DOMINATOR-BASED OPTIMIZATION......................................................... 159
8.1 ADDING OPTIMIZATIONS TO THE RENAMING PROCESS .......................................................... 160
8.2 STORING INFORMATION AS WELL AS OPTIMIZATIONS ............................................................ 163
8.3 CONSTANT PROPAGATION ..................................................................................................... 164
8.3.1 Representing Arithmetic................................................................................................. 165
8.3.2 Simulating the Arithmetic .............................................................................................. 165
8.3.3 Simulating Conditional Branches .................................................................................. 167
8.3.4 Simulating the Flow Graph............................................................................................ 168
8.3.5 Other Uses of the Constant Propagation Algorithm...................................................... 171
8.4 COMPUTING LOOP-INVARIANT TEMPORARIES ....................................................................... 172
8.5 COMPUTING INDUCTION VARIABLES ..................................................................................... 175
8.6 RESHAPING EXPRESSIONS ...................................................................................................... 178
8.7 STRENGTH REDUCTION .......................................................................................................... 181
8.8 REFORMING THE EXPRESSIONS OF THE FLOW GRAPH ............................................................ 182
8.9 DEAD-CODE ELIMINATION .................................................................................................... 183
8.10 GLOBAL VALUE NUMBERING .............................................................................................. 186
8.11 REFERENCES ........................................................................................................................ 193
CHAPTER 9 ADVANCED TECHNIQUES .............................................................................. 194
9.1 INTERPROCEDURAL ANALYSIS ............................................................................................... 195
9.1.1 The Call Graph .............................................................................................................. 196
9.1.2 Simple Interprocedural Analysis Information ............................................................... 196
9.1.3 Computing Interprocedural Alias Information.............................................................. 197
9.2 INLINING PROCEDURES .......................................................................................................... 198
9.3 CLONING PROCEDURES .......................................................................................................... 200
9.4 SIMPLE PROCEDURE-LEVEL OPTIMIZATION ........................................................................... 201
9.5 DEPENDENCE ANALYSIS ........................................................................................................ 201
9.6 DEPENDENCE-BASED TRANSFORMATIONS ............................................................................. 205
9.7 LOOP UNROLLING .................................................................................................................. 206
9.8 REFERENCES .......................................................................................................................... 209
CHAPTER 10 GLOBAL OPTIMIZATION .............................................................................. 210
10.1 MAIN STRUCTURE OF THE OPTIMIZATION PHASE ................................................................ 212
10.2 THEORY AND ALGORITHMS ................................................................................................. 213
Preface
Building compilers has been a challenging activity since the advent of digital
computers in the late 1940s and early 1950s. At that time, implementing the concept of
automatic translation from a form familiar to mathematicians into computer instructions
was a difficult task. One needed to figure out how to translate arithmetic expressions into
instructions, how to store data in memory, and how to choose instructions to build
procedures and functions. During the late 1950s and 1960s these processes were
automated to the extent that simple compilers could be written by most computer science
professionals. In fact, the concept of small languages with corresponding translators is
fundamental in the UNIX community.
From the beginning, there was a need for translators that generated efficient code:
The translator must use the computer productively. Originally this constraint was due to
computers small memories and slow speed of execution. During each generation of
hardware, new architectural ideas have been added. At each stage the compilers have
also needed to be improved to use these new machines more effectively. Curiously,
pundits keep predicting that less efficient and less expensive translators will do the job.
They argue that as machines keep getting faster and memory keeps expanding, one no
longer needs an optimizing compiler. Unfortunately, people who buy bigger and faster
machines want to use the proportionate increase in size and speed to handle bigger or
more complex problems, so we still have the need for optimizing compilers. In fact, we
have an increased need for these compilers because the performance of the newer
architectures is sensitive to the quality of the generated code. Small changes in the order
and choice of the instructions can have much larger effects on machine performance than
similar choices made with the complex instruction set computing (CISC) machines of the
1970s and 1980s.
The interplay between computer architecture and compiler performance has been
legitimized with the development of reduced instruction set computing (RISC)
architectures. Compilers and computer architecture have a mutually dependent
relationship that shares the effort to build fast applications. To this end, hardware has
been simplified by exposing some of the details of hardware operation, such as simple
load-store instruction sets and instruction scheduling. The compiler is required to deal with
these newly exposed details and provide faster execution than possible on CISC
processors.
This book describes one design for the optimization and code-generation phases of
such a compiler. Many compiler books are available for describing the analysis of
programming languages. They emphasize the processes of lexical analysis, parsing, and
semantic analysis. Several books are also available for describing compilation processes
for vector and parallel processors. This book describes the compilation of efficient
programs for a single superscalar RISC processor, including the ordering and structure of
algorithms and efficient data structures.
The book is presented as a high-level design document. There are two reasons for
this. Initially, I attempted to write a book that presented all possible alternatives so that the
reader could make his or her own choices of methods to use. This was too bulky, as the
projected size of the volume was several thousand pagesmuch too large for practical
purposes. There are a large number of different algorithms and structures in an optimizing
compiler. The choices are interconnected, so an encyclopedic approach to optimizing
compilers would not address some of the most difficult problems.
Second, I want to encourage this form of design for large software processes. The
government uses a three-level documentation system for describing software projects:
The A-level documents are overview documents that describe a project as a whole and list
its individual pieces. B-level documents describe the operation of each component in
sufficient detail that the reader can understand what each component does and how it
does it, whereas the C-level documents are low-level descriptions of each detail.
As a developer I found this structure burdensome because it degenerated into a
bureaucratic device involving large amounts of paper and little content. However, the
basic idea is sound. This book will describe the optimization and code-generation
components of a compiler in sufficient detail that the reader can implement these
components if he or she sees fit. Since I will be describing one method for each of the
components, the interaction between components can be examined in detail so that all of
the design and implementation issues are clear.
Each chapter will include a section describing other possible implementation
techniques. This section will include bibliographic information so that the interested reader
can find these other techniques.
In the course of writing this book, my view of it has evolved. It started out as a
recording of already known information. I have designed and built several compilers using
this existing technology. As the book progressed, I have learned much about integrating
these algorithms. What started out as a concatenation of independent ideas has thus
become melded into a more integrated whole. What began as simple description of
engineering choices now contains some newer ideas. This is probably the course of any
intellectual effort; however, I have found it refreshing and encouraging.
Chapter 1 Overview
What is an optimizing compiler? Why do we need them? Where do they come
from? These questions are discussed in this chapter, along with how to use the book.
Before presenting a detailed design in the body of the book, this introductory chapter
provides an informal history of optimizing compiler development and gives a running
example for motivating the technology in the compiler and to use throughout the rest of
the book.
elements and memory references. It must choose the instructions well to use as few
instructions as possible while obtaining this balance. Of course, all of this is impossible,
but the compiler must do as well as it can.
Fortunately, standards groups are becoming more aware of the needs of compiler
writers when describing the language standards. Each major language standard now
describes in some way the limits of compiler optimization. Sometimes this is done by
leaving certain aspects of the language as undefined or implementation defined. Such
phrases mean that the compiler may do whatever it wishes when it encounters that aspect
of the language. However, be cautious the user community frequently has expectations of
what the compiler will do in those cases, and a compiler had better honor those
expectations.
What does the compiler do when it encounters a portion of the source program that
uses language facilities in a way that the compiler does not expect? It must make a
conservative choice to implement that facility, even at the expense of runtime performance
for the program. Even when conservative choices are being made, the compiler may be
clever. It might, for example, compile the same section of code in two different ways and
generate code to check which version of the code is safe to use.
Should the compiler writer add special code for dealing with these anomalous
benchmarks? Yes and no. One has to add the special code in a competitive world, since
the competition is adding it. However, one must realize that one has not really built a
better compiler unless there is a larger class of programs that finds the feature useful. One
should always look at a benchmark as a source of general comments about programming.
Use the benchmark to find general improvements. In summary, the basis for the design of
optimizing compilers is as follows:
1. Investigate the important programs in the application areas of interest. Choose
compilation techniques that work well for these programs. Choose kernels as
benchmarks.
2. Investigate the source languages to be compiled. Identify their weaknesses from a
code quality point of view. Add optimizations to compensate for these weaknesses.
3. Make sure that the compiler does well on the standard benchmarks, and do so in a
way that generalizes to other programs.
Figure 1.2 is a version of the classic matrix multiply algorithm. It involves a large
amount of floating point computation together with an unbalanced use of the memory
system. As written, the inner loop consists of two floating point operations together with
three load operations and one store operation. The problem will be to get good
performance from the machine when more memory operations are occurring than
computations.
Figure 1.3 computes the length of the longest monotone subsequence of the vector
A. The process uses dynamic programming. The array C(I) keeps track of the longest
monotone sequence that starts at position I. It computes the next element by looking at all
of the previously computed subsequences that can have X(I) added to the front of the
sequence computed so far. This example has few floating point operations. However, it
does have a number of load and store operations together with a significant amount of
conditional branching.
Figure 1.4 is a binary search algorithm written as a recursive procedure. The
student may feel free to translate this into a procedure using pointers on a binary tree. The
challenge here is to optimize the use of memory and time associated with procedure calls.
I recommend that the major grade in the course be associated with a project that
prototypes a number of the optimization algorithms. The implementation should be viewed
as a prototype so that it can be implemented quickly. It need not handle the complex
memory management problems existing in real optimizing compilers.
1.8 References
Auslander, M., and M. Hopkins. 1982. An overview of the PL.8 compiler. Proceeding of the
ACN SIGPLAN82 Conference on Programming Language Design and Implementation, Boston, MA.
Backus, J. W., et al. 1957. The Fortran automatic coding system. Proceedings of AFIPS
1957 Western Joint Computing Conference (WJCC), 188-198.
Chaitin, G. J. 1982. Register allocation and spilling via graph coloring. Proceedings of
the SIGPLAN 82 Symposium on Compiler Construction, Boston, MA. Published as SIGPLAN
Notices 17(6): 98-105.
Chaitin, G. J., et al. 1981. Register allocation via coloring. Computer Languages 6(1):
47-57.
Cooper, K., and K. Kennedy. 1988. Interprocedural side-effect analysis in linear time.
Proceedings of the SIGPLAN 88 Symposium on Programming Language Design and
Implementation, Altanta, GA. Published as SIGPLAN Notices 23(7).
Cytron, R., et al. 1989. An efficient method of computing static single assignment form.
Conference Record of the 16th ACM SIGACT/SIGPLAN Symposium on Programming Languages,
Austin, TX. 25-35.
Gross, T. 1983. Code optimization of pipeline constraints. (Stanford Technical Report CS
83-255.) Stanford University.
Hendron, L. J., G. R. Gao, E. Altman, and C. Mukerji. 1993. A register allocation
framework based on hierarchical cyclic interval graphs. (Technical report.) McGill
University.
Karr, M. 1975. P-graphs. (Report CA-7501-1511.) Wakefield, MA: Massachusetts Computer
Associates.
Kildall, G. A. 1973. A unified approach to global program optimization. Conference
Proceedings of Principles of Programming Languages I, 194-206.
Leverett, B. W., et al. 1979. An overview of the Production-Quality Compiler-Compiler
project. (Technical Report CMU-CS-79-105.) Pittsburgh, PA: Carnegie Mellon University.
Morel, E., and C. Renvoise. 1979. Global optimization
redundancies. Communications of the ACM 22(2): 96-103.
by
suppression
of
partial
et
al.
1975.
The
design
of
an
optimizing
compiler.
New
York:
American
in the future the first team changes some part of its implementation. Suddenly the second
component will no longer work. Even worse problems can occur if the second team has
the first team save some information on the side to help their component. Now the
interface is no longer the intermediate representation of the program but the intermediate
representation plus this other (possibly undocumented) data. The only way to avoid this
problem is to require the interfaces to be documented and simple.
Optimizing compilers are complex. After years of development and maintenance, a
large fraction of a support teams effort will go to fixing the problems. Little further
development can be done because there is no time. This situation happens because most
compilers can only be tested as a whole. A test program will be compiled and some phase
will have an error (or the program compiles and runs incorrectly). Where is the problem? It
is probably not at the point in the compiler where you observe the problem. A pithy phrase
developed at COMPASS was Expletive runs downhill. (The actual expletive was used, of
course.) This means that the problem occurs somewhere early in the compiler and goes
unnoticed until some later phase, typically the register allocation, or object module
formation. Several things can be done to avoid this problem:
Subroutines must be available to test the validity of the intermediate
representation. These routines can be invoked by compile-time switches to check which
phases create an inappropriate representation.
Assertions within the phases must be used frequently to check that situations
that are required to be true are in fact true. This is often done in production compilers.
A test and regression suite must be created for each phase. These tests involve
special versions of the IR in which a program that has been compiled up to the point of
this phase. This IR is input to the phase and then the output is simulated to see if the
resulting program runs correctly.
Having these requirements, how is the program stored? The choice is based on
experience and then ratified by the manual simulations discussed earlier. In this compiler,
each procedure will be stored internally in a form similar to assembly language for a
generic RISC processor.
Experience with the COMPASS Compiler Engine taught that the concept of a value
computed by an operator must be general. The value may be a vector, scalar, or structural
value. Early in the compilation process, the concept of value must be kept as close to the
form in the source program as possible so that the program can be analyzed without
losing information.
These observations are almost contradictory. We need to be able to manipulate the
smallest pieces of the program while still being able to recover the overall structure
present in the source program. This contradiction led to the idea of the gradual lowering of
the intermediate representation. At first, LOAD instructions have a complete set of
subscript expressions. Later these specialized load instructions are replaced by machinelevel load instructions.
What does an assembly language program look like? There is one machine
instruction per line. Each instruction contains an operation code, indicating the operation
to be performed; a set of operands; and a set of targets to hold the results. The following
gives the exact form for the intermediate representation, except that the representation is
encoded:
1. The instruction, encoded as a record that is kept in a linked list of instructions.
2. An operation code describing action performed. This is represented as a built-in
enumeration of all operations.
3. A set of constant operands. Some instructions may involve constant operands.
These are less prone to optimization and so are inserted directly in the instruction.
The compiler initially will not use many constant operands because doing so
decreases the chances for optimization. Later, many constants will be stored in the
instructions rather than using registers.
4. A list of registers representing the inputs to the instruction. For most instructions
there is a fixed number of inputs, so they can be represented by a small array.
Initially, there is an assumption of an infinite supply of registers called temporaries.
5. A target register that is the output of the instruction.
The assembly program also has program labels that represent the places to which
the program can branch. To represent this concept, the intermediate representation is
divided into blocks representing straight-line sequences of instructions. If one instruction in
a block is executed, then all instructions are executed. Each block starts with a label (or is
preceded by a conditional branching instruction) and ends with a branching instruction.
Redundant branches are added to the program to guarantee that there is a branch under
every possible condition at the end of the block. In other words, there is no fall-through
into the next block.
The number of operation codes is large. There is a distinct operation code for each
instruction in the target machine. Initially these are not used; however, the lowering
process will translate the set of machine-independent operation codes into the target
machine codes as the compilation progresses. There is no need to list all of the operation
codes here. Instead the subset of instructions that are used in the examples is listed in
Figure 2.2.
Now the source program is modeled as a directed graph, with the nodes being the
blocks. There is a directed edge between two blocks if there is a possible branch from the
first block to the second. A unique node called Entry represents the entry point for the
source program. The entry node has no predecessors in the graph. Similarly, a unique
node called Exit represents the exit point for the source program, and that node has no
successors. In Figure 2.3 the entry node is node B0, and the exit node is node B5.
The execution of the source program is modeled by a path through the graph. The
path starts at the entry node and terminates at the exit node. The computations within
each node in the path are executed in order of the occurrence of nodes on the path. In
fact, the computations within the node are used to determine the next node in the path. In
Figure 2.3, one possible path is B0, B1, B2, B4, B1, B3, B4, B5. This execution path
means that all computations in B0 are executed, then all computations in B1, then B2, and
so on. Note that the computations in B1 and B4 are executed twice.
graph is sufficiently simple that a straightforward abstract syntax tree walk generating
instructions on the fly is sufficient to build the IR. While building the flow graph some initial
optimizations can be performed on instructions within each block.
The Dominator Optimization phase performs the initial global optimizations. It
identifies situations where values are constants, where two computations are known to
have the same value, and where instructions have no effect on the results of the program.
It identifies and eliminates most redundant computations. At the same time it reapplies the
optimizations that have already occurred within a single block. It does not move
instructions from one point of the flow graph to another.
Figure 2.4 Compiler Structure
The Interprocedural Optimization phase analyzes the procedure calls within this
flow graph and the flow graphs of all of the other procedures within the whole program. It
determines which variables might be modified by each procedure call, which variables and
expressions might be referencing the same memory location, and which parameters are
known to be constants. It stores this information for other phases to use.
The Dependence Optimization phase attempts to optimize the time taken to
perform load and store operations. It does this by analyzing array and pointer expressions
to see if the flow graph can be transformed to one in which fewer load/stores occur or in
which the load and store operations that occur are more likely to be in one of the cache
memories for the RISC chip. To do this it might interchange or unroll loops.
The Global Optimization phase lowers the flow graph, eliminating the symbolic
references to array expressions and replacing them with linear address expressions.
While doing so, it reforms the address expressions so that the operands are ordered in a
way that ensures that the parts of the expressions that are dependent on the inner loops
are separated from the operands that do not depend on the inner loop. Then it performs a
complete list of global optimizations, including code motion, strength reduction, and deadcode elimination.
After global optimization, the exact set of instructions in the flow graph has been
found. Now the compiler must allocate registers and reorder the instructions to improve
performance. Before this can be done, the flow graph is transformed by the Limiting
Resources phase to make these later phases easier. The Limiting Resources phase
modifies the flow graph to reduce the number of registers needed to match the set of
physical registers available. If the compiler knows that it needs many more registers than
are available, it will save some temporaries in memory. It will also eliminate useless
copies of temporaries.
Next an initial attempt to schedule the instructions is performed. Register allocation
and scheduling conflict, so the compiler attempts to schedule the instructions. It counts on
the effects of the Limiting Resources phase to ensure that the register allocation can be
performed without further copying of values to memory. The instruction scheduler reorders
the instructions in several blocks simultaneously to decrease the time that the most
frequently executed blocks require for execution.
After instruction scheduling, the Register Allocation phase replaces temporaries by
physical registers. This is a three-step process in which temporaries computed in one
block and used in another are assigned first, then temporaries within a block that can
share a register with one already assigned, and finally the temporaries assigned and used
in a single block. This division counts on the work of the Limiting Resources phase to
decrease the likelihood that one assignment will interfere with a later assignment.
It is hoped that the Register Allocation phase will not need to insert store and load
operations to copy temporaries into memory. If such copies do occur, then the Instruction
Scheduling phase is repeated. In this case, the scheduler will only reschedule the blocks
that have had the instructions inserted.
Finally, the IR is in the form in which it represents an assembly language
procedure. The object module is now written in the form needed by the linker. This is a
difficult task because the documentation of the form of object modules is notoriously
inaccurate. The major work lies in discovering the true form. After that it is a clerical (but
large) task to create the object module.
Thus the assign node takes two expressions as operands one representing the address of
the location for getting the result and the other representing the value of the right side of
the assignment. The fetch node translates between addresses and values. It takes one
argument, which is the address of a location. The result of the fetch node is the value
stored in that location.
Figure 2.5 Abstract Syntax Tree for MAXCOL
Note that this tree structure represents the complete structure of the program,
indicating which parts of the subroutine are contained in other parts.
A straightforward translation will result in the flow graph shown in Figure 2.6. It is
shown here to describe the process of translation. It is not actually generated, since
certain optimizations will be performed during the translation process. Note that the
temporaries are used in two distinct ways. Some temporaries, such as T5, are used just
like local variables, holding values that are modified as the program is executed. Other
temporaries, such as T7, are pure functions of their arguments. In the case of T7, it
always holds the constant 1. For these temporaries the same temporary is always used
for the result of the same operation. Thus any load of the constant 1 will always be into
T7. The translation process must guarantee that an operand is evaluated before it is used.
To guarantee that the same temporary is used wherever an expression is
computed, a separate table called the formal temporary table is maintained. It is indexed
by the operator and the temporaries of the operands and constants involved in the
instruction. The result of a lookup in this table is the name of the temporary for holding the
result of the operation. The formal temporary table for the example routine is shown in
Figure 2.7. Some entries that will be added later are listed here for future reference.
What is the first thing that we observe about the lengthy list of instructions in Figure 2.6?
Consider block B1. The constant 1 is loaded six times and the expression I - 1 is
evaluated three times. A number of simplifications can be performed as the flow graph is
created:
If there are two instances of the same computation without operations that modify
the operands between the two instances, then the second one is redundant and can be
eliminated since it will always compute the same value as the first.
If there are two instances of a computation X * Y and the first one occurs on all
paths leading from the Entry block to the second computation, then the second one can
be eliminated. This is a special case of the general elimination of redundant expressions,
which will be performed later. This simple case accounts for the largest number of
redundant expressions, so much of the work will be done here before the general
technique is applied.
Copy propagation or value propagation is performed. If an X is a copy of Z, then
uses of X can be replaced by uses of Z as long as neither X nor Z changes between the
point at which the copy is made and the point of use. This transformation is useful for
improving the program flow graph generated by the compiler front end. There are many
compiler-generated temporaries such as loop counters or components of array dope
information that are really copy operations.
Figure 2.8 Flow Graph after Simplifications
patterns will be stored. When the actual shape does not fit one of the usual reference
patterns, a conservative choice will be made to expand the shape to one of the chosen
forms.
Consider the Fortran fragment in Figure 2.11 for copying array A into B twice. In
Fortran, the elements of a column are stored in sequential locations in memory. The
hardware will reference a particular element. The whole cache line for the element will be
read into the cache (typically 32 bytes to 128 bytes), but the next element will not come
from the cache line; instead, the next element is the next element in the row, which may
be very far away in memory. By the time the inner loop is completed and the next iteration
of the outer loop is executing, the current elements in the cache will likely have been
removed.
The dependence-based optimizations will transform Figure 2.11 into the right-hand
column. The same computations are performed, but the elements are referenced in a
different order. Now the next element from A is the next element in the column, thus using
the cache effectively.
The phase will also unroll loops to improve later instruction scheduling, as shown in
Figure 2.12. The left column is the original loop; the right column is the unrolled loop. In
the original loop, the succeeding phases of the compiler would generate instructions that
would require that each store to B be executed before each subsequent load from A. With
the loop unrolled, the loads from A may be interwoven with the store operations, hiding the
time it takes to reference memory. Another optimization called software pipelining is
performed later, which increases the amount of interweaving even more.
Figure 2.12 Original (left) and Unrolled (right) Loop
This book will not address the concepts of parallelization and vectorization,
although those ideas are directly related to the work here. These concepts are covered in
books by Wolfe (1996) and Allen and Kennedy.
Lowering : The instructions are lowered so that each operation in the flow graph
represents a single instruction in the target machine. Complex instructions, such as
subscripted array references, are replaced by the equivalent sequence of
elementary machine instructions. Alternatively, multiple instructions may be folded
into a single instruction when constants, rather than temporaries holding the
constant value, can occur in instructions.
Reshaping : Before the global optimization techniques are applied, the program is
transformed to take into account the looping structure of the program. Consider the
expression I * J * K occurring inside a loop, with I being the index for the innermost
loop, J the index for the next loop, and K the loop invariant. The normal
associativity of the program language would evaluate this as (I * J) *K when it
would be preferable to compute it as I * (J * K) because the computation of J * K is
invariant inside the innermost loop and so can be moved out of the loop. At the
same time we perform strength reduction, local redundant expression elimination,
and algebraic identities.
To have a place in which to put the code to initialize T28, we insert an empty block
between blocks B1 and B2. For mnemonic purposes we will call the block B12, standing
for the block between B1 and B2. The compiler puts two computations into the loop (if
they are not already available):
1. The expression to initialize the strength-reduced variable, in this case T28. This
involves copying all of the expressions involved in computing T28 and inserting
them into block B12.
2. The expression for the increment to the strength reduction expression.In this case,
it is the constant 8, which is already available.
While inserting these expressions into B12, the compiler will perform redundant
expression elimination, constant propagation, and constant folding. In this case, the
compiler knows that J has value 2 on entry to the loop, so that constant value will be
substituted for J, that is, for T6.
The code in Figure 2.13 represents the program after strength reduction has been
applied to the inner loop. T28 no longer represents a pure expression: It is now a
compiler-created local variable. This does not change how the compiler handles the load
and store operations involving T28. Since it is taking on the same values that it did when it
was a pure expression, the side effects of the load and store instructions are the same.
Figure 2.13 Strength-Reduced Inner Loop
In this rough simulation of the compiler, we see that the compiler needs to perform
some level of redundant expression elimination, constant propagation, and folding before
strength reduction. We can get that information by performing strength reduction (and
expression reshaping) as a part of the dominator-based optimizations discussed earlier.
As a working hypothesis, assume that strength reduction for a single-entry loop is
performed after the dominator-based transformations for the loop entry and all of its
children in the dominator tree. If we perform strength reduction for a loop at that point, we
gain three advantages. First, strength reduction will be applied to inner loops before being
applied to outer loops. Second, the loop body will have been already simplified by the
dominator-based algorithms. And third, the information concerning available expressions
and constants is still available for a block inserted before the entry to the loop.
For the sake of description, the computations in block B3 that are no longer used
have been eliminated. In reality they are eliminated later by the dead-code elimination
phase. This order makes the implementation of strength reduction easier because the
compiler need not worry about whether a computation being eliminated is used someplace
else.
Now consider the contents of block B12. We know that the value of J, or T6, is 2.
So the compiler applies value numbering, constant propagation, and constant folding to
this block. One other optimization is needed to obtain good code. The compiler multiplies
by 8 after it has performed all additions. The application of distribution of integer
multiplication will result in better code since 8 will be added to an already existing value to
give the code in Figure 2.14.
We now perform strength reduction on the outer loop. There are three candidates
for strength reduction: address(A(1,I)), or T33; address(VALUE(I)), or T17; and
address(LARGE(I)), or T13. Again we insert a block B01 between blocks B0 and B1 to
hold the initialization values for the loop B1, [B2, B6, B3], B4. The three pointers will be
initialized in block B01 and incremented in block B4.
One of the values of this simulation process is to observe situations that you would not
have imagined when designing the compiler. There are two such situations with strength
reduction:
The load of the constant 4 into T11 now happens too early. All uses of it have been
eliminated, except for updating the pointer at the end of the loop. In this case that is not a
problem because the constant will be folded into an immediate field of an instruction later.
More complex expressions may be computed much earlier than needed. There is no easy
solution to this problem.
The computation of the constant 8 in block B01 makes the computation in block B1
redundant. Later code-motion algorithms had better identify these cases and eliminate the
redundant expressions.
After strength reduction on both loops, the compiler has the flow graph in Figure
2.15.
This is a good point to review. The compiler has created the flow graph, simplified
expressions, eliminated most redundant expressions, applied strength reduction, and
performed expression reshaping. Except for some specialized code insertions for strength
reduction, no expressions have been moved. Code motion will move code out of loops.
The techniques proposed here for code motion are based on a technique called
elimination of partial redundancies devised by Etienne Morel (Morel and Renvoise, 1979).
Abstractly, this technique attempts to insert copies of an expression on some paths
through the flow graph to increase the number of redundant expressions. One example of
where it works is with loops. Elimination of partial redundancies will insert copies of loop
invariant expressions before the loop making the original copies in the loop redundant.
Surprisingly, this technique works without knowledge of loops. We combine three other
techniques with code motion:
1. A form of strength reduction is included in code motion. The technique is
inexpensive to implement and has the advantage that it will apply strength
reduction in situations where there are no loops.
2. Load motion is combined with code motion. Moving load operations can be handled
as a code motion problem by pretending that any store operation is actually a store
operation followed by the corresponding load operation. So a store operation can
be viewed as having the same effect on the availability of an expression as a load
operation. As will be seen in this example, this will increase the number of load
operations that can be moved.
3. Store operations can also be moved by looking at the flow graph backward and
applying the same algorithms to the reverse graph that we apply for expressions to
the normal flow graph. We only look at the reverse graph for store operations.
4. In this particular example, code motion only removes the redundant loads of the
constants 4 and 8. The load of VALUE(I) is moved out of the inner loop. It is not a
loop-invariant expression since there is a store into VALUE(I) in the loop. However,
the observation that a store may be viewed as a store followed by a load into the
same register means that there is a load of VALUE(I) on each path to the use of
VALUE(I), making the load within the loop redundant. This gives the code in Figure
2.16.
Now, we can move the store operations forward using partial redundancy on the
reverse program flow graph, as shown in Figure 2.17. The stores into VALUE(I) and
LARGE(I) occurring in the loop can be moved to block B4. Although we think of this as a
motion out of the loop, the analysis has nothing to do with the loop. It depends on the
occurrence of these store operations on each path to B4 and the repetitive stores that do
occur in the loop. Together with dead-code elimination this gives us the final result of the
optimization phases.
Register allocation: The temporaries used for values in the program flow graph
must be replaced by the use of physical registers.
number of physical registers needed to hold values. On the other hand, if one allocates
the temporaries to physical registers before instruction scheduling, then the amount of
instruction reordering is limited. This is known as a phase-ordering problem. There is no
natural order for performing instruction scheduling and register allocation.
The LIMIT phase performs the first of these three tasks and prepares the code for
instruction scheduling and register allocation. It attempts to resolve this problem by
performing parts of the register allocation problem before instruction scheduling, then
allowing instruction scheduling to occur. Register allocation then follows, plus a possible
second round of instruction scheduling if the register allocator generated any instructions
itself (spill code).
Before preparing for instruction scheduling and register allocation, the compiler
lowers the program representation to the most efficient set of instructions. This is the last
of the code-lowering phases.
We begin by modifying the flow graph so that each operation corresponds to an
operation in the target machine. Since the instruction description was chosen to be close
to a RISC processor, most instructions already correspond to target machine instructions.
This step is usually called code generation; however, our view of code generation is more
diffuse. We began code generation when we built the flow graph, we progressed further
into code generation with each lowering of the flow graph, and we complete it now by
guaranteeing the correspondence between instructions in the flow graph and instructions
in the target machine.
To illustrate this code lowering, we assume that the target machine contains
instructions with small-constant immediate operands. For example, the addition of small
constants can be performed with an immediate operand. Or load and store operations can
take a constant as an additive part of the address computation. The target machine also
has instructions for adding a multiple of 4 or 8 times one register, adding another register,
and putting the result in a target register. In other words, we consider a target processor
such as the Alpha processor. While performing code lowering, the compiler will also
perform the following operations:
Replacing instructions in the flow graph by equivalent target machine instructions. If
the instruction in the flow graph is a target machine instruction, then the compiler leaves it
as it is.
Removing register-to-register copy operations. The compiler no longer honors the
convention that a particular expression is computed in a fixed symbolic register. Now all
effort is made to eliminate register-to-register copies.
In the process of code lowering, some blocks will become empty. The compiler
deletes them.
The important instructions for the Alpha processor that simplify this particular
example are as follows:
The S4ADDQ instruction computes 4 times one register plus another, simplifying
address arithmetic on integer arrays.
The S8ADDQ instruction computes 8 times one register plus another, simplifying
address arithmetic on double-precision arrays.
The CPYS instruction, which takes two operands, creates a floating point value
from the sign of one operand and the absolute value of another. It can be used to compute
the absolute value.
The use of these instructions may make other computations unnecessary, such as
an instruction that loads a constant, or the multiplication or shift operation (and its target
register). These unnecessary computations must be eliminated also. This can be
performed partially during the other optimizations or by the execution of the dead-code
elimination algorithm.
The compiler also orders the blocks so that the destination pairs in conditional
branches can be replaced with fall-through values; however, we do not eliminate the extra
part of the branches because register allocation may need to insert blocks and such
elimination would change the order of blocks. The code in Figure 2.18 shows the results of
code lowering. At this point the restriction that the same expression always be computed
in the same register is discarded since this would add unnecessary instructions. Hence
the loop variables are incremented by a single iADD instruction. Note that an S8ADDQ
instruction is used to increment the pointer referencing the A array in the inner loop.
At the same time that the code is being lowered, the LIMIT phase is preparing for
instruction scheduling and register allocation by performing the following transformations.
Rename: There are many situations in which the same temporary is used in two
independent parts of the procedure. This can happen through the source program
using the same automatic variable for two purposes, or through transformations
performed by earlier phases of the compiler. One of the sets of uses is now
renamed to reference a new temporary. By using independent names, register
allocation is more effective. Rename is illustrated in Figure 2.19. In the code on the
left, the same index variable is used for two loops. After renaming, two different
index variables are used, as seen in the code on the right.
Pressure: The register pressure at a point p in the program flow graph is the
number of registers needed at p to hold the values that are computed before
p and used after p. The maximum register pressure is an estimate of the
minimum number of registers needed for allocating registers for the
procedure. It is not a precise lower estimate because more registers may be
needed due to the interactions of multiple paths through the procedure.
However, if the register pressure is higher than the number of available
registers, then some temporaries will be stored in memory for part of the
procedure. This is called register spilling.
Spilling: LIMIT will consider each point where the register pressure exceeds
the number of physical registers. It will consider each enclosing loop
containing that point and find a temporary that is not used in the loop but
which holds a value to be used later (in other words, it is holding a value
passing through the loop). It takes the temporary that has that property on
the outermost loop, stores it in memory before the loop, and reloads it after
the loop (where necessary). This decreases the register pressure by 1
everywhere within the loop. If no loop contains a temporary of this form, a
temporary that holds a value but is unused in the block will be chosen. If no
such temporary exists, a temporary used or defined within the block will be
chosen. This whole process will be repeated until the register pressure has
been decreased below the number of available registers everywhere within
the procedure.
To compute the register pressure, the compiler needs to know for each point of the
flow graph the temporaries that hold a value used later, in other words, the set of
temporaries that are live at each point in the program. For illustrative purposes, the set of
points where each temporary is live is represented as a set of intervals using the numbers
we associated with each instruction in Figure 2.18. If a temporary is live at the beginning
of the first instruction of an interval, we will indicate that by using a closed bracket. If it
becomes live in the middle of an instruction, we will use an open parenthesis. Figure 2.20
indicates the range of instructions where each register is live.
This information can be used to compute the number of registers needed at each
point in the program, otherwise known as the register pressure. If the number of registers
needed exceeds the number of physical registers available, then not all temporaries will
be able to be assigned to registers. The registers that are live before and after each
instruction in the subroutine are shown in Figure 2.21. In this particular case the largest
register pressure occurs in the innermost loop. This is frequently true, but is not always
the case.
Figure 2.20 Table of Live Ranges
One computes a separate register pressure for each register set: integer and
floating point. We have shown the register pressure for integer registers. The register
pressure for floating point registers is not shown in Figure 2.21 so as to make the table
more understandable; however, there are only three floating registers in the program, so
determining the register pressure is straightforward.
Now we compute the register pressure at the beginning of each statement. This is
a pair consisting of the number of integer and floating point symbolic or physical registers
that are live at the beginning of each instruction. Recall that the formal parameters are live
at the beginning of the program (if they are used anywhere in the program), so T1, T2, T3,
and T4 are live at the beginning of the subroutine.
As is frequently the case with small flow graphs, there is no register spilling
needed. The maximum register pressure is much lower than the number of registers.
However, let us pretend that the machine only has eight registers. The register pressure is
9 at the end of the inner loop, so we cannot fit the number of symbolic registers that are
live at that point into the available registers. The symbolic registers T1, T3, T4, T5, T6, T8,
T14, T24, and T28 are live at the point at which the pressure is 9; however, T1, T3, T4,
T5, and T8 are not referenced (defined or used) in the inner loop. Therefore one of them
can be spilled before the loop and reloaded after the loop. This will decrease the register
pressure by 1 throughout the loop. Ideally, we would choose the register that is referenced
in as few nested loops as possible These temporaries are all referenced in the next loop,
however, so we will arbitrarily choose to store T5, which is the temporary representing I.
We use the stack (SP is a dedicated register) to spill registers to memory. Note that
the register pressure has peaked at one point, and that by spilling a register we have
decreased the register pressure at other points.
The insertion process takes two steps. First insert a store operation at the
beginning of the outermost loop where the temporary (T5) is not referenced, and insert
load operations at the exits from the loop if the temporary is live on exit. Second, optimize
the placement of the loads and stores by moving the loads as far as possible toward the
beginning of the program and the stores toward the end of the program. This gives us the
code in Figure 2.22.
Figure 2.22 Load and Store Operations for Spilling
After the LIMIT phase, the compiler knows that the resources are available at each
point to perform the operations described in the program flow graph. The remaining
phases of the compiler will preserve this invariant whenever they perform a
transformation.
Scheduling limited to blocks does not use the multiple instruction-issue character of
RISC processors effectively. Blocks are usually small, and each instruction within them
depends on some other instructions in the block. Consider the problem of instruction
scheduling as filling in a matrix, with the number of columns being the number of
instructions that can be issued simultaneously and the number of rows being the number
of machine cycles it takes to execute the block. Block scheduling will fill in this matrix
sparsely: There will be many empty slots, indicating that the multiple-issue character of
the machine is not being used. This is particularly a problem for load, store, multiply,
divide, or floating point operations which take many cycles to execute. RISC processors
usually implement other integer operations in one cycle. There are several techniques
incorporated in the compiler for ameliorating this problem:
Unroll: Earlier phases of the compiler have performed loop unrolling, which
increases the size of blocks, giving the block scheduler more chance to
schedule the instructions together.
Superblock: When there is a point in a loop where two paths join, it is difficult
to move instructions from after the join point to before it. When the
succeeding block in the loop is short, the compiler has earlier made a copy
of the block so that the joined path is replaced by two blocks, joined only at
the head of the loop. This transformation is applied at the same time that
loop unrolling is performed.
Move: The normal optimization techniques used for code motion attempt to
keep temporaries live for as short a sequence of instructions as is possible.
When scheduling, we will schedule each block separately. For blocks that
are executed frequently, we will repeat the code motion algorithm, but allow
the motion of instructions from one block to another even when there is no
decrease in execution of the instruction.
start more quickly, thereby decreasing the execution time of the whole loop.
Blocks and loops that can be software pipelined are identified before other
scheduling occurs and are handled separately.
During instruction scheduling, some peephole optimization occurs. It can happen
during scheduling that instructions that were not adjacent have become adjacent, creating
situations such as a store followed by an immediate load from the same location. It is
therefore effective to apply some of the peephole optimizations again.
When instruction scheduling is completed, the order of instructions is fixed and
cannot be changed without executing the instruction scheduler again. In that case, it may
only be necessary to rerun the block scheduler.
We have shrunk the register requirements so the values in registers can fit in the
physical registers at each point in the flow graph. Now we will reorder the instructions to
satisfy the instruction-scheduling constraints of the target processor. We will assume a
processor such as the Alpha 21164, which can issue four instructions on each clock cycle.
Many of the integer instructions take one cycle to complete. Most floating point operations
take four cycles to complete. In any given cycle one can issue one or two load instructions
or a single store instruction. A store instruction cannot be issued in the same cycle as a
load instruction. We will assume that the other integer operations can be filled in as
necessary. Instructions such as integer multiply or floating point divide take a large
number of cycles.
The problem is to group the instructions into one to four instruction packets such
that all the instructions in a packet can be issued simultaneously. The compiler also
reorders the instructions in an attempt not to use an operand until a number of cycles
following the issue of the instruction that computes it to ensure that the value is available.
The load and store operations take a variable amount of time, depending on the load on
the memory bus and whether the values are in caches. In the Alpha 21164, there are two
caches on the processor chip, and most systems have a further large cache on the
processor board. A load instruction takes two cycles for the cache nearest the processor,
eight cycles in the next cache, twenty cycles in the board cache, and a long time if data is
in memory. Furthermore, the processor contains hardware to optimize the loading of
consecutive memory locations. If two load operations are each issued on two consecutive
cycles to consecutive memory locations, the processor will optimize the use of the
memory bus.
It is important that useless branches are at least not counted when determining
scheduling. This is marked with an asterisk (*) in the cycle location.
There are hardware bypasses so that a compare instruction and a branch
instruction can be issued in the same cycle. Note that the assignment to SI9 (in B1) can
be moved forward eliminating an extra slot. Also note that B12 is only reached from the
preceding block, so NOPs do not need to be inserted.
Now note that the inner loop starting with block B2 consists of three blocks. The
first block is the conditional test and the third block updates the iterations. All but one of
the computations from the third block can be moved to the first block (hoisting), while the
remaining instructions can be scheduled more effectively by making a copy of the iteration
block (super block scheduling).
Note that NOPS were inserted in the middle of the code. The machine picks up four
instructions at a time, aligned on 16-byte boundaries. It must initiate all instructions in this
packet of four instructions before going on to the next packet. To execute the instructions
in the smallest amount of time, we must maximize the number of independent instructions
in each packet. The resulting scheduled instructions are shown in Figure 2.23.
Figure 2.23 Scheduled Instructions
This compilers register allocator combines the two techniques. Because LIMIT has
been run, little register spilling will occur. Graph coloring is therefore used to assign
registers to temporaries that hold values at the beginning of some block, in other words, in
those situations in which graph coloring performs best. A modification of bin packing
suggested by Hendron (1993) will be used to schedule temporaries within each block.
Previous attempts at splitting the temporaries that are live at the beginning of
blocks (global allocation) from those that are live within a block (local allocation) have
encountered difficulties because performing either global or local allocation before the
other could affect the quality of register allocation. This problem is resolved by the
existence of the LIMIT phase, which has performed spilling of global temporaries before
either allocation occurs.
Note that the presence of LIMIT has eliminated most register spilling during register
allocation. It does not eliminate all of it. There can be secondary effects of conditional
branching that can cause register spilling during either graph coloring or bin packing. This
situation is unavoidable, since optimal register allocation is NP-complete. In the situations
in which spilling occurs, the register allocator will insert the required store and load
operations.
Now we apply register allocation to the example. First the compiler must recompute
the points where temporaries are live, because instruction scheduling has changed these
points (see Figure 2.24). Note that the scheduler has introduced a redefinition of a local
register, so we need to either do superblock scheduling earlier (when we dont know that it
will pay off) or redo right number of names, or locally redo right number of names when
we create these problems. We only deal with the integer registers here; the floating point
registers in this case are simple because they all interfere and so one assigns each to a
different register.
After the lifetime information for temporaries has been computed, the compiler uses
a graph-coloring algorithm to allocate the registers that are live at the beginning of some
block, or registers which are directly assigned to a physical register. The ones assigned to
a physical register are preallocated; however, they must be considered here to avoid any
accidental assignments. The physical registers will be named using $0, $1, and so on.
Note that the temporaries corresponding to formal parameters are assigned to physical
registers specified by the calling standard for the target machine. The globally assigned
registers are listed in Figure 2.25, together with the kind of register. In this case all of the
registers needed are called scratch registers, which means that the value in the register
need not be saved and restored if the register is used in the procedure.
Figure 2.24 Live Ranges after Scheduling
After that the registers that are live at the beginning of any block have been
allocated, we can allocate the symbolic registers that are live only within a single block. In
this small example there are only a few. In realistic programs, these registers greatly
outnumber the globally live registers. These local registers are listed in Figure 2.26. A
register is reused if at all possible because the compiler wants to minimize the number of
registers used. This avoids the necessity of using a register that is not a scratch register
and would thus require that a store operation be inserted at the beginning of the
procedure to save its value and a load inserted at the exit to restore the value.
The resulting assembly code is shown in Figure 2.27. The temporaries have all
been replaced by registers. There were no spill instructions inserted, so the instruction
schedules have not changed.
Figure 2.26 Local Register Assignments
2.11 Rescheduling
The next phase is a rescheduling phase, which is only executed if the register
allocator has changed the set of instructions that are executed. This can happen due to
either a peephole optimization or the introduction of spill code. Neither of these occurred
in this case, so the rescheduling operation is ignored.
If the register allocator generated any instructions, that is, register spilling occurred,
then the instruction scheduler is executed again, but in this case only on blocks where
load or store operations have been inserted.
2.13 References
Allen, R., and K. Kennedy. ;Advanced compilation for vector and parallel computers. San
Mateo, CA: Morgan Kaufmann.
Frazer, C. W., and D. R. Hanson. 1995. A retargetable
implementation. Redwood City, CA: Benjamin/Cummings.
compiler:
Design
and
by
suppression
of
partial
Wolfe, M. 1996. High performance compilers for parallel computing. Reading, MA: AddisonWesley.
Chapter 3 Graphs
A prime prerequisite for being a compiler writer is being a data structure junkie.
One must live, breathe, and love data structures, so we will not provide the usual
complete list of all background mathematics that usually appears in a compiler book. We
assume that you have access to any one of a number of data structure or introductory
compiler writing books, such as Lorho (1984) or Fischer and LeBlanc (1988). This design
assumes that you are familiar with the following topics, which are addressed by each of
the data structure books referenced.
Equivalence relations and partitions. The compiler frequently computes
equivalence relations or partitions sets. An equivalence relation is frequently represented
as a partition: All of the elements that are mutually equivalent are grouped together into a
set of elements. Hence the whole set can be represented by a set of disjoint sets of
elements. Partitions are frequently implemented as UNION/FIND data structures. This
approach was pioneered by Tarjan (1975).
Partial ordering relations on sets. A compiler contains a number of explicit and
implicit partial orderings. Operands must be computed before the expression for which
they are an operand, for example. The compiler must be able to represent these relations.
The topics that are addressed in this chapter concern graphs. A number of the data
structures within a compiler the flow graph and the call graph, for instanceare represented
as directed graphs. Undirected graphs are used to represent the interference relationship
for register allocation. Thus these topics are addressed here to the extent that the theory
is used in implementing the compiler. The topics addressed are as follows:
Data structures for implementing directed and undirected graphs
Depth-first search and the classification of edges in a directed graph
Dominators, postdominators, and dominance frontiers
Computing loops in a graph
Representing sets
In an undirected graph the edges do not have a sense of direction. One is not
traveling from one node to another in a particular direction. Instead, undirected graphs
represent the idea of neighbors: two nodes are adjacent or they are not. The techniques
for implementing directed graphs are used to implement undirected graphs: for each edge
{X,Y} in the undirected graph, build two edges (Y,X) and (Y,X) in the implementation. In
the matrix form, this means that the matrix is symmetric and only half of the matrix need
be stored.
The second category consists of back edges. These are edges that go from a node
to another node that has started processing but has not finished yet. If you look at the
algorithm, this means that the edge must go back to a node that is still being processed by
a procedure that directly or recursively calls this one: In implementation terms, the head of
the edge is a node still on the stack, and that node will be an ancestor of the current node
in the depth-first tree. This edge goes from a node to an ancestor in the tree formed of
tree edges.
Figure 3.3 Depth-First Search Tree for MAXCOL
The opposite of backward edges are forward edges. A forward edge from n to S is
an edge that goes from a node to its successor; however, the successor has already been
processed. In fact, it was processed as a result of the processing of some other successor
of n. So this is an edge that goes from an ancestor to a descendent in the depth-first
search tree.
No other edge can go up the tree or down the tree, so the fourth category of edges
must go from one subtree to another. These are called cross edges. The classification of
the edges for Figure 3.3 is given in Table 3.1.
not reachable, so B2 dominates these three nodes. B2 does not dominate B4 since there
is an alternate path from B1 to B4 that avoids B2.
Figure 3.4 Dominator Tree for MAXCOL
The current algorithm for computing the tree of immediate dominators was
developed by Lengauer and Tarjan (1979). This algorithm comes in two forms, with
runtime complexity either O(|N|ln|N|) or O(|N| (|N|)), depending on the complexity of the
implementation. I do not state the algorithm here, as it is too complex to describe
accurately in the space available. Instead I will give a rationalization for the algorithm and
then a simpler algorithm by Purdom that is easy to understand.
Tarjan calculates the dominator using information gathered during a depth-first
search of the program flow graph. Note that the dominator of B is an ancestor of B in any
depth-first search tree. Frequently it will be the immediate parent in the depth-first search
tree. When will it not be so? When there is an edge entering B that is not a tree edge in
the depth-first search tree. Such an edge means that there is another way to get to B
besides the path in the tree. In that case the closest block that can be a dominator of B is
the common ancestor in the tree of B and the tail of the edge. But now things get complex,
because that block may not be a dominator because of another edge entering one of the
blocks in between.
To resolve these problems and store the information we have been discussing,
Tarjan defines a quantity called the semi-dominator and computes these values in a
bottom-up walk of the depth-first search tree. Having these values, he can easily compute
the actual dominators.
The compiler stores the dominator information as a tree. The nodes of the tree are
the blocks in the flow graph; however, the tree edges are not necessarily the flow graph
edges. The parent of any node in the tree is its immediate dominator. For each block B,
the compiler keeps two attributes that store the dominator information:
idom(B) is the immediate dominator of B.
children(B) is the set of blocks for which B is the immediate dominator. Logically this
information is a set; however, it is useful to store the information as a linked list, with the
successors of B that are dominated by B coming first in the list. This will make some of the
later optimization algorithms work more efficiently.
This tree structure results in the tree in Figure 3.4 for the running example.
The compiler also needs to know the common dominator of a set of blocks. The common
dominator is the block that dominates each element of the set of blocks and is dominated
by every other block that dominates each of the blocks of the set. This common dominator
can be computed as shown in Figure 3.5. The algorithm works by observing that if Z does
not dominate B, and B does not dominate Z, then one can walk up the dominator tree from
one of them to find a block that dominates both.
Although it computes the common dominator of a pair, this algorithm is adequate
for any set of blocks because the common dominator can be found by pairwise computing
the common dominator of blocks.
Here is a simple algorithm for computing dominators. Recall the basic principle of
depth-first searches. A depth-first search that visits a node n also visits all nodes
reachable from n. Now pretend that n is not in the graph by pretending that the edges
entering n do not exist and that n does not exist. Perform a depth-first search starting at
Entry on this mutilated graph. Which nodes are not reachable from Entry that were
reachable before? A node is not reachable if there is no path to it. If it was reachable
before, this means that n is on every path to these unreachable nodes. In other words, n is
a dominator of all of those unreachable nodes. Thus, the algorithm consists of performing
a single depth-first search to determine all of the reachable nodes. Discard the
unreachable nodes. Now for each node n in the flow graph, pretend that n is not in the
graph and repeat the depth-first search starting at Entry. The nodes that are not reachable
are the nodes dominated by n.
Figure 3.5 Computing the Common Dominator
3.4 Postdominators
If the compiler is moving computations to earlier points in the flow graph, then the
dominator information gives the safe positions in the flow graph to which to move the
computation. The compiler can move the computation to an earlier block that is on each
path to the current block. The opposite information is also useful. If the compiler wants to
move a computation to a later point, where can it be moved? This question leads to the
idea of postdominance, which has similar characteristics to dominance with the exception
that the path goes from B to Exit rather than from Entry to B, and successor blocks are
used rather than predecessor blocks.
Definition
Postdominance:
A block X postdominates a block B if and only if each path from B to Exit contains
the block X.
B0
Consider the running example for which the dominator tree is in Figure 3.1. The
bottom-up dominator tree walk first visits blocks B3, B6, B2, B4, B1, B5, and then B0. As
the walk is performed, the dominance frontier is computed (see Table 3.2). In the
calculation of the dominance frontier, B3 finds B2 and B4 in its dominance frontier
because they are successors and are not dominated by B3. Similarly, B6 finds B3 in its
dominance frontier. During the computation of the dominance frontier of B2, B3 will not be
in its dominance frontier because B2 dominates B3. However, B2 is in the dominance
frontier of B2.
Observation
If B and X are blocks in a flow graph where there is a path from every block to Exit,
then X postdominates a successor S of B if and only if there is a non-null path from B to X
through S such that X postdominates every node after B on the path.
Proof
Assume the path exists. Since S is on the path, S is postdominated by X.
Conversely, assume that S is postdominated by X. There is some path from S to Exit.
Since S is postdominated by X, X is on this path. Cut the path short at X and add B and
the edge from B to S to the beginning of the path. This gives a path from B to X. Each
node except B on the path is postdominated by X. If it isnt, then there is a path from it to
Exit and by cutting the original path and pasting in the new path, one can create a path
from S to Exit that avoids X, a contradiction. So we have the path.
Observation
If S is a successor of B, then either S is the postdominator of B or pdom(S) is
postdominated by pdom(B).
Proof
Assume S is not the postdominator of B. Consider any path from S to Exit. It can be
extended to a path from B to Exit. Thus, pdom(B) is on this path. Thus pdom(B) is not
equal to S and is on each path from S to Exit, so it is a postdominator of S. Thus it must
postdominate pdom(S).
Now we can give an algorithm for computing the control dependence relation. Look
at the definition: the edge (B,S) is given. What blocks are control dependent on this edge?
Any block that postdominates S and does not postdominate B. These are the nodes in the
postdominator tree starting at S, pdom(S), pdom(pdom(S)), and stopping at but not
including pdom(B). The second observation indicates that, traversing the tree upward
through the parents (postdominators), the algorithm must reach pdom(B) eventually.
The algorithm in Figure 3.8 can be applied to each edge. Actually, it needs to be
applied to each edge that leaves a block with multiple successors, since a block with a
single successor can have no blocks control dependent on it. For our running example this
gives the results in Table 3.3. Sometimes the compiler needs the transpose of this
information: for each block, on what blocks it is control dependent. In that case the same
algorithm is used; however, the information is stored indexed by the dependent block
rather than by the edge leading to the dependence.
Figure 3.8 Calculating Control Dependence
(B0,B1)
B1, B4
(B1,B4)
(B1,B2)
B2, B3
(B2,B3)
(B2,B6)
B6
(B3,B2)
(B3,B4)
(B4,B1)
B1, B4
(B4,B5)
The algorithm for computing the blocks in a loop for a single-entry loop is given in
Figure 3.9. Consider any block B. The only way that it can be the entry block for a singleentry loop is if there is a back edge in some depth-first search walk of the flow graph.
Consider the alternative: An entry block in a loop must be involved in a cyclic path and be
the first block in the cycle that is reached in the walk. Thus, all of the blocks in the cycle
will be descendents of B in the walk, and the edge leading back to B is a back edge.
The idea behind the algorithm is walking the loop backward. Consider each predecessor
of B coming from a back edge. Walk the graph backward from these predecessors.
Eventually the walk leads back to B, and all of the blocks in the loop will be visited. The
algorithm implements this idea using a work-list algorithm. The set Queue contains all
blocks that are known to be in the loop but whose predecessors have not been processed
yet. Each block is inserted into Queue at most once because Queue Loop and the
insertion occurs only when the block is not already in Loop.
Later we will generalize this algorithm to handle multiple-entry loops, and use it to
compute the nesting structure of loops. The compiler not only needs to know the loops,
but needs to know which loops are contained in other loops. Note that the way the
compiler computes loops will ensure that the loops identified are either disjoint (no blocks
in common) or nested (one loop is a subset of another). The nesting structure is used for
three purposes:
1. The compiler uses the loop nest during dependence-based optimization since
these phases transform loops to improve program performance.
2. The loop nests are used to perform one kind of strength reduction. Values modified
in a regular fashion during each iteration of a loop may be computed in a more
effective way; for example, multiplications can be replaced by repeated additions.
3. The loop nests are used during register allocation to find points in the program
where values may be stored or loaded from memory.
Figure 3.9 Template of Code for Finding a Loop
Single-entry loops are frequently called reducible loops. Multiple-entry loops are called irreducible loops. This compiler uses
techniques that optimize single-entry loops. Multiple-entry loops are identified to ensure that no incorrect translations occur.
How does the compiler identify multiple-entry loops? A loop is a union of cyclic
paths. Consider one of these cyclic paths. During a depth-first search there is a first block
B on the path that is visited. All other blocks on the cycle are descendants of B, and the
cyclic edge entering B is a back edge. Thus a loop with entry B is found as in Figure 3.9
by considering these predecessors and walking the loop backward. The problem with a
multiple-entry loop is that this walk can escape from the loop (walking backward through
one of the other entries) and eventually lead all the way back to Entry. This means that B
does not dominate these predecessors. Consider the multiple-entry loop {C,D} in Figure
3.11. If the depth-first search visits the blocks in order {A,C,D,E,B}, then C is the first block
in the loop that is visited. The edge (D,C) is a back edge. When walking backward from D
one visits {D,C,B,A}.
To avoid this problem, the algorithm must be modified to stop the backward walk.
But where should the walk stop? The compiler wants a single-entry region, even if it is not
a loop. So stop the walk at the block that is closest to the loop and which dominates all of
the blocks in the loop. This will be the block that dominates the header B and all of Bs
predecessors that reach B by a back edge. Recall that B dominates itself. Using this
information, the algorithm in Figure 3.9 is modified to the algorithm in Figure 3.12.
The algorithm implements the ideas that we have just discussed. Note that the
body of the loop is not computed at this point when a multiple-entry loop is encountered.
Instead, the set of blocks that lead to the loop body are recorded in an attribute called
generators. This set will be initialized to empty before the identification of loops is started.
A block that has a non-empty generators set is the immediate dominator of a multipleentry loop. The loop body is not recognized immediately for the following reasons:
We will see shortly that this whole process is embedded in a depth-first search in
which the loop starting at a block is recognized after all blocks later in the walk have been
processed. Recording the generators set allows this to be true for multiple-entry loops as
well.
More than one multiple-entry loop can have the same immediate dominator. The
aggregate will be considered one loop for the process of forming the loop nest.
We will be able to handle loops contained in this loop more effectively. Consider a
multiple-entry loop with entry blocks B1 and B2 with common denominator C. By delaying
the identification of the loop until all successors have been identified, a loop that occurs on
the path between C and B1 or C and B2, will be handled as a nested loop. If this subloop
is a single-entry loop, then the full set of optimizations can be applied to it. If the body of
the multiple-entry loop were created when either B1 or B2 was processed, then these
subloops would not be considered a separate loop.
These attributes allow free moment around the loop tree with full knowledge of
which blocks and loops are contained in other blocks and loops.
As the loop tree is built, each loop is identified and entered in the tree. Once it has
been entered in the tree it is handled as a single entity. Its interior structure is not viewed
again during the construction process. The algorithm FIND_LOOP is modified to handle
tree nodes and augmented to be part of the complete construction process. To form this
tree, we need two modifications to the algorithm:
1. Consider the blocks in the graph in postorder. Due to the structure of a depth-first
search, a single-entry loop contained in another single-entry loop has an entry
block with a smaller postorder number. So by visiting blocks in postorder, the inner
loops are identified before the outer loops.
2. Once identified, handle each loop as if it were a single block. This is done by
keeping a datum for each block or loop indicating which block or loop it is contained
in (if any). When one finds a block, use this datum to scan outward to the outermost
identified loop that contains this block.
The compiler now has the complete algorithm. In Figure 3.13 we have the final
version of FIND_LOOP, which computes the blocks, called the generators, that determine
all the other blocks in the loop. If it is a single-entry loop, FIND_LOOP goes ahead and
builds the node in the loop tree using FIND_BODY.
FIND_BODY computes the set of nodes in the body of the loop by moving
backward from the blocks that generate the loop to the header (see Figure 3.14). All
blocks in between are in the loop. It builds the node in the loop tree and fills in all of the
attributes. Care must be taken to ensure the distinction between blocks and already
computed loops. The loop header and predecessors are always blocks. Before inserting a
node into the loop tree, the compiler must find the largest enclosing loop that has already
been computed. This is done by LoopAncestor, shown in Figure 3.15.
LoopAncestor finds the outermost processed loop that contains the current loop or
block by scanning up the LoopParent attribute until it finds a node that has a null entry.
Since this attribute is updated to a non-null entry by FIND_BODY as soon as an enclosing
loop has been identified, this algorithm gives the outermost existing loop.
Finally, the main procedure for computing loops can be described (see Figure
3.16). Calculate_Loop_Tree first performs a depth-first search to compute the postorder
numbers for each node and the back edges. The implementation may perform this depthfirst walk at the same time that the rest of the algorithm is being computedjust embed the
calculations in a recursive depth-first search procedure after a node is visited.
First Calculate_Loop_Tree initializes all of the attributes for blocks. These could be
initialized when the blocks were created; however, the step is described here for
completeness. Then the procedure visits the blocks in postorder. If the generators set is
non-empty, then the block is the head of a multiple-entry loop, so that loop is built. Then
the procedure checks to see if the block is the head of a single-entry loop. Note that a
block may be the head of both a multiple-entry loop and a single-entry loop. In that case,
the compiler builds a nest of two loops: the multiple-entry loop is the innermost loop and
the single-entry loop is the outer loop. The loop tree for our standing example is given in
Figure 3.17.
Figure 3.14 Computing the Body of a Loop
operations; however, it takes an order of magnitude more space than bit vectors, so one
does not want to use it if one needs to have a large number of sets.
Consider our universe of integers, numbered from 0 to MAX. Allocate two arrays of
MAX + 1 elements with initial INDEX[0:MAX] and VALUE[0:MAX] and a single integer
variable, NEXTPLACTE.
The idea behind the algorithm (Figure 3.18) is that the elements of the set are
stored in VALUE, starting at the bottom and piling them up in adjacent slots. As an
element X is added to VALUE, the index in VALUE where it is stored is placed in
INDEX(X). Otherwise the values of INDEX are not initialized. Curiously, the algorithm is
dealing with uninitialized data.
How does the algorithm know when a value is in the set? It checks the
corresponding INDEX(X). That information may be uninitialized, so first it checks to see if
the value is in range. If it is not, then the element is not in the set. If the value is in range it
can still be uninitialized, so it checks the corresponding value in the VALUE array. If the
value matches, then the algorithm knows that the element is in the set.
To remove an element from the set is a bit trickier. The algorithm must run in a constant
time so it cannot remove an element and move the others down. Instead it moves the last
element in the set down into the position that is being vacated. At the same time it adjusts
its INDEX value and decreases the counter NEXTPLACE.
Figure 3.18 Efficient Set Algorithm
The basic operations occur in O(1) time, and scanning the elements in the set is
proportional to the actual elements in the set. It does take more space, though. Consider
an implementation where the elements are represented by 16-bit numbers. Thus there are
32 bits for each element, indicating that this representation takes 32 times as much space
as a bit-vector approach. Thus this representation works well when only a small number of
sets (usually one or two) is necessary.
3.9 References
Aho, A. V., J. E. Hopcroft, and J. D. Ullman. 1974. The design and analysis of computer
algorithms. Reading, MA: Addison-Wesley.
Briggs, P., and L. Torczon. 1993. An efficient representation for sparse sets. ACM
Letters on Programming Languages and Systems 2(1-4): 59-69.
Cytron, R., and J. Ferrante. 1987. An improved control dependence algorithm. (Technical
Report RC 13291.) White Plains, NY: International Business Machines, Thomas J. Watson
Research Center.
Cytron, R., J. Ferrante, and V. Sarkar. 1990. Compact representations for control
dependence. Proceedings of the SIGPLAN 90 Symposium on Programming Language Design and
Implementation. White Plains, NY. 241-255. In SIGPLAN Notices 25(6).
Cytron, R., J. Ferrante, B. Rosen, M. Wegman, and F. Zadeck. 1991. Efficiently computing
static single assignment form and the control dependence graph. ACM Transactions on
Programming Languages and Systems 13(4): 451-490.
Fischer, C. N., and R. J. LeBlanc, Jr. 1988. Crafting a compiler. Redwood City, CA:
Benjamin/Cummings.
Lengauer, T., and R. E. Tarjan. 1979. A fast algorithm for finding dominators in a flow
graph. Transactions on Programming Languages and Systems 1(1): 121-141.
Lorho, B. 1984. Methods and tools for compiler construction: An advanced course.
Cambridge University Press.
Purdom, P. W., and E. F. Moore. 1972. Immediate predominators in a directed graph.
Communications of the ACM 8(1): 777-778.
Tarjan, R. E. 1975. Efficiency of a good but not linear set of union algorithm. Journal
of ACM 22(2): 215-225.
performed, a set of input operands that are used to perform the operation indicated by the
opcode, and a set of output targets that name the values being changed.
The individual instructions have operands that are constants or temporaries. The
set of temporaries is an arbitrarily large set of objects, like the physical registers in a real
processor. Each temporary holds a value for some portion of the execution of the
procedure. Some set of instructions will evaluate an expression and place it in the target
temporary. Instructions that use this value as an operand reference the temporary as an
operand.
The instructions form a program in the same manner that assembly code on a real
processor forms a program. The execution starts with the first instruction. Instructions are
executed in turn until a branching instruction is found. The instructions are broken into
sequences called blocks. The only instruction that is the destination of a branching
instruction is the first instruction in a block. The only branching instructions are the last
instructions in the block. At the end of the block there is a branching instruction
representing each possible path out of the block.
The blocks form a flow graph having the blocks as nodes in the graph. The edges
between the blocks represent the possible execution paths leaving the block. The edge
(B1, B2) indicates that there is some way that the execution of the procedure can travel
directly from B1 to B2. The flow graph will have two distinguished nodes: the start block
Entry and the exit block Exit.
Consider Figure 4.1 as a fragment of a procedure representing the computation of
the statement A = B + C * (B+A). The computation is broken into individual computations.
Before the value of a variable can be referenced, the address of the variable must be
loaded and a load operation for the variable must be executed. All values are loaded into
temporaries. For typographical purposes integer temporaries are represented by an
integer prefixed with a letter T. Note that the addresses of A and B are used twice and
loaded only once. The name A indicates the constant address of variable A. The value of
B is used twice and loaded only once. These are examples of redundant expression
elimination. The individual operation names (or opcodes) will be described later: iLDC
stands for load integer constant, iSLD stands for load integer value from static memory,
iADD is integer add, iMUL is integer multiply, and iSST is integer store into static memory.
These names are taken from the Massive Scalar Compiler Project at Rice University.
For an example involving loops and branches consider Figure 4.2, which computes
an integer power of 2. The argument is the power, and it controls the number of times the
loop is executed. The flow graph for this program (Figure 4.3) shows a number of
characteristics of flow graphs. Each flow graph starts with a special pseudo-instruction
called prolog and ends with the instruction epilog. These represent whatever computations
need to be performed at the beginning and end of the procedure. Note that prolog takes
as an argument the actual parameters of the procedure. In this case the single parameter
is i, which is stored in temporary T1.
The program flow graph is divided into blocks labeled B0, B1, and B2. They each
begin a block in the directed flow graph. The block consists of some number of
computational instructions followed by branching instructions that end the block. The
conditional branching instructions iBCOND are assumed to be two-way branches, so there
is no implied flow of execution from one block to another. The first label is the address to
branch to if the condition is true. The second label is branched to if the condition is false.
Figure 4.3 Program Flow Graph for Sample in Figure 4.2
include the blocks that are possible destinations of the branches. An edge (P,F) E
occurs when there is a branching statement in P containing a possible destination, F.
added to the set of operators. Before the final optimization phases, the flow graph is
lowered to only use operators that have a single-instruction representation in the target
machine.
Some of the generic operations in the flow graph can be viewed as macros to be
expanded. The load and store byte operations on early Alpha processors are an example
of this. A multiple-instruction sequence is required to load a byte. If the exact sequence of
instructions is generated initially, then some optimizations are lost. Similarly, multiplication
by a constant needs to be expanded into a sequence of shifts and adds. Both of these are
examples of gradual lowering since each should not be lowered initially, but needs to be
done before the final optimization phases so that the individual instructions can be
optimized.
Some operations on the target machine may represent several instructions in the
flow graph. The easiest example is a load or store operation that takes two arguments, a
register and a constant. The address is computed as the sum of the register and constant.
This load operation performs two computations: the addition and the fetch. In the initial
flow graph, these are represented as two distinct instructions. Before the final optimization
phases, these two operations are folded together into a single instruction.
The compiler assumes that all flow of control is explicitly represented in the flow
graph. In other words, the flow of control is not gradually lowered. Some flow of control
can be hidden within instructions that are not yet lowered, such as maximum and
minimum operations. However, each instruction has a set of inputs and outputs, with flow
entering the beginning of the instruction and (except for branching instructions) executing
the next instruction at the end.
Recall that the semantics of a language can be divided into two distinct sets of
rules: static and dynamic semantics. Static semantics are the set of rules that describe the
structural rules of the language (beyond the lexical and parsing structure). For example, a
static semantic rule is that a symbol must be declared before it is used. The dynamic
semantics are the set of rules that describe the effect of each part of the language. It is
part of the dynamic semantics to state that the operands of an addition must be evaluated
before the addition is performed (and possibly to specify the order of evaluation of the
operands), or that the meaning of an assignment statement is to evaluate the address of
the left-hand side, evaluate the value of the right-hand side, and store the value from the
right side into the address specified by the left side. These rules are all part of the
language standard or language specification. 1
1
Many compiler writers, including myself, have made a good living from the fact that many people are not aware of dynamic
semantics. Many programmers think that a language is defined if a grammar has been written. The grammar is only a small part of the
total effort. The real effort comes in describing the static and dynamic semantics and the interactions between distinct dynamic
semantic rules.
The language definitions describe the dynamic semantics in terms of the language
construct and its operands. To build an assignment statement, the compiler must be able
to build the operands. This tree-structured approach is true of each construct. This fact
suggests that the flow graph can be built during a bottom-up walk of the abstract syntax
tree in which the children are walked in an order described by the dynamic semantics of
the language construct. For some tree nodes, such as loops, a bottom-up tree walk is
inadequate: Instructions may be generated before, during, and after the generation of the
children.
The tree walk is a little more complex than a simple bottom-up tree walk because
different operations may be needed depending on the context in which the tree occurs.
There are several contexts that occur, but more may be needed depending on the
complexity of the language:
Value Context: When the operand is an expression, the compiler will want to walk
the expression and create a temporary that contains the corresponding value. As a side
effect, it inserts instructions in the flow graph. This walk is implemented by calling the
procedure temporary value_walk(ast * node)
NoValue Context: When the subtree is a statement or an expression used as a
statement, the compiler walks the subtree creating instructions to represent the effect of
the subtree, but no temporary is created to hold any final value. There is an opportunity for
optimization herethe only instructions that the compiler needs to generate are those
representing side effects of the subtree, so some instructions need not be generated. This
walk is implemented by calling the procedure void novalue_walk(ast * node)
Size Context: If the size of the data represented by the subtree is needed, the
subtree must be walked to generate a temporary holding the value of the size of the data.
The calling sequence for this procedure is identical to the value context routine. It just
computes a different valuethe size: temporary size_context(ast * node)
Before discussing the structure of each of these tree-walking procedures (they are
all similar), we must discuss the structure of the support routines used to build the flow
graph. These procedures are structured so that the tree walks will read much like dynamic
semantic rules.
Here are the support procedures for building the initial flow graph:
initialize_graph : This procedure creates an empty data structure for the flow
graph and associated tables. It builds two blocks, Entry and Exit, that are the start
and exit blocks for the flow graph. It then makes the Entry block the current block
so the initial instructions will be in that block.
start_block : This procedure takes a block as argument and makes it the current
block. All future instructions will be added to the end of this block.
xxx_instruct : For each class of instructions in the flow graph a separate support
procedure is present to create an instruction of that form. The arguments to the
instruction are the operation code, the input constants or temporaries, and the
output temporaries. For the load and store instructions, further data will be passed
indicating what storage locations these instructions might modify.
There are also support procedures for dealing with temporaries. We assume an
infinite supply of temporaries, so we create a new one at any point that a temporary is
needed. However, we need some conventions concerning the use of temporaries to ease
the work of later optimization phases. Later, during the Limit phase, some of these
conventions will be relaxed.
Basic Convention: Each time a formal expression, such as B + A is computed, it is
computed in the same temporary. Why? The algorithms for code motion and eliminating
redundant expressions need to know where a value is stored. If one instance of B + A is
known to be redundant, the compiler wants to delete that computation. To do so, it must
search the rest of the flow graph looking for all points where B + A is computed and
copying the result into a temporary to be used in place of the redundant expression.
Instead, the compiler always computes B + A in the same temporary so that a redundant
computation need only be deleted.
The compiler ensures that the convention is met by building a data structure called
the formal temporary table, 2 consisting of records of the operation code and inputs for
each instruction together with the temporary for the result. There is a unique entry in the
formal temporary table even if the instruction occurs multiple times in the flow graph.
2
This is a simplification of an idea first suggested by Chow (1983) in his thesis and later used by COMPASS in the
COMPASS Compilier Engine. The COMPASS approach attempted to use this table for too many purposes.
In fact, the temporaries are divided into two distinct classes: the variable
temporaries and the expression temporaries. The expression temporaries satisfy the
criteria stated above. All instructions that have one of these as the target register have
exactly the same form. The variable temporaries are all others. Different optimizations are
used on the two classes of instructions.
Now consider the entry for constants. Here there is no further tree to walk, so the
tree-walking procedure gets the data associated with the node (the constant value in this
case) and generates an instruction that has a single constant operand. Again it makes
sure that the same temporary is used for all instances of the same constant and that the
instruction is inserted at the end of the current block.
If a node of the abstract syntax tree cannot return a value, it has no alternative in
the case statement. If such a node occurs where an expression is expected, the compiler
will give a system error message. This check is valuable because it checks the abstract
syntax tree for legal structure at no overhead for correct trees.
Processing Structured Statements
The NoValue_Walk procedure is used for statements. For statements like
procedure calls, the processing is similar to the processing of expressions. Branching and
structured statements are different because they can change the block in which the
current instructions are being inserted.
Consider the case statement alternative for a while loop. Consider the flow graph
that the compiler needs to generate (Figure 4.7). This will describe the code in the
alternative. The compiler will generate two copies of the loop test. The first copy occurs in
the current block to decide if the loop needs to be executed at all. If the loop needs to be
executed, then the code for the body of the loop occurs. Another copy of the loop test
occurs at the end of the body to decide whether the loop needs to be executed again. This
is a more complex representation of the loop than appears in most textbooks. It is chosen
to improve the chances for moving code out of the loop.
Thus the compiler is going to start at least two blocks during the processing of a
while loop. The first block is the block for the body; the second block is the block following
the while loop. We need the second block because the compiler must be able to branch to
the block following the loop.
Figure 4.7 Flow Graph for while Loop
Recall that a break statement can occur inside a while statement. To handle such a
statement, the translator maintains a stack containing the blocks that follow a looping
statement. If a break statement occurs, then it is implemented as a branch to the block at
the top of this stack. With this information, we can describe the code in NoValue_Walk
corresponding to a while loop and break statement (see Figure 4.8).
Conditional Expressions
Special note is made of conditional expressions because they are one of the few
instances where an expression computing a value can have operands or parts of
operands in separate blocks. This is one of the reasons that a flow graph approach to the
program representation was chosen rather than a tree structure. The concept of a
temporary does not depend on being in the same block, so an operand can be computed
in one block and used in another.
Consider a conditional expression, (E0 ? Et : Ef). The conditional expression E0 is
computed in the current block. There are distinct blocks to compute the other operands.
Where is the result value placed? The compiler needs to generate a temporary to hold
that value. That temporary must be handled as a compiler-generated variable temporary.
It cannot satisfy the requirement placed on expression temporaries: The instructions for
which it is the target register are not all identical.
Figure 4.9 Structure of flow_walk
goto Statements
goto statements can be a problem with some translation techniques. Here we have
developed enough structure that they are quite easy. There are two parts to the
processing: the goto statement itself and the label position. The following operations need
to be performed:
A label is a symbol in the language. There needs to be a symbol table entry for the
label, with a field to hold the block that starts at that label.
A goto statement is translated into an unconditional branch to the block associated
with the label. If there is as yet no block associated with the label, use create_block to
produce one.
At the point at which the label occurs, insert an unconditional branch to the label in
the current block. Effectively end the previous block. Then perform start_block on the
block associated with the label. If there is no block associated with the label, crate one
using create_block.
This processing translates the goto statement into an unconditional branch.
The only problems occur when the tail block of the edge has multiple successors
and the target block has multiple predecessors, as shown in Figure 4.11. Such an edge is
called a critical edge. A critical edge can be removed by creating an empty block and
replacing the original edge by two edges: one edge with the original source and the new
block as target, and another with the new block as source and the original target as target.
The two new edges are not critical edges because one of them has a target with only one
predecessor and the other one has a source with a single successor.
Each definition refers to the concept that one instruction might kill a temporary. For
the direct computational instructions, it is easy to determine if one instruction might kill a
temporarythat instruction can only modify the output temporaries. Other instructions, such
as procedure calls, are more complex. The inner workings of the procedure may not be
available and even if available may not indicate which temporaries are modified. In that
case the compiler must make a conservative estimate. If the compiler cannot deduce that
a temporary is not killed, then it must assume that it is killed.
The language definition may help the compiler make less conservative decisions
about the temporaries killed by an instruction. In Fortran, the language standard indicates
that a legal program cannot modify the same location using two different names. Thus the
compiler can assume that a modification of a formal parameter (dummy argument) does
not modify any global variable, local variable, or other formal parameter. In ANSI C, a
pointer (when not a pointer to characters) cannot modify a storage location of a different
type.
In Figure 4.12 the temporaries I and J are variable temporaries, whereas the
temporaries T1, T2, and T3 are expression temporaries. The expression I * J, or T1, is
locally anticipated because it occurs in the block and no preceding instructions can modify
I or J. Similarly I + 1, or T2, is locally anticipated. However, I * 5, or T3, is not locally
anticipated since I is modified before the instruction. Similarly, T1 and T2 are not locally
available, whereas T3 is.
Figure 4.12 Sample Block
As shown in Figure 4.13, the compiler computes the local information for each
block by simulating the execution of the block. Not knowing the value of temporaries and
variables, it only keeps track of the temporaries that are evaluated and killed. A temporary
is locally anticipated if the first instruction that evaluates the temporary precedes any
instructions that kill the temporary. The compiler maintains a set of all temporaries that
have been killed by earlier instructions in the block, making the check for local
anticipatability straightforward.
The check for local availability is more difficult because the algorithm does not
know which temporaries are killed later in the block while it is simulating the execution. A
temporary is locally available if it is evaluated in the block and the temporary is not killed
by a later instruction. The algorithm computes this by assuming that a temporary is locally
available whenever it is evaluated in the block. When a temporary is killed, it is added to
the set of killed variables and is removed from the set of locally available temporaries.
To determine which temporaries might have been modified, the compiler needs a
set for each temporary called modifies. The set modifies(T) contains the set of
temporaries and memory locations that are killed by an instruction that has T as a target.
For expression temporaries this set is empty. For variable temporaries, it includes the set
of all temporaries that have this temporary as a leaf of the corresponding expression tree.
The calculation of this set is described in the chapter on alias analysis (Chapter 6).
Apply the algorithm to the block in Figure 4.12 (see Figure 4.13). While simulating
the first instruction T1 is added to both available and anticipated. Since T1 is an
expression temporary, it does not kill any other temporaries. Similarly, the second
instruction adds T2 to both sets and no temporaries are killed. The copy into I kills T1, T2,
and any other temporaries that use I as an operand. They are both still anticipated;
however, they are removed from available because a killing instruction follows their
evaluation. However, T3 is added to available and is not later removed. I is both killed in
the block and available at the end of the block.
This algorithm may be the most expensive part of the optimizer. The algorithm is
simple, but each instruction must be simulated. Other algorithms will consider only the
blocks and not the instructions. The data structures need to be tuned for speed and
space. Here are the data structure choices that I have found most effective:
The collection of sets, modifies(T), is large and not dense. For expression temporaries the
set is empty. Each of these sets should therefore be stored as a small array of
temporaries.
The sets available and anticipated occur only once; hence their size is not much of
a factor. However, elements are repeatedly being added and union and differences are
being taken. The compiler uses the Briggs set implementation technique to store these
sets.
The sets associated with each block, local_anticipated and local_available, should
have an efficient storage. The storage depends on how the global information is
computed.
The set killed(B) needs to be efficient. For each of the possible global optimization
algorithms, killed(B) is best stored as a bit vector.
Unfortunately, the definition does not give a direct algorithm for computing
anticipation. There are two algorithms in current use. Both will be presented here. The first
one is given in most compiler textbooks. The second one is the one recommended for this
compiler. However, the time/ space trade-offs are such that a switch to the first algorithm
may be necessary to improve performance. In other words, the author may be wrong.
To describe anticipation in terms of formulas, consider the Boolean variables
ANTIN(B) and ANTOUT(B). ANTIN(B) is true if and only if the temporary T is anticipated
at the beginning of block B. Correspondingly, ANTOUT(B) is true if and only if the
temporary T is anticipated at the end of block B. What does it mean for ANTOUT(B) to be
true? Each path leaving B has a definition of T not preceded by a modification of T. If one
looks at the next block on the path, this means that ANTIN(S) is true for each successor of
B. Conversely, if ANTIN(S) is true for each successor then consider any path leaving B.
The next block in the path is one of the successors, S, and there are no computations
between the end of B and S, so each path leaving B has a computation of T before any
modification of T.
Now consider a path leaving the beginning of B. This path must travel through B to
reach the end of B. Three different events can happen as the path traverses B:
1. There is no instruction in B that either defines T or kills T. In that case nothing
happens to T in the block, so the path satisfies the anticipation definition if and only
if it satisfies the same definition at the end of the block; in other words, ANTIN(B) =
ANTOUT(B).
2. There is an instruction in B that defines T before any instruction that kills T, that is,
T local_anticipation(B). Since any path starting at the beginning of the block must
go through the block, this means that ANTIN(B) = true.
3. There is an instruction B that kills T before there is any instruction that defines T.
(Whether there is an instruction in B that defines T is irrelevant.) Again the block
itself is the start of each path, so ANTIN(B) = false.
All of these conditions can be summarized in the set of equations in Figure 4.15.
The equations are a direct transcription of the analysis in the form of equations.
Unfortunately, there is not a unique solution to the equations.
Consider the flow graph fragment in Figure 4.16. From the definition, one has
ANTOUT(B1) = true. But consider the equations. If one inserts the known value for B3 and
eliminates the redundant equation, one gets two solutions to the equations (see Table
4.1). Which solution represents the collection of points where anticipation is true?
Figure 4.15 Anticipatability Equations
Observation 1:
Given a solution of the anticipation equations. If ANTIN(B) is true then the
expression is anticipated at the beginning of B. Similarly, if ANTOUT(B) is true then the
expression is anticipated at the end of B.
Proof
We need to verify that the definition is satisfied. Consider any path starting at the
start (or end) of B and ending at the exit block. By assumption ANTIN(B) is true. Scan
through the blocks in the path, stopping first at the beginning of the block and then the end
of the block. If a block is reached when ANTLOC is true, stop, because the definition is
satisfied. By the equations, if ANTOUT of a block is true then so is ANTIN of the next
block, so the value can only change from true to false as we scan through a block. Stop at
the first block W where either Kill(W) is true or ANTOUT(W) is false. Consider the
following two cases:
Consider the case where Kill(W) is true. Since ANTIN(W) is true, then the
expression must be in ANTLOC(W) to get the true value for ANTIN(W). Thus an
evaluation occurs at the beginning of W, verifying the definition.
Consider the case where ANTOUT(W) is false. Since ANTIN(W) is true, the expression
must be in ANTLOC(W) to get the true value for ATIN(W), again, verifying the definition.
In either case we have a path beginning with a sequence of blocks in which there are no
modifications to the operands of the expressions and ending with a block that contains the
expression before instructions that modify the operands. The definition is satisfied.
Observation 2:
Assume that ANTIN(B), ANTOUT(B) is the maximum solution to the anticipation
equations. If ANTIN(B) is false then T is not anticipated at the beginning of B.
Proof
The proof is given in Appendix A.
Solution 1
false
false
false
true
Solution 2
true
true
true
true
Assume PANTIN(B) and PANTOUT(B) are a set of Boolean values that satisfies
the equations in Figure 4.17. If PANTIN(B) (correspondingly, PANTOUT(B)) is false, then
T is not partially anticipated at that point.
Proof
Given a solution to the equations, assume PANTIN(B) is false. Consider any path
from that point to Exit. We need to show that we reach a killing instruction or Exit before
we find an evaluation of T. Assume that we reach an evaluation of T before a killing
instruction. That means we find a block P with ANTLOC(P) = true. Now walk backward.
Since the values are a solution to the equations, PANTOUT of the previous block is true
also because it is a union operation. By assumption there is no killing instruction in the
block, so PANTIN is true. Repeat this whole process, walking backward until we reach the
original point. We have PANTIN being true rather than false. We have a contradiction, the
assumption that there is an evaluation of T before a killing instruction is false. There is no
evaluation, so T is not partially anticipated.
Figure 4.17 Partial Anticipation Equations
Observation 4:
Let PANTIN(B) (respectively, PANTOUT(B)) be true if and only if T is partially
anticipated at the beginning (respectively, the end) of B. Then this set of values is a
solution to the equation in Figure 4.17.
Proof
We must verify that the values satisfy the equations. Assume they do not. Then
there is a block B where the equations are not satisfied. Now look at the possibilities. The
equation for PANTOUT(B) is satisfied by the nature of the definition. Similarly, the
definition implies that the equation for PANTIN(B) is also true. Thus a contradiction.
Observation 5:
Let PANTIN(B) and PANTOUT(B) be the smallest solution to the equation in Figure
4.17, then T is partially anticipated at the beginning of B if and only if PANTIN(B) is true.
Proof
This argument mimics the argument for anticipation in the Appendix. The roles of
true and false are switched and the smallest solution is used rather than the largest.
Definition
Available:
A temporary T is available at a point p in the flow graph if and only if given any path
from Entry to p there is an evaluation of T on the path that is not followed by any
instruction that kills T.
Definition
Partially Available:
A temporary T is partially available at a point p in the flow graph if and only if there
is some path from Entry to p with an evaluation of T on that path that is not followed by
any instruction that kills T.
To illustrate these ideas, consider the flow graph of the running example as shown
in Figure 4.18. The temporary T is available at the beginning of block B2 since every path
(including the ones that go through B2 and come back again) from B0 to B2 contains an
evaluation of T. The temporary S is partially available at the beginning of B2 since the
path B0, B1, B2, B3, B2 contains an evaluation of S that is not followed by an instruction
that kills it. It is interesting that this evaluation is in B2, and we will later see that this is a
potential condition for moving code out of the loop.
Figure 4.18Flow Graph of Running Example
The reasoning that led to the equations for anticipatability can be used to create the
equations for availability. The only differences are that predecessors are used rather than
successors, and the reasoning involves paths from Entry rather than paths to Exit. This
gives us the equations in Figure 4.19, whereas before if one has the largest solution,
AVIN(B) (respectively, AVOUT(B)) is true if and only if T is available at the beginning
(respectively, end) of block B.
The equations for partial availability are derived by the same techniques as used
for partial anticipatability, giving the equations in Figure 4.20. The smallest solution to the
equation in Figure 4.20, PAVIN(B) (respectively, PAVOUT(B)) is true if and only if T is
partially available at the beginning (respectively, end) of B.
of all blocks whose predecessors have not been investigated. The work-list loop takes an
arbitrary block B and considers each of its predecessors P. If P is not transparent, then T
will not be partially anticipated at the beginning of the block unless it is locally anticipated
already.
Note that WORKLIST PANTIN because elements are added at the same time
and elements are never removed from PANTIN. Thus each block can only be added to
WORKLIST once; hence, the algorithm has time complexity proportional to the number of
edges.
For efficient compilation, the structure of each of the sets is important. Consider
WORKLIST first. The operations performed on WORKLIST are insertion, deletion,
membership, initialization to the empty set, and testing for the empty set. The maximum
number of elements in this set is the number of blocks and there is only one instance of
the set. Assign distinct integer values to each of the blocks and use the Briggs set
algorithm to implement WORKLIST.
Example
Consider the temporary S in Figure 4.18. The only instruction that kills S is in B4,
so the backward walk implemented by the algorithm indicates that S is partially anticipated
at the beginning of blocks B2, B1, B0, B3, and B6. For partial availability one uses a
forward walk, so S is partially available at the end of blocks B2, B3, and B6.
To see that the algorithm is correct, consider the following argument. Consider a
block B that is in PANTIN but not in ANTIN. There is thus some path from the beginning of
B to Exit that does not contain an evaluation of T before any instructions that kills T. Walk
down that path. Since B PANTIN initially, you come to one of the following situations:
One arrives at a block that is not in PANTIN. The previous block is in PANTIN, so
the first loop in Figure 4.23 will identify that the preceding block is not in ANTIN and
remove it from the set, placing it in the work list to remove its predecessors on the
path.
One arrives at a block containing a killing instruction. Since we assumed that this
path did not contain a preceding evaluation of T, T is not locally anticipated in the
block and therefore the block is not in PANTIN, reducing to the previous example.
Thus the first loop initializes work by identifying the blocks at the boundaries of
paths that violate the definition. Now the work-list algorithm walks backward along the
path, successively removing each block until the block B is removed as required by the
definition.
Example
In the example in Figure 4.18, consider the temporary S. We have already
computed the points where it is partially anticipated. We must go through that list and
throw out any block whose successors are not all in the list. Thus B3 is thrown out
because of B4, B6 is thrown out because of B3, B1 is thrown out because of B4, and B0 is
thrown out because of B5. Hence S is only anticipated at the beginning of B2.
Now consider the availability of S. It is partially available at the end of blocks B2,
B3, and B6. In this case there is nothing to throw out. B2 remains because S is locally
available in B2. The only predecessor of B6 is B2. The predecessors of B3 are B2 and B6,
where neither of those blocks kills or generates S. Hence the partially available
expressions are the available ones in this case.
Global is the set of temporaries that are live at the beginning of a block. This is the
same as the union of all of the LiveIn sets.
This information can be computed by scanning each block backward. Whenever a
use of a temporary is identified, mark the temporary as live. Whenever an evaluation of T
is found, add T to the LiveKill set and mark the temporary as not live. As the local
information is collected, the set Global is computed. It consists of all temporaries that are
live at the beginning of some block. The local information is summarized in Figure 4.24.
The global information is computed in the same way as partial anticipation. Each
temporary is handled separately; however, there is no need to compute anything unless T
Global because the temporary is only live within a block, not across block boundaries.
The algorithm is given in Figure 4.25.
Figure 4.24 Setup to Compute Lifetimes
To compute the more general definition of lifetime where a temporary is live from
an evaluation of T to a use of T, one needs to solve another condition, which is similar to
partial anticipatability. One must compute the set of blocks where there is a path to the
beginning of the block from an evaluation of T that does not include an instruction that kills
T. We will be using this information during register allocation when the compiler is only
dealing with single temporaries, so there is no need to consider which instructions kill a
temporary. The only thing in that case that kills a temporary is another evaluation of the
temporary. The problem thus reduces to a depth-first search starting from the evaluation
of a temporary. Any block that is marked live by the work-list algorithm in Figure 4.25 and
occurs on the depth-first search walk has the more general property of liveness that we
need for register allocation. However, performing the depth-first search has the probability
of visiting a large number of blocks. One gets the same result by performing the following
processes:
1. Calculate the set of blocks where T is live at the beginning of the block using the
work-list algorithm in Figure 4.25.
2. Perform a depth-first search starting at each evaluation of T, but only visit blocks
where the work-list algorithm indicated that T might be live.
3. Remove all blocks computed by the work-list algorithm that are not visited during
the depth-first search.
4. The result is the set of blocks where T is live at the beginning of the block.
Thus we have the general definition of live easily implemented from the more
straightforward definition. Unfortunately, the work list naturally computes the set of blocks
where T is live at the beginning of the block. The compiler always needs the set of blocks
where T is live at the end of the block, or more correctly, it will need the set of temporaries
live at the end of each block. This can be computed from the other information by the
algorithm in Figure 4.26.
Note that T is in LiveOut(B) if and only if T is live on entry to one of the successors
and one can get to B by a depth-first search from some evaluation. The last nested loops
in the algorithm compute this fact. LiveOut is a sparse set, so it should be implemented as
a linked list. The test for membership is easy because all entries for T are made before
entries for any other temporary; thus, if T has already been added to LiveOut for a block, it
is the element at the head of the list. Therefore one need only check the head of the list to
see if T has already been added.
Figure 4.26 Total Lifetime Algorithm
4.14 References
Aho, A. V., and J. D. Ullman. 1977. Principles of compiler design. Reading, MA: AddisonWesley.
Chow, F. 1983. A portable machine independent optimizerDesign and measurements. Ph.D.
diss., Stanford University.
Drechsler, K.-H., and M. P. Stadel. 1988. A solution to a problem with Morels and
Renvoises Global optimization by suppression of partial redundancies. ACM Transactions
on Programming Languages and Systems 10(4): 635-640.
Morel, E., and C. Renvoise. 1979. Global optimization
redundancies. Communications of the ACM 22(2): 96-103.
by
suppression
of
partial
The compiler tracks the values in temporaries within the current block and applies
algebraic identities. In this case the compiler knows that T5 has the value 1. The next
computation asks whether T5 > T8, which the compiler knows is 1 > T8. The compiler
knows that this is the same as 0 T8.
After building the program flow graph and performing the initial aliasing analysis,
the compiler performs local optimizations and transformations to improve the structure of
the program flow graph for the rest of the optimizer and decrease its size so that it takes
less time and space. After cleaning up the program flow graph, this phase will perform
global constant propagation and folding so that later phases have complete constant
information.
Note that most of the algebraic simplifications are applied to integer arithmetic. It
must also be applied to floating point arithmetic; however, the compiler must be careful on
two points.
The arithmetic must be done at compile time exactly the same way it is done at
runtime. Usually this is not a problem. It is a problem if the compiler is not running on the
same machine as the machine which executes the program (a cross compiler). It is also a
problem if the floating point rounding mode can change. If the rounding mode is not
known, constant folding should be avoided.
Precise IEEE floating arithmetic can also be a problem. Full IEEE arithmetic
includes the representation of infinities and NaN (Not a Number). The compiler must avoid
the evaluation of expressions where one of the operands may be a NaN. It must even
avoid replacing 0 * X by 0 if X might be a NaN.
How does the compiler know when two instructions are equivalent? There are limits
to the analysis that the compiler can do. For the purposes of code generation the analysis
is simple:
Two instructions without side effects are equivalent if they involve the same
operator and have equivalent inputs.
Two load instructions are equivalent if they load equivalent addresses and no store
operation has occurred between them that might modify that storage location.
When in doubt, declare that two instructions are not equivalent. For example, a
procedure call may change a number of variables that are visible to it or procedures that it
might call. All such variables must be assumed to change at the procedure call.
To implement the value-numbering scheme, the compiler needs to construct tables
that will compute this information quickly. The data structures will be described in abstract
terms; however, the implementation is simple. The temporaries are represented as small
integers, representing indices into tables. Each abstract data structure can thus be
represented as an array or a chain-linked hash table. The following data structures are
needed:
constant_temporary(temporary) is a data structure that, given a temporary, returns
one of the three following classes of values. It returns top or if the temporary does not
contain a value already computed in this block. It can return the value bottom, or , which
indicates that the temporary has been computed in this block but does not have a
constant value. Or it can return the constant value that was assigned to the temporary.
This is the same information that we will use later when doing global constant
propagation. It is used here to combine the answers to these questions: Does the
temporary have a constant value? and What is the constant value associated with the
temporary? This can be implemented as an array in which each entry is a class or record
indicating one of the two alternative values or the value of the constant. The table value is
filled in each time a temporary is the target of an instruction.
value_number(temporary) is a data structure that gives the value number
associated with the particular temporary. It can be implemented as an array of integers.
An entry is filled in each time the temporary is the target of an instruction or an instruction
occurs with side effects that invalidate a previous value number.
defining instruction(temporary) is a data structure that returns the instruction that
most recently defined the temporary in this block. It is updated as each instruction is
generated. If another instruction forces the value number of a temporary to change (due to
making the value unknown) or if there is no definition of the temporary in the block, then
the entry is NULL.
These data structures are used during the generation of the intermediate
representation. As an example, consider the generation of a binary operation using the
instruction-generation procedure, binary_instruct, discussed in Chapter 4. Its
implementation will look like the following pseudo-code:
temporary binary_struct (opcode, first_operand, second_operand)
if (constant_temporary(first_operand) is constant)
^(constant_temporary(second-operand) is constant) then
Get temporary T for loading folded constant from formal
temporary table;
Generate iLDC of constant into T
return T;
endif;
Get temporary T for (opcode, first_operand, second-operand) from
formal temporary table;
if value_number(T) == NULL then
Generate the instruction I;
value_number(T) = new value number V;
defining_instruction(I) = V;
return T;
endif
end procedure;
Generating a register copy operation or store operation must destroy the value
numbers for any temporaries that use that temporary as an operand. This information is
available in the formal temporary table.
5.4 References
Bagwell, J. T. Jr. 1970. Local Optimization, SIGPLAN Notices, Association for Computing
Machinery 5(7): 52-66.
Frailey, D. J. 1970. Expression Optimization Using Unary Complement Operators, SIGPLAN
Notices, Association for Computing Machinery 5(7): 67-85.
This compiler implements dependence analysis, therefore it can notice that A(N) in
Figure 6.1 is not modified by the store into A(I) since the value of I is always less than N in
the loop. However, that is not done in this section. Using the techniques in this section, the
compiler will be unable to differentiate A(I) from A(N). Compilers without dependence
analysis cannot notice this. However, the compiler will identify the following situations:
When the compiler knows that two addresses are distinct, then no modifies
relationship will exist between a store and a load. For example, B(I) and B(I+1) are not
related by the modifies relation. When the compiler is not sure, it must assume that there
is a modifies relationship, so A(I) and A(J) must be assumed to be related unless the
compiler knows something about the ranges of values that I and J can take.
Figure 6.1 Example for Describing Aliasing
The compiler knows that two fields of a nonoverlaid structure cannot be related by
the modifies relation because they are different offsets from the same address.
The compiler knows that the modifies relation is not transitive. A store into A(I) indicates
that the whole array A is modified. The modification of A indicates that A(I+1) is potentially
modified. However, the transitive relation modification of A(I) indicates that A(I+1) is
modified is false.
Source language restrictions must be taken into account. In C, pointers are typed.
Except for pointers to characters (which can point to anything for historical reasons), a
storage modification using a pointer can only modify locations of the same type and data
structures containing locations of the same type.
The modification information can be expressed as a conjunction (logical AND) of
several different factors:
Consider two different memory references X and Y. If the address of X is never
taken in the source procedure and X and Y are not related by some overlaying
characteristic of the source language, then a store to X cannot modify the location Y. This
is the only modification information used during the creation of the flow graph. It allows the
identification of local variables that can be stored in temporaries and allows some
efficiency gain by the use of value numbering within a block during the creation of the flow
graph.
There are language-specific rules that also limit the locations that a store to X can
affect. In Fortran, the compiler is free to assume that a store to a global variable does not
modify a dummy argument (formal parameter). Furthermore, the compiler can assume
that a store to a dummy argument does not affect another dummy argument or a global
variable. In ANSI C, the compiler can assume that a store through a pointer of one type
does not affect a load or store through a pointer of another type unless one of the types is
a pointer to a character. The compiler is free to use these rules because the language
definition indicates that violation of the rules is a language error, in which case the
compiler is free to do whatever it wishes.
A store to X cannot affect a load or store to Y if X and Y are different offsets from
the beginning of the same area of storage. Of course, the difference in offsets must be
large enough so that no bit affected by X is in the storage area associated with Y.
These three conditions represent three very different conditions on the store
operation. If one of the conditions is not satisfied, then a store to X does not affect the load
or store of Y. Thus the modification relation is the conjunction (logical AND or set
intersection) of different conditions.
This property can be used to refine the modification information as the program
progresses through the compiler. In other words, computing the modifies attributes is a
process of successive refinement. Early in the compilation process, a less refined version
of the modification information is used; in fact, one based on the previous three conditions.
Later, more refined information is used that involves analysis of the flow graph. Finally,
dependence analysis is used to give the most refined information. This dependence
information is used only in some of the phases of the compiler since the more accurate
information is not needed in many of the phases.
procedure can only be named in one fashion. This means that the compiler can assume
that each store operation involving a formal parameter does not have the modifies
relationship with any other formal parameter or global variable named in the procedure.
Second, interprocedural analysis records the actions of procedure calls. In the
absence of interprocedural analysis, the compiler must assume that every datum
addressable by the called procedure has a STORE executed and a LOAD executed.
Hence the procedure call is modeled as a collection of simultaneous STORE and LOAD
instructions.
With interprocedural analysis, the compiler estimates which data are modified and
referenced by each procedure call. With this information, the compiler can model a
procedure call by a smaller set of store and load operations. The store operations
represent the data that might be modified by the procedure call, and the load operations
represent the data that might be referenced by the procedure call.
The formal temporary table is an acyclic graph, with the load operations being the
leaves of the graph and the store or copy operations being the roots. Rather than making
additions to the flow graph, the modification information is stored as the modifies
information that we described earlier. The algorithms operating on the normal form of the
flow graph will use this modifies information directly to restrict optimizations. This was
done earlier when the compiler computed local information for each block in the flow
graph.
The modifies relation is recorded in terms of the formal temporary table. Each store
and copy operation will have an added attribute (called modifies) that records the set of
load and store operations in the formal computation that have a modifies relationship with
this operation. The set of instructions that this operation modifies is the set of load
operations in its modifies set together with all instructions that use the value generated by
that load operation either directly or indirectly. Actually the temporary that is the result of
the load is used to represent the information. Remember that there is a one-to-one
correspondence between the formal load operations and temporaries.
Heap: There is a tag for each type of data stored in the heap. The word type refers
to the source language type. This allows the compiler to distinguish elements of different
types when dealing with pointer dereferencing.
COMMON block
Array
Structure or union
Atomic: Represents scalars and pointers in memory
Parent: This field is a pointer to the tag representing the parent tag that includes
this tag. If there is no parent then the entry is NULL. The following tags have parents:
An array access has the tag for the whole array as the parent.
A structure or union access has the tag for the whole structure as a parent.
An entry in a COMMON block has the tag for the whole COMMON block as
the parent.
Any variable allocated to the runtime stack has the tag for the runtime stack
as a parent.
Any datum residing in the heap (and which is not a component of a larger
object) has a parent that indicates a heap object of that type.
Children: This field lists all of the tags that have this tag as their parent. This is the
reverse of the parent attribute and is used to scan all of the children whenever necessary.
Offset: If the tag represents memory that is a constant offset within the parent tag,
then the offset is placed in this field. Fields of structures or unions are constant offsets
from the start of the structure or union. Similarly, references to arrays using constant
subscripts give a constant offset from the beginning of the array. If there is no constant
offset, then a value is stored in the field to represent the fact that the offset is not known.
Size: The size of the datum represented by the tag is stored in this field. For
example, on a 32-bit machine, an integer and float will have a size of 4 bytes, whereas a
double-precision number has a size of 8 bytes. Size information is stored for structures
and COMMON blocks also. For arrays whose size is not known at compile time, the value
is inserted.
The compiler cannot always determine all variables whose addresses might have been taken. If there is a procedure call
that is separately compiled, then the address might be taken inside the procedure call and stored in a globally available variable. In this
case the compiler must assume that all addresses that are visible to the procedure (or any procedure that it calls) might be taken.
To handle other cases, the compiler uses a safe approximation to the eventual
modifies information. As the flow graph is built, the compiler associates a tag with each
symbolic memory address encountered. It performs value numbering on single blocks as
was described during the construction of the flow graph.
Definition
Points-To Set:
Consider any tag or temporary X. The set of tags to which X can point is the
points-to set for X, or PT(X). When X is not used as a pointer, PT(X) is not needed.
PT(X) is flow-insensitive information. There is no indication of the point in the flow
graph where this information is being used; it is aggregated over the whole flow graph or
the whole program. As we will see shortly, the information will be more precise if it is
computed simultaneously over all of the flow graphs for all of the procedures in the
program.
The basic algorithm is simple, so I will describe it first. Then I will adjust the
algorithm to take care of the problems that we will see in the original description. Initialize
all of the sets PT(X) to the empty set, Ø. Now scan through the flow graph in either
normal form or static single assignment form. Scan through the instructions in any order
and consider the targets of each instruction.
The instructions can be divided into multiple classes. The largest class of
instructions are those that can never be used in an address computation:
They add nothing to the set PT(X) for each X that is a target. The second class
includes the instructions that have some unspecified effects such as procedure calls. In
this situation the compiler adjusts the PT(X) of each tag or temporary that might be
modified in the procedure call. When processing a single flow graph at a time, this means
that the set of all memory locations that might be referenced within the procedure call
must be added to each PT(X) for each tag or temporary of the same value type. When
dealing with all procedures simultaneously, the procedure call can be viewed as a set of
copy operations from the actual parameters to the formal parameters. Since the algorithm
is flow insensitive, the processing of these copies together with the later processing of the
body of the procedure will correctly update all of the PT(X) sets.
The last set of instructions are the instructions that might be involved in an address
computation. This includes load and store operations, load constant operations, addition
and subtraction of expressions, and the register-to-register copy operations. If the
instruction is the load of an address constant, then find the corresponding tag and add it to
the PT(X) for each temporary or tag that is an output of the instruction. If it is a register-toregister copy operation, add PT(Y) to PT(X) for the operand of the instruction. If the
instruction is a load instruction, add PT(tag) to PT(X), where tag is the tag of the memory
location and X is the output variable. Addition and subtraction can be considered to not
change the point for the purposes of computing PT(X).
As usual, the most difficulty occurs with the store operations. Performing a store
adds PT(Y) to PT(tag), where Y is the operand being stored and tag is the tag for the
primary address in the store operation. There are two cases in considering the other tags
associated with the store. If a tag represents a memory location that is aligned in the same
manner as the primary tag and is the same size, then PT(Y) can be added to PT(tag) also.
Where is the problem in this algorithm? What the compiler really wants is to merge
the final PT(Y) into PT(X) if Y is an address operand for an instruction computing an
address X. What the algorithm above does is merge the current value (at various points in
the algorithm) into PT(X). The way to handle this is to build a directed graph, which we will
call the address graph (the name is not standard). The nodes of the address graph are the
tags and temporaries in the set of flow graphs being processed, and there is an edge from
Y to X if Y is an address expression (or tag containing a pointer) used to compute X.
Instead of scanning the flow graph and updating the PT(X) sets as we go, the compiler
scans the instructions, inserting the constants and tags representing memory allocation
instructions into the corresponding PT(X) and building this graph as it goes.
Given this graph, the collection of PT(X) sets can be computed in one of two ways
(your choice). The first technique is a work-list algorithm. Place all of the nodes that
contain non-empty PT(X) sets on the work list. That condition means that the nodes
contain constants of memory allocation tags. Then process this work list one node at a
time. Assume the compiler is processing the element Y. Then PT(Y) is added to PT(X) for
each of the successors X of Y in the address graph. X is added to the work list if PT(X)
changes.
Another way of computing the sets is to compute the strongly connected
components of the reverse graph of the address graph. Each element X in a strongly
connected component has the same value. By processing the nodes in reverse postorder
on this (reverse) graph and handling a strongly connected component as a single node,
the PT(X) values can be computed in a depth-first search.
information for each copy operation? The algorithm here is based on one used in the Rice
Massive Scalar Compiler Project.
To compute this information use an auxiliary data structure called DEPENDS.
There is one DEPENDS set for each temporary. Consider two temporaries, Tand S. The
temporary T is in DEPENDS(S) if T is a variable temporary that is used to compute S.
These sets can be computed by performing a walk of the flow graph. Recall that all of the
operands of any expression temporary must be computed on all paths leading to the
occurrence of the instruction computing that temporary. Thus, a depth-first search through
the flow graph will visit instructions computing the operands before instructions computing
S.
Initialize all of the DEPENDS sets to empty. Perform a depth-first search of the flow
graph. When processing any instruction computing an expression temporary, make the
DEPENDS set of the target be the union of the DEPENDS sets for the operands. When
processing an instruction that computes a variable temporary, make the DEPENDS set of
the target be the target itself. When the walk is completed all of the DEPENDS sets have
been computed.
Now scan through the set of all tags, considering the address expression portion of
the tag. For the sake of discussion consider a single tag X with an address expression
computed into T. X is in the modifies set for each temporary in DEPENDS(T).
This algorithm is particularly helpful when arrays are involved or in-line functions
have been inserted. It will replace the translation of arrays to pointers and the copying of
pointers by array semantics when possible.
6.10 References
Steensgaard, B. 1996. Points-to analysis by type inference of programs with structures
and unions. In International Conference on Compiler Construction, number 1060. In
Lecture Notes in Computer Science, 136-150.
An alternative technique called USE-DEF chains can also be used. It frequently requires more space and time to compute
and is harder to incrementally update.
Definition
Static Single Assignment Form:
The flow graph is in static single assignment form if each temporary is the target in
a single instruction.
The definition of static single assignment is so restrictive that most programs
cannot be translated into SSA form. Consider the left flow graph in Figure 7.1. There are
two assignments to the variable X: one outside the loop and one incrementing X inside the
loop. There is no way to put X into SSA form without the introduction of a new operator.
To ensure that all program flow graphs can be put in SSA form, another special
instruction, called a -node, is added to the definition of static single assignment form.
Definition
-node:
Consider a block B in the flow graph with predecessors {P1, P2, ..., Pn}, where n >
1.A -node T0 = (T1, T2, ..., Tn) in B is an instruction that gives T0 the value that Ti
contains on entrance to B if the execution path leading to B traverses the block Pi as the
predecessor of B on the path. The set of -nodes in B is denoted by (B).
Figure 7.1 Graph (left) and SSA Graph Equivalent (right)
Consider the program flow on the right graph in Figure 7.1. This graph is equivalent
to the one on the left (it computes the same values) and is in SSA form. The variable X
has been replaced by four variables (X0, X1, X2, X3) and two -nodes that indicate the
points in the program reached by multiple definitions of X. One of the -nodes is at the
beginning of the loop because there is a modification of X inside the loop and it is
initialized outside the loop. The other -node occurs at the merge of two paths through the
loop, where only one path contains a definition of X.
A flow graph in SSA form is interpreted in the same way as a normal program flow
graph, with the addition -nodes. Consider a path from Entry to exit:
Each normal instruction is evaluated in order, recording the results of each instruction so
that these values can be used in the evaluation of later instructions on the path.
All -nodes at the beginning of a block are evaluated simultaneously on entrance to the
block. The value of target temporary T0 is Ti if the path came to the -node through the ith
predecessor of the block.
The next two sections describe the fundamental operations of translating a flow
graph into and out of static single assignment form. Two areas that are typically
overlooked in the literature are emphasized: the simultaneous evaluation of -nodes at the
beginning of a block, and the handling of abnormal edges in the flow graph.
Computing these points is not intuitive; thus, we now descend to a theoretical discussion. A more intuitive algorithm was
used in an earlier form of static single assignment called p-graphs. P-graphs had all of the characteristics of static single assignment;
however, computing the points for the insertion of the birth points was quadratic to the size of the graph, so was not practical in most
compilers.
Definition
Converging Paths:
Two non-null paths, p from B0 to Bn and q from to , converge at a block Z if and
only if
in other words, the paths start at different points
in other words, both paths end at Z.
If then either i = n or j = m; in other words, the only point
on the paths that is in common is the end point. Note that one of the
paths may loop through Z and come back to it.
If I1 and I2 are assignments to T, then any basic block Z that is the conjunction of
two merging paths from I1, and I2 will be a point where a -node is inserted because two
different definitions of T lie on distinct paths reaching that point. But one must go further. Z
is now a new definition of T because the -node has been inserted, so now it too must be
included in the set of definitions of T and the process of computing merge nodes repeated.
Using the notation of Cytron et al. (1991), one obtains the notation and formula
shown in Figure 7.2 for the points where -nodes need to be inserted for T. The notation
is an abstraction of the idea of the previous paragraph. The function J1, takes a set of
blocks as an argument and returns the set of merge points associated with those blocks.
However, this process must be repeated with the set of blocks together with the merge
point giving J2. By the definition of merge points, if the argument J1 is a larger set, then
the result is larger also. In other words,
Since there is a finite number of
blocks, there must come a point where Ji(S) = Ji+1(S). This will be true for all larger values
of i, so the formula represented as an infinite union actually represents the value of Ji(S)
where the sets stop increasing in size.
It is too difficult to directly compute the merge points; another formulation is
needed. An efficient algorithm is based on dominance frontiers. One point before I discuss
the algorithm: In forming static single assignment form, each temporary is assumed to
have an evaluation at Entry. Think of this evaluation as the undefined evaluation. It is the
evaluation used when no evaluation has really occurred. If this evaluation is used as an
operand, then the operand has an undefined value. This cannot happen with expressions
or compiler-generated temporaries. It can happen with user-defined variables stored as
temporaries.
Here is the algorithm. Consider two evaluations of T in blocks B and B;. There are
three possibilities:
B dominates B;. consider a merge point Z for these two blocks. There are disjoint
paths from B to Z and from B; to Z. B; cannot dominate Z because then B; would be on
the path from B to Z, contradicting disjointness.
A loose interpretation of this proof is as follows: Start at B. Move down the path
until one finds a member B1 of the dominance frontier of B. Now do the same thing starting
at B1. Continue this process until one reaches the end of the path. The last block of this
form on the path dominates all the following ones.
Lemma
Let B ; C be two blocks. If there are two paths p:
converge at Z, then
.
and q:
that
Proof
Using the previous lemma twice, choose a block B; DF+(B) on p that dominates
Z, and a block C; DF+(C), on q that dominates Z. There are three cases:
Suppose B; is on the path q as well as on the path p. By the definition of two paths
converging, this means that B; = Z, so Z DF+ ({B}).
Now recall the definitions of converging paths and join set J(S). What this lemma
shows is that for any set S one has
. Now consider the concept of
iteration we are using to form from DF+ and similarly J+ from J. The sequence of sets
is a sequence of increasing finite sets with an upper bound
being all sets in the graph. Thus there is a point in the sequence where the sets no longer
increase; in other words, there is a point where DFi = DFi+1. After this point the sets will
always continue to be the same because the inputs on each iteration are the same as the
previous iteration. Also note that DF+(DF+(S)) = DF+(S). We have
We can now compute the points at which to insert the -nodes by computing the
points in the iterative dominance frontier.
Computing the iterative dominance frontier can be performed using a work-list
algorithm, as shown in Figure 7.4. We have computed the dominance frontier for each
block B earlier. The dominance frontier of a set S is just the union of the dominance
frontiers of the elements in the set. The iterative dominance frontier means that we must
include the dominance frontier of any block that we add into the dominance frontier. This
is done by keeping a work list of blocks that have been added to the dominance frontier
but which have not been processed to add the elements of their dominance frontiers yet.
Since the algorithm is stated in an abstract fashion, I include a number of implementation
hints here:
The set DF+(S) is written in the algorithm to indicate that it is dependent on S.The
compiler will use it on one set at a time so the algorithm takes a single set as input and
computes a single set as output. No indexing is needed.
The only operation performed on the set S is to scan through the elements to initialize
both the Worklist and DF+(S) sets, so it can be implemented using any technique that
allows accessing all members in linear time on the size of the set. In this case, the most
likely implementation is as a linked list.
The Worklist is a set in which the operations are adding an element to the set only
when it is known that the element is not in the set, and taking an arbitrary element from
the set. Note in the algorithm that an element is added to the Worklist at most once
because an element can be added to the DF+(S) at most once because of the conditional
test. The most likely implementation of Worklist is as an array implementing a stack. The
maximum size of the array is the number of blocks in the graph.
The implementation of DF+(S) is more subtle. The operations performed on it are
initializing to empty, inserting an element, and checking membership. Outside the
algorithm, one will need to scan through all the elements in the set. Since it is a subset of
the blocks in the graph, its maximum size is known. The most likely implementation for
this set uses the set membership algorithm described in Chapter 3. This set algorithm
requires that the elements be mapped to a sequence of integers, which can be done using
any of the numerical orderings we have computed for the blocks, such as reverse
postorder.
Now that we know how to compute DF+(S), we can piece together the algorithm for
computing the places in which to put -nodes. The basic algorithm, as shown in Figure
7.5, is simple. Handle each temporary or memory location separately. Form the set of all
basic blocks that modify the value being considered. Compute the iterated dominators and
then insert a -node at each block that is in the iterated dominance frontier. Initially the
node inserted has the same left side and operands. In the renaming phase coming shortly,
these names will be changed so that the program satisfies the SSA form conditions.
The algorithm in Figure 7.5 inserts too many -nodes. It inserts the minimum
number of -nodes to guarantee that each temporary and variable always has the correct
value and the program satisfies the SSA form, but many of these -nodes will define
temporaries that have no uses, so the =nodes can be eliminated. Consider a temporary
T that is defined and used only in one basic block B. The algorithm will still insert -nodes
at the basic blocks in DF+({B}) even though no uses of T occur outside the block. These
extra nodes can be eliminated by dead-code elimination; however, they take up space in
the compiler and require time to generate, slowing the compiler down. There are two
techniques, shown in Figures 7.6 and 7.7, for eliminating some of these nodes.
The first improvement on the basic algorithm is given in Figure 7.6: Do not compute
-nodes to be inserted for temporaries that do not contain information across a block
boundary. If the same temporary is used in multiple blocks but no information is stored in it
at a block boundary, the renaming algorithm will change these into multiple temporaries
appropriately.
Figure 7.7 Inserting Fewest Nodes
Recall that Globals is the set of temporaries that holds a value at the beginning of
some block. This is still too coarse; -nodes will be inserted at blocks where the value will
not be used. In other words, -nodes need only be inserted where the temporary is live.
Figure 7.7 shows the modified algorithm that computes which blocks have T live and only
inserts -nodes in these blocks.
The work-list algorithm for computing Live given in Chapter 4 is well suited for this
algorithm. The algorithm only processes temporaries in Global and the work-list algorithm
can then be applied to the small set of temporaries.
when a temporary is evaluated and when it is used. Consider a temporary T in the original
program. After the -nodes are inserted, the uses of T can be divided into two groups:
1. The uses of T that occurred in the original program. All of these uses are
dominated by the definition that computes the value used. If this were not true, then
there would be another path to the use that avoids the definition, which would mean
that there is a point where separate paths from definitions converge between the
definition and the use, thus inserting another definition. In other words, each use is
dominated by an evaluation of T or a -node with target T.
2. The uses of T in a -node. To each such use there is a corresponding predecessor
block. This predecessor must be dominated by the definition of T for the same
reasons that normal uses of T are dominated.
The renaming algorithm thus reduces to a walk of the dominator tree (see Figure
7.8). Each time one sees a definition of a temporary, a new name is given to the
temporary, and that name replaces all of the uses of the temporary that occur in blocks
dominated by the definition. After the subtree dominated by the definition has been
walked, the previous name is restored so that other subtrees can be walked with the
previous name. Uses of a temporary in -nodes handled in the predecessor block. When
a block is traversed, all of the -nodes in each successor are traversed. Uses of a
temporary in the operand position corresponding to this (predecessor) block are renamed
in the same way that normal uses are renamed.
Figure 7.8 Basic SSA Renaming Algorithm
The whole point of static single assignment is to provide concise information about
the uses and definitions of temporaries, so we need to add attributes to the renaming
process that record this information. For each temporary there are two attributes,
Definition(T) and Uses(T). Definition(T) is the single instruction that defines T. Recall that
there can be more than one temporary defined by each instruction; however, there is only
one instruction that defines a particular temporary.
Uses(T) is the set of instructions that uses T. This is a set that most likely is
implemented as a linked list. Since each instruction is only visited once during the
renaming process, the only way that an instruction can be inserted twice into the set is
when the same operand is used two or more times in an instruction. I choose to let these
multiple insertions occur in the set because later a removal of one operand will only
remove one of the uses.
The form of NameStack is the implementation issue. NameStack is a collection of
stacks, one for each temporary. These stacks are implemented as linked lists to avoid
excessive storage. The Push operation adds an element to the head of the list, and the
Pop operation removes an element from the head of the list. Top looks at the element at
the head of the list.
If we are being mathematically pure, we should now prove a lemma that the
execution of the static single assignment form computes the same values on each path
through the flow graph as are computed with the normal form of the flow graph. The proof
is a clerical application of the ideas that we have discussed here, carefully checking that
the renaming algorithm is accurate to the execution of the flow graph. If you are not
convinced, then we leave the proof to you.
Also, one of the variables may be modified before it is later used on a different
path, as shown in Figure 7.11. In this example u, which is a copy of a1, is used later in the
program. It is eliminated by the optimizer and replaced by a use of a1. If a1 is assigned a
value at the end of B, then the value of a1 will be destroyed before its use. But note the
following:
If a B has only one predecessor, then no -nodes can occur.
If a predecessor of B only has B as a successor, then there is no possible alternative path
out of the block.
Figure 7.10 Normal and Optimized SSA Form
Thus only critical edges can have this problem, so they must be eliminated before
translation by inserting an empty block in the middle of the edge. Since abnormal critical
edges cannot be removed, the optimization algorithms using SSA form must ensure that
there will be no need to insert a block on an abnormal critical edge.
The compiler must sort the copy operations so that uses of a temporary precede
their definitions and create temporaries to create copies for mutually dependent
temporaries. We describe a graph, R(B), to represent these two relationships.
The nodes of the graph are the members of the partition P. There is one node of
R(B) for each subset in the partition P that contains a temporary occurring in ;(B).
Equivalent temporaries are represented by a single member of the partition.
There is an edge from FIND(Tk) FIND(Tl) if there are temporaries Tk and Tl such
that Tk = (. . .) (B) with Tl as the ith operand.
How does this graph describe the problem of ordering the copy operations? Each
node in the graph corresponds to the representative of a partition element that occurs as
the operand or target of some of the -nodes. Each representative can occur as the target
of at most one copy operation. If an ordering is found where uses occur before definitions,
then the copy operations can be generated in the same order. This is a topological sort of
the reverse graph.
Which nodes generate copy operations? In R(B), there is an edge out of a node if
and only if there is a copy operation. So each node in R(B) with a successor generates a
copy operation. The other nodes represent temporaries that are used but not defined.
What about the case in which there are mutually dependent temporaries? Then the
graph will have a strongly connected region and the topological sort will not succeed. The
strongly connected regions must be identified and extra temporaries must be introduced to
simulate simultaneous assignment.
The strongly connected regions have a special form because there is at most one
edge leaving each node. Look at the definition of an edge; only the ith operand counts,
and there can only be one assignment to any temporary in a subset in the partition. If
equivalent temporaries are assigned, then the operands must be equivalent. So there can
be at most one edge leaving a node in R(B). These two characteristics imply that the
strongly connected region has the following characteristics:
A strongly connected region is a simple cycle. There is a path from any member of
the region to any other member. Start at one of the nodes. Since there is only one edge
out, there is only one way to leave the node. As you walk from node to node, there are no
choices. Eventually you must get to the other node. If you keep walking from node to
node, you will eventually get back to the original node. You have a simple cycle.
The strongly connected region may have multiple predecessors outside the region,
but it can have no successors outside the region. The reasoning is the same as before.
Since there is only one edge out of each node and the strongly connected region is a
cycle, there is no way to get to any successors.
This makes the algorithm simpler. We can use the standard strongly connected
region algorithm4 to identify a reverse postorder for topological sorting and identify the
strongly connected regions. Each strongly connected region can be translated as follows:
4
Actually there are two related but distinct algorithms. Either one can be used. The one here is in most modern textbooks.
The original algorithm is by Tarjan (1972).
Enumerate the loop in some order where each successive node is a successor of the
previous one and the first node is a successor of the last. This can be performed during a
depth-first search.
1. Generate one extra temporary, T.
2. Generate an instruction to copy the temporary representing the first node into T.
3. Translate all of the other nodes except the last one as is done for the topologically
sorted nodes.
4. Generate an instruction to copy T into the temporary corresponding to the final
node.
Figure 7.13 Mutually Dependent Temporaries
The graph is represented by a set of nodes called NodeSet and two temporary
attributes called ELIM_PREDECESSORS and ELIM_SUCCESSORS, representing the
predecessors successors in the directed graph. NodeSet is implemented using Briggs set
algorithm because we need to be able to efficiently scan the nodes, check for
membership, and insert a node. The predecessors and successors can be implemented
using either linked lists or arrays simulating linked lists. I recommend the latter or the use
of some collection-based memory allocation method because these data structures are
very temporary.
The second part of the algorithm implements the topological sort and identifications
of strongly connected regions. These can be done in one algorithm. The topological sort
can be performed by pushing each node on a stack after all of its successors have been
walked in a depth-first search. The first element in the topological order is on top of the
stack, the second element is next on the stack, and so on. Hence the order can be found
by listing the elements in the order in which they are removed from this stack.
Figure 7.14 Converting Edge to Normal Form
The strongly connected regions can be identified using the same stack. Before
popping an element off the stack, perform a depth-first search using the predecessors
rather than the successors of a node. Do not visit any node more than once in this
predecessor walk. All of the unvisited nodes reached by this depth-first search of the
predecessors are the elements of the strongly connected region containing the
predecessor. The algorithm in Figure 7.16 is a transcription of this algorithm (Cormen,
Leiserson, and Rivest 1990).
There are three different possibilities when creating the copy to represent a node. If
the node has no successor, then there is no copy operation and the node can be ignored.
If the node has no unvisited predecessor, then it is a single node that is not in a strongly
connected region, so the copy operation can be generated where the operand is the
successor in the graph and the target is the current node.
The third possibility is a strongly connected region. In that case, perform a depthfirst walk using the predecessors until you get back to the current node (see Figure 7.16).
Since a strongly connected region is known to be a cycle, this will describe the whole
strongly connected region. Before starting the walk create a temporary to hold the value of
the first node. Then generate all of the copies except the last after one has completely
visited a node (and its predecessors). This will force the copies to be generated in
topologically sorted order. The last copy uses the value held in the newly created
temporary as its operand. Note that the node at the head of the cycle is not officially
visited until the end of the depth-first walk. This forces the copy with the head as target to
be generated first.
As you will see in the following example, the additional temporary can be avoided if
there is a predecessor to the head. That temporary already holds the value of the target of
the first copy instruction and can be used in place of the generated temporary. The
algorithm in Figure 7.16 does not include this optimization to make the algorithm clearer;
the implementor should include it.
To see how the algorithm works, apply it to the set of -nodes in Figure 7.17. The
-nodes are in the left column, with the corresponding auxiliary graph on the right side.
Since there is no order among -nodes, the order of the nodes has been jumbled. Rather
than using for temporaries involving a subscripted capital T, normal letters are used for
distinct temporaries to make the graph easier to read.
The results for this example are given in Figure 7.18. The stack generated by the
first pass is given in the right column and the generated copies are given in the left. Recall
that most edges will not generate any copies at all because the algorithms will eliminate
them. This particular example was created to show as much about the algorithm as
possible. Note that H is not the target of a copy since it has no successor. Also note that
the new temporary U is not needed since E already holds that value.
Figure 7.17 Example Graph for an Edge
7.4 References
Aho, A. V., J. E. Hopcroft, and J. D. Ullman. 1983. Data structures and algorithms.
Reading, MA: Addison-Wesley.
Cormen, T. H., C. E. Leiserson, and R. L. Rivest. 1990. Introduction to Algorithms. New
York: McGraw-Hill.
Cytron, R., J. Ferrante, B. Rosen, M. Wegman, and F. Zadeck. 1991. Efficiently computing
static single assignment form and the control dependence graph. ACM Transactions on
Programming Language and Systems 13(4): 451-490.
Tarjan, R. E. 1972. Depth-first search and linear graph algorithms. SIAM Journal of
Computing 1(2): 146-160.
Tarjan, R. E. 1975. Efficiency of a good but not linear set of union algorithm. Journal
of ACM 22(2): 215-225.
The compiler now begins global optimization. Global optimization is divided into
four components: VALUE, DEPEND, RESHAPE, and MOTION, executed in order (Figure
8.1). VALUE simulates the execution of the flow graph. If it can determine that the value or
structure of an expression can be simplified, it replaces the expression with a simpler
form. DEPEND performs loop restructuring using dependence-based optimizations. It
relies on the simplifications performed by VALUE to make dependence analysis more
accurate. After loop transformations have been performed, the RESHAPE phase is
performed. RESHAPE includes all of the transformations in VALUE together with
expression reshaping and strength reduction. RESHAPE prepares for the code motion
performed in MOTION. MOTION performs code motion, including moving loads and
stores to complete the global optimization portion of the compiler.
This chapter describes the VALUE and RESHAPE phases of the optimizer. VALUE
limits its transformations so that DEPEND can operate more effectively. It does not do
code motion, because the loop structure may change dramatically in DEPEND, and it
does not do strength reduction, because DEPEND relies on the original form of
expressions for analyzing subscripts.
RESHAPE includes all of VALUE. It adds strength reduction to modify
multiplications in loops to repeated additions. It also applies the distributive and
associative laws of arithmetic to integer operations. Several other simplifications are
added to improve the flow graph as it is prepared for code motion.
VALUE performs the following transformations. Using the technique of static single
assignment (SSA), they are inexpensive and suprisingly effective.
The compiler can eliminate globally redundant expressions when there is an
instance of the expression in a dominator block. This eliminates many of the redundant
expressions in the procedure; however, it does not perform code motion. Later, the
compiler uses a technique called elimination of partial redundancies to do code motion.
Figure 8.1 Structure of Global Optimizer
The compiler performs global constant propagation and constant folding. This is
performed in two ways. Initially the compiler performs some constant propagation during
the construction of the SSA form at the same time that globally redundant expressions are
eliminated. Later a full global constant propagation algorithm is performed. For
already in that table, then the expression is redundant. When an expression is redundant,
do not give the target operand a new name; instead, give it the name of the target of the
instruction that is already in the table.
This table has the same characteristics as the available-expression table used
during value numbering. Algebraic identities and value numbering can be incorporated in
the same way that they were incorporated in the value-numbering algorithm for single
blocks. The operations required of this table are as follows:
Initialization:
Initialize the available-expression table to have no entries.
Start Block:
Begin a basic block. Remember the set of entries currently in the availableexpression table so that the entries added during this block can be removed later.
End Block:
Restore the available-expression table to the state that it was in when the current
block was entered.
Find:
Given an instruction I in the flow graph, look up I in the available-expression table
using only the operator and operands. Insert I in the table if a matching entry is not
already there. Return an indication of whether I was already in the table.
Insert:
Insert an expression in the available-expression table even though it is not in the
flow graph. This is used to record added information that can be deduced during the
dominator tree walk. For example, if a conditional branch tests whether T = 0, then the
compiler can record that T has the value 0 on one of the alternative branches.
Finalization:
Eliminate all storage reserved for the available-expression table.
The available-expression table can be implemented using data structures similar to
a scope-based symbol table. It can be viewed as a stack: Elements are pushed onto the
stack if they are not already there. The stack is searched from the top of the stack down.
Elements are popped off the stack when a block is completed. Of course, the data
structure used will be more complex, using a chain-linked hash table to speed up the
searches and an auxiliary array to keep track of the elements in each block on the path
from the start block to the current block. See the description of symbol tables in
McKeeman (1974).
Given the structure of the available-expression table, the full algorithm can be given
in Figure 8.2. Each instruction is handled by first renaming its operands. Then any
algebraic simplifications are incorporated. Finally, the instruction is entered in the
available-expression table and given a new name if it is not in the table already.
In this case the only way into C is through B. Look at the conditional expression
controlling the branch from B to C, Perform the following insertions into the availableexpression table:
When the conditional expression has the form T = constant, where constant is a
constant and C is the destination when the condition is true, enter the expression T in the
available-expression table with the same name as the constant. The renaming process
will now perform constant folding on the blocks dominated by C. Follow the same
procedure if the conditional expression has the form T ; constant and C is the destination
when the condition is false.
If C is the destination when the conditional expression is true, enter the conditional
expression in the available-expression table with the same name as the name for the
constant true.
If C is the destination when the conditional expression is false, enter the conditional
expression in the available-expression table with the same name as the name for the
constant false.
Now the normal available-expression processing, constant folding, and identities
processing will simplify the algorithm. Consider the example described earlier of two
nested loops iterating over a square matrix, as shown in Figure 8.3. The code is written in
C, mimicking the code that a front end will generate for a Fortran DO loop or a C for loop.
The test for zero iterations of the loop is made explicit so that code motion out of loops
can be done. The constant propagation modifies the test for the zero-iteration case to test
n against 0. The occurrence of two tests of n against 0 is simplified by eliminating the
second oneit is known that the value is true at that point by the information stored in the
available-expression table.
The same information can be used to simplify range checks or checks for pointers
being null. Although more complex methods can get better results, these tests are a good
preamble to the more complex solutions since most cases are eliminated here.
the compiler must come back later and make sure that the additional operand does not
violate the optimistic assumption made. During earlier dominator-based value numbering,
the compiler made the pessimistic choice. Here the compiler makes the optimistic choice
because it will find more opportunities for identifying constants.
What are the arithmetic rules in this extended arithmetic? If you think of as
undefined and as varying, then the rules are what you would expect. Consider the
addition table in Figure 8.4. For two constant values the arithmetic is target-machine
arithmetic. If one of the operands is undefined, then the whole value is undefined. If one of
the operands has a varying value, then the whole addition has a varying value. The only
surprise is that an undefined added to a varying temporary could immediately be declared
a varying temporary. This is not done here so that the rules will match the rules for
multiplication, where the distinction is important.
Figure 8.4 Rules for Addition
What is the arithmetic of -nodes? This is where the optimistic view of constant
propagation occurs in the algorithm. Any arguments that are undefined or are ignored in
computing the value of a -node.
If any of the remaining arguments has value , then the value of the -node is .
If any two arguments of the -node are distinct constants, then the value of the node is .
If there is at least one constant operand and the previous conditions are met, then
the value of the -node is that constant.
Otherwise, all arguments of the -node are undefined, so the value of the -node is
undefined.
For example, consider a block B with two predecessors and a -node in that block.
If all of the operands have the same value, then that is the value of the target temporary. If
two of the operands have different defined values (not ), then the target temporary must
be a varying temporary, i.e., . When one of the operands has value we make the most
optimistic assumptiontthat it will later have the same value as one of the other operands.
This gives the arithmetic table in Figure 8.6, which uses the C languages condition
expression operator to indicate that the value is if the two constants are different.
Conditional branching instructions are checked to see if more destinations can now
be reached. The algorithm computes the change using the attribute executable. If the
attribute is already true, then the block has already been on the work list so it need not be
entered again. The attribute is set to true and the destination entered in the work list for all
executable blocks that previously had the false attribute.
The current possible destinations for a branching instruction are as follows. If the
controlling temporary has value , there are no destinations. If the controlling temporary is a
constant, then it is the corresponding destination for that constant. If the controlling
temporary is a varying temporary with value , then all destinations are possible.
Simulating the execution of the block is shown in Figure 8.8. Each block is
simulated again when one of its predecessor edges becomes executable. This changes
the values of the -nodes. In fact, this change of the values of -nodes is the only reason
that the instructions cannot be evaluated just once in an order in which the operands are
evaluated before the instruction with the operands (reverse postorder, for instance). When
a new edge is present, a new operand becomes relevant in each -node.
The first time a block is processed, all of the other instructions in the block are
evaluated. After that the reprocessing of instructions will be driven by the reprocessing of
the instructions to evaluate their operands.
The initialization code is shown in Figure 8.9. All of the sets are initialized to empty,
all edges are initialized to not executable, and all temporaries are initialized to have an
undefined value. The formal parameters need to take a different value. If a formal
parameter is known to be a constant because interprocedural analysis has indicated that
the same value is passed in all procedure calls within this program, then the formal
parameter is initialized to have that constant value. Otherwise it is initialized to indicate
that it has a defined but unknown value, that is, it is initialized to .
To start the whole algorithm, the entry block Entry is placed on the work list for
blocks. There are -nodes in Entry; however, it will force the evaluation of the instructions
in the block and cascade through all of the other blocks as the work-list algorithm
progresses.
The algorithms are designed to avoid evaluating instructions until the instruction is
known to be executable. This cannot be done because the instructions are put on the
instruction work list as soon as an operand changes whether it is in a block that is
executable or not. The only place that this can have an effect is in -nodes. The -nodes
are the collectors of values when there is a merge of control flow.
Figure 8.10 Main Constant Propagation Procedure
The constant propagation algorithm can now be applied to this contrived arithmetic.
At the instructions where this contrived temporary is used, one can check the value to see
the character of this temporary as a pointer. If one is executing one of the null-pointer test
instructions and the value of the operand is not a null pointer, then the test can be
eliminated.
What we have done is interpreted a different idea as a system of arithmetic and
applied the same constant propagation algorithm.
Alias Analysis Information
The alias analysis information can be improved by constant propagation in
languages such as C where pointers can be created to any data structure and pointers
can be incremented within a data structure.
Associate with each data structure a tag naming the data structure. Pretend that
each load of an address constant gives a temporary that tag as a value. Normal arithmetic
operations such as addition and subtraction take the same tag as one of the operands.
Now we can define an arithmetic system containing , the set of tags, and .
Constant propagation can now be applied giving each temporary a tag or . The
temporaries that have a tag value represent pointers into the data structure with that tag.
A store through that temporary cannot modify the value in memory for any other region.
There are two different uses of the term strength reduction in compiler literature. One use is the replacement of a
multiplication by a power of two by a shift or replacing the multiplication by a constant by a collection of shift and add operations. I am
using the term to refer to replacing multiplication of a regularly varying temporary by a constant in a loop with an addition.
Definition
Loop Invariant:
A temporary T is a loop invariant in the loop L if it is either not computed in the loop
or its operands are loop invariants.2
2
This definition is used by Markstein, Markstein, and Zadeck in the ACM book on optimization that has yet to be published
(Wegman et al. forthcoming).
specified that T is loop invariant if its operands are evaluated outside of L, then this
expression would not be loop invariant because X + Y would be evaluated inside the loop.
A loop-invariant temporary is one in which the leaves of the corresponding expression tree
are not evaluated in the loop.
The immediate reaction is to remove loop-invariant instructions from the loop. If
they always evaluate to the same value, compute them outside the loop. However, doing
so is not safe. A temporary (and its corresponding instruction) is loop invariant irrespective
of where it occurs in the loop. It may occur in a conditional statement. Later optimizations
will take care of that. The compiler only needs to know what is loop invariant and what is
not.
To record the loop-invariant information, we add an attribute to the temporary T
called variant(T), which contains the innermost loop in the loop tree in Which T is not loop
invariant. If T is invariant in every loop, then variant(T) is the root of the loop tree. If T is
not invariant in any loop, then variant(T) is null. Recall that this is all being performed on
the SSA form of the flow graph, so there is a single definition for each temporary and that
definition dominates the uses in instructions.
Before describing the algorithm, lets consider each class of instructions and
determine the meaning of loop invariance for each:
Consider a -node T0 = (T1, . . . , Tn). To determine that T0 has the same value
each time through a loop, the compiler must know the innermost loop in which each of the
operands is invariant and know which block branches to the block containing the -node.
The second condition is impractical to compute, so the compiler will assume that
variant(T0 = (T1, . . . , Tn)) is the innermost loop containing it.
For an instruction that is a pure function, such as addition, multiplication, or
disjunction, the instruction varies in the innermost loop in which one of the operands
varies.
A copy operation is a pure function in this situation, so the target is variant in the
same containing loop in which the operand is variant.
A LOAD instruction varies on the innermost loop in which a store operation might
modify the same location (that is, the same tag).
The compiler needs an auxiliary function that gives the nearest common ancestor
of two nodes of a tree, in this case the loop tree. The algorithm is simple: If either node is
an ancestor of the other, then that node is the result. Otherwise choose one of the nodes
and start walking toward the root until a node that is an ancestor to (or equal to) the other
is found. The algorithm uses the preorder/postorder number test to check if one loop is an
ancestor of the other. This check runs in constant time. The algorithm is given in Figure
8.11. The initial test for L2 being an ancestor of L1 is unnecessary for correct operation of
the algorithm, but is included for efficiency.
Figure 8.11 Finding the Nearest Ancestor in a Tree
If it does not improve the performance of the procedure then it should be removed.
For each instruction, the compiler computes the innermost loop in which the computation
varies: a direct encoding of the conditions for the varying of any temporary. This algorithm
is shown in Figure 8.12. The algorithm processes a block by first processing the -nodes
in the block. The target temporaries are modified in the block and hence are varying in the
innermost loop. The other temporaries are processed by looking at each of the operands.
Find the innermost common loop for the current block B and the point of definition of the
operand. Compare this value with the partially computed innermost varying loop for this
instruction, held in Varying. If the operand is modified in a more inner loop this becomes
the loop in which the instruction varies. After computing the results of all operands, the
targets are given this innermost loop on which all of their operands depend.
Figure 8.12 Computing Nested Varying Loop
Recall that static single assignment means that each temporary has a single
instruction that evaluates it. It does not mean that each instruction has only one output. In
fact, STORE instructions may be viewed as having multiple outputs.
The driving procedure must ensure that variant for each temporary is computed
before it is used. Since variant is computed for each -node without using the information
for the operands, a dominator tree walk will ensure that all operands have a value for
variant before the instruction in which the operands are used. Hence the driving procedure
uses the algorithm in Figure 8.13.
As with other static single assignment algorithms, this algorithm assumes that all
LOAD instructions have another operand, called a tag, which is handled like a temporary
for the purposes of renaming. Each STORE instruction modifies a particular storage
location and a number of tags. There are -nodes included for the tags also. The tags are
handled just like temporaries for the purposes of SSA computations and are handled like
operands in this algorithm for computing invariant temporaries (and tags). Thus a load
operation inside a loop will be invariant if the address expression is invariant and the tag is
not modified by any store operation in the loop.
As an example, consider the running example and the instruction dSLD (TI7) =>
SF1 in block B2. Consider the flow graph after initial dominator-based optimization has
occurred. B2 is a block contained in the loop {B2,B6,B3}, which is contained in the loop
{B1,B2,B6,B3,B4}. T17 is assigned a value in block B1, which is in this second loop;
hence, Varying starts out pointing at this outer loop. However, the store operation in block
B6 also affects the load operation through the tag, so SF1 is marked as varying on the
loop {B2,B6,B3}. However, T17 itself is marked as varying on the outer loop.
The description here is based on a description of strength reduction by Markstein (Wegman et al. forthcoming). That work
and this are both based on the original paper by Allen, Cocke, and Kennedy (1981).
Definition
Induction Candidates:
Given a loop L, a temporary T is a candidate temporary for L if and only if T is
evaluated in L and the evaluation has one of the following forms:
T = Ti Tj, where one of the two operands is a candidate temporary and the other
operand is loop invariant in L.
A basic work-list algorithm is described in Figures 8.14 and 8.15. The algorithm
computes the set of temporaries that are candidates for induction temporaries. It includes
each temporary that is computed using the correct form of instruction from the definition
and then eliminates temporaries whose evaluating instructions do not have the correct
form of operands. When the algorithm stabilizes, the largest set of candidates available
has been computed. How does one prove that? Clearly all candidates are in the initial set
and only temporaries that would be removed with any set of candidates are removed, so
the algorithm computes the maximum set of candidates.
Figure 8.15 Pruning to a Valid Set of Temporaries
The set of candidates describes the temporaries that are evaluated with the correct
instructions in the loop; however, induction temporaries represent temporaries that are
incremented in a regular fashion across iterations of the loop, with the value on the next
iteration differing by a fixed amount from the values on the previous iterations. There are
two possible interpretations of this idea.4 This compiler uses the following definition.
4
The other definition of induction variables also requires the induction temporary to change by the same amount each time
through the loop following all possible paths through the loop. This is the definition needed for dependence analysis. It is more
restrictive than needed for strength reduction.
Definition
Induction Sets and Temporaries:
An induction temporary T in a loop L is a candidate with the following property.
Consider the graph with the candidate temporaries as nodes with an edge between two
candidates T and U if T is used to compute the value of U. An induction temporary is a
candidate temporary that is a member of a strongly connected region in this graph. The
set of temporaries in such a strongly connected region is called an induction set.
In other words, the temporary T is used to compute other temporaries, and those
temporaries are used to compute others, until the resulting value is used to compute the
value of T on the next iteration. Eventually the value of T is used to compute the value of
T. For a single-entry loop, this means that the temporary is involved in a strongly
connected region that contains a -node at the beginning of the loop.
The algorithm that the compiler used to compute loops cannot be used here.
Starting at a -node and tracing backward may lead to a number of temporaries that are
not in the strongly connected region. Instead, the general algorithm for a strongly
connected region must be used. Since the algorithm is applied at several other places in
the design, it will not be repeated here. The algorithm is summarized in Figure 8.16.
If the loop is not single entry, we will not bother to apply strength reduction here. A more
limited version will be applied later. First calculate the candidate temporaries. Then
implicitly create the graph. It does not need to be explicitly created because the form of
the instructions evaluating the temporaries in Candidates is simple. Perform one of the
two standard strongly connected region algorithms. Any strongly connected region with at
least two members and which includes a temporary that is the target of a -node at the
entry is an induction set.
Figure 8.16 Pruning Candidates to Induction Sets
As an example, consider Figure 8.17. The left column represents the flow graph for
the loop, and the right column represents the implicit graph of the candidate temporaries.
Note that J1, and J2 are not in a strongly connected region and if one had started at J1,
and had traversed the arcs backward one would never return to J1, unlike the case of
computing loops in the flow graph. The set (I1, I2) represents the induction set in this
example.
Consider one induction set {T1, . . ., Tk}. This is a strongly connected region in the
graph of Candidates temporaries. The only temporaries that can have multiple
predecessors in that graph are the -nodes. So the strongly connected regions have a
special form: The normal instructions are divided into subsets that form paths, with the
joins and separations occurring at the -nodes.
Now add one optimization after identifying the induction sets and induction
temporaries. Consider three temporaries in an induction set, T1, T2, and T3, where T2 =
T1 + RC1 and T3 = T2 + RC2. Recompute T3 as T3 T1 + (RC1 + RC2).
The expression for E; can be written to expose the induction variables. The most
important form of strength reduction involves integer multiplication. If one operand of a
multiplication is an induction variable and the other operand does not change in the loop,
then the repeated evaluation of the multiplication each time through the loop can be
replaced by a repeated addition. To identify these cases, the compiler divides up E; into
summands of the form
E; = E;; + FD1 * I1 + FD2 * I2 + . . . + FDm * Im
where FDj is a loop-invariant expression (FD is an abbreviation for first difference)
and Ij is one of the induction variables in the innermost loop. Induction variables for the
outer loops are loop constants here, so they are not a part of this expression.
There is danger in this transformation: Wholesale rewriting of expressions may
increase the size of the generated code and decrease the execution speed. Most of the
time this will not be the case. Consider the running example, in which strength reduction
and rewriting the expressions collapses the code remarkably. However, there are cases
where this is not true. The compiler attempts to avoid these situations by the following
devices.
It embeds the reassociation in a dominator tree walk used to eliminate redundant
expressions. Thus each expression will probably be evaluated once.
This phase of the compiler does not eliminate the original expressions. Later, after
global optimization, dead-code elimination will remove them. Why? One of the problems
with reassociation is that it can move expressions into loops. In fact it can move them into
conditional expressions within loops. The compiler cannot move them out of these
conditional expressions due to safety concerns. So the compiler leaves the original
expressions in place, which causes those expressions to be available where the
programmer originally placed them. If the compiler has not transformed the expression
within the loop, the compiler will find that the moved expression is redundant and eliminate
it from the loop.
There are four different categories of operators involved in reassociation. The rest
of this discussion will use these names: addition, subtraction, and multiplication. However,
there are many other operators that have the same characteristics: logical disjunction,
logical conjunction, and logical negation, for instance. The compiler applies the same
techniques to all of them. The techniques are not applied to floating-point operations
because they are not associative or distributive in the literal sense. When using the term
associative, the arithmetic must be literally associative, not approximately associative.
Commutative operators such as addition and multiplication have the property that x
+ y = y + x. For these operators, the compiler can reorder the operands in any order
desired. In our compiler, the operands are reordered so that the one with the highest
reverse-postorder number in the flow graph occurs first. This procedure combined with the
value_table structure will automatically identify commuted redundant expressions.
Operations such as subtraction that are the combination of a commutative
operation (addition) and an inverse operation (negation) are reordered like the
corresponding commutative operator; however, an extra flag is maintained, indicating that
a negation is also needed. When the instructions are regenerated after processing, the
negation flag is used to create a subtraction operation rather than an addition.
Associative operations (such as addition) allow more processing. Assume that an
associative operator is the root of an expression tree. Group together all of the operands
of the associative operator at the root. In other words, if the compiler has (x + y) + (w + z),
handle it as a single operator with a list of operands x, y, w, z. The associative operator
can then be rewritten as (x + (y + (w + z))). If the associative operator is also commutative,
the operands can be reordered so that the first operand is the one with the highest reverse
postorder. This will automatically set up things so that the expressions for the inner loop
are computed first, then the operands for the next-outer loop, and so on.5
5
Distributive operations, such as integer multiplication, are the fourth category. The
rule x * (y + z) = x * y + x * z can be used to rewrite combinations of addition, subtraction,
and multiplication as a sum of products. Each term in the sum of products is the product of
a constant (the subtraction contributes a -1 to the constant), induction temporaries, and
other temporaries. Now the elements of the products can be ordered using the reverse
postorder number as before, and the terms in the sum can be ordered by the maximum
reverse postorder number of the components of the product. This gives the expression the
form described at the beginning of the section.
Before the dominator walk of the blocks of the loop and at the same time that
induction variables are being identified, identify all expressions with the following
properties:
The evaluation of the temporary is implemented with an associative operator. For
these purposes, the compiler considers subtraction to be an addition with a negate flag.
The temporary is used as the operand of an instruction that is not the same
associative operator. In other words, the temporary represents the root of an expression
tree where the operations near the root are all the same associative operator.
The key insight is that the static single assignment form allows the compiler to view
the temporaries in the flow graph as nodes of the original expression trees. Consider two
temporaries T1 and T2. T1 can be considered the parent of T2 in an expression tree if T1
uses T2 as an operand. Thus the edge from T1 to T2 is given by the Definition attribute of
the Operand set for T1 to get to the instruction, and the temporary is reached as the
Target of that instruction.
Given the root of an expression tree, perform a tree walk of that expression tree,
analyzing it as a sum of products and combining like terms. This sum of products need not
be stored as instructions: It can be stored as a linked list of linked lists in temporary
storage. Stop the tree walk when the compiler gets to a constant, LOAD instruction,
variable temporary, or induction temporary.
Having recognized the tree, now rewrite it in instructions in the form described
above. The only problem is reapplying distributivity. Dividing the expressions into pieces
that are invariant in each of the enclosing loops is straightforward: The compiler has
already ordered the operands, so the compiler need only divide the sum into the parts
invariant in each loop.
For each of these sums, distribution should be applied in reverse. This is a greedy
algorithm. Consider a sum and find the component of a term that is an operand of the
largest number of terms. Apply distributivity to rewrite those terms as a product of that
component with the sum of the other terms involving this component. Keep reapplying
distributivity in this greedy fashion until no further rewriting can occur. Each of the
products can now be divided into parts that are invariant in each of the enclosing loops by
applying the same techniques as were used for addition.
Now we have a reformed expression tree in temporary storage. Rewrite the
expressions as instructions in the flow graph. Leave the old expressions there. Use the
dominator tree walk to determine if expressions are already available so that they need
not be generated again, taking up space and potentially escaping later optimization
phases and causing poor runtime performance. At this point go on to the next tree in
dominator order.
where Ij are the induction variables and all of the other expressions are loop
invariant in the loop L. The idea of strength reduction is to compute this expression before
entering the loop and update it each time one of its operands changes. Since the only
operands that can change are the induction temporaries, we update E each time one of
the induction temporaries changes.6
6
This discussion is glossing over a hard problem. There may be many incremental expressions: Keeping and updating each
one ties up most of the registers for the machine. For a few incremental expressions, the discussion given here is best. When there are
more incremental expressions, it is probably better to consider the linear function of the induction temporaries as the incremental
expression and add in the loop-invariant part separately. The linear function of the induction temporaries is likely to be reused many
times in the same loop.
Since the flow graph is in static single assignment form, the compiler cannot update
the temporary holding the value for E. Instead, the compiler must generate a collection of
temporaries E0, . . . , Eq: one for each time one of the induction temporaries changes and
one for each RC constant involved in a -node. Assignments to compute the value of
each of these temporaries are inserted after the update of each one of the induction
temporaries.
Besides generating new temporaries to hold the value of E, the compiler must
update uses of the expression E. At each point that E is used, the value is stored in some
temporary. The compiler must replace the uses of that temporary with the uses of these
new temporaries. This is not difficult: The compiler walks through the loop using the
dominator tree and keeps a table of redundant expressions. This information can be
inserted in this table as a previous computation of E, which will make the real computation
redundant and update the operands.
Where can the induction temporaries change? Let ISi be the induction set
associated with the induction temporary II. As noted earlier, the induction set replaces the
idea of updating a temporary because the flow graph is in static single assignment form.
The expression E changes whenever one of the temporaries in ISi changes. Consider the
cases:
If Ti is not in the induction set, then insert a computation of E at the end of block P
and place the temporary holding that value into the corresponding entry in the node. Be careful! There may already be a computation of the same expression
available in P, so do not insert it if it is not necessary.
longer expression temporaries; they become variable temporaries like any other local
variable.
All temporaries that are not expression temporaries can be renamed back to the
original temporary that created them. Again, no use of the temporary has been moved.
The compiler now has all of the leaves consistently named, so it reconstructs the
expression temporary names by using the formal temporary table, as was done during the
original building of the flow graph. How is this implemented? When the static single
assignment form is created, keep an added attribute that is the original name of the
temporary. Also keep a set of temporaries for each induction set.
One might ask, why not apply dead-code elimination immediately after strength reduction? Because reassociation has
occurred earlier. Reassociation has the effect of moving computations into a loop, possibly into a conditional statement within the loop.
Partial redundancy will not be able to move the computation back out unless the original occurrence of the expression is still there to
make the moved expression redundant.
The dead-code elimination algorithm represents a simple idea. First mark the
instructions that store into data structures outside the subprogram. These instructions are
not dead. Then mark the instructions that compute each of the operands of these
instructions. These instructions are not dead. Keep doing this until no more instructions
are marked. The unmarked instructions are not used directly or indirectly in producing
data that is available outside the subprogram so the instructions can be eliminated.
There are two options for operands of conditional branch statements. The simplest
option is to immediately declare that all instructions computing operands of conditional
branches are important. The idea is that the path of computation is important.8 This is a
conservative approach, but more instructions can be eliminated by eliminating conditional
branches where possible. This can be done with the following steps:
8
This is what we did in the COMPASS Compiler Engine. The result is that many instructions are eliminated but the
framework instructions implementing the flow graph remain, possibly generating loops with no computations within them except the
increment of the loop index, or basic blocks with no instructions.
When the work list becomes empty, all instructions that might affect data outside
the subprogram have been processed. All other instructions are eliminated, and branches
to blocks that contain no instructions are redirected to the immediate postdominator block.
Why the postdominator? If all instructions in the block have also been eliminated, then all
branching instructions in the block have also been eliminated, so there are no instructions
in blocks that are control dependent on it. But the control flow must branch somewhere.
The first available block on any path to Exit is the immediate postdominator.
Consider the example in Figure 8.19. It is a simple conditional statement containing
a simple loop. Assume that T is used to compute necessary data and the Is are used as a
loop index that is no longer needed. That means that the body of the loop is unnecessary.
The assignment to I in the initial block can be eliminated. The initial blocks branch to the
loop is changed into a branch to the join block at the bottom of the example.
Two temporaries have the same value if they are computed using the same
operator and the operands have the same value. If there is ever any doubt, such as a
subroutine call instruction or a STORE instruction that may be killing a number of
temporaries, then the temporary is assumed to have an unknown value different from any
other temporary. In other words, the target temporary is put in a new equivalence class by
itself and becomes the representative for all other temporaries added later.
If there are no loops in the flow graph then computing this partition is simple. Visit
the blocks in reverse postorder and the instructions within each block in execution order.
When there are no loops, this guarantees that the instructions evaluating the operands are
processed before the instructions in which they are used.
Two temporaries that load the same constant are put in the same partition.
Two temporaries that load a value from memory where the addresses have the
same value and the memory locations have the same value are put in the same
partition.
Two instructions that have the same operation code with corresponding operands
having the same value_representative field will have the same value. That is, the
targets of two instructions with operands of the same value and the same operation
code will generate the same result.
iterates over a data structure called the SSA graph. It is implicit in that the information is
already stored; this is just a different way of looking at it.9
9
This algorithm is from L. Taylor Simpsons thesis, Value Redundancy Elimination, at Rice University (1996).
Consider the set of temporaries to be a directed graph, called the SSA graph,
where the nodes are the temporaries and (T1, T2) is an edge if T1 is used as an operand
in computing the instruction, with T2 as a target. Then topologically sorting the SSA graph
orders the instructions so that the computations of all operands precede the temporaries
they are used to compute. The temporaries that cannot be sorted into such an order are
the strongly connected regions of this graph.
Recall that either standard algorithm for a strongly connected region computes the
sets of temporaries that form the strongly connected regions: C1, . . ., Cs. In the process it
also orders these sets of temporaries so that if T1 Ci and T2 Cj where i < j, then T1
precedes T2 in the reverse postorder of a depth-first search.
What do strongly connected regions look like in this graph? A strongly connected
region is either a single instruction or a loop. A single instruction can be processed as
discussed above.
All of the instructions in a loop should be processed simultaneously. Of course, this
is not possible. The compiler must iterate through the instructions of the loop. The only
problem is with the -nodes. The compiler will make an optimistic assumption about the
effects of -nodes and then iterate through the loop, updating the assumptions until a
consistent set of value_representative fields is found.
When the strongly connected region contains a cycle, some operands have not
been evaluated. These can only occur as operands of -nodes. First consider what it
means to process a -node. There are three possibilities:
An entry has already been made in the value table for a -node in the same block
with the corresponding equivalent operands. Thus the value_representative field for
the target of the current -node is made the same as that of the in the table.
If all operands of the -node are equivalent, then the target is made equivalent to
each of the operands and this -node is entered in the value table.
If all operands are not equivalent and an equivalent -node is not already in the
value table, then the target of the -node is placed in a new partition; that is, it is
made its own entry in the value_representative field and the information is entered
in the value table.
Here is where the optimism comes into play. Consider all of the temporaries in a
particular strongly connected region in the SSA graph. Order these temporaries in reverse
postorder (remember that there is a single instruction with a temporary as the target). Now
the instructions evaluating operands will be processed before the uses, except for some
-nodes.
Consider the -nodes optimistically: If an operand has not been processed yet,
assume that it does not affect the result. For each temporary, initially assign it a
value_representative value of NULL to indicate that the temporary is not processed yet.
Then -nodes are processed by ignoring the operands that have not yet been processed,
hoping that they will have the same value as the other operands.
If a corresponding entry is already available in the value table, then assign the
target of this -node the same value_representative value.
Consider the operands that do not have a value_representative attribute of NULL. If
at least two of them have different values, then the target of the -node is placed in a new
partition by itself and entered in the table. Since the temporaries are being scanned in
reverse postorder, at least one of the operands will have been processed already.
Consider the operands that do not have a value_representative attribute of NULL. If
all of them have the same value, then add the target to the same partition (give it the
same value_representative value) and enter the instruction in the table.
The instructions in the strongly connected region are repeatedly processed until no
changes occur, that is, the value_representative fields are unchanged during a complete
scan of the instructions. This generates two problems: avoiding creation of unnecessary
partitions, and the pollution of the value table. During the processing of -nodes, a new
partition is generated for the target if two of the operands differ. The compiler does not
need to generate a new partition each time the set of instructions is scanned. Instead,
note that if the value_representative field for the target of a -node already has a non-null
value different from one of the operands, then it has already been placed in a new
partition and so it need not be changed.
Pollution of the value table is good news and bad news. Remember that we are
processing the SSA graph. There may be multiple strongly connected regions in this
graph for each loop in the flow graph. The false information is good news because it will
force potentially equivalent temporaries in another strongly connected region to make the
same choices and get the same value_representative fields as in the current strongly
connected region. It is bad news because false information is stored in the table, which
might have adverse consequences later when processing instructions after the current
strongly connected region.
The pollution is avoided by creating another table, called scratch table, which has
the same structure as value_table. During the processing of a strongly connected region,
the scratch table is used rather than the value table. After the values have stabilized, the
instructions are reprocessed using the value table.
One more point before summarizing the algorithm and giving an example:
Algebraic simplification can be combined with this algorithm. Each time the compiler
processes instructions, it processes them in execution order. Algebraic simplification and
peephole optimizations can be applied during the processing step. In fact, constant folding
can be combined with the processing step simultaneously. The description of constant
propagation is not included here since constant propagation has already been applied at
the point in this compiler where global value numbering is applied.
The algorithm is described in Figure 8.20. Recall that the SSA graph is simply the
set of temporaries where (T1, T2) is an edge if T1 is used as an operand of the evaluation
of T2, so the successor relation in this graph is given by the Uses(T) attribute and the
predecessor in the graph is given by the Definition(T) attribute.
The algorithm calls CALCULATE_GLOBAL_VALUE_SCC to compute the global
value numbers for temporaries in a strongly connected region of the SSA graph. When
that calculation is completed, all of the temporaries have the correct value_representative
field; however, the value table must be updated to reflect the existence of these
instructions for future instructions outside the loop.
The description of the algorithm avoids some points that the implementer must
face. The semantics of an instruction (such as a copy operation) may indicate that the
target operand has the same value as one of the inputs, which the algorithm must reflect.
This also occurs with the special processing of -nodes.
The algorithm in Figure 8.21 describes the processing of instructions in strongly
connected regions. The processing is simpler than many algorithms since there are no
strongly connected regions within strongly connected regions. The instructions are
scanned repeatedly until there are no changes in the value_representative attributes.
When an instruction is inserted in the tables, the keys for looking up information are the
operation code (together with the block if it is a -node) and the value_representative
entries for the operands. The value to be retrieved from the table is the
value_representative value for the target. When an entry is made there are two
possibilities: the resulting value is either the same as one of the operands or the resulting
temporary is a new entry in a new partition. In the first case, the value_representative of
the operand is used as the value_representative of the target. In the second case, the
value_representative of the target is made to be itself, indicating a new partition.
Figure 8.20 Main Global Value-Numbering Algorithm
Consider the fragment of the flow graph in the left column of Figure 8.22. Note that
the corresponding I and J entries always have the same value. The right column
represents the SSA graph for the same code. In this example there are four strongly
connected regions: C1 = {J0}, C2 = {J1, J2}, C3 = {I0}, C4 = {I1, I2}, and C5 = {U}. The
regions are in the same order in which the algorithm will process them. Note that all
operands are evaluated before use when the uses and definitions are in different strongly
connected regions and the temporaries are in reverse postorder within a strongly
connected region. Now let us walk through the algorithm.
Figure 8.21 Processing Strongly Connected Regions
The temporary J0 is processed and the corresponding entry is made in the value
table, indicating that this temporary has value 1. Now the first real strongly connected
region is processed. Initially the algorithm will assume that J1 is the same as J0 since the
latter is the only processed operand. It will then determine that J2 must be 2 by algebraic
simplification. The algorithm reprocesses J1 and finds that the two operands of the -node
are different, so J1 is given itself as a representative of the new partition. Now J2 does not
have any simplification, so J2 is given itself as a representative of the new partition. There
will be no more changes in processing, so all of the real entries are put into the value
table. However, the scratch table is left as it is.
The temporary I0 is processed next. Because the operands of that instruction are
the same as J0, I0 is given J0 as its value_representative entry. We now come to the
strongly connected region {I1, I2}. Initially, I1 is given the same value as J0 and I0 since I0
is the only processed operand to the -node. The nodes for this strongly connected region
are processed just like the nodes for Js and since the information is in the scratch table,
the same representatives are chosen. So, I1 has the same value as J1 and J2 has the
same value as J2.
The compiler can determine that U has value 0 since it is the subtraction of two
equal values. This is an example of algebraic simplification. This simplification is limited. If
the example was changed so that U was added to I, before I is incremented, the algorithm
would be unable to determine that the I and J temporaries were equal.
8.11 References
Allen, F. E., J. Cocke, and K. Kennedy. 1981. Reduction of operator strength. In Program
flow analysis: Theory and application, edited by S. Muchnick and N. D. Jones. New York:
Prentice-Hall.
Markstein, P. Forthcoming. Strength reduction. In unpublished book on optimization,
edited by M. N. Wegman et al. Association of Computing Machinery.
Markstein, P., V. Markstein, and F. K. Zadeck. Forthcoming. In unpublished book on
optimization, edited by M. N. Wegman et al. Association of Computing Machinery.
McKeeman, W. M. 1974. Symbol table access. In Compiler construction: An advanced course,
edited by F. L. Bauer et al. Berlin, Germany: Springer-Verlag.
Simpson, L. T. 1996. Value-driven redundancy elimination. Ph.D. thesis, Computer Science
Department, Rice University.
Wegman, M. N., and F. K. Zadeck. 1985. Constant propagation with conditional branches.
Conference Proceedings of Principles of Programming. Languages XII, 291-299.
If a cohesive set of algorithms is not published by one of the researchers, this may be grounds for a second book to
complete these ideas. I would rather see a book by one of the researchers that addressed these issues in sufficient detail. But if they do
not, then we engineers must publish the work to help ourselves.
At this point the flow graph has been simplified by constant folding, elimination of
most redundant expressions, and algebraic simplification. The compiler now can do the
advanced transformations that are the basis of much recent research. These are the
transformations for improving the use of the memory caches and identifying parallel
computations.
If there is only one call on a function, then it can be expanded inline. This will
decrease the amount of function-call overhead without increasing program size. This
situation occurs with programs that are written in a top-down programming style. Such a
programming style encourages the writing of functions called only once. If the resulting
function is estimated to be larger than some size, such as the size of the fastest cache,
then the expansion should not be performed automatically.
If the compiler estimates that the size of the function body is smaller than the size
of the function call, then the function can be expanded inline. The resulting program will be
smaller and more efficient.
If a procedure has one call site that represents a large fraction of all calls of the
procedure, then that one call site can be expanded inline, whereas all other calls of the
function will be implemented normally. This makes the highly frequent call inlined without
inlining all of the calls.
If the procedure has a flow graph that breaks into many small independent sections
of code combined in a branching structure that looks like a case or switch statement on
one of the formal parameters and that formal parameter is always a constant, then expand
the procedure inline. In each call, only a small amount of code will remain after dead-code
elimination.
Otherwise, specify a heuristic choice function based on the size of the flow graph
being inlined, the number of call sites, and the frequency information for the calls. If the
function is small enough, it can always be inlined. If the function is a little bit larger, is
frequently called, and has few call sites, then it can still be inlined.
How does the compiler perform in-line expansion? This compiler performs it in the
interprocedural analysis phase, so it has all of the flow graphs for the procedures
available. These flow graphs have had an initial level of optimization applied to them to
clean up the flow graphs. In-line expansion consists of the following steps:
1. Consider the call site where a function is to be inlined. Break the block containing
the call site into three parts: the portion before the call, the portion after the call,
and the call itself.
2. Replace the block containing the call itself by a copy of the flow graph for the called
procedure. In the process, rename the temporary names associated with the flow
graph so they are all different from the temporaries that occur in other parts of the
larger flow graph. This is a textual renaming problem that can be solved as the
copy of the called flow graph is created.
3. In the new block representing the block that is the entry block to the called
procedure in the copied flow graph, insert copy operations to copy the actual
parameters into the temporaries representing the formal parameters in the called
procedure.
For C, this form of in-line expansion is sufficient since all formal parameters are
passed by value. For languages with pass-by-reference, a different mechanism should be
used. Consider a formal parameter X that is bound to an actual parameter A(I). One could
compute the address of A(I) and copy it into a temporary within the inlined called
procedure. Each reference to X within the procedure can be replaced by a pointer
dereference through the pointer; however, all information about the array A has been lost.
Instead, a more complex mechanism is helpful. Identify all loads and stores of X
within the procedure. These are all simple load or store operations. To bind X to A(I),
create a new temporary T to hold the value of I and replace each simple load of X by an
array load of A(T). Similarly, replace each store of X by an array store of A(T). This
matches the semantics of pass-by-reference and keeps all information about A available
for use. The temporary for T can be eliminated by optimization.
Before leaving in-line expansion, lets touch on one problem, observed by a number
of implementers and studied by Keith Cooper (Cooper, Hall, and Torczon 1992), which
involves decreases in performance that may occur with in-line expansion. The problem is
exemplified by the procedure DAXPY in the LINPACK library. This is a simple loop to
compute the sum of two vectors, where the second one is multiplied by a constant. If this
procedure is expanded inline in LINPACK, the program may run slower. Why? This is a
Fortran program, so the compiler (using the language rules) can assume that the formal
parameters (dummy arguments) do not reference the same memory locations. However,
LINPACK calls DAXPY with two of the arguments referencing the same array. The
compiler is in a tough spot: Since the expressions are no longer formal parameters, it
cannot assume that they are distinct and will generate slower code to ensure that the
program runs properly.
There is no easy solution to this problem. The alias information needs to be
expanded to record that within these particular blocks in the flow graph, the compiler may
assume that two references to the same array are actually distinct locations. This would
require flow-sensitive alias information, which is beyond the scope of this book (or the
knowledge of the author).
As you can see, there are several different ways that the stores into A(I) affect and
are affected by the load operations from other elements of A(I). The loads and stores
involving other arrays do not affect the load or stores into the array A unless the memory
for the arrays overlaps.
Definition
Dependence:
Consider two instructions in the flow graph, I1 and I2. The instruction I2 is
dependent on I1 if and only if there is a path in the flow graph from I1 to I2 and the
instructions might reference the same memory location. There are several categories of
dependence:
The compiler performs an analysis called dependence analysis to determine the character
of each of the loops, and records this information in a data structure called the
dependence graph. The dependence graph contains the set of instructions in the flow
graph as nodes. There is an edge between two instructions in the following cases:
When there is a true, anti-, or output dependence between the instructions. In such
cases, one of them must be a store operation and the other either a load or store
operation.
There is a dependence between two instructions I1 and I2, and therefore an edge,
if I1 evaluates a value into a temporary T and I2 uses that temporary.
The compiler builds the dependence graph by considering each pair of memory
reference instructions. Consider the first loop in Figure 9.4. There are two memory
reference instructions: the store into A(I) and the load from A(I - 1). The compiler must first
check to see if there is a path between the two instructions and then check to see if there
is any situation where the two references might refer to the same memory location. This
need not happen on the same iteration of the loop, so we are looking for two values I and
I; such that
The value of I is the index for the store operation, and I; is the index for the load
operation. The first two inequalities indicate that the indices must be within the loop
bounds. The equality indicates that the subscripts must be the same to be referencing the
same location. Clearly there are values where this set of conditions is satisfied, so there is
a dependence. The reference to the memory location by the load occurs on the next
iteration after the store.2
2
The compiler writer wishes to find no dependencies; in other words, one wants no solutions to exist. This will mean that
there are no dependencies and the compiler therefore has the maximum flexibility in reordering the instructions.
If the array is multiply subscripted, then there is one equation for each subscript.
Assuming that the language standard specifies that array subscripts must be in bounds,
the only way that two memory reference instructions can reference the same location is if
each of the subscripts has identical value.
If there is more than one enclosing loop, then there would be extra pairs of loop
indices: one for each loop, thus generating more inequalities. The problem is to solve both
the set of equalities and inequalities simultaneously. There are four distinct techniques for
doing this in the literature:
1. Ignore the inequalities and see if there are integer solutions to the equalities. This
approach involves the use of Diophantine equation theory from number theory;
however, it works well for single subscripts and loops.
2. Ignore the fact that the compiler is looking for integer solutions and consider the
problem as a real-valued problem. Consider the inequalities as defining the domain
of a real function, and the difference of the left side and right side of the equations
as defining a real-valued function. This reduces the problem to determining that
there is a zero in the domain. That will be true if the maximum is positive and the
minimum is negative. There are clever formulas for estimating the maximum and
minimum.
3. A more recent general method called the Omega test (Pugh 1992) will replace
these two techniques. It uses a specialized form of integer programming solution
that works well with these particular problems. It is much more precise than the
previous two techniques.
4. Below you will see a simple test that works the vast majority of the time, leaving the
other tests to deal with the difficult problems. The dependence test given here is sufficient
for software pipelining, discussed later. The test is described in terms of a doubly nested
loop and a triply subscripted array. Consider the loop in Figure 9.5. It has been made
overly complicated to show all of the possibilities. Note that the same loop index is used in
the corresponding subscript positions in references to A and that at most one loop index is
used in any subscript positions. The lowercase letters refer to constants in the program,
which are not specified so that a general formula can be obtained. In summary:
Each subscript has the form <constant> * <loop index> + <another constant>
The same loop index occurs in the same subscript location. This does not mean
that the outermost index occurs in the first subscript; it means that if a loop index occurs in
one of the positions, then it occurs in the same position in the other array reference.
Figure 9.5 Simple Dependence Test
For this example, one can write down the inequalities and equalities as done
earlier. In this case, they will give
This seems complex; however, one can directly solve for I - I; and J - J;, getting
In this case, there is an integer solution if and only if each denominator divides the
numerator evenly and both expressions for I - I; give the same answer. This is adequate
to determine most dependencies. Of course the difference must also be less than the
upper bound of the loop minus the lower bound so that there are two iterations that
provide the difference.
The laundry list of transformations has already been given in Chapter 2. Instead of
repeating the list, we will give the intent of each of the transformations in the following list:
Scalar replacement: As shown above, if the compiler can identify that a value is stored on
the previous iteration of the loop, then the loop can be rewritten, eliminating the load and
keeping the value in a temporary. This may involve some loop unrolling.
Loop interchange: Besides decreasing the load and store operations, the compiler
wants to improve the chances that two references are in the same cache line. Thus
the loops may be interchanged so that consecutive values are loaded.
Loop fusion: If there is a sequence of loops, then they may be combined. This will
decrease loop overhead. If the loops load the same values it may also remove a
number of load operations.
Loop distribution: If a large loop uses too many cache lines, the compiler can divide
the loop into multiple loops (under the correct circumstances). This will decrease
the cache usage, making it easier to fit the data in the cache.
Unroll and jam: Sometimes two iterations of the outer loop can be merged,
decreasing the number of load operations if some load operations are shared
between iterations.
9.8 References
Callahan, D., K. D. Cooper, K. Kennedy, and L. Torczon. 1986. Interprocedural constant propagation. Proceedings of
the SIGPLAN Symposium on Compiler Construction, Palo Alto, CA. Published as SIGPLAN Notices 21(7): 152-161.
Cooper, K. D., M. W. Hall, and L. Torczon. 1992. Unexpected side effects of inline substitution: A case study. ACM
Letters on Programming Languages and Systems 1(1): 22-32.
Cooper, K., and K. Kennedy. 1988. Interprocedural side-effect analysis in linear time. Proceedings of the SIGPLAN 88
Symposium on Programming Language Design and Implementation, Altanta, GA. Published as SIGPLAN Notices 23(7).
Cooper, K., and K. Kennedy. 1989. Fast interprocedural alias analysis. Conference Record of the Sixteenth Annual
Symposium on Principles of Programming Languages, Austin, TX.
Hall, M. W. 1991. Managing interprocedural optimization. Ph.D. Thesis, Computer Science Department, Rice University.
Hall, M. W., and K. Kennedy. 1992. Efficient call graph analysis. ACM Letters on Programming Languages and Systems
1(3): 227-242.
Hall, M. W., K. Kennedy, and K. S. McKinley. 1991. Interprocedural transformations for parallel code generation.
Proceedings of the 1991 Conference on Supercomputing, 424-434.
Pugh, W. 1992. The omega test: A fast and practical integer programming algorithm for dependence analysis.
Communications of the ACM 8: 102-114.
Ryder, B. G. 1979. Constructing the call graph of a program. IEEE Transactions on Software Engineering SE-5(3).
Torczon, L., 1985. Compilation dependencies in an ambitious optimizing compiler. Ph.D. thesis, Computer Science
Department, Rice University, Houston, TX.
Wolfe, M. 1996. High performance compilers for parallel computing. Reading, MA: Addison-Wesley.
COPY instructions play two roles in expression trees. They occur at the roots of the
tree, copying the result into the destination temporary or setting up for a STORE
instruction. COPY instructions are the most difficult to move in the flow graph because
moving the copy operation depends on both the uses and evaluation points of the
temporaries involved in the copy. The copy operations will be moved toward the Entry
block to a point of less frequent execution. This optimization will happen rarely; however, it
is important when it can happen. During this transformation, the copy operation is
represented by a pair: the source temporary and the target temporary. This is a unique
representation of the instruction since it completely represents the instruction.
STORE instructions will be moved either toward Entry or toward Exit. In effect, they
do not occur in the expression trees. Rather, they occur after the copy operation at the
root of the tree. During the transformation, the store operation is represented by the
temporary being stored. This is a unique representation for all store operations; however,
it conflicts with the use of the same temporary for load operations, so separate sets of
global and local information are computed for store operations.
Computational instructions are pure functions that compute a value depending only
on their operands. These instructions will be moved toward Entry to a point of less
frequent execution. We will see that by using this form of partial redundancy elimination
these computations can be moved independent of LOAD, STORE, and other
computational instructions. During the transformation, the instruction is represented by the
destination temporary, which uniquely represents the instruction by the conventions for the
flow graph.
The final class of instructions are those that determine the structure of the flow
graph, such as procedure calls and branching instructions. These instructions are left in
place by these optimizations and are not moved.
The optimization algorithm computes points in the flow graph where the insertion of
evaluations of T will make all present evaluations of T redundant. Then the algorithm
eliminates all of the original evaluations. If the algorithm attempts to insert an evaluation
into a block that already contains an evaluation, neither the insertion nor the deletion will
be performed.
There are three different algorithms depending on the class of instructions to be
optimized. The rest of this chapter will describe the algorithms and apply them to the
running example.
Partial redundancy elimination (PRE): This transformation includes moving
expressions out of loops and eliminating redundant expressions. The compiler determines
new positions in the flow graph at which to insert evaluations of temporaries to make all of
the original evaluations redundant. The particular form of partial redundancy used here is
a derivative of lazy code motion, which ensures that computations are not moved unless
there is some path that guarantees a decrease of computations.
Strength reduction (SR): Using a modification of the partial redundancy algorithm,
the compiler can further optimize temporaries that are functions of induction temporaries.
This handles some of the strength-reduction situations that were not handled earlier in the
dominator-based transformation phases.
Load-store motion (LSM): Another generalization of partial redundancy elimination,
this technique will move loads of data backward out of loops and move stores forward out
of loops in some situations in which the data changes in the loop (that is, the data is not
loop invariant).
The original algorithm was developed by Knoop, Ruthing, and Steffen (1992). We will use a variant developed by Drechsler
and Stadel (1993) because it fits the compiler framework we have developed better.
2
Developed by Etienne Morel (Morel and Renvoise 1979).
We will consider the insertion of evaluations of T on the edges of the program flow
graph. This means that one needs to evaluate T if one traverses that edge. If the edge has
only one predecessor, this is the same as inserting the instruction in the predecessor. If
the edge has only one successor, this is the same as inserting the instruction in the
successor. If evaluations of T are going to be inserted on each edge into a block, then
insert the evaluation in the beginning of the block instead. Otherwise, if the instruction is to
be inserted on an edge, create an empty basic block with one successor, the head of the
original edge. Make the new block be the successor of the tail. In effect, one has spliced
an empty basic block into the middle of the edge.
Now consider an arbitrary edge (P, S). Under what conditions would it be the
earliest point at which to insert an evaluation of T?
First, T must be anticipated at the beginning of S. If it is not, then the insertion is not
safe since there is some path to an exit that does not already contain an evaluation of T.
T should not be available at the end of P. If it is available at the end of P, then there
is no point in inserting a new copy, since it would only create two evaluations in
succession on some path.
There must be some reason that the computation cannot be placed earlier. There
are only two possibilities: Either T is killed in the preceding block P or T is not anticipated
at the end of P. The second condition means there is a path out of P that does not contain
an evaluation of T.
We can directly translate these conditions into a set of equations, as in Figure 10.3.
Unfortunately, the intuition above does not constitute a proof that placing evaluations of T
at these points will make all original evaluations redundant or decrease the number of
evaluations executed.
Lemma E1
Consider any path from Entry to a block B with ANTLOCB = true, then either
There is an edge (P,S) with EARLIESTP,S = true and there are no evaluations of T
or instructions killing T between the beginning of S and the beginning of B on the path, or
There is some block C on the path that contains an evaluation of T and there are no
instructions that kill T on the path between the evaluation in C and the beginning of B.
Informally, this means that on each path through the flow graph each original
evaluation of T is either redundant on the path or preceded by an edge where EARLIEST
is true. So placement of evaluations at each point where EARLIEST is true will make all
original evaluations redundant.
Proof
Since we have two conditions, we will assume that one is false and prove that the
other is true. Assume that walking backward along the path from B one reaches either
Entry or an instruction killing T before reaching an evaluation of T. In other words, assume
that the second condition is false. Consider the subpath consisting of the part of the
original path from the killing instruction (or Entry) to B.
First note that AVOUT is false for each block on this subpath. Remember that
AVOUT is true only if there is an evaluation on each path leading to the point. The
subpath we are considering is one such path that begins either with Entry or an instruction
that kills T. If AVOUT is true, there must be some evaluation of T on the subpath,
contradicting the assumption made at the beginning of the proof.
Now consider the two cases:
Assume that the subpath goes all of the way back to Entry. Now start walking
backward on the subpath again, starting at B. T is anticipated at B since ANTLOCB
= true and there are no instructions on the subpath that kill T. Walk backward along
the path until one comes to a block S where T is anticipated at the beginning of S
but not at the end of its predecessor, P. Go back and look at the formulas for
EARLIEST. They are satisfied for (P, S). If T is anticipated at the beginning of each
block, simply make P be Entry and S its successor on the path.
Assume that the subpath starts with a block C that does kill T. Again perform the
backward walk starting at B looking for the first block S where T is anticipated in S
but not in its predecessor P. Then the formulas for EARLIEST are satisfied. If there
is no such block, then make P be C and S be the successor of C. Again the
formulas are satisfied because the conjunction is satisfied, C killing T.
Lemma E4
Let (P, S) and (Q, R) be two edges in the flow graph. Assume that there is a path
from S to Q that contains no instructions that kill T. If EARLIESTP,S = true and
EARLIESTQ,R = true, then there is an evaluation of T on this path.
Proof
Look at the equations for EARLIEST. EARLIESTQ,R = true and Q is part of the
path, so Q does not kill T. So ANTOUTQ = false. Assume there is no evaluation of T on
the path. Then we have a path from S to Q and extended on that does not contain an
evalution of T before an instruction killing T. So ANTINS = false by the definition of
anticipation. This is a contradiction, since EARLIESTP,S = true requires that ANTINS = true.
Theorem E
Consider the program transformation in which evaluations of T are inserted on each
edge (P,S) where EARLIESTP,S = true and the initial evaluation of T in each block B where
ANTLOCB = true is deleted. This transformation satisfies the following conditions:
Safety:
Every path from a point of insertion to Exit arrives at a point of evaluation of T
before Exit or an instruction that kills T. Thus no new side effects are generated.
Correctness:
Each path from Entry to a block B where ANTLOCB = true contains an edge (P,S)
where EARLIESTP,S = true. Thus the evaluation of T at the beginning of that block is
redundant and can be deleted.
Profitability:
The number of evaluations of T on any path from Entry to Exit after the insertions
and deletions is less than or equal to the number of evaluations before the transformation.
Proof
Safety.
By the definition of EARLIEST, T is anticipated at S, so there is an evaluation of T on
each path out of S. Hence the transformation is safe.
Correctness.
Consider any path from Entry to B where ANTLOCB = true. By Lemma E1, either
there is an edge (P,S) where EARLIESTP,S = true not followed by a killing instruction, or
an evaluation of T occurs on the path not followed by a killing instruction. In the first case,
an evaluation of T is going to be inserted on (P,S) that will satisfy the criteria for
availability. If there is an evaluation of T on the path in some block C, then there are two
possibilities. If C contains a killing instruction, then that evaluation of T will not be deleted
and satisfies the criteria for availability. If C does not contain a killing instruction for T, then
ANTLOCC = true and we can repeat the process on the same path, but only considering
the path from Entry to C This process is repeated until a condition for availability is found.
The path keeps getting shorter and there are only a finite number of evaluations of T on
the path. If one reaches the first evaluation of T on the path without finding the criteria for
availability, then Lemma E1 indicates that an earlier edge must have EARLIEST being
true, so the condition is eventually satisfied.
Profitability.
Consider any path from Entry to Exit. Let I1, . . . , Im be the instructions on the path
that are either evaluations of T or instructions that kill T. Let I 0 be a pretend instruction at
the beginning of the path that kills T. Now consider each pair of instructions Ik and Ik+1.
If Ik+1 is an instruction that kills T, then T is not anticipated between the two
instructions, so by the equations for EARLIEST there is no edge where an evaluation will
be inserted.
Suppose Ik and Ik+1 are both evaluations of T. Since we are assuming that local
optimization has been performed in a block, we know that Ik+1 is in a different block than
Ik. So Ik+1 is at the beginning of the block. Thus Ik+1 will be eliminated by the
transformation. Note that only one insertion can occur between the two instructions
because of Lemma E4 (there would be no evaluation between the two), so we have one
deletion and at most one insertion.
Suppose Ik is aninstruction that kills T and Ik+1 is an evaluation of T. If they both
occur in the same block then there will be no insertions or deletions. Assume they occur in
separate blocks; thus again Ik+1 is at the beginning of a block and will be deleted by the
transformation. Again there is at most one insertion and one deletion.
Following the last evaluation on the path, T is not anticipated, so there are no
insertions following the last evaluation.
In summary, the worst case is that there is one deletion for each insertion, that is,
profitability is satisfied.
In one sense this transformation is optimal. There is no other set of insertions that
is safe, correct and profitable, and that will involve fewer evaluations of T in the
transformed flow graph. Later we will see another transformation that is better in a
different way.
Theorem EO
Consider another transformation that is safe, correct, and profitable in the sense of
Theorem E. The number of evaluations of T on any path from Entry to Exit after this
transformation will be no less than the number of evaluations of T after the EARLIEST
transformation
Proof
The argument is much like the argument for profitability. Consider a path from Entry
to Exit and list the instructions I1, . . . , Im on the path that are either evaluations of T or
instructions that kill T. Pretend that there is an instruction I0 that kills T at the beginning of
the path. Now consider each pair of instructions.
If Ik+1 is an instruction that kills T, then T is not anticipated between the two
instructions. Since both transformations are safe, neither will insert instructions on any
edge between the two instructions.
Suppose Ik kills T, and Ik+1 is an evaluation of T. If both are in the same block,
then there is no modification of the two instructions and no insertion between them, so
assume that they are in different blocks. Thus Ik+1 is at the beginning of a block and will
be deleted, so both transformations must insert an evaluation of T between the two
instructions. However, EARLIEST will insert only one, by Lemma E4.
The case where Ik and Ik+1 are evaluations of T is the difficult case. We are
considering one path so we do not know whether T is available or anticipated on the
whole path since that involves the flow graph rather than the path: There may be edges
entering and leaving the path. Again we are only interested in the case in which the
instructions are in separate blocks, since local optimization will remove one of them
otherwise. Thus Ik+1 is locally anticipated in its block. Recall that there are no instructions
killing T between these two instructions. Walk backward toward Ik until you find an edge
(P, S) where T is anticipated at the beginning of S and not anticipated at the end of P. If
no such edge exists, then EARLIEST will make no insertions. If the edge does exist,
consider two further cases:
If T is available at the end of P, then there is no insertion for EARLIEST by the
definining equations.
Otherwise, T is not available at the end of P, so EARLIEST will insert an evaluation
on (P, S) and no other insertions between the two instructions, by Lemma E4. We must
show that the other transformation must insert a computation between the two instructions
also. Since T is not available at the end of P, there is a path from Entry to P that contains
no evaluation of T after the last instruction killing T. This can be pieced together with the
current path between P and Ik+1. To satisfy correctness (that is, make T redundant at
Ik+1), the other transformation must insert a computation on this constructed path after the
last killing instruction. This instruction cannot be before P because there is a path out of P
to either Exit or a killing instruction. That means the inserted evaluation is on the path from
Ik to Ik+1, proving that at least one insertion happens on this path.
Consider the portion of the path from Im to Exit. T is not anticipated at any point on
this path, so neither transformation will insert an evaluation of T because of safety.
In summary, in each case where EARLIEST inserted an evaluation of T, the other
transformation was forced to insert an evaluation of T, thus satisfying the theorem.
Investigating the proof of this optimality theorem reveals the reason that this
transformation is called EARLIEST. It inserts evaluations of T at the earliest possible
points that are safe, profitable, correct, and guarantee the fewest number of evaluations.
Example
In the flow graph of Figure 10.1, the temporary T will not have any insertions and
the evaluation in block B2 will be deleted. The temporary S has an insertion on edge (B1,
B2) and the evaluation in block B2 is deleted. Now consider a hypothetical evaluation of a
constant in block B4 where the constant is not evaluated anywhere else. There are no
instructions that kill a constant, so an evaluation will be inserted on the edge (B0,B1) and
the evaluation in block B4 will be eliminated. This is the weakness of EARLIEST: It can
evaluate temporaries long before they are needed.
The EARLIEST transformation has been included here for two reasons. The
primary reason is that it is preliminary to the LATEST transformation, which we will now
describe, and the proof techniques lead one gradually to understand the proof techniques
for LATEST. Secondarily, the compiler is going to use the EARLIEST transformation later
during register allocation to move register spilling instructions to earlier points in the flow
graph. In that case, moving instructions further will free up more registers and be better.
As with most Boolean equations the compiler encounters, there is not a unique
solution to the equations in Figure 10.4. Consider a loop that contains no evaluations of T.
As with the equations for anticipatability and availability, one gets different values if one
assumes that the values in the loop are true or false. In this case we are trying to push a
value back through the loop to before the loop, so we want the maximal solution such that
going around a loop does not force the value to false.
Lemma L1
Consider a block B. LATERINB = true if and only if each path from Entry to B
contains an edge (P, S) where EARLIESTP,S = true and there are no evaluations of T or
instructions that kill T between the beginning of S and the beginning of B on this path.
Proof
Assume LATERINB = true. Consider any path from Entry to B. Let B; be the
predecessor of B on this path. LATERB;,B = true by the equations for LATER. This
means that either EARLIESTB,B =true, satisfying the condition, or LATERINB = true and
B; contains no evaluations of T. Walking backward on the path, repeating this argument
for each block, one eventually must come to an edge (P, S) where EARLIESTP,S = true
and there are no instructions that evaluate T between the start of S and the start of B.
Now EARLIESTP,S = true means that T is anticipated at S and there are no instructions
between S and B that evaluate T, so T must be anticipated at B and there must be no
instructions between S and B that kill T (otherwise it would not be anticipated at S). So the
condition is satisfied.
Assume that each path from Entry to B contains an edge (P, S) where
EARLIESTP,S = true and there are no instructions between the start of S and the start of B
that evaluate or kill T. To show that LATERINB = true, we assume that LATERINB = false
and derive a contradiction. Look at the equations.
For LATERINB = false there must be a predecessor P0 such that LATERIN
false and
P0
Lemma L2
Consider a block B and one of its predecessors B. LATERB,B = true if and only if
each path from Entry to B contains an edge (P, S) where EARLIESTP,S = true and there
are no evaluations of T or instructions that kill T between the beginning of S and the end
of B on this path.
Proof
The argument is the same as in Lemma L1. The only addition is that one is dealing
with the end of the block B rather than the beginning of the block B. In this case either
EARLIESTB,B is true (in which case the lemma is automatically satisfied) or B contains no
instructions that kill or evaluate T, so the argument for Lemma L1 can be used directly.
Figure 10.5 Insertion and Deletion Equations
We must now repeat the argument for correctness, profitability, and safety that we
made for the EARLIEST algorithm, but now for the LATEST algorithm.
We start with safety. That means that any insertion that is made must lead to an
evaluation of T on each path leaving the point of insertion. In other words, we must show
that T is anticipated at the head of the edge where an insertion occurs.
Lemma L3
If INSERTB,B = true, then T is anticipated at the start of B.
Proof
INSERTB,B = true and LATERINB = false (look at the formulas). By Lemma L2,
LATERINBB,B = true means that each path from Entry to (B;, B) contains an edge (P, S)
where EARLIESTP,S = true and there are no instructions that evaluate or kill T between the
start of S and the end of B;. We proved earlier that EARLIESTP,S = true means that T is
anticipated at S. Now there are no evaluations of T between S and the end of B. Assume
T is not anticipated at B. Then there is a path from B to Exit that does not contain an
evaluation of T before an instruction that kills T. Piece that path together with the path
from S to B, and one has a path from S to Exit that does not contain an evaluation of T
before an instruction that kills T. So T is not anticipated at S. Contradiction.
We next must prove that the insertions are correct. That means that there is an
undeleted or inserted evaluation of T on each path from Entry to an evaluation that is
deleted, so that T is known to be evaluated at the original points of evaluation.
Lemma L4
Assume that DELETEDB = true. Then after the insertions have occurred, T is
available at B.
Proof
DELETEB = true means that ANTLOCB = true and LATERINB = false. We apply
Lemma L1. Consider any path from Entry to B. Either there is an edge (P, S) where
EARLIESTP,S = true with no instructions between S and B that evaluate or kill T, or there is
an evaluation of T on the path not followed by an instruction that kills T.
Consider the two cases.
If there is an evaluation of T on the path, then there are two cases. If that
evaluation does not have DELETE true, then the condition for availability of this path is
satisfied. If the evaluation does have DELETE true, then the same argument we are using
can be applied to that evaluation. Eventually we will reach either an evaluation with
DELETE false or the first evaluation on the path, at which point it will be impossible to
have a preceding evaluation.
Suppose EARLIESTP,S = true and there are no instructions between S and B that
modify or evaluate T. Then walk down the path from S to B investigating the value of
LATER and LATERIN. Since there are no instructions that evaluate T on the path, and
EARLIESTP,S = true, we see by the equations that LATER starts out being true and can
only become false by LATERIN becoming false. So walk the path until we find an edge
(P;, S;) where LATERINS = false. We must find such an edge by the time we reach B,
since LATERINB = false. By the formula for INSERT, there is an insertion on this edge.
Thus an insertion occurs without following instructions that might kill T.
These two cases together prove that T will become available at B.
We now know correctness and safety. We must prove that the algorithm is
profitable. Thus we must show that the number of evaluations of T does not increase on
any path from Entry to Exit. Before the proof, which is similar to the proof for EARLIEST,
we need to show that there must be an original evaluation of T in the flow graph between
any two insertions on a path without instructions that kill T.
Lemma L5
Consider a path from Entry to Exit. Let (P, S) and (Q, R) be two edges on the path
such that the path from S to Q does not contain instructions that might kill T. If INSERTP,S
= true and INSERTQ,R = true, then there is an evaluation of T on the path from S to Q.
Proof
We will use Lemma E4, which is the same lemma about EARLIEST rather than
INSERT. INSERTP,S = true means that LATERP,S = true and LATERINS = false. So by
Lemma L1 there must be an earlier edge (P, S) on the path such that EARLIESTP,S =
true and there are no instructions between S; and P that evaluate or kill T. Similarly,
because INSERTQ,R = true there is an earlier edge (Q;, R;) where EARLIESTQ,R = true
and there are no instructions between R; and Q that modify or kill T.
Where is the edge (Q,R) in relation to the node S? (Q, R) must be later on the
path than S. Assume (Q,R) precedes S on the path. LATERINS = false, so Lemma L1
indicates that there is a path from Entry to S that does not contain an edge where
EARLIEST is true without following instructions that evaluate or kill T. Piece this path
together with the path from S to Q and we have a path from Entry to Q without an edge
with EARLIEST being true without following instructions that evaluate or kill T. But this
contradicts Lemma L1 and the fact that LATERQ,R = true.
Thus we have the edge (P, S) preceding S, which precedes the edge (Q, R),
which precedes Q. EARLIEST is true on these two edges and there are no instructions
that evaluate T between S and S or between R and Q. By Lemma E4, there must be an
evaluation of T between the two edges; however, the only place that that evaluation can
occur is between S and Q. Thus we have an evaluation of T between the two edges
where INSERT is true.
Now we are ready to prove that the transformation is profitable. The proof is an
adaptation of the same proof for EARLIEST.
Lemma L6
Consider any path from Entry to Exit. The number of evaluations of T on the path
after the application of INSERT and DELETE is no greater than the number of evaluations
originally on the path.
Proof
Let I1, . . . , Im be the instructions on the path that are either evaluations of T or
instructions that kill T. Pretend that there is an initial instruction I0 at the beginning of the
path that kills T. Now consider each pair of instructions IK and IK+1.
If IK+1 is an instruction that kills T, then T is not anticipated at any block or edge on
this piece of the path. Thus there is no edge where INSERT is true, so the number of
evaluations of T on this piece of the path is the same as the number of evaluations
originally.
Suppose IK is an instruction that kills T, and IK+1 is an evaluation of T. If both
instructions occur in the same block, then IK+1 is not locally anticipated so there is no
insertion or deletion. Consider the case that they are in distinct blocks. Then IK+1 is locally
anticipated in its block. Thus there is an earlier (P, S) where EARLIESTP,S = true between
IK and IK+1. Thus LATERP,S = true. Now walk down the path from S toward Ik+1. Since there
are no instructions that can evaluate or kill T, the only way that LATER can become false
is if LATERIN becomes false. There are two cases:
1. Suppose there is an edge (P, S) between (P, S) and Ik+1 where LATERINS =
false. Then INSERTP,S = true. Since there was only one edge where EARLIEST was true,
LATER and LATERIN remain false until we get to the instruction Ik+1. In this case,
DELETE is true for Ik+1. We have one insertion and one deletion; thus, there is no net gain
in evaluations.
2. If there is no such edge, we have LATERIN being true for the block containing
Ik+1 , so there is no insertion or deletion and no net gain in evaluations.
Suppose Ik and Ik+1 are both evaluations of T. Since we are assuming that local
optimization has occurred, both evaluations are not in the same block, so Ik+1is locally
anticipated in its block. There are three cases:
1. If there is no edge between the two instructions where EARLIEST is true, then
LATEST cannot be true at any edge or block between the two instructions. Thus LATERIN
is false for Ik+1, so Ik+1 is deleted, thus decreasing the number of evaluations by 1.
2. If there is an edge where EARLIEST is true between the two instructions, and
LATERIN is true for Ik+1, then LATERIN and LATER are true between that edge and Ik+1. If
they became false at any point there is no way for them to become true again, since there
is only one edge where EARLIEST is true. So there is no edge where INSERT is true and
there is no deletion. Thus there is no change in the number of evaluations.
3. If there is an edge where EARLIEST is true between the two instructions, and
LATERIN is false, then there is a first block where LATERIN is false and the edge
preceding it has INSERT being true. There can only be one such edge by Lemma L5. So
we have an insertion and a deletion, for no net change in the number of evaluations.
On the piece of the path after IM, T is not anticipated, so there can be no insertions.
We have investigated all possible segments, and in each case there was either no
increase in the number of evaluations or a decrease in the number of evaluations.
Therefore the number of evaluations on the whole path after the transformation is no
larger than the number of evaluations originally on the path.
We now know that the transformation is correct, safe, and profitable. Like
EARLIEST, it is also optimal in that it generates the minimum number of evaluations
possible.
Theorem LO
Consider another transformation that is safe, correct, and profitable in the sense of
Theorem E. The number of evaluations of T on any path from Entry to Exit after this
transformation will be no less than the number of evaluations of T after the
INSERT/DELETE transformation.
Proof
We could construct the proof using the same techniques used in Theorem EO;
however, a simpler observation makes the job easier. Look at the previous proof of
profitability for INSERT/DELETE. In the case analysis, whenever an insertion from
EARLIEST occurred, one of two cases happened with LATEST:
There was an insertion due to INSERT being true. In that case, just like the
EARLIEST case, there was a deletion of the succeeding evaluation of T.
There was an insertion by EARLIEST; however, LATEST pushed the insertion all
the way down to the next evaluation of T. This happened when LATERIN was true for the
block containing the next evaluation. In that case there was no insertion or deletion.
In other words, the number of evaluations after INSERT/DELETE is exactly the
same as the number of evaluations after EARLIEST. Since EARLIEST is optimal, so is
INSERT/DELETE.
We now know that INSERT/DELETE is as good as the EARLIEST transformation.
Now we show that it is better by showing that the inserted evaluations are as close to the
original evaluations as possible.
Theorem LC
Consider any other algorithm INSERT | DELETE for insertion and deletion of
evaluations of T. Assume that INSERT | DELETE is safe, correct, profitable, and optimal
in the sense that for each path from Entry to Exit the number of evaluations of T after the
transformation is the same as EARLIEST or INSERT | DELETE. Consider a path from
Entry to Exit with instructions I0, I1, . . . IM, IM+1, where I0 is a pretend instruction at Entry
that kills T, IM+1 is a pretend instruction at Exit that kills T, and the IK are all other
instructions that either evaluate or kill T. For INSERT and INSERT, handle an evaluation
of T at the beginning of a block that is not deleted as an insertion on the preceding edge
followed by a delete. Consider any pair of instructions, IK and IK+1, on the path. If any one
of the three transformations inserts an evaluation between IK and IK+1, then all three do,
and the insertion for EARLIEST occurs before or at the same edge as the insertion for
INSERT, and the insertion for INSERT occurs before or on the same edge as the
insertion for INSERT.
This is an involved statement of the fact that EARLIEST inserts computations as far
away as is possible and INSERT inserts evaluations as late as possible to still produce an
optimal number of evaluations. Any other optimal transformation must be somewhere in
between. A nonoptimal transformation can perform insertions after INSERTjust consider
the transformation that makes no insertions or deletions. This nonoptimal transformation
has its insertion at the last edge.
Proof
Go back and look at the proof of Theorem EO. We proved that between any two
instructions IK and IK+1, if EARLIEST made an insertion then any other safe, correct,
profitable transformation had to make at least one insertion. Thus INSERT | DELETE
must make at least one insertion whenever EARLIEST does. It cannot make more than
one insertion or make an insertion when EARLIEST does not, since then any path
including the path from Ik to Ik+1 would contain more evaluations then for EARLIEST.
Since LATEST makes an insertion whenever EARLIEST does, this proves the first part of
the theorem.
Consider a pair of instructions IK and IK+1 where all three of the transformations
perform an insertion. Since INSERT is safe, the block at the head of the insertion edge
must have T anticipated. EARLIEST performs an insertion on an edge (P, S) where T is
anticipated at S and either T is not anticipated at the exit of P or T is killed in P. So the
insertion for INSERT must be either on the same edge as (P,S) or on a later edge,
because T is not anticipated at any earlier edge.
Consider the edge (R,Q) where INSERT inserts the edge. By the equations,
LATERR,Q is true and LATERINQ is false. The only way that this could become false is if
some other edge had LATER false. By Lemma L1, this means that there is a path from
some evaluation of T, call it J, to Q that does not contain any insertion due to EARLIEST.
Consider a new path from Entry to Exit that includes the path from J to Q, the path from Q
to IK+1, and any path on to Exit. If the insertion for INSERT between IK and IK+1 follows Q,
then the path from J to IK+1 contains an insertion due to INSERT but no insertion due to
EARLIEST or INSERT/DELETE, contradicting the first part of the theorem. Thus the
insertion for INSERT must precede the insertion from INSERT.
If T is locally available at the end of a block, then T must be available at the end of
the same block. Since some optimization has occurred earlier, T might not be
locally available; however, it must have been previously computed on all paths. As
a formula, T AVLOC(B) implies that T AVOUT(B).
Proof
Consider a point p where T is available, and consider any path from Entry to p.
Since T is available, there is a point q on the path where an evaluation of T occurs. By the
assumptions above, T is available at q, so there is an earlier point r where an evaluation
of T occurs. There is no instruction between r and q that kills T and there is no instruction
between q and p that kills T. Since an instruction that kills T kills T, we have no instruction
between r and p that kills T. Since this can be argued for all paths, we have T available at
p.
Lemma S2
If T is anticipated at p, then either T is available at p or T is anticipated at p.
Proof
Since we have two alternatives, we will assume that one is false and show that the
other is true. Assume T is anticipated at p and T is not available at p. That means that
there is a path from Entry to p not containing an evaluation of T, or the last evaluation is
followed by an instruction that kills T;. Now consider any path from p to Exit. Since T is
anticipated at p, there is an evaluation of T at some point q which is not preceded by an
instruction that kills T. By the preconditions, there are two possibilities:
One possibility is that T is evaluated in the same block as q. The lack of
instructions that kill T between p and q means that there are no instructions between p
and q that kill T, so there is an evaluation of T following p on this path with no intervening
instructions that kill T;.
The other possibility is that there is no evaluation of T in the same block as q. By
the preconditions on the flow graph, that means that T must be available at q. Now we
have constructed a path from Entry through p to q. This path must contain an evaluation of
T that is not followed by instructions that kill T. By the construction of the path, this
evaluation cannot precede p so it must be between p and q. Lacking instructions between
p and q that kill T, and hence T;, we have an evaluation of T; with no preceding killing
instructions. Thus T is anticipated at p.
We now have the necessary tools to show that the operands are always evaluated
at or before the same point as the instructions of which they are operands.
Theorem ES
Consider a path from Entry to Exit, T an expression temporary, T one of its
operands, and an edge (P,S') on the path where T EARLIESTP,S. There are two
possibilities:
There is an edge (P, S;) on the path preceding S where T; EARLIESTP,S and
them are no instructions between the start of S; and the end of P that kill T.
A block P is reached where T is not anticipated at the end of P. Then let S be the
successor on the path, and (P, S) satisfies the condition.
Entry is reached. Then P is chosen to be Entry and S is its successor on the path.
There are no other possibilities. While walking backward on the path, one either
comes to an instruction that kills T, an instruction that evaluates T, or a point where T is
no longer anticipated. By assumption, the instruction that evaluates T is not possible.
The theorem is therefore proved. The compiler needs the same result for
INSERT/DELETE.
Theorem LS
Consider a path from Entry to Exit, T an expression temporary, T one of its
operands, and an edge (P, S) on the path where T INSERTP,S. There are two
possibilities:
There is an edge (P,S) on the path preceding S where T INSERTP,S and there
are no instructions between the start of S and the end of P that kill T.
that contain them. We will then optimize each of these computations in order. Consider
the evaluation of T.
Recall that EARLIEST does not require the solution of any equations once
availability and anticipatability have been computed, so it can be described by the function
in Figure 10.6. The function is being written this way for descriptive purposes; it is
probably inefficient in the production version of the compiler since the value of EARLIEST
will only be asked when the calling procedure knows that T is anticipated, thus, we have
redundant references to anticipatability. If the compiler used compiling the compiler
incorporates in-line expansion of functions, then there should be no inefficiency.
The major computation for INSERT/DELETE is the computation of LATER and
LATERIN. We will compute and store the value of LATERIN since it is associated with
blocks and we will not need to find storage for a value in the data structure representing
edges. LATER can be computed from ANTLOC, LATERIN, and EARLIEST (Figure 10.7).
The computation of LATERIN has much the same form as the computation of availability.
It is an intersection of information from all of the predecessors.
Figure 10.6 Pseudo-code for EARLIEST
The major difference is that ANTLOC is the information in a block that kills the
transmission of LATER forward to the next block, and the EARLIEST information on an
edge is the information that creates the value true rather than the existence of an
evaluation at the end of the block.
The compiler uses a work-list algorithm for computing LATERIN, much like
availability, as is shown in Figure 10.8. The head of each edge that has EARLIEST true is
given the value true for LATERIN, then each succeeding block is added if there are no
intervening evaluations of T. The first phase gives all blocks between an edge where
EARLIEST is true and the following evaluation of T the value of true for LATERIN.
Figure 10.8 First Phase of LATERIN Computation
The second phase prunes the set of blocks where LATERIN is true by eliminating
blocks where LATER is not true for all predecessors, as shown in Figure 10.9. Initially the
algorithm eliminates all blocks that do not have LATER true for all incoming edges. At the
same time it builds a work list of all blocks that have been eliminated, but whose
successors have not yet been processed.
The second part of the work-list algorithm processes each of these eliminated
blocks. If the block contains an evaluation of T then no further processing is needed since
the absence of this block from LATERIN cannot affect the presence or absence of its
successors. The successors are added to the work list if they are removed from LATERIN
by the removal of this block. This will happen unless EARLIEST is true for the edge
between them.
Figure 10.9 Pruning the LATERIN Set
Now that LATERIN has been computed, the compiler must compute the edges on
which to perform insertions and the evaluations of T to be deleted. This is done in Figure
10.10. The compiler does not need to look at all blocks to see if they are not in LATERIN
and the entering edge has the attribute LATER. since T is anticipated at a block at the
head of an edge where the insertion will occur. Thus we look for blocks in ANTINLATERIN that have a preceding edge with attribute LATER. Insertions occur on these
edges. To avoid introducing unneeded blocks, perform a special case check for the
situation in which the tail of the edge has only one successor. In that case, insert the
evaluation at the end of that block before the unconditional branch.
The algorithm will not attempt to insert an evaluation on an edge where the head of
the edge has only one predecessor, since the LATER computation will delay the insertion
until after the block. If a block has only one predecessor, then LATERP,S = LATERINS so it
is not possible that LATERS is false when LATERP,S is true.
Abnormal edges (P, S) are more difficult because one cannot insert a block in the
middle of an edge, since there is no way of modifying P to represent a branch to the
constructed block in the middle of the edge. These edges do get executed, so an insertion
must occur someplace. There are two techniques that have been used, plus the technique
proposed here (making three):
A pessimistic technique is to pretend that all evaluations are killed at the beginning
of S, so ANTINS is empty. Thus EARLIESTP,S is false and the INSERT/DELETE
computations cannot push the insertion to this edge. This is overkill since it means that
there can be no redundant expression elimination in the neighborhood of the edge.
An alternative approach is to handle the abnormal edge like a normal edge. Hopefully,
there will be nothing inserted on the edge. If a computation is inserted on the edge, then
insert it into the tail or the head. This is not safe or profitable, but has been done in the
Rice Massive Scalar Compiler Project.
This compiler uses a different approach, which counts on the processing of
operands before the evaluations that use them. Consider a temporary T.
Apply partial redundancy elimination as described in the previous sections. Usually there
will be no insertion on an abnormal edge; in that case, the transformation is complete. The
compiler performs more processing if an evaluation is inserted on an abnormal edge.
Consider the set of abnormal edges on which an evaluation of T is inserted.
Pretend that T and all temporaries that use T directly or indirectly as operands have
another operand. For each abnormal edge (P, S) on which T is inserted, pretend that this
added operand is killed at the head of S. This kills all of the instructions dependent on T.
Now use partial redundancy to recompute the insertion points. There will now be no
insertions of T on the abnormal edges.
Of course, the recomputation may decree that insertions will occur on other
abnormal edges that did not have insertions before, so repeat the process until there are
no insertions of T on any abnormal edges. This must happen eventually since a safe
placement can be determined by killing this phony operand at the head of each abnormal
edge in the flow graph, and each iteration will increase the number of abnormal edges that
have the phony kill at their head.
The compiler may improve on this comment by noticing that the memory locations referenced by A(I + c) are
not modified by this store operation, where c is a compile-time constant.
One improvement can be made for load operations. While constructing the flow
graph, the compiler always uses the same temporary for the target of a LOAD instruction
representing a load from a particular symbolic expression. In other words, all load
operations for A(I) use the same target temporary. When the compiler is generating
instructions for a store operation, it first copies the value into the same temporary used for
a load from the same symbolic expression.
As an example, consider the subscripted reference VALUE(I) in the running
example with code sequences extracted in Figure 10.12. The address of VALUE(I) is
always computed into temporary, T17. A load operation from VALUE(I) always occurs into
double-precision temporary SF1. The store operation is implemented as two instructions.
First the value to be stored is copied into the temporary representing the fetch of a value,
in this case SF1, and then a store operation is inserted from that temporary into memory
using the address calculation, in this example, T17.
Although initially inefficient, this code sequence will be improved by later compiler
phases. The copy operation will probably be removed by register renaming and register
coalescing during the register allocation process.
If it is not removed, the store will be moved to a less frequent execution point thus
gaining from keeping the value in a register rather than storing it to memory.4
4
The IBM Tobey compiler team (OBrien, et al. 1985) made the same observation independently.
The compiler improves the optimization of load operations by observing that a store
operation can be viewed as having two actions:
1. First it kills all load operations that load from a memory location that might be
modified by this store operation, including the memory referenced by the address
specified in the store operation.
2. The store operation can be viewed as implementing an evaluation of the temporary
holding the value. In other words, the store can be viewed as a store followed by a
load from the same address into the same register.
Note that a store operation can never produce a value in ANTLOC for any block,
since the memory location is killed before the evaluation. The store operation can
contribute to the AVLOC information for a block. Since the store operation never
contributes to ANTLOC, the compiler needs no special checks to avoid deleting a store
operation when moving a load. With these comments, the load operation is handled just
like expression temporaries.
The temporary T in Figure 10.1 is actually the references to VALUE(I), with the
following operations in these blocks:
B1: Initialization of VALUE(I). For complete realism there should have been a
modification of T before the evaluation of T in the example, but it would have
changed nothing.
B4: The mod T occurs when I is incremented. Changing the address kills the load.
B6: The mod T followed by an eval T is the store operations updating VALUE(I).
As we see, the store operation contributes to the elimination of the load in block B2.
One further improvement to LOAD optimization is based on the semantics of the
source language. Consider an uninitialized variable or data structure X that is allocated on
the runtime stack. Due to unexecutable paths5 in the flow graph, the compiler might
determine that there are points in the flow graph where the variable is referenced but not
initialized. The compiler should pretend that there are load operations for each of the
uninitialized elements of the structure at the beginning of the scope. This can be done by
adding the load operation to the AVLOC information for the block at the beginning of the
scope, even though the load is not there. This has the interpretation of an uninitialized
value being loaded on all unexecutable paths, making more load operations redundant.
5
There are paths through many programs that cannot be executed. Consider two IF statements in different parts of the
program that use the same conditional expression. It is not feasible for the compiler to determine that both IF statements always branch
on the same condition, so the branch on the opposite condition leads to an unexecutable path.
Now that we have the information needed, the same algorithm can be used for load
operations as is used for expression temporaries in Figure 10.10.
Store operations can also be moved toward the Entry block. This transformation is less useful than LOAD/STORE motion.
The author is not proposing it for the current compiler; however, the technique is discussed at the end of the chapter for reference.
Alternatively, the store operation can be moved toward Exit. This motion pays off
when the address of the store does not change in the loop, but the value being stored
does. Consider the loop in Figure 10.14. The value stored in B(J) changes with each
iteration of the loop; however, the address being stored into does not change, so the store
and the copy can be moved to after the loop.
Recall that all store operations have a special form. Each store operation takes two
arguments, the address to be stored and the temporary holding the value to be stored.
The temporary holding the value is the same as the temporary used to fetch the value with
a load operation for the same address. Thus two store operations that store an explicit
value into the same address will always use the same temporary to hold the value. Of
course there may be other memory locations that might be modified by the store also.
When the compiler moves a store operation toward either Entry or Exit, it moves the
collection of store operations having the same temporary. These are guaranteed to have
the same address computation and the same temporary for the address.
availability means that some earlier instruction has already computed that effect. With
these understandings, go back and review the proofs and you will see that the proofs
actually show that lazy code motion applies to effect rather than value.
The second theoretical problem is, What about the theorems involving
subexpressions? They cannot be viewed as subeffects since there is no such thing. They
dont apply, so we make the main procedure apply all transformations for expressions first,
then the transformations for copies, and then the transformations for store operations.
This also has the advantage that we can avoid trying to move store operations if no copy
operations are moved.
The practical problem is as follows. We have used the temporary name to
represent the instruction computing it: This does not work for store operations since the
temporary is already used to represent load operations. We therefore build a separate set
of data structures for store operations. We have STORE_ANTLOC rather than ANTLOC,
STORE_AVLOC rather than AVLOC, and STORE_TRANSP rather than TRANSP. Since
we compute global information on the fly, the global information is temporary and so it is
not a problem.
All of this rationalization now allows the compiler to use the same algorithm for
moving store operations toward Entry as the compiler uses for expressions and load
operations.
values are moved toward Exit, the flow graph guarantees that the correct value is in the
temporary at all points where it can be used. The store operation must only guarantee that
the value eventually makes it to memory.
A copy operation into T does not kill the STORE instruction. Moving the store past
the copy does change the immediate value that is to be stored in memory; however, there
is a store after each copy into T, so interchanging the instructions makes one of the store
operations partially redundant.
Another store operation with the same temporary T has the same temporary for an
address so is identical to this store operation. The store operation is the equivalent of the
evaluation of T for STOREs (rather than LOADs).
A load or store involving a different temporary T; that might reference the same
memory location kills the STORE instruction. If they might reference the same memory
location, then interchanging the order of the instructions might change the value in
memory. This is precisely the modifies relation for the store operation for T.
An instruction that kills the address computation for the memory location associated
with T also kills the store, since that changes the location in memory being referenced.
A load involving the temporary T also kills the store. If the store is moved after the
load, then the value in memory is not correct and an incorrect value will be loaded. This
should be a rare situation since earlier optimization of load operations used the existence
of the store operation to make the load redundant. However, some cases involving partial
redundancy can still exist.
As with load operations, there is an improvement that can be made for data that is
stored on the runtime stack, such as LOCAL variables implementing data structures.
Since the data ceases to exist at the end of the execution of the flow graph, the compiler
can pretend that there is a store into the memory location in the Exit block. This will make
some of the other store operations redundant, avoiding those store operations that put
data into the memory location that are never loaded again.
We now know the instructions that affect the store operations. Observe that the
lazy code motion form of partial redundancy can be recast in terms of the reverse graph.
The names that are used are given in Figure 10.15. Rather than using the name
EARLIEST, the name FARTHEST is used to represent the farthest toward Exit that the
store operation can be moved. Similarly the name NEARER is used to represent that the
store can be moved nearer to the original position of the store without increasing
execution frequency.
The definitions are direct transliterations of the definitions for normal optimization. T
being a member of ST_AVLOC means that a store of T occurs in the blocks after any
instruction that would kill the store. Similarly, T ST_ANTLOC means that a store of T
occurs in the block before any instruction which would kill the store. T ST_TRANSP
means that no instruction in the block kills a store of T.
From the local information one can compute the global availability and
anticipatability information as shown in Figure 10.16. A store is available at a point if each
path from the Entry to that point contains an instance of the store that is not followed by
any instructions that kill T. Similarly, the store is anticipated if every path to Exit contains
an instance of the store that is not preceded by a killing instruction.
Given this information we can form the analog to EARLIEST, which is FARTHEST:
the edge nearest to Exit on which a store can be inserted that will have the same effect as
preceding stores. The analog to LATER is NEAR, which moves the store back toward the
original store as far as is possible without introducing extra store operations on any path.
These equations are given in Figure 10.17.
Figure 10.15 Transliteration on Reverse Graph
Now that the equations are recorded, each reader should go through the process of
convincing himself or herself that the equations do give the correct positions for inserting
the store operations nearer to the Exit node.
Figure 10.17 INSERT/DELETE Equations for Stores
Convince yourself by going through the proofs, viewing the instructions for their
effects rather than the individual value, and see that all of the proofs work on the reverse
graph as well as on the original graph.7
7
The authors first notice of this observation was in a paper by Dhamdhere, Rosen, and Zadeck (1992).
Now all of the algorithms that we have developed for normal computations can be
applied to store operations. When a store operation is moved, there may be more chance
for other optimization, so the local information for expressions should be updated and the
algorithm rerun for each expression that might be killed by the store. This can be done by
adding the expression to the expression work list.
As an example of moving store operations, consider the running example for the
book. The store operations into VALUE(I) and LARGE(I) are moved. The store operations
in blocks B1 and B6 are moved into block B3.
I kills the copy if the copy kills I. This takes care of the case in which the target of
the copy is a direct or indirect operand of I.
I kills the copy if I changes S. Here we have to be more careful, since the theorem
about subexpressions does not apply to copies because there are multiple copies
with the same target but different sources.
I kills the copy if I kills S. This is different from the preceding condition, since an
instruction that computes S does not necessarily kill S.
I kills the copy if I is not an identical instruction to the copy and I modifies T. A
different copy can kill this copy also. Interchanging them would change the values
computed. However, a copy that has exactly the same form should be viewed as
partially redundant, in other words, an evaluation rather than a killing instruction.
Copy operations are different from the expression, load, and store operations we
have discussed before in that the copy operation is determined by a pair of temporaries:
the source and the target. As noted above, there are multiple copies with the same target
and different sources. Rather than optimizing all of the copies together (which cannot be
done), each source/ target pair is optimized separately. This includes collecting the local
information and computing global information on the pair rather than on the single target
temporary as in all of the other cases.
With these understandings, the algorithms for moving the copies toward Entry can
be performed using the same lazy code motion algorithms used for moving all of the other
instructions toward Entry. While transcribing the algorithms, remember that the compiler is
optimizing all of the copies with the same source/target pair at the same time. Remember
that TRANSFORM_COPIES takes the target temporary as a parameter. This means that
there is a loop within TRANSFORM_COPIES that loops over all of the possible source
temporaries and applies the lazy code motion algorithms to each.
Again, the algorithms for moving store operations toward Exit can be applied to
copies, with the same understanding that one optimizes a source/target pair rather than a
single temporary.
There is no motion of copy operations within the running example. This is not
uncommon. The motion of copies will happen more frequently when radical
transformations of the loops have been performed by either the dependence analysis
phase or a transforming preprocessor. Copies are more likely to be movable when in-line
expansion occurs or when the source program was created by an application-dependent
preprocessor.8
8
Compiler writers frequently make the error of thinking the programs are written by programmers. The most troublesome
programs are written by other programs. These program generators will generate sequences of statements that no programmer in his
right mind would ever consider, for example, 9000 assignment statements in a block.
Frequently the argument is made that multiplication and division operations are rare, so they need not be fast. Many times
this is true; however, the argument must be refined. Frequency counts should not be used to weight instructions, but rather the total
number of cycles that these instructions occupy in computational units in the hardware. Second, it does not matter whether an
instruction is rare if the points at which it occurs are in the critical paths of important programs for which the processor was designed.
Both of these factors make multiplication and division more important than the usual arguments show.
Another case that this technique will handle is when a variable is almost an
induction variable. If most modifications of the variable are increments by constants but a
few are more complex expressions, then this technique will increment a pointer near the
increments by constants and generate a new version of the pointer near the computations
that are general assignments.
The technique is based on a simple observation. Consider any computation E of
the form C0 + I * C1 or I * C1, where C, C0, and C1, are compile-time constants. When the
compiler sees an assignment to a variable of the form I = I + C or I = I - C, then pretend
that these assignments do not kill computations of the form E. The compiler can pretend
not to kill these computations by modifying the gathering of local data ANTLOC and
AVLOC. When the compiler sees an increment or decrement by a constant, the compiler
only signals that computations that are not of the form E are killed. Then the compiler
performs lazy code motion, which moves the occurrences E using only information about
the nonincrement evaluations of I.
After the code motion, the compiler revisits each increment or decrement of I and
fixes up the value in the temporary for E so that it has the correct value after the
increment. If the assignment to I was an increment, then the temporary is modified by
adding C * C1. If the assignment was a decrement, then the value C * C1 is subtracted
from the temporary. This need only be done if there is a path from the increment or
decrement to a use of E that contains no instruction that evaluates E.
For the example in Figure 10.18 assume that the array A and the values X, Y, and
Q are double-precision numbers requiring 8 bytes of storage and that I is not used after
the fragment. The result of this strength reduction is then the left column of Figure 10.19.
Note that there are two expressions that satisfy the conditions for E: 8 * I and address(A) +
8 * I. Strength reduction is applied to both of them. Note that there is no increment of
these two expressions at the end of the code fragment because we are assuming that I is
not used later.
Later, dead-code elimination is performed, resulting in the computations in the right
column of Figure 10.19. Since I is not used later, the increment to it is removed as well as
all references to I8 except the first one. The others are not used since the increment of the
address expression removes the need for it.
There are two shortcomings of this technique. Consider two points in the flow
graph, p1, and p2, and consider these two possibilities:
Suppose that p1 and p2 are the positions of increment instructions for I with no
instructions between them that kill I. If there is no evaluation of E between p1 and
p2, and there is an evaluation of E after p2, then the multiplication will be replaced
by at least two additions. This is a minor problem since earlier phases of the
compiler have eliminated as many repetitive additions as possible. Thus the
compiler will ignore this problem.
the evaluation of E or a point of insertion. The algorithm for identifying the unnecessary
addition operations is given in Figure 10.21
Now we fit the whole algorithm together (Figure 10.22). It is outlined here in a very
high-level pseudo-code. The idea is that lazy code motion is performed under the
assumption that all increments of I can be incorporated into the multiplications by
additions of constants. Then the increments that cause extra additions are computed.
These increments are then handled like normal assignments, and lazy code motion is
repeated. If there are no increments that cause extra additions, the algorithm is done.
Since this algorithm is used infrequently and the extra additions are infrequent, the
repetitions should be few.
10.10 References
Chow, F. 1983. A portable machine independent optimizerDesign and measurements. Ph.D.
diss., Stanford University.
Dhamdhere, D. M., B. Rosen, and F. K. Zadeck. 1992. How to analyze large programs
efficiently and informatively. Proceedings of the SIGPLAN 92 Symposium of Programming
Language Design and Implementation. San Fransisco, CA. Published in SIGPLAN Notices
27(7): 212-223.
Drechsler, K.-H., and M. P. Stadel. 1993. A variation of Knoop, Ruthing, and Steffens
lazy code motion. ACM SIGPLAN Notices 28(5): 29-38.
Joshi, S. M., and D. M. Dhamdhere. 1982. A composite hoisting-strength reduction
transformation for global program optimization, parts I and II. International Journal of
Computer Mathematics 11: 21-41, 111-126.
Knoop, J., O. Ruthing, and B. Steffen. 1992. Lazy code motion. Proceedings of the ACM
SIGPLAN Conference on Programming Language Design and Implementation PLDI92, 224-234.
Knoop, J., O. Ruthing, and B. Steffen.
Programming Languages 1(1): 71-91.
1993.
Lazy
strength
by
reduction.
suppression
Journal
of
of
partial
OBrien, et al. 1985. XIL and YIL: The Intermediate Language of TOBEY, ACM SIG-PLAN
Workshop on Intermediate Representations, San Fransisco, CA. Published as SIGPLAN
Notices, 30(3): 71-82.
d.
2. The SCHEDULE phase reorganizes the target instructions to reduce the number of
machine cycles needed for executing the flow graph. At the same time, it avoids
increasing the register pressure beyond the available set of registers.
3. The REGISTER phase assigns the temporaries to physical registers. This is done
in three steps.
a. First, temporaries that are live between blocks (global temporaries) are
assigned to registers.
b. Within a block, temporaries that can share storage with a global temporary
are assigned registers.
c. Then unassigned temporaries that are live within a block are assigned
registers.
4.The RESCHEDULE phase is a reexecution of the SCHEDULE phase. It is only
performed if the register allocator has inserted load and store operations.
The register allocation phases must use the target resources effectively. That
means using the fewest possible registers. When there are insufficient registers, the
register allocator inserts the fewest possible load and store operations. Using the
minimum number of registers and inserting the minimum number of load and store
operations is unrealistic since the problems are NP-complete. Instead, we use heuristics
to do as good a job as possible.
compiler. We observe that much of renaming can occur as renaming on the static single
assignment form of the flow graph without the conflict graph. The conflict graph need only
be built for a small fraction of the temporary registers, decreasing its size.
Peephole optimization works best when the compiler can inspect the definitions of the
operands of each instruction. We have this with the static single assignment form, so
peephole optimization can be performed here. It certainly needs to be performed before
instruction scheduling.
As a cleanup phase, dead code must be eliminated. Again this algorithm operates
on the static single assignment form of the flow graph.
Since the algorithms all work on the static single assignment form, they can be
performed sequentially; however, they can also be combined. Peephole optimization can
be performed at the same time that register copies are initially being eliminated for register
coalescing. And we will see shortly that register renaming and register coalescing can be
combined into one algorithm that computes a partition of the temporaries for reforming the
normal flow graph.
After these algorithms based on static single assignment form, the algorithm
operates on the normal flow graph and the loop structure to insert load and store
operations to reduce the number of registers needed to match the registers available in
the target machine. Thus the main procedure for LIMIT has the form shown in Figure 11.2.
The algorithms could be combined further if abnormal edges in the flow graph did
not exist.1Peephole optimization, local coalescing, and the construction of the static single
assignment form could be done simultaneously. Since the compiler must avoid copy
operations on abnormal edges, these edges and the corresponding -nodes must be
identified before any coalescing or peephole optimizations are performed. This
identification can be done during the construction of the static single assignment form;
however, it is described separately for simplicity.
1
By now, you have figured out that abnormal edges are the bane of the compiler writers existence.
If the temporary is not a constant value, does the compiler know information about
some of the bits? Is the temporary positive? Are certain bits known to be 0 or
known to be 1?
The compiler contains a set of procedures for identifying patterns in the instruction
stream: There is one procedure for each kind of instruction. That procedure identifies all
patterns that end with the particular instruction and performs the transformation required
for a better code sequence. It also records the information for each temporary that is the
target of that kind of instruction. If the sequence of instructions changes, it restarts the
pattern matching with the first instruction in the transformed sequence. Thus multiple
patterns may be applied to the same instruction.
Although peephole optimization restarts the scan with the first transformed instruction
to allow the identification of multiple patterns, some patterns will still not be identified. The
whole peephole optimization phase is repeated until there are no patterns matched. The
information gathered from previous iterations is still true for each temporary. This
information can be used by -nodes to get better information on subsequent iterations;
however, repetition of the whole peephole optimization phase to gain better information at
-nodes alone should not be performedthere is not enough to be gained by it.
Transformations
which
involve
adding
a
use
of
a
temporary
in
Occurs_In_Abnormal__node or moving the point at which such temporaries are
evaluated must be avoided. By doing so, the compiler guarantees that no copies will be
introduced later at abnormal edges.
The algorithm in Figure 11.5 describes the processing of a block. The actions in the
block are performed in the same order as execution. First the -nodes are processed.
There are only a few transformations that might eliminate -nodes; however, information
can be gained about the value of the result from the information known about the
operands.
After -nodes are processed, the compiler simulates the execution of the block.
This is done by calling the peephole optimization procedure for each instruction in the list.
That procedure will perform any transformations. The value true is returned if a
transformation is performed. Here is the tricky part of peephole optimization. If no
transformations are performed, the compiler wants to go on to the next instruction. If a
transformation has been performed, it wants to reprocess the transformed instructions,
which may now be a different instruction from the original instruction. Care must be
applied to avoid skipping an instruction, attempting to reprocess a deleted instruction, or
generally crashing.
After the block has been processed, the walk of the dominator tree is continued by
processing the children of the block in the dominator tree.
We will not describe all of the procedures here since their number and patterns
depend on the target machine. Instead we will describe the processing of -nodes, copy
instructions, and integer multiplication. The reader can extrapolate to the structure for all
machines.
When creating the procedure for any of the instructions, first consider the
transformations that can be applied. With -nodes, the following transformations are
possible when the -node has the form T0 = (T1, . . ., Tm):
If each of T1 through Tm are the same temporary, then the -node can be changed
into a single copy operation, T0 = T1. If neither of these temporaries is involved in
an abnormal edge, then the copy can be eliminated.
If all except one of the temporaries T1 through Tm are the same and that one is the
same as T0, then again the -node can be turned into a copy operation and
potentially eliminated.
Processing a -node thus consists first of identifying these two possibilities and making
the transformation. Afterward, find all the characteristics that are the same between the
operands and give the target those characteristics (see Figure 11.6).
These always seem to happen. The compiler is carefully designed so that all instances of a particular instruction are transformed
at a single point in the compiler; however, later transformations might generate the same situation. So if it is not expensive, checks
should be made to see that the situation has not already occurred.
The other instruction to consider here is i2i, which is the integer copy operation in the
flow graph. Here there is only one transformation. If the source and target are not involved
in abnormal edges, the source can replace all uses of the target, eliminating the target
temporary completely. This is illustrated in Figure 11.8. The procedure checks to see if the
temporaries involved occur in abnormal edges; if not, all instructions that use the target
are modified.
While scanning for peephole optimizations, the compiler precomputes the set of
temporaries that occur in either a copy operation or a -node. Later the conflict graph will
be computed for only these temporaries, decreasing the size of the graph and speeding
the compiler. The set Occurs_in_Copy holds the set of temporaries that occur in either a
copy or a -node. Note that this set is recomputed during each pass through peephole
optimization because the processing of copies may change the set of temporaries
occurring in copies (Figure 11.8).
Figure 11.6 Peephole Optimizing -nodes
This data structure is normally called the interference graph, which reuses the name for the data structure formed during
instruction scheduling. Thus I chose to use the name used on the PQCC project at Carnegie Mellon University (Leverett et al. 1979).
Definition
Conflict Graph: Given a set of temporaries R, the conflict graph for R is the
undirected graph formed from the set of temporaries R as the nodes together with an
edge between T1, T2 R if there is any point p in the flow graph satisfying both of the
following conditions:
T1 and T2 are both live at p. This means that there is a path from an
evaluation of T1 to a use of T1 that includes the point p, and there is a path
from an evaluation of T2 to a use of T2 that includes the point p. Note that
this means that no edge is needed if either temporary is uninitialized.
keep both data structures simultaneously, using whichever is more efficient for the
particular operation. This costs memory in the compiler.
Our compiler optimizes the construction of the conflict graph in two ways. First the
conflict graph is only constructed for a subset of the temporaries that are predetermined
by the compiler. By keeping the set of temporaries small, time and space are saved.
Second, the compiler implements the conflict graph as a combined hash table and the
representation of the conflicting neighbors as a list. The data structures are shared
between the hash table and graph representation to avoid additional memory
consumption.
The field smaller contains the number of the temporary with the smaller value.
The field larger contains the number of the temporary with the larger value.
Note that there is no data stored in the edge. The existence of the edge is the
important thing to the algorithms. Thus the data structure for the edge would look
something like the description in Figure 11.9.
To check for the existence of a particular conflict, the compiler uses a chain-linked
hash table, ConflictHash, of some size HASHSIZE, which can be a power of two since the
hash function is simple. Let TI be the temporary represented by the integer i and
correspondingly let TJ be the temporary represented by the integer j. Since we have no
knowledge of the frequencies and interrelationships of the temporary, the hash function
consists of linearizing the entries in the corresponding symmetrix bit matrix (which we did
not build) and dividing by the size of the table. In other words, the hash function is which
generates an index to a chain in the chain table. Of course, hashnext is used to scan
down the chain until a matching smaller and larger are found, indicating the presence of
the edge.
Conflict(Ti,Tj) = (if i < j then
j(j - 1)/2 + i
else
i(i - 1)/2 + j) mod HASHSIZE
Figure 11.9 Structure of a Conflict Entry
During insertion, new edges are added at the head of the chain, since locality
indicates that once an insertion occurs it is likely that the same insertion will be attempted
shortly.
The other operation is finding all of the neighbors of a temporary. Let TI be the
temporary corresponding to the integer i. To scan down the list of temporaries that conflict
with TI, use an algorithm like the one in Figure 11.10.
The compiler will also keep track of the number of a temporarys neighbors. This
can be accommodated by adding an attribute to the temporary, called NumNeighbors, that
is initialized to 0 and incremented each time a conflict is added.
A temporary that does not have a value can share a register with any other temporary. Since we do not care
what the value is, we can assign it the value in the other temporary.
By the definitions of live and dead, these are the only alternatives, so we have
proven the observation.
This observation means that we do not have to create conflicts for each pair of
temporaries five at a point. The compiler need only create conflicts between temporaries
evaluated at a point and the other temporaries that are five at that point. This gives the
algorithm in Figure 11.11. It computes the lifetime information for temporaries that are in
Nodes in the same way that live/dead analysis computes the information and then uses
this information and the last observation to add conflicts to the conflict graph.
partition of the temporaries: Two temporaries are in the same partition if they have been
combined during register coalescing.
The SSA-form register-renaming algorithm can generate -nodes associated with
abnormal edges in the flow graph. These -nodes must not generate copy operations
when the graph is translated back into normal form. Thus the algorithm must avoid
eliminating copies that will cause copies to occur on abnormal edges. As usual,
impossible edges are fine since the code on them can never be executed anyway.
The algorithm consists of using the SSA form to eliminate most copies. Initially the
temporaries are partitioned so that each temporary is in an element of the partition by
itself. Then each -node and copy instruction is investigated. If an operand and the
destination temporaries do not conflict, then both temporaries are put in the same
partition. The flow graph is then translated back into normal form.
Note the similarity between register coalescing and register renaming. Both are
implemented by creating a partition, and both partitions are created to eliminate the copies
at the -nodes.
Finally, the real work is done in CHECK_COALESCE in Figure 11.15. The conflict
information for the partition is stored as the conflict information of the representative
temporary, so first find the representative temporaries. If they are the same
representative, then the temporaries have already been coalesced either directly or
indirectly. Second, check to see if they conflict. If they do, then nothing is done; otherwise,
the two partitions are merged with a UNION operation and the conflict information for the
new representative is given the union of the conflict information for the original partitions.
Figure 11.15 Coalescing Two Temporaries
There are some other target architectures that require a form of implied coalescing.
If the target machine is not a RISC processor, then it may have instructions in which one
of the operands is modified to get the result. With the intermediate representation
mimicking a RISC processor, the register allocator wants to make as many of these
targets as possible be the same as one of the operands. This is accomplished by
substituting two target machine instructions for a RISC instruction: a copy from one
operand to the target and the target instruction with the target and the (implied) operand
the same. Coalescing is used to eliminate the copy instruction, that is, make the operand
and target be the same temporary.
The register pressure is a synthesized attribute of the loop tree. The register pressure
for each node is the maximum of the register pressures for each of the children. So
computing the register pressure for a loop is just finding the maximum of the register
pressures for the enclosed loops and blocks, as shown in Figure 11.17.
Figure 11.16 Finding Register Pressure In Flow Graph
Computing the register pressure in a block is shown in Figure 11.18. The structure
mimics the computation of local lifetime information used for live/dead analysis. The block
is scanned in reverse execution order and each instruction is executed in backward order.
When a definition is found, the temporary becomes dead, and when a use is found, the
temporary becomes live if it was not already live. The register pressure is the number of
registers that are five between each pair of instructions.
Figure 11.18 Computing Pressure in a Block
Some processors, such as the INTEL i860, contain instructions that define the
target register before the operands are used. In those cases, this code must be changed
to reflect the hardware. For those particular instructions, the operands will be referenced
first in backward execution order, then the targets will be modified.
be in a register at each use of the temporary. The temporary is no longer live between the
STORE instruction and LOAD instruction, so the register pressure is decreased.
To summarize this situation, assume that the register pressure is too high at the
point p in the flow graph and a temporary T is being spilled to memory. A memory location
MEMORY(T) must be assigned to hold the value of T. Then instructions must be added to
the program to move T to and from the memory location. If T is live at a point p in the
program where the compiler wants to reuse the register holding T, then
It is not difficult to satisfy these conditions. The compiler could insert a store operation
after each instruction that computes a value into T, and a load operation before each
instruction that uses the value in T. The problem is that this generates too many memoryreference instructions. On modern processors, memory references are one of the most
expensive operations, so the compiler needs to decrease the number of such instructions.
These instructions also take up space in the instruction cache, further decreasing
performance.
If there is a point in the program where the register pressure exceeds the number of
available registers, the compiler will spill a temporary to decrease the register pressure.5
Since the compiler is trying to decrease the number of load and store operations
performed, it will start spilling at the most frequently executed point in the program and
attempt to insert the load and store operations at less frequently executed points. To do
this it uses a three-step process applied at the point p in the procedure where the register
pressure is largest:
5
There are situations in which the register pressure is not an accurate measure of the number of registers needed.
In some situations, more registers are needed due to complex intertwining of register usage patterns. In the presence of
uninitialized temporaries and paths through the flow graph that are not executable, fewer registers may be needed.
However, the register pressure is typically very close to the number of registers needed.
1. Find the largest loop (most outward loop) containing p where there is some
temporary T that is live throughout the loop and not used within the loop. T is
holding a value that is passed through the loop. Insert a single store operation T
into MEMORY(T) at the beginning of the loop, and a load operation from
MEMORY(T) into T at each loop exit where T is live. Attempt to move the store
operations toward the procedure Entry as far as is possible without increasing the
number of times they are executed. Attempt to move the load operations toward the
procedure Exit as far as is possible without increasing the number of times they are
executed. This may decrease the register pressure at other points.
2. If no loop and temporary T can be found, then apply the same technique to the
single block where the register pressure is too high. Find a temporary T that is five
throughout the block and not used in the block. Insert the store operation before the
block and the load operation after the block if T is live after the block. Again attempt
to move the store operation toward the procedure Entry block and the load
operation toward the Exit block.
3. If both previous techniques fail to reduce the register pressure, the load and store
operations must occur within the block where the register pressure is too high.
Choose a temporary T that is live at p and is not used for the largest number of
instructions. Insert a store operation after the definition of T (or at the beginning of
the block if there is no definition in the block). Insert a load operation before the
next use of T (or at the end of the block if there is not another use in the block). If a
load occurs at the beginning of the block, attempt to move the load as far toward
the procedure Entry as possible without increasing the frequency of execution.
Similarly, move the store operation toward the procedure Exit as far as is possible.
Once the compiler has inserted the load and store operations, it uses the techniques of
partial redundancy elimination to move the load toward the Entry block and the store
toward the Exit block. The EARLIEST algorithm is used so that the operations are moved
as far as possible.
Recall that register allocation is an NP-complete problem, so there is no likelihood of
finding an algorithm that works well in all cases. This means that the implementer (and the
author) must resist too complex allocation mechanisms: Past experience says that they do
not pay off.
It is more efficient to compute for each loop the temporaries that are available to spill
and then scan from the outermost loop to the innermost, spilling temporaries if the register
pressure is too high. An attribute Through(L) is computed for the flow graph, each loop,
and each block. The algorithm is given in Figures 11.19 and 11.20.
The procedure COMPUTE_THROUGH starts the recursive tree walk of the loop tree.
Since the attribute is only needed for loops with high register pressure, the attribute is not
computed for less complex loops. This will save some time. Note that this is not true of
loops contained within other loops. If the outer loop has high register pressure, the register
pressure for the inner, less complex loops is still computed. It is too complex to avoid the
unneeded computation.
The procedure COMPUTE_THROUGH_LOOP handles blocks separately from loops.
For a block, a temporary is live throughout the block without references in the block if and
only if the temporary is live at the beginning of the block and there are no references.
Warning: It is not true that a temporary is live throughout the block if it is live at the start
and end of the block, since it may become dead within the block and then five again. Of
course, this cannot happen if there are no references to the temporary in the block.
Figure 11.19 Computing Transparent Temporaries
The set of temporaries that are live everywhere in a loop without references is the
intersection of the corresponding set for each component of the loop. The
COMPUTE_THROUGH_LOOP computes this intersection. The compiler is only interested
outermost loop in which a temporary is live throughout without references, so after
computing the Through set for a loop, it removes the references to those temporaries from
the inner loops.
For single-entry loops, there is an easier way to compute the Through attribute. For
a single-entry loop, a temporary is live throughout the loop without references in the loop if
and only if it is five at the beginning of the entry block and has no references within the
loop. This is true because there is a path from every block to every other block in the loop.
This is not true for multiple-entry loops because the compiler has added blocks onto the
beginning of the loop to create a single entry region of the flow graph. In these added
blocks there is not a path from each block to every other block.
number of registers. Finally it recomputes the pressure for use during instruction
scheduling.
There are two fundamental procedures implementing the algorithm: one reduces
the pressure in loops (see Figure 11.22), the other uses a different algorithm to reduce the
pressure within a block (described later in section 11.7.1). The algorithm we have been
discussing reduces pressure in loops. Reducing the pressure within a block is the last
resort and is only performed if there are no temporaries that are live throughout the block
and unused in it.
Now lets discuss reducing pressure within a loop, as described in Figure 11.22.
The algorithm description is more daunting than the actual idea. Compute the set of loops
or blocks, High_Pressure, which have an internal register pressure that is too high. The
compiler needs to spill a temporary that is live in each of these loops if possible. To that
end, it computes a priority queue, Excess_Pressure, consisting of the loops or blocks
contained in High_Pressure. The priority is given by the excess in register pressure. The
algorithm chooses a temporary to spill (described shortly) and then spills it (also described
shortly). When as much spilling as is possible has been performed in this loop, spilling is
performed in the subloops and blocks if necessary.
Figure 11.21 Driver for Reducing the Pressure
How is the temporary to spill chosen? Consider the algorithm in Figure 11.23. The
loop (or block) with the most excessive pressure is chosen. Each of the temporaries in
Through for that loop are candidates for spilling. The one chosen is that which is also a
candidate for spilling in the most other loops that need to spill temporaries. This gives the
algorithm that optimizes the placement of the load and store operations the biggest
chance of avoiding some load and store operations.
The algorithm in Figure 11.24 describes the insertion of the load and store
operations. First there must be a memory location to hold the value. The same memory
location must be used for all references to the same temporary. The store operation is
inserted before entry into the loop, and load operations are inserted at the exit points if the
temporary is still live there. Since there are no references to the temporary in the loop, this
guarantees that the new program has exactly the same computational effect as the
original program. Then the data structures are updated. If the loop no longer has excess
pressure, then the loop is removed from Excess_Pressure and High_Pressure. If it still
has excessive pressure, the priority is decreased by one.
Figure 11.23 Choosing which Loop Temporary to Spill
Updating the register pressure is the most expensive operation, so the compiler
uses an approximation that reduces the pressure in the preselected loops and blocks. The
optimization of the placement of load and store operations may decrease it in other
places. However, the choice to spill the same temporary in as many places as necessary
within a loop makes this approximation better. All of the loops or blocks that have high
pressure and in which the temporary can be spilled do have the temporary spilled and the
pressure adjusted. It is the blocks or loops where the pressure is not too high that fail to
have the pressure adjusted. So the algorithm performs a walk of the loop tree, decreasing
the recorded pressures by one. It stops when the leaves or a loop with low pressure is
reached. The algorithm is described in Figure 11.25 as a simple tree walk down the tree,
fixing the values of the attribute Pressure.
Which temporary should be stored? The temporary whose next use is furthest in
the future. In other words, scan the set of five temporaries and choose the one whose next
entry in the use list is latest. This is the classic heuristic used in one-pass register
allocators and it makes a register available as long as is possible.
The actual point where the register pressure is too high is within the instruction
between the uses of the operands (which might decrease register pressure) and the
storing of values in the targets (which increases register pressure). If the pressure is too
high, the temporary being spilled is stored before this instruction (the temporary must be
one of the operands or another temporary that is five but not used in this instruction). The
value must be reloaded before the next use. If there are no more uses in the block but the
temporary is five, then the load operations must be placed on each exit edge where the
temporary is live and the algorithm to optimize the placement of the spill operations called.
Similarly, if the load operation is placed at the beginning of the block, then the spill
optimizer must be called to improve the placement of the spill (see Figure 11.28).
Figure 11.28 Inserting a Spill within a Block
This section describes an algorithm to improve the placement of these load and
store operations. Once the loop-based algorithm has determined the placement of
instructions outside of loops, these load and store operations together with the previous
load and store operations for the same temporary are used to find better places to put the
operations. The algorithm used is the EARLIEST algorithm for partial redundancy
elimination.
What instructions kill the store operation? These are the instructions that cause the
condition that the value in T is the same as the value in MEMORY(T) to be violated, which
will be any instructions that modify T. Note that a LOAD instruction first kills T and then
has the effect of an evaluation of a store operation.
Note that the uses of T as an operand do not affect the placement of store
operations. The store operations are being moved toward the Entry block and never
change the value of T, so a store can be moved past a use of T without affecting any
values in registers. This gives us the following definitions for anticipation and availability:
STORE_ANTLOC(I) = STORE_IN(T)
What instructions kill a LOAD instruction? A use or evaluation of the temporary kills
a LOAD instruction. The use kills it because moving the load past the use will destroy the
value for the use. An evaluation of the temporary will kill the load since it will generate a
value different from the one in memory.
A further optimization can be made by observing that some paths to Exit may not
contain any further uses of the temporary T. If a LOAD instruction is to be inserted at a
point where T is not five, then the insertion can be ignored.
11.9 References
Chaitin, G.
6(1):47-57.
J.,
et
al.
1981.
Register
allocation
via
coloring.
Computer
Languages
Early RISC processors did not stall when a value was not ready. Instead they executed the instruction using
garbage as input. It was the responsibility of the compiler to ensure that such execution did not happen. All recent
processors will stall while waiting for operands since the indeterminancy of some instructions, particularly LOAD,
multiply, and divide instructions, made scheduling difficult.
The processor will issue multiple instructions at the same time. The processor will
load a small set of instructions, called a packet, and analyze the relationship between the
instructions. If the instructions do not use or change the operands of the other instructions
in the set, then the instructions may be issued at the same time. This gains performance if
the processor has more than one computing unit, such as an integer arithmetic unit and
floating-point unit.
The processor may have more than one integer unit and more than one floatingpoint unit. In that case, the packet of instructions that can be fetched is larger and more
than one arithmetic instruction may be issued simultaneously. Not all of the arithmetic
units may be identical. In that case the compiler will reorder the instructions so that each
packet will have instructions that can execute on different arithmetic units.
A processor with these three characteristics is called a superscalar processor. Most
processors in use today are superscalar. Many of them have an additional characteristic
called out-of-order execution. Such a processor will operate as described above and will
allow later instructions in a packet to execute even when the instructions that precede
them are constrained from execution. This later characteristic will not be discussed here
since there are few things that the compiler can do to enhance the execution of out-oforder processors that are not important for the normal superscalar processors.
The Digital Alpha 21164 is an example of a superscalar processor. Consider how it
matches the criteria above. First, the Alpha is not an out-of-order execution processor. All
instructions are executed in order; if the execution of an instruction is delayed, all
instructions following it are delayed also.
The Alpha is pipelined. Most instructions for the integer arithmetic units take one
cycle; the floating-point instructions take four cycles. Some of the exceptions to these
rules are the conditional move, LOAD, multiplication, and floating-division instructions.
The Alpha will attempt to issue four four-byte instructions during each clock cycle. The
block of instructions must be aligned on an address that is a multiple of sixteen. If the
address is not a multiple of sixteen, then the packet that contains the current instruction is
fetched and the initial instructions in the packet are ignored, thus decreasing the number
of instruction that can be issued during that clock cycle. If the instructions of a packet
contain dependences so that they cannot all be issued, then the initial part of the packet is
issued, up to the first instruction that cannot be issued immediately.
The Alpha contains two integer arithmetic units and two floating-point units. Both
integer arithmetic units can execute most integer instructions. The exceptions are that shift
operations are done on one of the units and branching operations are done on the other.
There is a floating-point multiplication unit and a floating-point addition unit. Some
instructions are shared between the two units.
Ideally, the Alpha will issue four instructions during a clock cycle. Two of the
instructions will be executed by the two integer arithmetic units. One of the other
instructions will be an instruction that can be executed by the floating-point multiplication
unit, and the final instruction can be executed by the floating-point addition unit. These
instructions can occur in any order in the packet of instructions.
There are other characteristics of the Alpha that the scheduler must take into
account. Consider load operations. These are performed by the integer arithmetic unit;
however, the length of time to fetch data from memory depends on which cache or
memory unit currently contains the data.
The Alpha contains three on-chip caches: one for data, one for instructions, and a
larger secondary cache. The primary cache contains 8K of data organized as a direct
memory-mapped cache with a cache line of 32 bytes. If data is in this cache, a load
operation takes two cycles. There is another 8K on-chip cache for instructions; however, it
will not be discussed here.
The secondary on-chip cache holds both instructions and data. The cache will hold
96K of information organized as a three-way set-associative cache with each cache line
containing 64 bytes of data. A load operation when data is in this cache takes nine cycles,
including moving the data into the primary cache.
There is usually a large board-level cache in an Alpha system. This cache contains
multiple megabytes of data and is organized as a direct memory-mapped cache with each
cache line containing 64 bytes of data. For data in this cache, a load operation takes
twenty cycles, including moving the data into the two higher-level caches.
When data is in memory and not in a cache, the load operation takes a long time:
somewhere in the range of a hundred cycles, depending on the system involved. This is
sufficiently long that there is no point in modeling the exact time in the scheduler. The
compiler can handle this in one of two ways. It can optimistically assume that data is in
one of the caches and schedule the instructions for that case, or the compiler can be
aware that these load operations take a huge amount of time and attempt to move them
as early in the schedule as possible.
The effect of pipelining on the execution of a program is that instructions are issued
at some point in the program and the result becomes available some number of clock
cycles later. Due to hardware optimizations, the number of cycles that the value takes to
become available may depend on how it is going to be used, so the delay, or latency, is a
function of both the instruction computing a value and the instruction that will use the
value.
The SCHEDULE phase reorders the instructions in the procedure being compiled
to eliminate as many stalls as possible. There are three different types of schedulers,
based on the size of the pieces of the procedure that they attempt to reorder:
Block schedulers reorder the instructions within individual blocks. The form of the
program flow graph is not changed. The reordering of each block is independent of the
reordering of other blocks, with the possible exception of some knowledge about values
computed at the end of a block (or used at the beginning of a block).
Trace schedulers reorder the instructions in a simple path of blocks. The paths that
are reordered are chosen to be the most frequently executed paths in the program.
Instructions may be moved from one block to another. In fact, instructions may be moved
to places where the value computed is not guaranteed to be used (speculative execution).
By reordering these larger sequences of instructions, more opportunities can be found for
eliminating stalls.
Software pipeliners reorder and replicate instructions in loops to eliminate stalls.
The result of software pipelining is a new loop in which values are being simultaneously
computed for multiple iterations of the original loop.
For a superscalar processor, block scheduling is inadequate. If the machine can
issue four instructions on each clock cycle, then a one-cycle delay means that four
potential computations are not performed. The compiler must find ways to move
computations together so that they can be simultaneously executed. Since most blocks
are small, the compiler must combine computations from multiple blocks or multiple
iterations of a loop. In other words, the compiler must perform some form of trace
scheduling and software pipelining.
Up to this point, we have ignored the amount of time needed to execute instructions
and the way that the instructions are issued by the processor. The SCHEDULE phase
reorders the instructions to avoid stalls. The Alpha instruction sequence in Figure 12.1 for
the statement A = (B + C) + D * E will waste two cycles when the instructions are executed
in the order described by the source language. The initial load operation takes two cycles
to get the data from the data cache. These cycles can be used to execute the later load
operations, allowing the loads and the subsequent multiplication and addition to overlap.
Figure 12.1Instructions Before (left) and After (right) Scheduling
Figure 12.1 indirectly shows three other concerns. First, the initial instruction
sequence can be executed using only three registers. The reordered sequence requires
four registers. Reordering the instructions can increase the number of registers needed,
therefore making register allocation harder. There are also cases where instruction
scheduling will decrease the number of registers needed; however, these are rare. On the
whole, instruction scheduling makes register allocation more difficult.
The second problem is, what happens when register allocation cannot put all
temporaries in physical registers? The register allocator will insert store and load
operations to move data to and from memory. These instructions destroy the original
schedule of instructions, so instruction scheduling may need to be repeated after register
allocation. The second time it is called, most temporaries will already have been assigned
to physical registers so less movement is possible.
The third problem is implicit in the example shown above. The Alpha processor can
issue four instructions at the same time. In this example, there are never more than two
instructions available to issue. Many cycles have no instructions to issue, thus many
opportunities for starting instructions (cycles) are wasted. How can the compiler reform the
program to make more instructions available? We have discussed one method already:
loop unrolling. If the code in Figure 12.1 were array references inside a loop, then four
iterations of the loop could be executed simultaneously, using many of the wasted cycles.
scheduling only occurs within a block. In that case only blocks with new instructions
inserted need to be rescheduled.
Scheduling can create opportunities for peephole optimizations. It can move loads
and stores of the same location so that they are adjacent. Hence the scheduler must be
prepared to do some limited forms of peephole optimization as it schedules instructions.
After register allocation has been performed, the scheduler can be called again if there are
any instructions inserted by the register allocator. If there is no register spilling in the
register allocator, then the second execution of the scheduler is unnecessary.
12.3 Example
Two examples are used to illustrate instruction scheduling. First, in Figure 12.3, is
the inner loop of the running example. We will schedule the body of the loop, gaining
some performance even though the loop has few instructions. This is typical of many
loops in real programs.
The corresponding scheduled fragment of the flow graph is given in Figure 12.4. All
of the store operations have been removed from the loop and the superblock2
transformation has replicated the instructions at the end of the loop to improve instruction
scheduling.
2
Figure 12.5 is used as an example for two purposes. The compiler will software
pipeline this loop, overlapping the execution of multiple iterations. As well as being used to
show software pipelining, we will use this example to illustrate how the compiler would
compile a loop that is not software pipelined. Such a loop may be unrolled to gain more
instructions for scheduling.
Figure 12.3 Inner Loop of Example
Figure 12.6 contains the instructions that are provided after previous compiler
phases if the loop will be software pipelined. The body of the loop contains the instructions
for one iteration of the loop. Figure 12.7 contains the instructions for the loop that were
generated when assuming that the loop would not be software pipelined. The loop has
been unrolled four times so that the computations can be overlapped. In this particular
case, the compiler might actually unroll the loop more than four times; however, nothing
would be gained by depicting more unrolling as an example.
Before describing the scheduling algorithm itself, we will discuss five topics that
form the basis for scheduling:
Rather than scheduling the instructions in a single block, the compiler will schedule
instructions in a collection of blocks, called a trace. First the compiler must compute the
traces. Then it will schedule the instructions in the trace as if they were a single block of
instructions.
As you will see momentarily, the traces are not necessarily sequentially contiguous
blocks in the flow graph. When they are not adjacent, the compiler must compute the
temporaries that are used or defined between the blocks in the trace.
When the trace and the interblock information is known, the compiler will compute a
data structure called the interference graph, which describes which instructions must be
evaluated before other instructions and how far in advance these prior instructions must
occur.
Just before the instructions are scheduled, the compiler must compute for each
instruction an estimate of how many cycles occur between it and the end of the trace. This
is called the critical path information and will be used to choose the instructions for
scheduling.
During scheduling the compiler simulates the execution of the instructions and
keeps track of which function units within the processor are busy during each execution
cycle. This can all be done by maintaining a collection of status information and updating it
on each cycle. It is more efficient to precompute all possible states that the function units
can be in and represent this as a finite state machine. The update of the state then
reduces to a state transition.
We will discuss each of these topics in turn and then give the scheduling algorithm
at the end.
If the trace gets longer than a fixed size (determined experimentally), terminate the
trace. The size should be measured in terms of instructions. Some of the scheduling
algorithms are not linear in the size of the trace, so avoid a trace that is too big.
Conversely, if the trace is large then a significant amount of instruction overlapping will
already be available, so little incremental advantage can be had by increasing the size of
the trace.
Given these conditions, the algorithm for computing the trace, given in Figure 12.8,
is straightforward. Form a priority queue of blocks ordered by execution frequency. Use
this queue to find the anchor of a trace and then extend it by the rules mentioned above.
Scan backward including the dominators until one must stop the trace. This gives us the
entry point. Now scan forward from the anchor, including either a path through the
extended block or a postdominator. These rules are flexible. The best choices for traces
depend on the programming styles of the users and the best programming styles in the
source language, so be prepared to modify this code to meet these needs.
Figure 12.8 Calculating Traces
The compiler needs to have a means to name a trace. The name the compiler uses
is the block that is the entry to the trace. Each block has an attribute trace(B), which is
either NULL because the block has yet to be inserted in a trace or is the block that is the
entry to the trace. Given this attribute, it is easy to find all of the blocks in the trace. The
trace consists of a set of blocks forming a path in the dominator tree starting at the entry
block to the trace. Simply scan down the tree looking at each child. If there is a child with
the same value for trace, then the trace includes that child. If no child has the same value
of trace as its parent, then the trace ends.
Note that we use the vertical lines, |B|, to represent the number of instructions in B.
This is a reasonable notation since the vertical lines are used in mathematics to represent
cardinality.
The decision process for adding the dominators of the anchor to the trace is given
in Figure 12.9. The dominators are added if there are any (the compiler must stop at the
root) and they are not already in a trace. If the trace has gotten too long, stop the trace.
The compiler must also check whether the dominator is in a loop that does not include the
anchor directly or indirectly. It is appropriate for the dominator to be in an outer loop, but
not a loop in which the anchor is not directly or indirectly included.
Figure 12.9 Determining Whether Dominators Can Be Added to a Trace
For extending the trace from the anchor into an extended block, the algorithm in
Figure 12.10 is used. Find a successor that has only one predecessor. Choose the
successor with the highest frequency and that is the next block to add to the trace.
Now consider the running example we have been using throughout the book. We
will use the flow graph without superblock formation occurring. Forming superblocks will
make for a better trace, but that is for later discussion. Assume that each loop is executed
one hundred times, so the inner loop is actually executed nearly ten thousand times. We
assume that the maximum value is changed about ten times each loop, so that the
number of executions of block B6 will be one thousand (see Table 12.1).
The compiler forms the priority queue of blocks and chooses one of the most
frequently executed. There is not a unique choice here. One possibility is that block B3 will
be chosen first. Then the immediate dominators of that block will be scanned, giving a first
trace of {B0, B1, B2, B3}. The next trace would be the single block {B6}. Then block {B4}
forms a trace, with the final trace being {B5}.
Definition
IDEFS:
For each block B, IDEFS(B) is the set of temporaries that are defined on some path
from IDOM(B) to B. This does not include definitions that occur in B or in IDOM(B).
In Figure 12.11, IDEFS(B4) includes T2 and T3 but does not include T1 or T4. T2
and T3 are included because each is defined on a path from B1 to B4 where B1 is the
immediate dominator of B4.
A similar set of information exists for uses rather than definitions. The idea is
identical and the computations that we see below are identical. The only difference is that
the occurrences of uses of temporaries and variables as operands are measured rather
than the targets of instructions.
Definition
IUSE:
For each block B, IUSE(B) is the set of temporaries that are used as operands on
some path from IDOM(B) to B. This does not include uses that occur in B or in IDOM(B).
We have the basic information. How does the compiler piece this together into an
algorithm? First the compiler must compute this information bottom-up in the dominator
tree: It needs the information for dominated blocks to compute the information for
dominator blocks. Because of the way IDEFS is defined, and the previous observation that
the information for one child in the dominator tree can affect the information for the other
children of a block, the information for all children of a node are computed simultaneously.
The compiler needs to know DEFS(Z I,P) where P is a predecessor of Z I + 1. This is
the difficult information to store efficiently. The storage uses a UNION/FIND algorithm.
Consider a block B 0 that is the current block being processed. Let Z 1 through Z n be the
children of B 0 in the dominator tree. Thus each block dominated by B 0 is a member of
one of the subtrees rooted at one of the Z I. If one has a block P that is a predecessor of
Z I + 1 on the same path, then one can walk up the dominator tree from P to the
corresponding child of B 0 that is the root of the tree. As one performs this walk, one can
compute DEFS(Z I,P) using the formula above. By adding in OUT(P) one computes the
temporaries potentially modified between Z I and Z I + 1 . This is the information we need.
However, this tree walking is inefficient. A shadow data structure is therefore
created that contains the same information as walking the tree. The data structure is
collapsed as the walk progresses. This data structure is based on a UNION/FIND tree with
the addition of EVAL operations to compute the sets. Here is how it is structured. When a
block is processed, it is added into a UNION/FIND structure in which the representative of
the partition is the block at the root of the subtree that has been processed and all blocks
in the subtree are dominated by the representative. Of course the standard UNION/FIND
collapsing occurs to make this tree much shallower than the actual dominator tree.
Associated with each edge in this UNION/FIND structure is the DEFS set between the
parent in the structure and the child. When collapsing occurs, the DEFS set is updated to
represent the new parent and the child. When EVAL is called, collapsing occurs and the
resulting DEFS set is returned as the value.
We now have everything for the algorithm except the mutual computation of the
IDEFS sets for the children of a particular node. What does the previous discussion tell
us? We can view the children of B 0 as a new graph where there is an edge between the
two children if there is a path from one to the other that does not go through the parent.
Given this new graph, the set of temporaries in IDEFS becomes the set of temporaries
modified on any path from the roots of the graph (they are the children that are direct
successors of the parent) to the nodes. This can all be done by topologically sorting the
children. Of course there can be strongly connected regions. Their effect is that arbitrary
paths through the strongly connected regions can occur, so the union of all temporaries
modified in a strongly connected region must be computed.
Figure 12.12 shows the algorithm for performing this computation. The children, the
Z S, are formed into a graph by looking at each childs predecessors and finding the
alternate child that dominates that predecessor. This gives the edge between two of the
children. As noted earlier, this could have been done by walking up the dominator tree.
Instead it is done with a UNION/FIND algorithm so that paths may be collapsed. Then the
strongly connected components are computed and ordered in reverse postorder. Now we
have the effect of a topological sort. Predecessors occur before successors except for
strongly connected regions.
Since a path can go around a strongly connected region any number of times, the
effect of a strongly connected region is the union of the effects of the blocks within it. For a
single block, there is no effect between the predecessor and the current block. Having
computed the summary effect, that information is added to the information already
computed for the predecessors to give the information about what can be computed on a
path from the direct dominator through one of its successors that is also a child (and root)
to the current node. This information is then added to the dominator tree to store the
result.
Figure 12.12 Algorithm for IDEFS
Figure 12.13 gives the support procedures for implementing the UNION/FIND and
EVAL. They are included because the EVAL operation is rarely used in the literature.
There are two attributes implementing these operations. DEFS indicates the set of
temporaries that are changed between the parent and the child; the information is stored
with the child. FindParent indicates the parent of a block. If it is null then this is the root of
the current tree.
The initialization consists of simply setting all FindParent attributes to null. The
DEFS attribute need not be initialized since it will only be used after being set. The FIND
operation consists of walking up the tree to find the root. Once that has occurred, the tree
is collapsed using the collapsing procedure to shorten any future walks.
The UNION operation has a fixed block that is made the parent. It is guaranteed to
be fed two blocks with FindParent being null, so no collapsing will occur. The other
attribute is the set of blocks modified between the parent and the child, which is simply
stored in the data structure.
The EVAL operation uses FIND to find the root. At the same time a collapse occurs
(within the FIND). Hence the EVAL consists of simply returning the stored data that has
now been updated to be between the root (now the parent) and the current block.
The real work occurs in the COLLAPSE procedure. If the parent is not the root,
collapse the parent first. Now there are two hops to the root. Collapse it to one hop by
using the definition of DEFS to compute the temporaries modified between the root and
the current block.
This is sufficiently complex that an example is necessary. Consider the normal flow
graph for the running example (refer back to Figure 2.1). We will deal with a single
temporary. In that case we can refer to Boolean values rather than sets: The value true
occurs if the temporary is in the set. Note that block B1 dominates block B2 and block B4.
Assume that a temporary T is modified in block B6. What is IDEFS(B4)?
Before processing block B1 (which computes the va1lue of IDEFS(B4)), the
algorithm most process B2 (which computes the values of IDEFS(B3) and IDEFS(B6)).
The blocks B3 and B6 form a graph, with B6 preceding B3. When the algorithm is applied,
one gets the value of OUT(B6) added into IDEFS(B3) so that IDEFS(B3) is true.
Now apply the algorithm to B1, computing the values of IDEFS(B4) and IDEFS(B2).
One of the predecessors of B4 is B3, which is dominated by B2, so the graph of children is
formed with B2 preceding B4. When computing the IDEFS set for B4, its predecessor B3
is interrogated and we find that IDEFS(B3) is true so IDEFS(B4) is true.
The game algorithm can be applied to compute the IUSE sets using the IN set of
temporaries that are used as operands instead of the OUT set of temporaries modified.
Note that I said used rather than needed. It is possible to perform scheduling without building the interference
graph. Instead, the graph can be implicitly built by keeping track of the instructions computing the operands and their
placement. It is easier and more effective to build the interference graph, although this takes time and space.
Definition
Interference Graph:
Given a trace of blocks {B 0, . . ., B N} the instruction interference graph is an acyclic
directed graph. There are three different kinds of nodes in the graph:
Each instruction in the trace is a node in the graph. These are the essential
elements of the graph.
For each block B in the trace, there is a Block Start node that will be referred to as
Block_Start(B). This node is present to determine where each block starts. It will
also carry dependence information necessary to inhibit reordering of instructions
that might destroy data needed later.
An edge (Tail, Head) between two nodes indicates that Tail must precede Head in
the final order of instructions. The absence of an edge between two nodes means that
they can be arranged in any order. Each edge is labeled with an integer delay((Tail,
Head)) indicating the number of cycles after Tail issues that Head may issue. If the delay
is 1 then Head may issue on the cycle following Tail. It is possible for the delay to be 0.
This usually means that there is specialized hardware present to make the value of one
instruction available to another faster than the normal pipeline timing.
When is there an edge between two nodes? Two conditions are necessary. Tail
must precede Head in the initial instruction order; that is, Tail is executed before Head.
Second, both instructions must either use or define the same resource. There are four
cases:
True dependence: If Tail modifies some resource that is later used by Head, then
there is a true dependence. An edge exists between the two nodes, with a delay
indicating the length of time needed for Tail to complete the modification of the
resource. The length of the delay depends on both Tail and Head since the time for
the resource to be available is different for different instructions pairs.
Output dependence: If Tail and Head both modify the same resource, then the
initial order must be preserved so that later nodes will get the value of the resource
modified by Head. Normally the delay is 1, indicating that only the order counts.
Input dependence: There is no restriction on order if both Tail and Head use a
resource without modifying it. No ordering of the instructions is required, so no
edge is created.
A resource is any quantity that indicates a change of execution state of the program.
Hence each temporary is a resource. Thus, there is an edge from an instruction that
evaluates a temporary to each instruction that uses the temporary. There is an edge from
an instruction that evaluates a temporary to the next instruction that evaluates the same
temporary. And there is an edge from each instruction that uses a temporary to the next
instruction that evaluates it.
If the target machine has condition codes, they are a resource. They are handled like
temporaries. If the set of instructions that set condition codes is pervasive, as in some
complex instruction set computing (CISC) architectures, then the condition codes should
be handled specially since the size of the interference graph will be huge. In most RISC
architectures only a few instructions set condition codes (if they exist) and a few read
them. In that case the condition codes are handled as implicit operands or targets of the
instructions, just as the temporaries are handled as actual arguments.
Interferences for LOAD and STORE instructions are computed using the region of
memory that each can reference. Each region of memory that the compiler can identify is
a resource; hence the tags previously used for alias analysis indicate separate resources.
The edges for the load and store operations match the kinds of dependencies that occur:
There is an edge between each store operation and each succeeding load operation
from the same area of memory. If the compiler can determine that the memory regions do
not overlap, then no edge is necessary. The compiler can determine this if the areas of
memory are different, if the addresses are known to be different (for example, if the
addresses differ by a constant), or if the dependence analyzer leaves information
indicating that the store and load do not reference the same location in memory.
There is an edge between each store operation and each succeeding store operation.
The same considerations as with load operations apply.
There is an edge from each load operation to the next store operation into the same
region of memory. Of course if the addresses are known to be different, the edge is not
inserted.
Not all edges need be inserted in the graph. Assume the compiler is building an edge
(Tail, Head) and there are already two edges (Tail, Middle) and (Middle, Head) in the
graph with
delay((Tail, Head)) delay((Tail, Middle)) + delay((Middle, Head)),
then the new edge is unnecessary. The edges already in the graph place stronger
restrictions on instruction order than the new edge. Three occurrences of this are easy to
identify:
Consider a node Head that uses a resource R. By the definition, there must be an
edge from every preceding node modifying R to Head. The compiler need only
record the edge from the last preceding node that modifies R to Head. The set of
nodes that modify R form a list of edges in the graph since there is an output
dependence from each such node to the next one.
Consider a node Tail that uses a resource R. There is an edge from Tail to the next
node that modifies R, recording an antidependence; however, there is no need to
record the antidependences with later nodes that modify R since that
antidependence is subsumed by the initial antidependence and the sequence of
output dependences between nodes that modify R.
What are the conditions for interferences with BlockStart(B) and BlockEnd(B)? These
nodes represent the boundaries of each block, so the compiler must ensure that the
BlockStart node occurs before the BlockEnd node and that the BlockEnd node for the
dominator occurs before the BlockStart node of the dominated block. The other way to
view the BlockStart node is that it represents all instructions that occur before the block
and after the dominator. These ideas give us the conditions for interferences with
BlockStart and BlockEnd:
There is an interference edge between BlockStart(B) and BlockEnd(B) and an
interference edge between BlockEnd(IDOM(B)) and BlockStart(B). Thus, the BlockStart
and BlockEnd nodes form a linked list in the graph. This can be implemented by either
forcing these edges to exist or by introducing an artificial resource that is written by each
BlockStart node and read by each BlockEnd node. This creates the same edges as noted
above.
Pretend that BlockStart(B) reads every resource that is read by an instruction between
B and IDOM(B) and writes each resource that is defined between B and IDOM(B). In other
words, make the set of resources used by BlockStart(B) be IUSE(B), and the set of
resources defined by BlockStart(B) be IDEFS(B).
Since not all instructions are implemented in a simple pipelined fashion, the formula
must be made more complex. Consider the following two cases in the Alpha 21164 as
representative cases:
Integer multiply instructions cannot be issued more frequently than every four to
twelve cycles, depending on the instructions and sources of the operands. The
latency of each multiply instruction is eight to sixteen cycles, so the multiply
instructions are partially pipelined.
A floating divide instruction cannot be issued until results of the previous divide
instruction are available.
To compute a more accurate value for priority, the compiler must compute the total
latency caused by instructions in each of these classes. The priority cannot be less than
each of these values.
The integer class does everything in one cycle, so it ties up the function unit for a
cycle and is completed. The floating-point add instruction uses the floating-point unit for
two cycles, so it is not fully pipelined. It can only start a floating-point instruction every
other cycle. The floating-point multiply instruction is fully pipelined. Actually, it should be
represented as multiple function units with one stage for each cycle; however, these
function units are only used by the floating-point multiplier and are completely determined
by the first stage in the pipe, so the machine model can be simplified to show only the first
stage.
If the scheduler schedules first a floating-point add instruction and then a floatingpoint multiply instruction in the same cycle, then the machine state looks like Table 12.6.
There are no more instructions that can be scheduled in this cycle.
To start the next cycle, the machine shifts all of the columns left by one to indicate
that the current cycle is completed and the next clock cycle has become the current clock
cycle. This gives us the state in Table 12.7. Note that the machine can issue an integer
instruction or a floating point multiply. However, the floating add instructions cannot be
issued since the function unit is still busy. Recall that the floating add unit used the same
resource twice.
This description has been a simplified one. There are many more function units and
they are not all directly connected to the instruction class. For example, there may be an
integer register write function unit that writes the resulting data into the register file. Also
some instructions will use multiple major function units: A copy integer to floating register
instruction will involve some of the integer function units and some of the floating function
units.
The problem is that computing the machine state in this fashion is time consuming
and requires specialized code in the scheduler. This compiler uses a technique described
by Bala and Rubin (1996) for simplifying and speeding the processing of state.
Note that the finite state machine may be nondeterministic. Why? Isnt the
construction we have just described deterministic? If there are single function units for
each function, then yes. If there are multiple units for the same operation (multiple integer
function units, for example), then there will be multiple transitions under the same
instruction class to distinct states.
What are the start states for this machine? Clearly the state representing the matrix
of all false values is a start state; however, there are two other classes of start states:
When one machine cycle is completed, the state of the machine must be initialized
for the next cycle. This involves shifting the matrix left one. So we need a function
STATE_SHIFT(S) that takes a state S and gives the state that represents the
matrix with all values shifted left one column. The range of this function must be
considered a start state for the scheduling of the next cycle. This function is
represented internally as a vector indexed by S and giving the state number for the
state at the beginning of the next cycle. To decrease the number of start states, we
require the scheduler to issue NOP instructions for all function units if it has no
other instructions to schedule in a given cycle. This means that all the initial
function units will be occupied in a state that ends a cycle and we do not have to
perform the shift for intermediate states.
At the beginning of a block, the compiler needs to make an estimate for the state of
the machine after executing one of the preceding blocks. This does not need to be
exact: The more accurate a computation, the fewer stalls that will occur. Since the
compiler does not know which block is actually the predecessor, it forms a state
from the state at the end of each of the predecessor blocks by ORing the state
matrices together. Actually we need only consider two predecessors because more
can be handled by applying the process successively to the rest of the
predecessors in pairs. So we must form the OR of any two states that end a block
and introduce those as new start states. We need a function COMBINE_PRED(s 1,
s 2) that takes the OR of the two matrices and returns the shifted result as the start
state for the first instruction in the block.
We have outlined the procedure. All of this computation is done during the compiler
construction so that the code in the machine consists of matrices representing the
transition functions and the COMBINE PRED function and a vector representing the
STATE_SHIFT machine. This is very much like the tables required in the use of LEX or
YACC.
The algorithm is outlined in Figure 12.16. Initially the machine starts with a single
state, which is the state with nothing busy. The algorithm is written in terms of matrices;
however, each matrix for a state is stored once and a unique integer to represent the state
is used to represent the matrix in all tables that are generated for use in the compiler.
There is a waiting list called StateQueue that holds each state after it is created.
Each state only enters that queue once because it is entered into the StateTable and the
StateQueue at the same time and nothing is ever removed from the StateTable. When a
state is processed, the generator attempts to create a transition under every possible
instruction class.
If there are no transitions generated, then the machine is full for the current clock
cycle and the compiler must generate a transition to a new start state for the next cycle.
This is done by performing that manipulation on the matrix for the state and then seeing if
a corresponding state already exists. If not, add it to the set of states also.
The whole process is continued until all states have been processed so that all transitions
are known. After the algorithm is performed, the equivalent deterministic finite state
machine must be found.
Consider the same set of states, but build the transitions in the reverse direction.
This gives us a very nondeterministic finite state machine from which we can build a
deterministic finite state machine. After we have scheduled a block, we run the reverse
state machine on the block to give a pair of state numbers for each instruction. The
forward state numbers indicate what legal instructions can occur in the future, and the
backward state numbers indicate what legal instructions can occur in the past.
We now have a representation of the state of the machine before an instruction is
executed and a representation of the machine following the instruction that is executed.
We store this information with each instruction. Give each instruction two temporary
attributes that exist during instruction scheduling and register allocation. ForwardState(I) is
the state of the machine before the execution of instruction I. BackwardState(I) is the state
of the rest of the instructions after I.
at fixed points later in the schedule. Thus, some later instructions must be scheduled
before the next current instruction.
Rarely during register allocation, instructions will be spilled into memory. This
requires the insertion of load and store operations into the schedule. The best way to do
this is to find an empty slot in the schedule that can be replaced by the LOAD or STORE
instruction and directly place the instruction in the proper place in the schedule.
We therefore need to know the conditions under which one instruction can be
replaced by another. This includes the possibility of inserting an instruction into an empty
slot in the schedule.
Assume that we have the states ForwardState(I) and BackwardState(I) for each
slot in the schedule, whether there is an instruction there or not. Thus we can implement
the schedule as a sufficiently large array in which each instruction will occupy a slot. At the
beginning we initialize the ForwardState and BackwardState attributes to the start state for
each of the machines, indicating that all of the resource matrices are empty.
Now consider the conditions under which an instruction I can be inserted in slot IS.
To be able to be inserted in that position means that the instruction cannot conflict with
any instructions that have been scheduled in the past. This is the same as having a
transition out of ForwardState(IS) because we only created transitions when there was no
conflict. The BackwardState(IS) attribute indicates whether there is a future instruction
already scheduled that will conflict with I. If no future instruction conflicts with I, then there
is a valid transition out of BackwardState(IS) under I.
If the instruction I can be placed in slot IS, then the ForwardState and
BackwardState attributes of the slots must be updated. This involves recomputing the
ForwardState attribute forward from the slot IS and recomputing the BackwardState
backward from IS. This is less expensive than it seems. Since we are dealing with finite
state machines, we need only scan forward (or backward) as long as the newly computed
state differs from the previously stored state.
The recomputation of states will only differ for a few slots. Why? Remember the
construction of the finite state machine, which involved resource matrices and the shifting
of columns. As soon as all columns involving the current instruction have been shifted to
the left, the current instruction will not be visible in the state of the machine. In other
words, only a few shifts (the maximum number of columns in a matrix) can occur.
Practically, only a few iterations are required.
The pseudo-code summarizing the insertion is given in Figure 12.17. It elaborates
the discussion above. The value false is returned if the instruction cannot be inserted.
Otherwise, the insertion occurs and the states are updated.
Why the use of value numbering here? Are not all redundant expressions
eliminated? No! Instruction scheduling can introduce redundant expressions. Consider the
source statement in Figure 12.19. If one of the branches is included in a trace with the
conditional expression at the beginning, then it is quite likely that A * B will be scheduled
before the conditional branch. It is therefore available before the beginning of the other
trace.
Figure 12.20 gives the actual algorithm for walking the dominator tree. First the
trace is determined as described earlier. It is headed by a block with Trace(B) = B. At most
one child in the dominator tree has the same value of trace, and so on down the tree until
no child has the value of B for the Trace. Then that trace is scheduled by a call to
SchedulePackets. After the trace is scheduled, then trace an instruction at a time, entering
the instructions into the value table. When a block boundary is reached, the walk
continues at the child in the trace; however, the instructions for each of the other children
are scheduled before that since such blocks must be the beginning of a trace.
Figure 12.20 Determining the Trace and Walking It
The real work starts in Figure 12.21. SchedulePackets (notice the plural) first
computes the interference graph. From this it initializes the attributes Ready(I), which is
the first clock cycle where the instruction can be scheduled without instruction stalls, and
PredLeft(I), which is the number of predecessors that have not yet been scheduled.
PredLeft(I) is the same attribute used in many topological sorting algorithms to control a
topological sort. After all, an instruction schedule is a topological sort of the interference
graph. Ready(I) is the maximum of the times where the operands are available. The
operand is available after its instruction is scheduled and the delay associated with the
pair of instructions has occurred. Since it is a maximum, we initialize Ready(I) to 0 and
increase it whenever we find an operand that gives a larger value.
Before scheduling the instructions, the procedure checks to see if the instructions
at the root of the conflict graph are available outside the trace. If they are, the instruction is
replaced by a COPY instruction. We would like to do better, but we have a phase-ordering
problem here. Register coalescing has already occurred. We will attempt to make the
register allocator allocate the registers to the same register; however, this cannot be
guaranteed, so the copy must be made, which inhibits other optimizations.
Figure 12.21 Starting Trace and Scheduling Packets
The set Ready contains all of the instructions that can be scheduled on this cycle
without delay. The set Available is the set of instructions that are available to schedule
during this or a future cycle. In other words, all of the instructions that a member of that set
interferes with have been scheduled. To compute this set we keep an attribute for each
instruction called PredLeft(I), which is the number of predecessors in the conflict graph
that have not been scheduled. When this attribute reaches 0, the instruction is added to
the Available set.
With all of this machinery, the procedure Schedule Packet in Figure 12.22 chooses
the instructions from Ready that can be scheduled. It chooses the most important
instructions first and only instructions that do not conflict with instructions that are already
scheduled. After all of the instructions have been scheduled, the Available set is updated.
The PredLeft attribute of each successor of an instruction in the packet is decremented.
When it reaches 0 its instruction is added to the Available set.
12.9.1 Refinements
There are two refinements to this scheduling algorithm that may improve it. It will
depend on the processor and the set of programs typically scheduled. The first refinement
involves looking at Schedule_Importance. If there is a critical instruction to schedule, we
schedule it; however, an earlier instruction that was not critical and was scheduled in an
earlier slot may prevent the scheduling of the critical instruction. How can the scheduler be
modified to (sometimesremember this is all NP-complete) prevent this?
Consider the set Available, which contains all of the instructions whose operands
have begun being evaluated. Choose the instruction in this set with the largest Priority.
Compute the instruction slot where that instruction will have its operands available. Then
before executing the normal scheduling process, schedule this critical instruction in this
slot.
This is a major modification to the scheduling algorithm; however, it may be useful
on some processors. This algorithm no longer schedules the instructions in order so that
we need only keep track of the ForwardState. Now the instruction schedule is considered
to be a large array of instructions, initially empty, with the ForwardState and
BackwardState for each empty instruction slot being the initial state. The insertion of an
instruction in the schedule must then use the replacement algorithm rather than simple
insertion. For processors with complex processor architectures, this modification is
worthwhile.
The other modification to the scheduling algorithm is to schedule the instructions
backward. In other words, schedule the last packet first, then the preceding packet, and so
on back to the initial packet. To do this the compiler must build the reverse of the
interference graph and compute the number of cycles from an instruction to the beginning
of the block rather than to the end of the block. Otherwise the algorithm is identical.
There are two advantages to scheduling the trace backward. First, the scheduler
can track exact register pressure. As we have seen before, tracing through a sequence of
instructions backward allows the compiler to see which instruction is the last use, so the
compiler knows when a temporary is live or dead.
The other advantage to scheduling backward is more subtle. When scheduling
instructions forward, there are points at which there are no important instructions to
schedule; however, there may be other instructions that could be scheduled later but will
be scheduled early because there is nothing else to do. This needlessly increases the
register pressure. By scheduling instructions in reverse order, the compiler will schedule
an instruction at nearly the latest time possible for the value to be available when needed.
The disadvantage of backward trace scheduling is more of an unknown. Trace scheduling
has typically scheduled instructions in a forward order. How scheduling backward fits with
instruction scheduling needs some experimentation. What one would like is to schedule
instructions away from the most frequently executed blocks in a trace. How does one do
this?
A final refinement to the scheduling algorithm can be made. If some of the
predecessors of the head of the trace have been scheduled already, then the beginning
state is not the initial start state of the finite state machine. Instead some of the function
units may already be busy. In building the finite state machine we computed the Join of
two states. This can be used on the predecessors to compute the initial state. If a
predecessor has not been processed, then ignore that predecessor in forming the Join.
The software-pipelined loop is the important concept. Multiple iterations of the loop
are folded upon one another. The first time through the software-pipelined loop, the
compiler completes the last instructions for the first iteration of the loop, earlier instructions
for the second iteration of the loop, and so on. During the next iteration, the first iteration is
already completed. The instructions executed for the second iteration are the same as
those for the first iteration during the previous loop body execution, except they are for the
following iteration.
Figure 12.23 Schematic of Pipelined Loop
The software-pipelined loop contains instructions for multiple iterations of the loop.
We will see how to compute the number of iterations shortly. It is also unrolled to some
extent to allow renaming of the temporaries to allow valid use of the physical registers. An
important point is that each instruction in the original loop occurs once in the softwarepipelined loop (if the loop is unrolled, it occurs the number of times the loop is unrolled).
What is the benefit? Software pipelining works well when the separate iterations of
the loop reference independent data. In that case, the computations in one iteration have
nothing to do with computations in another iteration, so the execution of multiple iterations
allows a tighter scheduling of instructions (usually much tighter).
The software-pipelined loop executes until almost all iterations of the loop are
completed. It then exits into the epilogue code, which completes the final iterations of the
loop.
If the number of iterations of the original loop are small enough, there is no
advantage to software pipelining. In fact it makes the implementation of the softwarepipelined loop more difficult. Also it will be simpler to implement software pipelining if the
number of iterations of the loop are a multiple of some number (to be determined later).
These two ideas can be combined by generating two copies of the loop: one is the
sequential copy and the other is the software-pipelined copy. The compiler compiles the
code as in Figure 12.24. The constant D, which is the number of iterations before software
pipelining is useful, and the constant S, representing the number of iterations of the loop,
are determined during the construction of the software-pipelined loop.
The compiler must generate the prologue, epilogue, and the software-pipelined
loop. Actually, it will generate the software-pipelined loop first, and all of the other
computations are determined by the loop. The agenda is as follows:
1. Determine one schedule of a single iteration of the loop constrained so that it can
be folded on itself. Think of rolling up a piece of transparent paper with marks on it
so that no two marks overlap and the marks are evenly spaced when rolled up.
2. In this schedule for a single iteration, determine the maximum number of cycles
that a temporary assigned in the loop is live. This will determine S and together with
the schedule determine II and the software-pipelined loop.
3. Then build the prologue by overlaying multiple executions of the first few iterations
of the loop as sequential (rather than looping) code.
4. Build the epilogue in the same way. Overlay the final executions of the loop as
sequential code.
To begin the process we need an estimate for the initiation interval. That is the length
of the software-pipelined loop. This is an initial estimate that will change for several
reasons while we determine the loop.
A loop-carried dependence occurs when a store on one iteration of a loop stores a value that might be loaded
on a separate iteration of the loop, or a load occurs on one iteration of a loop where a store may occur into the same
location on another iteration of the loop. Similarly for a store and a store.
The iterations are independent if the array that is stored into on all of the left-hand
sides of the assignment statements in the loop is not the same as the arrays on the righthand side. The one exception is that a load from the same location that is stored into can
occur on the right-hand side. This is the minimal condition. It will find a number of the
loops for software pipelining but will not find many that occur in linear algebra codes.
Loops can be software pipelined if all load and store operations are referenced
through temporaries that change by the same amount through each loop and they can be
seen to reference different areas of memory. Suprisingly this is a special purpose and a
general technique. It is specialized because it can only see a few of the cases. However, if
it is used to generate two copies of the loop, one for sequential execution and one for
pipelined execution, the choice between loops can be made at the beginning of the loop
by comparing pointers.
This book is performing a limited form of software pipelining. It is possible to
software-pipeline when there are loop-carried dependences. The same techniques as we
will discuss here apply, with added dependences in the interference graph. The number of
situations in which this gives an advantage over simple unrolling is limited.
Having said all of this, what is the initial estimate for the initiation interval II?
Consider the loop L. It is composed of a number of instructions. Each instruction must be
executed in the software-pipelined loop. Sort the instructions into buckets, one bucket for
each class of function unit: various floating-point units, integer units, and load/store units.
Divide the number of elements in each bucket by the number of units of that type. In the
Alpha there are two integer units, so divide the number of integer instructions by two.5 The
maximum of these ratios is the estimate for the initiation interval II.
5
Yes, I know they are not identical units. This process is to get an approximation. If it does not work, it will be
increased later.
This estimate simply means that we must have enough slots in the packets to put
all of the instructions. So make II be the smallest value at which there are enough slots in
the packets.
kernel, and the previous II packets were performed in the previous kernel. So the prologue
can have all but the last II packets of an iteration added to it, shifted late by II packets, and
renamed to use the temporaries from the second iteration. This continues until there are
no more instructions to be added.
Although these instructions in the prologue form a valid schedule, the schedule
should be combined with other surrounding code and can be scheduled better. Thus the
prologue should be scheduled as part of a trace containing the head of the loop.
Note that at the end of the unrolled loop, we have executed S iterations of the
original loop. The complete execution of the unrolled loop will therefore represent some
multiple of S iterations of the original loop.
12.12 References
Bala, V., and N. Rubin. 1996. Efficient instruction scheduling using finite state
automata. Unpublished memo, available from authors. (Rubin is with Digital Equipment
Corp.)
Ball, T., and J. R. Larus. 1992. Optimally profiling and tracing programs. Proceedings
of the Nineteenth Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming
Languages, POPL92, Albuquerque, NM. 59-70.
Fisher, J. A. 1981. Trace scheduling: A technique for global microcode compaction. IEEE
Transactions on Computers C-30(7): 478-490.
Freudenberger, S. M., T. R. Gross, and P. G. Lowney. 1994. Avoidance and suppression of
compensation code in a trace scheduling compiler. ACM Transactions on Programming
Languages and Systems 16(4):1156-1214.
Huber, B. L. 1995. Path-selection heuristics for dominator-path scheduling. Master of
Science thesis, Michigan Technical University.
Reif, J. H., and H. R. Lewis. 1978. Symbolic program analysis in almost linear time.
Conference Proceedings of Principles of Programming Languages V, Association of
Computing Machinery.
Sweany, P. H., and S. Beaty. 1992. Dominator-path scheduling: A global scheduling
method. Proceedings of the 25th International Symposium on Microarchitecture (MICRO-25),
260-263.
Warren, H. S. 1990. Instruction scheduling for the IBM RISC System/6000 processor. IBM
Journal of Research and Development 34(1).
Avoid spilling: The compiler should assign temporaries to registers so that the
number of LOAD and STORE instructions inserted by the register allocator are as
small as possible during the execution of the program.
Use few registers: The compiler should use as few registers in the register set as
possible. Registers that are saved by the calling procedure should be used before
registers saved by the called procedure.
Many compilers take a simplistic view of register allocation. They describe register
allocation in terms of some algorithmic problemsuch as graph coloring or bin packingand
then use some heuristic solution for that particular formulation. Such register allocators
perform well on problems needing few registers; however, if the number of registers
needed is significantly greater than the number of registers available, each of these
register allocation methods generates a large number of spill instructions, namely, the
loads and stores to memory generated by the register allocator.
The problem is that each of these allocation techniques uses one of the two types of
information available. The graph-coloring allocators use the concept of interference or
conflict graphs. The conflict graph has no concept of which instructions are near each
other, so it performs poorly on blocks. The bin-packing register allocators perform well on
blocks, but have to use an approximation to handle control flow. It is possible to create
situations where one algorithm will work better than another. This approach to register
allocation was chosen to expose the best attributes of each of the algorithms.
This compiler combines the two. Recall that the compiler has already inserted spilling
instructions to reduce the register pressure to less than or equal to the number of registers
available. The compiler will now use three distinct allocation algorithms to allocate
registers:
The compiler uses a derivative of graph-coloring register allocation introduced by
Preston Briggs (1992) to perform allocation of temporaries that are live across block
boundaries.
The compiler uses a derivative of the FAT algorithm introduced by Laurie Hendron
(1993) to perform allocation of the local temporaries that can be allocated to the same
registers as global temporaries.
The compiler uses the standard single-pass register allocation algorithm to allocate
registers to temporaries that are live only within a single block. This is a bin-packing
algorithm that allocates the local temporaries one at a time as the block is walked in
reverse execution order.
By separating the assignment of local and global temporaries, the compiler introduces
the possibility of a phase-ordering problem: The assignment of global temporaries may
inhibit the assignment of local temporaries. This is unavoidable since the optimal
allocation of registers is an NP-complete problem. The design is such that the particular
choice of algorithms to use will avoid as much of the problem as is possible.
To illustrate the interplay between global and local register allocation, consider the
pictorial representation of a block in Figure 13.1 The set of instructions where a temporary
is live is represented by a horizontal distance. Each temporary is represented by a distinct
row. The global register allocator will create situations such as R1, R2, and R4. R1
contains a value assigned in another block and used for the last time in this block. R2 is
assigned a value in this block and used in another block. R4 combines the two: R4 is
assigned a value in this block, control flow leaves the block, and returns to the block using
the value earlier in the block. R3 is a typical local temporary. It is assigned a value in the
block and used for the last time later in the block. In large procedures, this is most of the
temporaries. R1, R2, and R4 are allocated by the global allocator. R2 and R4 are
combined with other local temporaries by the FAT algorithm. R3 is allocated using the
local allocator.
Figure 13.1 Pictorial Representation of Block
As Chaitin also noted, one can construct a program to have any undirected graph as its conflict graph, so very
general graphs can occur. However, most graphs are simpler. For example, most temporaries have only one point of
definition, and programs are mostly structured so the interactions between temporaries are much more limited.
Frequently, all nodes can be removed from the graph using this heuristic. In that
case the observation above gives a complete algorithm for coloring the graph. Chaitin
originally proposed an algorithm that stopped when there were no nodes with fewer
neighbors than colors. The algorithm then chose a node to spill to memory.
The heuristic and a more recent improvement are illustrated by the conflict graph in
Figure 13.3. There are four temporaries with the edges representing the conflicts. S3 has
one neighbor, so it can be removed from the graph, leaving S0, S1, and S2. After
removing S3, each of them has two neighbors, so any one of them can be removed from
the graph next, say SO, and then S1, followed by S2. In the end we have a sequence of
temporaries (S2, S1, S0, S3) that need to be assigned registers. S2 is first. Put it back into
the graph and assign it to any register, say R0. S1 is next: Put it back into the graph and
assign it any register except the one assigned to S2, say R1. Similarly for SO, assigning it
R2. Finally, S3 needs to be assigned a register. It conflicts only with S2 (which is assigned
R0), so it can be assigned to either R1 or R2. Thus the algorithm can assign registers
even though S2 has three neighbors.
Although the algorithm has been described in terms of sequences, one sees that
the nodes are removed from the stack in the opposite order to the order of assigning
registers. Thus the nodes are pushed on a stack as they are removed and popped off the
stack as they are reinserted in the conflict graph.
Of course, the nodes themselves are not removed from the conflict graph. All that
the algorithm uses as the nodes are removed is the number of neighbors in the conflict
graph. Thus the algorithm must keep a count of the number of neighbors still in the graph
and update it as nodes are removed.
stored need to be assigned. Since these registers do not exist until the middle of a register
allocation pass, the most effective way to deal with them is to repeat the coloring algorithm
using the complete set of registers.
The proposed register allocator does not need to repeat the graph-coloring
algorithm. The new temporaries introduced by spilling are loaded and stored within single
blocks, so they can be handled later by the local scheduler. This implies that the register
pressure may exceed the number of physical registers during local register allocation.
To summarize, the registers that cannot be colored are assigned spill locations in memory
exactly as the earlier LIMIT phase assigned spill locations. The determination of the load
and store locations together with the assignment of registers for these temporaries occurs
later, during local register allocation. To do this the global allocator performs the following
transformations when the temporary T is to be spilled:
It adds the temporary to the set SpillRegisters, which indicates to the local register
allocator that a LOAD instruction should be inserted before the first use (if not
preceded by a definition) and a store operation should occur after the last definition
(unless the temporary is no longer live).
Note that this is the reverse of the role that spilling played in LIMIT. In LIMIT the
compiler assumed that temporaries are in registers and only moved the temporary to
memory when absolutely necessary. Here the temporary is assumed to be in memory and
is moved to a register when needed. So the load operations happen before the block and
store operations occur after the block. The load operation cannot be moved backward, nor
can the store operation be moved forward without affecting other already allocated
temporaries. The placement of these operations thus cannot be improved without moving
the operations into the block.
Which buckets should be inspected first, the ones containing nodes with the most
edges or the ones with nodes containing the fewest edges? This is not clear to the author.
If one looks at nodes with the most edges first, then the total number of edges being
removed with each node is greater and it is likely that more nodes will have fewer edges
than the number of registers. If one looks at nodes with fewer neighbors first, then the
nodes with fewer neighbors will be the last nodes to be colored where there is less
latitude. The nodes with more neighbors will be colored first, when there are more
registers available. There is no clear answer. This design pushes the nodes with fewer
edges first because it makes the pseudo-code simpler. The only change to experiment
with different orders is in the loop that references the buckets.2
2
Keith Cooper of Rice University has commented that any plausible improvement to a register allocation
algorithm can only be validated by experimentation. From my own experience, many changes to algorithms that should
theoretically only improve the performance of the allocator have decreased the performance. This is the nature of NPcomplete problems.
The stacking algorithm, as described in Figure 13.5, has been stated without some
of the optimizations and data structure choices that can be made. Here are some notes.
Stack can be implemented as an array that is preallocated. Its size cannot be any larger
than the number of global temporaries.
The compiler must be able to delete arbitrary nodes from buckets. The buckets can
be implemented as doubly linked lists. Insertions into buckets can always occur at the
beginning of the list.
The algorithm has written the manipulation of i as simply as possible. One can
experiment with the order in which nodes are chosen. One can also decrease the number
of increments. Consider the algorithm as stated. If the current node is in Bucket(i), then
the next node by necessity will be in Bucket(j), where j >= i - 1 so the loop can restart at
that point rather than 0.
Note that the algorithm does not attempt to keep the number of neighbors that have
been returned to the graph up-to-date. It does keep the attribute InGraph up-to-date
because it is used to signal that a temporary has been colored.
If there are no registers left after looking at all of the neighbors, the temporary is
spilled. This consists of leaving the InGraph attribute false to indicate that it has no
associated physical register and adding the temporary to SpillRegisters. The local register
allocator will take care of inserting load and store operations to effect the spilling.
Figure 13.6 Register Coloring Algorithm
If this heuristic does not work, then attempt to assign T to a physical register that
has already been used. This will keep the number of registers used down. Remember that
scheduling has already occurred, so the compiler has already reordered instructions and
nothing will be gained by using more registers.
If no register is available that has already been used, then use one of the
CallerSave registers because there is no cost for saving and restoring them. Failing that,
use a CalleeSave register; however, code must be inserted in the prologue and epilogue
of the flow graph to save and restore the physical register.
Unfortunately, the compiler cannot precompute this information and save it for all
cases where spilling might occur, since the attribute NumberLeft(T) keeps changing during
the process of pushing temporaries on the stack. Instead, the compiler precomputes the
following formula and then performs the division when spilling is necessary:
Priority(T) = {frequency(p) |p is a point where T is used or defined}
As far as the code is concerned, the subroutine Compute Priority3 does a walk over
the flow graph identifying all load and store operations involving temporaries and
computing the numerator of this expression. It stores it in the attribute Priority(T). Later,
when a temporary must be chosen to push on the stack, the division by the denominator is
performed and the smallest value resulting is chosen.
3
The code for Compute Priority is not included as pseudo-code. It is a clerical walk of the flow graph using the
frequency information stored with the block and the occurrences of load and store operations to accumulate the
information.
The algorithm in Figure 13.9 performs these three tasks in a two-step process. A
backward pass is performed that determines the last instruction that places a value in one
of these temporaries. A store operation is inserted after those instructions. At the same
time, it determines which temporaries need a load operation inserted before them. The
algorithm does this by assuming that the load is needed and deleting the assumption if an
earlier assignment to the temporary is found.
The second pass is a forward pass, using the attribute NewName to hold the local
name for the spilled temporary and inserting the load operation before the first use of the
temporary name.
After the spilling of global temporaries, the local register allocator classifies the
types of temporaries that occur in the block. Before this is described, the reader should be
made aware that all of the walks in the register allocator mimic the computation of live
information. In fact, most compute live information. They all perform a reverse-execution
walk of the flow graph, either implicitly or explicitly computing live information and
performing whatever processing takes place at the same time. In the case of classifying
the temporaries, the information collected is a set of sets of temporaries and the maximum
register pressure, that is, the maximum number of temporaries live at any given time.
These sets are listed below:
LiveThrough: The temporaries that are live at each point in the block. They may be
referenced in the block and possibly modified; however, there is no point between
instructions where these temporaries are not live. Thus each one of them occupies
a physical register for the complete block, effectively removing that physical register
from any consideration for local allocation.
LiveStart: The temporaries that are five at the beginning of the block and become
dead after some instruction in the block. These are the global temporaries that will
cause the local register allocator problems. This local register allocator walks the
block backward (remember the simulation of live computation) to allocate
temporaries, and it must take great care not to overlap the use of a temporary that
it allocates with a physical register allocated to a temporary in LiveStart. The
allocator uses the FAT heuristic for doing that.
LiveEnd: The temporaries that are live at the end of the block and became live at
some instruction in the block. These will cause the compiler no problems with local
register allocation. In effect, they are preallocated local temporaries for the
purposes of allocation within this block.
LiveTransparent: The temporaries that are live throughout the block and not in the
block. As with LiveThrough, these temporaries occupy a physical register
throughout the block. However, they are useful when the register pressure is too
high because they can be spilled before and after the block, as was done in the
LIMIT phase.
LocalRegisters: Local temporaries that become live within the block and later
become dead within the block. In computationally intensive programs this is the
largest class of temporaries. Allocating physical registers to these temporaries is
the whole point of this section. Note that the newly created temporaries associated
with spilled temporaries are in this class.
The algorithm in Figure 13.10 is precisely a recomputation of live information within the
block and uses this live information to classify all of the temporaries using the definitions
above. For example, a temporary in LiveTransparent is live on exit from the block and has
no references to it. Thus LiveTransparent is initialized to be the set of temporaries live on
exit and then a temporary that is referenced is removed. The others are handled similarly.
Having classified the temporaries, it is now time to prepare for register allocation.
Suprisingly, the compiler computes the conflict graph for the block. Although graph
coloring is not the basis for this allocator, the graph-coloring heuristic can be viewed in the
following useful way: A temporary that has fewer neighbors than available colors is always
easy to color, thus it can be put aside. By repeating this process, the easy temporaries are
all put aside, leaving only the temporaries that are difficult to color to be dealt with in a
specialized way. In fact, the removal of these easy registers removes the clutter, making
the hard decisions apply only to the hard temporaries.
The compiler computes two data structures for the local register allocator (see Figure
13.11). The first is the local conflict graph, in which the only temporaries that occur in the
graph are the temporaries that occur in this block. This makes for a small graph, one
hopes. There are cases where one procedure is a large (think thousands of lines of code)
block. In that case, the global conflict graph is small and this one is big.4
4
Compiler writers frequently forget that there are two categories of program writers. Human program writers are
more easily dealt with. The compiler can estimate the patterns of usage. Programs that write programs are much more
difficult, creating programs with horrid structure.
The algorithm also computes the range in which the temporary is live. This
information is needed by the FAT algorithm. The information is recorded by assigning two
numbers to each instruction. The end of the block is numbered 0, and the numbers
increase toward the beginning of the block. The smaller number of the pair represents the
portion of the instruction that performs modifications of registers. The larger number
represents the portion of the instruction that fetches the operands.
Figure 13.11 Building Lifetimes and Local Conflict Graph
There are two attributes associated with each temporary. StartTime(T) is the
counter associated with the instruction that writes the temporary. If the temporary is live at
the beginning of the block, then it references a value preceding the block. EndTime(T)
references the operand section of the last instruction to reference the temporary. If the
temporary is live at the end of the block, then the attribute references off the end of the
block. These attributes are computed in a single walk through the block that simulates the
live computation and assigns EndTime the first time the temporary becomes live and
assigns StartTime when the temporary becomes dead.
After the register allocator has computed the conflict and lifetime information, it
prepares to do the standard graph-coloring heuristic to remove easy temporaries. Just as
with the global allocator, the temporaries are bucket-sorted (see Figure 13.12). The same
attributes are set up in the same way as in the global register allocator.
Now we will describe the algorithm out of order for ease of understanding. What we
want to do is go through the block assigning physical registers to temporaries as we go.
This algorithm is described later in Figure 13.15. Before allocation begins, all of the
physical registers are placed in a set called FreeRegisters, indicating that they are
available for use. As we scan through the block (again in reverse order, simulating live
computation), we assign one of the FreeRegisters to a temporary the first time that we see
it become live; that is, we find the last use of the temporary. We return the physical
register allocated to a temporary to FreeRegister at the point where it is defined (if the
temporary is not also used as an operand).
The problem is that this does not work if there are global temporaries already
allocated physical registers at the other end of the block. We may pull a physical register
out of FreeRegisters and assign it to a temporary whose lifetime overlaps a global
temporary that is already using that register.
The solution is to preprocess the global temporaries that are live at the other end of
the block (in this case, the start of the block since we are going through the block
backward). This is the FAT heuristic. Take one of these temporaries, call it T. The FAT
heuristic does the following operations:
1. It scans through the block finding all of the points where the register pressure is
maximum. These are called the FAT points.
2. For each of these FAT points, it chooses a local temporary that is live at that FAT
point. We say that the temporary covers that FAT point. The temporaries are
chosen so that each FAT point is covered and no two temporaries chosen have
overlapping lifetimes or lifetimes that overlap with T. This may not be possible; in
that case, there will be further spilling. After all, this is a heuristic, not an algorithm.
3. Each one of these temporaries covering the FAT points are assigned the same
physical register as the T.
4. The physical register associated with T and the temporaries that cover the FAT
points is taken out of consideration for further allocation. The register pressure is
reduced by 1 at each of the instructions where one of the covering temporaries is
live. In other words, we ignore the physical register, T, and the temporaries that
cover the FAT points.
5. We now repeat this process with the other global temporaries live at the beginning
of the block until we have processed them all.
At this point there are no temporaries live at the beginning of the block that we care
about so we can apply the one-pass local register allocator as described above.
Figure 13.12 Build Buckets for Local Coloring
This is the algorithm we use. The only modification is that between the processing
of each of these temporaries the compiler applies the coloring heuristic to remove easy
registers. This is the algorithm that we describe in Figure 13.8. We now describe the
support procedures.
The
graph-coloring
heuristic
is
implemented
by
two
procedures,
ADD_TO_LOCAL_STACK
(Figure
13.13)
and
GIVE_STACKED_TEMPORARIES_COLOR (Figure 13.14). The algorithms are copies of
the algorithms during global allocation and will not be described further. Note that the
variable NumberRegisters starts out being the same as the constant
MaxPhysicalRegisters and keeps decreasing each time the FAT algorithm is applied.
Note that there should be no spilling involved with the coloring heuristic.
Temporaries are pushed on the stack when they have fewer neighbors than colors.
Nothing is pushed on the stack that violates that condition. When the FAT heuristic is
applied, one physical register is taken out of participation, so the number of neighbors
allowed is decreased by one. This does not affect any earlier temporaries pushed on the
stack.
Figure 13.13 Building Local Graph-Coloring Stack
When spilling is needed, the classic spilling heuristic is used (Figure 13.17).
Consider the register allocation process at an instruction I where there is an operand that
needs a temporary assigned to a physical register. There are not enough physical
registers, so choose the temporary whose previous use is the furthest away. By inserting
a load operation after I and a store operation after the last definition of the temporary, a
register is freed up for use for the largest possible period of time in the block.
Figure 13.17 Spilling within the Block
13.3 References
Briggs, P., K. D. Cooper, and L. Torczon. 1992. Coloring register pairs. ACM Letters on
Programming Languages and Systems 1(1): 3-13.
Briggs, P., K. D. Cooper, and L. Torczon. 1994. Improvements to graph coloring register
allocation. ACM Transactions on Programming Languages and Systems 16(3): 428-455.
Chaitin, G. J. 1982. Register allocation and spilling via graph coloring. Proceedings of
the SIGPLAN 82 Symposium on Compiler Construction, Boston, MA. Published as SIGPLAN
Notices 17(6): 98-105.
Chaitin, G. J., et al. 1981. Register allocation via coloring. Computer Languages 6(1):
47-57.
Hendron, L. J., G. R. Gao, E. Altman, and C. Mukerji. 1993. Register allocation using
cyclic interval graphs: A new approach to old problem.(Technical report.) McGill
University.
locations with respect to the other instructions and data in the procedure. If required to do
so, the linker must adjust the addresses created by the compiler to be absolute addresses
rather than the relative addresses created by the compiler.1 This process is called
relocation.
1
Some instructions represent addresses as offsets from the current program counter. In this case the linker
does not need to adjust the addresses. Many processors have a set of relative branches together with the absolute jump
instruction.
Each section has a unique name. Two sections that have the same name are either
concatenated together or overlaid by the linker. Thus multiple object modules can
contribute to the same section by using the same name. Similarly, separate parts of
the same object module can contribute to the same section.
Each section has a set of attributes. The most important attribute is whether this
section involves concatenation of data from separate section commands or
overlaying of data from separate section commands. Other attributes include the
read and write attributes of the section. The object module can specify that a
particular section can be read-only or read-write. This information can be used by
the operating system to invoke page protection when possible.
Each segment has an alignment. Since some data must begin at an address that is
a multiple of some specified power of two, the segment command must allow the
compiler to describe the multiple of two on which this portion of the segment must
begin. This allows the compiler to allocate packets of instructions for multiple issue
or data that must be aligned at specified addresses.
Each section command indicates a size. This is the number of bytes of memory (or
whatever memory units are used) to be allocated by this section command.
The section may have data stored in the storage represented by this section
command. Frequently this data will be instructions; however, it can be data or
constants.
The compiler represents absolute addresses as a section name together with an offset
within the section. The linker will replace this pair by the absolute address. When the
compiler is storing an absolute address in an instruction or data, it will actually store the
offset and create a command to indicate to the linker that the address of this part of the
section must be added in also. There is a subtle point here: The linker keeps track of
which section command is associated with each component of the section and will add in
the relative offset of the beginning of the data for that section command.
The object module contains the following commands besides the segment commands
above.
A definition command defines a symbol. It contains two parts: the name of the
symbol being defined and an expression representing the symbol. For our
purposes that expression need only be a segment name and an offset. Thus the
name MAXCOL in our running example must be made known to other procedures
so it can be called. The entry point is described by a definition command, which
represents the offset within the segment representing the instructions where the
entry point occurs.
These two commands are used by the compiler to instruct the linker on where to adjust
data and where to place addresses. More complex commands may be available in the
object module; however, these are what are needed for basic compilation. Another
commonly available command is one to expand the current section by a fixed amount of
initialized data. This will decrease the size of the object modules considerably.
There are other parts of the object module for debugging. These are less standardized
and can vary from language compiler to language compiler on the same machine. The
basic form of this data is tables of information. There must be a table that describes the
address of each variable in the program indexed by the location where the program has
stopped. The data may be in memory, on the stack, or in registers, and the debugging
symbol table must store this information. Furthermore, a table of line numbers or program
statements must be included, indexed by the program counter where the program has
stopped.
Originally, each machine had a distinct object module format. This is still true, although
many of the object module formats are based on COFF or ELF, which are formats that
have developed within the UNIX community. However, each manufacturer has developed
additions or slightly different implementations so that even these are not standard. This is
particularly true in the area of the debugging tables.
The major problem with object modules, in my experience, is that all descriptions of
object modules are inaccurate. The only real definition of an object module is what the
linker accepts. The only way to find this out is by experimentation. Implementing the object
module generator once the object module format is known is easy. Finding out what the
object module format actually is (not what it is described to be) is difficult.
A segment to hold all local initialized static data. The nameof this is frequently a
function of the procedure name; however, the data could be placed in one large
segment with all other static data.
A segment to hold all local uninitialized static data. Again the name can either be a
function of the procedure name or one large segment for all uninitialized data.
Similarly, segments for initialized and uninitialized external data. This is data that
can be referenced by other procedures. It could be combined with the local data if
desired. Actually, each external variable may be placed in a segment by itself. This
can be useful in languages where the originator of the data is not known (such as C
or common blocks in Fortran). In that case each procedure can create the segment
for the data and mark it as an overlay segment. Then only one area of storage will
be allocated.
each address that is associated with the beginning of a block in the flow graph. Since
each instruction has a fixed size, this can be done in one pass. At the same time each set
of variables is scanned and the relative address within the segment is determined, just as
described in introductory compiler textbooks describing the layout of data.
During the second pass, the compiler does generate the object module. It now
knows all addresses relative to their corresponding segments, so it can lay out the
segments. As it creates the segments it keeps a table of definition and use commands to
describe the operations that must be performed to update each datum or instruction. Thus
an instruction that includes an absolute address is represented as a fixed number in the
segment together with a command describing the use of the segment name to be added
into the number to represent the full address.2
2
With RISC architectures, the addresses are rarely included in the instruction. Instead the addresses are
included in the constants, which are loaded into registers to perform an absolute branch. This simply replaces the
updating of instructions by the updating of constants for relocation.
After all of the segments for data have been generated, the segments for the
debugging tables must be generated. They are usually handled like any other data
(although they may not be loaded in the executable by the linker). Thus the debugging
symbol table will have a collection of absolute data, such as the names of the symbols.
However, references to memory locations in the program will be represented by the use of
a defined symbol or by a segment name plus offset.
A long unconditional jump instruction. This will require two instructions, one to load
the address constant and one to perform the jump operation. Thus this instruction
sequence requires 8 bytes on most RISC processors.
A long conditional jump instruction will require three instructions: A short conditional
branching instruction, the loading of an address constant, and the performance of
an unconditional jump instruction. This requires 12 bytes on most RISC processors.
between the destination and the location of the branch is increasing. Whenever the
distance is greater than that represented by a short branch, change it to a long branch.
Repeat the process until there are no changes. At that point, leave all branches short that
have not already been changed.
14.5 Generating the Assembly File and Additions to the Listing File
The assembly file is an optional text file that attempts to create input to the
assembler so that result of assembly will be an object file identical to the original object
file.3 This goal is not always achievable. The compiler may generate object modules that
are not directly expressible in the available assembly language. Of course, the assembly
language can be extended to support all of these features, but frequently the assembler is
a less important piece of software, so its support is limited.
3
Let me go on record with all other writers of optimizing compilers: Attempts to edit the assembly file to improve
performance are misguided. The results will frequently be less efficient, if it works at all. It is much more likely that an
edit will generate an incorrect program because the editor could not follow all of the assumptions made by the compiler.
However, advanced users find these files useful, so they should be generated.
This file can be generated during the second assembler pass, at the same time that
the object module is being generated. The compiler has a representation of the flow graph
that mimics the instructions to be generated. All that the assembly listing needs to do is
translate the internal representation of the instructions into an external representation
matching the instruction in the flow graph. Since the internal representation mirrors the
external representation, this is a clerical process.
At times, a similar representation of the program is desired in the listing file. This
too can be performed at the same time as the assembly file is generated. In fact, columns
can be added to indicate the address of the instruction in the segment and the binary
representation of the instruction.
14.7 References
Wulf, W.,
Elsevier.
et
al.
1975.
The
design
of
an
optimizing
compiler.
New
York:
American
where there are global variables defined to represent each of the registers in the
register set. All of the characteristics of the machine can be simulated in C. What is the
advantage? The compiler can run and generate assembly language for the program. The
program can be executed on any reasonable processor. Each instruction in the assembly
language can be simulated in this way.
In fact, additional expressions can be added to the definitions to measure the
characteristics of the program. For example, a counter can be incremented in the macro
simulating a branch operation. Thus we can know how many times all of the branches are
executed or, even more specially, how often a particular branch is executed.
The StrongARM processor was chosen as the real machine to compile for because
there are inadequate optimizing compilers for that processor. I find no joy in building a
compiler that will always be worse than a compiler that already exists, so I am choosing a
processor with less support. Thus I have a chance of building a compiler that might be
useful.
Observation
Consider a flow graph in which there is a path from each block to Exit. Let
ANTIN and ANTOUT be the maximum solution to the anticipation equations for the
temporary T. If ANTOUT(B) is false, then T is not anticipated at the end of B. If
ANTIN(B) is false, then T is not anticipated at the beginning of B.
Consider the case in which ANTOUT(B) is false. The proof consists of showing
Proof
that there is a path from B to Exit that either contains an instruction that kills T
before an evaluation of T or contains no evaluations of T at all.
By assumption, ANTOUT(B) is false, so there is some successor B1 with
ANTIN(B1) equal to false. ANTIN(B1) being false means that T is not locally
anticipated in B1. If B1 kills T, then we are done because the path can be extended
with any path from B1 to Exit giving a path violating the definition of anticipation at
B.
Now continue to add blocks B = B0, B1, . . . , Bn to the path such that Bi is a
predecessor of Bi+1 and T is not locally anticipated or killed in any of the blacks after
B and ANTOUT(Bi) is false for each of these blocks. The problem is to add another
block to the path in such a way that a path to Exit can be constructed. There are three
possibilities:
If Bn has no successors, then Bn is equal to Exit since this is the only block
with no successors. When this situation occurs, a path from B to Exit has
been constructed containing no instructions that kill or evaluate T. The path
thus violates the definition of anticipation, and T is not anticipated at the end
of B.
Bn has a successor Bn+1 such that ANTIN(Bn+1) is false and Bn+1 is not on the
path. Since ANTIN(Bn+1) is false, T is not locally anticipated in the block. If T
is killed in the block, then the path can be extended from Bn+1 to Exit by any
path, giving a path that violates the definition of anticipation.
Bn has no successor with ANTIN equal to false that is not already on the path.
A way to continue expanding the path must be found. If I can show that there
is always a way to continue the path, then the proof is completed since the
two previous possibilities lead to a path violating the definition of
anticipation.
Observation
If Bn has no successors that have ANTIN equal to false and are not in the path, choose one of the
successors S that is already in the path. We have a cycle starting with S and continuing through the
other blocks on the path until Bn is reached. All of the blocks on this cycle satisfy the conditions of
the previous observation, so there is a successor Q of one of the blocks Bk that is not in the cycle
and has the value ANTIN(Q) equal to false.
Now add the blocks S, . . ., Bk, Q to the path after Bn. Although some blocks have been added to the
path multiple times, the path has been extended by at least one new block Q. This shows that the
third case always leads to the addition of at least one block and completes the proof of the
observation.
Bibliography
Aho, A. V., and J. D. Ullman. 1977. Principles of compiler design. Reading, MA: Addison-Wesley.
Aho, A. V., J. E. Hopcroft, and J. D. Ullman. 1974. The design and analysis of computer
algorithms. Reading, MA: Addison-Wesley.
Aho, A. V., J. E. Hopcroft, and J. D. Ullman. 1983. Data structures and algorithms. Reading, MA:
Addison-Wesley.
Aho, A. V., R. Sethi, and J. D. Ullman. 1986. Compilers: Principles, techniques, and tools.
Reading, MA: Addison-Wesley.
Allen, F. E., J. Cocke, and K. Kennedy. 1981. Reduction of operator strength. In Program flow
analysis: Theory and application, edited by S. Muchnick and N. D. Jones. New York: Prentice-Hall.
Allen, R., and K. Kennedy. Advanced compilation for vector and parallel computers. San Mateo,
CA: Morgan Kaufmann.
Alpern, B., M. N. Wegman, and F. K. Zadeck. 1988. Detecting equality of variables in programs.
Proceedings of the Conference on Principles of Programming Languages, POPL88, San Diego,
CA. 1-11.
Auslander, M., and M. Hopkins. 1982. An overview of the PL.8 compiler. Proceedings of the ACN
SIGPLAN 82 Conference on Programming Language Design and Implementation, Boston, MA.
Backus, J. W., et al. 1957. The Fortran automatic coding system. Proceedings of AFIPS 1957
Western Joint Computing Conference (WJCC), 188-198.
Bagwell, J. T. Jr. 1970. Local Optimization, SIGPLAN Notices, Association for Computing
Machinery 5(7): 52-66.
Bala, V., and N. Rubin. 1996. Efficient instruction scheduling using finite state automata.
Unpublished memo, available from authors. (Rubin is with Digital Equipment Corp.)
Ball, T., and J. R. Larus. 1992. Optimally profiling and tracing programs. Proceedings of the
Nineteenth Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming
Languages, POPL92, Albuquerque, NM. 59-70.
Ball, T., and J. R. Larus. 1993. Branch prediction for free. Proceedings of the SIGPLAN 93
Symposium on Programming Language Design and Implementation, PLDI93, Albuquerque, NM.
Published as SIGPLAN Notices 28(7): 300-313.
Barrett, W. A., et al. 1986. Compiler construction: Theory and practice. Science Research
Associates, Inc.
Bauer, F. L., et al. 1974. Compiler construction: An advanced course. In Lecture notes in computer
science, vol. 21. Berlin, Germany: Springer-Verlag.
Beaty, S. J. 1991. Instruction scheduling using genetic algorithms. Ph.D. diss., Colorado State
University.
Briggs, P., and L. Torczon. 1993. An efficient representation for sparse sets. ACM Letters on
Programming Languages and Systems 2(1-4): 59-69.
Briggs, P., K. D. Cooper, and L. Torczon. 1992. Coloring register pairs. ACM Letters on
Programming Languages and Systems 1(1): 3-13.
Briggs, P., K. D. Cooper, and L. Torczon. 1992. Rematerialization. Proceedings of the Fifth ACM
SIGPLAN Conference on Programming Language Design and Implementation, San Francisco,
CA. Published as SIGPLAN Notices 27(7): 311-321.
Briggs, P., K. D. Cooper, and L. Torczon. 1994. Improvements to graph coloring register
allocation. ACM Transactions on Programming Languages and Systems 16(3): 428-455.
Callahan, D., K. D. Cooper, K. Kennedy, and L. Torczon. 1986. Interprocedural constant
propagation. Proceedings of the SIGPLAN Symposium on Compiler Construction, Palo Alto, CA.
Published as SIGPLAN Notices 21(7): 152-161.
Chaitin, G. J. 1982. Register allocation and spilling via graph coloring. Proceedings of the
SIGPLAN 82 Symposium on Compiler Construction, Boston, MA. Published as SIGPLAN Notices
17(6): 98-105.
Chaitin, G. J., et al. 1981. Register allocation via coloring. Computer Languages 6(1): 47-57.
Chase, D. R., M. Wegman, and F. K. Zadeck. 1990. Analysis of pointers and structures.
Proceedings of the Conference on Programming Language Design and Implementation, PLDI90,
White Plains, NY. 296-310.
Chow, F. 1983. A portable machine independent optimizerDesign and measurements. Ph.D.
diss., Stanford University.
Cooper, K. D., M. W. Hall, and L. Torczon. 1992. Unexpected side effects of inline substitution: A
case study. ACM Letters on Programming Languages and Systems 1(1): 22-32.
Cooper, K., and K. Kennedy. 1988. Interprocedural side-effect analysis in linear time. Proceedings
of the SIGPLAN 88 Symposium on Programming Language Design and Implementation, Altanta,
GA. Published as SIGPLAN Notices 23(7).
Cooper, K., and K. Kennedy. 1989. Fast interprocedural alias analysis. Conference Record of the
Sixteenth Annual Symposium on Principles of Programming Languages, Austin, TX.
Cormen, T. H., C. E. Leiserson, and R. L. Rivest. 1990. Introduction to Algorithms. New York:
McGraw-Hill.
Coutant, D. S. 1986. Retargetable high-level alias analysis. Conference Record of the 13th
SIGACT/SIGPLAN Symposium on Principles of Programming Languages, St. Petersburg Beach,
FL.
Cytron, R., and J. Ferrante. 1987. An improved control dependence algorithm. (Technical Report
RC 13291.) White Plains, NY: International Business Machines, Thomas J. Watson Research
Center.
Cytron, R., et al. 1989. An efficient method of computing static single assignment form.
Conference Record of the 16th ACM SIGACT/ISIGPLAN Symposium on Programming
Languages, Austin, TX. 25-35.
Cytron, R., J. Ferrante, and V. Sarkar. 1990. Compact representations for control dependence.
Proceedings of the SIGPLAN 90 Symposium on Programming Language Design and
Implementation, White Plains, NY. 241-255. In SIGPLAN Notices 25(6).
Cytron, R., J. Ferrante, B. Rosen, M. Wegman, and F. Zadeck. 1991. Efficiently computing static
single assignment form and the control dependence graph. ACM Transactions on Programming
Languages and Systems 13(4): 451-490.
Dhamdhere, D. M., B. Rosen, and F. K. Zadeck. 1992. How to analyze large programs efficiently
and informatively. Proceedings of the SIGPLAN 92 Symposium of Programming Language
Design and Implementation. San Fransisco, CA. Published in SIGPLAN Notices 27(7): 212-223.
Drechsler, K.-H., and M. P. Stadel. 1988. A solution to a problem with Morels and Renvoises
Global optimization by suppression of partial redundancies. ACM Transactions on Programming
Languages and Systems 10(4): 635-640.
Drechsler, K.-H., and M. P. Stadel. 1993. A variation of Knoop, Ruthing, and Steffens lazy code
motion. ACM SIGPLAN Notices 28(5): 29-38.
Ferrante, J., K. J. Ottenstein, and J. D. Warren. 1987. The program dependence graph and its use
in optimization. ACM Transactions on Programming Languages and Systems 9(3): 319-349.
Fischer, C. N., and R. J. LeBlanc, Jr. 1988. Crafting a compiler. Redwood City, CA:
Benjamin/Cummings.
Fischer, C. N., and R. J. LeBlanc, Jr. 1991. Crafting a compiler with C. Redwood City, CA:
Benjamin/Cummings.
Fisher, J. A. 1981. Trace scheduling: A technique for global microcode compaction. IEEE
Transactions on Computers C-30(7): 478-490.
Frailey, D. J. 1970. Expression Optimization Using Unary Complement Operators, SIGPLAN
Notices, Association for Computing Machinery 5(7): 67-85.
Frazer, C. W., and D. R. Hanson. 1995. A retargetable C compiler: Design and implementation.
Redwood City, CA: Benjamin/Cummings.
Freudenberger, S. M., T. R. Gross, and P. G. Lowney. 1994. Avoidance and suppression of
compensation code in a trace scheduling compiler. ACM Transactions on Programming
Languages and Systems 16(4): 1156-1214.
Golumbic, M. C., and V. Rainish. 1990. Instruction scheduling beyond basic blocks. IBM Journal of
Research and Development 37(4).
Gross, T. 1983. Code optimization of pipeline constraints. (Stanford Technical Report CS 83-255.)
Stanford University.
Hall, M. W. 1991. Managing interprocedural optimization. Ph.D. Thesis, Computer Science
Department, Rice University.
Hall, M. W., and K. Kennedy. 1992. Efficient call graph analysis. ACM Letters on Programming
Languages and Systems 1(3): 227-242.
Hall, M. W., K. Kennedy, and K. S. McKinley. 1991. Interprocedural transformations for parallel
code generation. Proceedings of the 1991 Conference on Supercomputing, 424-434.
Hendron, L. J., G. R. Gao, E. Altman, and C. Mukerji. 1993. A register allocation framework based
on hierarchical cyclic interval graphs. (Technical report.) McGill University.
Hendron, L. J., G. R. Gao, E. Altman, and C. Mukerji. 1993. Register allocation using cyclic
interval graphs: A new approach to an old problem. (Technical report.) McGill University.
Howland, M. A., R. A. Mueller, and P. H. Sweany. 1987. Trace scheduling optimization in a
retargetable microcode compiler. Proceedings of the Twentieth Annual Workshop on
Microprogramming (MICRO-20), 106-114.
Huber, B. L. 1995. Path-selection heuristics for dominator-path scheduling. Master of Science
thesis, Michigan Technical University.
Joshi, S. M., and D. M. Dhamdhere. 1982. A composite hoisting-strength reduction transformation
for global program optimization, parts I and II. International Journal of Computer Mathematics 11:
21-41, 111-126.
Karr, M. 1975. P-graphs. (Report CA-7501-1511.) Wakefield, MA: Massachusetts Computer
Associates.
Kerns, D. R., and S. J. Eggers. 1993. Balanced scheduling: Instruction scheduling when memory
latency is uncertain. Proceedings of Conference on Programming Language Design and
Implementation (PLDI93), Albuquerque, NM. 278-298.
Kildall, G. A. 1973. A unified approach to global program optimization. Conference Proceedings of
Principles of Programming Languages I, 194-206.
Knoop, J., O. Ruthing, and B. Steffen. 1992. Lazy code motion. Proceedings of the ACM SIGPLAN
Conference on Programming Language Design and Implementation PLDI92, 224-234.
Knoop, J., O. Ruthing, and B. Steffen. 1993. Lazy strength reduction. Journal of Programming
Languages 1(1): 71-91.
Knoop, J., O. Ruthing, and B. Steffen. 1994. Optimal code motion: Theory and practice. ACM
Transactions on Programming Languages and Systems, New York, NY. 16(4):1117-1155.
Lam, M. S. 1990. Instruction scheduling for superscalar architectures. Annual Review of Computer
Sciences.
Lengauer, T., and R. E. Tarjan. 1979. A fast algorithm for finding dominators in a flow graph.
Transactions on Programming Languages and Systems 1(1): 121-141.
Leverett, B. W., et al. 1979. An overview of the Production-Quality Compiler-Compiler project.
(Technical Report CMU-CS-79-105.) Pittsburgh, PA: Carnegie Mellon University.
Lewis II, P. M., D. J. Rosenkrantz, and R. E. Stearns. 1978. Compiler design theory. Reading, MA:
Addison-Wesley.
Lorho, B. 1984. Methods and tools for compiler construction: An advanced course. Cambridge
University Press.
Markstein, P. Forthcoming. Strength reduction. In unpublished book on optimization, edited by M.
N. Wegman et al. Association of Computing Machinery.
Markstein, P., V. Markstein, and F. K. Zadeck. Forthcoming. In unpublished book on optimization,
edited by M. N. Wegman et al. Association of Computing Machinery.
McKeeman, W. M. 1974. Symbol table access. In Compiler construction: An advanced course,
edited by F. L. Bauer et al. Berlin, Germany: Springer-Verlag.
Morel, E., and C. Renvoise. 1979. Global optimization by suppression of partial redundancies.
Communications of the ACM 22(2): 96-103.
New York University Computer Science Department. 1970-1976. SETL Newsletters.
OBrien, et al. 1985. XIL and YIL: The Intermediate Language of TOBEY, ACM SIGPLAN
Workshop on Intermediate Representations. San Fransisco, CA. Published as SIGPLAN Notices,
30(3): 71-82.
Pittman, T., and J. Peters. 1992. The art of compiler design: Theory and practice. New York:
Prentice-Hall.
Pugh, W. 1992. The omega test: A fast and practical integer programming algorithm for
dependence analysis. Communications of the ACM 8: 102-114.
Purdom, P. W., and E. F. Moore. 1972. Immediate predominators in a directed graph.
Communications of the ACM 8(1): 777-778.
Reif, J. H., and H. R. Lewis. 1978. Symbolic program analysis in almost linear time. Conference
Proceedings of Principles of Programming Languages V, Association of Computing Machinery.
Rosen, B. K., M. N. Wegman, and F. K. Zadeck. 1988. Global value numbers and redundant
computations. Conference Record of the 15th ACM SIGACT/SIGPLAN Symposium on Principles
of Programming Languages, POPL88, Austin, TX, pp. 12-27.
Ryder, B. G. 1979. Constructing the call graph of a program. IEEE Transactions on Software
Engineering SE-5(3).
Scarborough, R. G., and H. G. Kolsky. 1980. Improved optimization of Fortran programs. IBM
Journal of Research and Development 24: 660-676.
Sheridan, P. B. 1959. The arithmetic translation compiler of the IBM Fortran automatic coding
system. Communications of the ACM 2(3): 9-21.
Simpson, L. T. 1996. Value-driven redundancy elimination. Ph.D. thesis, Computer Science
Department, Rice University.
Sites, R. 1978. Instruction ordering for the CRAY-1 computer. (Technical Report 78-CS-023.)
University of California at San Diego.
Steensgaard, B. 1996. Points-to analysis by type inference of programs with structures and
unions. In International Conference on Compiler Construction, number 1060. In Lecture Notes in
Computer Science, 136-150.
Sweany, P. H., and S. Beaty. 1992. Dominator-path scheduling: A global scheduling method.
Proceedings of the 25th International Symposium on Microarchitecture (MICRO-25), 260-263.
Tarjan, R. E. 1972. Depth-first search and linear graph algorithms. SIAM Journal of Computing
1(2): 146-160.
Tarjan, R. E. 1975. Efficiency of a good but not linear set of union algorithm. Journal of ACM 22(2):
215-225.
Torczon, L., 1985. Compilation dependencies in an ambitious optimizing compiler. Ph.D. thesis,
Computer Science Department, Rice University, Houston, TX.
Waite, W. M., and G. Goos. 1984. Compiler construction. Berlin, Germany: Springer-Verlag.
Warren, H. S. 1990. Instruction scheduling for the IBM RISC System/6000 processor. IBM Journal
of Research and Development 34(1).
Wegman, M. N., and F. K. Zadeck. 1985. Constant propagation with conditional branches.
Conference Proceedings of Principles of Programming Languages XII, 291-299.
Wilhelm, R., and D. Maurer. 1995. Compiler Design. Reading, MA: Addison-Wesley.
Wilson, R. P., and M. Lam. 1995. Efficient context-sensitive pointer analysis for C programs.
Proceedings of the SIGPLAN 95 Conference on Programming Language Design and
Implementation, LaJolla, CA. Published as SIGPLAN Notices 30(6):1-12.
Wolfe, M. 1996. High performance compilers for parallel computing. Reading, MA: AddisonWesley.
Wulf, W., et al. 1975. The design of an optimizing compiler. New York: American Elsevier.
Zadeck, F. K. 1984. Incremental data flow analysis in a structured program editor. Proceedings of
the ACM SIGPLAN 84 Symposium on Compiler Construction, Montreal, Quebec. Published as
SIGPLAN Notices 19(6).