Compiler Design Notes
Compiler Design Notes
G22.2130: Compiler
Construction
2006-07 Fall
Allan Gottlieb
Tuesday 7-8:50pm Rm 101 Ciww
Chapter 0: Administrivia
I start at Chapter 0 so that when we get to chapter 1, the numbering will agree with the
text.
You can also find these lecture notes on the course home page. Please let me
know if you can't find it.
The notes are updated as bugs are found or improvements made.
The notes are being produced this semester so will be considerably enlarged by
the end of the course.
I will also produce a separate page for each lecture after the lecture is given.
These individual pages might not get updated as quickly as the large page.
0.3: Textbook
The course text is Aho, Seithi, and Ullman: Compilers: Principles, Techniques, and
Tools
The first edition, which we shall use, is very well known.
Available in bookstore.
We will cover most of the first 9 chapters (plus some asides).
The second edition is (just now / very nearly) available. Should it be available
soon, I will supplement the first edition with new material from the second.
The first edition is a descendant of the classic Principles of Compiler Design
Independent of the titles, each of the books is called “The Dragon Book”, due
to the cover picture.
0.5: Grades
Your grade will be a function of your final exam and laboratory assignments (see
below). I am not yet sure of the exact weightings for each lab and the final, but will let
you know soon.
I try very hard to remember to write all announcements on the upper left board and I
am normally successful. If, during class, you see that I have forgotten to record
something, please let me know.HOWEVER, if I forgot and no one reminds me, the
assignment has still been given.
Labs are
Required.
Due several lectures later (date given on assignment).
Graded and form part of your final grade.
Penalized for lateness.
Computer programs you must write.
Homeworks are
Optional.
Due the beginning of Next lecture.
Not accepted late.
Mostly from the book.
Collected and returned.
Able to help, but not hurt, your grade.
Homeworks are numbered by the class in which they are assigned. So any homework
given today is homework #1. Even if I do not give homework today, the homework
assigned next class will be homework #2. Unless I explicitly state otherwise, all
homeworks assignments can be found in the class notes. So the homework present in
the notes for lecture #n is homework #n (even if I inadvertently forgot to write it to
the upper left board).
You may solve lab assignments on any system you wish, but ...
You are responsible for any non-nyu machine. I extend deadlines if the nyu
machines are down, not if yours are.
Be sure to upload your assignments to the nyu systems.
o In an ideal world, a program written in a high level language like Java,
C, or C++ that works on your system would also work on the NYU
system used by the grader. Sadly this ideal is not always achieved
despite marketing claims to the contrary. So, although you
may develop you lab on any system, you must ensure that it runs on the
nyu system assigned to the course.
o If somehow your assignment is misplaced by me and/or a grader, we
need a to have a copy ON AN NYU SYSTEM that can be used to verify
the date the lab was completed.
o When you complete a lab (and have it on an nyu system), do not edit
those files. Indeed, put the lab in a separate directory and keep out of the
directory. You do not want to alter the dates.
You may write your lab in Java, C, or C++. Other languages may be possible, but
please ask in advance. I need to ensure that the TA is comfortable with the language.
If you have already had a compiler class, this course is probably not appropriate. For
example, if you can explain the following concepts/terms, the course is probably too
elementary for you.
Parsing
Lexical Analysis
Syntax analysis
Register allocation
LALR Grammar
I also assume that you have at least a passing familiarity with assembler language. In
particular, your compiler will produce assembler language. We will not, however,
write significant assembly-language programs.
4. I tend to spend too much time on introductory chapters, but will try not to.
1.1: Compilers
A Compiler is a translator from one language, the input or source language, to another
language, the output or target language.
Often, but not always, the target language is an assembler language or the machine
language for a computer processor.
Modern compilers contain two (large) parts, each of which is often subdivided. These
two parts are the front end and the back end.
The front end analyzes the source program, determines its constituent parts, and
constructs an intermediate representation of the program. Typically the front end is
independent of the target language.
The back end synthesizes the target program from the intermediate representation
produced by the front end. Typically the back end is independent of the source
language.
This front/back division very much reduces the work for a compiling system that can
handle several (N) source languages and several (M) target languages. Instead of NM
compilers, we need N front ends and M back ends. For gcc (originally standing for
Gnu C Compiler, but now standing for Gnu Compiler Collection), N=7 and M~30 so
the savings is considerable.
Syntax Trees
Other “compiler like” applications also use analysis and synthesis. Some examples
include
1. Pretty printer. Can be considered a real compiler with the target language a
formatted version of the source.
2. Interpreter. The synthesis traverses the tree and executes the operation at each
node (rather than generating code to do such).
We will be primarily focused on the second element of the chain, the compiler. Our
target language will be assembly language.
Preprocessors
Preprocessors are normally fairly simple as in the C language, providing primarily the
ability to include files and expand macros. There are exceptions, however. IBM's
PL/I, another Algol-like language had quite an extensive preprocessor, which made
available at preprocessor time, much of the PL/I language itself (e.g., loops and I
believe procedure calls).
Assemblers
Assembly code is an mnemonic version of machine code in which names, rather than
binary values, are used for machine instructions, and memory addresses.
Some processors have fairly regular operations and as a result assembly code for them
can be fairly natural and not-too-hard to understand. Other processors, in particular
Intel's x86 line, have let us charitably say more “interesting” instructions with certain
registers used for certain things.
My laptop has one of these latter processors (pentium 4) so my gcc compiler produces
code that from a pedagogical viewpoint is less than ideal. If you have a mac with a
ppc processor (newest macs are x86), your assembly language is cleaner. NYU's ACF
features sun computers with sparc processors, which also have regular instruction sets.
Two pass assembly
No matter what the assembly language is, an assembler needs to assign memory
locations to symbols (called identifiers) and use the numeric location address in the
target machine language produced. Of course the same address must be used for all
occurrences of a given identifier and two different identifiers must (normally) be
assigned two different locations.
The conceptually simplest way to accomplish this is to make two passes over the input
(read it once, then read it again from the beginning). During the first pass, each time a
new identifier is encountered, an address is assigned and the pair (identifier, address)
is stored in a symbol table. During the second pass, whenever an identifier is
encountered, its address is looked up in the symbol table and this value is used in the
generated machine instruction.
A Trivial Assembler Program
Consider the following trivial C program that computes and returns the xor of the
characters in a string.
int xor (char s[]) // native C speakers say char *s
{
int ans = 0;
int i = 0;
while (s[i] != 0) {
ans = ans ^ s[i];
i = i + 1;
}
return ans;
}
You should be able to follow everything from xor: to ret. Indeed most of the rest can
be omitted (.globl g is needed). That is the following assembly program gives the
same results.
.globl xor
xor:
subl $8, %esp
movl $0, 4(%esp)
movl $0, (%esp)
.L2:
movl (%esp), %eax
addl 12(%esp), %eax
cmpb $0, (%eax)
je .L3
movl (%esp), %eax
addl 12(%esp), %eax
movsbl (%eax),%edx
leal 4(%esp), %eax
xorl %edx, (%eax)
movl %esp, %eax
incl (%eax)
jmp .L2
.L3:
movl 4(%esp), %eax
addl $8, %esp
ret
What is happening in this program?
Lab assignment 1 is available on the class web site. The programming is trivial; you
are just doing inclusive (i.e., normal) OR rather than XOR I just did. The point of the
lab is to give you a chance to become familiar with your compiler and assembler.
Linkers
Linkers, a.k.a. linkage editors combine the output of the assembler for several
different compilations. That is the horizontal line of the diagram above should really
be a collection of lines converging on the linker. The linker has another input, namely
libraries, but to the linker the libraries look like other programs compiled and
assembled. The two primary tasks of the linker are
The assembler processes one file at a time. Thus the symbol table produced while
processing file A is independent of the symbols defined in file B, and conversely.
Thus, it is likely that the same address will be used for different symbols in each
program. The technical term is that the (local) addresses in the symbol table for file A
are relative to file A; they must be relocated by the linker. This is accomplished by
adding the starting address of file A (which in turn is the sum of the lengths of all the
files processed previously in this run) to the relative address.
Resolving external references
The solution is for the compiler to indicated in the output of the file A compilation
that the address of g is needed. This is called a use of g When processing file B, the
compiler outputs the (relative) address of g. This is called the definition of g. The
assembler passes this information to the linker.
The simplest linker technique is to again make two passes. During the first pass, the
linker records in its “external symbol table” (a table of external symbols, not a symbol
table that is stored externally) all the definitions encountered. During the second pass,
every use can be resolved by access to the table.
I will be covering the linker in more detail tomorrow at 5pm in 2250, OS Design
Loaders
After the linker has done its work, the resulting “executable file” can be loaded by the
operating system into central memory. The details are OS dependent. With early
single-user operating systems all programs would be loaded into a fixed address (say
0) and the loader simply copies the file to memory. Today it is much more
complicated since (parts of) many programs reside in memory at the same time.
Hence the compiler/assembler/linker cannot know the real location for an identifier.
Indeed, this real location can change.
More information is given in any OS course (e.g., 2250 given wednesdays at 5pm).
The character stream input is grouped into tokens. For example, any one of the
following
x3 := y + 3;
x3 := y + 3 ;
x3 :=y+ 3 ;
but not
x 3 := y + 3;
would be grouped into
Note that non-significant blanks are normally removed during scanning. In C, most
blanks are non-significant. Blanks inside strings are an exception.
Note that we could define identifiers, numbers, and the various symbols and
punctuation can be defined without
recursion (compare with parsing
below).
Note the recursive definition of expression (expr). Note also the hierarchical
decomposition in the figure on the right.
The division between scanning and parsing is somewhat arbitrary,
but invariably if a recursive definition is involved, it is considered
parsing not scanning.
Often we utilize a simpler tree called the syntax tree with operators
as interior nodes and operands as the children of the operator. The
syntax tree on the right corresponds to the parse tree above it.
Semantic analysis
Illustrates the use of hierarchical grouping for formatting languages (Tex and EQN are
used as examples). For example shows how you can get subscripted superscripts (or
superscripted subscripts)
We just examined the first three phases. Modern, high-performance compilers, are
dominated by their extensive optimization phases, which occur before, during, and
after code generation. Note that optimization is most assuredly an inaccurate, albeit
standard, terminology, as the resulting code is not optimal.
Symbol-table management
As we have seen when discussing assemblers and linkers, a symbol table is used to
maintain information about symbols. The compiler uses a symbol to maintain
information across phases as well as within each phase. One key item stored with each
symbol is the corresponding type, which is determined during semantic and used
(among other places) during code generation.
As you have doubtless noticed, not all programming efforts produce correct programs.
If the input to the compiler is not a legal source language program, errors must be
detected and reported. It is often much easier to detect that the program is not legal
(e.g., the parser reaches a point where the next token cannot legally occur) than to
deduce what is the actual error (which may have occurred earlier). It is even harder to
reliably deduce what the intended correct program should be.
This is processed by the parser and semantic analyzer to produce the two trees shown
above here and here. On some systems, the tree would not contain the symbols
themselves as shown in the figures. Instead the tree would contain leaves of the form
idi which in turn would refer to the corresponding entries in the symbol table.
Many compilers first generate code for an “idealized machine”. For example, the
intermediate code generated would assume that the target has an unlimited number of
registers and that any register can be used for any operation. Another common
assumption is that all machine operations take three operands, two source and one
target.
With these assumptions one generates “three-address code” by walking the semantic
tree. Our example C instruction would produce
temp1 := inttoreal(3)
temp2 := id2 + temp1
temp3 := realtoint(temp2)
id1 := temp3
We see that three-address code can include instructions with fewer than 3 operands.
Sometimes three-address code is called quadruples because one can view the previous
code sequence as
inttoreal temp1 3 --
add temp2 id2 temp1
realtoint temp3 temp2 --
assign id1 temp3 --
Each “quad” has the form
operation target source1 source2
Code optimization
This is a very serious subject, one that we will not really do justice to in this
introductory course. Some optimizations are fairly easy to see.
1. Since 3 is a constant, the compiler can perform the int to real conversion and
replace the first two quads with
2. add temp2 id2 3.0
Code generation
Modern processors have only a limited number of register. Although some processors,
such as the x86, can perform operations directly on memory locations, we will for
now assume only register operations. Some processors (e.g., the MIPS architecture)
use three-address instructions. However, some processors permit only two addresses;
the result overwrites the second source. With these assumptions, code something like
the following would be produced for our example, after first assigning memory
locations to id1 and id2.
MOVE id2, R1
ADD #3.0, R1
RTOI R1, R2
MOVE R2, id1
Aho, Sethi, Ullman assert only limited success in producing several compilers for a
single machine using a common back end. That is a rather pessimistic view and I
wonder if the 2nd edition will change in this area.
Passes
The term pass is used to indicate that the entire input is read during this activity. So
two passes, means that the input is read twice. We have discussed two pass
approaches for both assemblers and linkers. If we implement each phase separately
and use multiple phases for some of them, the compiler will perform a large number
of I/O operations, an expensive undertaking.
As a result techniques have been developed to reduce the number of passes. We will
see in the next chapter how to combine the scanner, parser, and semantic analyzer into
one phase. Consider the parser. When it needs another token, rather than reading the
input file (presumably produced by the scanner), the parser calls the scanner instead.
At selected points during the production of the syntax tree, the parser calls the “code
generator”, which performs semantic analysis as well as generating a portion of the
intermediate code.
One problem with combining phases, or with implementing a single phase in one
pass, is that it appears that an internal form of the entire program will need to be
stored in memory. This problem arises because the downstream phase may need early
in its execution information that the upstream phase produces only late in its
execution. This motivated the use of symbol tables and a two pass approach.
However, a clever one-pass approach is often possible.
Consider the assembler (or linker). The good case is when the definition precedes all
uses so that the symbol table contains the value of the symbol prior to that value being
needed. Now consider the harder case of one or more uses preceding the definition.
When a not yet defined symbol is first used, an entry is placed in the symbol table,
pointing to this use and indicating that the definition has not yet appeared. Further
uses of the same symbol attach their addresses to a linked list of “undefined uses” of
this symbol. When the definition is finally seen, the value is placed in the symbol
table, and the linked list is traversed inserting the value in all previously encountered
uses. Subsequent uses of the symbol will find its definition in the table.
We will study tools that generate scanners and parsers. This will involve us in some
theory, regular expressions for scanners and various grammars for parsers. These
techniques are fairly successful. One drawback can be that they do not execute as fast
as “hand-crafted” scanners and parsers.
We will also see tools for syntax-directed translation and automatic code generation.
The automation in these cases is not as complete.
Finally, there is the large area of optimization. This is not automated; however, a basic
component of optimization is “data-flow analysis” (how are values transmitted
between parts of a program) and there are tools to help with this task.
2.1: Overview
The source language is infix expressions consisting of digits, +, and -; the target is
postfix expressions with the same components. The compiler will convert
7+4-5 to 74+5-.
Actually, our simple compiler will handle a few other operators as well.
We will tokenize the input (i.e., write a scanner), model the syntax of the source, and
let this syntax direct the translation.
Example:
Terminals: 0 1 2 3 4 5 6 7 8 9 + -
Nonterminals: list digit
Productions: list → list + digit
list → list - digit
list → digit
digit → 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
Start symbol: list
Watch how we can generate the input 7+4-5 starting with the start symbol, applying
productions, and stopping when no productions are possible (we have only terminals).
list → list - digit
→ list - 5
→ list + digit - 5
→ list + 4 - 5
→ digit + 4 - 5
→ 7 + 4 - 5
It is important that you see that this context-free grammar generates precisely
the set of infix expressions with digits (so 25 is not allowed) as operands and +
and - as operators.
The way you get different final expressions is that you make different choices
of which production to apply. There are 3 productions you can apply to list and
10 you can apply to digit.
The input cannot have blanks since blank is not a nonterminal.
The empty string is not a legal input since, starting from list, we cannot get to
the empty string. If we wanted to include the empty string, we would add the
production
list → ε
Homework: 2.1a, 2.1c, 2.2a-c (don't worry about “justifying” your answers).
Parse trees
You can read off the productions from the tree. For any
internal (i.e.,non-leaf) tree node, its children give the right
hand side (RHS) of a production having the node itself as the LHS.
The leaves of the tree, read from left to right, is called the yield of the tree. We call the
tree a derivation of its yield from its root. The tree on the right is a derivation of 7+4-5
from list.
Homework: 2.1b
Ambiguity
An ambiguous grammar is one in which there are two or more parse trees yielding the
same final string. We wish to avoid such grammars.
The grammar above is not ambiguous. For example 1+2+3 can be parsed only one
way; the arithmetic must be done left to right. Note that I am not giving a rule of
arithmetic, just of this grammar. If you reduced 2+3 to list you would be stuck since it
is impossible to generate 1+list.
Associativity of operators
Our grammar gives left associativity. That is, if you traverse the tree in postorder and
perform the indicated arithmetic you will evaluate the string left to right. Thus 8-8-8
would evaluate to -8. If you wished to generate right associativity (normally
exponentiation is right associative, so 2**3**2 gives 512 not 64), you would change
the first two productions to
list → digit + list and list → digit - list
Precedence of operators
We use | to indicate that a nonterminal has multiple possible right hand side. So
A → B | C
is simply shorthand for
A → B
A → C
Statements
Keywords are very helpful for distinguishing statements from one another.
stmt → id := expr
| if expr then stmt
| if expr then stmt else stmt
| while expr do stmt
| begin opt-stmts end
opt-stmts → stmt-list | ε
stmt-list → stmt-list ; stmt | stmt
Remark:
opt-stmts stands for “optional statements”. The begin-end block can be empty
in some languages.
The epsilon stands for the empty string
The use of “epsilon
productions” will add
complications.
Some languages do not
permit empty blocks; e.g.
Ada has a “null”
statement, which does
nothing when executed,
for this purpose.
The above grammar
is ambiguous!
The notorious “dangling
else” problem.
How do you parse “if x
then if y then z=1 else
z=2”?
2.3: Syntax-Directed
Translation
Specifying the translation of a
source language construct in
terms of attributes of its syntactic
components.
Postfix notation
One question is, given say 1+2-3, what is E, F and op? Does E=1+2, F=3, and op=+?
Or does E=1, F=2-3 and op=+? This is the issue of precedence mentioned above. To
simplify the present discussion we will start with fully parenthesized infix
expressions.
Example: 1+2/3-4*5
Syntax-directed definitions
We want to “decorate” the parse trees we construct with “annotations” that give the
value of certain attributes of the corresponding node of the tree. We will do the
example of translating infix to postfix with 1+2/3-4*5. We use the following
grammar, which follows the normal arithmetic terminology where one multiplies and
divides factors to obtain terms, which in turn are added and subtracted to form
expressions.
expr → expr + term | expr - term | term
term → term * factor | term / factor | factor
factor → digit | ( expr )
digit → 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
This grammar supports parentheses, although our example does not use them. On the
right is a “movie” in which the parse tree is build from this example.
The attribute we will associate with the nodes is the text to be used to print the postfix
form of the string in the leaves below the node. In particular the value of this attribute
at the root is the postfix form of the entire source.
The book does a simpler grammar (no *, /, or parentheses) for a simpler example. You
might find that one easier. The book also does another grammar describing commands
to give a robot to move north, east, south, or west by one unit at a time. The attributes
associated with the nodes are the current position (for some nodes, including the root)
and the change in position caused by the current command (for other nodes).
Synthesized Attributes
For the bottom-up approach I will illustrate now, we annotate a node after having
annotated its children. Thus the attribute values at a node can depend on the children
of the node but not the parent of the node. We call these synthesized attributes, since
they are formed by synthesizing the attributes of the children.
We specify how to synthesize attributes by giving the semantic rules together with the
grammar. That is we give the syntax directed definition.
We apply these rules bottom-up (starting with the geographically lowest productions,
i.e., the lowest lines on the page) and get the annotated graph shown on the right. The
annotation are drawn in green.
Depth-first traversals
Translation schemes
The bottom-up annotation scheme generates the final result as the annotation of the
root. In our infix → postfix example we get the result desired by printing the root
annotation. Now we consider another technique that produces its results
incrementally.
Instead of giving semantic rules for each production (and thereby generating
annotations) we can embed program fragments called semantic actionswithin the
productions themselves.
In diagrams the semantic action is connected to the node with a distinctive, often
dotted, line. The placement of the actions determine the order they are performed.
Specifically, one executes the actions in the order they are encountered in a postorder
traversal of the tree.
For our infix → postfix translator, the parent either just passes on the attribute of its
(only) child or concatenates them left to right and adds something at the end. The
equivale
nt
semantic
actions
would
be either
to print
the new
item or
print
nothing.
Emitting a translation
Here are the semantic actions corresponding to a few of the rows of the table above.
Note that the actions are enclosed in {}.
expr → expr + term { print('+') }
expr → expr - term { print('-') }
term → term / factor { print('/') }
term → factor { null }
digit → 3 { print('3') }
The diagram for 1+2/3-4*5 with attached semantic actions is shown on the right.
Given an input, e.g. our favorite 1+2/3-4*5, we just do a depth first (postorder)
traversal of the corresponding diagram and perform the semantic actions as they
occur. When these actions are print statements as above, we can be said to be emitting
the translation.
Do a depth first traversal of the diagram on the board, performing the semantic actions
as they occur, and confirm that the translation emitted is in fact 123/+45*-, the postfix
version of 1+2/3-4*5
The table below shows the semantic actions or rules needed for our translator.
Homework: 2.8.
If the semantic rules of a syntax-directed definition all have the property that the new
annotation for the left hand side (LHS) of the production is just the concatenation of
the annotations for the nonterminals on the RHS in the same order as the
nonterminals appear in the production, we call the syntax-directed
definition simple. It is still called simple if new strings are interleaved with the
original annotations. So the example just done is a simple syntax-directed definition.
Remark: We shall see later that, in many cases a simple syntax-directed definition
permits one to execute the semantic actions while parsing and not construct the parse
tree at all.
2.4: Parsing
Objective: Given a string of tokens and a grammar, produce a parse tree yielding that
string (or at least determine if such a tree exists).
We will learn both top-down (begin with the start symbol, i.e. the root of the tree) and
bottom up (begin with the leaves) techniques.
In the remainder of this chapter we just do top down, which is easier to implement by
hand, but is less general. Chapter 4 covers both approaches.
Tools (so called “parser
generators”) often use bottom-up
techniques.
Top-down parsing
Parsing is easy in principle and for certain grammars (e.g., the two above) it
actually is easy. The two fundamental steps (we start at the root since this is top-down
parsing) are
When programmed this becomes a procedure for each nonterminal that chooses a
production for the node and calls procedures for each nonterminal in the RHS. Thus it
is recursive in nature and descends the parse tree. We call these parsers “recursive
descent”.
The big problem is what to do if the current node is the LHS of more than one
production. The small problem is what do we mean by the “next” node needing a
subtree.
The easiest solution to the big problem would be to assume that there is only one
production having a given terminal as LHS. There are two possibilities
6. Circularity
7. expr → term + term
8. term → factor / factor
9. factor → ( expr )
This is even worse; there are no (finite) sentences. Only an infinite sentence
beginning (((((((((.
So this won't work. We need to have multiple productions with the same LHS.
How about trying them all? We could do this! If we get stuck where the current tree
cannot match the input we are trying to parse, we would backtrack.
Instead, we will look ahead one token in the input and only choose productions that
can yield a result starting with this token. Furthermore, we will (in this section)
restrict ourselves to predictive parsing in which there is only production that can yield
a result starting with a given token. This solution to the big problem also solves the
small problem. Since we are trying to match the next token in the input, we must
choose the leftmost (nonterminal) node to give children to.
Predictive parsing
Let's return to pascal array type grammar and consider the three productions having
type as LHS. Even when I write the short form
type → simple | ↑ id | array [ simple ] of type
I view it as three productions.
For each production P we wish to consider the set FIRST(P) consisting of those
tokens that can appear as the first symbol of a string derived from the RHS of P. We
actually define FIRST(RHS) rather than FIRST(P), but I often say “first set of the
production” when I should really say “first set of the RHS of the production”.
Definition: Let r be the RHS of a production P. FIRST(r) is the set of tokens that can
appear as the first symbol in a string derived from r.
Assumption: Let P and Q be two productions with the same LHS. Then FIRST(P)
and FIRST(Q) are disjoint. Thus, if we know both the LHS and the token that must be
first, there is (at most) one production we can apply. BINGO!
This table gives the FIRST sets for our pascal array type example.
Production FIRST
type → simple { integer, char, num }
type → ↑ id {↑}
type → array [ simple ] of type { array }
simple → integer { integer }
simple → char { char }
simple → num dotdot num { num }
The three productions with type as LHS have disjoint FIRST sets. Similarly the three
productions with simple as LHS have disjoint FIRST sets. Thus predictive parsing can
be used. We process the input left to right and call the current token lookahead since
it is how far we are looking ahead in the input to determine the production to use. The
movie on the right shows the process in action.
Homework:
ε-productions
1. For each nonterminal, write a procedure that chooses the unique(!) production
having lookahead in its FIRST. Use the ε production if no other production
matches. If no production matches and there is no ε production, the parse fails.
2. These procedures mimic the RHS of the production. They call procedures for
each nonterminal and call match for each terminal. Write a match(nonterminal)
that advances lookahead to the next input token after confirming that the
previous value of lookaheThe book has code at this point. ad equals t
nonterminal argument.
3. Write a main program that initializes lookahead to the first input token and
invokes the procedure for the start symbol.
The book has code at this point. We will see code later in this chapter.
Left Recursion
For the first production the RHS begins with the LHS. This is called left recursion. If
a recursive descent parser would pick this production, the result would be that the next
node to consider is again expr and the lookahead has not changed. An infinite loop
occurs.
Consider instead
expr → term rest
rest → + term rest
rest → ε
Both pairs of productions generate the same possible token strings, namely
term + term + ... + term
The second pair is called right recursive since the RHS ends (has on the right) the
LHS. If you draw the parse trees generated, you will see that, for left recursive
productions, the tree grows to the left; whereas, for right recursive, it grows to the
right.
Note also that, according to the trees generated by the first pair, the additions are
performed right to left; whereas, for the second pair, they are performed left to right.
That is, for
term + term + term
the tree from the first pair has the left + at the top (why?); whereas, the tree from the
second pair has the right + at the top.
One problem that we must solve is that this grammar is left recursive.
We prefer not to have superfluous nonterminals as they make the parsing less
efficient. That is why we don't say that a term produces a digit and a digit produces
each of 0,...,9. Ideally the syntax tree would just have the operators + and - and the 10
digits 0,1,...,9. That would be called the abstract syntax tree. A parse tree coming from
a grammar is technically called a concrete syntax tree.
We eliminate the left recursion as we did in 2.4. This time there are two operators +
and - so we replace the triple
A → A α | A β | γ
with the quadruple
A → γ R
R → α R | β R | ε
This time we have actions so, for example
α is + term { print('+') }
However, the formulas still hold and we get
expr → term rest
rest → + term { print('+') } rest
| - term { print('-') } rest
| ε
term → 0 { print('0') }
. . .
| 9 { print('9') }
The C code is in the book. Note the else ; in rest(). This corresponds to the epsilon
production. As mentioned previously. The epsilon production is only used when all
others fail (that is why it is the elsearm and not the then or the else if arms).
In the first edition this is about 40 lines of C code, 12 of which are single { or }. The
second edition has equivalent code in java.
These do not become tokens so that the parser need not worry about them.
The 2nd edition moves the discussion about x<y versus x<=y
into this new section. I have left it 2 sections ahead to more closely agree with our
(first edition).
2.6.3: Constants
This chapter considers only numerical integer constants. They are computed one digit
at a time by value=10*value+digit. The parser will therefore receive the
token num rather than a sequence of digits. Recall that our previous parsers
considered only one digit numbers.
The value of the constant is stored as the attribute of the token num. Indeed
<token,attribute> pairs are passed from the scanner to the parser.
The C statement
sum = sum + x;
contains 4 tokens. The scanner will convert the input into
id = id + id ; (id standing for identifier).
Although there are three id tokens, the first and second represent the lexeme sum; the
third represents x. These must be distinguished. Many language keywords, for
example “then”, are syntactically the same as identifiers. These also must be
distinguished. The symbol table will accomplish these tasks.
Care must be taken when one lexeme is a proper subset of another. Consider
x<y versus x<=y
When the < is read, the scanner needs to read another character to see if it is an =. But
if that second character is y, the current token is < and the y must be “pushed back”
onto the input stream so that the configuration is the same after scanning < as it is
after scanning <=.
Also consider then versus thenewvalue, one is a keyword and the other an id.
Interface
As indicated the scanner reads characters and occasionally pushes one back to the
input stream. The “downstream” interface is to the parser to which <token,attribute>
pairs are passed.
A few comments on the program given in the text. One inelegance is that, in order to
avoid passing a record (struct in C) from the scanner to the parser, the scanner returns
the next token and places its attribute in a global variable.
Since the scanner converts digits into num's we can shorten the grammar. Here is the
shortened version before the elimination of left recursion. Note that the value attribute
of a num is its numerical value.
expr → expr + term { print('+') }
expr → expr - term { print('-') }
expr → term
term → num { print(num,value) }
In anticipation of other operators with higher precedence, we introduce factor and, for
good measure, include parentheses for overriding the precedence. So our grammar
becomes.
expr → expr + term { print('+') }
expr → expr - term { print('-') }
expr → term
term → factor
factor → ( expr ) | num { print(num,value) }
The factor() procedure follows the familiar recursive descent pattern: find a
production with lookahead in FIRST and do what the RHS says.
Interface
insert(s,t) returns the index of a new entry storing the pair (lexeme s, token t).
lookup(s) returns the index for x or 0 if not there.
Reserved keywords
Simply insert them into the symbol table prior to examining any input. Then they can
be found when used correctly and, since their corresponding token will not be id, any
use of them where an identifier is required can be flagged.
insert("div",div)
Implementation
Arithmetic instructions
Stack manipulation
Translating expressions
To say this more formally we define two attributes. For any nonterminal, the attribute
t gives its translation and for the terminal id, the attribute lexeme gives its string
representation.
Assuming we have already given the semantic rules for expr (i.e., assuming that the
annotation expr.t is known to contain the translation for expr) then the semantic rule
for the assignment statement is
stmt → id := expr
{ stmt.t := 'lvalue' || id.lexime || expr.t || := }
Control flow
There are several ways of specifying conditional and unconditional jumps. We choose
the following 5 instructions. The simplifying assumption is that the abstract machine
supports “symbolic” labels. The back end of the compiler would have to translate this
into machine instructions for the actual computer, e.g. absolute or relative jumps
(jump 3450 or jump +500).
Fairly simple. Generate a new label using the assumed function newlabel(), which we
sometimes write without the (), and use it. The semantic rule for an if statement is
simply
stmt → if expr then stmt1 { out := newlabel();
stmt.t := expr.t || 'gofalse' out || stmt1.t ||
'label' out
Emitting a translation
Rewriting the above as a semantic action (rather than a rule) we get the following,
where emit() is a function that prints its arguments in whatever form is required for
the abstract machine (e.g., it deals with line length limits, required whitespace, etc).
stmt → if
expr { out := newlabel; emit('gofalse', out); }
then
stmt1 { emit('label', out) }
Don't forget that expr is itself a nonterminal. So by the time we reach out:=newlabel,
we will have already parsed expr and thus will have done any associated actions, such
as emit()'ing instructions. These instructions will have left a boolean on the tos. It is
this boolean that is tested by the emitted gofalse.
More precisely, the action written to the right of expr will be the third child of stmt in
the tree. Since a postorder traversal visits the children in order, the second child
“expr” will have been visited (just) prior to visiting the action.
Look how simple it is! Don't forget that the FIRST sets for the productions having
stmt as LHS are disjoint!
procedure stmt
integer test, out;
if lookahead = id then // first set is {id} for assignment
emit('lvalue', tokenval); // pushes lvalue of lhs
match(id); // move past the lhs]
match(':='); // move past the :=
expr; // pushes rvalue of rhs on tos
emit(':='); // do the assignment (Omitted in book)
else if lookahead = 'if' then
match('if'); // move past the if
expr; // pushes boolean on tos
out := newlabel();
emit('gofalse', out); // out is integer, emit makes a legal label
match('then'); // move past the then
stmt; // recursive call
emit('label', out) // emit again makes out legal
else if ... // while, repeat/do, etc
else error();
end stmt;
Description
The grammar with semantic actions is as follows. All the actions come at the end
since we are generating postfix. this is not always the case.
start → list eof
list → expr ; list
list → ε // would normally use | as below
expr → expr + term { print('+') }
| expr - term { print('-'); }
| term
term → term * factor { print('*') }
| term / factor { print('/') }
| term div factor { print('DIV') }
| term mod factor { print('MOD') }
| factor
factor → ( expr )
| id { print(id.lexeme) }
| num { print(num.value) }
Contains lexan(), the lexical analyzer, which is called by the parser to obtain the next
token. The attribute value is assigned to tokenval and white space is stripped.
white space
sequence of digits NUM numeric value
div DIV
mod MOD
other seq of a letter then letters and digits ID index into symbol table
eof char DONE
other char that char NONE
Parser.c
Using a recursive descent technique, one writes routines for each nonterminal in the
grammar. In fact the book combines term and morefactors into one routine.
term() {
int t;
factor();
// now we should call morefactorsl(), but instead code it inline
while(true) // morefactor nonterminal is right recursive
switch (lookahead) { // lookahead set by match()
case '*': case '/': case DIV: case MOD: // all the same
t = lookahead; // needed for emit() below
match(lookahead) // skip over the operator
factor(); // see grammar for morefactors
emit(t,NONE);
continue; // C semantics for case
default: // the epsilon production
return;
The insert(s,t) and lookup(s) routines described previously are in symbol.c The
routine init() preloads the symbol table with the defined keywords.
Error.c
Does almost nothing. The only help is that the line number, calculated by lexan() is
printed.
Two Questions
1. How come this compiler was so easy?
2. Why isn't the final exam next week?
One reason is that much was deliberately simplified. Specifically note that
No real machine code generated (no back end).
No optimizations (improvement to generated code).
FIRST sets disjoint.
No semantic analysis.
Input language very simple.
Output language very simple and closely related to input.
Also, I presented the material way too fast to expect full understanding.
1. By hand, beginning with a diagram of what lexemes look like. Then write code
to follow the diagram and return the corresponding token and possibly other
information.
2. Feed the patterns describing the lexemes to a “lexer-generator”, which then
produces the scanner. The historical lexer-generator is Lex; a more modern one
is flex.
Note that the speed (of the lexer not of the code generated by the compiler) and error
reporting/correction are typically much better for a handwritten lexer. As a result most
production-level compiler projects write their own lexers
The lexer also might do some housekeeping such as eliminating whitespace and
comments. Some call these tasks scanning, but others call the entire task scanning.
After the lexer, individual characters are no longer examined by the compiler; instead
tokens (the output of the lexer) are used.
Why separate lexical analysis from parsing? The reasons are basically software
engineering concerns.
1. Simplicity of design. When one detects a well defined subtask (produce the
next token), it is often good to separate out the task (modularity).
2. Efficiency. With the task separated it is easier to apply specialized techniques.
3. Portability. Only the lexer need communicate with the outside.
A token is a <name,attribute> pair. These are what the parser processes. The
attribute might actually be a tuple of several attributes
A pattern describes the character strings for the lexemes of the token. For
example “a letter followed by a (possibly empty) sequence of letters and
digits”.
A lexeme for a token is a sequence of characters that matches the pattern for the
token.
Homework: 3.3.
For tokens corresponding to keywords, attributes are not needed since the name of the
token tells everything. But consider the token corresponding to integer constants. Just
knowing that the we have a constant is not enough, subsequent stages of the compiler
need to know the value of the constant. Similarly for the token identifier we need to
distinguish one identifier from another. The normal method is for the attribute to
specify the symbol table entry for this identifier.
3.1.4: Lexical Errors
We saw in this movie an example where parsing got “stuck” because we reduced the
wrong part of the input string. We also learned about FIRST sets that enabled us to
determine which production to apply when we are operating left to right on the input.
For predictive parsers the FIRST sets for a given nonterminal are disjoint and so we
know which production to apply. In general the FIRST sets might not be disjoint so
we have to try all the productions whose FIRST set contains the lookahead symbol.
All the above assumed that the input was error free, i.e. that the source was a sentence
in the language. What should we do when the input is erroneous and we get to a point
where no production can be applied?
The simplest solution is to abort the compilation stating that the program is wrong,
perhaps giving the line number and location where the parser could not proceed.
We would like to do better and at least find other errors. We could perhaps skip input
up to a point where we can begin anew (e.g. after a statement ending semicolon), or
perhaps make a small change to the input around lookahead so that we can proceed.
The book illustrates the standard programming technique of using two (sizable)
buffers to solve this problem.
3.2.2: Sentinels
A useful programming improvement to combine testing for the end of a buffer with
determining the character read.
Example: Strings over {0,1}: ε, 0, 1, 111010. Strings over ascii: ε, sysy, the string
consisting of 3 blanks.
Example: All grammatical English sentences with five, eight, or twelve words is a
language over ascii. It is also a language over unicode.
Definition: The concatenation of strings s and t is the string formed by appending the
string t to s. It is written st.
A prefix of a string is a portion starting from the beginning and a suffix is a portion
ending at the end. More formally,
Definition: The union of L1 and L2 is simply the set-theoretic union, i.e., it consists
of all words (strings) in either L1 or L2.
Example: The union of {Grammatical English sentences with one, three, or five
words} with {Grammatical English sentences with two or four words} is
{Grammatical English sentences with five or fewer words}.
Definition: The concatenation of L1 and L2 is the set of all strings st, where s is a
string of L1 and t is a string of L2.
We again view concatenation as a product and write LM for the concatenation of L
and M.
Example: {0,1,2,3,4,5,6,7,8,9}+ gives all unsigned integers, but with some ugly
versions. It has 3, 03, 000003.
{0} ∪ ( {1,2,3,4,5,6,7,8,9} ({0,1,2,3,4,5,6,7,8,9,0} * ) ) seems better.
In these notes I may write * for * and + for +, but that is strictly speaking wrong and I
will not do it on the board or on exams or on lab assignments.
The book gives other examples based on L={letters} and D={digits}, which you
should read..
The book's definition includes many () and is more complicated than I think is
necessary. However, it has the crucial advantages of being correct and precise.
The wikipedia entry doesn't seem to be as precise.
I will try a slightly different approach, but note again that there is nothing wrong with
the book's approach (which appears in both first and second editions, essentially
unchanged).
The postfix unary operator * has the highest precedence. The book mentions that it is
left associative. (I don't see how a postfix unary operator can be right associative or
how a prefix unary operator such as unary - could be left associative.)
The book gives various algebraic laws (e.g., associativity) concerning these operators.
The reason we don't include the positive closure is that for any RE
r+ = rr*.
These will look like the productions of a context free grammar we saw previously, but
there are differences. Let Σ be an alphabet, then a regular definition is a sequence of
definitions
d1 → r1
d2 → r2
...
dn → rn
where the d's are unique and not in Σ and
ri is a regular expressions over Σ ∪ {d1,...,di-1}.
There are many extensions of the basic regular expressions given above. The
following three will be frequently used in this course as they are particular useful for
lexical analyzers as opposed to text editors or string oriented programming languages,
which have more complicated regular expressions.
All three are simply shorthand. That is, the set of possible languages generated using
the extensions is the same as the set of possible languages generated without using the
extensions.
1. One or more instances. This is the positive closure operator + mentioned above.
2. Zero or one instance. The unary postfix operator ? defined by
r? = r | ε for any RE r.
3. Character classes. If a1, a2, ..., an are symbols in the alphabet, then
[a1a2...an] = a1 | a2 | ... | an. In the special case where all the a's are consecutive,
we can simply the notation further to just [a1-an].
Examples:
C-language identifiers
letter_ → [A-Za-z_]
digit → [0-9]
CId → letter_ ( letter | digit ) *
Recall that the terminals are the tokens, the nonterminals produce terminals.
For the parser all the relational ops are to be treated the same so they are all the same
token, relop. Naturally, other parts of the compiler will need to distinguish between
the various relational ops so that appropriate code is generated. Hence, they have
distinct attribute values.
It is fairly clear how to write code corresponding to this diagram. You look at the first
character, if it is <, you look at the next character. If that character is =, you return
(relop,LE) to the parser. If instead that character is >, you return (relop,NE). If it is
another character, return (relop,LT) and adjust the input buffer so that you will read
this character again since you have used it for the current lexeme. If the first character
was =, you return (relop,EQ).
The transition diagram below corresponds to the regular definition given previously.
We will continue to assume that the keywords are reserved, i.e., may not be used as
identifiers. (What if this is not the case—as in Pl/I, which had no reserved words?
Then the lexer does not distinguish between keywords and identifiers and the parser
must.)
We will use the method mentioned last chapter and have the keywords installed into
the symbol table prior to any invocation of the lexer. The symbol table entry will
indicate that the entry is a keyword.
installID() checks if the lexeme is already in the table. If it is not present, the lexeme
is install as an id token. In either case a pointer to the entry is returned.
gettoken() examines the lexeme and returns the token name, either id or a name
corresponding to a reserved keyword.
Both installID() and gettoken() access the buffer to obtain the lexeme of interest
The text also gives another method to distinguish between identifiers and keywords.
Recognizing Whitespace
The diagram itself is quite simple reflecting the simplicity of the corresponding
regular expression.
The delim in the diagram represents any of the whitespace characters, say
space, tab, and newline.
The final star is there because we needed to find a non-whitespace character in
order to know when the whitespace ends and this character begins the next
token.
There is no action performed at the accepting state. Indeed the lexer
does not return to the parser, but starts again from its beginning as it still must
find the next token.
Recognizing Numbers
The diagram below is from the second edition. It is essentially a combination of the
three diagrams in the first edition.
This certainly looks formidable, but it is not that bad; it follows from the regular
expression.
In class go over the regular expression and show the corresponding parts in the
diagram.
When an accepting states is reached, action is required but is not shown on the
diagram. Just as identifiers are stored in a symbol table and a pointer is returned, there
is a corresponding number table in which numbers are stored. These numbers are
needed when code is generated. Depending on the source language, we may wish to
indicate in the table whether this is a real or integer. A similar, but more complicated,
transition diagram could be produced if they language permitted complex numbers as
well.
Homework: Write transition diagrams for the regular expressions in problems 3.6 a
and b, 3.7 a and b.
The idea is that we write a piece of code for each decision diagram. I will show the
one for relational operations below (from the 2nd edition). This piece of code contains
a case for each state, which typically reads a character and then goes to the next case
depending on the character read. The numbers in the circles are the names of the
cases.
Accepting states often need to take some action and return to the parser. Many of
these accepting states (the ones with stars) need to restore one character of input. This
is called retract() in the code.
What should the code for a particular diagram do if at one state the character read is
not one of those for which a next state has been defined? That is, what if the character
read is not the label of any of the outgoing arcs? This means that we have failed to
find the token corresponding to this diagram.
The code calls fail(). This is not an error case. It simply means that the current input
does not match this particular token. So we need to go to the code section for another
diagram after restoring the input pointer so that we start the next diagram at the point
where this failing diagram started. If we have tried all the diagram, then we have a
real failure and need to print an error message and perhaps try to repair the input.
Note that the order the diagrams are tried is important. If the input matches more than
one token, the first one tried will be chosen.
TOKEN getRelop() // TOKEN has two components
TOKEN retToken = new(RELOP); // First component set here
while (true)
switch(state)
case 0: c = nextChar();
if (c == '<') state = 1;
else if (c == '=') state = 5;
else if (c == '>') state = 6;
else fail();
break;
case 1: ...
...
case 8: retract(); // an accepting state with a star
retToken.attribute = GT; // second component
return(retToken);
The description above corresponds to the one given in the first edition.
The newer edition gives two other methods for combining the multiple transition-
diagrams (in addition to the one above).
1. Unlike the method above, which tries the diagrams one at a time, the first new
method tries them in parallel. That is, each character read is passed to each
diagram (that hasn't already failed). Care is needed when one diagram has
accepted the input, but others still haven't failed and may accept a longer prefix
of the input.
2. The final possibility discussed, which appears to be promising, is to combine
all the diagrams into one. That is easy for the example we have been
considering because all the diagrams begin with different characters being
matched. Hence we just have one large start with multiple outgoing edges. It is
more difficult when there is a character that can begin more than one diagram.
Lex is itself a compiler that is used in the construction of other compilers (its output is
the lexer for the other compiler). The lex language, i.e, the input language of the lex
compiler, is described in the few sections. The compiler writer uses the lex language
to specify the tokens of their language as well as the actions to take at each state.
One of the procedures in lex.yy.c (call it pinkLex()) is the lexer itself, which reads a
character input stream and produces a sequence of tokens. pinkLex() also sets a global
value yylval that is shared with the parser. I then compile lex.yy.c together with a the
parser (typically the output of lex's cousin yacc, a parser generator) to produce say
pinkfront, which is an executable program that is the front end for my pink compiler.
The lex program for the example we have been working with follows (it is typed in
straight from the book).
%{
/* definitions of manifest constants
LT, LE, EQ, NE, GT, GE,
IF, THEN, ELSE, ID, NUMBER, RELOP */
%}
/* regular definitions */
delim [ \t\n]
ws {delim}*
letter [A-Za-z]
digit [0-9]
id {letter}({letter}{digit})*
number {digit}+(\.{digit}+)?(E[+-]?{digit}+)?
%%
%%
int installID() {/* function to install the lexeme, whose first character
is pointed to by yytext, and whose length is yyleng,
into the symbol table and return a pointer thereto
*/
}
The first, declaration, section includes variables and constants as well as the all-
important regular definitions that define the building blocks of the target language,
i.e., the language that the generated lexer will analyze.
The next, translation rules, section gives the patterns of the lexemes that the lexer will
recognize and the actions to be performed upon recognition. Normally, these actions
include returning a token name to the parser and often returning other information
about the token via the shared variable yylval.
If a return is not specified the lexer continues executing and finds the next lexeme
present.
Anything between %{ and %} is not processed by lex, but instead is copied directly to
lex.yy.c. So we could have had statements like
#define LT 12
#define LE 13
The regular definitions are mostly self explanatory. When a definition is later used it
is surrounded by {}. A backslash \ is used when a special symbol like * or . is to be
used to stand for itself, e.g. if we wanted to match a literal star in the input for
multiplication.
Each rule is fairly clear: when a lexeme is matched by the left, pattern, part of the rule,
the right, action, part is executed. Note that the value returned is the name (an integer)
of the corresponding token. For simple tokens like the one named IF, which
correspond to only one lexeme, no further data need be sent to the parser. There are
several relational operators so a specification of which lexeme matched RELOP is
saved in yylval. For id's and numbers's, the lexeme is stored in a table by the install
functions and a pointer to the entry is placed in yylval for future use.
Everything in the auxiliary function section is copied directly to lex.yy.c. Unlike
declarations enclosed in %{ %}, however, auxiliary functions may be used in the
actions
The first rule makes <= one instead of two lexemes. The second rule makes if a
keyword and not an id.
Sorry.
This only matches IF when it is followed by a ( some text a ) and a letter. The only
FORTRAN statements that match this are the if/then shown above; so we have found
a lexeme that matches the if token. However, the lexeme is just the IF and not the rest
of the pattern. The slash tells lex to put the rest back into the input and match it for the
next and subsequent tokens.
Homework: 3.11.
Homework: Modify the lex program in section 3.5.2 so that: (1) the keyword while is
recognized, (2) the comparison operators are those used in the C language, (3) the
underscore is permitted as another letter (this problem is easy).
Finite automata are like the graphs we saw in transition diagrams but they simply
decide if a sentence (input string) is in the language (generated by our regular
expression). That is, they are recognizers of language.
1. Deterministic finite automata (DFA) have for each state (circle in the diagram)
exactly one edge leading out for each symbol. So if you know the next symbol
and the current state, the next state is determined. That is, the execution is
deterministic; hence the name.
2. Nondeterministic finite automata (NFA) are the other kind. There are no
restrictions on the edges leaving a state: there can be several with the same
symbol as label and some edges can be labeled with ε. Thus there can be
several possible next states from a given state and a current lookahead symbol.
Surprising Theorem: Both DFAs and NFAs are capable of recognizing the same
languages, the regular languages, i.e., the languages generated by regular expressions
(plus the automata can recognize the empty language).
There are certainly NFAs that are not DFAs. But the language recognized by each
such NFA can also be recognized by at least one DFA.
The DFA that recognizes the same language as an NFA might be significantly larger
that the NFA.
The finite automaton that one constructs naturally from a regular expression is often
an NFA.
An NFA is basically a flow chart like the transition diagrams we have already seen.
Indeed an NFA (or a DFA, to be formally defined soon) can be represented by
a transition graph whose nodes are states and whose edges are labeled with elements
of Σ ∪ ε. The differences between a transition graph and our previous transition
diagrams are:
1. Possibly multiple edges with the same label leaving a single state.
2. An edge may be labeled with ε.
Consider aababb. If you choose the wrong edge for the initial a's you will get stuck or
not end at the accepting state. But an NFA accepts a word if any path (beginning at
the start state and using the symbols in the word in order) ends at an accepting state. It
essentially tries all such paths at once and accepts if any end at an accepting state.
Patterns like (a|b)*abb are useful regular expressions! If the alphabet is ascii, consider
*.java.
Homework: For the NFA to the right, indicate all the paths labeled aabb.
The downside of these tables is their size, especially if most of the entries are φ since
those entries would not take any space in a transition graph.
Homework: Construct the transition table for the NFA in the previous homework
problem.
An NFA accepts a string if the symbols of the string specify a path from the start to an
accepting state.
Homework: Does the NFA in the previous homework accept the string aabb?
Again note that these symbols may specify several paths, some of which lead to
accepting states and some that don't. In such a case the NFA does accept the string;
one successful path is enough.
Also note that if an edge is labeled ε, then it can be taken for free.
For the transition graph above any string can just sit at state 0 since every possible
symbol (namely a or b) can go from state 0 back to state 0. So every string can lead to
a non-accepting state, but that is not important since if just one path with that string
leads to an accepting state, the NFA accepts the string.
The language defined by an NFA or the language accepted by an NFA is the set of
strings (a.k.a. words) accepted by the NFA.
So the NFA in the diagram above (not the diagram with the homework problem)
accepts the same language as the regular expression (a|b)*abb.
This is realistic. We are at a state and examine the next character in the string,
depending on the character we go to exactly one new state. Looks like a switch
statement to me.
Minor point: when we write a transition table for a DFA, the entries are elements not
sets so there are no {} present.
Simulating a DFA
Do not forget the goal of the chapter is to understand lexical analysis. We saw, when
looking at Lex, that regular expressions are a key in this task. So we want to recognize
regular expressions (say the ones representing tokens). We are going to see two
methods.
The list I just gave is in the order the algorithms would be applied—but you would
use either 2 or (3 and 4).
The two editions differ in the order the techniques are presented, but neither does it in
the order I just gave. Indeed, we just did item #4.
I will follow the order of 2nd ed but give pointers to the first edition where they differ.
Remark: I forgot to assign homework for section 3.6. I have added one problem
spread into three parts. It is not assigned but it is a question I believe you should be
able to do.
(This is item #3 above and is done in section 3.6 in the first edition.)
The book gives a detailed proof; I am just trying to motivate the ideas.
Let N be an NFA, we construct a DFA D that accepts the same strings as N does. Call
The idea is that D-state corresponds to a set of N-states and hence this is called
the subset algorithm. Specifically for each string X of symbols we consider all the N-
states that can result when N processes X. This set of N-states is a D-state. Let us
consider the transition graph on the right, which is an NFA that accepts strings
satisfying the regular expression
(a|b)*abb.
The start state of D is the set of N-states that can NFA states DFA state a b
result when N processes the empty string ε. This is
{0,1,2,4,7} D0 D1 D2
called the ε-closure of the start state s0 of N, and
consists of those N-states that can be reached from {1,2,3,4,6,7,8} D1 D1 D3
s0 by following edges labeled with ε. Specifically {1,2,4,5,6,7} D2 D1 D2
it is the set {0,1,2,4,7} of N-states. We call this {1,2,4,5,6,7,9} D3 D1 D4
state D0 and enter it in the transition table we are
building for D on the right. {1,2,4,5,6,7,10} D4 D1 D2
Next we want the a-successor of D0, i.e., the D-state that occurs when we start at
D0 and move along an edge labeled a. We call this successor D1. Since D0 consists of
the N-states corresponding to ε, D1 is the N-states corresponding to εa=a. We compute
the a-successor of all the N-states in D0 and then form the ε-closure.
Next we compute the b-successor of D0 the same way and call it D2.
We continue forming a- and b-successors of all the D-states until no new D-states
result (there is only a finite number of subsets of all the N-states so this process does
indeed stop).
This gives the table on the right. D4 is the only D-accepting state as it is the only D-
state containing the (only) N-accepting state 10.
Theoretically, this algorithm is awful since for a set with k elements, there
are 2k subsets. Fortunately, normally only a small fraction of the possible subsets
occur in practice.
Homework: Convert the NFA from the homework for section 3.6 to a DFA.
Instead of producing the DFA, we can run the subset algorithm as a simulation itself.
This is item #2 in my list of techniques
S = ε-closure(s0);
c = nextChar();
while ( c != eof ) {
S = ε-closure(move(S,c));
c = nextChar();
}
if ( S ∩ F != φ ) return yes; // F is
accepting states
else return no;
Slick implementation.
The pictures on the right illustrate the base and inductive cases.
Remarks:
1. The generated NFA has at most twice as many states as there are operators and
operands in the RE. This is important for studying the complexity of the NFA.
2. The generated NFA has one start and one accepting state. The accepting state
has no outgoing arcs and the start state has no incoming arcs.
3. Note that the diagram for st correctly indicates that the final state of s and the
initial state of t are merged. This uses the previous remark that there is only one
start and final state.
4. Except for the accepting state, each state of the generated NFA has either one
outgoing arc labeled with a symbol or two outgoing arcs labeled with ε.
Do the NFA for (a|b)*abb and see that we get the same diagram that we had before.
Do the steps in the normal leftmost, innermost order (or draw a normal parse tree and
follow it).
The remaining large question is how is the lex input converted into one of these
automatons.
Also
1. Lex permits functions to be passed through to the yy.lex.c file. This is fairly
straightforward to implement.
2. Lex also supports actions that are to be invoked by the simulator when a match
occurs. This is also fairly straight forward.
3. The lookahead operator is not so simple in the general case and is discussed
briefly below.
In this section we will use transition graphs, lexer-generators do not draw pictures;
instead they use the equivalent transition tables.
At each of the accepting states (one for each NFA in step 1), the simulator executes
the actions specified in the lex program for the corresponding pattern.
The simulator starts reading characters and calculates the set of states it is at.
At some point the input character does not lead to any state or we have reached the
eof. Since we wish to find the longest lexeme matching the pattern we proceed
backwards from the current point (where there was no state) until we reach an
accepting state (i.e., the set of NFA states, N-states, contains an accepting N-state).
Each accepting N-state corresponds to a matched pattern. The lex rule is that if a
lexeme matches multiple patterns we choose the pattern
listed first in the lex-program.
Pattern Action to perform
Example
a Action1
Consider the example on the right with three patterns abb Action2
and their associated actions and consider processing the * +
ab Action3
input aaba.
1. We begin by constructing the three NFAs. To save space, the third NFA is not
the one that would be constructed by our algorithm, but is an equivalent smaller
one. For example, some unnecessary ε-transitions have been eliminated. If one
view the lex executable as a compiler transforming lex source into NFAs, this
would be considered an optimization.
2. We introduce a new start state and ε-transitions as in the previous section.
3. We start at the ε-closure of the start state, which is {0,1,3,7}.
4. The first a (remember the input is aaba) takes us to {2,4,7}. This includes an
accepting state and indeed we have matched the first patten. However, we
do not stop since we may find a longer match.
5. The next a takes us to {7}.
6. The b takes us to {8}.
7. The next a fails since there are no a-transitions out of state 8. So we must back
up to before trying the last a.
8. We are back in {8} and ask if one of these N-states (I know there is only one,
but there could be more) is an accepting state.
9. Indeed state 8 is accepting for third pattern. If there were more than one
accepting state in the list, we would choose the one in the earliest listed pattern.
10. Action3 would now be performed.
Technical point. For a DFA, there must be a outgoing edge from each D-state for each
possible character. In the diagram, when there is no NFA state possible, we do not
show the edge. Technically we should show these edges, all of which lead to the same
D-state, called the dead state, and corresponds to the empty subset of N-states.
This has some tricky points. Recall that this lookahead operator is for when you must
look further down the input but the extra characters matched are not part of the
lexeme. We write the pattern r1/r2. In the NFA we match r1 then treat the / as an ε and
then match s1. It would be fairly easy to describe the situation when the NFA has only
ε-transition at the state where r1 is matched. But it is tricky when there are more than
one such transition.
Skipped
Skipped
Skipped
4.1: Introduction
4.1.1: The role of the parser
Conceptually, the parser accepts a sequence of tokens and produces a parse tree.
As we saw in the previous chapter the parser calls the lexer to obtain the next token.
In practice this might not occur.
1. universal
2. top-down
3. bottom-up
The universal parsers are not used in practice as they are inefficient.
As expected, top-down parsers start from the root of the tree and proceed downward;
whereas, bottom-up parsers start from the leaves and proceed upward.
The commonly used top-down and bottom parsers are not universal. That is, there are
grammars that cannot be used with them.
The LL and LR parsers are important in practice. Hand written parsers are often LL.
Specifically, the predictive parsers we looked at in chapter two are for LL grammars.
The LR grammars form a larger class. Parsers for this class are usually constructed
with the aid of automatic tools.
This takes care of precedence, but as we saw before, gives us trouble since it is left-
recursive and we did top-down parsing. So we use the following non-left-recursive
grammar that generates the same language.
E → T E'
E' → + T E' | ε
T → F T'
T' → * F T' | ε
F → ( E ) | id
The following ambiguous grammar will be used for illustration, but in general we try
to avoid ambiguity. This grammar does not enforce precedence.
E → E + E | E * E | ( E ) | id
Report errors clearly and accurately. One difficulty is that one error can mask
another and can cause correct code to look faulty.
Recover quickly enough to not miss other errors.
Add minimal overhead.
Print an error message when parsing cannot continue and then terminate parsing.
Panic-Mode Recovery
The first level improvement. The parser discards input until it encounters
a synchronizing token. These tokens are chosen so that the parser can make a fresh
beginning. Good examples are ; and }.
Phrase-Level Recovery
Locally replace some prefix of the remaining input by some string. Simple cases are
exchanging ; with , and = with ==. Difficulty is when real error occurred long before
the error was detected.
Error Productions
Global Correction
Change the input I to the closest correct input I' and produce the parse tree for I'.
1. Terminals: The basic components found by the lexer. They are sometimes
called token names, i.e., the first component of the token as produced by the
lexer.
2. Nonterminals: Syntactic variables that help define the syntactic structure of the
language.
3. Start Symbol: A start symbol that is the root of the parse tree.
4. Productions:
a. Head or left (hand) side or LHS. A single nonterminal.
b. →
c. Body or right (hand) side or RHS. A string of terminals and
nonterminals.
4.2.3: Derivations
Assume we have a production A → α. We would then say that A derives α and write
A⇒α
We generalize this. If, in addition, β and γ are strings, we say that βAγ derives βαγ
and write
βAγ ⇒ βαγ
The notation used is ⇒ with a * over it (I don't see it in html). This should be
read derives in zero or more steps. Formally,
Definition: Two grammars generating the same language are called equivalent.
We see that id + id is a sentence. Indeed it can be derived in two ways from the start
symbol E
E ⇒ E + E ⇒ id + E ⇒ id + id
E ⇒ E + E ⇒ E + id ⇒ id + id
When one wishes to emphasize that a (one step) derivation is leftmost they write an
lm under the ⇒. To emphasize that a (general) derivation is leftmost, one writes an lm
under the ⇒*. Similarly one writes rm to indicate that a derivation is rightmost. I
won't do this in the notes but will on the board.
The leaves of a parse tree (or of any other tree), when read left to right, are called
the frontier of the tree. For a parse tree we also call them the yield of the tree.
it is easy to write a parse tree with A as the root and xn as the leaves. Just do what (the
productions contained in) each step of the derivation says. The LHS of each
production is a nonterminal in the frontier of the current tree so replace it with the
RHS to get the next tree.
Do this for both the leftmost and rightmost derivations of id+id above.
So there can be many derivations that wind up with the same final tree.
But for any parse tree there is a unique leftmost derivation the produces that tree and a
unique rightmost derivation that produces the tree. There may be others as well (e.g.,
sometime choose the leftmost nonterminal to expand; other times choose the
rightmost).
Homework: 4.1 b
4.2.5: Ambiguity
Recall that an ambiguous grammar is one for which there is more than one parse tree
for a single sentence. Since each parse tree corresponds to exactly one leftmost (or
rightmost) derivation, an ambiguous grammar is one for which there is more than one
leftmost (or rightmost) derivation of a given sentence.
is ambiguous because we have seen (a few lectures ago) two parse trees for
id + id * id
So there must me at least two leftmost derivations. Here they are
E ⇒ E + E E ⇒ E * E
⇒ id + E ⇒ E + E * E
⇒ id + E * E ⇒ id + E * E
⇒ id + id * E ⇒ id + id * E
⇒ id + id * id ⇒ id + id * E
4.2.6: Verification
Skipped
The book starts with (a|b)*abb and then uses the short NFA on the left below. Recall
that the NFA generated by our construction is the longer one on the right.
The book gives the simple grammar for the short diagram.
The grammar
A → a A b | ε
generates all strings of the form anbn, where there are the same number of a's and b's.
In a sense the grammar has counted. No RE can generate this language (proof in
book).
Recall the ambiguous grammar with the notorious dangling else problem.
stmt → if expr then stmt
| if expr then stmt else stmt
| other
On the board try to find leftmost derivations of the problem sentence above.
Previously we did it separately for one production and for two productions with the
same nonterminal A on the LHS. Not surprisingly, this can be done for n such
productions (together with other non-left recursive productions involving A).
where the x's and y's are strings, no x is ε, and no y begins with A.
This removes direct left recursion where a production with A on the left hand side
begins with A on the right. If you also had direct left recursion with B, you would
apply the procedure twice.
The harder general case is where you permit indirect left recursion, where, for
example one production has A as the LHS and begins with B on the RHS, and a
second production has B on the LHS and begins with A on the RHS. Thus in two
steps we can turn A into something starting again with A. Naturally, this indirection
can involve more than 2 nonterminals.
Homework: Eliminate left recursion in the following grammar for simple postfix
expressions.
X→SS+|SS*|a
If two productions with the same LHS have their RHS beginning with the same
symbol, then the FIRST sets will not be disjoint so predictive parsing (chapter 2) will
be impossible and more generally top down parsing (later this chapter) will be more
difficult as a longer lookahead will be needed to decide which production to use.
So convert A → x y1 | x y2 into
A → x A'
A' → y1 | y2
Although our grammars are powerful, they are not all-powerful. For example, we
cannot write a grammar that checks that all variables are declared before used.
1. Start with the root of the parse tree, which is always the start symbol of the
grammar. That is, initially the parse tree is just the start symbol.
2. Choose a nonterminal in the frontier.
a. Choose a production having that nonterminal as LHS.
b. Expand the tree by making the RHS the children of the LHS.
3. Repeat above until the frontier is all terminals.
4. Hope that the frontier equals the input string.
The above has two nondeterministic choices (the nonterminal, and the production) and
requires luck at the end. Indeed, the procedure will generate the entire language. So
we have to be really lucky to get the input string.
We also process the terminals in the RHS, checking that they match the input. By
doing the expansion depth-first, left to right, we ensure that we encounter the
terminals in the order they will appear in the frontier of the final tree. Thus if the
terminal does not match the corresponding input symbol now, it never will and the
expansion so far will not produce the input string as desired.
1. Initially, the tree is the start symbol, the nonterminal we are processing.
2. Choose a production having the current nonterminal A as LHS. Say the RHS is
X1 X2 ... Xn.
3. for i = 1 to n
4. if Xi is a nonterminal
5. process Xi // recursive
6. else if Xi (a terminal) matches current input symbol
7. advance input to next symbol
8. else // trouble Xi doesn't match and never will
9.
Note that the trouble mentioned at the end of the algorithm does not signify an
erroneous input. We may simply have chosen the wrong production in step 2.
The good news is that we will work with grammars where we can control the
nondeterminism much better. Recall that for predictive parsing, the use of 1 symbol of
lookahead made the algorithm fully deterministic, without backtracking.
4.4.2: FIRST and FOLLOW
Now we learn the whole truth about these two sets, which prove to be quite useful for
several parsing techniques (and for error recovery).
The basic idea is that FIRST(α) tells you what the first symbol can be when you fully
expand the string α and FOLLOW(A) tells what terminals can immediately follow the
nonterminal A.
Definition: For any string α of grammar symbols, we define FIRST(α) to be the set of
terminals that occur as the first symbol in a string derived from α. So, if α⇒*xQ for x
a terminal and Q a string, then x is in FIRST(α). In addition if α⇒*ε, then ε is in
FIRST(α).
Definition: For any nonterminal A, FOLLOW(A) is the set of terminals x, that can
appear immediately to the right of A in a sentential form. Formally, it is the set of
terminals x, such that S⇒*αAxβ. In addition, if A can be the rightmost symbol in a
sentential form, the endmarker $ is in FOLLOW(A).
Note that there might have been symbols between A and x during the derivation,
providing they all derived ε and eventually x immediately follows A.
Unfortunately, the algorithms for computing FIRST and FOLLOW are not as simple
to state as the definition suggests, in large part caused by ε-productions.
1. FIRST(α) ∩ FIRST(β) = φ.
2. If β ⇒* ε, then no string derived from α begins with a terminal in
FOLLOW(A). Similarly, if α ⇒* ε.
The 2nd condition may seem strange; it did to me for a while. Let's consider the
simplest case that condition 2 is trying to avoid.
S → A b // b is in FOLLOW(A)
A → b // α=b so α derives a string beginning with b
A → ε // β=ε so β derives ε
The goal is to produce a table telling us at each situation which production to apply.
A situation means a nonterminal in the parse tree and an input symbol in lookahead.
We start with an empty table M and populate it as follows. (2nd edition has typo, A
instead of α.) For each production A → α
a. S → 0 S 1 | 0 1
b. the prefix grammar S → + S S | * S S | a
Skipped.
For bottom up parsing, we are not as fearful of left recursion as we were with top
down. Our first few examples will use the left recursive expression grammar
E → E + T | T
T → T * F | F
F → ( E ) | id
4.5.1: Reductions
Remember that running a production in reverse, i.e., replacing the RHS by the LHS is
called reducing. So our goal is to reduce the input string to the start symbol.
On the right is a movie of parsing id*id in a bottom-up fashion. Note the way it is
written. For example, from step 1 to 2, we don't just put F above id*id. We draw it as
we do because it is the current top of the tree (really forest) and not the bottom that we
are working on so we want the top to be in horizontal line and hence easy to read.
The tops of the forest are the roots of the subtrees present in the diagram. For the
movie those are
id * id, F * id, T * F, T, E
Note that (since the reduction successfully reaches the start symbol) each of these sets
of roots is a sentential form.
The steps from one frame of the movie, when viewed going down the page, are
reductions (replace the RHS of a production by the LHS). Naturally, when viewed
going up the page, we have a derivation (replace LHS by RHS). For our example the
derivation is
E ⇒ T ⇒ T * F ⇒ T * id ⇒ F * id ⇒ id * id
Note that this is a rightmost derivation and hence each of the sets of roots identified
above is a right sentential form. So the reduction we did in the movie was a rightmost
derivation in reverse.
Remember that for a non-ambiguous grammar there is only one rightmost derivation
and hence there is only one rightmost derivation in reverse.
Remark: You cannot simply scan the string (the roots of the forest) from left to right
and choose the first substring that matches the RHS of some production. If you try it
in our movie you will reduce T to E right after T appears. The result is not a right
sentential form.
Homework: 4.23 a c
A technical point, which explains the usage of a stack is that a handle is always at the
TOS. See the book for a proof; the idea is to look at what rightmost derivations can do
(specifically two consecutive productions) and then trace back what the parser will do
since it does the reverse operations (reductions) in the reverse order.
We have not yet discussed how to decide whether to shift or reduce when both are
possible. We have also not discussed which reduction to choose if multiple reductions
are possible. These are crucial question for bottom up (shift-reduce) parsing and will
be addressed.
Homework: 4.23 b
There are grammars (non-LR) for which no viable algorithm can decide whether to
shift or reduce when both are possible or which reduction to perform when several are
possible. However, for most languages, choosing a good lexer yields an LR(k)
language of tokens. For example, ada uses () for both function calls and array
references. If the lexer returned id for both array names and procedure names then a
reduce/reduce conflict would occur when the stack was ... id ( id and the input )
... since the id on TOS should be reduced to parameter if the first id was a procedure
name and to expr if the first id was an array name. A better lexer (and an assumption,
which is true in ada, that the declaration must precede the use) would return proc-id
when it encounters a lexeme corresponding to a procedure name. It does this by
constructing the symbol table it builds.
Indeed, I will have much more to say about SLR than the other LR schemes. The
reason is that SLR is simpler to understand, but does capture the essence of shift-
reduce, bottom-up parsing. The disadvantage of SLR is that there are LR grammars
that are not SLR.
I will just say the following about operator precedence. We shall see that a major
consideration in all the bottom-up, shift-reduce parsers is deciding when to shift and
when to reduct. Consider parsing A+B*C in C/java/etc. When the stack is A+B and
the remaining input is *C, the parser needs to know whether to reduce A+B or shift in
* and then C. (Really the A+B will probably by now be more like E+T.) The idea of
operator precedence is that we give * higher precedence so when the parser see * on
the input it knows not to reduce +. More details are in the first (i.e., your) edition of
the text.
These compiler writers claim that they are able to produce much better error messages
than can readily be obtained by going to LR (with its attendant requirement that a
parser-generator be used since the parsers are too large to construct by hand). Note
that compiler error messages is a very important user interface issue and that with
recursive descent one can augment the procedure for a nonterminal with statements
like
if (nextToken == X) then error(expected Y here)
We now come to grips with the big question: How does a shift-reduce parser know
when to shift and when to reduce? This will take a while to answer in a satisfactory
manner. The unsatisfactory answer is that the parser has tables that say in
each situation whether to shift or reduce (or announce error, or announce acceptance).
To begin the path toward the answer, we need several definitions.
An item is a production with a marker saying how far the parser has gotten with this
production. Formally,
Examples:
A. E → E + T generates 4 items.
1. E → · E + T
2. E → E · + T
3. E → E + · T
4. E → E + T ·
B. A → ε generates A → · as its only item.
The item E → E · + T signifies that the parser has just processed input that is
derivable from E and will look for input derivable from + T.
Line 4 indicates that the parser has just seen the entire RHS and must consider
reducing it to E. Important: consider does not mean do.
The parser groups certain items together into states. As we shall see, the items with a
given state are treated similarly.
Our goal is to construct first the canonical LR(0) collection of states and then a DFA
called the LR(0) automaton (technically not a DFA since no dead state).
To construct the canonical LR(0) collection formally and present the parsing
algorithm in detail we shall
Augmenting the grammar is easy. We simply add a new start state S' and one
production S'→S. The purpose is to detect success, which occurs when the parser is
ready to reduce S to S'.
I hope the following interlude will prove helpful. In preparing to present SLR, I was
struck how it looked like we were working with a DFA that came from some
(unspecified and unmentioned) NFA. It seemed that by first doing the NFA, I could
give some rough insight. Since for our current example the NFA has more states and
hence a bigger diagram, let's consider the following extremely simple grammar.
E → E + T
E → T
T → id
When the dots are added we get 10 items (4 from the second production, 2 each from
the other three).
See the diagram
at the right. We
begin at E'→.E
since it is the
start item.
1. Edges labeled with terminals. These correspond to shift actions, where the
indicated terminal is shifted from the input to the stack.
2. Edges labeled with nonterminals. These will correspond to reduce actions when
we construct the DFA. The stack is reduced by a production having the given
nonterminal as LHS. Reduce actions do more as we shall see.
3. Edges labeled with ε. These are associated with the closure operation to be
discussed and are the source of the nondeterminism (i.e., why the diagram is an
NFA).
4. An edge labeled $. This edge, which can be thought of as shifting the
endmarker, is used when we are reducing via the E'→E production and
accepting the input.
If we were at the item E→E·+T (the dot indicating that we have seen an E and now
need a +) and shifted a + from the input to the stack we would move to the item
E→E+·T. If the dot is before a non-terminal, the parser needs a reduction with that
non-terminal as the LHS.
Now we come to the idea of closure, which I illustrate in the diagram with the ε's.
Please note that this is rough, we are not doing regular expressions again, but I hope
this will help you understand the idea of closure, which like ε in regular production
leads to nondeterminism.
Look at the start state. The placement of the dot indicates that we next need to see an
E. Since E is a nonterminal, we won't see it in the input, but will instead have to
generate it via a production. Thus by looking for an E, we are also looking for any
production that has E on the LHS. This is indicated by the two ε's leaving the top left
box. Similarly, there are ε's leaving the other three boxes where the dot is immediately
to the left of a nonterminal.
I0, I1, etc are called (LR(0)) item sets, and the collection with the arcs (i.e., the DFA)
is called the LR(0) automaton.
Now we put the diagram to use to parse Stack Symbols Input Action
id+id as shown in the table on the right.
0 id+id$ Shift to 3
The symbols column is not needed since
it can be determined from the stack, but it 03 id +id$ Reduce by T→id
is useful for understanding. The first 02 T +id$ Reduce by E→T.
edition merges the stack and symbols 01 E +id$ Shift to 4
columns, but I think it is clearer when
they are separate as in the 2nd edition. 014 E+ id$ Shift to 3
0143 E+id $ Reduce by T→id
We start in the initial state with the stack 0145 E+T $ Reduce by E→E+T
empty and the input full. The $'s are just
end markers. From state 0, called I0 in 01 E $ Accept
my diagram (following the book they are called I's since they are sets of items), we
can only shift in the id (the nonterminals will appear in the symbols column). This
brings us to I3 so we push a 3 onto the stack
In I3 we see a completed production in the box (the dot is on the extreme right). Thus
we can reduce by this production. To reduce we pop the stack for each symbol in the
RHS since we are replacing the RHS by the LHS; this time the RHS has one symbol
so we pop the stack once and also remove one symbol. The stack corresponds to
moves so we are undoing the move to 3 and we are temporarily in 0 again. But the
production has a T on the LHS so we follow the T production from 0 to 2, push T onto
Symbols, and push 2 onto the stack.
The next two steps are shifts of + and id. We then reduce the id to T and are in step 5
ready for the big one.
The reduction in 5 has three symbols on the RHS so we pop (back up) three times
again temporarily landing in 0, but the RHS puts us in 1.
Perfect! We have just E as a symbol and the input is empty so we are ready to reduce
by E'→E, which signifies acceptance.
Say I is a set of items and one of these items is A→α·Bβ. This item represents the
parser having seen α and records that the parser might soon see the remainder of the
RHS. For that to happen the parser must first see a string derivable from B. Now
consider any production starting with B, say B→γ. If the parser is to making progress
on A→α·Bβ, it will need to be making progress on one such B→·γ. Hence we want to
add all the latter productions to any state that contains the former. We formalize this
into the notion of closure.
1. Initialize CLOSURE(I) = I
2. If A → α · B β is in CLOSURE(I) and B → γ is a production, then add B → · γ
to the closure and repeat.
CLOSURE({E' → E}) contains 7 elements. The 6 new elements are the 6 original
productions each with a dot right after the arrow.
If X is a grammar symbol, then moving from A→α·Xβ to A→αX·β signifies that the
parser has just processed (input derivable from) X. The parser was in the former
position and X was on the input; this caused the parser to go to the latter position. We
(almost) indicate this by writing GOTO(A→α·Xβ,X) is A→αX·β. I said almost
because GOTO is actually defined from item sets to item sets not from items to items.
I really believe this is very clear, but I understand that the formalism makes it seem
confusing. Let me begin with the idea.
We augment the grammar and get this one new production; take its closure. That is
the first element of the collection; call it Z. Try GOTOing from Z, i.e., for each
grammar symbol, consider GOTO(Z,X); each of these (almost) is another element of
the collection. Now try GOTOing from each of these new elements of the collection,
etc. Start with jane smith, add all her friends F, then add the friends of everyone in F,
called FF, then add all the friends of everyone in FF, etc
This GOTO gives exactly the arcs in the DFA I constructed earlier. The formal
treatment does not include the NFA, but works with the DFA from the beginning.
Homework:
1. Construct the LR(0) set of items for the following grammar (which produces
simple postfix expressions).
X→SS+|SS*|a
Don't forget to augment the grammar.
2. Draw the DFA for this item set.
The DFA for
our Main
Example
Our main
example is
larger than the
toy I did before.
The NFA would
have
2+4+2+4+2+4+
2=20 states (a
production with
k symbols on
the RHS gives
k+1 N-states
since there k+1
places to place
the dot). This
gives rise to 11
D-states.
However, the
development in
the book, which
we are following
now, constructs
the DFA
directly. The
resulting
diagram is on the right.
Start constructing the diagram on the board. Begin with {E' → ·E}, take the closure,
and then keep applying GOTO.
The LR-parsing algorithm must decide when to shift and when to reduce (and in the
latter case, by which production). It does this by consulting two tables, ACTION and
GOTO. The basic algorithm is the same for all LR parsers, what changes are the
tables ACTION and GOTO.
Technical point that may, and probably should, be ignored: our GOTO was defined on
pairs [item-set,grammar-symbol]. The new GOTO is defined on pairs
[state,nonterminal]. A state (except the initial state) is an item set together with the
grammar symbol that was used to generate it (via the old GOTO). We will not use the
new GOTO on terminals so we just define it on nonterminals.
1. Shift j. The terminal a is shifted on to the stack and the parser enters state j.
2. Reduce A → α. The parser reduces α on the TOS to A.
3. Accept.
4. Error
So ACTION is the key to deciding shift vs. reduce. We will soon see how this table is
computed for SLR.
This formalism is useful for stating the actions of the parser precisely, but I believe it
can be explained without it.
(s0,s1...sm,aiai+1...an$
where the s's are states and the a's input symbols. This state could also be represented
by the right-sentential form
X1...Xm,ai...an
where the X is the symbol associated with the state. All arcs into a state are labeled
with this symbol. The initial state has no symbol.
The parser consults the combined ACTION-GOTO table for its current state (TOS)
and next input symbol, formally this is ACTION[sm,ai], and proceeds as follows based
on the value in the table. We have done this informally just above; here we use the
formal treatment
1. Shift s. The input symbol is pushed and becomes the new state. The new
configuration is
(s0...sms,ai+1...an
A Terminology Point
The book (both editions) and the rest of the world seem to use GOTO for both the
function defined on item sets and the derived function on states. As a result we will be
defining GOTO in terms of GOTO. (I notice that the first edition uses goto for both; I
have been following the second edition, which uses GOTO. I don't think this is a real
problem.) Item sets are denoted by I or Ij, etc. States are denoted by s or sior (get
ready) i. Indeed both books use i in this section. The advantage is that on the stack we
placed integers (i.e., i's) so this is consistent. The disadvantage is that we are defining
GOTO(i,A) in terms of GOTO(Ii,A), which looks confusing. Actually, we view the
old GOTO as a function and the new one as an array (mathematically, they are the
same) so we actually write GOTO(i,A) and GOTO[Ii,A].
The GOTO columns can also be read directly off the DFA. Since there is an E-
transition (arc labeled E) from I0 to I1, the column labeled E in row 0 contains a 1.
Since the column labeled + is blank for row 7, we see that it would be an error if we
arrived in state 7 when the next input character is +.
Finally, if we are in state 1 when the input is exhausted ($ is the next input character),
then we have a successfully parsed
the input.
Stack Symbols Input Action
Example: The diagram on the right 0 id*id+id$ shift
shows the actions when SLR parsing
05 id *id+id$ reduce by F→id
id*id+id. On the blackboard let's do
id+id*id and see how the precedence 03 F *id+id$ reduct by T→id
is handled. 02 T *id+id$ shift
027 T* id+id$ shift
Homework: Construct the SLR
parsing table for the following 0275 T*id +id$ reduce by F→id
grammar 027 10 T*F +id$ reduce by T→T*F
X→SS+|SS*|a 02 T +id$ reduce by E→T
You already constructed the LR(0)
automaton for this example in the 01 E +id$ shift
previous homework. 016 E+ id$ shift
0165 E+id $ reduce by F→id
4.6.5: Viable Prefixes
0163 E+F $ reduce by T→F
Skipped. 0169 E+T $ reduce by E→E+T
01 E $ accept
4.7: More Powerful LR
Parsers
We consider very briefly two alternatives to SLR, canonical-LR or LR, and
lookahead-LR or LALR.
SLR used the LR(0) items, that is the items used were productions with an embedded
dot, but contained no other (lookahead) information. The LR(1) items contain the
same productions with embedded dots, but add a second component, which is a
terminal (or $). This second component becomes important only when the dot is at the
extreme right (indicating that a reduction can be made if the input symbol is in the
appropriate FOLLOW set). For LR(1) we do that reduction only if the input symbol is
exactly the second component of the item. This finer control of when to perform
reductions, enables the parsing of a larger class of languages.
Skipped.
Skipped.
4.7.4: Constructing LALR Parsing Tables
For LALR we merge various LR(1) item sets together, obtaining nearly the LR(0)
item sets we used in SLR. LR(1) items have two components, the first, called the core,
is a production with a dot; the second a terminal. For LALR we merge all the item sets
that have the same cores by combining the 2nd components (thus permitting
reductions when any of these terminals is the next input symbol). Thus we obtain the
same number of states (item sets) as in SLR since only the cores distinguish item sets.
Unlike SLR, we limit reductions to occurring only for certain specified input symbols.
LR(1) gives finer control; it is possible for the LALR merger to have reduce-reduce
conflicts when the LR(1) items on which it is based is conflict free.
Although these conflicts are possible, they are rare and the size reduction from LR(1)
to LALR is quite large. LALR is the current method of choice for bottom-up, shift-
reduce parsing.
Skipped.
Skipped.
Skipped.
Skipped.
The structure of the user input is similar to that for lex, but instead of regular
definitions, one includes productions with semantic actions.
There are ways to specify associativity and precedence of operators. It is not done
with multiple grammar symbols as in a pure parser, but more like declarations.
Skipped.
Skipped
Skipped
Again we are redoing, more formally and completely, things we briefly discussed
when breezing over chapter 2.
Recall that a syntax-directed definition (SDD) adds semantic rules to the productions
of a grammar. For example to the production T → T1 / F we might add the rule
T.code = T1.code || F.code || '/'
if we were doing an infix to postfix translator.
Rather than constantly copying ever larger strings to finally output at the root of the
tree after a depth first traversal, we can perform the output incrementally by
embedding semantic actions within the productions themselves. The above example
becomes
T → T1 / F { print '/' } Since we are generating postfix, the action comes at the end
(after we have generated the subtrees for T1 and F, and hence performed their actions).
In general the actions occur within the production, not necessarily after the last
symbol.
For SDD's we conceptually need to have the entire tree available after the parse so
that we can run the depth first traversal. (It is depth first since we are doing postfix;
we will see other orders shortly.) Semantic actions can be performed during the parse,
without saving the tree.
Terminals can have synthesized attributes, that are given to it by the lexer (not the
parser). There are no rules in an SDD giving values to attributes for terminals.
Terminals do not have inherited attributes. A nonterminal A can have both inherited
and synthesized attributes. The difference is how they are computed by rules
associated with a production at a node N of the parse tree. We sometimes refer to the
production at node N as production N.
Draw the parse tree for 7+6/3 on the board and verify that L.val is 9, the value of the
expression.
Definition: This example use only synthesized attributes; such SDDs are called S-
attributed and have the property that the rules give the attribute of the LHS in terms of
attributes of the RHS.
Inherited attributes are more complicated since the node N of the parse tree with
which it is associated (which is also the natural node to store the value) does not
contain the production with the corresponding semantic rule.
Note that when viewed from the parent node P (the site of the semantic rule), the
inherited attribute depends on values at P and at P's children (the same as for
synthesized attributes). However, and this is crucial, the nonterminal B is the LHS of
a child of P and hence the attribute is naturally associated with that child. It is possibly
stored there and is shown there in the diagrams below.
Definition:Often the attributes are just evaluations without side effects. In such cases
we call the SDD an attribute grammar.
Remark: Do 7+6/3 on board using the SDD from the end of the previous lecture
(should have been done last time).
5.1.2: Evaluating an SDD at the Nodes of a Parse Tree
If we are given an SDD and a parse tree for a given sentence, we would like to
evaluate the annotations at every node. Since, for synthesized annotations parents can
depend on children, and for inherited annotations children can depend on parents,
there is no guarantee that one can in fact find an order of evaluation. The simplest
counterexample is the single production A→B with synthesized attribute A.syn,
inherited attribute B.inh, and rules A.syn=B.inh and B.inh=A.syn+1. This means to
evaluate A.syn at the parent node we need B.inh at the child and vice versa. Even
worse it is very hard to tell, in general, if every sentence has a successful evaluation
order.
All this not withstanding we will not have great difficulty because we will not be
considering the general case.
We computed the values to put in this tree for 7+6/3 and on the right is (7-6).
Homework: 5.1
When doing top-down parsing, we need to avoid left recursion. Consider the grammar
below, which is the result of removing the
left recursion, and again its parse tree is
shown on the right. Try not to look at the
semantic rules for the moment.
T'1.lval = T'.lval *
Inherited
T' → * F T1' F.val
T'.tval = T'1.tval Synthesized
F.val =
F → num Synthesized
num.lexval
Now where on the tree should we do the multiplication 3*5? There is no node that has
3 and * and 5 as children. The second production is the one with the * so that is the
natural candidate for the multiplication site. Make sure you see that this production
(for 3*5) is associated with the blue highlighted node in the parse tree. The right
operand (5) can be obtained from the F that is the middle child of this T'. F gets the
value from its child, the number itself; this is an example of the simple synthesized
case we have already seen, F.val=num.lexval (see the last semantic rule in the table).
But where is the left operand? It is located at the sibling of T' in the parse tree, i.e., at
the F immediately to T's left. This F is not mentioned in the production associated
with the T' node we are examining. So, how does T' get F.val from its sibling? The
common parent, in this case T, can get the value from F and then our node can inherit
the value from its parent.
Bingo! ... an inherited attribute. This can be accomplished by having the following
two rules at the node T.
T.tmp = F.val
T'.lval = T.tmp
Since we have no other use for T.tmp, we combine the above two rules into the first
rule in the table.
Now lets look at the second multiplication (3*5)*4, where the parent of T' is another
T'. (This is the normal case. When there are n multiplies, n-1 have T' as parent and
only one has T).
The red-highlighted T' is the site for the multiplication. However, it needs as left
operand, the product 3*5 that its parent can calculate. So we have the parent (another
T' node, the blue one in this case) calculate the product and store it as an attribute of
its right child namely the red T'. That is the first rule for T' in the table.
We have now explained the first, third, and last semantic rules. These are enough to
calculate the answer. Indeed, if we trace it through, 60 does get evaluated and stored
in the bottom right T', the one associated with the ε-production. Our remaining goal is
to get the value up to the root where it represents the evaluation of this term T and can
be combined with other terms to get the value of a larger expression.
Going up
is easy,
just
synthesiz
e. I
named
the
attribute
tval, for
term-
value. It
is
generated at the ε-production from the lval attribute (which at this node is not a good
name) and propagated back up. At the T node it is called simply val. At the right we
see the annotated parse tree for this input.
Homework: Extend this SDD to handle the left-recursive, more complete expression
evaluator given earlier in this section. Don't forget to eliminate the left recursion first.
It clearly requires some care to write the annotations.
Another question is how does the system figure out the evaluation order if one exists?
That is the subject of the next section.
Remark: Consider the identifier table. The lexer creates it initially, but as the
compiler performs semantic analysis and discover more information about various
identifiers, e.g., type and visibility information, the table is updated. One could think
of this is some inherited/synthesized attribute pair that during each phase of analysis is
pushed down and back up the tree. However, it is not implemented this way; the table
is made a global data structure that is simply updated. The the compiler writer must
ensure manually that the updates are performed in an order respecting any
dependences.
Each green arrow points to the attribute calculated from the attribute at the tail of the
arrow. These arrows either go up the tree one level or stay at a node. That is because a
synthesized attribute can depend only on the node where it is defined and that node's
children. The computation of the attribute is associated with the production at the
node at its arrowhead. In this example, each synthesized attribute depends on only one
other, but that is not required.
Each red arrow also points to the attribute calculated from the attribute at the tail.
Note that two red arrows point to the same attribute. This indicates that the common
attribute at the arrowheads, depends on both attributes at the tails. According to the
rules for inherited attributes, these arrows either go down the tree one level, go from a
node to a sibling, or stay within a node. The computation of the attribute is associated
with the production at the parent of the node at the arrowhead.
The graph just drawn is called the dependency graph. In addition to being generally
useful in recording the relations between attributes, it shows the evaluation order(s)
that can be used. Since the attribute at the head of an arrow depends on the on the one
at the tail, we must evaluate the head attribute after evaluating the tail attribute.
Thus what we need is to find an evaluation order respecting the arrows. This is called
a topological sort. The rule is that the needed ordering can be found if and only if
there are no (directed) cycles. The algorithm is simple.
If the algorithm terminates with nodes remaining, there is a directed cycle and no
suitable evaluation order.
If the algorithm succeeds in deleting all the nodes, then the deletion order is a suitable
evaluation order and there were no directed cycles.
Given an SDD and a parse tree, it is easy to tell (by doing a topological sort) whether
a suitable evaluation exists (and to find one).
However, a very difficult problem is, given an SDD, are there any parse trees with
cycles in their dependency graphs, i.e., are there suitable evaluation orders
for all parse trees. Fortunately, there are classes of SDDs for which a suitable
evaluation order is guaranteed.
Since postorder corresponds to the actions of an LR parser when reducing the body of
a production to its head, it is often convenient to evaluate synthesized attributes during
an LR parse.
1. Synthesized.
2. Inherited from the left, and hence the name L-
attributed.
If the production is A → X1X2...Xn, then the inherited attributes for Xj can
depend only on
a. Inherited attributes of A, the LHS.
b. Any attribute of X1, ..., Xj-1, i.e. only on symbols to the left of Xj.
3. Attributes of Xj, *BUT* you must guarantee (separately) that these attributes
do not by themselves cause a cycle.
The picture shows that there is an evaluation order for L-attributed definitions (again
assuming no case 3). More formally, do a depth first traversal of the tree. The first
time you visit a node, evaluate its inherited attributes (since you will know the value
of everything it depends on), and the last time you visit it, evaluate the synthesized
attributes. This is two-thirds of an Euler-tour traversal.
Homework: Suppose we have a production A → B C D. Each of the four
nonterminals has two attributes s, which is synthesized, and i, which is inherited. For
each set of rules below, tell whether the rules are consistent with (i) an S-attributed
definition, (ii) an L-attributed definition, (iii) any evaluation order at all.
5.2.5: Semantic
Rules with
Controlled Side Production Semantic Rule Type
Effects
D→TL L.type = T.type inherited
When we have side T → INT T.type = integer synthesized
effects such as
printing or adding an L1.type = L.type inherited
L → L1 , ID
entry to a table we addType(ID.entry,L.type) synthesized, side effect
must ensure that we
have not added a L → ID addType(ID.entry,L.type) synthesized, side effect
constraint to the
evaluation order that causes a cycle.
For example, the left-recursive SDD shown in the table on the right propagates type
information from a declaration to entries in an identifier table.
The function addType adds the type information in the second argument to the
identifier table entry specified in the first argument. Note that the side effect, adding
the type info to the table, does not affect the evaluation order.
Draw the dependency graph on the board. Note that the terminal ID has
an attribute (given by the lexer) entry that gives its entry in the identifier table. The
nonterminal L has (in addition to L.type) a dummy synthesized attribute, say
AddType, that is a place holder for the addType() routine. AddType depends on the
arguments of addType(). Since the first argument is from a child, and the second is an
inherited attribute of this node, we have legal dependences for a synthesized attribute.
Homework: For the SDD above, give the annotated parse tree for
INT a,b,c
================ Start Lecture #9 ================
Remark: See the new section Evaluating L-Attributed Definitions in section 5.2.4.
Assume we have two functions Leaf(op,val) and Node(op,c1,...,cn), that create leaves
and interior nodes respectively of the syntax tree. Leaf is called for terminals. Op is
the label of the node (op for operation) and val is the lexical value of the token. Node
is called for
nonterminals and
the ci's refer (are Production Semantic Rules Type
pointers) to the
children. E.node=E'.syn Synthesized
E → T E'
E'node=T.node Inherited
The upper table
on the right E'1.node=new Node('+',E'.node,T.node) Inherited
E' → + T E'1
shows a left- E'.syn=E'1.syn Synthesized
recursive
grammar that is E'1.node=new Node('-',E'.node,T.node) Inherited
E' → - T E'1
S-attributed (so E'.syn=E'1.syn Synthesized
all attributes are
synthesized). E' → ε E'.syn=E'.node Synthesized
T→(E) T.node=E.node Synthesized
Try this for x-2+y T → ID T.node=new Leaf(ID,ID.entry) Synthesized
and see that we
get the syntax T → NUM T.node=new Leaf(NUM,NUM.val) Synthesized
tree.
When we eliminate the left recursion, we get the lower table on the right. It is a good
illustration of dependencies. Follow it through and see that you get the same syntax
tree as for the left-recursive version.
Remarks:
1. In the first edition (section 8.1) we have nearly the same table. The main
difference is the switch from algol/pascal-like notation (mknode) to a
java/object-like new.
2. These two functions, new Node and new Leaf (or their equivalent), are needed
for lab 3 (part 4), if you are doing a recursive-descent parser. When processing
a production
i. Create a parse tree node for the LHS.
ii. Call subroutines for RHS symbols and connect the resulting nodes to the
node created in i.
iii. Return a reference to the new node so the parent can hook it into the
parse tree.
3. It is the lack of a call to new in the third and fourth productions that causes the
(abstract) syntax tree to be produced rather than the parse (concrete syntax)
tree.
4. Production compilers do not produce a parse tree, but only the syntax tree. The
syntax tree is smaller, and hence more (space and time) efficient for subsequent
passes that walk the tree. The parse tree might be (I believe) very slightly easier
to construct as you don't have to decide which nodes to produce; you simply
produce them all.
This course emphasizes top-down parsing (at least for the labs) and hence we must
eliminate left recursion. The resulting grammars need inherited attributes, since
operations and operands are in different productions. But sometimes the language
itself demands inherited attributes. Consider two ways to
describe a 3x4, two-dimensional array.
array [3] of array [4] of int and int[3][4]
Assume that we want to produce a structure like the one the right
for the array declaration given above. This structure is generated by calling a function
array(num,type). Our job is to create an SDD so that the function gets called with the
correct arguments.
For the first language representation of arrays (found in Ada and similar to that in lab
3), it is easy to generate an S-attributed (non-left-recursive) grammar based on
A → ARRAY [ NUM ] OF A | INT | FLOAT
This is shown in the table on
the left.
Production Semantic Rules Type
Producti
Semantic Rule T.t=C.t Synthesized
on T→BC
C.b=B.t Inherited
A→
ARRAY A.t=array(NUM.v B → INT B.t=integer Synthesized
[ NUM ] al,A1.t) B → FLOAT B.t=float Synthesized
OF A1
A→ C.t=array(NUM.val,C1.t) Synthesized
A.t=integer C → [ NUM ] C1
INT C1.b=C.b Inherited
A→
FLOAT
A.t=float C→ε C.t=C.b Synthesized
On the board draw the parse tree and see that simple synthesized attributes above
suffice.
For the second language representation of arrays (the C-style), we need some smarts
(and some inherited attributes) to move the int all the way to the right. Fortunately, the
result, shown in the table on the right, is L-attributed and therefore all is well.
Homework: 5.6
The idea is that instead of the SDD approach, which requires that we build a parse tree
and then perform the semantic rules in an order determined by the dependency graph,
we can attach semantic actions to the grammar (as in chapter 2) and perform these
actions during parsing, thus saving the construction of the parse tree.
But except for very simple languages, the tree cannot be eliminated. Modern
commercial quality compilers all make multiple passes over the tree, which is actually
the syntax tree (technically, the abstract syntax tree) rather than the parse tree (the
concrete syntax tree).
5.4.1: Postfix Translation Schemes
If parsing is done bottom up and the SDD is S-attributed, one can generate an SDT
with the actions at the end (hence, postfix). In this case the action is perform at the
same time as the RHS is reduced to the LHS.
Skipped.
Skipped
Skipped
Skipped
1. Build the parse tree and annotate. Works as long as no cycles are present
(guaranteed by L- or S-attributed).
2. Build the parse tree, add actions, and execute the actions in preorder. Works
for any L-attributed definition. Can add actions based on the semantic rules of
the SDD.
3. Translate During Recursive Descent Parsing. See below.
4. Generate Code on the Fly. Also uses recursive descent, but is restrictive.
5. Implement an SDT during LL-parsing. Skipped.
6. Implement an SDT during LR-parsing of an LL Language. Skipped.
Recall that in recursive-descent parsing there is one procedure for each nonterminal.
Assume the SDD is L-attributed. Pass the procedure the inherited attributes it might
need (different productions with the same LHS need different attributes). The
procedure keeps variables for attributes that will be needed (inherited for nonterminals
in the body; synthesized for the head). Call the procedures for the nonterminals.
Return all synthesized attributes for this nonterminal.
analyze (tree-node)
This procedure is basically a big switch statement where the cases correspond to the
different productions in the grammar. The tree-node is the LHS of the production and
the children are the RHS. So by first switching on the tree-node and then inspecting
enough of the children, you can tell the production.
As described in 5.5.1 above, you have received as parameters (in addition to tree-
node), the attributes you are to inherit. You then call yourself recursively, with the
tree-node argument set to your leftmost child, then call again using the next child, etc.
Each time, you pass to the child the attributes it needs to inherit (You may be giving it
too many since you know the nonterminal represented by this child but not the
production; you could find out the production by examining the child's children, but
probably don't bother doing so.)
When each child returns, it supplies as its return value the synthesized attributes it is
passing back to you.
After the last child returns, you return to your caller, passing back the synthesized
attributes you are to calculate.
Variations
1. Instead of a giant switch, you could have separate routines for each nonterminal
as done in the parser and just switch on the productions having this nonterminal
as LHS.
2. You could have separate routines for each production (requires looking 2-deep,
as mentioned above).
3. If you like actions instead of rules, perform the actions where indicated in the
SDT.
4. Global variable can be used (with care) instead of parameters.
5. As illustrated earlier in the notes, you can call routines instead of setting an
attribute (see addType in 5.2.5).
The difference between a syntax DAG and a syntax tree is that the former can
have undirected cycles. DAGs are useful where there are multiple, identical portions
in a given input. The common case of this is for expressions where there often are
common subexpressions. For example in the expression
X+a+b+c-X+(a+b+c)
each individual variable is a common subexpression. But a+b+c is not since the first
occurrence has the X already added. This is a real difference when one considers the
possibility of overflow or of loss of precision. The easy case is
x+y*z*w-(q+y*z*w)
where y*z*w is a common subexpression.
It is easy to find these. The constructor Node() above checks if an identical node
exists before creating a new one. So Node ('/',left,right) first checks if there is a node
with op='/' and children left and right. If so, a reference to that node is returned; if not,
a new node is created as before.
Homework: Construct the DAG for
((x+y)-((x+y)*(x-y)))+((x+y)*(x-y))
Often one stores the tree or DAG in an array, one entry per node. Then references to
the array index of a node is called the node's value-number. Searching an unordered
array is slow; there are many better data structures to use. Hash tables are a good
choice.
If we are starting with a DAG (or syntax tree if less aggressive), then transforming
into 3-address code is just a topological sort and an assignment of a 3-address
operation with a new name for the result to each interior node (the leaves already have
names and values).
We use the term 3-address when we view the (intermediate-) code as having
one elementary operation with three operands, each of which is an address. Typically
two of the addresses represent source operands or arguments of the operation and the
third represents the result. Some of the 3-address operations have fewer than three
addresses; we simply think of the missing addresses as unused (or ignored) fields in
the instruction.
Possible addresses
1. (Source program) Names. Really the intermediate code would contain a
reference to the (identifier) table entry for the name. For convenience, the
actually identifier is often written.
2. Constants. Again, this would often be a reference to a table entry. An important
issue is type conversion that will be discussed later. Type conversion also
applies to identifiers.
3. (Compiler-generated) Temporaries. Although it may at first seem
wasteful, modern practice assigns a new name to each temporary, rather than
reusing the same temporary. (Remember that a DAG node is considered one
temporary even if it has many parents.) Later phases can combine several
temporaries into one (e.g., if they have disjoint lifetimes).
In the list below, x, y, and z are addresses, i is an integer, and L is a symbolic label, as
used in chapter 2. The instructions can be thought of as numbered and the labels can
be converted to the numbers with another pass over the output or via backpatching,
which is discussed below.
1. Binary ops. x = y op z
2. Unary ops. x = op y (includes copy, where op is the identity f(x)=x)
3. Junp. goto L.
4. Conditional unary op jumps. if x goto L ifFalse x goto L.
5. Conditional binary op jumps. if x relop y goto L
6. Procedure/Function Calls and Returns.
param x call p,n y = call p,n return return y.
7. Indexed Copy ops. x = y[i] x[i] = y.
8. Address and pointer ops. x = &y x = *y *x
Homework: 8.1
Optimization to save a field. The result field of a quad is omitted in a triple since the
result is often a temporary.
If the result field of a quad is not a temporary then two triples may be needed: One to
do the operation and place the result into a temporary (which is not a field of the
instruction). The second operation is a copy operation from the temporary to the final
home. Recall that a copy does not use all the fields of a quad no fits into a triple
without omitting the result.
1. The pointers are (probably) smaller than the triples so faster to move. This is a
generic advantage and could be used for quads and many other reordering
applications (e.g., sorting large records).
2. Since the triples don't move, the references they contain to past results remain
accurate. This is specific to triples (or similar situations).
Homework: 8.2
This has become a big deal in modern optimizers, but we will largely ignore it. The
idea is that you have all assignments go to unique (temporary) variables. So if the
code is
if x then y=4 else y=5
it is treated as though it was
if x then y1=4 else y2=5
The interesting part comes when y is used later in the program and the compiler must
choose between y1 and y2.
A type expression is either a basic type or the result of applying a type constructor.
1. A basic type.
2. A type name.
3. Applying an array constructor array(number,type-expression). In 1e, the
number argument is an index set. This is where the C/java syntax is, in my
view, inferior to the more algol-like syntax of e.g., ada and lab 3
array [ index-type ] of type.
4. Applying a record constructor record(field names and types).
5. Applying a function constructor type→type.
6. The product type×type.
7. A type expression may contain variables (that are type expressions).
This generates a type error in Ada, which has name equivalence since the types of x
and MyX do not have the same name, although they have the same structure.
When you have an object of an anonymous type as in
x : array [5] of integer;
it doesn't have the same type as any other object even
y : array [5] of integer;
But x[2] has the same type as y[3]; both are integers.
6.3.3: Declarations
The following from 2ed uses C/Java array notation. The 1ed has pascal-like material
(section 6.2). Although I prefer Ada-like constructs as in lab 3, I realize that the class
knows C/Java best so like the authors I will go with the 2ed. I will try to give lab3-like
grammars as well.
The lab 3 grammar doesn't support records and the support for multidimensional
arrays is flawed (you can define the type, but not a (constrained) object). Here is the
part of the lab3 grammar that handles declarations of ints, reals and arrays.
declarations → declaration declarations | ε
declaration → object-declaration | type-declaration
object-declaration → defining-identifier : object-definition ;
object-definition → type-name | type-name [ NUMBER ]
type-declaration → TYPE defining-identifier IS ARRAY OF type-name ;
defining-identifier → IDENTIFIER
type-name → IDENTIFIER | INT | REAL
So that the tables below are not too wide, let's use shorter names
ds → d ds | ε
d → od | td
od → di : odef ;
odef → tn | tn [ NUM ]
td → TYPE di IS ARRAY OF tn ;
di → ID
tn → ID | INT | REAL
You might wonder why we want the unconstrained type. These types permit a
procedure to have a parameter that is an array of integers of unspecified size.
Remember that the declaration of a procedure specifies only the type of the parameter;
the object is determined at the time of the procedure call.
See section 8.2 in 1e (we are going back to chapter 8 from 6, so perhaps Doc Brown
from BTTF should give the lecture).
We are considering here only those types for which the storage can be computed at
compile time. For others, e.g., string variables, dynamic arrays, etc, we would only be
reserving space for a pointer to the structure; the structure itself is created at run time
and is discussed in the next chapter.
The idea is that the basic type determines the width of the data, and the size of an
array determines the height. These are then multiplied to get the size (area) of the
data.
The book uses semantic actions (i.e., a syntax directed translation SDT). I added the
corresponding semantic rules so that we have an SDD as well.
Remember that for an SDT, the placement of the actions withing the production is
important. Since it aids reading to have the actions lined up in a column, we
sometimes write the production itself on multiple lines. For example the production
T→BC has the B and C on separate lines so that the action can be in between even
though it is written to the right of both.
The actions use global variables t and w to carry the base type (INT or FLOAT) and
width down to the ε-production, where they are then sent on their way up and become
multiplied by the various dimensions. In the rules I use inherited attributes bt and bw.
This is similar to the comment above that instead of having the identifier table passed
up and down via attributes, the bullet is bitten and a globally visible table is used.
The base types and widths are set by the lexer or are constants in the parser.
C.type =
Synthesized
array(NUM.value, C1.type)
C.width = NUM.value *
Synthesized
C → [ NUM ] C1.width;
C1 { C.type =
C1.bt = C.bt Inherited
array(NUM.value, C1.type);
C.width = NUM.value *
C1.bw = C.bw Inherited
C1.width; }
d → od d.width = od.width
d → td d.width = 0
addType(di.entry, odef.type)
od → di : odef ;
od.width = odef.width
di → ID di.entry = ID.entry
First let's ignore arrays. Then we get odef.type = tn.type
the simple table on the right. All the odef → tn odef.width = tn.width
attributes are Synthesized so we have
tn.type must be integer or real
an S-attributed grammar.
tn.type = integer
We dutifully synthesize the width tn → INT
attribute all the way to the top and then tn.width = 4
do not use it. We shall use it in the tn.type = real
next section when we consider tn → REAL
multiple declarations. tn.width = 8
Scalar Declarations
Recall that addType is viewed as a
synthesized since its parameters come from the RHS, i.e., from children of this node.
It has a side effect (of modifying the identifier table) so we must be sure that we are
not depending on some order of evaluation that is not simply parent after children. In
fact, later when we evaluate expressions, we will need some of this information. We
will need to enforce declaration before use since we will be looking up information
that we are setting here. So in evaluation, we check the entry in the identifier table to
be sure that the type (for example) has already been set.
Note the comment tn.type must be integer or real. This is an example of a type check,
a key component of semantic analysis, that we will learn about soon. The reason for it
here is that we are only able to handle 1 dimensional arrays with the lab3 grammar. (It
would be a more complicated grammar with other type check rules to handle the
general case found in ada).
Once again all attributes are synthesized (including those with side effects) so we
have an S-attributed SDD.
d → od d.width = od.width
d → td d.width = 0
addType(di.entry, odef.type)
od → di : odef ;
od.width = odef.width
di → ID di.entry = ID.entry
odef.type = tn.type
odef → tn odef.width = tn.width
tn.type must be integer or real
odef.type = array(NUM.value,
getBaseType(tn.entry.type)
odef → tn [ NUM ] odef.width = sizeof(odef.type)
= NUM.value*sizeof(getBaseType(tn.entry.type))
tn must be ID
tn.entry = ID.entry
tn → ID
ID.entry.type must be array()
tn.type = integer
tn → INT
tn.width = 4
tn.type = real
tn → REAL
tn.width = 8
Array Declarations
The top diagram on the right
shows the result of applying the
semantic actions in the table
above to to the declaration
type t is array of real;
To summarize, the identifier table (and others we have used) are not present when the
program is run. But there must be run time storage for objects. We need to know the
address each object will have during execution. Specifically, we need to know its
offset from the start of the area used for object storage.
Multiple Declarations
The goal is to permit multiple declarations in the same procedure (or program or
function). For C/java like languages this can occur in two ways.
In either case we need to associate with the object being declared its storage location.
Specifically we include in the table entry for the object, its offset from the beginning
of the current procedure. We initialize this offset at the beginning of the procedure
and increment it after each object declaration.
The programming languages Ada and Pascal do not permit multiple objects in a single
declaration. Both languages are of the
object : type
school. Thus lab 3, which follows Ada, and 1e, which follows pascal, do not support
multiple objects in a single declaration. C/Java certainly does permit multiple objects,
but surprisingly the 2e grammar does not.
The lab 3 grammar has a list of declarations (each of which ends in a semicolon).
Shortening declarations to ds we have
ds → d ds | ε
The name top is used to signify that we work with the top symbol table (when we
have nested scopes for record definitions we need a stack of symbol tables). Top.put
places the identifier into this table with its type and storage location. We then bump
offset for the next variable or next declaration.
Rather that figure out how to put this snippet together with the previous 2e code that
handled arrays, we will just present the snippets and put everything together on the lab
3 grammar.
In the function-def (fd) and procedure-def (pd) productions we add the inherited
attribute offset to declarations (ds.offset) and set it to zero. We then inherit this offset
down to an individual declaration. If this is an object declaration, we store it in the
entry for the identifier being declared and we increment the offset by the size of this
object. When we get the to the end of the declarations (the ε-production), the offset
value is the total size needed. So we turn it around and send it back up the tree.
fd → FUNC di ( ps ) RET tn
ds.offset = 0 Inherited
IS ds BEG s ss END ;
ds.offset = 0 Inherited
pd → PROC di ( ps ) IS ds
s.next = newlabel() Inherited
BEG s ss END ;
ss.next = newlabel() Inherited
pd.code = s.code || label(s.next) || ss.code
Synthesized
|| label(ss.next)
odef.type = array(NUM.value,
Synthesized
getBaseType(tn.entry.type))
odef → tn [ NUM ]
odef.width = sizeof(odef.type) Synthesized
tn must be ID
Since records can essentially have a bunch of declarations inside, we only need add
T → RECORD { D }
to get the syntax right. For the semantics we need to push the environment and offset
onto stacks since the namespace inside a record is distinct from that on the outside.
The width of the record itself is the final value of (the inner) offset.
T → record { { Env.push(top); top = new Env()
Stack.puch(offset); offset = 0; }
D } { T.type = record(top); T.width = offset;
top = Env.pop(); offset = Stack.pop(); }
This does not apply directly to the lab 3 grammar since the grammar does not have
records. It does, however, have procedures that can be nested. If we wanted to
generate code for nested procedures we would need to stack the symbol table as done
here in 2e.
Homework: Determine the types and relative addresses for the identifiers in the
following sequence of declarations.
float x;
record { float x; float y; } rec;
float y;
We will use two attributes code and address. For a parse tree node the code attribute
gives the three address code to evaluate the input derived from that node. In particular,
code at the root performs the entire assignment statement. there.
The attribute addr at a node is the address that holds the value calculated by the code
at the node. Recall that unlike real code for a real machine our 3-address code doesn't
reuse addresses.
As one would expect for expressions, all the attributes in the table to the right are
synthesized. The table is for the expression part of the lab 3 grammar. To save space
let's use as for assignment-statement, lv for lvalue, e for expression, t for term, and f
for factor. Since we will be covering arrays a little later, we do not consider LET
array-element.
The method in the previous section generates long strings and we walk the tree. By
using SDT instead of using SDD, you can output parts of the string as each node is
processed.
The idea is that you associate the base address with the array name. That is, the offset
stored in the identifier table is the address of the first element of the array. The indices
and the array bounds are used to compute the amount, often called the offset
(unfortunately, we have already used that term), by which the address of the
referenced element differs from the base address.
For one dimensional arrays, this is especially easy: The address increment is the width
of each element times the index (assuming indexes start at 0). So the address of
A[i] is the base address of A plus i times the width of each element of A.
The width of each element is the width of what we have called the base type. So for
an ID the element width is sizeof(getBaseType(ID.entry.type)). For convenience we
define getBaseWidth by the formula
getBaseWidth(ID.entry) = sizeof(getBaseType(ID.entry.type))
Let us assume row major ordering. That is, the first element stored is A[0,0], then
A[0,1], ... A[0,k-1], then A[1,0], ... . Modern languages use row major ordering.
With the alternative column major ordering, after A[0,0] comes A[1,0], A[2,0], ... .
For two dimensional arrays the address of A[i,j] is the sum of three terms
Since the goal of the semantic rules is precisely generating such code, I could have
used a[i]. I did not because
1. Since we are restricted to one dimensional arrays, the full code generation for
the address of an element is not hard and
2. I thought it would be instructive to see the full address generation without
hiding some of it under the covers.
It was definitely instructive for me! The rules for addresses in 3-address code also
include
a = &b
a = *b
*a = b
which are other special forms. They have the same meaning as in C.
I believe the SDD on the right if given a[3]=5, with a an integer array will generate
t$1 = 3*4 // t$n are the temporary names from new TEMP()
t$2 = &a
t$3 = t$2 + t$1
*t3 = 5
I also added an & to the non-array production lv→ID so that both could be handled by
the same semantic rule for as→lv=e.
Homework: Write the SDD using the a[i] special form instead of the & and * special
forms.
This is an exciting moment. At long last we can compile a full program!
1. The language comes with a type system, i.e., a set of rules saying what types
can appear where.
2. The compiler assigns a type expression to parts of the source program.
3. The compiler checks that the type usage in the program conforms to the type
system for the language.
All type checking could be done at run time: The compiler generates code to do the
checks. Some languages have very weak typing; for example, variables can change
their type during execution. Often these languages need run-time checks. Examples
include lisp, snobol, apl.
A sound type system guarantees that all checks can be performed prior to execution.
This does not mean that a given compiler will make all the necessary checks.
1. We will learn type synthesis where the types of parts are used to infer the type
of the whole. For example, integer+real=real.
2. Type inference is very slick. The type of a construct is determined from usage.
This permits languages like ML to check types even though names need not be
declared.
We consider type checking for expessions. Checking statements is very similar. View
the statement as a function having its components as arguments and returning void.
A very strict type system would do no automatic conversion. Instead it would offer
functions for the programer to explicitly convert between selected types. Then either
the program has compatible types or is in error.
However, we will consider a more liberal approach in which the language permits
certain implicit conversions that the compiler is to supply. This is called type
coercion. Explicit conversions supplied by the programmer are
called casts.
We continue to work primarily with the two types used in lab 3, namely
integer and real, and postulate a unary function denoted (real) that
converts an integer into the real having the same value. Nonetheless, we
do consider the more general case where there are multiple types some
of which have coercions (often called widening). For example in
C/Java, int can be widened to long, which in turn can be widened to
float as shown in the figure to the right.
Mathematically the hierarchy on the right is a partially order set (poset) in which each
pair of elements has a least upper bound (LUB). For many binary operators (all the
arithmetic ones we are considering, but not exponentiation) the two operands are
converted to the LUB. So adding a short to a char, requires both to be converted to an
int. Adding a byte to a float, requires the byte to be converted to a float (the float
remains a float and is not converted).
The steps for addition, subtraction, multiplication, and division are all essentially the
same: Convert each types if necessary to the LUB and then perform the arithmetic on
the (converted or original) values. Note that conversion requires the generation of
code.
1. LUB(t1,t2) returns the type that is the LUB of the two given types. It signals an
error if there is no LUB, for example if one of the types is an array.
2. widen(a,t,w,newcode,newaddr). Given an address a of type t, and a (hopefully)
wider address w, produce the instructions newcode needed so that the address
newaddr is the conversion of address a to type w.
LUB is simple, just look at the address latice. If one of the type arguments is not in the
lattice, signal an error; otherwise find the lowest common ancestor.
widen is more interesting. It involves n2 cases for n types. Many of these are error
cases (e.g., if t wider than w). Below is the code for our situation with two possible
types integer and real. The four cases consist of 2 nops (when t=w), one error (t=real;
w=integer) and one conversion (t=integer; w=real).
widen (a:addr, t:type, w:type, newcode:string, newaddr:addr)
if t=w
newcode = ""
newaddr = a
else if t=integer and w=real
newaddr = new Temp()
newcode = gen(newaddr = (real) a)
else signal error
With these two functions it is not hard to modify the rules to catch type errors and
perform coercions for arithmetic expressions.
1. Maintain the type of each operand by defining type attributes for e, t, and f.
2. Coerce each operand to the LUB.
This requires that we have type information for the base entities, identifiers and
numbers. The lexer can supply the type of the numbers. We retrieve it via
get(NUM.type).
It is more interesting for the identifiers. We insert that information when we process
declarations. So we now have another semantic check: Is the identifier declared before
it is used?
I will use the function get(ID.type), which returns the type from the identifier table
and signals an error if it is not there. The original SDD for assignment statements
was here and the changes for arrays washere.
lv.addr = ae.addr
lv → let ae lv.type = ae.type
lv.code = ae.code
ae.type = getBaseType(ID.entry.type)
ae.t1 = new Temp()
ae.t2 = new Temp()
ae → ID [ e ] ae.addr = new Temp()
ae.code = e.code || gen(ae.t1 = e.addr * getBaseWidth(ID.entry)) ||
gen(ae.t2 = &get(ID.lexeme)) ||
gen(ae.addr = ae.t2 + ae.t1)
e.addr = t.addr
e→t e.type = t.type
e.code = t.code
t.addr = f.addr
t→f t.type = f.type
t.code = f.code
f.addr = e.addr
f→(e) f.type = e.type
f.code = e.code
f.addr = get(ID.lexeme)
f → ID f.type = get(ID.type)
f.code = ""
f.addr = get(NUM.lexeme)
f → NUM f.type = get(NUM.type)
f.code = ""
Assignment Statements With Type Checks and Coercions
Homework: Same question as the previous homework (What code is generated for
the program written above?). But the answer is different!
6.5.3: Overloading of Functions and Operators
Skipped.
Skipped.
Skipped.
Control flow includes the study of Boolean expressions, which have two roles.
1. They can be computed and treated similar to integers or real. Once can declare
Boolean variables, there are boolean constants and boolean operators. There are
also relational operators that produce Boolean values from arithmetic operands.
From this point of view, Boolean expressions are similar to the expressions we
have already treated. Our previous semantic rules could be modified to generate
the code needed to evaluate these expressions.
2. They are used in certain statements that alter the normal flow of control. In this
regard, we have something new to learn.
One question that comes up with Boolean expressions is whether both operands need
be evaluated. If we need to evaluate A or B and find that A is true, must we evaluate
B? For example, consider evaluating
A=0 OR 3/A < 1.2
when A is zero.
This comes up some times in arithmetic as well. Consider A*F(x). If the compiler
knows that for this run A is zero must it evaluate F(x)? Don't forget that functions can
have side effects,
6.6.2: Short-Circuit Code
This is also called jumping code. Here the Boolean operators AND, OR, and NOT do
not appear in the generated instruction stream.
Instead we just generate jumps to either the true
branch or the false branch
We get
if x < 5 goto L2
goto L3
L3: if x > 10 goto L4
goto L1
L4: if x == y goto L2
goto L1
L2: x = 3
Note that there are three extra gotos. One is a goto the next statement. Two others
could be eliminated by using ifFalse.
Remark: As mentioned before 6.6 in the notes is 6.6 in 2e and 8.4 in 1e. However the
third level material is not is the same order. In particular this section (6.6.6) is very
early in 8.4.
If ther are boolean variables (or variables into which a boolean value can be placed),
we can have boolean assignment statements. That is we might evaluate boolean
expressions outside of control flow statements.
Recall that the code we generated for boolean expressions (inside control flow
statements) used inherited attributes to push down the tree the exit labels B.true and
B.false. How are we to deal with Boolean assignment statements?
Up to now we have used the so called jumping code method for Boolean quantities.
We evaluated Boolean expressions (in the context of control flow statements) by
using inherited attributes to push down the tree the true and false exits (i.e., the target
locations to jump to if the expression evaluates to true and false).
With this method if we have a Boolean assignment statement, we just let the true and
false exits lead to statements
LHS = true
LHS = false
respectively.
In the second method we simply treat boolean expressions as expressions. That is, we
just mimic the actions we did for integer/real evaluations. Thus Boolean assignment
statements like
a = b OR (c AND d AND (x < y))
just work.
However this is wrong. In C if (a=0 || 1/a > f(a)) is guaranteed not to divide by zero
and the above implementation fails to provide this guarantee. We must somehow
implement short-circuit boolean evaluation.
6.7: Backpatching
Skipped.
Our intermediate code uses symbolic labels. At some point these must be translated
into addresses of instructions. If we use quads all instructions are the same length so
the address is just the number of the instruction. Sometimes we generate the jump
before we generate the target so we can't put in the instruction number on the fly.
Indeed, that is why we used symbolic labels. The easiest method of fixing this up is to
make an extra pass (or two) over the quads to determine the correct instruction
number and use that to replace the symbolic label. This is extra work; a more efficient
technique, which is independent of compilation, is called backpatching.
The C language is unusual in that the various cases are just labels for a
giant computed goto at the beginning. The more traditional idea is that you execute
just one of the arms, as in a series of
if
else if
else if
...
end if
The lab 3 grammar does not have a switch statement so we won't do a detailed SDD.
1. When you process the switch (E) ... production, call newlabel() to generate
labels for next and test which are put into inherited and synthesized attributes
respectively.
2. Then the expression is evaluated with the code and the address synthesized up.
3. The code for the switch has after the code for E a goto test.
4. Each case begins with a newlabel(). The code for the case begins with this label
and then the translation of the arm itself and ends with a goto next. The
generated label paired with the value for this case is added to an inherited
attribute representing a queue of these pairs (actually this is done by some
production like
cases → case cases | ε
As usual the queue is sent back up the tree by the epsilon production.
5. When we get to the end of the cases we are back at the switch production which
now adds code to the end. Specifically, the test label is gen'ed and then a series
of
if E.addr = Vi goto Li
statements, where each Li,Vi pair is from the generated queue.
Recall the SDD for declarations. These semantic rules pass up the totalSize to the
ds → d ds
production.
Our lexer doesn't support this. So you would remove table building from the lexer and
instead do it in the parser and when a new scope (procedure definition, record
definition, begin block) arises you push the current tables on a stack and begin a new
one. When the nested scope ends, you pop the tables.
This should be compared with an operating systems treatment, where we worry about
how to effectively map this configuration to real memory. For example see see these
two diagrams in my OS class notes, which illustrate an OS difficulty with our
allocation method, which uses a very large virtual address range and one solution.
Some system require various alignment constraints. For example 4-byte integers
might need to begin at a byte address that is a multiple of four. Unaligned data might
be illegal or might lower performance. To achieve proper alignment padding is often
used.
Much (often most) data cannot be statically allocated. Either its size is not know at
compile time or its lifetime is only a subset of the program's execution.
Early versions of Fortran used only statically allocated data. This required that each
array had a constant size specified in the program. Another consequence of supporting
only static allocation was that recursion was forbidden (otherwise the compiler could
not tell how many versions of a variable would be needed).
Modern languages, including newer versions of Fortran, support both static and
dynamic allocation of memory.
The advantage supporting dynamic storage allocation is the increased flexibility and
storage efficiency possible (instead of declaring an array to have a size adequate for
the largest data set; just allocate what is needed). The advantage of static storage
allocation is that it avoids the runtime costs for allocation/deallocation and may permit
faster code sequences for referencing the data.
An (unfortunately, all too common) error is a so-called memory leak where a long
running program repeated allocates memory that it fails to delete, even after it can no
longer be referenced. To avoid memory leaks and ease programming, several
programming language systems employ automatic garbage collection. That means the
runtime system itself can determine if data can no longer be referenced and if so
automatically deallocates it.
1. Space shared by procedure calls that have disjoint durations (despite being
unable to check disjointness statically).
2. The relative address of each nonlocal variable is constant throughout execution.
Recall the fibonacci sequence 1,1,2,3,5,8, ... defined by f(1)=f(2)=1 and, for n>2,
f(n)=f(n-1)+f(n-2). Consider the function calls that result from a main program calling
f(5). On the left we show the calls and returns linearly and on the right in tree form.
The latter is sometimes called the activation tree or
call tree.
System starts main
enter f(5)
enter f(4)
enter f(3)
enter f(2)
exit f(2)
enter f(1)
exit f(1)
exit f(3)
enter f(2) int a[10];
exit f(2) int main(){
exit f(4) int i;
enter f(3) for (i=0; i<10; i++){
enter f(2) a[i] = f(i);
exit f(2) }
enter f(1) }
exit f(1) int f (int n) {
exit f(3) if (n<3) return 1;
exit f(5) return f(n-1)+f(n-2);
main ends }
At the far right is a later state in the execution when f(4) has been called by main and
has in turn called f(2). There are three activation records, one for main and two for f.
It is these multiple activations for f that permits the recursive execution. There are two
locations for n and two for the result.
The calling sequence, executed when one procedure (the caller) calls another (the
callee), allocates an activation record (AR) on the stack and fills in the fields. Part of
this work is done by the caller; the remainder by the callee. Although the work is
shared, the AR is called the callee's AR.
Since the procedure being called is defined in one place, but called from many, there
are more instances of the caller activation code than of the callee activation code.
Thus it is wise, all else being equal, to assign as much of the work to the callee.
1. Values computed by the caller are
placed before any items of size
unknown by the caller. This way they
can be referenced by the caller using
fixed offsets. One possibility is to place
values computed by the caller at the
beginning of the activation record (AR),
i.e., near the AR of the caller. The
number of arguments may not be the
same for different calls of the same
function (so called varargs, e.g. printf()
in C).
2. Fixed length items are placed next.
These include the links and the saved status.
3. Finally come items allocated by the callee whose size is known only at run-
time, e.g., arrays whose size depends on the parameters.
4. The stack pointer sp is between the last two so the temporaries and local data
are actually above the stack. This would seem more surprising if I used the
book's terminology, which is top_sp. Fixed length data can be referenced by
fixed offsets (known to the intermediate code generator) from the sp.
The top picture illustrates the situation where a pink procedure (the caller) calls a blue
procedure (the callee). Also shown is Blue's AR. Note that responsibility for this
single AR is shared by both procedures. The picture is just an approximation: For
example, the returned value is actually the Blue's responsibility (although the space
might well be allocated by Pink. Also some of the saved status, e.g., the old sp, is
saved by Pink.
Calling Sequence
Return Sequence
1. The callee stores the return value near the parameters. Note that this address
can be determined by the caller using the old (soon-to-be-restored) sp.
2. The callee restores sp and the registers.
3. The callee jumps to the return address.
It is the second flavor that we wish to allocate on the stack. The goal is for the (called)
procedure to be able to access these arrays using addresses determinable at compile
time even though the size of the arrays (and hence the location of all but the first) is
not know until the program is called and indeed often differs from one call to the next.
The solution is to leave room for pointers to the arrays in the AR. These are fixed size
and can thus be accessed using static offsets. Then when the procedure is invoked and
the sizes are known, the pointers are filled in and the space allocated.
A small change caused by storing these variable size items on the stack is that it no
longer is obvious where the real top of the stack is located relative to sp. Consequently
another pointer (call it real-top-of-stack) is also kept. This is used on a call to tell
where the new allocation record should begin.
7.3: Access to Nonlocal Data on the Stack
As we shall see the ability of procedure p to access data declared outside of p (either
declared globally outside of all procedures or declared inside another procedure q)
offers interesting challenges.
In languages like standard C without nested procedures, visible names are either local
to the procedure in question or are declared globally.
1. For global names the address is known statically at compile time providing
there is only one source file. If multiple source files, the linker knows. In either
case no reference to the activation record is needed; the addresses are know
prior to execution.
2. For names local to the current procedure, the address needed is in the AR at a
known-at-compile-time constant offset from the sp. In the case of variable size
arrays, the constant offset refers to a pointer to the actual storage.
With nested procedures a complication arises. Say g is nested inside f. So g can refer
to names declared in f. These names refer to objects in the AR for f; the difficulty is
finding that AR when g is executing. We can't tell at compile time where the (most
recent) AR for f will be relative to the current AR for g since a dynamically-
determined number of routines could have been called in the middle.
There is an example in the next section. in which g refers to x, which is declared in the
immediately outer scope (main) but the AR is 2 away because f was invoked in
between. (In that example you can tell at compile time what was called in what order,
but with a more complicated program having data-dependent branches, it is not
possible.)
As we have discussed, the 1e, which you have, uses pascal, which many of you don't
know. The 2e, which you don't have uses C, which you do know.
Since pascal supports nested procedures, this is what the 1e uses to give examples.
The 2e asserts (correctly) that C doesn't have nested procedures so introduces ML,
which does (and is quite slick), but which unfortunately many of you don't know and I
haven't used. Fortunately a common extension to C is to permit nested procedures. In
particular, gcc supports nested procedures. To check my memory I compiled and ran
the following program.
#include <stdio.h>
void g(int y)
{
int z = x;
return;
}
int f (int y)
{
g(y);
return y+1;
}
The program compiles without errors and the correct answer of 11 is printed.
Outermost procedures have nesting depth 1. Other procedures have nesting depth 1
more than the nesting depth of the immediately outer procedure. In the example above
main has nesting depth 1; both f and g have nesting depth 2.
The AR for a nested procedure contains an access link that points to the AR of the
(most recent activation of the immediately outer procedure). So in the example above
the access link for all activations of f and g would point to the AR of the (only)
activation of main. Then for a procedure P to access a name defined in the 3-outer
scope, i.e., the unique outer scope whose nesting depth is 3 less than that of P, you
follow the access links three times.
Without procedure parameters, the compiler knows the name of the called procedure
and, since we are assuming the entire program is compiled at once, knows the nesting
depth.
Let the caller be procedure R (the last letter in caller) and let the called procedure be
D. Let N(f) be the nesting depth of f. I did not like the presentation in 2e (which had
three cases and I think did not cover the example above). I made up my own and
noticed it is much closer to 1e (but makes clear the direct recursion case, which is
explained in 2e). I am surprised to see a regression from 1e to 2e, so make sure I have
not missed something in the cases below.
Our goal while creating the AR for D at the call from R is to set the access link
to point to the AR for P. Note that this entire structure in the skeleton code
shown is visible to the compiler. Thus, the current (at the time of the call) AR is
the one for R and if we follow the access links k+1 times we get a pointer to the
AR for P, which we can then place in the access link for the being-created AR
for D.
When k=0 we get the gcc code I showed before and also the case of direct recursion
where D=R.
7.3.7: Access Links for Procedure Parameters
Basically skipped. The problem is that, if f calls g giving with a parameter of h (or a
pointer to h in C-speak) and the g calls this parameter (i.e., calls h), g might not know
the context of h. The solution is for f to pass to g the pair (h, the access link of h)
instead of just passing h. Naturally, this is done by the compiler, the programmer is
unaware of access links.
7.3.8: Displays
Basically skipped. In theory access links can form long chains (in practice nesting
depth rarely exceeds a dozen or so). A display is an array in which entry i points to the
most recent (highest on the stack) AR of depth i.
Covered in OS.
Covered in Architecture.
Covered in OS.
Covered in OS.
Stack data is automatically deallocated when the defining procedure returns. What
should we do with heap data explicated allocated with new/malloc?
The manual method is to require that the programmer explicitly deallocate these data.
Two problems arise.
As this program continues to run it will require more and more storage even
though is actual usage is not increasing significantly.
Skipped
Skipped
7.5.2: Reachability
Skipped.
Skipped.
Skipped.
7.6.2:Basic Abstraction
Skipped.
Skipped.
Skipped.
Skipped.
Skipped.
Skipped.
Skipped.
Skipped.
Skipped.
Skipped.
Skipped.
Skipped.
Skipped.
Skipped.
================ Start Lecture #13
================
Goal: Transform intermediate code + tables into final machine (or assembly) code.
Code generation + Optimization is the back end of the compoiler.
As expected the input is the output of the intermediate code generator. We assume
that all syntactic and semantic error checks have been done by the front end. Also all
needed type conversions are already done and any type errors have been detected.
We are using three address instructions for our intermediate language. The these
instructions have several representations, quads, triples, indirect triples, etc. In this
chapter I will tend to write quads (for brevity) when I should write three-address
instructions.
8.1.2: the Target Program
A RISC (Reduced Instruction Set Computer), e.g. PowerPC, Sparc, MIPS (popular for
embedded systems), is characterized by
Many registers
Three address instructions
Simple addressing modes
Relatively simple ISA (instruction set architecture)
Only loads and stores touch memory
Homogeneous registers
Very few instruction lengths
Few registers
Two address instructions
Variety of addressing modes (some complex)
Complex ISA
Register classes
Multiple instruction lengths
1. No registers
2. Zero address instructions (operands/results implicitly on the stack)
3. Top portion of stack kept in hidden registers
A Little History
Stack based machines were believed to be good compiler targets. They became very
unpopular when it was believed that register architecture would perform better. Better
compilation (code generation) techniques appeared that could take advantage of the
multiple registers.
Pascal P-code and Java byte-code are the machine instructions for a hypothetical
stack-based machine, the JVM (Java Virtual Machine) in the case of Java. This code
can be interpreted, or compiled to native code.
For maximum compilation speed, the compiler accepts the entire program at once and
produces code that can be loaded and executed (the compilation system can include a
simple loader and can start the compiled program). This was popular for student
jobs when computer time was expensive. The alternative, where each procedure can
be compiled separately, requires a linkage editor.
It eases the compiler's task to produce assembly code instead of machine code and we
will do so. This decision increased the total compilation time since it requires an extra
assembler pass (or two).
A big question is the level of code quality we seek to attain. For example can we
simply translate one quadruple at a time. The quad
x=y+z
can always (assuming x, y, and z are statically allocated, i.e., their address is a
compile time constant off the sp) be compiled into
LD R0, y
ADD R0, R0, z
ST x, R0
But if we apply this to each quad separately (i.e., as a separate. problem) then
a=b+c
d=a+e
is compiled into
LD R0, b
ADD R0, R0, c
ST a, R0
LD R0, a
ADD R0, e
ST d, R0
The fourth statement is clearly not needed since we are loading into R0 the same
value that it contains. The inefficiency is caused by our compiling the second quad
with no knowledge of how we compiled the first quad.
8.1.4: Register Allocation
Since registers are the fastest memory in the computer, the ideal solution is to store all
values in registers. However, there are normally not nearly enough registers for this to
be possible. So we must choose which values are in the registers at any given time.
The reason for the second problem is that often there are register requirements, e.g.,
floating-point values in floating-point registers and certain requirements for even-odd
register pairs (e.g., 0&1 but not 1&2) for multiplication/division.
Sometimes better code results if the quads are reordered. One example occurs with
modern processors that can execute multiple instructions concurrently, providing
certain restrictions are met (the obvious one is that the input operands must already be
evaluated).
1. Load. LD dest, addr loads the register dest with the contents of the address
addr.
As will be see below we charge more for a memory location than for a register.
2. Store. ST addr, src stores the value of the source src (register) into the address
addr.
3. Computation. OP dest, src1, src2 performs the operation OP on the two source
operands src1 and src2. For RISC the three operands must be registers. If the
destination is one of the sources the source is read first and then overwritten
(using a master-slave flip-flop if it is a register).
The addressing modes are not RISC-like at first glance, as they permit memory
locations to be operands. Again, note that we shall charge more for these.
1. Variable name. This is shorthand (or assembler-speak) for the memory location
containing x, i.e., the l-value of x.
2. Indexed address. The address a(r), where a is a variable name and r is a register
(number) specifies the address that is, the value-in-r bytes past the address
specified by a.
contents(a+contents(r2)) NOT
contents(contents(a)+contents(r2))
Array assignment statements are also four instructions. We can't do A[i]=B[j] because
that needs four addresses.
Here we just determine the first cost, and use quite a simple metric. We charge for
each instruction one plus the cost of each addressing mode used.
Addressing modes using just registers have zero cost, while those involving memory
addresses or constants are charged one. This corresponds to the size of the instruction
since a memory address or a constant is assumed to be stored in a word right after the
instruction word itself.
You might think that we are measuring the memory (or space) cost of the program not
the time cost, but this is mistaken: The primary space cost is the size of the data, not
the size of the instructions. One might say we are charging for the pressure on the I-
cache.
For example, LD R0, *50(R2) costs 2, the additional cost is for the constant 50.
Homework: 9.1, 9.3
1. The text or code area. The size of this area is statically determined.
2. The static area holding global constants. The size of this area is statically
determined.
3. The stack holding activation records. The size of this area is not known at
compile time.
4. The heap. The size of this area is not known at compile time.
Returning to the glory days of Fortran, we consider a system with static allocation.
Remember, that with static allocation we know before execution where all the data
will be stored. There are no recursive procedures; indeed, there is no run-time stack of
activation records. Instead the ARs are statically allocated by the compiler.
In this simplified situation, calling a parameterless procedure just uses static addresses
and can be implemented by two instructions. Specifically,
call procA
can be implemented by
ST callee.staticArea, #here+20
BR callee.codeArea
We are assuming, for convenience, that the return address is the first location in the
activation record (in general it would be a fixed offset from the beginning of the AR).
We use the attribute staticArea for the address of the AR for the given procedure
(remember again that there is no stack and heap).
The # we know signifies an immediate constant. We use here to represent the address
of the current instruction (the compiler knows this value since we are assuming that
the entire program, i.e., all procedures, are compiled at once). The two instructions
listed contain 3 constants, which means that the entire sequence takes 5 words or 20
bytes. Thus here+20 is the address of the instruction after the BR, which is indeed the
return address.
Callee Returning
With static allocation, the compiler knows the address of the the AR for the callee and
we are assuming that the return address is the first entry. Then a procedure return is
simply
BR *callee.staticArea
Example
We consider a main program calling a procedure P and then halting. Other actions by
Main and P are indicated by subscripted uses of other.
// Quadruples of Main
other1
call P
other2
halt
// Quadruples of P
other3
return
Let us arbitrarily assume that the code for Main starts in location 1000 and the code
for P starts in location 2000 (there might be other procedures in between). Also
assume that each otheri requires 100 bytes (all addresses are in bytes). Finally, we
assume that the ARs for Main and P begin at 3000 and 4000 respectively. Then the
following machine code results.
// Code for Main
1000: Other1
1100: ST 4000, #1120 // P.staticArea, #here+20
1112: BR 2000 // Two constants in previous instruction take 8
bytes
1120: other2
1220: HALT
...
// Code for P
2000: other3
2100: BR *4000
...
// AR for Main
3000: // Return address stored here (not used)
3004: // Local data for Main starts here
...
// AR for P
4000: // Return address stored here
4004: // Local data for P starts here
We now need to access the ARs from the stack. The key distinction is that the location
of the current AR is not known at compile time. Instead a pointer to the stack must be
maintained dynamically.
We dedicate a register, call it SP, for this purpose. In this chapter we let SP point to
the bottom of the current AR, that is the entire AR is above the SP. (I do not know
why last chapter it was decided to be more convenient to have the stack pointer point
to the end of the statically known portion of the activation. However, since the
difference between the two is known at compile time it is clear that either can be
used.)
The first procedure (or the run-time library code called before any user-written
procedure) must initialize SP with
LD SP, #stackStart
were stackStart is a known-at-compile-time (even -before-) constant.
The caller increments SP (which now points to the beginning of its AR) to point to the
beginning of the callee's AR. This requires an increment by the size of
the caller's AR, which of course the caller knows.
Both editions treat it as a constant. The only part that is not known at compile time is
the size of the dynamic arrays. Strictly speaking this is not part of the AR, but it must
be skipped over since the callee's AR starts after the caller's dynamic arrays.
Perhaps for simplicity we are assuming that there are no dynamic arrays being stored
on the stack. If there are arrays, their size must be included in some way.
Callee Returning
The return requires code from both the Caller and Callee. The callee transfers control
back to the caller with
BR *0(SP)
upon return the caller restore the stack pointer with
SUB SP, SP, caller.ARSize
Example
We again consider a main program calling a procedure P and then halting. Other
actions by Main and P are indicated by subscripted uses of `other'.
// Quadruples of Main
other[1]
call P
other[2]
halt
// Quadruples of P
other[3]
return
Recall our assumptions that the code for Main starts in location 1000, the code for P
starts in location 2000, and each other[i] requires 100 bytes. Let us assume the stack
begins at 9000 (and grows to larger addresses) and that the AR for Main is of size 400
(we don't need P.ARSize since P doesn't call any procedures). Then the following
machine code results.
// Code for Main
1000; LD SP, 9000
1008: Other[1]
1108: ADD SP, SP, #400
1116: ST *SP, #1132
1124: BR, 2000
1132: SUB SP, SP, #400
1140: other[2]
1240: HALT
...
// Code for P
2000: other[3]
2100: BR *0(SP)
...
// AR for Main
9000: // Return address stored here (not used)
9004: // Local data for Main starts here
9396: // Last word of the AR is bytes 9396-9399
...
// AR for P
9400: // Return address stored here
9404: // Local data for P starts here
Homework: 9.2
Another problem is that we don't make much use of the registers. That is translating a
single quad needs just one or two registers so we might as well throw out all the other
registers on the machine.
Both of the problems are due to the same cause: Our horizon is too limited. We must
consider more than one quad at a time. But wild flow of control can make it unclear
which quads are dynamically near each other. So we want to consider, at one time, a
group of quads within which the dynamic order of execution is tightly controlled. We
then also need to understand how execution proceeds from one group to another.
Specifically the groups are called basic blocks and the execution order among them is
captured by the flow graph.
Definition: A flow graph has the basic blocks as vertices and has edges from one
block to each possible dynamic successor.
Constructing the basic blocks is not hard. Once you find the start of a block, you keep
going until you hit a label or jump. But, as usual, to say it correctly takes more words.
Definition: A basic block leader (i.e., first instruction) is any of the following (except
for the instruction just past the entire program).
Given the leaders, a basic block starts with a leader and proceeds up to but not
including the next leader.
Example
1 is a leader by definition. The jumps are 9, 11, and 17. So 10 and 12 are leaders as
are the targets 3, 2, and 13.
The basic blocks are {1}, {2}, {3,4,5,6,7,8,9}, {10,11}, {12}, and {13,14,15,16,17}.
Here is the code written again with the basic blocks indicated.
1) i = 1
2) j = 1
3) t1 = 10 * i
4) t2 = t1 + j // element [i,j]
5) t3 = 8 * t2 // offset for a[i,j] (8 byte numbers)
6) t4 = t3 - 88 // we start at [1,1] not [0,0]
7) a[t4] = 0.0
8) j = j + 1
9) if J <= 10 goto (3)
10) i = i + 1
11) if i <= 10 goto (2)
12) i = 1
13) t5 = i - 1
14) t6 = 88 * t5
15) a[t6] = 1.0
16) i = i + 1
17) if i <= 10 goto (13)
We can see that once you execute the leader you are assured of executing the rest of
the block in order.
We want to record the flow of information from instructions that compute a value to
those that use the value. One advantage we will achieve is that if we find a value has
no subsequent uses, then it is dead and the register holding that value can be used for
another value.
Assume that a quad p assigns a value to x (some would call this a def of x).
Definition: Another quad q uses the value computed at p (uses the def) and x is live at
q if q has x as an operand and there is a possible execution path from p to q that does
not pass any other def of x.
Since the flow of control is trivial inside a basic block, we are able to compute the
live/dead status and next use information for at the block leader by a simple
backwards scan of the quads (algorithm below).
Note that if x is dead (i.e., not live) on entrance to B the register containing x can be
reused in B.
Our goal is to determine whether a block uses a value and if so in which statement.
Initialize all variables in B as being live
Examine the quads of the block in reverse order.
Let the quad q compute x and read y and z
Mark x as dead; mark y and z as live and used at q
When the loop finishes those values that are read before being are marked as live and
their first use is noted. The locations x that are set before being read are marked dead
meaning that the value of x on entrance is
not used.
1. is a jump to S or
2. is not a jump and S immediately
follows P.
Note that jump targets are no longer quads but blocks. The reason is that various
optimizations within blocks will change the instructions and we would have to change
the jump to reflect this.
8.4.5: Loops
i. Produce quads of the form we have been using. Assume each element requires
8 bytes.
ii. What are the basic block of your program?
iii. Construct the flow graph.
iv. Identify the loops in your flow graph.
We have seen that a simple backwards scan of the statements in a basic block enables
us to determine the variables that are live-on-entry and those that are dead-on-entry.
Those variables that do not occur in the block are in neither category; perhaps we
should call them ignored by the block.
We shall see below that it would be lovely to know which variables are live/dead-on-
exit. This means which variables hold values at the end of the block that will / will not
be used. To determine the status of v on exit of a block B, we need to trace all
possible execution paths beginning at the end of B. If all these paths reach a block
where v is dead-on-entry before they reach a block where v is live-on-entry, then v is
dead on exit for block B.
The goal is to obtain a visual picture of how information flows through the block. The
leaves will show the values entering the block and as we proceed up the DAG we
encounter uses of these values defs (and redefs) of values and uses of the new values.
1. Create a leaf for the initial value of each variable appearing in the block. (We
do not know what that the value is, not even if the variable has ever been given
a value).
2. Create a node N for each statement s in the block.
i. Label N with the operator of s. This label is drawn inside the node.
ii. Attach to N those variables for which N is the last def in the block.
These additional labels are drawn along side of N.
iii. Draw edges from N to each statement that is the last def of an operand
used by N.
3. Designate as output nodes those N whose values are live on exit, an officially-
mysterious term meaning values possibly used in another block. (Determining
the live on exit values requires global, i.e., inter-block, flow analysis.)
As we shall see in the next few sections various basic-block
optimizations are facilitated by using the DAG.
You might think that with only three computation nodes in the
DAG, the block could be reduced to three statements
(dropping the computation of b). However, this is wrong. Only
if b is dead on exit can we omit the computation of b. We can,
however, replace the last statement with the simpler
b = c.
Some of these are quite clear. We can of course replace x+0 or 0+x by simply x.
Similar considerations apply to 1*x, x*1, x-0, and x/1.
Other uses of algebraic identities are possible; many require a careful reading of the
language reference manual to ensure their legality. For example, even though it might
be advantageous to convert
((a + b) * f(x)) * a
to
((a + b) * a) * f(x)
it is illegal in Fortran since the programmer's use of parentheses to specify the order of
operations can not be violated.
Does
a = b + c
x = y + c + b + r
A statement of the form x = a[i] generates a node labeled with the operator =[] and the
variable x, and having children a0, the initial value of a, and the value of i.
In this case we know precisely the value of p so the second statement kills only nodes
with x attached.
We must treat the first statement as a use of every variable; pictorially the =* operator
takes all current nodes with identifiers as arguments. This impacts dead code
elimination.
We must treat the second statement as writing every variable. That is all existing
nodes are killed, which impacts common subexpression elimination.
In our basic-block level approach, a procedure call has properties similar to a pointer
reference: For all x in the scope of P, we must treat a call of P as using all nodes with
x attached and also kills those same nodes.
8.5.7: Reassembling Basic Blocks From DAGs
Now that we have improved the DAG for a basic block, we need to regenerate the
quads. That is, we need to obtain the sequence of quads corresponding to the new
DAG.
We need to construct a quad for every node that has a variable attached. If there are
several variables attached we chose a live-on-exit variable, assuming we have done
the necessary global flow analysis to determine such variables).
If there are several live-on-exit variables we need to compute one and make a copy so
that we have both. An optimization pass may eliminate the copy if it is able to assure
that one such variable may be used whenever the other is referenced.
Example
If b is dead on exit, the first three instructions suffice. If not we produce instead
a = b + c
c = a + x
d = b + c
b = c
which is still an improvement as the copy instruction is less expensive than the
addition on most architectures.
If global analysis shows that, whenever this definition of b is used, c contains the
same value, we can eliminate the copy and use c in place of b.
Note that of the following 5, rules 2 are due to arrays, and 2 due to pointers.
Homework: 9.14,
9.15 (just simplify the 3-address code of 9.14 using the two cases given in 9.15), and
9.17 (just construct the DAG for the given basic block in the two cases given).
For this section we assume a RISC architecture. Specifically, we assume only loads
and stores touch memory; that is, the instruction set consists of
LD reg, mem
ST mem, reg
OP reg, reg, reg
where there is one OP for each operation type used in the three address code.
The 1e uses CISC like instructions (2 operands). Perhaps 2e switched to RISC in part
due to the success of the ROPs in the Pentium Pro.
A major simplification is we assume that, for each three address operation, there is
precisely one machine instruction that accomplishes the task. This eliminates the
question of instruction selection.
We do, however, consider register usage. Although we have not done global flow
analysis (part of optimization), we will point out places where live-on-exit
information would help us make better use of the available registers.
Recall that the
mem operand in
the load LD and
store ST
instructions can
use any of the
previously
discussed
addressing
modes.
Addressing
Mode Usage
Remember that
in 3-address
instructions, the
variables written
are addresses,
i.e., they
represent l-
values.
Let us assume a
is 500 and b is
700, i.e., a and b
refer to locations
500 and 700
respectively.
Assume further
that location 100
contains 666,
location 500
contains 100,
location 700 contains 900, and location 900 contains 123. This initial state is shown in
the upper left picture.
In the four other pictures the contents of the pink location has been changed to the
contents of the light green location. These correspond to the three-address assignment
statements shown below each picture. The machine instructions indicated below
implement each of these assignment statements.
a = b
LD R1, b
ST a, R1
a = *b
LD R1, b
LD R1, 0(R1)
ST a, R1
*a = b
LD R1, b
LD R2, a
ST 0(R2), R1
*a = *b
LD R1, b
LD R1, 0(R1)
LD R2, a
ST 0(R2), R1
These are the primary data structures used by the code generator. They keep track of
what values are in each register as well as where a given value resides.
Each register has a register descriptor containing the list of variables currently
stored in this register. At the start of the basic block all register descriptors are
empty.
Each variable has a address descriptor containing the list of locations where
this variable is currently stored. Possibilities are its memory location and one or
more registers. The memory location might be in the static area, the stack, or
presumably the heap (but not mentioned in the text).
The register descriptor could be omitted since you can compute it from the address
descriptors.
There are basically three parts to (this simple algorithm for) code generation.
1. Choosing registers
2. Generating instructions
3. Managing descriptors
1. Call getReg(OP x, y, z) to get Rx, Ry, and Rz, the registers to be used for x, y,
and z respectively.
Note that getReg merely selects the registers, it does not guarantee that the
desired values are present in these registers.
2. Check the register descriptor for Ry. If y is not present in Ry, check the address
descriptor for y and issue
LD Ry, y
The 2e uses y' (not y) as source of the load, where y' is some location
containing y (1e suggests this as well). I don't see how the value of y can
appear in any memory location other than y. Please check me on this.
It would be a serious bug in the algorithm if the first were true, and I am
confident it is not. The second might be a possible design, but when we study
getReg(), we will see that if the value of y is in some register, then the chosen
Ry will contain that value.
When processing
x=y
steps 1 and 2 are the same as above (getReg() will set Rx=Ry). Step 3 is vacuous and
step 4 is omitted. This says that if y was already in a register before the copy
instruction, no code is generated at this point. Since the value of y is not in its
memory location, we may need to store this value back into y at block exit.
You probably noticed that we have not yet generated any store instructions; They
occur here (and during spill code in getReg()). We need to ensure that all variables
needed by (dynamically) subsequent blocks (i.e., those live-on-exit) have their current
values in their memory locations.
2. Variables dead on exit (thank you global flow for determining such variables)
are also ignored.
Check the address descriptor for each live on exit variable. If its own memory
location is not listed, generate
ST x, R
where R is a register listed in the address descriptor
This is fairly clear. We just have to think through what happens when we do a load, a
store, an OP, or a copy. For R a register, let Desc(R) be its register descriptor. For x a
program variable, let Desc(x) be its address descriptor.
1. Load: LD R, x
o Desc(R) = x (removing everything else from Desc(R))
o Add R to Desc(x) (leaving alone everything else in Desc(x))
o Remove R from Desc(w) for all w ≠ x (not in 2e please check)
2. Store: ST x, R
o Add the memory location of x to Desc(x)
Example
Since we haven't specified getReg() yet, we will assume there are an unlimited
number of registers so we do not need to generate any spill code (saving the register's
value in memory). One of getReg()'s jobs is to generate spill code when a register
needs to be used for another purpose and the current value is not presently in memory.
Despite having ample registers and thus not generating spill code, we will not be
wasteful of registers.
When a register holds a temporary value and there are no subsequent uses of
this value, we reuse that register.
When a register holds the value of a program variable and there are no
subsequent uses of this value, we reuse that register providing this value is also
in the memory location for the variable.
When a register holds the value of a program variable and all subsequent uses
of this value are preceded by a redefinition, we could reuse this register. But to
know about all subsequent uses may require live/dead-on-exit knowledge.
This example is from the book. I give another example after presenting getReg(), that
I believe justifies my claim that the book is missing an action, as indicated above.
u = a - c
LD r3, c
SUB R1, R1, R3
v = t + u
ADD R3, R2, R1
a = d
LD R2, d
d = v + u
ADD R1, R3, R1
exit
ST a, R2
ST d, R1
What follows describes the choices made. Confirm that the values in the descriptors
matches the explanations.
1. For the first quad, we need all three instructions since nothing is register
resident on block entry. Since b is not used again, we can reuse its register.
(Note that the current value of b is in its memory location.)
2. We do not load a again since its value is R1, which we can reuse for u since a is
not used below.
3. We again reuse a register for the result; this time because c is not used again.
4. The copy instruction required a load since d was not in a register. As the
descriptor shows, a was assigned to the same register, but no machine
instruction was required.
5. The last instruction uses values already in registers. We can reuse R1 since u is
a temporary.
6. At block exit, lacking global flow analysis, we must assume all program
variables are live and hence must store back to memory any values located only
in registers.
Consider
x = y OP z
Picking registers for y and z are the same; we just do y. Choosing a register for x is a
little different.
A copy instruction
x=y
is easier.
Choosing Ry
Similar to demand paging, where the goal is to produce an available frame, our
objective here is to produce an available register we can use for Ry. We apply the
following steps in order until one succeeds. (Step 2 is a special case of step 3.)
Example
R1 R2 R3 a b c d e
a b c d e
a = b + c
LD R1, b
LD R2, c
ADD R3, R1, R2
R1 R2 R3 a b c d e
b c a R3 b,R1 c,R2 d e
d = a + e
LD R1, e
ADD R2, R3, R1
R1 R2 R3 a b c d e
2e → e d a R3 b,R1 c R2 e,R1
me → e d a R3 b c R2 e,R1
We needed registers for d and e; none were free. getReg() first chose R2 for d since
R2's current contents, the value of c, was also located in memory. getReg() then chose
R1 for e for the same reason.
Using the 2e algorithm, b might appear to be in R1 (depends if you look in the address
or register descriptors).
a = e + d
ADD R3, R1, R2
Descriptors unchanged
e = a + b
ADD R1, R3, R1 ← possible wrong answer from 2e
R1 R2 R3 a b c d e
e d a R3 b,R1 c R2 R1
LD R1, b
ADD R1, R3, R1
R1 R2 R3 a b c d e
e d a R3 b c R2 R1
The 2e might think R1 has b (address descriptor) and also conclude R1 has only e
(register descriptor) so might generate the erroneous code shown.
Really b is not in a register so must be loaded. R3 has the value of a so was already
chosen for a. R2 or R1 could be chosen. If R2 was chosen, we would need to spill d
(we must assume live-on-exit, since we have no global flow analysis). We choose R1
since no spill is needed: the value of e (the current occupant of R1) is also in its
memory location.
exit
ST a, R3
ST d, R2
ST e, R1
We would like to be able to describe the machine OPs in a way that enables us to find
a sequence of OPs (and LDs and STs) to do the job.
The idea is that you express the quad as a tree and express each OP as a (sub-)tree
simplification, i.e. the op replaces a
subtree by a simpler subtree. In fact
the simpler subtree is just a single
node.
Compare this to grammars: A production replaces the RHS by the LHS. We consider
context free grammars where the LHS is a single nonterminal.
Another example is that ADD Ri, Ri, Rj replaces a subtree consisting of a + with both
children registers (i and j) with a Register node (i).
As you do the pattern matching and reductions (apply the productions), you emit the
corresponding code (semantic actions). So to support a new processor, you need to
supply the tree transformations corresponding to every instruction in the instruction
set.
We assume all operators are binary and label the instruction tree with something like
the height. This gives the minimum number of registers needed so that no spill code is
required. A few details follow.
8.10.1: Ershov Numbers
1. Recursive algorithm starting at the root. Each node puts its answer in the
highest number register it is assigned. The idea is that a node uses (mostly) the
same registers as its sibling.
a. If the labels on the children are equal to L, the parent's label is L+1.
i. Give one child L regs answer appears in top reg.
ii. Give other child L regs, but one higher, answer again appears in
top reg.
iii. Parent uses a two address OP to compute answer in the same reg
used by second child, which is the top reg assigned to the parent.
b. If the labels on the children are M<L, the parent is labeled L.
i. Give bigger child L regs.
ii. Give other child M regs ending one below bigger child.
iii. Parent uses 2-addr OP computing answer in L
c. If at a leaf (operand), load it into assigned reg.
Rough idea is to apply the above recursive algorithm, but at each recursive step, if the
number of regs is not enough, store the result of the first child computed before
starting the second.
8.11: Dynamic Programming Code-Generation
Skipped