Compiler Construction Lecture Notes
Compiler Construction Lecture Notes
You may never write a commercial compiler, but that's not why we study compilers. We study
compiler construction for the following reasons:
Purpose: translate a program in some language (the source language) into a lower-level language
(the target language).
Phases:
Lexical Analysis:
Converts a sequence of characters into words, or tokens
Syntax Analysis:
Converts a sequence of tokens into a parse tree
Semantic Analysis:
Manipulates parse tree to verify symbol and type information
Intermediate Code Generation:
Converts parse tree into a sequence of intermediate code instructions
Optimization:
Manipulates intermediate code to produce a more efficient program
Final Code Generation:
Translates intermediate code into final (machine/assembly) code
The term "token" usually refers to an object (or struct) containing complete information about a
single lexical entity, but it is often also used to refer the category ("class" if you prefer) of that
entity. The term "lexeme" denotes the actual string of characters that comprise a particular
occurrence ("instance" if you like) of a token.
Regular Expressions
The notation we use to precisely capture all the variations that a given category of token may
take are called "regular expressions" (or, less formally, "patterns". The word "pattern" is really
vague and there are lots of other notations for patterns besides regular expressions). Regular
expressions are a shorthand notation for sets of strings. In order to even talk about "strings" you
have to first define an alphabet, the set of characters which can appear.
1. Epsilon is a regular expression denoting the set containing the empty string
2. Any letter in the alphabet is also a regular expression denoting the set containing a one-
letter string consisting of that letter.
3. For regular expressions r and s,
r | s
is a regular expression denoting the union of r and s
4. For regular expressions r and s,
r s
is a regular expression denoting the set of strings consisting of a member of r followed by
a member of s
5. For regular expression r,
r*
is a regular expression denoting the set of strings consisting of zero or more occurrences
of r.
6. You can parenthesize a regular expression to specify operator precedence (otherwise,
alternation is like plus, concatenation is like times, and closure is like exponentiation)
Although these operators are sufficient to describe all regular languages, in practice everybody
uses extensions:
Finite Automata
A finite automaton is an abstract, mathematical machine, also known as a finite state machine,
with the following components:
1. A set of states S
2. A set of input symbols E (the alphabet)
3. A transition function move(state, symbol) : new state(s)
4. A start state S0
5. A set of final states F
For a deterministic finite automaton (DFA), the function move(state, symbol) goes to at most one
state, and symbol is never epsilon.
Finite automata correspond in a 1:1 relationship to transition diagrams; from any transition
diagram one can write down the formal automaton in terms of items #1-#5 above, and vice versa.
To draw the transition diagram for a finite automaton:
draw a circle for each state s in S; put a label inside the circles to identify each state by
number or name
draw an arrow between Si and Sj, labeled with x whenever the transition says to move(Si,
x) : Sj
draw a "wedgie" into the start state S0 to identify it
draw a second circle inside each of the final states in F
DFA Implementation
The nice part about DFA's is that they are efficiently implemented on computers. What DFA
does the following code correspond to? What is the corresponding regular expression? You can
speed this code fragment up even further if you are willing to use goto's or write it in assembler.
state := S0
for(;;)
switch (state) {
case 0:
switch (input) {
'a': state = 1; input = getchar(); break;
'b': input = getchar(); break;
default: printf("dfa error\n"); exit(1);
}
case 1:
switch (input) {
EOF: printf("accept\n"); exit(0);
default: printf("dfa error\n"); exit(1);
}
}
Notation convenience motivates more flexible machines in which function move() can go to
more than one state on a given input symbol, and some states can move to other states even
without consuming an input symbol.
Fortunately, one can prove that for any NFA, there is an equivalent DFA. They are just a
notational convenience. So, finite automata help us get from a set of regular expressions to a
computer program that recognizes them efficiently.
Each rule in the definition of regular expressions has a corresponding NFA; NFA's are composed
using epsilon transitions. This is cited in the text as "Thompson's construction" (Algorithm 3.3).
We will work examples such as (a|b)*abb in class and during lab.
In: NFA N
OUt: DFA D
Method: Construct transition table Dtran (a.k.a. the "move function"). Each DFA state is a set of
NFA states. Dtran simulates in parallel all possible moves N can make on a given string.
e_closure(s)
set of states reachable from state s via epsilon
e_closure(T)
set of states reachable from any state in set T via epsilon
move(T,a)
set of states to which there is an NFA transition from states in T on symbol a
Algorithm:
Dstates := {e_closure(start_state)}
while T := unmarked_member(Dstates) do {
mark(T)
for each input symbol a do {
U := e_closure(move(T,a))
if not member(Dstates, U) then
insert(Dstates, U)
Dtran[T,a] := U
}
}
These programs generally take a lexical specification given in a .l file and create a corresponding
C language lexical analyzer in a file named lex.yy.c. The lexical analyzer is then linked with the
rest of your compiler.
The C code generated by lex has the following public interface. Note the use of global variables
instead of parameters, and the use of the prefix yy to distinguish scanner names from your
program names. This prefix is also used in the YACC parser generator.
header
%%
body
%%
helper functions
The header consists of C code fragments enclosed in %{ and %} as well as macro definitions
consisting of a name and a regular expression denoted by that name. lex macros are invoked
explicitly by enclosing the macro name in curly braces. Following are some example lex macros.
letter [a-zA-Z]
digit [0-9]
ident {letter}({letter}|{digit})*
The body consists of of a sequence of regular expressions for different token categories and other
lexical entities. Each regular expression can have a C code fragment enclosed in curly braces that
executes when that regular expression is matched. For most of the regular expressions this code
fragment (also called a semantic action consists of returning an integer that identifies the token
category to the rest of the compiler, particularly for use by the parser to check syntax. Some
typical regular expressions and semantic actions might include:
The helper functions in a lex file typically compute lexical attributes, such as the actual integer
or string values denoted by literals.
Lex further extends the regular expressions with several helpful operators. Lex's regular
expressions include:
c
normal characters mean themselves
\c
backslash escapes remove the meaning from most operator characters. Inside character
sets and quotes, backslash performs C-style escapes.
"s"
Double quotes mean to match the C string given as itself. This is particularly useful for
multi-byte operators and may be more readable than using backslash multiple times.
[s]
This character set operator matches any one character among those in s.
[^s]
A negated-set matches any one character not among those in s.
.
The dot operator matches any one character except newline: [^\n]
r*
match r 0 or more times.
r+
match r 1 or more times.
r?
match r 0 or 1 time.
r{m,n}
match r between m and n times.
r1r2
concatenation. match r1 followed by r2
r1|r2
alternation. match r1 or r2
(r)
parentheses specify precedence but do not match anything
r1/r2
lookahead. match r1 when r2 follows, without consuming r2
^r
match r only when it occurs at the beginning of a line
r$
match r only when it occurs at the end of a line
Besides the token's category, the rest of the compiler may several pieces of information about a
token in order to perform semantic analysis, code generation, and error handling. These are
stored in an object instance of class Token, or in C, a struct. The fields are generally something
like:
struct token {
int category;
char *text;
int linenumber;
int column;
char *filename;
union literal value;
}
The union literal will hold computed values of integers, real numbers, and strings.
In many compilers, the symbol table and memory management components of the compiler
interact with several phases of compilation, starting with lexical analysis.
A hash table or other efficient data structure can avoid this duplication. The software engineering
design pattern to use is called the "flyweight". Example: Figure 3.18, p. 109. Use "install_id()"
instead of "strdup()" to avoid duplication in the lexical data.
Syntax Analysis
Parsing is the act of performing syntax analysis to verify an input program's compliance with the
source language. A by-product of this process is typically a tree that represents the structure of
the program.
When X consists only of terminal symbols, it is a string of the language denoted by the
grammar. Each iteration of the loop is a derivation step. If an iteration has several
nonterminals to choose from at some point, the rules of derviation would allow any of
these to be applied. In practice, parsing algorithms tend to always choose the leftmost
nonterminal, or the rightmost nonterminal, resulting in strings that are leftmost
derivations or rightmost derivations.
Grammar Ambiguity
The grammar
E -> E + E
E -> E * E
E -> ( E )
E -> ident
allows two different derivations for strings such as "x + y * z". The grammar is
ambiguous, but the semantics of the language dictate a particular operator precedence
that should be used. One way to eliminate such ambiguity is to rewrite the grammar. For
example, we can force the precedence we want by adding some nonterminals and
production rules.
E -> E + T
E -> T
T -> T * F
T -> F
F -> ( F )
F -> ident
Perhaps the simplest parsing method, for a large subset of context free grammars, is
called recursive descent. It is simple because the algorithm closely follows the production
rules of nonterminal symbols.
The grammar
S -> A B C
A -> a A
A -> epsilon
B -> b
C -> c
procedure S()
if A() & B() & C() then succeed
end
procedure A()
if currtoken == a then # use production 2
currtoken = scan()
return A()
else
succeed # production rule 3, match epsilon
end
procedure B()
if currtoken == b then
currtoken = scan()
succeed
else fail
end
procedure C()
if currtoken == c then
currtoken = scan()
succeed
else fail
end
We can remove the left recursion by introducing new nonterminals and new production
rules.
E -> T E'
E' -> + T E' | epsilon
T -> F T'
T' -> * F T' | epsilon
F -> ( E ) | ident
Getting rid of such immediate left recursion is not enough, one must get rid of indirect
left recursion, where two or more nonterminals are mutually left-recursive. One can
rewrite any CFG to remove left recursion (Algorithm 4.1).
for i := 1 to n do
for j := 1 to i-1 do begin
replace each Ai -> Aj gamma with productions
Ai -> delta1gamma | delta2gamma
end
eliminate immediate left recursion
Backtracking?
Current token could begin more than one of your possible production rules? Try all of
them, remember and reset state for each try.
S -> cAd
A -> ab
A -> a
S -> cAd
A -> a A'
A'-> b
A'-> (epsilon)
One can also perform left factoring (Algorithm 4.2) to reduce or eliminate the lookahead
or backtracking needed to tell which production rule to use. If the end result has no
lookahead or backtracking needed, the resulting CFG can be solved by a "predictive
parser" and coded easily in a conventional language. If backtracking is needed, a
recursive descent parser takes more work to implement, but is still feasible. As a more
concrete example:
S -> if E then S
S -> if E then S1 else S2
Automatic techniques for constructing parsers start with computing some basic functions for
symbols in the grammar. These functions are useful in understanding both recursive descent and
bottom-up LR parsers.
First(a)
First(a) is the set of terminals that begin strings derived from a, which can include epsilon.
Follow(A)
Follow(A) for nonterminal A is the set of terminals that can appear immediately to the right of A
in some sentential form S -> aAxB... To compute Follow, apply these rules to all nonterminals in
the grammar:
1. Add $ to Follow(S)
2. if A -> aBb then add First(b) - epsilon to Follow(B)
3. if A -> aB or A -> aBb where epsilon is in First(b), then add Follow(A) to Follow(B).
Bottom Up Parsing
Bottom up parsers start from the sequence of terminal symbols and work their way back up to
the start symbol by repeatedly replacing grammar rules' right hand sides by the corresponding
non-terminal. This is the reverse of the derivation process, and is called "reduction".
(1) S->aABe
(2) A->Abc
(3) A->b
(4) B->d
the string "abbcde" can be parsed bottom-up by the following reduction steps:
abbcde
aAbcde
aAde
aABe
S
Definition: a handle is a substring that
1. Shift one symbol from the input onto the parse stack
2. Reduce one handle on the top of the parse stack. The symbols from the right hand side of
a grammar rule are popped of the stack, and the nonterminal symbol is pushed on the
stack in their place.
3. Accept is the operation performed when the start symbol is alone on the parse stack and
the input is empty.
4. Error actions occur when no successful parse is possible.
LR Parsers
LR denotes a class of bottom up parsers that is capable of handling virtually all programming
language constructs. LR is efficient; it runs in linear time with no backtracking needed. The class
of languages handled by LR is a proper superset of the class of languages handled by top down
"predictive parsers". LR parsing detects an error as soon as it is possible to do so. Generally
building an LR parser is too big and complicated a job to do by hand, we use tools to generate
LR parsers.
The LR parsing algorithm is given below. See Figure 4.29 for a schematic.
ip = first symbol of input
repeat {
s = state on top of parse stack
a = *ip
case action[s,a] of {
SHIFT s': { push(a); push(s') }
REDUCE A->beta: {
pop 2*|beta| symbols; s' = new state on top
push A
push goto(s', A)
}
ACCEPT: return 0 /* success */
ERROR: { error("syntax error", s, a); halt }
}
}
"Conflicts" occur when an ambiguity in the grammar creates a situation where the parser does
not know which step to perform at a given point during parsing. There are two kinds of conflicts
that occur.
shift-reduce
a shift reduce conflict occurs when the grammar indicates that different successful parses
might occur with either a shift or a reduce at a given point during parsing. The vast
majority of situations where this conflict occurs can be correctly resolved by shifting.
reduce-reduce
a reduce reduce conflict occurs when the parser has two or more handles at the same time
on the top of the stack. Whatever choice the parser makes is just as likely to be wrong as
not. In this case it is usually best to rewrite the grammar to eliminate the conflict,
possibly by factoring.
Example shift reduce conflict:
S->if E then S
S->if E then S else S
In many languages two nested "if" statements produce a situation where an "else" clause could
legally belong to either "if". The usual rule (to shift) attaches the else to the nearest (i.e. inner) if
statement. Example reduce reduce conflict:
A->.
Closure: if I is a set of items for a grammar G, then closure(I) is the set of items constructed as
follows:
These two rules are applied repeatedly until no new items can be added.
Intuition: If A->a.B closure(I) then we home to see a string derivable from B in the input. So
if B-> is a production, we should hope to see a string derivable from . Hence, B->.
closure(I).
Goto: if I is a set of items and X is a grammar symbol, then goto(I,X) is defined to be:
Intuition:
[A->.X] I => we've seen a string derivable from ; we hope to see a string derivable
from X.
Now suppose we see a string derivable from X
Then, we should "goto" a state where we've seen a string derivable from X, and where
we hope to see a string derivable from . The item corresponding to this is [A->X.]
E -> E+T | T
T -> T*F | F
F -> (E) | id
Let I = {[E -> E+.T]} then:
goto(I,+) = closure({[E -> E+.T]})
= closure({[E -> E+.T], [E -> .T*F], [T -> .F]})
= closure({[E -> E+.T], [E -> .T*F], [T -> .F],
[F-> .(E)], [F -> .id]})
= { [E -> E + .T],[T -> .T * F],[T -> .F],[F -> .
(E)],[F -> .id]}
1. Given a grammar G with start symbol S, construct the augmented grammar by adding a
special production S'->S where S' does no appear in G.
2. Algorithm for constructing the canonical collection of LR(0) items for an augmented
grammar G':
begin
C := { closure({[S' -> .S]}) };
repeat
for each set of items I C:
for each grammar symbol X:
if goto(I,X) != 0 and goto(I,X) ! C then
add goto(I,X) to C;
until no new sets of items can be added to C;
return C;
end
Valid Items: an item A -> 1. 2 is valid for a viable prefix 1 if there is a derivation:
Suppose A -> 1. 2 is valid for 1, and B1 is on the parsing stack
1. if 2 != , we should shift
2. if 2 = , A -> 1 is the handle, and we should reduce by this production
Note: two valid items may tell us to do different things for the same viable prefix. Some of these
conflicts can be resolved using lookahead on the input string.
1. Given a grammar G, construct the augmented grammar by adding the production S' -> S.
2. Construct C = {I0, I1, … In}, the set of sets of LR(0) items for G'.
3. State I is constructed from Ii, with parsing action determined as follows:
o [A -> .aB] Ii, a a terminal; goto(Ii,a) = Ij : set action[i,a] = "shift j"
o [A -> .] Ii : set action[i,a] to "reduce A -> x" for all a e FOLLOW(A), where
A != S'
o [S' -> S] Ii : set action[i,$] to "accept"
4. goto transitions constructed as follows: for all non-terminals: if goto(Ii, A) = Ij, then
goto[i,A] = j
5. All entries not defined by (3) & (4) are made "error". If there are any multiply defined
entries, grammar is not SLR.
6. Initial state S0 of parser: that constructed from I0 or [S' -> S]
Example:
YACC
The grammar gives the production rules, interspersed with program code fragments called
semantic actions that let the programmer do what's desired when the grammar productions are
reduced. They follow the syntax
A : body ;
Where body is a sequence of 0 or more terminals, nonterminals, or semantic actions (code, in
curly braces) separated by spaces. As a notational convenience, multiple production rules may be
grouped together using the vertical bar (|).
YACC headers can specify precedence and associativity rules for otherwise heavily ambiguous
grammars. Precedence is determined by increasing order of these declarations. Example:
%right ASSIGN
%left PLUS MINUS
%left TIMES DIVIDE
%right POWER
%%
expr: expr ASSIGN expr
| expr PLUS expr
| expr MINUS expr
| expr TIMES expr
| expr DIVIDE expr
| expr POWER expr
;
Semantic Analysis
Semantic ("meaning") analysis refers to a phase of compilation in which the input program is
studied in order to determine what operations are to be carried out. The two primary components
of a classic semantic analysis phase are variable reference analysis and type checking. These
components both rely on an underlying symbol table.
What we have at the start of semantic analysis is a tree built definitions; they can have all the
synthesized attributes they want. In practice, attributes get stored in parse tree nodes and the
semantic rules are evaluated either (a) during parsing (for easy rules) or (b) during one or more
(sub)tree traversals.
Symbol tables are used to resolve names within name spaces. Symbol tables are generally
organized hierarchically according to the scope rules of the language. See the operations defined
on pages 474-476 of the text. To wit:
mktable(parent)
creates a new symbol table, whose scope is local to (or inside) parent
enter(table, symbolname, type, offset)
insert a symbol into a table
addwidth(table)
sums the widths of all entries in the table
enterproc(table, name, newtable)
enters the local scope of the named procedure
Type Checking
Perhaps the primary component of semantic analysis in many traditional compilers consists of
the type checker. In order to check types, one first must have a representation of those types (a
type system) and then one must implement comparison and composition operators on those types
using the semantic rules of the source language being compiled. Lastly, type checking will
involve adding synthesized attributes through those parts of the language grammar that involve
expressions and values.
Type Systems
Types are defined recursively according to rules defined by the source language being compiled.
A type system might start with rules like:
In addition, a type system includes rules for assigning these types to the various parts of the
program; usually this will be performed using attributes assigned to grammar symbols.
The type system is represented using data structures in the compiler's implementation language.
In the symbol table and in the parse tree attributes used in type checking, there is a need to
represent and compare source language types. You might start by trying to assign a numeric code
to each type, kind of like the integers used to denote each terminal symbol and each production
rule of the grammar. But what about arrays? What about structs? There are an infinite number of
types; any attempt to enumerate them will fail. Instead, you should create a new data type to
explicitly represent type information. This might look something like the following:
struct c_type {
int base_type; /* 1 = int, 2=float, ... */
union {
struct array {
int size;
struct c_type *elemtype;
} a;
struct ctype *p;
struct struc {
char *label;
struct field **f;
} s;
} u;
}
struct field {
char *name;
struct ctype *elemtype;
}
Given this representation, how would you initialize a variable to represent each of the following
types:
int [10][20]
struct foo { int x; char *s; }
Run-time Environments
Relationship between source code names and data objects during execution
Procedure activations
Memory management and layout
Library functions
Scope rules for each language determine how to go from names to declarations.
Each use of a variable name must be associated with a declaration. This is generally done via a
symbol table. In most compiled languages it happens at compile time (in contrast, for
example ,with LISP).
Environment maps source code names onto storage addresses (at compile time), while state maps
storage addresses into values (at runtime). Environment relies on binding rules and is used in
code generation; state operations are loads/stores into memory, as well as allocations and
deallocations. Environment is concerned with scope rules, state is concerned with things like the
lifetimes of variables.
if no recursion, just count back some number of frame pointers based on source
code nesting
if recursion, you need an extra pointer field in activation record to keep track of
the "static link", follow static link back some # of times to find a name defined in
an enclosing scope
Garbage Collection
Automatic storage management is one of the single most important features that makes
programming easier.
Basic problem in garbage collection: given a piece of memory, are there any pointers to it? (And
if so, where exactly are all of them please). Approaches:
reference counting
traversal of known pointers (marking)
o copying
o compacting
o generational
conservative collection
Goal: list of machine-independent instructions for each procedure/method in the program. Basic
data layout of all variables.
add new attributes where necessary, e.g. for expression E we might have
E.place
the name that holds the value of E
E.code
the sequence of intermediate code statements evaluating E.
newtemp()
returns a new temporary variable each time it is called
newlabel()
returns a new label each time it is called
Three-Address Code
Basic idea: break down source language expressions into simple pieces that:
Instruction set:
Declarations (Pseudo instructions): These declarations list size units as "bytes"; in a uniform-
size environment offsets and counts could be given in units of "slots", where a slot (4 bytes on
32-bit machines) holds anything.
global
declare a global named x at offset n1 having n2 bytes of space
x,n1,n2
declare a procedure named x with n1 bytes of parameter space and n2 bytes of
proc x,n1,n2
local variable space
local x,n declare a local named x at offset n from the procedure frame
label Ln designate that label Ln refers to the next instruction
end declare the end of the current procedure
Code for control flow (if-then, switches, and loops) consists of code to test conditions, and the
use of goto instructions and labels to route execution to the correct code. Each chunk of code that
is executed together (no jumps into or out of it) is called a basic block. The basic blocks are
nodes in a control flow graph, where goto instructions, as well as falling through from one basic
block to another, are edges connecting basic blocks.
Depending on your source language's semantic rules for things like "short-circuit" evaluation for
boolean operators, the operators like || and && might be similar to + and * (non-short-circuit) or
they might be more like if-then code.
A general technique for implementing control flow code is to add new attributes to tree nodes to
hold labels that denote the possible targets of jumps. The labels in question are sort of analogous
to FIRST and FOLLOW; for any given list of instructions corresponding to a given tree node, we
might want a .first attribute to hold the label for the beginning of the list, and a .follow attribute
to hold the label for the next instruction that comes after the list of instructions. The .first
attribute can be easily synthesized. The .follow attribute must be inherited from a sibling. The
labels have to actually be allocated and attached to instructions at appropriate nodes in the tree
corresponding to grammar production rules that govern control flow. An instruction in the
middle of a basic block need neither a first nor a follow.
Different languages have different semantics for booleans; for example Pascal treats them as
identical to arithmetic operators, while the C family of languages (and many ) others specify
"short-circuit" evaluation in which operands are not evaluated once the answer to the boolean
result is known. Some ("kitchen-sink" design) languages have two sets of boolean operators:
short circuit and non-short-circuit.
1. treat boolean operators same as arithmetic operators, evaluate each and every one into
temporary variable locations.
2. add extra attributes to keep track of code locations that are targets of jumps. Boolean
expressions' results evaluate to jump instructions.
3. one could change the machine execution model so it implicity routes control from
expression failure to the appropriate location. In order to do this one would
o mark boundaries of code in which failure propagates
o maintain a stack of such marked "expression frames"
a<="" pre="">
translates into
100: if a1 = 0
goto 104
103: t1 = 1
104: if c2 = 0
goto 108
107: t2 = 1
108: if e3 = 0
goto 112
111: t3 = 1
112: t4 = t2 AND t3
t5 = t1 OR t4
Short-Circuit Example
a<="" pre="">
translates into
if a<="" pre="">
Instruction Selection
The hardware may have many difference sequences of instructions to
accomplish a given task. Instruction selection must choose a particular
sequence. At issue may be: how many registers to use, whether a special
case instruction is available, and what addressing mode(s) to use. Given
a choice among equivalent/alternaive sequences, the decision on which
sequence of instructions to use is based on estimates or measurements of
which sequence executes the fastest. This is usually determined by the
number of memory references incurred during execution, including the
memory references for the instructions themselves. Simply picking the
shortest sequence of instructions is often a good approximation of the
optimal result, since fewer instructions usually translates into fewer
memory references.
Accessing values in registers is much much faster than accessing main memory.
Register allocation denotes the selection of which variables will go
into registers. Register assignment is the determination of exactly
which register to place a given variable. The goal of these operations
is generally to minimize the total number of memory accesses required
by the program.
register allocation and assignment is not needed; the compiler has few
being used, and then spill it back out to main memory when the register
A virtual machine architecture such as the JVM changes the "final" code
generation somewhat. We have seen several changes, some of which
simplify final code generation and some of which complicate things.