Syntax Analysis
CSC 3205: Compiler Design
Marriette Katarahweire
26th February 2020
CSC 3205: Compiler Design 1/63
Phases of a Compiler
CSC 3205: Compiler Design 2/63
Phases of a Compiler
The analysis phase of a compiler breaks up a source program
into constituent pieces and produces an internal
representation for it, called intermediate code.
The synthesis phase translates the intermediate code into the
target program.
The syntax of a programming language describes the proper
form of its programs,
The semantics of the language defines what its programs
mean; that is, what each program does when it executes
CSC 3205: Compiler Design 3/63
Syntax Analysis - Basics
2nd phase in the compilation process
Part of front end analysis
Also known as parsing
The parser obtains a string of tokens from the lexical analyzer
and verifies that the string of token names can be generated
by the grammar for the source language.
Where lexical analysis splits the input into tokens, the purpose
of syntax analysis is to recombine these tokens. Not back into
a list of characters, but into something that reflects the
structure of the text. This “something” is typically a data
structure called the syntax tree of the text.
The parser should report any syntax errors in an intelligible
fashion and recover from commonly occurring errors to
continue processing the remainder of the program
CSC 3205: Compiler Design 4/63
Syntax Analysis - Basics ...
Every PL has precise rules that prescribe the syntactic
structure of well-formed programs.
for example in C, a program is made up of functions, a
function out of declarations and statements, a statement out
of expressions, and so on.
The syntax of PL constructs can be specified by context-free
grammars or BNF (Backus-Naur Form) notation
What benefits do Grammars offer for both language designers
and compiler writers?
CSC 3205: Compiler Design 5/63
Syntax Analysis
For well-formed programs, the parser constructs a parse tree and
passes it to the rest of the compiler for further processing. Parsing
is the process of determining how a string of terminals can be
generated by a grammar.
CSC 3205: Compiler Design 6/63
Context-Free Grammar
The syntax of a programming is described by a context-free
grammar CFG .
We will use BNF (Backus-Naur Form) notation in the
description of CFGs
The parser checks whether a given source program satisfies
the rules implied by a context-free grammar or not.
If it satisfies, the parser creates the parse tree of that program.
Otherwise the parser gives the error messages.
3 general types of parsers for grammars: universal, top-down,
and bottom-up.
The methods commonly used in compilers can be classified as
being either top-down or bottom-up
CSC 3205: Compiler Design 7/63
Parsers
Top-Down Parser: the parse tree is created top to bottom,
starting from the root to the leaves.
Bottom-Up Parser: the parse is created bottom to top;
starting from the leaves working their way up to the root.
Both top-down and bottom-up parsers scan the input from
left to right (one symbol at a time).
Efficient top-down and bottom-up parsers can be
implemented only for sub-classes of context-free grammars
because they are expressive enough:
LL for top-down parsing(Left-to-right scanning of input,
Left-most derivation)
LR for bottom-up parsing(Left-to-right scanning of input,
Right-most derivation)
Parsers implemented by hand often use LL grammars
Parsers for the larger class of LR grammars are usually
constructed using automated tools
CSC 3205: Compiler Design 8/63
Context Free Grammars
Inherently recursive structures of a programming language are
defined by a context-free grammar. In a context-free grammar, we
have:
A finite set of terminals (in our case, this will be the set of
tokens)
A finite set of non-terminals (syntactic-variables)
A finite set of productions rules in the following form:
A → α where A is a non-terminal and α is a string of
terminals and non-terminals (including the empty string)
A start symbol (one of the non-terminal symbol)
Example:
E → E + E |E − E |E ∗ E |E /E | − E
E → (E )
E → id
CSC 3205: Compiler Design 9/63
Context Free Grammars
Recall(Sections 2.2 and 4.2 in the Dragon Book):
Derivations: leftmost and rightmost
Grammar ambiguity and syntax trees
Operator precedence and associativity
CSC 3205: Compiler Design 10/63
Bottom-up Parsing
parsing starts with the input symbols and tries to construct
the parse tree up to the start symbol.
Input string: a + b * c
Production rules:
S →E
E →E +T
E →E ∗T
E →T
T → id
CSC 3205: Compiler Design 11/63
Bottom-up Parsing
Read the input and check if any production matches with the
input:
a+b∗c
T +b∗c
E +b∗c
E +T ∗c
E ∗c
E ∗T
E
S
CSC 3205: Compiler Design 12/63
Top-Down Parsing
The parse tree is constructed From the top; From left to right
Pick a production & try to match the input
• Bad “pick” - may need to backtrack
• Some grammars are backtrack-free (Predictive parsing)
Terminals are seen in order of appearance in the token stream:
t2 t5 t6 t8 t9
Methods: Recursive descent and Predictive parsing
CSC 3205: Compiler Design 13/63
Top-Down Parsing Methods
CSC 3205: Compiler Design 14/63
Top-Down Parsing
The grammar in the figure generates a subset of the statements of
C or Java.
CSC 3205: Compiler Design 15/63
Top-Down Parsing
The top-down construction of the parse tree.
CSC 3205: Compiler Design 16/63
Top-Down Parsing
The top-down construction of the parse tree: starting with the
root, labeled with the starting nonterminal stmt, and repeatedly
performing the following two steps:
1: At node N, labeled with nonterminal A, select one of the
productions for A and construct children at N for the symbols
in the production body.
2: Find the next node at which a subtree is to be constructed,
typically the leftmost unexpanded nonterminal of the tree
The key is picking the right production in step 1. That choice
should be guided by the input string
The current terminal being scanned in the input is frequently
referred to as the lookahead symbol.
CSC 3205: Compiler Design 17/63
Parse Tree Construction
Given a string for (; expr ; expr )other
CSC 3205: Compiler Design 18/63
Top-Down Parsing - Example
Expression Grammar And the input x – 2 * y
0 Goal → Expr
1 Expr → Expr + Term
2 | Expr - Term
3 | Term
4 Term → Term * Factor
5 | Term / Factor
6 | Factor
7 Factor → ( Expr )
8 | Number
9 | id
CSC 3205: Compiler Design 19/63
Recursive Descent Parsing
Is a top-down method of syntax analysis
Uses recursive procedures to model the parse tree to be
constructed
For each nonterminal in the grammar, a procedure, which
parses a nonterminal, is constructed.
Each of these procedures may read input, match terminal
symbols or call other procedures to read input and match
terminals in the right-hand side of a production
Recursively parses the input to make a parse tree, which may
or may not require back-tracking
is regarded recursive as it uses context-free grammar which is
recursive in nature
CSC 3205: Compiler Design 20/63
Backtracking
If one derivation of a production fails, the syntax analyzer
restarts the process using different rules of same production.
This technique may process the input string more than once
to determine the right production.
It would be better if we always knew the correct action to take
Backtracking is time consuming and therefore, inefficient.
Thus a special case predictive parsing was developed where no
backtracking is required
CSC 3205: Compiler Design 21/63
Left Recursion
A production is left recursive if the same nonterminal that
appears on the LHS appears first on the RHS of the
production.
Recursive descent parsers cannot deal with left recursion.
However, we can rewrite the grammar to represent the same
language without the need for left recursion.
CSC 3205: Compiler Design 22/63
Removing Left Recursion
In general, we can eliminate all immediate left recursion:
A → Ax|y
By changing the grammar to:
A → yA0
A0 → xA0 |
Not all left recursion is immediate may be hidden in multiple
production rules
A → BC |D
B → AE |F
There is a general approach for removing indirect left
recursion, but we’ll not worry about if for this course.
CSC 3205: Compiler Design 23/63
Removing Left Recursion
Given the grammar:
Fee → Fee α | β
Introduce a new nonterminal, Fee’, and transfer the recursion
onto Fee’.
Fee → β Fee’
Fee’ → α Fee’
Add the rule Fee’ → where represents the empty string
Final grammar:
Fee → β Fee’
Fee’ → α Fee’ |
CSC 3205: Compiler Design 24/63
Left Factoring
another useful grammar transformation technique used in
parsing
It consists of ”factoring out” prefixes which are common to
two or more productions
For example: from A → αβ|αγ
to:
A → α A’
A’ → β|γ
CSC 3205: Compiler Design 25/63
Left Factoring
Given the grammar: A − − > ab1 |ab2 |ab3
for every production, there is a common prefix & if we choose
any production here, it is not confirmed that we will not need
to backtrack
It is non deterministic, because we cannot choose any
production and be assured that we will reach at our desired
string by making the correct parse tree
we rewrite the grammar in a way that is deterministic and also
leaves us to be flexible enough to make it any string that may
be possible without backtracking
Becomes: A − − > aA0
A0 − − > b1 |b2 |b3
CSC 3205: Compiler Design 26/63
Left Refactoring
Suppose a grammar, S− > abS|aSb
Rewrite the production to:
S− > aS 0
S 0 − > bS|Sb
CSC 3205: Compiler Design 27/63
Question
What is the difference between Left Factoring and Left Recursion?
What is the result of Left factoring the given grammar below?
S − > if E then S|if E then S else S|a
E −>b
CSC 3205: Compiler Design 28/63
Predictive Parsing
a simple form of recursive-descent parsing that does not
require any back-tracking
parser can “predict” which production to use
A predictive parser always knows which production to use to
avoid backtracking
the lookahead symbol unambiguously determines the flow of
control through the procedure body for each nonterminal.
The sequence of procedure calls during the analysis of an
input string implicitly defines a parse tree for the input, and
can be used to build an explicit parse tree, if desired
CSC 3205: Compiler Design 29/63
Predictive Parsing
Example: for the productions
stmt -> if ( expr ) stmt else stmt
| while ( expr ) stmt
| for ( stmt expr stmt ) stmt
A recursive descent parser would always know which production to
use, depending on the input token.
CSC 3205: Compiler Design 30/63
Predictive Parsing
If it picks the wrong production, a top-down parser may backtrack
Alternative is to look ahead in input & use context to pick correctly
Predictive parsers accept LL(k) grammars
L means “left-to-right” scan of input
L means “leftmost derivation”
k means “predict based on k tokens of lookahead”
In practice, LL(1) is used
CSC 3205: Compiler Design 31/63
LL(1) Grammars
A class of grammars that can be used to construct Predictive
parsers, that is, recursive-descent parsers needing no
backtracking
The first ”L” in LL(1) stands for scanning the input from left
to right
The second ”L” for producing a leftmost derivation
The ” 1 ” for using one input symbol of lookahead at each
step to make parsing action decisions
In recursive-descent, for each non-terminal and input token
there may be a choice of production.
LL(1) means that for each non-terminal and token there is
only one production.
CSC 3205: Compiler Design 32/63
Backtrack Free Grammars
Need to formalize the property that makes the right-recursive
expression grammar backtrack free.
At each point in the parse, the choice of an expansion is
obvious because each alternative for the leftmost nonterminal
leads to a distinct terminal symbol.
Comparing the next word in the input stream against those
choices reveals the correct expansion.
The intuition is clear, but formalizing it will require some
notation
CSC 3205: Compiler Design 33/63
FIRST and FOLLOW Sets
FIRST and FOLLOW computations for a grammar help to
construct ”predictive parsing tables”
”predictive parsing tables” make explicit the choice of
production during top-down parsing
The construction of both top-down and bottom-up parsers is
aided by two functions, FIRST and FOLLOW, associated with
a grammar G.
During topdown parsing, FIRST and FOLLOW allow us to
choose which production to apply, based on the next input
symbol
During panic-mode error recovery, sets of tokens produced by
FOLLOW can be used as synchronizing tokens
To build the parsing table, we need the notion of nullability
and the two functions: FIRST and FOLLOW
CSC 3205: Compiler Design 34/63
FIRST and FOLLOW Sets
Given a grammar G, we may define the functions FIRST and
FOLLOW on the strings of symbols of G.
For a grammar symbol α,
FIRST (α) is the set of terminal symbols that can appear as
the first word in some string derived from α
FOLLOW (α) is the set of all terminals that may follow α in a
derivation.
CSC 3205: Compiler Design 35/63
Nullability
A nonterminal A is nullable if A ⇒∗
Clearly, A is nullable if it has a production A →
But A is also nullable if there are, for example, productions
A → BC
B → A | aC |
C → aB | Cb |
CSC 3205: Compiler Design 36/63
Nullability
In other words, A is nullable if there is a production A → ,
or there is a production A → B1 , B2 . . . Bn , where B1 , B2 , ...,
Bn are all nullable.
CSC 3205: Compiler Design 37/63
Nullability
In the grammar
E → TE 0
E 0 → +TE 0 |
T → FT 0
T 0 → ∗FT 0 |
F → (E )|id|num
E’ and T’ are nullable.
E, T, and F are not nullable
CSC 3205: Compiler Design 38/63
Nullability
Nonterminal Nullable?
E ??
E’ ??
T ??
T’ ??
F ??
CSC 3205: Compiler Design 39/63
FIRST Set
For a grammar symbol X, FIRST(X) is defined as follows:
For every terminal X, FIRST(X) = X.
For every nonterminal X, if X → Y1 , Y2 , . . . Yn is a
production, then FIRST(Y1 ) ⊆ FIRST(X).
Furthermore, if Y1 , Y2 , . . . , Yk are nullable, then FIRST(Yk+1 )
⊆ FIRST(X).
CSC 3205: Compiler Design 40/63
FIRST Set
We are concerned with FIRST(X) only for the nonterminals of
the grammar.
FIRST(X) for terminals is trivial.
According to the definition, to determine FIRST(A), we must
inspect all productions that have A on the left.
CSC 3205: Compiler Design 41/63
FIRST Set Example
Find FIRST(E) given the grammar below
E → TE 0
E 0 → +TE 0 |
T → FT 0
T 0 → ∗FT 0 |
F → (E )|id|num
CSC 3205: Compiler Design 42/63
FIRST Set Example
Find FIRST(E).
E occurs on the left in only one production
E → TE 0 .
Therefore, FIRST(T) ⊆ FIRST(E).
Furthermore, T is not nullable. Therefore, FIRST(E) =
FIRST(T).
We have yet to determine FIRST(T).
CSC 3205: Compiler Design 43/63
FIRST Set Example
Find FIRST(T).
T occurs on the left in only one production
T → FT 0
Therefore, FIRST(F) ⊆ FIRST(T).
Furthermore, F is not nullable. Therefore, FIRST(T) =
FIRST(F).
We have yet to determine FIRST(F).
CSC 3205: Compiler Design 44/63
FIRST Set Example
Find FIRST(F).
FIRST(F) = {(, id, num}.
Therefore, FIRST(E) = {(, id, num}.
FIRST(T) = {(, id, num}.
CSC 3205: Compiler Design 45/63
FIRST Set Example
Find FIRST(E’).
FIRST(E’) = {+}.
Find FIRST(T’).
FIRST(T’) = {*}.
CSC 3205: Compiler Design 46/63
Summary
Nonterminal Nullable FIRST
E No {(, id, num}
E’ Yes {+}
T No {(, id, num}
T’ Yes {*}
F No {(, id, num}
CSC 3205: Compiler Design 47/63
FOLLOW Set
For a grammar symbol X, FOLLOW(X) is defined as follows
If S is the start symbol, then $ ∈ FOLLOW(S).
If A → α B β is a production, then FIRST(β) ⊆ FOLLOW(B).
If A → α B is a production, or A→ α B β is a production and
β is nullable, then FOLLOW(A) ⊆ FOLLOW(B).
CSC 3205: Compiler Design 48/63
FOLLOW Set
We are concerned about FOLLOW(X) only for the
nonterminals of the grammar.
According to the definition, to determine FOLLOW(A), we
must inspect all productions that have A on the right.
CSC 3205: Compiler Design 49/63
FOLLOW Set Example
Let the grammar be
E → TE 0
E 0 → +TE 0 |
T → FT 0
T 0 → ∗FT 0 |
F → (E )|id|num
CSC 3205: Compiler Design 50/63
FOLLOW Set Example
Find FOLLOW(E)
E is the start symbol, therefore $ ⊆ FOLLOW(E).
E occurs on the right in only one production. F → (E).
Therefore FOLLOW(E) = {$, )}.
CSC 3205: Compiler Design 51/63
FOLLOW Set Example
Find FOLLOW(E’).
E’ occurs on the right in two productions.
E → T E’
E’ → + T E’
Therefore, FOLLOW(E’) = FOLLOW(E) = {$, )}.
CSC 3205: Compiler Design 52/63
FOLLOW Set Example
Find FOLLOW(T).
T occurs on the right in two productions.
E → T E’
E’ → + T E’
Therefore, FOLLOW(T) contains FIRST(E’) = {+}
However, E’ is nullable, therefore it also contains FOLLOW(E)
= {$, )} and FOLLOW(E’) = {$, )}
Therefore, FOLLOW(T) = {+, $, )}
CSC 3205: Compiler Design 53/63
FOLLOW Set Example
Find FOLLOW(T’).
T’ occurs on the right in two productions.
T → F T’
T’ →* F T’
Therefore, FOLLOW(T’) = FOLLOW(T) = {$, ), +}
CSC 3205: Compiler Design 54/63
FOLLOW Set Example
Find FOLLOW(F).
F occurs on the right in two productions.
T → F T’
T’ → * F T’
Therefore, FOLLOW(F) contains FIRST(T’) = {*}
However, T’ is nullable, therefore it also contains
FOLLOW(T) = {+, $, )} and FOLLOW(T’) = {$, ), +}
Therefore, FOLLOW(F) = {*, $, ), +}.
CSC 3205: Compiler Design 55/63
Summary
Nonterminal Nullable FIRST FOLLOW
E No {(, id, num} {$, )}
E’ Yes {+} {$, )}
T No {(, id, num} {$, ), +}
T’ Yes {*} {$, ), +}
F No {(, id, num} {*, $, ), +}
CSC 3205: Compiler Design 56/63
Example 2
Consider the grammar
Z → d|XYZ
Y → c|
X → Y |a
Find:
Nullable Set?
Nonnullable Set?
FIRST(X), FIRST(Y), FIRST(Z)?
FOLLOW(X), FOLLOW(Y), FOLLOW(Z)?
CSC 3205: Compiler Design 57/63
Example 2
Nullable Set: {X, Y}
Nonnullable Set: {Z}
FIRST(X) = {a, c}
FIRST(Y) = {c}
FIRST(Z) = {d, a, c}
FOLLOW(X) = {a, c, d}
FOLLOW(Y) = {a, c, d}
FOLLOW(Z) = {$}
CSC 3205: Compiler Design 58/63
Exercise
Consider the grammar
E → TX
X → +E |
T → intY |(E )
Y → ∗T |
Find:
Nullable Set?
Nonnullable Set?
FIRST(E), FIRST(X), FIRST(Y), FIRST(T)?
FOLLOW(E), FOLLOW(X), FOLLOW(Y), FOLLOW(T)?
CSC 3205: Compiler Design 59/63
Predictive Parsing Table
Grammar:
S → E$
E → TE 0
E 0 → |” + ”TE 0
T → FT 0
T 0 → |” ∗ ”FT 0
F → id|num|”(”E ”)”
CSC 3205: Compiler Design 60/63
Predictive Parsing Table
Nullable FIRST FOLLOW
S False ( id num
E False ( id num ) $
E’ True + ) $
T False ( id num ) +$
T’ True * ) +$
F False ( id num ) *+$
CSC 3205: Compiler Design 61/63
Predictive Parsing Table
Rows: Non-terminals
Columns: Terminals
Entries: Productions
Enter production X → α in row X, column t for each t in
FIRST(α). If α is nullable, enter the productions in row X, column
t for each t in FOLLOW(X).
+ * id num ( ) $
S S → E$ S → E$ S → E$
E E → TE’ E → TE’ E → TE’
E’ E’ → ”+” T E’ E’ → E’ →
T T → FT’ T → FT’ T → FT’
T’ T’ → T’ → ”*” F T’ T’ → T’ →
F F → id F → num F → ”(” E ”)”
CSC 3205: Compiler Design 62/63
Next Topic
Read about:
- Error Handling
- Bottomup Parsing
CSC 3205: Compiler Design 63/63