Compiler Design Dire Dawa University [DDIT]
Chapter Three
Syntax Analysis or Parsing
Syntax analysis is the second phase of the compiler. It gets the input from the tokens and generates a syntax
tree or parse tree
Syntax: the way in which tokens are put together to form expressions, statements, or blocks of statements. It
is the rules governing the formation of statements in a programming language. Parsing is the process of
determining whether a string of tokens can be generated by a grammar. A syntax analyzer checks whether a
given program satisfies the rules implied by a CFG or not. If it satisfies, the syntax analyzer creates a parse
tree for the given program. Otherwise, the parser gives the error messages.
The syntax of a programming language is usually given by the grammar rules of a context free grammar
(CFG).
A CFG:
Gives a precise syntactic specification of a programming language.
A grammar can be directly converted in to a parser by some tools (yacc).
The Role of the Parser
The parser or syntactic analyzer obtains a string of tokens from the lexical analyzer and verifies that the string
can be generated by the grammar for the source language and creates the syntactic structure (parse tree) of the
given source program. It reports any syntax errors in the program. It also recovers from commonly occurring
errors so that it can continue processing its input.
The parser can be categorized into two groups:
1. Top-down parser: The parse tree is created top to bottom, starting from the root to leaves.
2. Bottom-up parser: known as shift-reduce parsing, the parse tree is created bottom to top, starting from
the leaves to root.
Prepared by: Andualem T. Page 1
Compiler Design Dire Dawa University [DDIT]
Both top-down and bottom-up parser scan the input from left to right (one symbol at a time).
Efficient top-down and bottom-up parsers can be implemented by making use of context-free- grammar.
1. LL for top-down parsing
2. LR for bottom-up parsing
Functions of the parser:
1. It verifies the structure generated by the tokens based on the grammar.
2. It constructs the parse tree.
3. It reports the errors.
4. It performs error recovery.
Question: Identify errors that cannot detect by parser?
Parser cannot detect errors such as:
1. Variable re-declaration
2. Variable initialization before use.
3. Data type mismatch for an operation.
The above issues are handled by Semantic Analysis phase.
Syntax Error Handling
If a compiler had to process only correct programs, its design and implementation would be simplified greatly.
However, a compiler is expected to assist the programmer in locating and tracking down errors that
unavoidably (inevitably) creep into programs, despite the programmer's best efforts. In case of any syntax
errors in the program, the parser tries to report as many errors as possible.
Error reporting and recovery are a very important part of the syntax analyzer. The error handler in the parser
has the following goals:
It should report the presence of errors clearly and accurately.
It should recover from each error quickly enough to be able to detect subsequent errors.
It should speed up the processing of correct programs
Common programming errors can occur at many different levels.
Example:
1. Lexical: includes misspelling an identifier, keyword or operator. E.g., ebigin instead of begin
2. Syntactic: includes an arithmetic expression with unbalanced parentheses, misplaced semicolons; or
adding or missing of braces { }, case without switch
3. Semantic: include type mismatches between operators and operands. A return statement in a Java
method with result type void. Operator applied to incompatible operand
4. Logical: can be anything from incorrect reasoning. E.g., assignment operator = instead of the
comparison operator == and an infinitely recursive call.
Prepared by: Andualem T. Page 2
Compiler Design Dire Dawa University [DDIT]
5. Run-time: impossible command
Syntax Error-Recovery Strategies
Once an error is detected, how should an error handler recover from an error?
The different strategies that a parse uses to recover from a syntactic error are:
1. Panic-mode
2. Phrase-level (Perform local correction )
3. Error Productions
4. Global-correction (not practical )
1. Panic mode: It is the simplest, most popular method.
On discovering an error, the parser discards input symbols one at a time until one of a designated set of
synchronizing tokens is found.
The synchronizing tokens are those whose roles in the source program are clear and unambiguous and they
are guaranteed not to go into an infinite loop.
Example the erroneous expression
(2 + + 8) + 9
Recover using Panic mode recovery:
skips a considerable amount of input without checking it for additional errors, it has the advantage of
simplicity
skip (+)ahead to next integer(8) and then continue
2. Phrase level recovery: On discovering an error, a parser may perform local correction on the remaining
input; that is, it may replace a prefix of the remaining input by some string that allows the parser to continue.
A typical local correction is, example in C++,
To replace a comma by a semicolon
To delete an extraneous semicolon, or
To insert a missing semicolon
The choice of the local correction is left to the compiler designer.
Example the erroneous expression
Erroneous one in C++
int x: // we need to end the expression by ;
After recovery of an error
int x ; // so the compiler have to delete : and replace by ; automatically
3. Error Productions: A parser constructed from a grammar augmented by these error productions detects
the expected errors when an error production is used during parsing. The parser can then generate
appropriate error diagnostics about the erroneous construct that has been recognized in the input.
4. Global-correction: Choosing minimal sequence of changes to obtain a globally least-cost correction
Prepared by: Andualem T. Page 3
Compiler Design Dire Dawa University [DDIT]
Given an incorrect input string x and grammar G, certain algorithms can be used to find a parse tree for a
string y, such that the number of insertions, deletions and changes of tokens is as small as possible.
However, these methods are in general too costly in terms of time and space
Context free grammar (CFG)
A context-free grammar is a specification for the syntactic structure of a programming language.
It has 4-tuples: G = (T, N, P, S) where
T is a finite set of terminals (a set of tokens); are the basic symbols from which strings are formed.
N is a finite set of non-terminals (syntactic variables); are these are the syntactic variables that denote a
set of strings. These help to define the language generated by the grammar.
P is a finite set of productions of the form
A→α where A is non-terminal and α is a string of terminals and non-terminals
(including the empty string)
It specifies the manner in which terminals and non-terminals can be combined to form strings. Each
production consists of a non-terminal, followed by an arrow, followed by a string of non-terminals and
terminals
S∈ N is a designated start symbol (one of the non- terminal symbols): One non-terminal in the grammar
is denoted as the “Start-symbol” and the set of strings it denotes is the language defined by the grammar
Example: grammar for simple arithmetic expressions
expressionexpression+ term Terminal symbols E -> E + T | T
expressionexpression- term id + - * / ( ) T -> T * F | F
expression term Non-terminals F -> (E) | id
termterm* factor expression E -> TE’
termterm/ factor term E’ -> +TE’ | Ɛ
term factor Factor T -> FT’
factor(expression ) Start symbol T’ -> *FT’ | Ɛ
factorid expression F -> (E) | id
Notational Conventions Used
Terminal symbols:
We use Lowercase letters such as a, b, e…..
Operator symbols such as +, *,<, and so on.
Punctuation symbols such as, comma, @,!,/,%,?,#,(,[,{,and so on.
Digits 0, 1,. . . ,9.
Boldface strings such as id , num…each of which represents a single terminal symbol and
All keywords: if, else ….
Non terminals Symbols
Uppercase letters, such as A, B, C …..
Lowercase& italic names such as expr, stmt…
Prepared by: Andualem T. Page 4
Compiler Design Dire Dawa University [DDIT]
Note: Greek letters such as , , represent strings of grammars symbols. Thus a generic production rule can
have the form: A where A ϵ Single Variables and ϵ (variables or Terminals)
Derivations
Derivation is a process that generates a valid string with the help of grammar by replacing the non-terminals
on the left hand side with the string on the right side of the production (LHSRHS). It is a sequence of
production rules, in order to get the input string.
A derivation begins with a single structure name and ends with a string of token symbols. At each step in a
derivation, a single replacement is made using one choice from a grammar rule.
We write as: S * β where * denotes zero or more steps
Example1: the grammar G with the single grammar rule: E (E) | a
This grammar generates the language
L(G) = { a,(a),((a)),(((a))),…} = { (na)n | n >= 0 }
Show Derivation for ((a)):
E => (E) =>((E)) =>((a))
Example 2: E E + E | E –E | E * E | E / E | -E
E ( E )
E id
Show Derivation for -(id+id) , from the above G
• E -E -(E) -(E+E) -(id+E) -(id+id) OR
• E -E -(E) -(E+E) -(E+id) -(id+id)
N.B: E => E + E means that E + E is derived from E. we can replace E by E + E
At each derivation step, we can choose any of the non-terminals in the sentential form of G for the replacement.
Types of derivation
There are two types of derivations:
a. Left most derivation
b. Right most derivation
To understand how parsers work, we shall consider derivations in which the non-terminal to be replaced at
each step is chosen as follows:
a. If we always choose the left-most non-terminal in each derivation step, this derivation is called as left-
most derivation (LMD).
Example: E=>-E=>-(E)=>-(E+E)=>-(id+E)=>-(id+id)
b. If we always choose the right-most non-terminal in each derivation step, this derivation is called as
right-most derivation (RMD)
Example: E=>-E=>-(E)=>-(E+E)=>-(E+id)=>-(id+id)
We will see that the top-down parser try to find the left-most derivation of the given source program and the
bottom-up parser try to find right-most derivation of the given source program in the reverse order.
Strings that appear in leftmost derivation are called left sentinel forms.
Prepared by: Andualem T. Page 5
Compiler Design Dire Dawa University [DDIT]
Strings that appear in rightmost derivation are called right sentinel forms.
Sentinels: Given a grammar G with start symbol S, if S → α, where α may contain non-terminals or terminals,
and then α is called the sentinel form of G.
Parse Trees
A parse tree is a graphical depiction of a derivation. It is convenient to see how strings are derived from the
start symbol. The start symbol of the derivation becomes the root of the parse tree.
A parse tree corresponding to a derivation is a labeled tree in which:
the interior nodes are labeled by non-terminals,
the leaf nodes are labeled by terminals, and
The children of each internal node represent the replacement of the associated non-
terminal in one step of the derivation.
Parse tree and Derivation
Ambiguity
A grammar produces more than one parse tree for a sentence is called as an ambiguous grammar.
produces more than one leftmost derivation or
More than one rightmost derivation for the same sentence (input).
A CFG G is ambiguous, if some string wϵ (G) has two or more LMD or RMD derivation trees from the start
symbol. We should remove or eliminate the ambiguity in the grammar during the design phase of the compiler.
An unambiguous grammar should be written to eliminate the ambiguity. No method can detect and remove
ambiguity automatically, but it can be removed by either re-writing the whole grammar without ambiguity, or
by setting and following Associativity and precedence constraints.
Example: The arithmetic expression grammar
E → E + E | E * E | ( E ) | id
Permits two distinct leftmost derivations for the sentence id + id * id:
Prepared by: Andualem T. Page 6
Compiler Design Dire Dawa University [DDIT]
(a) (b)
E => E + E E => E * E
=> id + E => E + E * E
=> id + E * E => id + E * E
=> id + id * E => id + id * E
=> id + id * id => id + id * id
Example: Construct parse tree for the expression: id + id id
E E + E | E E | ( E ) | - E | id
Others
According to the grammar, both are correct. But a grammar that produces more than one parse tree for any
input sentence is said to be an ambiguous grammar.
CFG G may not suitable for a particular parser. This problem happens if:
1. G is Ambiguous Grammar
2. G have Left Recursion Rule
3. G have Common prefix in two or more RHS part of a given single production rule
i.e : SßX| ßY| H. Here the common prefix in the two RHS part of the single production is ß
1. Elimination of ambiguity
There are several ways to handle ambiguity. We will discuss two of them:
a. Rewriting the grammar
b. Enforce Associativity and precedence rules using parser-generator declarations
a. Rewrite the grammar: use a different new non-terminal for each precedence level. it start with the
lowest precedence
Example: G1: E E - E | E / E | (E) | id
Is this grammar ambiguous? Check it?
Answer: yes, then rewrite to
Example:
Prepared by: Andualem T. Page 7
Compiler Design Dire Dawa University [DDIT]
Such a grammar captures operator precedence, but it could be still ambiguous; fails to express that both
subtraction and division are left associative;
e.g.: 5-3-2 is equivalent to: (5-3)-2 and not to: 5-(3-2)
b. Enforce Associativity and precedence using parser-generator declarations
Most parser generators allow precedence and Associativity declarations to disambiguate grammars.
Associativity of operators
By convention, 9+5+2 is equivalent to (9+5)+2 and 9-5-2 is equivalent to (9-5)-2. When an operand like 5 has
operators to its left and right, conventions are needed for deciding which operator applies to that operand. We
say that the operator + associates to the left, because an operand with plus signs on both sides of it belongs to
the operator to its left. In most programming languages the four arithmetic operators, addition, subtraction,
multiplication, and division are left-associative.
Precedence of Operators
Consider the expression 9+5*2. There are two possible interpretations of this expression: (9+5) *2 or 9+ (5*2).
The associatively rules for + and * apply to occurrences of the same operator, so they do not resolve this
ambiguity. Rules defining the relative precedence of operators are needed when more than one kind of operator
is present.
We say that * has higher precedence than + if * takes its operands before + does. In ordinary arithmetic,
multiplication and division have higher precedence than addition and subtraction. Therefore, 5 is taken by * in
both 9+5*2 and 9*5+2; i.e., the expressions are equivalent to 9+ (5*2) and (9*5) +2, respectively.
Level of precedence Operator Association
1(highest) ++ - - (post fix) Right to left
2 ! , ++ - - (prefix) Right to left
3 *, /, % Left-to- right
4 +, - Left-to- right
5 <, >, <=, >= Left-to- right
6 == , != Left-to- right
6
8 && Left-to- right
9 || Left-to- right
Prepared by: Andualem T. Page 8
Compiler Design Dire Dawa University [DDIT]
10 ?: Right-to- left
11(lowest ) = ,+=,-=, *= Right-to- left
E.g. Evaluate the expression a = = b + c * d
2. Eliminating Left Recursion
Let us consider the left recursive grammar: SSa|B; the set of strings generated from this grammar are:
starting with B and followed by any number of a’s. Then we can rewrite it using right recursion by introducing
new variable Z such as SBZ , ZaZ|λ but still it accepts the same set of strings as before : starting with B
and followed by any number of a’s
A grammar is said to be left recursive if it has a non-terminal A such that there is a derivation A=>Aα for some
string α. Top-down parsing methods cannot handle left-recursive grammars. Hence, left recursion can be
eliminated as follows
To eliminate left recursion for single production (immediate left-recursion)
A Aα |β could be replaced by the non-left- recursive productions
A β A’
A’ α A’| ε
This left-recursive grammar: EE+T|T
TTF|F
F ( E ) | id
Can be re-written to eliminate the immediate left recursion:
E TE’
E’ +TE’ |
T FT’
T’ FT’ |
F ( E ) | id
Generally, we can eliminate immediate left recursion from them by the following technique. First we group
the A-productions as:
A Aα1 |Aα2 |…. |Aαm |β1 | β2|….| βn Where no βi begins with A. then we replace the A
Productions by:
A β1A’ | β2A’ | … | βnA’
A’ α1Α’ | α2A’ | … | αmA’ |ε
Algorithm for eliminating (direct or indirect recursion
• Input: Grammar G with no cycles or -productions
• Arrange/list the non-terminal in some order A1, A2, …, An
for i = 1, …, n do
for j = 1, …, i-1 do
replace each Ai Aj with Ai 1 | 2 | … | k Where Aj 1 | 2 | … | k
end do
Prepared by: Andualem T. Page 9
Compiler Design Dire Dawa University [DDIT]
eliminate the immediate left recursion in Ai
end do
Example Eliminate Left Recursion
Given a grammar G: A B C | a Choose arrangement: A, B, C
BCA|Ab Meaning: A=1, B=2,C=3
CAB|CC|a
i = 1: nothing to do
i = 2, j = 1: BCA|Ab
BCA|BCb|ab
Final result
(imm) B C A BR | a b BR ABC|a
BR C b BR | B C A BR | a b BR
i = 3, j = 1: CAB|CC|a BR C b BR |
CBCB|aB|CC|a C a b BR C B CR | a B CR | a CR
i = 3, j = 2: CBCB|aB|CC|a C R A BR C B C R | C C R |
C C A BR C B | a b BR C B | a B | C C | a
(imm) C a b BR C B CR | a B CR | a CR
C R A BR C B C R | C C R |
Example 2: Eliminating indirect left recursion
Given: bold case indicates processed part substitution
Let: ∑ = {a, b, c, d} N ={S,A}, S is start symbol
Exercise1: S Aa | b
A Sc | d ; eliminate all left-recursions from this grammar
Exercise2: S Aa | b
A Ac | Sd | Є eliminate all left-recursions from this grammar
Prepared by: Andualem T. Page 10
Compiler Design Dire Dawa University [DDIT]
Exercise3: S Aa | b
A Ac | Sd | f eliminate all left-recursions from this grammar
3. Left Factoring a grammar
If more than one grammar production rules have a common prefix string, then the top-down parser cannot
make a choice as to which of the production it should take to parse the string in hand. Then it cannot determine
which production to follow to parse the string, as both productions are starting from the same terminal (or
non-terminal). To remove this confusion, we use a technique called left factoring. Left factoring transforms
the grammar to make it useful for top-down parsers. In this technique, we make one production for each
common prefixes and the rest of the derivation is added by new productions
• A recursive decent, predictive, parser (a top-down parser without backtracking) insists that the grammar
must be left-factored.
• When a non-terminal has two or more productions whose right-hand sides start with the same grammar
symbols, the grammar is not LL(1) and cannot be used for predictive parsing
• To remove this problem; replace productions of the form
A 1 | 2 | … | n | with
AZ|
Z 1 | 2 | … | n
Example: given the following grammar:
S iEtS | iEtSeS | a
Eb
Left factored; this grammar becomes:
S iEtSS’ | a
S’ eS | ε
Eb
Example2:
A abB | aB | cdg | cdeB | cdfB
A aA’ | cdg | cdeB | cdfB
A’ bB | B
A aA’| cdA’’
A’ bB | B
A’’g | eB | fB
Exercise: A ad | a | ab | abc | b
Prepared by: Andualem T. Page 11
Compiler Design Dire Dawa University [DDIT]
First and Follow Sets
The construction of both top-down and bottom-up parsers are aided by two functions, FIRST and FOLLOW,
associated with a grammar G.
During top-down parsing, FIRST and FOLLOW allow us to choose which production to apply, based on the
next input symbol.
During panic-mode error recovery, sets of tokens produced by FOLLOW can be used as synchronizing tokens.
An important part of parser table construction is to create first and follow sets. These sets can provide the actual
position of any terminal in the derivation. This is done to create the parsing table where the decision of
replacing T[A, t] = α with some production rule.
We need to build a FIRST set and a FOLLOW set for each symbol in the grammar and The elements of FIRST
and FOLLOW are terminal symbols.
Computing FIRST set
FIRST () is the set of terminal symbols that can begin any string derived from ; It is a set of the terminal
symbols which occur as first symbols in strings derived from where is any string of grammar symbols.
To compute FIRST (A) for all grammar symbols A, apply the following rules until no more terminals or λ can
be added to any FIRST set.
1. If x is a terminal, then FIRST(x) = {x}
2. If Aλ is a production, then add λ to FIRST(A)
3. If A is non-terminal and A → aα is a production then add a to FIRST (A).
4. If ABX (where provided no rule B λ or B‡ λ) then FIRST(A)= FIRST(B)
5. If ABX (where provided B λ ) then FIRST(A)= {FIRST(B)-λ} U{FIRST(X)}
Examples: Given a grammar, then compute FIRST set of the grammars elements
Computing FOLLOW set
FOLLOW () is the set of terminal symbols that can follow : t FOLLOW() derivation containing
t.
Prepared by: Andualem T. Page 12
Compiler Design Dire Dawa University [DDIT]
To compute FOLLOW (A) for all non-terminals A, apply the following rules until nothing can be added to any
FOLLOW set
1. Place $ in FOLLOW(S), where S is the start symbol, and $ is the input right end marker $.
2. If there is a production A XYZ (provided that Z‡ λ), then FOLLOW(Y)=FIRST(Z)
3. If AYZ then FOLLOW(Z)= FOLLOW(A)
4. If there is a production A XYZ (provided that Z λ), then FOLLOW(Y)= {FIRST(Z)-{λ}
U{FOLLOW(A)}
Example for first and follow sets
Given a grammar, then compute FOLLOW set of the grammars (Note: always we compute follow set of
variables only but not terminals)
Example 2: Consider the following grammars G, find FIRST and FOLLOW set
Example 3: Find FIRST and FOLLOW sets for the following grammar G:
E TR
FIRST(E)=FIRST(T)={0,1,…,9}
R +TR FIRST(R)={+,-,ε}
R -TR
Rε FOLLOW(E)={$}
FOLLOW(T)={+,-,$}
T 0|1|…|9 FOLLOW(R)={$}
Exercise: Consider the following grammar over the alphabet { g,h,i,b}
A BCD
B bB | ε
C Cg | g | Ch | i
D AB | ε
Fill in the table below with the FIRST and FOLLOW sets for the non-terminals in this grammar:
FIRST FOLLOW
A
B
C
D
Prepared by: Andualem T. Page 13
Compiler Design Dire Dawa University [DDIT]
Types of Parsing
Parsing is the process of analyzing a continuous stream of input in order to determine its grammatical structure
with respect to a given formal grammar.
Syntax analyzers follow production rules defined by means of context-free grammar. The way the production
rules are implemented (derivation) divides parsing into two types: top-down parsing and bottom-up parsing.
1. Top-down Parsing: Construction of the parse tree starts at the root (from the start symbol) and proceeds
towards leaves (token or terminals).
2. Bottom-up parsing: Constructions of the parse tree starts from the leaf nodes (tokens or terminals of the
grammar) and proceeds towards root (start symbol)
Top-down Parsing
Top-down parsing can be viewed as finding a leftmost derivation for an input string. To construct a parse tree
for a string, we initially create a tree consisting of a single node (root node) labeled by the start symbol.
Thereafter, we repeat the following steps to construct the parse tree by starting at the root labeled by start
symbol:
• At node labeled with non-terminal a select one of the production of A and construct the children nodes.
• Find the next node at which sub-tree is constructed
There are three types of top-down parsing
Recursive-Descent Parsing with backtracking (RDP)
Recursive-Descent Predictive Parsing without backtracking (RPP)
Non Recursive-Descent Predictive Parsing (PP)
1. Recursive-Descent Parsing with backtracking (RDP)
It is a common form of top-down parsing. It is called recursive, as it uses recursive procedures to process the
input. It is a parsing technique, but not widely used and not efficient. Recursive descent parsing suffers from
backtracking.
Backtracking means, if one derivation of a production fails, the syntax analyzer restarts the process using
different rules of same production.
Eliminate left recursion (if the CFG is Left recursive
The parse tree is constructed :
• From top (root)
• From left to right at each step
• Production and terminals are seen in order at each step
Prepared by: Andualem T. Page 14
Compiler Design Dire Dawa University [DDIT]
Prepared by: Andualem T. Page 15
Compiler Design Dire Dawa University [DDIT]
Exercise: Using the grammar below, draw a parse tree for the following string using RDP algorithm:
( ( id . id ) id ( id ) ( ( ) ) )
S→E
E → id
|(E.E)
|(L)
|()
L→LE
|E
2. Recursive descent(RD) parser without backtracking ( RD predictive parser)
Recursive-descent parsing is a top-down method of syntax analysis in which a set of recursive procedures is
used to process the input. In this parsing method no backtracking is required. It relies on information about
the first symbols that can be generated by a production body. It is Efficient than the recursive descent with
backtracking Steps
To work with RD predictive parser:
1. Eliminate left recursion from grammar
2. Left factor the grammar
3. Compute FIRST and FOLLOW sets. Then we apply predictive recursive-descent parsing. This is
through function call.
Prepared by: Andualem T. Page 16
Compiler Design Dire Dawa University [DDIT]
Prepared by: Andualem T. Page 17
Compiler Design Dire Dawa University [DDIT]
Example: Consider the grammar with V={E,T} and T={+,*,int,(,)}
ET+E|T
Tint | int *T| (E) Question: parse the input int*int
3. Non-Recursive (Table Driven) Predictive Parsing (NRPP) - LL(1) Parser
Non-Recursive predictive parsing is a table-driven parser. It is also known as LL (1) Parser.
-It contains the following components
Prepared by: Andualem T. Page 18
Compiler Design Dire Dawa University [DDIT]
What is LL (K) parser?
An LL (k) Parser accepts LL (k) grammar.
LL (k) grammar is a subset of context-free grammar but with some restrictions to get the simplified
version, in order to achieve easy implementation.
LL (k) grammar can be implemented by means of table-driven algorithm.
LL(k) mean
first L means “left-to-right” scan of input
second L means “leftmost derivation” is applied
k means “predict based on k tokens of look ahead”;
In practice, LL(1) is used
Components of LL (1) Parser
Non-Recursive predictive parsing is a table-driven parser. a non-recursive predictive parser can be built by
maintaining a stack explicitly, rather than implicitly via recursive calls
The table-driven parser has 5 components:
1. An input buffer: The input buffer contains the string to be parsed, followed by the end marker $.
2. A stack: It containing a sequence of grammar symbols, at the bottom of the stack, there is a special
end marker symbol $.initially the stack contains only the symbol $ and the starting symbol S.
$S initial stack
when the stack is emptied (i.e. only $ left in the stack), the parsing is completed.
3. A parsing table ,
4. An output stream and
5. Predictive parsing program (algorithm)
Parsing table: is
A two-dimensional array T[A,a]
Each row is a non-terminal symbol like A
Each column is a terminal symbol like a or the special symbol $
Each entry holds a production rule.
Blank entries in the table indicates errors
Construction of Non-recursive Predictive Parsing Tables
For any grammar G, the following algorithm can be used to construct the predictive parsing table. The
algorithm to construct a parsing table is given below
• Input : Given Grammar G:
It is free from left recursion and it is Left factored
Prepared by: Andualem T. Page 19
Compiler Design Dire Dawa University [DDIT]
• Output : Parsing table T
• Method (step)
1. For each production A-> X of the grammar, do steps A and B.
A. For each terminal a in FIRST(X), add A->X, to T[A,a].
B. If λ is in First(X), add A->λ to T[A,b] for each terminal b in FOLLOW(A). If λ is in FIRST(X)
and $ is in FOLLOW(A), add A->λ to T[A,$]
2. Make each undefined entry of T error
Note: a given grammar is said to be LL (1) grammar, if
→ There is no multiply defined entry in the constructed parsing table.
→ If a table entry is multiply defined, then the grammar is not LL (1) grammar. Because the grammar may
be
Not left factored
Left recursive
Ambiguous
Even if the grammar is free from left recursion and left factored, we have no 100% guarantee to say the
grammar is unambiguous because there are some grammars inherently ambiguous grammars
Parsing algorithm
The parser considers 'X‘, the symbol on top of stack, and 'a' the current input symbol
These two symbols determine the action to be taken by the parser
Assume that '$' is a special token that is at the bottom of the stack and terminates the input string
Step 1: Set ip to point to the first symbol of w; //ip= input pointer
Step 2: Set X to the top stack symbol
Step 3: Repeat {
Step 3.1: if X == a then pop(x) and ip++// x is terminal == a
Step 3.2: else if X is a non-terminal then
Step 3.2.1: if T[X,a] = {X Y1Y2Y3…Yn } then
Step 3.2.2: begin pop X; push Yn,Y (n-1),,…….Y2,Y1 to the stack respectively , where Y1 is
on top of stack
Step 3.3: else error ()
Step 4: until X == $
}
Prepared by: Andualem T. Page 20
Compiler Design Dire Dawa University [DDIT]
Prepared by: Andualem T. Page 21
Compiler Design Dire Dawa University [DDIT]
Generally: A Grammar G is LL (1) iff the following conditions hold for two distinctive production rules A
α and A β
1. both α and β cannot be derives strings starting with the same symbol
2. at most one of α and β can derive ε
3. if β can derive ε, then α cannot derive to any string with a terminal in follow(A)
Non- LL (1) Grammar: Examples
Prepared by: Andualem T. Page 22
Compiler Design Dire Dawa University [DDIT]
Types of Error Recovery Techniques in LL (1) parsing
Panic-Mode Error Recovery: Skipping the input symbols until a synchronizing token is found.
Phrase-Level Error Recovery: Each empty entry in the parsing table is filled with a pointer to a specific
error routine to take care that error case. Make local correction to the input. Works only in limited situations
– A common programming error which is easily detected
– For example insert a “;” after a data type & variable name in C++ eg: float x;
– Does not work very well!
Panic-Mode Error Recovery in LL(1) Parsing: An error is detected during predictive parsing when the
terminal on top of the stack does not match the next input symbol or when non-terminal A is on top of the
stack, a is the next input symbol, and T[A, a] is error (i.e., the parsing-table entry is empty).
Panic-mode error recovery works based on the idea of skipping symbols on the input until a token in a selected
set of synchronizing tokens appears.
• All the terminal-symbols in the follow set of a non-terminal can be used as a synchronizing token
set for that non-terminal.
Follow the following steps
1. As a starting point, place all symbols in FOLLOW (A) into the synchronizing set (synch) for non-terminal
A. If the parser looks up entry T[A, a] and finds that it is synch, then the driver pops current nonterminal
A and skips input till synch token is found or skips input until one of FIRST(A) is found, it is likely that
parsing can continue. i.e synch is an entry like T[A,a]=synch
2. If a non-terminal can generate the empty string, then the production deriving λ can be used as a default.
Doing so may postpone some error detection, but cannot cause an error to be missed. This approach reduces
the number of non-terminals that have to be considered during error recovery.
3. If a terminal on top of the stack cannot be matched, a simple idea is to pop the terminal from stack, issue a
message saying that the terminal was inserted, and continue parsing. In effect, this approach takes the
synchronizing set of a token to consist of all other tokens.
Note:
Prepared by: Andualem T. Page 23
Compiler Design Dire Dawa University [DDIT]
• If the parser looks up entry T[A, a] and finds that it is blank, then the input symbol a is skipped.
• If the parser looks up entry T[A, a] and finds that it is synch, then the input symbol a is skipped until
a symbol in the follow set of the non-terminal A is seen , then pop(A)
• If a token on top of the stack does not match the input symbol, then we pop the token from the stack,
as mentioned above.
Exercise: 1. Consider the following grammar G:
A’ A
Prepared by: Andualem T. Page 24
Compiler Design Dire Dawa University [DDIT]
A xA | yA |y
a) Find FIRST and FOLLOW sets for G:
b) Construct the LL(1) parse table for this grammar.
c) Explain why this grammar is not LL(1).
d) Transform the grammar into a grammar that is LL(1).
e) Give the parse table for the grammar created in (d).
2. Given the following grammar:
S WAB | ABCS
A B | WB
B ε |yB
Cz
Wx
a) Find FIRST and FOLLOW sets of the grammar.
b) Construct the LL (1) parse table.
c) Is the grammar LL (1)? Justify your answer.
3. Consider the following grammar:
S ScB | B
B e | efg | efCg
C SdC | S
a) Justify whether the grammar is LL(1) or not?
b) If not, translate the grammar into LL(1).
c) Construct predictive parsing table for the above grammar.
4. Consider the grammar:
E BA
Prepared by: Andualem T. Page 25
Compiler Design Dire Dawa University [DDIT]
A &BA | ε
B TRUE | FALSE
Note: &, true, false are terminals
A- Construct LL (1) parse table for this grammar
B- Parse the following input string TRUE &FALSE &TRUE
2. Bottom-Up Parsing (BUP)
A bottom-up parse corresponds to the construction of a parse tree for an input string beginning at the leaves (the bottom) and
working up towards the root (the top). It is more general, efficient and preferred type of parsing because it does
not need left factored grammar, so it needs a natural grammar. It use right most derivation in reverse and
scans input from left to right
Consider the following examples shows a bottom-up parse for the string id * id
Important concepts (Terms) in BUP
The following are important concepts in bottom up parsing, before we actually discussing about the different
bottom-up parsing techniques
Reduction
Handle and Handle Pruning
Shift-Reduce Parsing
Conflicts During Shift-Reduce Parsing
a. Reduction: We can think of the process as one of “reducing” a string w to the start symbol of a grammar.
At each reduction step a particular substring matching the right side of a production is replaced by the symbol
on the left of that production, and if the substring is chosen correctly at each step, a rightmost derivation is
traced out in reverse. Bottom-up parsing reduces a string to the start symbol by inverting productions
For example, consider the grammar:
S aABe
A Abc | b
Bd
The sentence abbcde can be reduced to S by the following steps:
abbcde
a A b c d e arranging
aAde
Prepared by: Andualem T. Page 26
Compiler Design Dire Dawa University [DDIT]
aABe
S
These reductions trace out the following right-most derivation in reverse:
S aABe
aAde
aAbcde
abbcde
S aABe
A Abc | b
Bd
abbcde aAbcde aAde aABe S
At each step, we have to find α such that α is a substring of the sentence and replace α by A, where A α.
The key decisions during bottom-up parsing are about when to reduce and about what production to apply,
as the parse proceeds.
In LR parsing the two major problems are:
locate the substring that is to be reduced
locate the production to use
Explanation for page 26; the reductions will be discussed in terms of the sequence of strings
id* id, F * id, T * id, T * F, T, E
The strings in this sequence are formed from the roots of all the sub-trees in the snapshots. The sequence starts
with the input string id*id. The first reduction produces F * id by reducing the leftmost id to F, using the
production F id. The second reduction produces T * id by reducing F to T. Now, we have a choice between
reducing the string T, which is the body of E T, and the string consisting of the second id, which is the body
of F id. Rather than reduce T to E, the second id is reduced to T, resulting in the string T * F. This string
then reduces to T. The parse completes with the reduction of T to the start symbol E.
By definition, a reduction is the reverse of a step in a derivation (recall that in a derivation, a non-terminal
in a sentential form is replaced by the body of one of its productions).
The goal of bottom-up parsing is therefore to construct a derivation in reverse.
Handle and Handle Pruning
Bottom-up parsing during a left-to-right scan of the input, it constructs a tree in a right most derivation in
reverse.
Informally, a "handle" is a string that matches right hand side (RHS) of a production and whose replacement
gives a step in the reverse of right most derivation
If S rm* αAw rm αβw then β (corresponding to production A β) in the position following α is a handle
of αβw. The string w consists of only terminal symbols. We only want to reduce handle and not any RHS
Prepared by: Andualem T. Page 27
Compiler Design Dire Dawa University [DDIT]
Handle pruning: If β is a handle and A β is a production then replace β by A is called Handle pruning ‘that
is, removing the children of A from the parse tree (stack).
A right most derivation in reverse can be obtained by handle pruning.
A handle is a substring of grammar symbols in a right-sentential form that matches a right-hand side of a
production
a. Shift-Reduce Parsing
Shift-reduce parsing is a form of bottom-up parsing in which a stack holds grammar symbols and an input
buffer holds the rest of the string to be parsed. As we shall see, the handle always appears at the top of the
stack just before it is identified as the handle.
We use $ or (initial state of a DFA) to mark the bottom of the stack and $ only the right end of the
input.
Conventionally, when discussing bottom-up parsing, we show the top of the stack on the right, rather
than on the left as we did for top-down parsing.
Prepared by: Andualem T. Page 28
Compiler Design Dire Dawa University [DDIT]
Initial Configuration: Initially, the stack is empty, and the string w is on the input, as follows:
During a left-to-right scan of the input string, the parser shifts zero or more input symbols onto the
stack, until it is ready to reduce a string β of grammar symbols on top of the stack
It then reduces β to the head of the appropriate production.
The parser repeats this cycle until it has detected an error or until the stack contains the start symbol and the
input is empty: Final Configuration
Upon entering this configuration, the parser halts and announces successful completion of parsing.
Stack Implementation of Shift-Reduce Parsing
One scheme to implement a handle-pruning, bottom-up parser is called a shift-reduce parser. Shift-reduce
parser uses a stack and input buffer
Algorithm:
Initialize stack with ($) and w$ as input
LOOP Repeat (a) and (b)
a) find handle - if we don’t have a handle on top of the stack , shift an input symbol on to stack , until handle
is found
b) Prune the handle- if we have a handle Aβ, on the stack invoke reduce action
i) Pop β from stack and ii) push A onto the stack
until the top of the stack is the start State ($S) and the input symbol is $ and announce successful parsing
Bottom up parsing has four actions. While the primary operations are shift and reduce, there are actually four
possible actions a shift-reduce parser can make: (1) shift, (2) reduce, (3) accept, and (4) error.
Prepared by: Andualem T. Page 29
Compiler Design Dire Dawa University [DDIT]
1. Shift. Shift the next input symbol onto the top of the stack.
2. Reduce. The right end of the string to be reduced must be at the top of the stack. Locate the left end of
the string within the stack and decide with what non-terminal to replace the string.
3. Accept. Announce successful completion of parsing.
4. Error. Discover a syntax error and call an error recovery routine.
Conflicts during Shift-Reduce Parsing
During Shift-Reduce Parsing if there is no handle on the stack then use shift action and If there is a handle
then use reduce action. However,
1. What action to take in case both shift and reduce is valid?( when we have a situation where the parser
knows the entire stack content and the next k symbols but cannot decide
whether it should shift or reduce. Ambiguity )
• In such situation we say that shift-reduce conflict is occurred
• To resolve this conflict, bison will choose to shift.
2. Which rule to use for reduction if reduction is possible by more than one rule? (When the parser cannot
decide which of the several productions it should use for a reduction.)
In such situation we say that reduce-reduce conflict is occurred
To resolve this conflict, bison will report Error
ET
E id with an id on the top of stack
T id
Conflicts come either because of ambiguous grammars or parsing method is not powerful enough
Example: Consider the grammar S E+E | E+n | E
En and input n+n
The grammar is with reducing-reduce conflict…..show the stack implementation…. Where the conflict happened?
Prepared by: Andualem T. Page 30
Compiler Design Dire Dawa University [DDIT]
There are two main categories of shift-reduce parsers
1. Operator-Precedence Parser
simple, but only a small class of grammars.
2. LR-Parsers
Covers wide range of grammars.
1. SLR – simple LR parser
2. LR – most general LR parser
3. LALR – intermediate LR parser (look- ahead LR parser)
SLR, LR and LALR work same, only their parsing tables are different
Introduction to LR
The most dominant type of bottom-up parser today is based on a concept called LR (k) parsing;
The "L" indicates for left-to-right scanning of the input
The "R" indicates for constructing a rightmost derivation in reverse ;
The “K” indicates for the number of input symbols of look-ahead that are used in making parsing
decisions. The cases k= 0 or k = 1 are of practical interest, and we shall only consider LR parsers with
k<=1 here. When (k) is omitted, k is assumed to be 1.
Why LR Parsers?
The most powerful and efficient shift-reduce parsing is LR parsing. LR parsing is attractive because:
• LR parsing is most general non-backtracking shift-reduce parsing, yet it is still efficient.
• The class of grammars that can be parsed using LR methods is a proper superset of the class of
grammars that can be parsed with predictive parsers. LL(1)-Grammars LR(1)-Grammars
• An LR-parser can detect a syntactic error as soon as it is possible to do so a left-to-right scan of the
input.
The principal drawback of the LR method is that it is too much work to construct an LR parser by hand for a
typical programming-language grammar. But a specialized tool, an LR parser generator, is needed. Fortunately,
many such generators are available.
Types of LR Parsers method
• LR(0) parser and it is least power full
• SLR(1) – simple LR(1) parser
• LALR(1)– intermediate LR parser (look-head LR parser)
• CLR(1)- canonical LR parser and most power full
Note: All work same (they used the same algorithm), only their parsing tables are different.
LR Parsing components
Prepared by: Andualem T. Page 31
Compiler Design Dire Dawa University [DDIT]
An LR parser contains the following components
1. Input buffer : this contains input string to be parsed
2. Stack: The program uses a stack to store a string of the form S0X1S1X2S2…XmSm, where Sm is on
top. Each Xi is a grammar symbol and each Si is a state
3. LR Parsing Algorithm: It is the same for all LR parsers
4. Parsing Table: consists of two parts: action and goto functions.
5. Output
In the parsing table there are two parts Action and Goto:
Action: The parsing program determines Sm ( the state currently on top of stack ) and ai (the current input
symbol). It then consults action[Sm, ai] in the action table which can have one of four values:
1. Shift s, where s is a state,
2. Reduce by a grammar production of the form A→β,
3. Accept
4. Error.
Goto: The function goto takes a state and variable as arguments and produces a state.
Constructing LR (0) parsing tables
Item, canonical LR (0) collection, and the LR (0) Automaton
Generating an LR (0) parsing table consists identifying the possible states and arranging the transitions among
them. At the heart of the table construction is the notion of an LR (0) configuration or item. A configuration
(item) is a production of the grammar with a dot (.) at some position on its right side. For example, A –> XYZ
has four possible items:
A –> •XYZ
Prepared by: Andualem T. Page 32
Compiler Design Dire Dawa University [DDIT]
A –> X•YZ
A –> XY•Z
A –> XYZ• The production A λ generates only one item, A.
For example, the item A .XYZ indicates that we hope to see a string derivable from XYZ next on the input.
Item A [Link] indicates that we have just seen on the input a string derivable from X and that we hope
next to see a string derivable from Y Z.
Item A XYZ. Indicates that we have seen the body XYZ and that it may be time to reduce XYZ to A.
One collection of sets of LR (0) items, called the canonical LR (0) collection, provides the basis for
constructing a deterministic finite automaton that is used to make parsing decisions. Such an automaton
is called an LR (0) automaton.
Construct the canonical LR (0) collection for a grammar
We define an augmented grammar and two functions; CLOSURE and GOTO (transition).
If G is a grammar with start symbol S, then G’ ( the augmented grammar for G) is G with a new start
symbol S’ and production S’ S.
The purpose of this new starting production is to indicate to the parser when it should stop parsing and
announce acceptance of the input. That is, acceptance occurs when and only when the parser is about
to reduce by S’ S.
Closure of Item Sets
If I is a set of item for a grammar G, then CLOSURE(I) is the set of items constructed from I by the
two rules:
Rule 1: Initially, add S’.S item in I to CLOSURE (I).
Rule 2: If A α.Bβ is in CLOSURE (I) and B Y is a production, then add the item B .Y to
CLOSURE (I), if it is not already there. Apply this rule until no more new items can be added to
CLOSURE(I), where B is variable
We divide all the sets of items of interest into two classes:
1. Kernel items: the initial item, S' .S, and all items whose dots are not at the left end
2. Non-kernel items: all items with their dots at the left end, except for S' .S
The Function GOTO (Transition)
The second useful function is GOTO (I, X) where I is a set of items and X is a grammar symbol.
Intuitively, the GOTO function is used to define the transitions in the LR (0) automaton for a grammar.
The states of the automaton correspond to sets of items, and GOTO (I, X) specifies the transition from the
state for I under input X.
GOTO (I, X) is defined to be the closure of the set of all items [A aX.β] such that [A α . Xβ] is in I.
Note: CLOSURE (I) and GOTO Function are working together; consider the grammar given below: with
augmented
Prepared by: Andualem T. Page 33
Compiler Design Dire Dawa University [DDIT]
Given : E’E , EE+T | T , T T*F | F , F (E) | id
As general information about LR(0) automaton
The start state of the LR(0) automaton is the CLOSURE that contain the item (S’ .S), where S’, is the start
symbol of the augmented grammar.
All states are accepting states.
Construct LR (0) parse table
Construct C= {I0, …, In} the collection of sets of LR(0) items
1. If Aα.aβ is in Ii and transition(Ii, a) = Ij then action[i, a] = shift j, where a is terminal
2. If Aα. is in Ii then apply reduce rule [ Aα ] for all terminals
3. If S'S. is in Ii then action [i, $] = accept
4. If Transition (Ii,A) = Ij then GOTO [i, A]=j for all non-terminals A
5. All entries not defined are errors
To Construct SLR (1) parse table
Construct C={I0, …, In} the collection of sets of LR(0) items
1. If Aα.aβ is in Ii and transition(Ii,a) = Ij ,then action[i,a] = shift j
2. If Aα. is in Ii ,then action[i,a] = reduce Aα for all terminal a in follow(A)
3. If S'S. is in Ii then action[i,$] = accept
4. If Transition(Ii , A) = Ij , then Goto[I , A]= j for all non-terminals A
5. All entries not defined are errors
SLR (1) is the simplest of the LR parsing methods. It is too weak to handle most languages, but powerful
than LR(0) parser
If an SLR parse table for a grammar does not have multiple entries the grammar is SLR(1) grammar
Example 1: Given a grammar, show the parsing process using LR (0) parsing techniques,
Note: please check whether the grammar is LR (0) grammar
Prepared by: Andualem T. Page 34
Compiler Design Dire Dawa University [DDIT]
SAA
AaA
Ab
input : aabb$
Example 2: Given a grammar, construct an LR (0) item collection (Automata) and the parse table
Note: please check whether the grammar is LR (0) grammar
SAB
Aa
Bb
Example 3: Given a grammar, construct an LR (0) item collection (Automata)
Note: please check whether the grammar is LR (0) grammar
ET+E
ET
Ta
Exercise 5 & 6: Given grammar, check whether it is LR (0), SLR (1) grammar?
Notes
SLR (1) is the simplest of the LR parsing methods. It is too weak to handle most languages, but powerful
than LR(0) parser
If an SLR parse table for a grammar does not have multiple entries the grammar is SLR(1) grammar
When do we say that a grammar is LR (0) grammar?
Prepared by: Andualem T. Page 35
Compiler Design Dire Dawa University [DDIT]
When there are shift-reduce or reduce-reduce conflict, the grammar is not LR(0) grammar.
Meaning:
Given input: id*id
Show the stack implementation in parsing the given input
Canonical LR (1) or CLR (1) parser
LR (1) parsing uses look-ahead to avoid unnecessary conflicts in parsing table. Carry extra information in the
state so that wrong reductions by A α will be ruled out
Redefine LR(1) items to include a terminal symbol as a second component (look ahead symbol)
The general form of the item becomes [A α.β, a] which is called LR (1) item.
Item [A α., a] calls for reduction only if next input is a. The set of symbols “a”s will be a subset of
Follow (A).
LR (1) item = LR(0) item + look ahead
The Closure Operation for LR(1) Items
1. Start with closure(I) = I, which contains item [S’.S, $]
2. If [A•B, a] closure(I) then for each production B in the grammar and each terminal b
FIRST(a), add the item [B•, b] to I if not already in I
Prepared by: Andualem T. Page 36
Compiler Design Dire Dawa University [DDIT]
3. Repeat 2 until no new items can be added
Alternative algorithm to compute Closure (I) in LR(1)
repeat
for each item [A α.Bβ, a] in I
for each production B γ in G‘ and for each terminal b in First(βa)
add item [B .γ, b] to I
Until no more additions to I
The Goto Operation for LR (1) Items
1. For each item [A•X, a] I, add the set of items closure({[AX•, a]}) to goto(I,X) if not
already there
2. Repeat step 1 until no more items can be added to goto(I,X)
Construction of Canonical LR(1) parse table: CLR(1)
Construct C={I0, …,In} the sets of LR(1) items.
1. If [A α.aβ, b] is in Ii and transition(Ii, a) =Ij then action[i, a]=shift j
2. If [A α., a] is in Ii then action[i, a] reduce A α
3. If [S′ S., $] is in Ii then action[i, $] = accept
4. If transition(Ii, A) = Ij then goto[i,A] = j for all non-terminals A
5. All undefined entries are error
Example: Given a grammar with augmented rule, show the parsing process using LR (1) parsing techniques,
Note: please check whether the grammar is LR(1) grammar
S’S
SAA
AaA | b
input : aabb$
LALR Parsing
LR (1) parsing tables have many states
LALR parsing (Look-ahead LR) merges two or more LR (1) state into one state to reduce table size
Less powerful than CLR (1)
–Will not introduce shift-reduce conflicts, because shifts do not use look-ahead
May introduce reduce-reduce conflicts, but seldom do so for grammars of programming languages
Constructing LALR Parsing Tables
1. Construct sets of LR(1) items
2. Combine LR(1) sets with sets of items that share the same first part
Prepared by: Andualem T. Page 37
Compiler Design Dire Dawa University [DDIT]
LL, LR (0) SLR(1), LR(1), LALR(1) Summary
• LL parse tables computed using FIRST/FOLLOW
– Non-terminals terminals productions
– Computed using FIRST/FOLLOW
• LR parsing tables computed using closure/goto
– LR states terminals shift/reduce actions
– LR states non terminals goto state transitions
• A grammar is
– LL(1) if its LL(1) parse table has no conflicts
– SLR if its SLR parse table has no conflicts
– LALR if its LALR parse table has no conflicts
– LR(1) if its LR(1) parse table has no conflicts
Prepared by: Andualem T. Page 38