Lecture05-Syntax Analysis-CFG
Lecture05-Syntax Analysis-CFG
A. Overview of Parser
token
Source Lexical
Program
Parser
Analyzer get next token
Symbol table
a) Role of Parser
1. Report any syntax error
2. Recover from commonly occurring error to
continue processing the remainder of program
3. Create:
* Some compiler creates symbol table.
* Some compiler creates parse tree.
* Some compiler directly produces intermediate representation (IR)
(parser and rest of front-ends are in one module)
• Syntactic errors:
misplaced semicolons,
extra or missing braces (‘{‘, ‘}’, ‘(‘, ‘)’),
case statement without an enclosing switch.
• Semantic error:
type mismatches between
operators and operands.
• Logical error:
anything from incorrect reasoning on the part of the
programmer to the use in a program. It produces
unintended or undesired output or behavior.
B. Context-Free Grammar (CFG)
A context-free grammar (CFG) is a certain type of formal
grammar, which is a set of production rules that describe
all possible strings in a given formal language. CFG is
used to specify the syntax of a language.
a) Format
CFG has following form:
𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 → 𝐢𝐢𝐢𝐢 ( 𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒 ) 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝐞𝐞𝐞𝐞𝐞𝐞𝐞𝐞 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠
• Terminals:
components of tokens output by lexical analyzer.
They will become terminal nodes (or leaf) of a
parse tree.
• Nonterminals:
syntactic variables that denote a set of strings.
• Start symbol:
One of nonterminals.
Conveniently, the head of the first productions is
often used as the start symbol.
Example)
Let we have a program.
while ( i > 0 )
if ( i % 2 == 0 )
i = i / 2;
else
i = i – 1;
We want to write a grammar 𝐺𝐺 that supports the above
program. The grammar that 𝐺𝐺 can be:
b) Notational conventions
1. Terminals
(a) Lowercase letters early in alphabet, such as a,b,c
(b) Operator symbols: +, *, /, …
(c) Punctuation symbols: parentheses, comma, …
(d) Digits: 0,1,…,9
(e) Boldface string: id, if
Underlined string (notation in my class) : id, if
2. Nonterminals
(a) Upper letters early in alphabet, such as A,B,C
(b) ‘S’ is usually the start symbol.
(c) Lowercase, italic names such as expr or stmt.
Lower case without underline (notation in my class): expr, stmt
We call,
• such a sequence of replacement as,
“derivation of −� 𝐢𝐢𝐢𝐢 � from 𝐸𝐸”, and
• “𝐸𝐸 derives −� 𝐢𝐢𝐢𝐢 �”
⊛ Symbols:
• “⇒” means, “derives in one step”, and
∗
• “⇒” means “derive in zero or more steps”, and
+
• “⇒” means “derive in one or more steps”
⊛ Rules:
∗
1. 𝛼𝛼 ⇒ 𝛼𝛼, for any string 𝛼𝛼,
∗ ∗
2. 𝛼𝛼 ⇒ 𝛽𝛽 and 𝛽𝛽 ⇒ 𝛾𝛾, then 𝛼𝛼 ⇒ 𝛾𝛾.
⊛ Sentence:
∗
If 𝑆𝑆 is a start symbol and 𝑆𝑆 ⇒ 𝛼𝛼 , then 𝛼𝛼 is a sentential form
of grammar G. Sentential form may contain both terminals
and nonterminals, and may be empty.
If 𝛼𝛼 is a sentence of G, then 𝛼𝛼 has only terminals.
A language generated by a grammar is the set of sentences.
A string terminal 𝑤𝑤 is in language 𝐿𝐿(G) if and only if 𝑤𝑤 is
∗ ∗
sentence of G (or 𝑆𝑆 ⇒ 𝑤𝑤). 𝐿𝐿(G) = �𝑤𝑤�𝑆𝑆 ⇒ 𝑤𝑤�
d) Parse Tree
• A parse tree is a graphical representation of a derivation.
• Each interior node of a parse tree represents the application of a production.
• Interior nodes are nonterminals, and leaves are terminals.
• Parse tree filters out the order in which productions are applied to replace nonterminals.
• There is one-to-one relationship between parse trees and the leftmost (or right most) derivation.
e) Ambiguity
A grammar is ambiguous if
• It produces more than one parse trees for some
sentences, or
• If it produces more than one leftmost
(or rightmost) derivations for the same sentence.
For most parsers, it is desirable that the grammar be
unambiguous
Example)
The following grammar 𝐺𝐺 is ambiguous, because it
permits two distinct leftmost derivations for some
sentences:
𝐸𝐸 → 𝐸𝐸 + 𝐸𝐸 | 𝐸𝐸 ∗ 𝐸𝐸 | 𝐢𝐢𝐢𝐢
Show that the grammar 𝐺𝐺 is ambiguous by:
1. Giving a sentence 𝛼𝛼 derived from 𝐺𝐺;
2. Showing two different parse trees of 𝛼𝛼 or
two distinct leftmost derivations for 𝛼𝛼.
f) CFG is more powerful notation than regular expression
Every constructs described by a regular expression can be
described by CFG: every regular languages are context-
free language, but not vice-versa.
DFA accepting
g) Non-context-free language
Semantic analysis cannot be checked by CFG.
Example 1)
Consider abstract language 𝐿𝐿1 = {𝑤𝑤𝑤𝑤𝑤𝑤|𝑤𝑤 is in (𝑎𝑎|𝑏𝑏)∗ }.
In programming language, first 𝑤𝑤 represents,
• declaration of an identifier (such as variable names)
Second 𝑤𝑤 represents
• the use of the identifier, and
𝑐𝑐 represents
• parts of the program between the two 𝑤𝑤s.
In a programming language like C/C++/Java,
𝐿𝐿1 abstracts the problem of
• checking if identifiers are declared before they are used
Example 2)
In 𝐿𝐿2 = {𝑎𝑎𝑛𝑛 𝑏𝑏 𝑚𝑚 𝑐𝑐 𝑛𝑛 𝑑𝑑𝑚𝑚 | 𝑛𝑛 ≥ 1 and 𝑚𝑚 ≥ 1},
𝑎𝑎𝑛𝑛 and 𝑏𝑏 𝑚𝑚 represent
• the list of parameters in two functions’ declaration, and
𝑐𝑐 𝑛𝑛 and 𝑑𝑑𝑚𝑚 represent
• the use of the two functions by passing 𝑛𝑛 and 𝑚𝑚 parameters.
The new parse tree of 𝑎𝑎 = 𝑏𝑏 = 𝑐𝑐 is: The new parse tree of 𝑎𝑎 = 𝑏𝑏 = 𝑐𝑐 is:
An ambiguous grammar for expression 9 + 5 − 2 is
given as follows:
𝐸𝐸 → 𝐸𝐸 + 𝐸𝐸 | 𝐸𝐸 − 𝐸𝐸 | 𝐧𝐧𝐧𝐧𝐧𝐧
Operator + and – are left-associative, in general. Write
a new grammar that removes the ambiguity by making
operators + and – be left-associative?
b) Precedence of operators
We have the following grammar for ∗ and −, which are
left-associative:
𝐸𝐸 → 𝐸𝐸 + 𝐹𝐹 | 𝐸𝐸 ∗ 𝐹𝐹 | 𝐹𝐹
𝐹𝐹 → 𝐧𝐧𝐧𝐧𝐧𝐧
Consider the expression 9 + 5 ∗ 2. There are two ways of
interpreting this: (9 + 5) ∗ 2 by left-associative, and
9 + (5 ∗ 2) by our understanding. Associativity rule
cannot remove this issue of incorrect interpretation
because ∗ has higher precedence than +.
The ambiguity of the grammar can be eliminated by
rewriting it:
Example)
Write a new grammar that eliminates the immediate left
recursion of the following grammar:
𝐸𝐸 → 𝐸𝐸 + 𝑇𝑇 | 𝐸𝐸 − 𝑇𝑇 | 𝑇𝑇
2a. 𝑖𝑖 = 2.
Make 𝐴𝐴-production immediate left recursive
by removing first body symbol 𝑆𝑆
3a. 𝑖𝑖 = 3.
Make 𝐵𝐵-production immediate left recursive
1. by removing first body symbol 𝑆𝑆
Example:
Consider the following if-statements:
𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 → 𝐢𝐢𝐢𝐢 𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒 𝐭𝐭𝐭𝐭𝐭𝐭𝐭𝐭 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝐞𝐞𝐞𝐞𝐞𝐞𝐞𝐞 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠
| 𝐢𝐢𝐢𝐢 𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒 𝐭𝐭𝐭𝐭𝐭𝐭𝐭𝐭 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠
Generalize:
When the choice between two or more alternative
productions are not clear for a nonterminal 𝐴𝐴, find the
longest prefix 𝛼𝛼 common to two or more of its
alternatives. Let the 𝐴𝐴-production has the following form:
𝐴𝐴 → 𝛼𝛼𝛽𝛽1 | 𝛼𝛼𝛽𝛽2 | ⋯ | 𝛼𝛼𝛽𝛽𝑛𝑛 | 𝛾𝛾