0% found this document useful (0 votes)
46 views19 pages

Lecture05-Syntax Analysis-CFG

This document provides an overview of syntax analysis and context-free grammars. It discusses: 1) The role of parsers in reporting syntax errors, recovering from errors, and creating symbol tables and parse trees. 2) Context-free grammars and how they are used to specify programming language syntax. Grammars have terminals, nonterminals, productions, and a start symbol. 3) How context-free grammars generate languages through derivations and parse trees. Grammars can be ambiguous if they allow multiple parse trees for a sentence.

Uploaded by

wocor11
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
0% found this document useful (0 votes)
46 views19 pages

Lecture05-Syntax Analysis-CFG

This document provides an overview of syntax analysis and context-free grammars. It discusses: 1) The role of parsers in reporting syntax errors, recovering from errors, and creating symbol tables and parse trees. 2) Context-free grammars and how they are used to specify programming language syntax. Grammars have terminals, nonterminals, productions, and a start symbol. 3) How context-free grammars generate languages through derivations and parse trees. Grammars can be ambiguous if they allow multiple parse trees for a sentence.

Uploaded by

wocor11
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 19

Syntax Analysis – Context Free Grammar

CMPSC 470 Lecture 05


Topics:
• Overview of parser
• CFG, RegEx→NFA→CFG
• Remove Ambiguity

A. Overview of Parser

token
Source Lexical
Program
Parser
Analyzer get next token

Symbol table

a) Role of Parser
1. Report any syntax error
2. Recover from commonly occurring error to
continue processing the remainder of program
3. Create:
* Some compiler creates symbol table.
* Some compiler creates parse tree.
* Some compiler directly produces intermediate representation (IR)
(parser and rest of front-ends are in one module)

b) Parser use grammar


• Types of grammar representation
* CFG (Context-Free Grammar)
* BNF (Backus-Naur Form, or Backus Normal Form)

• Grammar gives precise, yet easy-to-understand


syntactic specification of a programming
language
• Parser can be automatically constructed from
certain classes of grammars.
c) Types of parsers & grammars
• Commonly used parsing methods: top-down, bottom-up

• Common grammars: LL, LR

• Left recursive grammar

• Non-left recursive grammar

• A grammar is ambiguous if it permits more than


one parse tree.
d) Types of programming errors
• Lexical error:
misspelling of identifiers,
keywords, or operators.

• Syntactic errors:
misplaced semicolons,
extra or missing braces (‘{‘, ‘}’, ‘(‘, ‘)’),
case statement without an enclosing switch.

• Semantic error:
type mismatches between
operators and operands.

• Logical error:
anything from incorrect reasoning on the part of the
programmer to the use in a program. It produces
unintended or undesired output or behavior.
B. Context-Free Grammar (CFG)
A context-free grammar (CFG) is a certain type of formal
grammar, which is a set of production rules that describe
all possible strings in a given formal language. CFG is
used to specify the syntax of a language.

a) Format
CFG has following form:
𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 → 𝐢𝐢𝐢𝐢 ( 𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒 ) 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝐞𝐞𝐞𝐞𝐞𝐞𝐞𝐞 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠

• Terminals:
components of tokens output by lexical analyzer.
They will become terminal nodes (or leaf) of a
parse tree.

• Nonterminals:
syntactic variables that denote a set of strings.

• Productions (of a grammar):


they specify how terminals and nonterminals can
be combined to form a string.
It consists of
1. head or left side
2. “→” for CFG ( “∷=” for BNF)
3. body or right side. Body can be 𝜀𝜀.

• Start symbol:
One of nonterminals.
Conveniently, the head of the first productions is
often used as the start symbol.
Example)
Let we have a program.
while ( i > 0 )
if ( i % 2 == 0 )
i = i / 2;
else
i = i – 1;
We want to write a grammar 𝐺𝐺 that supports the above
program. The grammar that 𝐺𝐺 can be:
b) Notational conventions
1. Terminals
(a) Lowercase letters early in alphabet, such as a,b,c
(b) Operator symbols: +, *, /, …
(c) Punctuation symbols: parentheses, comma, …
(d) Digits: 0,1,…,9
(e) Boldface string: id, if
Underlined string (notation in my class) : id, if

2. Nonterminals
(a) Upper letters early in alphabet, such as A,B,C
(b) ‘S’ is usually the start symbol.
(c) Lowercase, italic names such as expr or stmt.
Lower case without underline (notation in my class): expr, stmt

3. Uppercase letters late in the alphabet, such as X, Y, Z,


represent grammar symbols; that is, either
nonterminals or terminals.

4. Lowercase letters late in the alphabet, chiefly u, v, ..., z,


represent (possibly empty) strings of terminals.

5. Lower Greek letters, 𝛼𝛼, 𝛽𝛽, 𝛾𝛾 for example, represent


(possibly empty) string of grammar symbols. Thus, a
generic production can be written as 𝐴𝐴 → 𝛼𝛼, where 𝐴𝐴
is the head and 𝛼𝛼 the body.

6. A set of productions 𝐴𝐴 → 𝛼𝛼1 , 𝐴𝐴 → 𝛼𝛼2 , … , 𝐴𝐴 → 𝛼𝛼𝑘𝑘


with common head 𝐴𝐴 (call them 𝐴𝐴-productions), may
be written 𝐴𝐴 → 𝛼𝛼1 |𝛼𝛼2 | ⋯ |𝛼𝛼𝑘𝑘 . Call 𝛼𝛼1 , 𝛼𝛼2 , … , 𝛼𝛼𝑘𝑘 the
alternatives for 𝐴𝐴.

7. The head of the first production is the start symbol,


unless it is explicitly stated.
c) Derivations
Let we have the following grammar G:
𝐸𝐸 → 𝐸𝐸 + 𝐸𝐸 | −𝐸𝐸 | (𝐸𝐸) | 𝐢𝐢𝐢𝐢
Starting from ‘𝐸𝐸’, we can obtain a sentence −� 𝐢𝐢𝐢𝐢 � by
sequentially replacing 𝐸𝐸 like:

We call,
• such a sequence of replacement as,
“derivation of −� 𝐢𝐢𝐢𝐢 � from 𝐸𝐸”, and
• “𝐸𝐸 derives −� 𝐢𝐢𝐢𝐢 �”

⊛ Symbols:
• “⇒” means, “derives in one step”, and

• “⇒” means “derive in zero or more steps”, and
+
• “⇒” means “derive in one or more steps”

⊛ Rules:

1. 𝛼𝛼 ⇒ 𝛼𝛼, for any string 𝛼𝛼,
∗ ∗
2. 𝛼𝛼 ⇒ 𝛽𝛽 and 𝛽𝛽 ⇒ 𝛾𝛾, then 𝛼𝛼 ⇒ 𝛾𝛾.

⊛ Sentence:

If 𝑆𝑆 is a start symbol and 𝑆𝑆 ⇒ 𝛼𝛼 , then 𝛼𝛼 is a sentential form
of grammar G. Sentential form may contain both terminals
and nonterminals, and may be empty.
If 𝛼𝛼 is a sentence of G, then 𝛼𝛼 has only terminals.
A language generated by a grammar is the set of sentences.
A string terminal 𝑤𝑤 is in language 𝐿𝐿(G) if and only if 𝑤𝑤 is
∗ ∗
sentence of G (or 𝑆𝑆 ⇒ 𝑤𝑤). 𝐿𝐿(G) = �𝑤𝑤�𝑆𝑆 ⇒ 𝑤𝑤�

Context-free language is a language that can be generated


by a grammar.

Two grammars are equivalent if they generate the same


language.
⊛ Leftmost derivation:
• Derive the leftmost nonterminal always.
• Use “��” symbol.
lm

⊛ Rightmost derivation (also called canonical derivation):


• Derive the rightmost nonterminal always.
• Use “��” symbol.
rm

d) Parse Tree
• A parse tree is a graphical representation of a derivation.
• Each interior node of a parse tree represents the application of a production.
• Interior nodes are nonterminals, and leaves are terminals.
• Parse tree filters out the order in which productions are applied to replace nonterminals.
• There is one-to-one relationship between parse trees and the leftmost (or right most) derivation.
e) Ambiguity
A grammar is ambiguous if
• It produces more than one parse trees for some
sentences, or
• If it produces more than one leftmost
(or rightmost) derivations for the same sentence.
For most parsers, it is desirable that the grammar be
unambiguous
Example)
The following grammar 𝐺𝐺 is ambiguous, because it
permits two distinct leftmost derivations for some
sentences:
𝐸𝐸 → 𝐸𝐸 + 𝐸𝐸 | 𝐸𝐸 ∗ 𝐸𝐸 | 𝐢𝐢𝐢𝐢
Show that the grammar 𝐺𝐺 is ambiguous by:
1. Giving a sentence 𝛼𝛼 derived from 𝐺𝐺;
2. Showing two different parse trees of 𝛼𝛼 or
two distinct leftmost derivations for 𝛼𝛼.
f) CFG is more powerful notation than regular expression
Every constructs described by a regular expression can be
described by CFG: every regular languages are context-
free language, but not vice-versa.

Convert regular expression to CFG:


Example) Determine CFG accepting regular expression (𝑎𝑎|𝑏𝑏)∗ 𝑎𝑎(𝑏𝑏𝑏𝑏|𝜀𝜀).

1. Determine NFA accepting the regular expression


2. For each state 𝑖𝑖 of NFA, create nonterminal 𝐴𝐴𝑖𝑖
𝑎𝑎
3. If state 𝑖𝑖 has a transition to 𝑗𝑗 on input 𝑎𝑎 (𝑖𝑖 �⎯⎯� 𝑗𝑗),
add production 𝐴𝐴𝑖𝑖 → 𝑎𝑎𝐴𝐴𝑗𝑗 .
𝜀𝜀
If 𝑖𝑖 �⎯� 𝑗𝑗, add production 𝐴𝐴𝑖𝑖 → 𝐴𝐴𝑗𝑗 .

4. If 𝑖𝑖 is an accepting state, add 𝐴𝐴𝑖𝑖 → 𝜀𝜀


5. If 𝑖𝑖 is the start state, make 𝐴𝐴𝑖𝑖 the start symbol of
grammar.

Language 𝐿𝐿 = {𝑎𝑎𝑛𝑛 𝑏𝑏 𝑛𝑛 |𝑛𝑛 ≥ 1} with equal number of 𝑎𝑎’s


and 𝑏𝑏’s can be described by grammar (CFG), but not by
regular expression. We say that “finite automata cannot
count”.
CFG accepting 𝐿𝐿 = {𝑎𝑎𝑛𝑛 𝑏𝑏 𝑛𝑛 |𝑛𝑛 ≥ 1} is

DFA accepting
g) Non-context-free language
Semantic analysis cannot be checked by CFG.

Example 1)
Consider abstract language 𝐿𝐿1 = {𝑤𝑤𝑤𝑤𝑤𝑤|𝑤𝑤 is in (𝑎𝑎|𝑏𝑏)∗ }.
In programming language, first 𝑤𝑤 represents,
• declaration of an identifier (such as variable names)
Second 𝑤𝑤 represents
• the use of the identifier, and
𝑐𝑐 represents
• parts of the program between the two 𝑤𝑤s.
In a programming language like C/C++/Java,
𝐿𝐿1 abstracts the problem of
• checking if identifiers are declared before they are used

CFG cannot describe the non-context-free language like 𝐿𝐿1 ,


and this checking process should be done in semantic
analysis phase.

Example 2)
In 𝐿𝐿2 = {𝑎𝑎𝑛𝑛 𝑏𝑏 𝑚𝑚 𝑐𝑐 𝑛𝑛 𝑑𝑑𝑚𝑚 | 𝑛𝑛 ≥ 1 and 𝑚𝑚 ≥ 1},
𝑎𝑎𝑛𝑛 and 𝑏𝑏 𝑚𝑚 represent
• the list of parameters in two functions’ declaration, and
𝑐𝑐 𝑛𝑛 and 𝑑𝑑𝑚𝑚 represent
• the use of the two functions by passing 𝑛𝑛 and 𝑚𝑚 parameters.

CFG cannot describe this language 𝐿𝐿2 , and so the


semantic analysis phase should check the number
of parameters used in function calls.
C. Eliminate Ambiguity in CFG
An ambiguous grammar can have more than one parse
tree generating a given string of terminals. Since a string
with more than one parse tree has more than one
meaning, we desire unambiguous grammars. Sometimes
an ambiguous grammar can be rewritten to eliminate the
ambiguity.
a) Associativity of operators
Let we have a grammar G for assignments, like
𝑎𝑎 = 𝑏𝑏 = 𝑐𝑐, such that:
𝐴𝐴 → 𝐴𝐴 = 𝐴𝐴 | 𝑎𝑎 | 𝑏𝑏 | 𝑐𝑐 | ⋯ | 𝑧𝑧
Given sentence 𝑎𝑎 = 𝑏𝑏 = 𝑐𝑐, there are 2 parse trees:
Draw a parse tree built with a grammar that is Draw a parse tree built with a grammar that is
left-associative, where 𝑏𝑏 belongs to left “=”: right-associative, where 𝑏𝑏 belongs to right “=”:

Grammar having left associativity: Grammar having right associativity:


A new grammar that eliminates the ambiguity A new grammar that eliminates the ambiguity
by making operator = associate to left: by making operator = associate to right:

The new parse tree of 𝑎𝑎 = 𝑏𝑏 = 𝑐𝑐 is: The new parse tree of 𝑎𝑎 = 𝑏𝑏 = 𝑐𝑐 is:
An ambiguous grammar for expression 9 + 5 − 2 is
given as follows:
𝐸𝐸 → 𝐸𝐸 + 𝐸𝐸 | 𝐸𝐸 − 𝐸𝐸 | 𝐧𝐧𝐧𝐧𝐧𝐧
Operator + and – are left-associative, in general. Write
a new grammar that removes the ambiguity by making
operators + and – be left-associative?

b) Precedence of operators
We have the following grammar for ∗ and −, which are
left-associative:
𝐸𝐸 → 𝐸𝐸 + 𝐹𝐹 | 𝐸𝐸 ∗ 𝐹𝐹 | 𝐹𝐹
𝐹𝐹 → 𝐧𝐧𝐧𝐧𝐧𝐧
Consider the expression 9 + 5 ∗ 2. There are two ways of
interpreting this: (9 + 5) ∗ 2 by left-associative, and
9 + (5 ∗ 2) by our understanding. Associativity rule
cannot remove this issue of incorrect interpretation
because ∗ has higher precedence than +.
The ambiguity of the grammar can be eliminated by
rewriting it:

The parse tree of 9 + 5 ∗ 2 and 9 ∗ 5 + 2 are:


In a similar manner, the grammar can be rewritten to
have the parenthesis, which has highest precedence.

Draw the parse tree of (1 + 2 ∗ 3) ∗ 4:


c) Eliminate dangling else
Consider the following grammar for if-statement:
𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 → 𝐢𝐢𝐢𝐢 𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒 𝐭𝐭𝐭𝐭𝐭𝐭𝐭𝐭 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠
| 𝐢𝐢𝐢𝐢 𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒 𝐭𝐭𝐭𝐭𝐭𝐭𝐭𝐭 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝐞𝐞𝐞𝐞𝐞𝐞𝐞𝐞 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠
| 𝑜𝑜𝑜𝑜ℎ𝑒𝑒𝑒𝑒
Here, 𝑜𝑜𝑜𝑜ℎ𝑒𝑒𝑒𝑒 stands for any other statements, like
“i = i + 1”, “{ i = i + 1; }”, etc.

The grammar is ambiguous since the following sentence


𝐢𝐢𝐢𝐢 𝐸𝐸1 𝐭𝐭𝐭𝐭𝐭𝐭𝐭𝐭 𝐢𝐢𝐢𝐢 𝐸𝐸2 𝐭𝐭𝐭𝐭𝐭𝐭𝐭𝐭 𝑆𝑆1 𝐞𝐞𝐞𝐞𝐞𝐞𝐞𝐞 𝑆𝑆2
has two parse trees

This problem is called “dangling else”.


In all programming language, the first parse tree is
preferred. That is “match each else with the closest
unmatched then”.
Rewrite the grammar to eliminate the “dangling else”.
d) Eliminate immediate left recursion
A grammar has immediate left recursion if there is a
derivation: 𝐴𝐴 ⇒ 𝐴𝐴𝐴𝐴 for some string 𝛼𝛼 that is terminals
or nonterminals. “Top-down parser” that uses leftmost
derivation cannot handle the (immediate) left-
recursive grammar.
Example) Given the following grammar:
𝐸𝐸 → 𝐸𝐸 + 𝑇𝑇 | 𝑇𝑇 𝑇𝑇 → 𝐢𝐢𝐢𝐢
its body begins with 𝐸𝐸. Therefore, (if this is a top-down
parser, such as recursive-descent parser) procedure of 𝐸𝐸 will
be called recursively, or left-most derivation will derive 𝐸𝐸
recursively. This case is called immediate left recursion.

The immediate left recursion can be eliminated, by


rewriting the grammar, as follows:

Generalize: Let we have a grammar:


(a) 𝐴𝐴 → 𝐴𝐴𝐴𝐴 | 𝛽𝛽

𝐴𝐴 is immediate left recursive, and the left recursion


can be removed by rewriting the grammar using new
nonterminal 𝑅𝑅:
(b)

Now, 𝑅𝑅 is right recurve, and the grammar does not


have left recursion. Note that the two grammars (a)
and (b) produce the same sentence “𝛽𝛽𝛽𝛽 ⋯ 𝛼𝛼𝛼𝛼”.
Generalize eliminating immediate left recursion:
Consider the immediate left recursion of the following
𝐴𝐴-productions:
𝐴𝐴 → 𝐴𝐴𝛼𝛼1 | 𝐴𝐴𝛼𝛼2 | ⋯ | 𝐴𝐴𝛼𝛼𝑚𝑚 | 𝛽𝛽1 | 𝛽𝛽2 | ⋯ | 𝛽𝛽𝑛𝑛
Rewrite the grammar to eliminate the immediate left
recursion:

Example)
Write a new grammar that eliminates the immediate left
recursion of the following grammar:
𝐸𝐸 → 𝐸𝐸 + 𝑇𝑇 | 𝐸𝐸 − 𝑇𝑇 | 𝑇𝑇

e) Eliminate left recursion


A grammar is left recursive if there is a derivation
+
𝐴𝐴 ⇒ 𝐴𝐴𝐴𝐴 for some string 𝛼𝛼.
Algorithm that removes left recursion:
1. Arrange non-terminals in some order 𝐴𝐴1 , 𝐴𝐴2 , … , 𝐴𝐴𝑛𝑛
2. for 𝑖𝑖 = 1 … 𝑛𝑛
3. for 𝑗𝑗 = 1 … 𝑖𝑖 − 1
4. let 𝐴𝐴𝑗𝑗 → 𝛿𝛿1 | ⋯ |𝛿𝛿𝑘𝑘
replace 𝐴𝐴𝑖𝑖 → 𝐴𝐴𝑗𝑗 𝛾𝛾 as 𝐴𝐴𝑖𝑖 → 𝛿𝛿1 𝛾𝛾| ⋯ |𝛿𝛿𝑘𝑘 𝛾𝛾
5. Eliminate immediate left recursion among 𝐴𝐴𝑖𝑖 -productions
Example)
Consider the following grammar:
𝑆𝑆 → 𝐵𝐵𝐛𝐛 | 𝐜𝐜
𝐴𝐴 → 𝐴𝐴𝐜𝐜 | 𝐵𝐵𝐝𝐝 | 𝑆𝑆𝐞𝐞 | 𝜀𝜀
𝐵𝐵 → 𝐴𝐴𝐟𝐟 | 𝑆𝑆𝐠𝐠 | 𝜀𝜀

𝐴𝐴 is immediate left recursive.


𝑆𝑆 and 𝐵𝐵 are not immediate left recursive, but left recursive.
Eliminate the left recursion:
1a. 𝑖𝑖 = 1.
Make 𝑆𝑆-production immediate left recursive.

1b. Eliminate immediate left recursions


in 𝑆𝑆-productions

2a. 𝑖𝑖 = 2.
Make 𝐴𝐴-production immediate left recursive
by removing first body symbol 𝑆𝑆

2b. Eliminate immediate left recursions


in 𝐴𝐴-productions

3a. 𝑖𝑖 = 3.
Make 𝐵𝐵-production immediate left recursive
1. by removing first body symbol 𝑆𝑆

2. by removing first body symbol 𝐴𝐴

3b. Eliminate immediate left recursions


in 𝐵𝐵-productions

4. Stop because there is no other productions


to eliminate left recursions.

Finally, the following grammar does not have left


recursions.
f) Left factoring
Left factoring is a grammar transformation that is useful
for producing a grammar suitable for predictive (or top-
down) parsing.

Example:
Consider the following if-statements:
𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 → 𝐢𝐢𝐢𝐢 𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒 𝐭𝐭𝐭𝐭𝐭𝐭𝐭𝐭 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝐞𝐞𝐞𝐞𝐞𝐞𝐞𝐞 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠
| 𝐢𝐢𝐢𝐢 𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒 𝐭𝐭𝐭𝐭𝐭𝐭𝐭𝐭 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠

Given input “𝐢𝐢𝐢𝐢” in “𝐢𝐢𝐢𝐢 𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒 𝐭𝐭𝐭𝐭𝐭𝐭𝐭𝐭 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠”, we cannot


immediately choose which alternative production should
be applied.
Rewrite the grammar as follows:

Now, the if-statement is left-factored. This new grammar


defers the decision until its inputs are clear.
This new grammar is still ambiguous because it still have
“dangling else” problem. This problem will be resolved
later.

Generalize:
When the choice between two or more alternative
productions are not clear for a nonterminal 𝐴𝐴, find the
longest prefix 𝛼𝛼 common to two or more of its
alternatives. Let the 𝐴𝐴-production has the following form:
𝐴𝐴 → 𝛼𝛼𝛽𝛽1 | 𝛼𝛼𝛽𝛽2 | ⋯ | 𝛼𝛼𝛽𝛽𝑛𝑛 | 𝛾𝛾

where 𝛼𝛼, 𝛽𝛽1 , … , 𝛽𝛽𝑛𝑛 , 𝛾𝛾 are terminals or nonterminals. The


grammars can be rewritten to defer the decision until its
inputs are clear, using a new nonterminal 𝐴𝐴′ , as follows:

You might also like