0% found this document useful (0 votes)
16 views121 pages

Bcse307l - Module 1

The document outlines the syllabus and key concepts of a Compiler Design course, including the roles of compilers and interpreters, phases of compilation, and software tools used in language processing. It explains the processes of lexical analysis, syntax analysis, semantic analysis, intermediate code generation, code optimization, and code generation. Additionally, it discusses the importance of the symbol table and error handling in the compilation process.

Uploaded by

sivadhanushya983
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views121 pages

Bcse307l - Module 1

The document outlines the syllabus and key concepts of a Compiler Design course, including the roles of compilers and interpreters, phases of compilation, and software tools used in language processing. It explains the processes of lexical analysis, syntax analysis, semantic analysis, intermediate code generation, code optimization, and code generation. Additionally, it discusses the importance of the symbol table and error handling in the compilation process.

Uploaded by

sivadhanushya983
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd

BCSE307L - Compiler

Design
Module 1

by

Dr. P. Mercy Rajaselvi Beaulah


Assistant Professor Senior Grade 2
School of Computer Science and Engineering
VIT Chennai
Syllabus
Syllabus
Why Compiler?
• All the software running on all the computers was written in
some programming language.

• But, before a program can be run, it first must be translated


into a form in which it can be executed by a computer.

• The software systems that do this translation are called


compilers.
What is a Compiler?
Source Target
Program Compiler Program

Errors

• A compiler is a program that can read a program in one


language (the source language) — and translate it into an
equivalent program in another language (the target language)

• Reports for error


What is a Compiler?

Source Target
Program Compiler Program

Errors

Input Target Output


Program
What is an Interpreter ?
Source
Program Interpreter Output
Input

Instead of producing a target program as a translation,


an interpreter appears to directly execute the
operations specified in the source program on inputs
supplied by the user
Hybrid Compiler Source
Program

Compiler

Intermediate
Program Interpreter Output
Input

• Java language processors combine compilation and interpretation,


• A Java source program may first be compiled into an intermediate form called
bytecodes.
• The bytecodes are then interpreted by a virtual machine. A benefit of this
arrangement is that bytecodes compiled on one machine can be interpreted on
another machine, perhaps across a network
A Language Processing
System
Source Program

Preprocessor

Modified Source Program

Compiler

Target Assembly Program

Assembler

Relocatable Machine Code

Loader / linker

Target Machine Code


Parts of Compilation
• Analysis (Front End)
 breaks up the source program into constituent pieces
 imposes a grammatical structure
 Create an intermediate representation of the source
program
 detects for error (syntactically ill formed or semantically
unsound)
 Updates Symbol table

• Synthesis (Back End)


 constructs the desired target program from the
intermediate representation and the information in the symbol
table
Phases of Compilation
Character Stream
Lexical Analysis
Token Stream

Syntax Analysis
Syntax Tree
Semantic Analysis

Syntax Tree
Intermediate Code Generator
Intermediate Representation

Code Optimizer
Intermediate Representation (optimized)
Code Generator

Target Machine Code


Software Tools that
Manipulate Source Program
Structure editors
A Structure editor takes as input a sequence of commands to build a source program.
The structure editor not only performs the text-creation and modification functions of
an ordinary text editor, but it also analyzes the program text, putting an appropriate
hierarchical structure on the source program. Thus, the structure editor can perform
additional tasks that are useful in the preparation of programs
Pretty printers
A pretty printer analyses a program and prints it in such a way that the structure of
the program becomes clearly visible. For example, comments may appear in a
special font, and statements may appear with an amount of indentation proportional
to the depth of their nesting in the hierarchical organization of the statements
Static checkers
A static checker reads a program, analyzes it, and attempts to discover potential
bugs without running the program
Interpreters
Instead of producing a target program as a translation, an interpreter performs the
operations
07/18/2025
implied by the source program
Software Tools that
Manipulate Source Program
Silicon Compilers: A silicon compiler has a source language that is similar or
identical to a conventional programming language. However, the variables of the
language represent, not locations in memory, but, logical signals (0 or 1) or groups
of signals in a switching circuit. The output is a circuit design in an appropriate
language

Query interpreters. A query interpreter translates a predicate containing relational


and Boolean operators into commands to search the database for records
satisfying that predicate.

Cross Compiler
A cross compiler is a compiler capable of creating executable code for a platform
other than the on which the compiler is running. (eg) A compiler that runs on a
windows but generates code that runs on android smart phone is a cross compiler
07/18/2025
Lexical Analysis
• The first phase of a compiler is called lexical analysis or scanning.
• The lexical analyzer reads the stream of characters making up the
source program and groups the characters into meaningful sequences
called lexemes.
• For each lexeme, the lexical analyzer produces as output a token of
the form
(token-name, attribute-value)
• It passes on to the subsequent phase, syntax analysis.
• In the token, the first component token-name is an abstract symbol
that is used during syntax analysis, and the second component
attribute-value points to an entry in the symbol table for this token
position = initial + rate * 60

Position - identifier
= - Assignment operator
Initial - identifier
+ - Arithmetic Operator
Rate - identifier
* - Arithmetic Operator
60 - Constant
Syntax Analysis
• The second phase of the compiler is syntax analysis or parsing.
• The parser uses the first components of the tokens produced by the
lexical analyzer to create a tree-like intermediate representation that
depicts the grammatical structure of the token stream.
• A typical representation is a syntax tree in which each interior node
represents an operation and the children of the node represent the
arguments of the operation.
Semantic analyzer
• The semantic analyzer uses the syntax tree and the information in the
symbol table to check the source program for semantic consistency with
the language definition.
• It also gathers Data type information and saves it in either the syntax tree
or the symbol table, for subsequent use during intermediate-code
generation.
• An important part of semantic analysis is Data type checking, where the
compiler checks that each operator has matching operands.
• Example
1. Array index is floating point value - reports an error
2. D is int, C & A Float - D = C + A - reports an error
3. D is float , C & A int - D = C + A - Converts C & A into Float
Semantic analyzer
position = initial + rate * 60
Intermediate Code Generation
• Syntax trees are a form of intermediate representation; they are
commonly used during syntax and semantic analysis.
• After syntax and semantic analysis of the source program, many
compilers generate an explicit low-level or machine-like intermediate
representation, an intermediate form called three-address code,
• Three-address code, consists of a sequence of assembly-like instructions
with three operands per instruction. Each operand can act like a register.
Intermediate Code Generation
• First, each three-address assignment instruction has at most
one operator on the right side.

• Second, the compiler must generate a temporary name to hold


the value computed by a three-address instruction.

• Third, some "three-address instructions" like the first and last in


the sequence in the previous slide have fewer than three
operands
Code Optimization
• The machine-independent code-optimization phase attempts to
improve the intermediate code so that better target code will result.
• Usually better means faster, shorter code, or target code that
consumes less power.
• A significant amount of time is spent on this phase.
• There are simple optimizations that significantly improve the running
time of the target program
Code Generation
• The code generator takes as input an intermediate representation of
the source program and maps it into the target language.
• Registers or memory locations of the Target machine are selected for
each of the variables used by the program.
• Then, the intermediate instructions are translated into sequences of
machine instructions that perform the same task.
• A crucial aspect of code generation is the judicious assignment of
registers to hold variables.
• The first operand of each instruction specifies a destination.
Code Generation
• The F in each instruction tells us that it deals with floating-point
numbers.
Translation of an
assignment Statement
Position = initial + rate * 60

Lexical Analyzer

< id, 1 > < = > < id, 2 > < + > < id, 3 > < * > < 60 >

Syntax Analyzer
< = >

< id, 1 > < + >

< id, 2 > < * >

07/18/2025 < id, 3 > < id, 1 >


< = >

< id, 1 > < + >

< id, 2 > < * >

< id, 3 > < 60 >

Semantic Analyzer

< = >

< id, 1 > < + >

< id, 2 > < * >

< id, 3 > < inttofloat


>
07/18/2025 < 60 >
< = >

< id, 1 > < + >

< id, 2 > < * >

< id, 3 > < inttofloat


>

< 60 >

Intermediate Code Generator

t 1 = inttofloat (60)
t 2 = id3 * t 1
t3 = id2 + t2
07/18/2025 id1 = t3
t 1 = inttofloat (60)
t 2 = id3 * t 1
t3 = id2 + t2
id1 = t3

Code Optimizer

t 1 = id3 * 60.0
id1 = id2 + t1

Code Generator

LDF R2, id3


MULF R2 , R2 , #60 . 0
LDF R1, id2
ADDF R1, R1, R2
07/18/2025
STF id1 , R1
Symbol Table
• Symbol Table is a data structure containing a
record for each variable name with field for the
attributes of the variable name.

•These attributes may provide information about the


storage allocated for a variable name, its type, its scope.

•In case of procedure names, the number and types


of its argument, the methods of passing each
argument (call by value or by reference) and the type
returned.
Items stored in Symbol Table
• Variable names and Numeric constants
• Data type, value , Scope and storage allocated
• Literal constants and strings
• Compiler generated temporary variables
• Labels in source languages
• Procedure and function names
• For parameters – details about whether the parameter is passing
by value or by reference
• Number and type of arguments passed to function
Example
int x;

The lexer will


1. recognize the keyword int and return the token to the parser
2. recognize the identifier "x", add it to the symbol table and return
the pointer to the parser
3. recognize the semicolon ';' and return the token to the parser
token
Source Lexical Syntax
Program
Analyzer Analyzer
Lexical GetNextToken

Analysis
Symbol
Table

• Read the input characters of the source program


• Group them into lexemes, and
• Produce as output a sequence of tokens for each lexeme in
the source program.
• The stream of tokens is sent to the parser for syntax analysis.
It is common for the lexical analyzer to interact with the
symbol table as well
Lexical Analysis

X=Y+Z Lexical ( id, l) ( = ) (id, 2) (+) (id, 3)


Analyser

1 X ...
2 Y ...
3 Z ...

Symbol Table
Role of Lexical Analyzer
• Lexical Analyzer scans the input character by character
and group them to form a valid lexeme.

• The Lexemes are checked against the stored patterns

• When a match occurs the corresponding token is return


to the next phase.

• Regular expressions are useful in representing the


patterns
Role of Lexical Analyzer ( Contd..)
• Stripping out comments and whitespace (blank, newline, tab, and
perhaps other characters that are used to separate tokens in the
input).
• Another task is correlating error messages generated by the compiler
with the source program.
• For instance, the lexical analyzer may keep track of the number of
newline characters seen, so it can associate a line number with each
error message.
• If the source program uses a macro-preprocessor, the expansion of
macros may also be performed by the lexical analyzer.
Lexemes, Patterns & Tokens
• A Lexeme is a sequence of characters in the source program that
matches the pattern for a token

• A pattern is a description of the form that the lexemes of a token


may take

• A token is a pair consisting of a token name and an optional


attribute value. The token name is an abstract symbol representing a
kind of lexical unit
Ex: net = gross – deduction
Lexeme : net
Pattern : letter followed by letter or digit [ l(l|d)* ]
Token : Identifier
Lexemes, Patterns & Tokens

Examples of Tokens
Tokens
• One token for each keyword. The pattern for a keyword is the
same as the keyword itself.
• Tokens for the operators
• One token representing all identifiers.
• One or more tokens representing constants, such as numbers
and literal, strings.
• Tokens for each punctuation symbol, such as left and right
parentheses, comma, and semicolon
E= M * C ^ 2
are written below as a sequence of pairs.
E - <id, pointer to symbol-table entry for E>
= - < assign-op >
M - <id, pointer to symbol-table entry for M>
* - <mult -op>
C - <id, pointer to symbol-table entry for C>
^ - <exp-op>
2 - <number , integer value 2 >
Lexical Errors
f i ( a == f (x) ) . . .

• f i : Misspelling of the keyword if or an undeclared function


identifier ?
• probably the parser in this case - handle an error due to
transposition of the letters

• The lexical analyzer is unable to proceed because none of the


patterns for tokens matches any prefix of the remaining input.
• The simplest recovery strategy is "panic mode" recovery.
• Delete successive characters from the remaining input, until the
lexical analyzer can find a well-formed token at the beginning
of what input is left.
Lexical Errors
• possible error-recovery actions are:

1. Delete one character from the remaining input. [ WHILEE]


2. Insert a missing character into the remaining input. [ELE]
3. Replace a character by another character. [WHLLE]
4. Transpose two adjacent characters [FI]
Input Buffering
• Lexical Analysis phase can be fastened by buffer
(between source program and lexical analyzer)

• What could be the size of the buffer ?


(2N where N is number characters in one memory read)
Specification of Tokens

Regular expressions are an important notation for


specifying lexeme’s patterns.
Strings and Languages
Alphabet, String, Language
• The term alphabet or character class denotes any finite set of
symbols.
• A string over some alphabet is a finite sequence of symbols drawn
from that alphabet.
• The term language denotes any countable set of strings over some
fixed alphabet.
Strings and Languages
Operations on Language
union, concatenation, and closure
Strings and Languages
Operations on Language
Let L be the set {A, B, . . . , Z, a, b, , . . , z} and D the set (0, 1, . . . , 9).

A new languages created from L and D by applying the operators

1. L U D is the set of letters and digits.


2. LD is the set of strings consisting of a letter followed by a digit.
3. L4 is the set of all four-letter strings.
4. L* is the set of all strings of letters, including ϵ, the empty string,
5. L(L U D)* is the set of all strings of letters and digits beginning with
a letter.
6. D+ is the set of all strings of one or more digits.
Regular Expressions
Regular expressions r are used to describe the languages L(r)

Languages can be built by applying operators over the


symbols of some alphabet

Regular expression (r) for an identifier is


letter ( letter | digit ) *

• vertical bar above means union,


• the parentheses are used to group sub expressions,
• the star means "zero or more occurrences of”
Regular Expressions
The rules that define the regular expressions over some alphabet ∑
and the languages that those expressions denote.

• There are two rules that form the basis:


• 1. Є is a regular expression, and L (Є) is {Є} , that is, the language
whose sole member is the empty string.
• 2. If a is a symbol in ∑ , then a is a regular expression, and L(a) = {a},
that is, the language with one string of length one, with a in its one
position.
• Note that by convention, we use italics for symbols, and boldface
for their corresponding regular expression.'
Regular Expressions
• INDUCTION: There are four parts to the induction whereby larger regular
expressions are built from smaller ones. Suppose r and s are regular
expressions denoting languages L(r) and L(s), respectively.

1. (r) is a regular expression denoting the language L(r).


2. (r) | (s) is a regular expression denoting the language L(r) U L(s).
3. (r) (s) is a regular expression denoting the language L(r) L(s) .
4. (r)* is a regular expression denoting (L (r)) * .

• The unary operator * has highest precedence and is left associative.


• Concatenation has second highest precedence and is left associative.
• | has lowest precedence and is left associative
Regular Expressions
Example : Let ∑ = {a, b} .

Generate the Language for the following Regular


expression
1. a | b
2. (a | b) ( a | b)
3. a*
4. ( a | b) *
5. a | a* b
Regular Expressions
1. The regular expression a | b denotes the language {a, b} .

2. (a | b) ( a | b) denotes {aa, ab, ba, bb} , the language of all strings


of length two over the alphabet ∑ . Another regular expression for
the same language is aa | ab | ba | bb.

3. a* denotes the language consisting of all strings of zero or more


a's, that is, {ϵ, a, aa, aaa, ... }.
Regular Expressions
4. ( a | b) * denotes the set of all strings consisting of zero or more
instances of a or b, that is, all strings of a's and b's: {ϵ, a, b, aa, ab,
ba, bb, aaa, ... }.
Another regular expression for the same language is (a* b * )*.

5. a | a* b denotes the language {a, b, ab, aab, aaab, ... }, that is,
the string ‘a’ and all strings consisting of zero or more a's and
ending in ‘b’.
Algebraic laws for Regular
expressions
Regular Definitions
For notational convenience regular expression can be given a name which is
regular definition like

id  letter ( letter | digit ) *

If ∑ is an alphabet of basic symbols, then a regular definition is a sequence of


definitions of the form
d1  r1
d2  r2

dn  rn

Where
1. Each di is a new symbol, not in ∑ and not the same as any other of the d's

2. Each ri is a regular expression over the alphabet ∑ U {d1 , d2 , . . . , di-1} ·


Regular Definitions
Regular Definitions for Identifier

letter  A | B | … | Z | a | b | … | z
digit  0 | 1 | …| 9
id  letter (letter | digit)*

Regular Definitions for Unsigned Number

digit  0 | 1 | …| 9
digits  digit +
Optional_fraction  . digits | ϵ
Optional_exponent  (E(+|-| ϵ)digits) | ϵ
Number  digits optional_fraction optional_exponent
Regular Definitions using
shorthands
Regular Definitions for Identifier
Notational shorthands
Letter  [A-Za-z]  Character Class
digit  [0-9] * - Zero or more times
id  letter (letter | digit)*
+ - one or more times
Regular Definitions for Unsigned Number ? – zero or one time

digit  [0-9]  Character Class


digits  digit +
Number  digits (.digits)? (E[+-]?digits)?
Regular Expressions
Regular Expressions
Recognition of Tokens
• The pattern for the tokens are described using Regular definition
Recognition of Tokens
Transition Table / Transition Diagrams
As an intermediate step in the construction of a lexical
analyzer, convert patterns(regular expression into a
flowcharts, called “Transition Diagrams”

The Lexical Analyzer will isolate the lexeme for the next
token from the input buffer and produce a pair consisting of
appropriate token and attribute value using the Transition
Table/ Transition Diagrams.
Transition Diagrams
• Transition diagrams have a collection of nodes or circles,
called states
• Edges are directed from one state of the transition diagram to
another.
• Each edge is labeled by a symbol or set of symbols
• One state is designated the start state, or initial state
• Certain states are said to be accepting, or final, having
double circle
• The transition diagram always begins in the start state before
any input symbols have been read.
• Transition diagrams are deterministic, meaning that there is
never more than one edge out of a given state with a given
symbol among its labels.
Transition Diagram for
Relational Operator
Transition Diagram for
Identifier

Letter or digit

*
Transition Diagram for
Unsigned Number

Number  digits (.digits)? (E[+-]?digits)?


Transition diagram for whitespace
Recognition of Reserved Words and Identifiers

Recognizing keywords and identifiers presents a problem.


There are two Methods to solve it
Method 1:
• Install the reserved words in the symbol table initially
• If lexeme is not found in the symbol table , then it calls install id to places
the lexeme in the symbol table and returns a pointer to the symbol-table
entry for the lexeme found
• If lexeme is found in the symbol table , Then the function getToken
examines the symbol table entry for the lexeme found, and returns relevant
token (ie) either id or keyword
Recognition of Reserved Words and Identifiers

• Method 2
• Create separate transition diagrams for each keyword;
• An example for the keyword THEN is shown in Figure above.
• After successive letter of the keyword is seen, a test for a "nonletter-
or-digit“ is done to ensure it is a keyword.
• Otherwise the identifier like thennextval will be recognized as
keyword THEN
FINITE AUTOMATA
A recognizer for a language is a program that takes as input
a string x and answers "yes" if x is a sentence of the
language and "no" otherwise.

A regular expression can be compiled into a recognizer by


constructing a generalized transition diagram called a finite
automaton.
Nondeterministic Finite Automata
A nondeterministic finite automata (NFA) is a mathematical model that consists of
1. A finite set of states ‘S’
2. A set of input symbols Σ, We assume that Є is never a member of Σ.
3. A transition function that gives, for each state, and for each symbol in
Σ U Є a set of next states.
4. A state So from S that is distinguished as the start (or initial) state
5. A set of states F, a subset of S, that is distinguished as Accepting (or final)
states
Transition Table
NFA for (a/b)*abb

07/18/2025
Deterministic Finite Automata
A Deterministic finite automaton (DFA) is a special case of a non- deterministic
finite automaton in which
1. No state has an ε transition i.e., a transition on input ε, and
2. For each state S and input symbol ‘a’, there is at most one edge labeled ‘a’
leaving ‘S’.
DFA accepting (a/b)*abb

07/18/2025
Conversion of NFA to DFA

• Regular Expression to NFA


• Converting NFA to DFA
• Minimised DFA
Construction of NFA from a Regular Expression
(Thompson’s Construction)
Input: A regular expression r over an alphabet Σ .
Output. An NFA N accepting L(r).

07/18/2025
07/18/2025
07/18/2025
07/18/2025
Decomposition of Regular Expression (a/b)*abb

07/18/2025
07/18/2025
07/18/2025
07/18/2025
07/18/2025
NFA for (a/b)*abb

07/18/2025
Conversion of NFA into DFA

ε-closure (start state) = {0, 1, 2 4, 7} = A

07/18/2025
Conversion of NFA into DFA
ε-closure (start state) = {0, 1, 2 4, 7} = A
ε-closure (Move(A, a)) = {1, 2, 3, 4, 6, 7, 8} = B
ε-closure (Move(A, b)) = {1, 2, 4, 5, 6, 7} = C
ε-closure (Move(B, a)) = {1, 2, 3, 4, 6, 7, 8} = B
ε-closure (Move(B, b)) = {1, 2, 4, 5, 6, 7, 9} = D
ε-closure (Move(C, a)) = {1, 2, 3, 4, 6, 7, 8} = B
ε-closure (Move(C, b)) = {1, 2, 4, 5, 6, 7} = C
ε-closure (Move(D, a)) = {1, 2, 3, 4, 6, 7, 8} = B
ε-closure (Move(D, b)) = {1, 2, 4, 5, 6, 7, 10} = E
ε-closure (Move(E, a)) = {1, 2, 3, 4, 6, 7, 8} = B
ε-closure (Move(E, b)) = {1, 2, 4, 5, 6, 7} = C

07/18/2025
Result of Applying Subset Construction

07/18/2025
Transition Table of Reduced DFA

07/18/2025
Operations on NFA States

07/18/2025
Computation of ε-closure

07/18/2025
The Subset Construction

07/18/2025
Construct a nondeterministics finite automata for the
following regular expressions

1.(a/b)*
2.(a*/b*)*
3.((ε/a)b*)*
4.(a/b)*abb
5.(a/b)*abb(a/b)*
6.(0+1)*(0+1)01

Construct a deterministics finite automata from the


nondeterministic finite automata derived using subset
construction method

07/18/2025
Direct DFA

• To construct a DFA directly from a regular expression, we


construct its syntax tree and then compute four functions:
• nullable,
• firstpos,
• lastpos, and
• followpos
nullable(n) & firstpos(n)
• nullable(n) is true for a syntax-tree node n if and only if the
subexpression represented by n has Є in its language. That
is, the subexpression can be "made null" or the empty
string, even though there may be other strings it can
represent as well.

• firstpos(n) is the set of positions in the subtree rooted at n


that correspond to the first symbol of at least one string in
the language of the subexpression rooted at n.
lastpos(n) & followpos(p)
• lastpos(n) is the set of positions in the subtree rooted at n
that correspond to the last symbol of at least one string in
the language of the subexpression rooted at n.

• followpos(p), for a position p, is the set of positions q in the


entire syntax tree such that there is some string x = a 1,
a2, ..... an, in L ((r)#) such that for some i, ai is matched to
position p of the syntax tree and ai+1 to position q.
Computing Nullable, Firstpos, Lastpos
Node n Nullable(n) Firstpos(n) Lastpos(n)
n is a leaf labeled ε True φ φ
n is a leaf labeled
False {i} {i}
with position i

OR node with c1 as
nullable(c1) or firstpos(c1) ∪ lastpos(c1) ∪
left child and c2 as
nullable(c2) firstpos(c2) lastpos(c2)
right child

if nullable(c1) then if nullable(c2) then


CAT node with c1 nullable(c1) and
firstpos(c1) ∪ lastpos(c1) ∪
as left child and c2 nullable(c2)
firstpos(c2) lastpos(c2)
as right child
else firstpos(c1) else lastpos(c2)

Star Node with c1


True firstpos(c1) lastpos(c1)
as child
07/18/2025
Computing Followpos
The function followpos(i) tells us what positions can follow
position i in the syntax tree.
Two rules define all the ways one position can follow
another.

1. If n is a cat-node with left child c1 and right child c2, and


i is a position in lastpos(c1), then all position in
firstpos(c2) are in followpos(i)

2. If n is a star-node, and i is a position in lastpos(n), then


all positions in firstpos(n) are in followpos(i).

07/18/2025
firstpos and lastpos for nodes in syntax tree for (a/b)*abb#

07/18/2025
Construction of a DFA from a regular expression r
Algorithm: Construction of a DFA from a regular expression r.
Input: A regular expression r.
Output: A DFA D that recognizes L (r).

1 . Construct a syntax tree for the augmented regular expression


(r)#, where # is a unique endmarker appended to (r).

2. Construt the functions nullable, firstpos, lastpos, and followpos


by making depth-first traversals of T.

3. Construct Dstates, the set of states of D, and Dtran, the transition


table for D by the procedure. The states in Dstates are sets of
positions; initially, each state is "unmarked," and a state becomes
"marked" just before we consider its out-transitions. The start stale
of D is firstpos(root), and the accepting states are all those
07/18/2025
containing the position associated with the endmarker #.
Construction of a DFA
initially, the only unmarked state in Dstates is firstpos(root),
where root is the root of the syntax tree far (r)#:
while there is an unmarkes state T in Dstates do begin
mark T:
for each input symbol a do begin I
let U be the set of positions that are in followpos(p)
for some position p in T,
such that the symbol at position p is a;
if U is not empty and is not in Dstates then
add U as an unmarked state to Dstates;
Dtran[T, a] = U
end
end

07/18/2025
Construction of a DFA

Firstpos(Root) = {1, 2, 3) = A DTran a b


A B A
DTran[A, a] = followpos (1, 3) = {1, 2, 3, 4} = B
B B C
DTran[A, b] = Follwpos (2) = {1, 2, 3} = A C B D
Dtran[B, a] = Followpos (1, 3) = {1, 2, 3, 4} = B D B A

Dtran[B, b] = Followpos(2, 4) = {1, 2, 3, 5} = C


Dtran[C, a] = Followpos (1, 3) = B
Dtran[C, b] =Followpos(2, 5)={1,2,3,6}=D
Dtran[D, a] = Followpos (1, 3) = B
Dtran[D, b] = Followpos (2) = A

07/18/2025
NFA Without ϵ Transition
1. Make all positions in firstpos of the root be start
states,
2. label each directed edge (i, j) by the symbol at
position j, and
3. make the position associated with # be the only
accepting state

a
a

b b #
a b

a
b

07/18/2025
Minimizing the Number of States of DFA

07/18/2025
Minimizing the Number of States of DFA

07/18/2025
Design of a Lexical Analyzer Generator

07/18/2025
Pattern Matching Based on NFAs

07/18/2025
Pattern Matching Based on NFAs

07/18/2025
Pattern Matching
Based on NFAs

07/18/2025
DFA for Lexical analyzer

07/18/2025
Solve by Direct DFA

• (a | b | c)* c a b*

• (a | d* )(b | c )* d a* b
Simulating the NFA

07/18/2025
Simulating a DFA
S = S0
C = nextchar;
While c != eof do
S = move (S, C);
C = nextchar;
End;
If S is in F then
return “yes”
Else
return “no”;

07/18/2025
Compiler Construction Tools
‘Parser generator’ that automatically produce syntax Analyzers
from a grammatical description of a programming language.

‘Scanner generator’ that produce lexical analyzers from a regular-


expression description of the tokens of a language.

‘Syntax-directed translation engines’ that produce collections of


routines for walking a parse tree and generating intermediate code.

‘Code-generator generators’ that produce a code generator from


collection of rules for translating each operation of the intermediates
language into the machine language for a target machine.

‘Data-flow analysis engines’ that facilitate the gathering of


information about how values are transmitted from one part of a
program to each other part. This is a key part of code
07/18/2025
optimization.
Lexical Analyzer Generator
Lex
• Lex allows one to specify a lexical analyzer by specifying
regular expressions to describe patterns for tokens

• The input notation for the Lex tool is referred to as the Lex
language and the tool itself is the Lex compiler

• Lex compiler transforms the input patterns into a transition


diagram and generates code, in a file called lex . yy . C
Creating Lexical Analyzer
with Lex
Lex Source
LEX
Program [Link].c
lex.l COMPILER

[Link].c
C
[Link]
COMPILER

Sequence of
Input Stream [Link] Tokens
Design of a Lexical Analyzer Generator

07/18/2025
Structure of Lex Program
A Lex program has the following form:
declarations
%%
translation rules
%%
auxiliary functions

• The declarations section includes declarations of variables,


manifest constants and regular definitions
• The translation rules each have the form
Pattern { Action }
• Each pattern is a regular expression
Lex Program
%{
/* definitions of manifest constants LT, LE, EQ, NE, GT, GE,
IF, THEN, ELSE, ID, NUMBER, RELOP*/
%}
/* regular definitions */
delim [ \t\n]
ws {delim}+
letter [A-Za-z]
digit [0-9]
id {letter} ({letter} l{digit}) *
number {digit}+ (\ . {digit } +)?(E [+-] ?{digit}+) ?
%%
Lex Program
{ws} {/* no act ion and no return */}
if {return(IF) ;}
then {return (THEN) ;}
else {return (ELSE) ;}
{id} {yylval = (int ) installID () ; return(ID}
{number} {yylval = (int) installNum() ; return (NUMBER) ;}
“<” {yylval = LT ; return (RELOP) ;}
“<=” {yylval = LE ; return (RELOP) ;}
“=” {yylval = EQ ; return (RELOP) j}
“<>” {yylval = NE ; return (RELOP) ;}
“>” {yylval = GT ; return (RELOP) ;}
“>=” {yylval = GE ; return (RELOP) ;}

%%
Lex Program
int installID ( )
{ /* funct ion to install the lexeme , whose first character is
pointed to by yytext, and whose length is yyleng , into the
symbol table and return a pointer thereto */
}

int installNum ( )
{ /* similar to installID , but puts numerical constants into a
separate table */
}
Lex Variables
• yyin is a variable of the type FILE* and points to the input file. yyin is
defined by LEX automatically. yyin either points to input file or points
to stdin(console input).
• yytext is of type char* and it contains the lexeme currently found.
• yyleng is a variable of the type int and it stores the length of the
lexeme pointed to by yytext.
• yylval is a global variable, used to pass the semantic value
associated with a token from the lexer to the parser.
Lex function
• yylex() is a function of return type int. LEX automatically
defines yylex() in [Link].c but does not call it. The programmer must
call yylex() in the Auxiliary functions section of the LEX program.
• yywrap() - LEX declares the function yywrap() of return-type int in the
file [Link].c . yylex() makes a call to yywrap() when it encounters the
end of input. If yywrap() returns zero, yylex() assumes there is more
input and it continues scanning from the location pointed to by yyin. If
yywrap() returns a non-zero value (indicating true), yylex() terminates
the scanning process and returns 0 (i.e. “wraps up”).

You might also like