CD DSTC Notes
CD DSTC Notes
1. Introduction to Compiler
Lexical Analysis
The first phase of scanner works as a text scanner. This phase scans the source code as a stream of
characters and converts it into meaningful lexemes. Lexical analyzer represents these lexemes in the form of
tokens as:
<token-name, attribute-value>
Syntax Analysis
The next phase is called the syntax analysis or parsing. It takes the token produced by lexical analysis as
input and generates a parse tree (or syntax tree). In this phase, token arrangements are checked against the
source code grammar, i.e. the parser checks if the expression made by the tokens is syntactically correct.
Semantic Analysis
Semantic analysis checks whether the parse tree constructed follows the rules of language. For example,
assignment of values is between compatible data types, and adding string to an integer. Also, the semantic
analyzer keeps track of identifiers, their types and expressions; whether identifiers are declared before use or
not etc. The semantic analyzer produces an annotated syntax tree as an output.
Code Optimization
The next phase does code optimization of the intermediate code. Optimization can be assumed as something
that removes unnecessary code lines, and arranges the sequence of statements in order to speed up the
program execution without wasting resources (CPU, memory).
Code Generation
In this phase, the code generator takes the optimized representation of the intermediate code and maps it to
the target machine language. The code generator translates the intermediate code into a sequence of
(generally) re-locatable machine code. Sequence of instructions of machine code performs the task as the
intermediate code would do.
Symbol Table
It is a data-structure maintained throughout all the phases of a compiler. All the identifier's names along
with their types are stored here. The symbol table makes it easier for the compiler to quickly search the
identifier record and retrieve it. The symbol table is also used for scope management.
2. What is the pass of a compiler? Explain how the single and multi-pass
compilers work.
In single pass Compiler source code directly transforms into machine code. For example, Pascal
language.
The Two pass compiler method also simplifies the retargeting process. It also allows multiple
front ends.
Multipass Compilers:
The multipass compiler processes the source code or syntax tree of a program several
times.
It divided a large program into multiple small programs and processes them.
It develops multiple intermediate codes. This entire multipass takes the output of the
previous phase as an input. So it requires less memory. It is also known as 'Wide
Compiler'.
2. Lexical Analyzer
Move(B,a) ={3,8}
ε – closure (Move(B,a))={1,2,3,4,6,7,8}---- B
Move(B,b) = {5,9}
ε – closure (Move(B,b))={1,2,4,5,6,7,9} --- Let D
Move(C,a) ={3,8}
ε – closure (Move(C,a))={1,2,3,4,6,7,8}---- B
Move(C,b) = {5}
ε – closure (Move(C,b))={1,2,4,5,6,7} ---- C
Move(D,a) ={3,8}
ε – closure (Move(D,a))={1,2,3,4,6,7,8} ----B
Move(D,b) = {5,10}
ε – closure (Move(D,b))={1,2,4,5,6,7,10} ---- Let E
Move(E,a) ={3,8}
ε – closure (Move(E,a))={1,2,3,4,6,7,8} ---- B
Move(E,b) = {5}
ε – closure (Move(E,b))={1,2,4,5,6,7} ---- C
States a b
A B C
B B D
C B C
D B E
E B C
Table: Transition table for (a+b)*abb
a
b
a B D
a
b b
A
a
a
b E
C
To find followpos traverse concatenation and star node in depth first searchorder.
1. 2. 3.
i=lastpos(c1)={5} i=lastpos(c1)={4} i=lastpos(c1)={3}
firstpos(c2)={6} firstpos(c2)={5} firstpos(c2)={4}
followpos(i)=firstpos(c2) followpos(i)=firstpos(c2) followpos(i)=firstpos(c2)
followpos(5)={6} followpos(4)={5} followpos(3)={4}
4. 5.
i=lastpos(c1)={1,2} i=lastpos(c1)={1,2}
firstpos(c2)={3} firstpos(c1)={1,2}
followpos(i)=firstpos(c2) followpos(i)=firstpos(c1)
followpos(1)={3} followpos(1)={1,2}
followpos(2)={3} followpos(2)={1,2}
Position Followpos(i)
1 {1,2,3}
2 {1,2,3}
3 {4}
4 {5}
5 {6}
Table: followpos table
Construct DFA
Initial node = firstpos (root node)= {1,2,3} --A
δ(A,a) = followpos(1) Ufollowpos(3)
= {1,2,3} U{4}
= {1,2,3,4} ---B
δ (A,b) =followpos(2)
={1,2,3} ---- A
Transition Table
States a b
A B A
B B C
C B D
D B A
Table: Transition table for (a+b)*abb
a
A B
a
b a
D C
b
3. Parsing Theory
6. Write down the algorithm for left factoring and left recursion.
Left Factoring:
Left factoring is a grammar transformation that is useful for producing grammar suitable for
predictive parsing.
Algorithm to left factor a grammar
Input: Grammar G
Output: An equivalent left factored grammar.
1. For each non terminal A find the longest prefix α common to two or more of its alternatives.
2. If α!= E, i.e., there is a non trivial common prefix, replace all the A productions
Aà αβ1| αβ2|..............| αβn| ɣ where ɣ represents all alternatives that do not begin with
α by
A==> α A'| ɣ
A'==>β1|β2|….|βn
Here A' is new non terminal. Repeatedly apply this transformation until no two
alternatives for a non-terminal have a common prefix.
Left Recursion:
• A grammar is said to be left recursive if it has a non terminal A such that there is a derivation
A Aα for some string α.
• Top down parsing methods cannot handle left recursive grammar, so a transformation that
eliminates left recursion is needed.
Algorithm to eliminate left recursion:
Input: Grammar G
Output: An equivalent left recursive grammar.
1. Assign an ordering A1,…,An to the nonterminals of the grammar.
2. for i:=1 to n do
begin
for j:=1 to i−1 do begin
replace each production of the form Ai→Aiɣ
by the productions Ai ->δ1ɣ | δ2ɣ |…..| δkɣ
where Aj -> δ1 | δ2 |…..| δk are all the current Aj productions;
end
eliminate the intermediate left recursion among the Ai-productions
end
7. Draw parsing table for Table Driven Parser for the given grammar. Is the
grammar LL (1)?
Consider the following grammar
11 Computer Science and Engineering
Dr. Subhash Technical Campus
Faculty of Degree Engineering
Compiler Design - 3170701
E E+T / T
T T*F / F
F (E) / id
Eliminate immediate left recursion from above grammar then we obtain,
E TE’
E’ +TE’ | ε
T FT’
T’ *FT’ | ε
F (E) | id
Compute FIRST & FOLLOW:
FIRST FOLLOW
E {(,id} {$,)}
E’ {+,ϵ} {$,)}
T {(,id} {+,$,)}
T’ {*,ϵ} {+,$,)}
F {(,id} {*,+,$,)}
Predictive Parsing Table
id + * ( ) $
E ETE’ ETE’
E’ E’+TE’ E’ϵ E’ϵ
T TFT’ TFT’
T’ T’ϵ T’*FT’ T’ϵ T’ϵ
F Fid F(E)
There is no any multiple entries in the parsing table. So, given grammar is LL(1) grammar.
Example:
E→ E + T | T
T→ TF | F
F→ F * | a | b
Augmented grammar: E’→.E
Action Go to
state + * a b $ E T F
0 S4 S5 1 2 3
1 S6 Accept
2 R2 S4 S5 R2 7
3 R4 S8 R4 R4 R4
4 R6 R6 R6 R6 R6
5 R6 R6 R6 R6 R6
6 S4 S5 9 3
7 R3 S8 R3 R3 R3
8 R5 R5 R5 R5 R5
9 R1 S4 S5 R1 7
10. Write down steps to set precedence relationship for Operator Precedence
Grammar. Design precedence table for:
• Operator Grammar: A Grammar in which there is no Є in RHS of any production or no
adjacent non terminals is called operator precedence grammar.
• In operator precedence parsing, we define three disjoint precedence relations <. , .>and =
between certain pair of terminals.
Relation Meaning
Leading:-
Leading of a nonterminal is the first terminal or operator in production of that nonterminal.
Trailing:-
+ * id $
f(x) 2 4 4 0
g(x) 1 3 5 0
4. Error Recovery
11. Explain Error Recovery strategies of compiler.
1. Panic mode
This strategy is used by most parsing methods. This is simple toimplement.
In this method on discovering error, the parser discards input symbol one at time. This
process is continued until one of a designated set of synchronizing tokens is found.
Synchronizing tokens are delimiters such as semicolon or end. These tokens indicate an
end of input statement.
Thus in panic mode recovery a considerable amount of input is skipped without
checking it for additional errors.
2. Phrase level recovery
In this method, on discovering an error parser performs local correction on remaining
input.
It can replace a prefix of remaining input by some string. This actually helps parser to
continue its job.
The local correction can be replacing comma by semicolon, deletion of semicolons or
inserting missing semicolon. This type of local correction is decided by compiler
designer.
While doing the replacement a care should be taken for not going in an infinite loop.
This method is used in many error-repairing compilers.
3. Error production
If we have good knowledge of common errors that might be encountered, then we can
augment the grammar for the corresponding language with error productions that
generate the erroneous constructs.
If error production is used during parsing, we can generate appropriate error message to
indicate the erroneous construct that has been recognized in the input.
This method is extremely difficult to maintain, because if we change grammar then it
becomes necessary to change the corresponding productions.
4. Global correction
We often want such a compiler that makes very few changes in processing an incorrect
input string.
Given an incorrect input string x and grammar G, the algorithm will find a parse tree for
a related string y, such that number of insertions, deletions and changes of token require
to transform x into y is as small as possible.
Such methods increase time and space requirements at parsing time.
Global production is thus simply a theoretical concept.
12. Explain how type checking & error reporting is performed in a compiler.
A compiler must check that the source program follows both syntactic and semantic conventions of
the source language. This checking, called static checking, detects and reports programming errors.
17 Computer Science and Engineering
Dr. Subhash Technical Campus
Faculty of Degree Engineering
Compiler Design - 3170701
Constructors include:
Arrays : If T is a type expression then array (I,T) is a type expression denoting the type of an array with
elements of type T and index set I.
Products : If T1 and T2 are type expressions, then their Cartesian product T1 X T2 is a type expression.
Records : The difference between a record and a product is that the names. The record type constructor will
be applied to a tuple formed from field names and field types.
Error Recovery
Since type checking has the potential for catching errors in program, it is desirable for type checker to
recover from errors, so it can check the rest of the input. Error handling has to be designed into the type
system right from the start; the type checking rules must be prepared to cope with errors.
13. Explain quadruple, triple and indirect triple with suitable example.
There are three types of representation used for three address code,
1. Quadruples
2. Triples
3. Indirect triples
Consider the input statement x: = -a*b +-a*b.
Three address code for above statement:
t1= -a
t2 := t1 * b
t3= -a
t4 := t3 * b
t5 := t2 + t4
x= t5
Quadruple representation
The quadruple is a structure with at the most four fields such as op, arg1,arg2.
The op field is used to represent the internal code for operator.
The arg1 and arg2 represent the two operands.
And result field is used to store the result of an expression.
Statement with unary operators like x= -y do not usearg2.
Triples
To avoid entering temporary names into the symbol table, we might refer a temporary value by
the position of the statement that computes it.
If we do so, three address statements can be represented by records with only three fields: op, arg1
and arg2.
(0) uminus a
(1) * (0) b
(2) uminus a
(3) * (2) b
(5) := X (4)
Indirect Triples
In the indirect triple representation the listing of triples has been done, and listing pointers are used
instead of using statement.
This implementation is called indirect triples.
14. Write syntax directed definition of a simple desk calculator and draw an
annotated parse tree for 3 * 5 + 4 n
Value of synthesized attribute at a node can be computed from the value of attributes at
the children of that node in the parse tree.
Syntax directed definition that uses synthesized attribute exclusively is said to be S-
attribute definition.
A parse tree for an S-attributed definition can always be annotated by evaluating the
semantic rules for the attribute at the each node bottom up, from the leaves toroot.
An annotated parse tree is a parse tree showing the value of the attributes at each node.
The process of computing the attribute values at the node is called annotating or
decorating the parse tree.
The syntax directed definition for simple desk calculator is given in following table:
Production Semantic Rules
LEn Print(E.val)
EE1+T E.val=E1.val+T.val
ET E.val= T.val
TT1*F T.val=T1.val*F.val
TF T.val= F.val
F(E) F.val= E.val
Fdigit F.val=digit.lexval
Table: Syntax directed definition of a simple desk calculator
Stack allocation
All compilers for languages that use procedures, functions or methods as units of user define
actions manage at least part of their run-time memory as a stack.
Each time a procedure is called, space for its local variables is pushed onto a stack, and when
the procedure terminates, the space is popped off the stack.
Calling Sequences: (How is task divided between calling & called program for stack updating?)
Procedures called are implemented in what is called as calling sequence, which consist of
code that allocates an activation record on the stack and enters information into its fields.
A return sequences similar to code to restore the state of machine so the calling procedure can
continue its execution after the call.
The code is calling sequence of often divided between the calling procedure (caller) and
procedure is calls (callee).
When designing calling sequences and the layout of activation principles are helpful:
1. Value communicated between caller and callee are generally placed atthe beginning
of the callee’s activation record, so they are as close as possible to the caller’s
activation record.
2. Fixed length items are generally placed in the middle. Such items typically include
the control link, the access link, and the machine statusfield.
3. Items whose size may not be known early enough are placed at the end of the
activation record.
4. We must locate the top of the stack pointer judiciously. A common approach is to
have it point to the end of fixed length fields in the activation is to have it point to the
end of fixed length fields in the activation record. Fixed length data can then be
accessed by fixed offsets, known to the intermediate code generator, relative to the
top of the stack pointer.
The calling sequence and its division between caller and callee are as follows:
1. The caller evaluates the actual parameters.
2. Thecallerstoresareturnaddressandtheoldvalueoftop_spintothecallee’s activation record.
The caller then increments the top_sp to the respective positions.
3. The callee saves the register values and other status information.
4. The callee initializes its local data and begins execution.
2. Using the information in the machine status field, the callee restores top_sp and other
registers, and then branches to the return address that the caller placed in the status
field.
3. Although top_sp has been decremented, the caller knows where the return value is,
relative to the current value of top_sp ; the caller therefore may use that value.
Activation Records (Most IMP)
Return value
Actual parameter
Control link
Access link
Local variables
Temporaries
18. For what purpose compiler uses symbol table? How characters of a name are
stored in symbol table? What do you mean by dangling reference?
Definition: Symbol table is a data structure used by compiler to keep track of semantics of
a variable. That means symbol table stores the information about scope and binding
information about names.
Symbol table is built in lexical and syntax analysis phases.
Symbol table entries
The items to be stored into symbol table are:
1) Variablen ames
2) Constants
3) Procedure names
4) Function names
5) Literal constants and strings
6) Compiler generated temporaries
7) Labels in source language
Compiler use following types of information from symbol table:
1) Data type
2) Name
3) Declaring procedure
4) Offset in storage
5) If structure or record then pointer to structure table
6) For parameters, whether parameter passing is by value or reference?
7) Number and type of arguments passed to the function
8) Base address
How to store names in symbol table?
There are two types of representation:
3. Fixed length name
A fixed space for each name is allocated in symbol table. In this type of storage if name is
too small then there is wastage of space.
The name can be referred by pointer to symbol table entry.
Name Attribute
c a I c u I a t e
s u m
a
b
Figure: Fixed length name
Name Attribute
Starting Length
index
0 10
10 4
14 2
16 2
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
C a I c u I a t e $ s u m $ a $ b $
Figure: Variable length name
Dangling Reference:
A dangling reference is a reference to an object that no longer exists. Garbage is an object that cannot be
reached through a reference. Dangling references do not exist in garbage collected languages because objects
are only reclaimed when they are no longer accessible (only garbage is collected).
In short, When you have a pointer (or reference) and it points to a location from where data is deleted, known
as dangling reference .
7. Code Optimization
The common sub expression t4:=4*i is eliminated as its computation is already in t1 and
value of i is not been changed from definition to use.
3. Variable propagation
Variable propagation means use of one variable instead of another.
Example:
x = pi;
area = x * r * r;
The optimization using variable propagation can be done as follows, area = pi * r *r;
Here the variable x is eliminated. Here the necessary condition is that a variable must be
assigned to another variable or some constant.
4. Code movement
There are two basic goals of code movement:
I. To reduce the size of the code.
II. To reduce the frequency of execution of
code.
Example:
for(i=0;i<=10;i++)
{
x=y*5; k=(y*5)+50;
}
Example:
i=0;
if(i==1)
{
a=x+5;
}
If statement is a dead code as this condition will never get satisfied hence, statement
can be eliminated and optimization can be done.
30 Computer Engineering
Dr. Subhash Technical Campus
Faculty of Degree Engineering
Compiler Design - 2170701
characteristic.
o Redundant instruction elimination.
Especially the redundant loads and stores can be eliminated in this type of
transformations.
Example:
MOV
R0,x
MOV
x,R0
We can eliminate the second instruction since x is in already R0. But if MOV x,
R0 is a label statement then we cannot remove it.
o Unreachable code.
Especially the redundant loads and stores can be eliminated in this type of
transformations.
An unlabeled instruction immediately following an unconditional jump may be
removed.
This operation can be repeated to eliminate the sequence of
instructions. Example:
#define debug
0 If(debug) {
}
In the intermediate representation the if- statement may be translated as:
If debug=1 goto
L1 goto L2
L2:
Now, since debug is set to 0 at the beginning of the program, constant propagation
31 Computer Engineering
Dr. Subhash Technical Campus
Faculty of Degree Engineering
Compiler Design - 2170701
should replace by
If 0≠1 goto L2
Print debugging information
L2:
As the argument of the first statement of evaluates to a constant true, it can be
replaced by gotoL2.
Then all the statement that print debugging aids are manifestly unreachable and
can be eliminated one at a time.
o Flow of control optimization.
The unnecessary jumps can be eliminated in either the intermediate code or the
target code by the following types of peephole optimizations.
We can replace the jump sequence.
Goto L1
……
L1: goto
L2 By the
sequence
Goto L2
…….
L1: goto L2
If there are no jumps to L1 then it may be possible to eliminate the statement L1:
goto L2 provided it is preceded by an unconditional jump. Similarly, thesequence
If a<b goto L1
……
L1: goto L2
Can be replaced by
If a<b goto L2
……
L1: goto L2
o Algebraic simplification.
32 Computer Engineering
Dr. Subhash Technical Campus
Faculty of Degree Engineering
Compiler Design - 2170701
33 Computer Engineering
Dr. Subhash Technical Campus
Faculty of Degree Engineering
Compiler Design - 2170701
8. Code Generation
34 Computer Engineering
Dr. Subhash Technical Campus
Faculty of Degree Engineering
Compiler Design - 2170701
MOV a, R0
ADD e, R0
MOV
R0,d
Here the fourth statement is redundant, so we can eliminate that statement.
5. Register allocation
If the instruction contains register operands then such a use becomes shorter and faster
than that of using in memory.
The use of registers is often subdivided into two sub problems:
During register allocation, we select the set of variables that will reside in registers at a
point in the program.
During a subsequent register assignment phase, we pick the specific register that a
variable will reside in.
Finding an optimal assignment of registers to variables is difficult, even with single
register value.
Mathematically the problem is NP-complete.
6. Choice of evaluation
Theorderinwhichcomputationsareperformedcanaffecttheefficiencyofthetarget
code. Some computation orders require fewer registers to hold intermediate results
than others. Picking a best order is another difficult, NP-complete problem.
7. Approaches to codegeneration
The most important criterion for a code generator is that it produces correct code.
Correctness takes on special significance because of the number of special cases that
code generator mustf ace.
Given the premium on correctness, designing a code generator so it can be easily
implemented, tested, and maintained is an important design goal.
35 Computer Engineering