Principles of Compiler Design
(SENG471)
Chapter One
Introduction
1
Preliminaries Required
• Basic knowledge of Programming languages.
• Basic knowledge of Automata and Context Free Grammar.
Textbook:
Alfred V. Aho, Ravi Sethi, and Jeffrey D. Ullman,
“Compilers: Principles, Techniques, and Tools”
Addison-Wesley, 2007.
Objective
At the end of this session students will be able to:
Understand the basic concepts and principles of Compiler Design
Understand the term compiler, its functions and how it works.
Be familiar with the different classification of compilers.
Be familiar with cousins of compiler: Linkers, Loaders, Interpreters, Assemblers
Understand the need of studying compiler Design and construction
Understand the phases of compilation and the steps of compilation
Understand the history of compilers. (Assignment For G1 Students)
Be familiar with the different compiler tools (Assignment For G2 Students)
3
Introduction
Definition
Compiler is an executable program that can read a program in one high-level language and
translate it into an equivalent executable program in machine language.
A compiler is a computer program that translates an executable program in a source language into
an equivalent program in a target language.
A source program/code is a program/code written in the source language, which is usually a
high-level language.
A target program/code is a Source Target
program Compiler
program
program/code written in the target
language, which often is a machine
Error
language or an intermediate code message
4
(Object Code).
Contd.
As a discipline compiler design involves multiple computer science and
Engineering courses like:
Programming Languages
Data structures and Algorithms
Theory of Computation (Automata and formal language theory)
Assembly language
Software Engineering
Computer Architecture
Operating Systems and
Discrete Mathematics
5
Why Study Theory of Compiler?
Curiosity
Prerequisite for developing advanced compilers, which continues to be active as
new computer architectures emerge
To improve capabilities of existing compiler/interpreter
To write more efficient code in a high-level language
Useful to develop software tools that parse computer codes or strings
E.g., editors, debuggers, interpreters, preprocessors, …
Important to understand how compliers work to program more effectively
To provide solid foundation in parsing theory for parser writing
To make compiler design as an excellent “capstone” project
To apply almost all of the major computer science fields, such as: automata theory,
computer programming, programming language design theory, assembly language,
6
computer architecture, data structures, algorithms, and software engineering.
Classification of Compilers
o Classifying compilers by number of
Compilers Viewed from Many Perspectives
passes has its background in the
Single Pass hardware resource limitations of
computers.
Multiple Pass Construction
o Compiling involves performing lots of
Load & Go work and early computers did not have
enough memory to contain one program
that did all of this work.
Debugging So compilers were split up into
Functional
smaller programs which each
Optimizing
made a pass over the source (or
some representation of it)
performing some of the required
analysis and translations.
However, all utilize same basic tasks to accomplish their actions
7
Contd.
1. Single(One) Pass Compilers:- is a compiler that passes through the source code of
each compilation unit only once
Also called narrow compilers.
The ability to compile in a single pass has classically been seen as a benefit because
it simplifies the job of writing a compiler.
Single-pass compilers generally perform compilations faster than multi-pass
compilers.
Due to the resource limitations of early systems, many early languages were
specifically designed so that they could be compiled in a single pass (e.g., Pascal).
Disadvantage of single pass Compilers:
It is not possible to perform many of the sophisticated optimizations needed to
generate high quality code.
It can be difficult to count exactly how many passes an optimizing compiler
8
makes.
Contd.
2. Multi-Pass Compilers:- is a type of compiler that processes the source code
or abstract syntax tree of a program several times.
Also called wide compilers.
Phases are separate "Programs", which run sequentially
Here, by splitting the compiler up into small programs, correct programs
will be produced.
Proving the correctness of a set of small programs often requires less
effort than proving the correctness of a larger, single, equivalent
program.
Many programming languages cannot be represented with a single pass
compilers, for example most latest languages like Java require a multi-
9 pass compiler.
Contd.
3. Load and Go Compilers:- generates machine code & then immediately
executes it.
Compilers usually produce either absolute code that is executed
immediately upon conclusion of the compilation or object code that is
transformed by a linking loader into absolute code.
These compiler organizations will be called Load & Go and
Link/Load.
Both Load & Go and Link/Load compilers use a number of passes to
translate the source program into absolute code.
10 A pass reads some form of the source program, transforms it into an
Contd.
4. Optimizing Compilers:- is a compiler that tries to minimize or maximize
some attributes of an executable computer program.
The most common requirement is to minimize the time taken to execute a
program; a less common one is to minimize the amount of memory occupied.
The growth of portable computers has created a market for minimizing the
power consumed by a program.
Compiler optimization is generally implemented using a sequence of
optimizing transformations, algorithms which take a program and
transform it to produce a semantically equivalent output program that
uses fewer resources.
Types of Optimization includes Peephole optimizations, local optimization,
11
global optimization, loop optimization, machine-code optimization, etc.
Cousins of Compilers
A. Assembler:- is a translator that converts programs written in assembly language
into machine code.
Translate mnemonic operation codes to their machine language equivalents.
Assigning machine addresses to symbolic labels.
B. Interpreter:- is a computer program that translates high level
instructions/programs into machine code as they are encountered.
It produces output of statement as they are interpreted
It generally uses one of the following strategies for program execution:
i. execute the source code directly
ii. translate source code into some efficient intermediate representation and
immediately execute this
iii. explicitly execute stored precompiled code made by a compiler which is part of the
interpreter system
12
source program
Contd.
preprocessor
modified source program
compiler
target assembly program
assembler
Relocatable machine code
linker/loader Library
files
target machine code
C. Linker:- is a program that takes one or more objects generated by a
compiler and combines them into a single executable program.
D. Loader:- is the part of an operating system that is responsible for loading
programs from executables (i.e., executable files) into memory, preparing
13 them for execution and then executing them.
Compiler vs. Interpreter
Ideal concept:
Source code Compiler Executable
Input data
Executable Output data
Source code
Interpreter Output data
Input data
Most languages are usually thought of as using either one or the other:
Compilers: FORTRAN, COBOL, C, C++, Pascal, PL/1
Interpreters: Lisp, scheme, BASIC, APL, Perl, Python, Smalltalk
14
BUT: not always implemented this way
Basic Compiler Design
Write a huge program that takes as input another program in the source
language for the compiler, and gives as output an executable that we can
run.
For modifying code easily, usually, we use modular design
(decomposition) methodology to design a compiler.
Two design strategies:
1. Write a “front end” of the compiler (i.e. the lexer, parser, semantic
analyzer, and assembly tree generator), and write a separate back end for
each platform that you want to support
2. Write an efficient highly optimized back end, and write a different front
end for several languages, such as Fortran, C, C++, and Java.
Source Intermediate Target
code
Front End code
Back End
code
15
The Analysis-Synthesis Model of Compilation
There are two parts to compilation: analysis & synthesis.
During analysis, the operations implied by the source program are
determined and recorded in a hierarchical structure called a tree.
During synthesis, the operations are involved in producing translated code.
1. Lexical Analysis Breaks up source program into
Front
2. Syntax Analysis Analysis
End
constituent pieces
3. Semantic Analysis Creates intermediate representation of
source program
Construct target program from
4. Code
Back
End
Generation Synthesis intermediate representation
5. Optimization Takes the tree structure and translates
16 the operations into the target program
Analysis
In compiling, analysis has three phases:
1. Linear analysis: stream of characters read from left-to-right and grouped into
tokens; known as lexical analysis or scanning
Converting input text into stream of known objects called tokens.
It simplifies parsing process
2. Hierarchical analysis: tokens grouped hierarchically with collective meaning;
known as parsing or syntax analysis
Translating code to rules of grammar.
Building representation of code.
3. Semantic analysis: check if the program components fit together meaningfully
Checks source program for semantic errors
Gathers type information for subsequent code generation (type checking)
17 Identifies operator and operands of expressions and statements
Phases of Compilation
Stream of characters
scanner
Stream of tokens
parser
Parse/syntax tree
Semantic analyzer
Annotated tree
Intermediate code generator
General Structure of a Compiler Intermediate code
Code optimization
Intermediate code
Code generator
Target code
Code optimization
18
Target code
Phase I: Lexical Analysis
The low-level text processing portion of the compiler
The source file, a stream of characters, is broken into larger chunks called
token.
For example:
void main() It will be broken into 13 tokens as
{
int x; below:
x=3; void main ( ) { int x ; x = 3 ; }
}
The lexical analyzer (scanner) reads a stream of characters and puts them
together into some meaningful (with respect to the source language) units
called tokens.
Typically, spaces, tabs, end-of-line characters and comments are ignored by
19 the lexical analyzer.
Phase II: Parsing (Syntax Analysis)
A parser gets a stream of tokens from the scanner, and determines if the syntax
(structure) of the program is correct according to the (context-free) grammar of
the source language.
Then, it produces a data structure, called a parse tree or an abstract syntax tree,
which describes the syntactic structure of the program.
The parser ensures that the sequence of tokens returned by the lexical
analyzer forms a syntactically correct program
It also builds a structured representation of the program called an abstract
syntax tree that is easier for the type checker to analyze than a stream of
tokens
It catches the syntax errors as the statement below:
if if (x > 3) then x = x + 1
Context-free grammars will be used (as the input) by the parser generator to
20 describe the syntax of the compiling language
Parse Tree
Is output of parsing that shows the Top-down description of program syntax
Root node is entire program and leaves are tokens that were identified during lexical
analysis
Constructed by repeated application of rules in Context Free Grammar (CFG)
Syntax structures are analyzed by DPDA (Deterministic Push Down Automata)
Example: parse tree for position:=initial + rate*60
21
Phase III: Semantic Analysis
It gets the parse tree from the parser together with information about some
syntactic elements
It determines if the semantics (meanings) of the program is correct.
It detects errors of the program, such as using variables before they are
declared, assign an integer value to a Boolean variable, …
This part deals with static semantic.
semantic of programs that can be checked by reading off from the
program only.
syntax of the language which cannot be described in context-free
grammar.
Mostly, a semantic analyzer does type checking (i.e. Gathers type information
for subsequent code generation.)
22
It modifies the parse tree in order to get that (static) semantically correct code
Contd.
The main tool used by the semantic analyzer is a symbol table
Symbol table:- is a data structure with a record for each identifier and its
attributes
Attributes include storage allocation, type, scope, etc
All the compiler phases insert and modify the symbol table
Discovery of meaning in a program using the symbol table
Do static semantics check
Simplify the structure of the parse tree ( from parse tree to abstract syntax tree
(AST) )
Static semantics check
Making sure identifiers are declared before use
Type checking for assignments and operators
23 Checking types and number of parameters to subroutines
Phase IV: Intermediate Code Generation
An intermediate code generator
takes a parse tree from the semantic analyzer
generates a program in the intermediate language.
In some compilers, a source program is translated into an intermediate code
first and then the intermediate code is translated into the target language.
In other compilers, a source program is translated directly into the target
language.
Compiler makes a second pass over the parse tree to produce the translated
code
If there are no compile-time errors, the semantic analyzer translates the
24 abstract syntax tree into the abstract assembly tree
Contd.
Using intermediate code is beneficial when compilers which translates a
single source language to many target languages are required.
The front-end of a compiler:- scanner to intermediate code generator can
be used for every compilers.
Different back-ends:- code optimizer and code generator is required for
each target language.
One of the popular intermediate code is three-address code.
A three-address code instruction is in the form of x = y op z.
25
Phase V: Assembly Code Generation
Code generator coverts the abstract assembly tree into the actual assembly
code
To do code generation
The generator covers the abstract assembly tree with tiles (each tile
represents a small portion of an abstract assembly tree) and
Output the actual assembly code associated with the tiles that we used to cover
Phase VI: Machine Code Generation and Linking
the tree
The final phase of compilation coverts the assembly code into machine code
and links (by a linker) in appropriate language libraries
26
Code Optimization
Replacing an inefficient sequence of instructions with a better sequence of
instructions.
Sometimes called code improvement.
Code optimization can be done:
after semantic analyzing
performed on a parse tree
after intermediate code generation
performed on a intermediate code
after code generation
performed on a target code
Two types of optimization
1. Local
27
2. Global
Local Optimization
The compiler looks at a very small block of instructions and tries to determine
how it can improve the efficiency of this local code block
Relatively easy; included as part of most compilers
Examples of possible local optimizations
1. Constant evaluation
2. Strength reduction
3. Eliminating unnecessary operations
28
Global Optimization
The compiler looks at large segments of the program to decide how to improve
performance
Much more difficult; usually omitted from all but the most sophisticated and
expensive production-level “optimizing compilers”
Optimization cannot make an inefficient algorithm efficient
29
The Phases of a Compiler
Phase Output Sample
Programmer (source code producer) Source string A=B+C;
Scanner (performs lexical analysis) Token string ‘A’, ‘=’, ‘B’, ‘+’, ‘C’, ‘;’
And symbol table with names
Parser (performs syntax analysis based Parse tree or abstract syntax tree ;
on the grammar of the programming |
language) =
/\
A +
/\
B C
Semantic analyzer (type checking, etc) Annotated parse tree or abstract
syntax tree
Intermediate code generator Three-address code, quads, or RTL int2fp B t1
+ t1 C t2
:= t2 A
Optimizer Three-address code, quads, or RTL int2fp B t1
+ t1 #2.3 A
Code generator Assembly code MOVF #2.3,r1
ADDF2 r1,r2
MOVF r2,A
30
Summary of Phases of Compiler
31
Compiler Construction Tools
Software development tools are available to implement one or more compiler phases
Scanner generators Other compiler tools:
JavaCC, a parser generator for Java, including
Parser generators. scanner generator and parser generator. Input
specifications are different than those suitable for
Syntax-directed translation engines
Lex/YACC. Also, unlike YACC, JavaCC generates
Automatic code generators a top-down parser.
ANTLR, a set of language translation tools
Data Flow Engines (formerly PCCTS). Includes scanner/parser
generators for C, C++, and Java.
Scanner generators for C/C++: Flex, Lex.
Parser generators for C/C++: Bison, YACC.
Available scanner generators for Java:
JLex, a scanner generator for Java, very similar to Lex.
JFlex, flex for Java.
Available parser generators for Java:
CUP, a parser generator for Java, very similar to YACC.
BYACC/J, a different version of Berkeley YACC for Java. It is an extension of
the standard YACC (a -j flag has been added to generate Java code).
32