Compilers Crash Course
Compilers Crash Course
Slides Acknowledgment:
Prof. Andrew Myers (Cornell)
2
Source Code
• Source code: optimized for human readability
– expressive: matches human notions of grammar
– redundant to help avoid programming errors
– computation possibly not fully determined by code
int expr(int n)
{
int d;
d = 4 * n * n * (n + 1) * (n + 1);
return d;
}
Machine Code
• Optimized for hardware
– Redundancy, ambiguity reduced
– Information about intent and reasoning lost
– Assembly code ≈ machine code
expr:
pushl %ebp 55
movl %esp, %ebp 89 e5
subl $4, %esp 83 ec 04
movl 8(%ebp), %eax 8b 45 08
movl %eax, %edx 89 c2
imull 8(%ebp), %edx 0f af 55 08
movl 8(%ebp), %eax 8b 45 08
incl %eax 40
imull %eax, %edx 0f af d0
movl 8(%ebp), %eax 8b 45 08
incl %eax 40
imull %edx, %eax 0f af c2
sall $2, %eax c1 e0 02
movl %eax, -4(%ebp) 89 45 fc
movl -4(%ebp), %eax 8b 45 fc
leave c9
ret c3
4
Example (Output assembly code)
Unoptimized Code Optimized Code
expr: expr:
pushl %ebp pushl %ebp
movl %esp, %ebp movl %esp, %ebp
subl $4, %esp movl 8(%ebp), %edx
movl 8(%ebp), %eax movl %edx, %eax
movl %eax, %edx imull %edx, %eax
imull 8(%ebp), %edx incl %edx
movl 8(%ebp), %eax imull %edx, %eax
incl %eax imull %edx, %eax
imull %eax, %edx sall $2, %eax
movl 8(%ebp), %eax leave
incl %eax ret
imull %edx, %eax
sall $2, %eax
movl %eax, -4(%ebp)
movl -4(%ebp), %eax
leave
ret
How to translate?
• Source code and machine code mismatch
• Goals:
– source-level expressiveness for task
– best performance for concrete computation
– reasonable translation efficiency (< O(n3))
– maintainable compiler code
6
How to translate correctly?
• Programming languages describe computation
precisely
• erefore: translation can be precisely described (a
compiler can be correct)
• Correctness is very important!
– hard to debug programs with broken compiler…
– non-trivial: programming languages are expressive
– implications for development cost, security
– some compilers have been proven correct!
[X. Leroy, Formal Verification of a Realistic Compiler, CACM 52(7), 2009]
8
Idea: translate in steps
• Compiler uses a series of different
intermediate representations (IRs) of
programs.
• Different IRs are good for different phases
of compilation
Compilation in a Nutshell 1
Source code if (b == 0) a = b;
(character stream)
Lexical analysis
Token if ( b == 0 ) a = b ;
stream
Parsing
if
== =
Abstract syntax
tree (AST) b 0 a b
if Semantic Analysis
boolean int
== =
Decorated AST
int b int 0 int a int b
lvalue
10
Compilation in a Nutshell 2
if
boolean int
== =
int b int 0 int a int b Intermediate Code Generation
lvalue
if b == 0 goto L1 else L2
L1: a = b
L2: Assembly Code Generation
cmp rb, 0
jnz L2
L1: mov ra, rb
Register allocation, Optimization
L2:
cmp ecx, 0
cmovz [ebp+8],ecx
11
Compilation in a Nutshell 2
if
boolean int
== =
int b int 0 int a int b Intermediate Code Generation
lvalue
if b == 0 goto L1 else L2
L1: a = b
L2: Assembly Code Generation
cmp rb, 0
jnz L2
L1: mov ra, rb
Register allocation, Optimization
L2:
cmp ecx, 0
cmovz [ebp+8],ecx
12
Simplified Compiler Structure
Source code
(character stream)
if (b == 0) a = b; Lexical analysis
Token stream
Parsing Front end
Abstract syntax tree (machine-independent)
Program Intermediate Code Generation
analysis Intermediate code
& Control flow graphs
Optimization
Assembly Code generation Back end
(machine-dependent)
Assembly code
cmp 0, %rcx
cmovz %rcx, %rdx
13
Assembler
Object code
(machine code +
symbol tables)
Fully-resolved object Linker
code (machine code +
symbol tables,
relocation info) Loader
Executable image in memory
14
Where to Learn More
– Compilers—Principles, Techniques and
Tools. Aho, Lam, Sethi and Ullman (e Dragon
Book)
(strength: parsing)
– Modern Compiler Implementation in Java.
Andrew Appel.
(strength: translation)
– Advanced Compiler Design and
Implementation. Steve Muchnick.
(strength: optimization)
15