Module Lecture2 2x2

Lecture 2: Lexical Analysis Review: Front End Compiler Structure
Lexical Semantic
Parsing
Source Analysis AnalysisDecorated
Tokens AST
code AST
We are here
• A sequence of translations that each:

– Filter out errors
– Remove or put aside extraneous information
– Make data more conveniently accessible.
• Strategy: find tools that partially automate this procedure.
• For lexical analysis: convert description that uses patterns (ex-
tended regular expressions) into program.
Last modified: Thu Mar 8 15:13:16 2018 CS164: Lecture #2 1 Last modified: Thu Mar 8 15:13:16 2018 CS164: Lecture #2 2
Tokens Classical Regular Expressions
• Token consists of syntactic category (like “noun” or “adjective”) plus • Regular expressions denote formal languages, which are sets of strings
semantic information (like a particular name). (of symbols from some alphabet).
• Parsing (the “customer”) only needs syntactic category: • Appropriate since internal structure not all that complex yet.
– “Joe went to the store” and “Harry went to the beach” have same • Expression R denotes language L(R):
grammatical structure.
– L(ǫ) = L("") = {""}.
• For programming, semantic information might be text of identifier – If c is a character, L(c) = {"c"}.
or numeral.
– If R1, R2 are r.e.s, L(R1R2) = {x1x2|x1 ∈ L(R1), x2 ∈ L(R2)}.
• Example from Notes: – L(R1|R2) = L(R1) ∪ L(R2).
IF, LPAR, ID("i"), EQUALS, – L(R∗) = L(ǫ) ∪ L(R) ∪ L(R R) ∪ · · ·.

if(i== j)
ID("j"), RPAR, ID("z"), – L((R)) = L(R).
z = 0; /* No work needed */
=⇒ ASSIGN, INTLIT("0"), SEMI,
else • Precedence is ‘*’ (highest), concatenation, union (lowest). Parenthe-
z= 1; ELSE, ID("z"), ASSIGN, ses also provide grouping.
INTLIT("1"), SEMI
Abbreviations Extensions
• Character lists, such as [abcf-mxy] in Java, Perl, or Python. • “Capture” parenthesized expressions:
• Negative character lists, such as [âeiou]. – After m = re.match(r’\s*(\d+)\s*,\s*(\d+)\s*’, ’12,34’), have
m.group(1) == ’12’, m.group(2) == ’34’.
• Character classes such as . (dot), \d, \s in Java, Perl, Python.
• Lazy vs. greedy quantifiers:
• L(R+) = L(RR∗).
– re.match(r’(\d+).*’, ’1234ab’) makes group(1) match ’1234’.
• L(R?) = L(ǫ|R).
– re.match(r’(\d+?).*’, ’1234ab’) makes group(1) match ’1’.
• Boundaries:
– re.search(r’(âbc|qef)’, L) matches abc only at beginning of
string, and qef anywhere.
– re.search(r’(?m)(âbc|qef)’, L) matches abc only at begin-
ning of string or of any line.
– re.search(r’rowr(?=baz)’, L) matches an instance of ‘rowr’,
but only if ‘baz’ follows (does not match baz).
– re.search(r’(?<=rowr)baz’, L) matches an instance of ‘baz’,
but only if immediately preceded by ‘rowr’ (does not match rowr).
• Non-linear patterns: re.search(r’(\S+),\1’, L) matches a word
followed by the same word after a comma.
An Example Problems
SL/1 “language”: • Decimal numerals in C, Java.

+ - * / = ; , ( ) < > • All numerals in C, Java.
>= <= --> • Floating-point numerals.
if def else fi while
identifiers • Identifiers in C, Java.
decimal numerals • Identifiers in Ada.
Comments start with # and go to end of line. • Comments in C++, Java.
(Review of programs in Chapter 2 of Course Notes.) • XHTML markups.
• Python bracketing.
Some Problem Solutions
• Decimal numerals in C, Java: 0|[1-9][0-9]*

• All numerals in C, Java: [1-9][0-9]+|0[xX][0-9a-fA-F]+|0[0-7]*
• Floating-point numerals: (\d+\.\d*|\d*\.\d+)([eE][-+]?\d+)?|[0-9]+[eE][-+
• Identifiers in C, Java. (ASCII only, no dollar signs):
[a-zA-Z ][a-zA-Z 0-9]*
• Identifiers in Ada: [a-zA-Z]([a-zA-Z 0-9]| [a-zA-Z0-9])*
• Comments in C++, Java: //.*|/\*([^*]|\*[^/])*\*+/
or, using some extended features: //.*|/\*(.|\n)*?\*/
• Python bracketing: Nothing much you can do here, except to note
blanks at the beginnings of lines and to do some programming in the
actions.
Last modified: Thu Mar 8 15:13:16 2018 CS164: Lecture #2 9

Module Lecture2 2x2

Uploaded by

Module Lecture2 2x2

Uploaded by

Lecture 2: Lexical Analysis Review: Front End Compiler Structure

• A sequence of translations that each:

Tokens Classical Regular Expressions

IF, LPAR, ID("i"), EQUALS, – L(R∗) = L(ǫ) ∪ L(R) ∪ L(R R) ∪ · · ·.

SL/1 “language”: • Decimal numerals in C, Java.

• Decimal numerals in C, Java: 0|[1-9][0-9]*

Last modified: Thu Mar 8 15:13:16 2018 CS164: Lecture #2 9

You might also like