0% found this document useful (0 votes)
3 views3 pages

Module Lecture2 2x2

The document discusses the process of lexical analysis in compiler design, focusing on the conversion of patterns into programs for token generation. It covers the structure of tokens, classical regular expressions, and various extensions and problems associated with parsing different programming constructs. Examples are provided for identifying numerals, identifiers, and comments in languages like C, Java, and Python.

Uploaded by

Edrian Rodriguez
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
0% found this document useful (0 votes)
3 views3 pages

Module Lecture2 2x2

The document discusses the process of lexical analysis in compiler design, focusing on the conversion of patterns into programs for token generation. It covers the structure of tokens, classical regular expressions, and various extensions and problems associated with parsing different programming constructs. Examples are provided for identifying numerals, identifiers, and comments in languages like C, Java, and Python.

Uploaded by

Edrian Rodriguez
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 3

Lecture 2: Lexical Analysis Review: Front End Compiler Structure

Lexical Semantic
Parsing
Source Analysis AnalysisDecorated
Tokens AST
code AST

We are here

• A sequence of translations that each:


– Filter out errors
– Remove or put aside extraneous information
– Make data more conveniently accessible.
• Strategy: find tools that partially automate this procedure.
• For lexical analysis: convert description that uses patterns (ex-
tended regular expressions) into program.

Last modified: Thu Mar 8 15:13:16 2018 CS164: Lecture #2 1 Last modified: Thu Mar 8 15:13:16 2018 CS164: Lecture #2 2

Tokens Classical Regular Expressions

• Token consists of syntactic category (like “noun” or “adjective”) plus • Regular expressions denote formal languages, which are sets of strings
semantic information (like a particular name). (of symbols from some alphabet).
• Parsing (the “customer”) only needs syntactic category: • Appropriate since internal structure not all that complex yet.
– “Joe went to the store” and “Harry went to the beach” have same • Expression R denotes language L(R):
grammatical structure.
– L(ǫ) = L("") = {""}.
• For programming, semantic information might be text of identifier – If c is a character, L(c) = {"c"}.
or numeral.
– If R1, R2 are r.e.s, L(R1R2) = {x1x2|x1 ∈ L(R1), x2 ∈ L(R2)}.
• Example from Notes: – L(R1|R2) = L(R1) ∪ L(R2).

IF, LPAR, ID("i"), EQUALS, – L(R∗) = L(ǫ) ∪ L(R) ∪ L(R R) ∪ · · ·.


if(i== j)
ID("j"), RPAR, ID("z"), – L((R)) = L(R).
z = 0; /* No work needed */
=⇒ ASSIGN, INTLIT("0"), SEMI,
else • Precedence is ‘*’ (highest), concatenation, union (lowest). Parenthe-
z= 1; ELSE, ID("z"), ASSIGN, ses also provide grouping.
INTLIT("1"), SEMI

Last modified: Thu Mar 8 15:13:16 2018 CS164: Lecture #2 3 Last modified: Thu Mar 8 15:13:16 2018 CS164: Lecture #2 4
Abbreviations Extensions

• Character lists, such as [abcf-mxy] in Java, Perl, or Python. • “Capture” parenthesized expressions:
• Negative character lists, such as [^aeiou]. – After m = re.match(r’\s*(\d+)\s*,\s*(\d+)\s*’, ’12,34’), have
m.group(1) == ’12’, m.group(2) == ’34’.
• Character classes such as . (dot), \d, \s in Java, Perl, Python.
• Lazy vs. greedy quantifiers:
• L(R+) = L(RR∗).
– re.match(r’(\d+).*’, ’1234ab’) makes group(1) match ’1234’.
• L(R?) = L(ǫ|R).
– re.match(r’(\d+?).*’, ’1234ab’) makes group(1) match ’1’.
• Boundaries:
– re.search(r’(^abc|qef)’, L) matches abc only at beginning of
string, and qef anywhere.
– re.search(r’(?m)(^abc|qef)’, L) matches abc only at begin-
ning of string or of any line.
– re.search(r’rowr(?=baz)’, L) matches an instance of ‘rowr’,
but only if ‘baz’ follows (does not match baz).
– re.search(r’(?<=rowr)baz’, L) matches an instance of ‘baz’,
but only if immediately preceded by ‘rowr’ (does not match rowr).
• Non-linear patterns: re.search(r’(\S+),\1’, L) matches a word
followed by the same word after a comma.
Last modified: Thu Mar 8 15:13:16 2018 CS164: Lecture #2 5 Last modified: Thu Mar 8 15:13:16 2018 CS164: Lecture #2 6

An Example Problems

SL/1 “language”: • Decimal numerals in C, Java.


+ - * / = ; , ( ) < > • All numerals in C, Java.
>= <= --> • Floating-point numerals.
if def else fi while
identifiers • Identifiers in C, Java.
decimal numerals • Identifiers in Ada.
Comments start with # and go to end of line. • Comments in C++, Java.
(Review of programs in Chapter 2 of Course Notes.) • XHTML markups.
• Python bracketing.

Last modified: Thu Mar 8 15:13:16 2018 CS164: Lecture #2 7 Last modified: Thu Mar 8 15:13:16 2018 CS164: Lecture #2 8
Some Problem Solutions

• Decimal numerals in C, Java: 0|[1-9][0-9]*


• All numerals in C, Java: [1-9][0-9]+|0[xX][0-9a-fA-F]+|0[0-7]*
• Floating-point numerals: (\d+\.\d*|\d*\.\d+)([eE][-+]?\d+)?|[0-9]+[eE][-+
• Identifiers in C, Java. (ASCII only, no dollar signs):
[a-zA-Z ][a-zA-Z 0-9]*
• Identifiers in Ada: [a-zA-Z]([a-zA-Z 0-9]| [a-zA-Z0-9])*
• Comments in C++, Java: //.*|/\*([^*]|\*[^/])*\*+/
or, using some extended features: //.*|/\*(.|\n)*?\*/
• Python bracketing: Nothing much you can do here, except to note
blanks at the beginnings of lines and to do some programming in the
actions.

Last modified: Thu Mar 8 15:13:16 2018 CS164: Lecture #2 9

You might also like