0% found this document useful (0 votes)
11 views89 pages

Regular expression

Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
Download as ppt, pdf, or txt
0% found this document useful (0 votes)
11 views89 pages

Regular expression

Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1/ 89

Module- 2

REGULAR EXPRESSIONS
Introduction
• Instead of focusing on the power of a computing
device, let's look at the task that we need to
perform.
• Let's consider problems in which our goal is to match
finite or repeating patterns.
• Lexical analysis.-- compiler
• Filtering email for spam.
• Sorting email into appropriate mailboxes based on
sender and/or content words and phrases.
• Searching a complex directory structure by specifying
patterns that are known to occur in the file we want
REGULAR EXPRESSIONS

• A regular expression is a pattern description using a


meta language, a language that use to describe
particular patterns of interest.
• A regular expression provides a concise and flexible
means for "matching" strings of text, such as
particular characters, words, or patterns of characters

[ ] A character class which matches any character within the brackets

[^ \t\n] matches any character except space, tab and newline character.
• The regular expression language that we are about to
describe is built on an alphabet that contains two
kinds of symbols:

1. A set of special symbols to which we will attach


particular meanings when they occur in a regular
expression. These symbols are Ø ,U, ε (, ), *, and .

2. An alphabet ∑ which contains the symbols that


regular expressions will match against.
What is a Regular Expression?
A regular expression is a string that can be formed according to
the following rules:
1. Ø is a regular expression.
2. ε is a regular expression.
3. Every element in ∑ is a regular expression.
4. Given two regular expressions α and β, αβ is a regular
expression.
5. Given two regular expressions α and β, α U β is a regular
expression..
6. Given a regular expression α, α* is a regular expression.
7. Given a regular expression α, α+ is a regular expression.
8. Given a regular expression α, (α) is a regular expression.
Let ∑ = {a, b }
The following strings are regular expressions
Ø
ε
a
b
(aUb)*
abba U ε etc……….

Every regular expression has a meaning


Semantic interpretation function L for the Language of regular
expressions.

1. L (Ø) = Ø, the language that contains no strings.

2. L(ε) = {ε}, the language that contains just the empty


string.

3. For any c ϵ ∑, L (c) = {c}, the language that contains


the single one character string c
• For any regular expressions α and β. L (αβ) = L(α) L(β).
That is concatenation of two regular expressions.
The concatenation of two languages L1 and L2
is { w = xy, where x ϵ L1 and y ϵ L2 }
If either L(α) or L(β) is equal to Ø, then the
concatenation will also be equal to Ø
• For any regular expressions α and β. L (αUβ) = L(α) U
L(β). That is union of two regular expressions.
• For any regular expressions α, L (α*) = (L (α))* where *
is the Kleen start operator
• For any regular expressions α, L (α+) = L(α α* )
= L (α)(L (α))*
If L (α) = Ø then , L (α+) is also equal to Ø
L (α+) is the language that is formed by concatenating
together one or more strings drawn from L (α).
• For any regular expressions α, L ((α)) = L(α).
That means parentheses have no effect on meaning
except to group the constituents in an expression
• L (Ø*) = { w: w is formed by concatenating
together zero or more strings from Ø}.
L (Ø*) = { ε }
L( (aU b)* b) = ?
er the
L( (aU b)* b) = L((a U b)*) L(b) ing
v
s o n b.
a ll str end i
= (L((a U b)))* L(b) s et of b} that
e ,
Is th abet {a
= (L(a) U L(b))* L(b) alph
= ({a} U {b})* {b}
= {a, b}* {b}.
Regular Meaning
expression
a*
String consisting of any number of a’s. (zero or more a’s)
a+
String consisting of at least one a. (one or more a’s)
(a , b)
String consisting of either a or b
(a, b)*
String consisting of any nuber of a’s and b’s including ε
(a, b)* ab
Strings of a’s and b’s ending with ab.
ab(a, b)*
Strings of a’s and b’s starting with ab.
(a , b)* ab (a,b)*
Strings of a’s and b’s with substring ab.
Strings of a’s and b’s having length 2:
Regular expression = (a , b) (a, b)

Strings of a’s and b’s of even length. (L = {w ϵ {a, b }*: |w| is even}
Regular expression = ((a , b) ( a , b))*

Strings of a’s and b’s of odd length (L = {w ϵ {a, b }*: |w| is odd}
Regular expression = (a, b)((a , b) ( a , b))*
Strings of a’s of even length
Regular expression = (aa)*
Strings of a’s of odd length
Regular expression = a(aa)*
L = {w ϵ {a, b }*: w contains an odd number of a’s}

Regular expression : = b*a b* (ab* ab*)*

or
b* (ab* ab*)* a b*

Obtain regular expression to accept the language


containing at least one a and one b over Σ = { a, b, c}.
(a+b+c)* a (a+b+c)* b(a+b+c)* + (a+b+c)* b(a+b+c)*a(a+b+c)*
Obtain regular expression to accept the language containing strings of a’s and
b’s ending with b and has no substring aa.
Regular expression = ( b + ab) (b + ab)*
Obtain regular expression to accept the language containing
strings of 0’s and 1’s having no two consecutive 0’s.
Regular expression = (1+01)* (0 + ε )
Strings of a’s and b’s with alternate a’s and b’s.
Regular expression = (ε +b) (ab)*(ε +a)
Obtain regular expression to accept the language containing strings of
a’s and b’s such that L = { a2n b2m | n, m  0 }
Regular expression = (aa)* (bb)*
Obtain regular expression to accept the language containing strings of
a’s and b’s such that L = { a2n+1 b2m | n, m  0 }
Regular expression = a (aa)* (bb)*

Obtain regular expression to accept the language containing strings of 0’s


and 1’s with exactly one 1 and an even number of 0’s.
Regular expression = (00)* 1 (00)* + 0(00)* 1 0(00)*
Strings of a’s and b’s such that 4th symbol from right end is b and the 5th symbol
from right end is a.
Regular expression = (a + b)* ab(a+b)(a+b)(a+b
or = (a, b)* ab( a, b) (a, b) (a, b)
Strings of a’s and b’s whose lengths are multiple of 3.
OR
L = { |w| mod 3 = 0, where w is in Σ = { a, b}

Regular expression =((a+b) (a+b) (a+b))*


Strings of a’s and b’s not more than 3 a’s:

Regular expression = b*(ε + a) b*(ε + a) b* (ε + a) b*


Obtain the regular expression to accept the words with two or more letters but
beginning and ending with the same letter. Σ = { a, b}
Regular expression = a (a+b)* a + b (a+b)* b
Obtain the regular expression to accept the language L = { an bm | m+n is even }

Regular expression = (aa)*(bb)* + a(aa)* b(bb)*


Obtain the regular expression to accept the language L = { an bm cp | n  4, m 
3 p  2}

Regular expression = aaaa(a)* (ε+b) (ε+b) (ε+b) (ε+c) (ε+c).

Obtain the regular expression for the language L = { an bm |


m  1, n  1, nm  3 }

Regular expression = abbb(b)* + aaa(a)*b +


aa(a)*bb(b)*
The regular expression language provides three

operators (precedence order from highest to lowest)

1. Kleene star

2. Concatenation, and

3. Union
(α U ε) → expression can be satisfied either by matching α or the empty string.

(a U b)* → Describes the set of all strings composed of the characters a and b.

a* U b* = (a U b)* Every string in the language on the left contains only a’s or b’s.
(ab)* ≠ a* b* The language on the left contains the string abab….. while the
language on the right does not. The language on the right
contains the string aaabbbb, while the language on the left does
not.

The regular expression a* is simply a string. It is different from


Language L(a*) = {w: w is composed of zero or more a's}.
a. NO

b. YES

c. NO

d. YES
Kleene's Theorem

• The regular expression language is a useful way to


define patterns.
• Any language that can be defined by a regular
expression can be accepted by some finite state
machine.
• Any language that can be accepted by a finite state
machine can be defined by some regular expressions
Building an FSM from a Regular Expression
Regular Expression to FSM (For Every Regular Expression
There is an Equivalent FSM )

Theorem: Any language that can be defined with a


regular expression can be accepted by some FSM and
so is regular.
Proof: The proof is by construction.

For a given regular expression α,we can construct an

FSM M such that L (α) = L (M).

If α is any c ϵ ∑, we construct for it the simple FSM as:


If α is Ø, we construct for it the simple FSM as:

If α is ε we construct for it the simple FSM as:


Let us construct FSMs to accept languages that are
defined by regular expressions that exploit the
operations of concatenation, union, and Kleene star.

Let β and γ be regular expressions that define languages


over the alphabet ∑
If L (β) is regular, then it is accepted by some FSM M1
=(K1, ∑, δ1, s1, A1).
If L (γ) is regular, then it is accepted by some FSM M2
=(K2, ∑, δ2, s2, A2).
Union Operation:

If regular expression α = β U γ and if both L(β) and L(γ)


are regular, then we construct M3 =(K3, ∑, δ3, s3, A3),
such that L(M3) = L(α ) = L(β ) U L(γ).
• If necessary, rename the states of M1 and M2 so that K1
∩ K2 = Ø
• Construct a new machine M3, by creating a new start
state s3, and connect it to the start states of M1 and M2
via ε-transitions.
• M3 accepts if either M1 or M2 accepts.
• So M3 = ( { s3} U K1 U K2, ∑, δ3, s3, A1 U A2 ) where
δ3 = δ1 U δ2{((s3, ε), s1), (s3, ε ), s2)}
L(M3) = L(α ) = L(β ) U L(γ)

= L(β ) U L(γ).
Concatenation Operation:

If regular expression α = βγ and if both L(β) and L(γ)


are regular, then we construct M3 =(K3, ∑, δ3, s3, A3),
such that L(M3) = L(α ) = L(β )L(γ).
• If necessary, rename the states of M1 and M2 so that
K1 ∩ K2 = Ø
• Construct a new machine M3, by connecting every
accepting state of M1 to the start state of M2 via an
ε-transition. M3 will start in the start state of M1 and
will accept if M2 does.
• So M3 = ( K1 U K2, ∑, δ3, s1, A2) where
δ3 = δ1 U δ2{((q, ε), s2) : q ϵ A1)}
L(M3) = L(α) = L(β )L(γ)
Kleene Star Operation:

If regular expression α = β*and L(β) is regular, then we


construct M2 =(K2, ∑, δ2, s2, A2), such that L(M2) = L(α ) =
L(β )*
• M2 is constructed by creating a new start state s2 and
make it accepting state, thus assuming that M2 accepts
ε.
• We link the new s2 to s1 via an ε –transitions. Finally,
we create ε -transitions from each of M1's accepting
states back to s1
• So M2 = ( {s2} U K1, ∑, δ2, s2, {s2 } U A1) where
δ2 = δ1 U {((s2, ε), s1) } U {((q, ε), s1) : q ϵ A1}
L(M2) = L(α ) = L(β)*
• Finite state Machines constructed from Regular
expression are typically highly non-deterministic
because of their use of ε-transitions.
• These FSM’s have a large number of unnecessary
states.
• As a practical matter, it is not a problem, since, given
an arbitrary NDFSM M, we have an algorithm that
can construct an equivalent DFSM M’ We also have
an algorithm that can minimize M’
Construct a FSM for the regular expression (b U ab)*
OR
Convert the regular expression (b + ab)* to an ε- NFA
OR
Convert the regular expression (b, ab)* to a FSM.
FSM for b

FSM for a
FSM for ab
FSM for (b U ab )
FSM for (b U ab)*
Convert the regular expression 0* + 1* + 2* to an ε- NFA or a
FSM
FSM to Regular Expression (State Elimination)
• How to build a regular expression for a FSM.
• Instead of limiting the labels on the transitions of an
FSM to a single character or ε, we will allow entire regular
expressions as labels.
• For a given input FSM M, we will construct a machine M’ such
that M and M’ are equivalent and M’ has only two states,
start state and a single accepting state.
• M’ will also have just one transition, which will go from its
start state to its accepting state.
• The label on that transition will be a regular expression that
describes all the strings that could have driven the original
machine M from its start state to some accepting state.
Consider the following FSM M:

Show a regular expression for L(M).

OR
Obtain the regular expression for the above finite
automata using state elimination method.
• We can build an equivalent machine M' by
eliminating state q2 and replacing it by a transition
from q1 to q3 labeled with the regular expression
ab*a.
So M' is:

Regular Expression = ab*a


Algorithm to create a regular expression from FSM

1. Remove any states from given FSM M that are


unreachable from the start state.
2. If M has no accepting states then halt and return the
simple regular expression Ø.
3. If the start state of M is part of a loop (i.e: it has any
transitions coming into it), then create a new start state s and
connects to M ‘s start state via an ε-transition. This

new start state s will have no transitions into it.


4. If there is more than one accepting state of M or if
there is just one but there are any transitions out of
it, create a new accepting state and connect each of
M’s accepting states to it via an ε-transition. Remove
the old accepting states from the set of accepting
states. Note that the new accepting state will have
no transitions out from it.
5. At this point, if M has only one state, then that state
is both the start state and the accepting state and M
has no transitions. So L (M} = {ε}. Halt and return the
simple regular expression as ε.
6. Until only the start state and the accepting state
remain do:

6.1. Select some state rip of M. Any state except the start

state or the accepting state may be chosen.

6.2 Remove rip from M.

6.3 Modify the transitions among the remaining states so

that M accepts the same strings. The labels on the

rewritten transitions may be any regular expression.


7. Return the regular expression that labels the one remaining
transition from the start state to the accepting state.
Consider the following FSM M:
Show a regular expression for L(M).

OR
Obtain the regular expression for the above finite automata
using state elimination method.
• Create a new start state and a new accepting state
and link them to M:
Remove state 3:
Remove state 2:
Remove state 1:

Regular Expression = (ab U aaa* b)* (a U ε )


If we attempt to remove state [2}, this changes not just the way that M can
move from state [I} to state [4].
It also changes. for example, the way that M can move from state [1] to state
[3) because it changes how M can move from state L 1] back to itself.
Kleen’sTheorem:

Theorem: Every regular language (ie: every language


that can be accepted by some FSM) can be defined
with a regular expression.

This proof is by construction of FSM


1. Remove any states from given FSM M that are unreachable from the
start state.

2. If the start state of M is part of a loop (i.e: it has any transitions coming
into it), then create a new start state s and connects to M ‘s start state via
an ε-transition. This new start state s will have no transitions into it.

3. If there is more than one accepting state of M or if there is just one but
there are any transitions out of it, create a new accepting state and
connect each of M’s accepting states to it via an ε-transition. Remove
the old accepting states from the set of accepting states. Note that the
new accepting state will have no transitions out from it.
• If there is more than one transition between states p and q,
collapse them into a single transition.
• If there is a pair of states p, q and there is no transition
between them and p is not the accepting state and q is not
the start state, then create a transition from p to q labeled Ø.
• At this point, if M has only one state, then that state is both the
start state and the accepting state and M has no transitions. So
L (M} = {ε}. Halt and return the simple regular expression as ε.
• If M has no accepting states then halt and return the simple
regular expression Ø.
• Until only the start state and the accepting state remain do:

i. Select some state rip of M. Any state except the start

state or the accepting state may be chosen.

ii. For every transition from some state p to some state ,

if both p and q are not rip then, using the current

labels given by the expressions R, compute the new

label R ' for the transition from p to q using the formula:

R'(p, q) = R(p, q) U R(p, rip)R(rip, rip)* R(rip, q)


• Remove rip and all transitions into and out of
it.
• Return the regular expression that labels the
one remaining transition from the start state
to the accepting state.
Construct the regular expression for the following FSM using
Kleen’s Theorem
By Adding all the required transitions.
Ripping States Out One at a Time

Let rip be state 2. Then:


R'(1, 3) = R(1, 3) U R(1, rip)R(rip, rip)*R(rip, 3).
= R(1, 3) U R(1, 2) R(2, 2)* R(2, 3)
= Ø U a b* a = ab*a
Applications of Regular Expressions

• Text editors: which are some programs used for processing the text.
Example: UNIX text editor uses the RE for substituting the strings.

• Lexical Analysers: example identifier as : (letter) (letter+digit)*


P, Q and R are regular expressions then the identity rules are:

Example
1ɛ =ɛ1=1
ɛR = Rɛ=R
1Ø = Ø1 = Ø
ØR = RØ = Ø
ɛ* =ɛ
(Ø)* = ɛ
Ø +1 =1
Ø + R = R+Ø = R
1U1=1
R +R = R
00* = 0+
RR* =R*R = R+
(1*)* = 1*
(R*)* = R*
Example

R* R* = R*
ɛ + 1+ = 1 *
ɛ + RR* = R*
(P+Q)R = PR +QR
(P+Q)* =(P*Q*) = (P*+Q*)*
R*(ɛ + R) = (ɛ + R) R* = R*
(ɛ + R)* = R*
ɛ + R* = R*
(PQ)* P = P(QP)*
R*R + R = R*R =R+
Kleen’s Theorem

For every FA recognizing language L, there exists


an equivalent regular expression R for the
regular language L, such that L = L(R)
Let us consider a DFA with n number of states say q0,
q1, q2,………qn

The path from state i to state j through an intermediate


state whose number is not higher than k is given by the
regular expression: Ri ,jk
Regular expression from state I to state j through a state
which is not higher than k is given by
Rijk = Rijk-1 + Rikk-1 (Rkk (k-1) )* Rkjk-1
R122 = R121 + R121 (R221)* R221

R121 = R120 + R110 (R110)* R120

R221 = R220 + R210 (R110)* R120


R110 = 1 + ɛ R210 = Ø

R220 = 0 +1 + ɛ
R120 = 0
R121 = R120 + R110 (R110)* R120

= (0) + (1 + ε)(1 + ε)* (0)


= 0 + 1* (0)
R121 = 1*0

we know that (ε +R)* = R*


R221 = R220 + R210 (R110)* R120
= (0+1+ε) + Ø(1+ ε)* 0
R221 = (0+1+ε)

we know that Ø R = R
R122 = R121 + R121 (R221)* R221

= 1*0 + (1*0) (0+1+ε)* (0+1+ε)


= 1*0 + (1*0) (0+1)*
=
(1*0) (0+1)*

You might also like