Unit-3 Regular Expressions
Unit-3 Regular Expressions
Example 2:
L2 = {w | w in {(, )}* and w is balanced }
- Balanced parentheses are those that can appear in an arithmetic expression.
L2 = { (), ()(), (()), (()()),… }
Regular Expression and Grammar::
REGULAR LANGUAGE=> Basic language + Regular operator
The basic language: The simple language is of the form {a} where a ε Ʃ and the
empty language ε.
Regular operator: There are three regular operators used NOTE: Precedence of
regular operator:
to generate a language which as mentioned below:-
The star operator is
1. Union (U): L1UL2={S|S ε L1 or S ε L2}
of highest
2. Concatenation (.): L1.L2={S.t|S ε L1 and t ε L2} precedence. i.e it
3. Kleene closure (*): L*= 0 or more applies to its left well
4. Positive closure (+): L+=1 or more formed RE.
Next precedence is
Example:- If L1={11,00},l2={01,10} over ε={0,1} then, taken by
L1UL2= {11, 00, 01, 10} concatenation
L1.L2= {1101, 1110, 0001, 0010} operator.
L*= {ε, 11, 00, 1111, 11011……………….} Finally, unions are
L+= {11, 00, 1111, 11011…………………….} taken.
Regular Sets
Any set that represents the value of the Regular Expression is called a Regular Set.
Properties of Regular Sets
∈ {∈}
a+ b {a, b}
a.b {ab}
Ex. 2: Find a regular expression corresponding to the language of all strings over
the alphabet {a, b } that contain exactly two a's.
Solution: A string in this language must have at least two a's. Since any string of
b's can be placed in front of the first a, behind the second a and between the two
a's, and since an arbitrasry string of b's can be represented by the regular
expression b*, b*a b*a b* is a regular expression for this language.
Ex. 3: Let r1 and r2 be arbitrary regular expressions over some alphabet. Find a simple (the
shortest and with the smallest nesting of * and +) regular expression which is equal to each of
the following regular expressions.
(a) (r1 + r2 + r1r2 + r2r1)*
(b) (r1(r1 + r2)*)+
Solution: One general strategy to approach this type of question is to try to see whether or not
they are equal to simple regular expressions that are familiar to us such as a, a*, a+, (a + b)*, (a
+ b)+ etc.
(a) Since (r1 + r2)* represents all strings consisting of strings of r1 and/or r2 , r1r2 + r2r1 in the
given regular expression is redundant, that is, they do not produce any strings that are not
represented by (r1 + r2)*. Thus (r1 + r2 + r1r2 + r2r1)* is reduced to (r1 + r2)*.
(b) (r1(r1 + r2)*)+ means that all the strings represented by it must consist of one or more
strings of (r1(r1 + r2)*). However, the strings of (r1(r1 + r2)*) start with a string of r1 followed
by any number of strings taken arbitrarily from r1 and/or r2. Thus anything that comes after the
first r1 in (r1(r1 + r2)*)+ is represented by (r1 + r2)*. Hence (r1(r1 + r2)*) also represents the
strings of (r1(r1 + r2)*)+, and conversely (r1(r1 + r2)*)+ represents the strings represented by
(r1(r1 + r2)*). Hence (r1(r1 + r2)*)+ is reduced to (r1(r1 + r2)*).
Ex. 4: For the two regular expressions given below,
(a) find a string corresponding to r2 but not to r1 and
(b) find a string corresponding to both r1 and r2.
r1 = a* + b* r2 = ab* + ba* + b*a + (a*b)*
Solution:
(a) Any string consisting of only a's or only b's and the empty string are in
r1. So we need to find strings of r2 which contain at least one a and at
least one b. For example ab and ba are such strings.
(b) A string corresponding to r1 consists of only a's or only b's or the empty
string. The only strings corresponding to r2 which consist of only a's or b's
are a, b and the strings consiting of only b’s (from (a*b)*).
Ex. : Find a regular expression corresponding to the language of all strings over the alphabet {a, b } that do not
end with ab.
Solution:
Any string in a language over { a , b } must end in a or b.
Hence if a string does not end with ab then it ends with a or if it ends with b the last b must be preceded by a
symbol b.
Since it can have any string in front of the last a or bb, ( a + b )*( a + bb ) is a regular expression for the language.
(a+b)*(a+bb)
OR
(a|b)*(a|bb)
Ex.: Find a regular expression corresponding to the language of strings of even
lengths over the alphabet of { a, b }.
Solution:
Since any string of even length can be expressed as the concatenation of strings
of length 2 and since the strings of length 2 are aa, ab, ba, bb, a regular
expression corresponding to the language is ( aa + ab + ba + bb )*.
Note that 0 is an even number. Hence the string is in this language.
Ex.: Describe as simply as possible in English the language corresponding to the
regular expression a*b(a*ba*b)*a* .
Solution:
A string in the language can start and end with a or b, it has at least one b, and
after the first b all the b's in the string appear in pairs. Any number of a's can
appear any place in the string.
Thus simply put, it is the set of strings over the alphabet { a, b } that contain an
odd number of b's
Ex. : Describe as simply as possible in English the language corresponding to the
regular expression (( a + b )3)*( + a + b ) .
Solution:
(( a + b )3) represents the strings of length 3. Hence (( a + b )3)* represents the
strings of length a multiple of 3.
Since (( a + b )3)*( a + b ) represents the strings of length 3n + 1, where n is
a natural number, the given regular
Ex. : Describe as simply as possible in English the language corresponding to the
regular expression ( b + ab )*( a + ab )*.
Solution:
( b + ab )* represents strings which do not contain any substring aa and which
end in b, and ( a + ab )* represents strings which do not contain any substring
bb.
Hence altogether it represents any string consisting of a substring with no aa
followed by one b followed by a substring with no bb.
Some RE Examples
Regular Expressions Regular Set
(0 + 10*) L = { 0, 1, 10, 100, 1000, 10000, … }
(0*10*) L = {1, 01, 10, 010, 0010, …}
(0 + ε)(1 + ε) L = {ε, 0, 1, 01}
(a+b)* Set of strings of a’s and b’s of any length including the null string. So L =
{ ε, a, b, aa , ab , bb , ba, aaa…….}
(a+b)*abb Set of strings of a’s and b’s ending with the string abb. So L = {abb,
aabb, babb, aaabb, ababb, …………..}
(11)* Set consisting of even number of 1’s including empty string, So L= {ε,
11, 1111, 111111, ……….}
(aa)*(bb)*b Set of strings consisting of even number of a’s followed by odd number
of b’s , so L = {b, aab, aabbb, aabbbbb, aaaab, aaaabbb, …………..}
(aa + ab + ba + bb)* String of a’s and b’s of even length can be obtained by concatenating
any combination of the strings aa, ab, ba and bb including null, so L =
{aa, ab, ba, bb, aaab, aaba, …………..}
∑ = {a, b} and r is a regular expression of language made using these symbols
∈ {∈}
a+ b {a, b}
a.b {ab}
5. Idempotent law:
If R is R.E then R+R=R
6. Law of closure:
If R is R.E the ((R)*)*=R*
ɸ= closure of ɸ= ɸ*= ɸ
ε=closure of ε= ε*= ε
7. Identities for regular expression
• There are many identities for the regular expression. Let p, q and r are regular
expressions.
∅+r=r
∅.r= r.∅ = ∅
∈.r = r.∈ =r
∈* = ∈ and ∅* = ∈
r+r=r
r*.r* = r*
r.r* = r*.r = r+.
(r*)* = r*
∈ +r.r* = r* = ∈ + r.r*
(p.q)*.p = p.(q.p)*
(p + q)* = (p*.q*)* = (p* + q*)*
(p+ q).r= p.r+ q.r and r.(p+q) = r.p + r.q
Examples
Consider Σ = {0, 1}, then some regular expressions over Σ are ;
• 0*10* is RE that represents language {w|w contains a single 1}
• Σ*1Σ* is RE for language{w|w contains at least single 1}
• Σ*001 Σ* = {w|w contains the string 001 as substring}
• (Σ Σ)* or ((0+1)*.(0+1)*) is RE for {w|w is string of even length}
• 1*(01*01*)* is RE for {w|w is string containing even number of zeros}
• 0*10*10*10* is RE for {w|w is a string with exactly three 1’s}
• For string that have substring either 001 or 100, the regular expression is
(1+0)*.001.(1+0)*+(1+0)*.(100).(1+0)*
• For strings that have at most two 0’s with in it, the regular expression is
1*.(0+Є).1*.(0+Є).1*
• For the strings ending with 11, the regular expression is
(1+0)*.(11)+
• Regular expression that denotes the C language identifiers:
(Alphabet + _ )(Alphabet + digit + _ )*
Application of regular languages:
Search and Selection: Identifying a subset of items from a larger set on the
basis of a pattern match.
Solution −
Here the initial state is q1 and the final state is q2
Now we write down the equations − 2
q2 = 0*1 + q20
q2 = 0*1(0)* [By Arden’s theorem]
SOLUTION:
Let the equations are;
q1=q21 + q30 + ε………………………..(i)
q2= q10………………………………….. (ii)
q3=q11……………………………………. (iii)
q4= q20 + q31 + q40 + q41…………….. (iv)
Now put q2 and q3 in eqn(i)
q1= q101 + q110 + ε
= ε + q1 (01+10)
where, q=ε r=q p=01 +10
Therefore, q1=ε (01 +10)*
since, q1 is the final state.
so, R.E= ε (01+ 10)*
= (01+ 10)* is the required R.E from given diagram.
Construction of an FA from an RE
We can use Thompson's Construction to find out a Finite Automaton from a Regular Expression. We will
reduce the regular expression into smallest regular expressions and converting these to NFA and finally
to DFA.
Some basic RA expressions are the following −
• Case 1 − For a regular expression ‘a’, we can construct the following FA −
0 0 0 0 0
1 1 1
1 1
A minimal DFA
• In practice, we are interested in the DFA with the minimal number of states.
– Use less memory
– Use less hardware (flip-flops)
• We can find a minimal DFA for any given DFA and their languages are equal.
47
Minimization of DFA
Given a DFA M, that accepts a language L (M). Now, configure a DFA M ‘. During
the course of minimization, it involves identifying the equivalent states and
distinguishable states.
For minimization, the table filling algorithm is used.
• Distinguishable state:
Two states p & q are said to be distinguishable states if (for any) there exists a
string x, such that δ(p, x) is a final state δ(q, x) is not a final state.
• Indistinguishable State:
Two indistinguishable states behave same for all possible strings
Indistinguishable State:
Two indistinguishable states behave same for all possible strings
• Indistinguishable states behave the same for all possible strings.
– So, we do not need all of states from a set of indistinguishable states.
– We can eliminate all of them by keeping only one of them to represent that set of
indistinguishable states.
49
Equivalent States: Two states p & q are called equivalent states,
denoted by p ≡ q if and only if for each input string x, δ(p, x) is a final
state if and only if δ(q, x) is a final state.
Finding Distinguishable States – Table Filling Algorithm
• We can compute distinguishable states with an inductive table filling algorithm.
Basis:
• Any non-accepting state is distinguishable from any accepting state.
Induction:
• States p and q are distinguishable if there is some input symbol a such that
δ(p,a) is distinguishable from δ(q,a).
• All other pairs of states are indistinguishable, and can be merged
appropriately.
We can also use table filling algorithm to minimize a DFA by merging all
equivalent states.
That is, we replace a state p with its equivalence class found by the table filling
algorithm. 51
Table filling algorithm steps (Minimize DFA):
It shows q2q0
and q5q3 can
be combined.
Third Iteration::
continue similar to second iteration for combined
states (i.e. q2q0 & q5q3)
=> No change so Stop iteration
Eg:
Start
drawing
minimized
DFA
It shows q2q0
and q5q3 can
be combined.
i.e. 6 state
reduced to 4
states
Eg:
Finalize
drawing
minimized
DFA
It shows q2q0
and q5q3 can
be combined.
i.e. 6 state
reduced to 4
states
Table Filling Algorithm: Minimizations of DFA
Example 2:
59
Example2:
Now to solve this problem first we should determine weather the pair is
distinguishable or not.
PASS 0: Distinguish accepting states from non-accepting states
C is only accepting state, it is distinguishable from all other non-accepting states.
NOTE: A ≢B means they are distinguishable
PASS 1:
Consider column A
A ≢B since δ(A,1)=F, δ(B,1)=C and F ≢C
so mark in AB box as distinguishable
A ≢D since δ(A,0)=B, δ(D,0)=C and B ≢C
so mark in AD box as distinguishable
A ≡E since
• δ(A,0)=B, δ(E,0)=H and B ≡H
• δ(A,1)=F, δ(E,1)=F and F ≡ F
so no mark in AE box
A ≢F since δ(A,0)=B, δ(F,0)=C and B ≢C
so mark in AF box (i.e. distinguishable)
A ≡G since
• δ(A,0)=B, δ(G,0)=G and B ≡G
• δ(A,1)=F, δ(G,1)=E and F ≡E
so no mark in AG box
A ≢H since δ(A,1)=F, δ(H,1)=C and F ≢C so mark AH
PASS 1:
Consider column B
B ≢D since δ(B,1)=C, δ(D,1)=G and C ≢G
B ≢E since δ(B,1)=C, δ(E,1)=F and C ≢F
B ≢F since δ(B,1)=C, δ(F,1)=G and C ≢G
B ≢G since δ(B,1)=C, δ(G,1)=E and C ≢E
B ≡H since
• δ(B,0)=G, δ(H,0)=G and G ≡G
• δ(B,1)=C, δ(H,1)=C and C ≡C
PASS 1:
Consider column D
D ≢E since δ(D,0)=C, δ(E,0)=H and C ≢H
D ≡F since
• δ(D,0)=C, δ(F,0)=C and C ≡C
• δ(D,1)=G, δ(F,1)=G and G ≡G
A ≡E since
• δ(A,0)=B, δ(E,0)=H and B ≡H
• δ(A,1)=F, δ(E,1)=F and F ≡F
B ≡H since
• δ(B,0)=G, δ(H,0)=G and G ≡G
• δ(B,1)=C, δ(H,1)=C and C ≡C
D ≡F since
• δ(D,0)=C, δ(F,0)=C and C ≡C
• δ(D,1)=G, δ(F,1)=G and G ≡G
PASS 3:
Consider column A,B,D
A ≡E since
• δ(A,0)=B, δ(E,0)=H and B ≡H
• δ(A,1)=F, δ(E,1)=F and F ≡F
B ≡H since
• δ(B,0)=G, δ(H,0)=G and G ≡G
• δ(B,1)=C, δ(H,1)=C and C ≡C
D ≡F since
• δ(D,0)=C, δ(F,0)=C and C ≡C
• δ(D,1)=G, δ(F,1)=G and G ≡G
=
=
=
Example 3 :
d
a b c d e
b
c ✔ ✔
d ✔ ✔
e ✔ ✔
f ✔ ✔ ✔ ✔ ✔
d
a b c d e
b b)
c ✔ ✔
d ✔ ✔
e ✔ ✔
f ✔ ✔ ✔ ✔ ✔
Proving Language not to be Regular
• It is shown that the class of language known as regular language has at least four
different descriptions. They are the language accepted by DFA’s, by NFA’s, by Є-
NFA, and defined by RE.
• Not every language is Regular. To show that a langauge is not regular, the
powerful technique used is known as Pumping Lemma.
• That is, we can always find a nonempty string y not too far from the
beginning of w that can be "pumped";
• i.e. repeating y any number of times, or deleting it (the case k= 0),
keeps the resulting string in the language L.
Proof: The Pumping Lemma - Proof
• Suppose L is regular
• Then L is recognized by some DFA A with n states, and L= L(A).
• Let a string w=a1a2...am ∈ L, where m>n
• Let pi = δ(q0, a1a2...ai)
• Then, there exists j such that i<j and pi = pj
• Now we have w=xyz where
1. x = a1a2...ai
2. y = ai+1ai+2...aj
3. z = aj+1aj+2...am
The Pumping Lemma – Proof (cont.)
• That is, x takes us to pi, once; y takes us from pi back to pi. (since pi is also
pj), and z is the balance of w.
• So we have the following figure, and every string longer than the number of
states must cause a state to repeat.
We have w = xyz
w = anbn
In given language, we have equal number of a’s and b’s w ? L, so it must satisfy
this condition. Let us take i=0.
As xz = anbn-m
xz = an-kbn
Where n ? m
So xz ? L. This case also gives contradiction.
Case 3: The string y consists of both a’s and b’s i.e. y = akbm (k,m >= 1).
We have w = xyz
w = anbn
w = an-kakbm bn-m
In given language we have equal number of a’s and b’s w ? L, so it must satisfy this
condition. Let us take i=2.
xy2z = xyyzi
= an-kakbmkbm bn-m
In this case, the string xyyz must have equal number of a’s and b’s but they are out
of order with some b’s before a’s. Hence it is not a member of L. which contradicts
our assumption.
Thus, in all cases we get a contradiction. Therefore, L is not regular.
Pumping lemma for regular languages
The pumping lemma for regular languages describes an essential property of all
regular languages.
Informally, it says that all sufficiently long words in a regular language may be
pumped that is, have a middle section of the word repeated an arbitrary number
of times to produce a new word which also lies within the same language.