UNIT-V String Matching
UNIT-V String Matching
➢Introduction
➢Naïve String-Matching Algorithm
➢Rabin-Karp Algorithm
➢Knuth-Morris-Pratt Algorithm
➢Tries
➢Suffix Tries
➢NP-hard and NP-complete problems
➢Cook’s Theorem
Naïve pattern searching is the simplest method among other pattern searching
algorithms. It is a straightforward (Brute Force) method. It checks for all character of
the main string to the pattern. This algorithm is helpful for smaller texts. It does not
need any pre-processing phases. We can find substring by checking once for the
string. It also does not occupy extra space to perform the operation.
The time complexity of Naïve Pattern Search method is O(m*n). The m is the size of
pattern and n is the size of the main string.
• The Brute Force algorithm compares the pattern to the text, one character at a time,
until unmatching characters are found:
TWO ROADS DIVERGED IN A YELLOW WOOD
ROADS
TWO ROADS DIVERGED IN A YELLOW WOOD
ROADS
TWO ROADS DIVERGED IN A YELLOW WOOD
ROADS
TWO ROADS DIVERGED IN A YELLOW WOOD
R OADS
TWO ROADS DIVERGED IN A YELLOW WOOD
ROADS
• The algorithm can be designed to stop on either the first occurrence of the pattern, or
upon reaching the end of the text.
Algorithm:
begin
patlen = pattern length;
textlen = text length;
for i = 0 to (textlen - patlen) do
for j = 0 to patlen do
if(text[i + j] ≠ p[j]) then
break the loop;
if(j = = patlen) then
display the index i;
End.
Example:
tetththeheehthtehtheththehehtht
the
Worst case:
• Given a pattern M characters in length, and a text N characters in length...
compares pattern to each substring of text of length M.
For example, M=5.
1) A A A A A A A A A A A A A A A A A A A A A A A A A A A H
AAAAH 5 comparisons made
2) A A A A A A A A A A A A A A A A A A A A A A A A A A A H
AAAAH 5 comparisons made
3) A A A A A A A A A A A A A A A A A A A A A A A A A A A H
AAAAH 5 comparisons made
4) A A A A A A A A A A A A A A A A A A A A A A A A A A A H
AAAAH 5 comparisons made
5) A A A A A A A A A A A A A A A A A A A A A A A A A A A H
AAAAH 5 comparisons made
6) ....
N) A A A A A A A A A A A A A A A A A A A A A A A A A A A H
5 comparisons made AAAAH
• Total number of comparisons: M (N-M+1)
• Worst case time complexity: Ο(MN)
Rabin-Karp Algorithm
1. The Rabin-Karp string searching algorithm calculates a hash value for the
pattern, and for each M-character subsequence of text to be compared.
2. If the hash values are unequal, the algorithm will calculate the hash value for
next M-character sequence.
3. If the hash values are equal, the algorithm will do a Brute Force comparison
between the pattern and the M-character sequence.
4. In this way, there is only one comparison per text subsequence, and Brute
Force is only needed when hash values match.
Let us assign a numerical value(v)/weight for the characters we will be using in the
problem. Here, we have taken first ten alphabets only (i.e., A to J).
‘n’ be the length of the pattern and ‘m’ be the length of the text.
We calculate the hash value of the next window by subtracting the first term and
adding the next term as shown below.
t = ((d * (t - v[character to be removed] * h) + v[character to be added] ) mod 13
= ((10 * (6 - 1 * 9) + 3) mod 13 = 12
Where, h = dm-1 = 103-1 = 100.
t = ((1 * 102) + ((2 * 101) + (3 * 100)) * 10 - (1 * 102) + (3 * 100)) mod 13
= 233 mod 13 = 12
For BCC, t = 12 (≠6). Therefore, go for the next window.
After a few searches, we will get the match for the window CDD in the text
Knuth-Morris and Pratt introduce a linear time algorithm for the string-matching
problem. A matching time of O (n) is achieved by avoiding comparison with an
element of 'S' that have previously been involved in comparison with some element of
the pattern 'p' to be matched. i.e., backtracking on the string 'S' never occurs
1. The Prefix Function (Π): The Prefix Function, Π for a pattern encapsulates
knowledge about how the pattern matches against the shift of itself. This information
can be used to avoid a useless shift of the pattern 'p.' In other words, this enables
avoiding backtracking of the string 'S.'
2. The KMP Matcher: With string 'S,' pattern 'p' and prefix function 'Π' as inputs,
find the occurrence of 'p' in 'S' and returns the number of shifts of 'p' after which
occurrences are found.
In the above pseudo code for calculating the prefix function, the for loop from step 4
to step 10 runs 'm' times. Step1 to Step3 take constant time. Hence the running time of
computing prefix function is O (m).
Solution:
Initially: m = length [p] = 7
Π [1] = 0
k=0
KMP-MATCHER (T, P)
1. n = length [T];
2. m = length [P];
3. Π = COMPUTE-PREFIX-FUNCTION (P)
4. q = 0 // numbers of characters matched
5. for i = 1 to n do // scan S from left to right
6. {
7. while ((q >= 0) and (P [q + 1] ≠ T [i])) do
8. q = Π [q] // next character does not match
9. If (P [q + 1] = T [i]) then q = q + 1 // next character matches
10. If (q = m) // is all of p matched?
11. then print "Pattern occurs with shift" i – m+1
12. q = Π [q]; // look for the next match
13. }
Let us execute the KMP Algorithm to find whether 'P' occurs in 'T.'
For 'p' the prefix function, ? was computed previously and is as follows:
Solution:
Initially: n = size of T = 15 m = size of P = 7
Pattern 'P' has been found to complexity occur in a string 'T.' The total number of
shifts that took place for the match to be found is i-m = 13 - 7 = 6 shifts
1. In computer science, a tries is also called digital tree and sometimes radix tree
or prefix tree.
2. Tries is a tree-based data structure for storing strings in order to support fast
pattern matching.
3. Tries are used for information retrieval.
4. Tries is used to store the character in each node not the key.
5. Path from root to node is associated with key.
6. Tries uses character of a key to guide the search process.
7. All the descendants of the node have a common prefix of the string associated
with that node.
Types:
I. Standard Tries:
The standard tries for a set of strings S is an ordered tree such that:
Root
b s
False False
e t
e i u
False
False
o
a l d l y l
False
False False False
r l l l c p
False False True False True False False False
True k
True True
True False True
True
0 1 2 3 4 5 6 7
M I N I M I Z E
Algorithm Nsort(A, n)
{
for i = 1 to n do B[i] = 0; // initialize B
for i = 1 to n do
{
j = Choice (1, n);
If (B[j] ≠ 0) then Failure ();
B[j] = A[i] ;
}
for i = 1 to n-1 do //verify order
If(B[i] > B[i + 1]) then Failure();
write (B[1…n]);
Success ();
}
Class P:
P is the set of all deterministic polynomial time algorithms.
Examples- sorting, searching, MST, Huffman coding etc.
Class NP:
NP is the set of all nondeterministic polynomial time algorithms.
S. No X1 X2 X3
1 0 0 0
2 0 0 1
3 0 1 0
4 0 1 1
5 1 0 0
6 1 0 1
7 1 1 0
8 1 1 1
Reduce to:
Let L1 and L2 be two problems. Problem L1 reduces to L2 (also written L1
L2) if and only if there is a way to solve L1 by a deterministic polynomial time
algorithm using a deterministic algorithm that solves L2 in polynomial time. That is if
we have a polynomial time algorithm for L2, then we can solve L1 in polynomial
time.
NP-hard:
A problem L is NP-hard if and only if satisfiability reduces to L (satisfiability
L).
NP-complete:
A problem L is NP-complete if and only if L is NP-hard and L is in NP.
Example:
Satisfiability problem is NP- hard and satisfiability NP. So, satisfiability
problem is NP-complete.
Satisfiability problem reduces to 0/1 knapsack problem. So, we find a formula for
satisfiability problem is convert into a formula for 0/1knapsack problem in
polynomial time.
Suppose the 0/1 knapsack problem is n = 3, m = 8, (w1, w2, w3) = (5, 4, 3) and (p1,
p2, p3) = (10, 8, 12).
The solution set X = {x1 = 0/1, x2 = 0/1, x3 = 0/1}
So, the possibilities of xi’s are
S. No X1 X2 X3
1 0 0 0
2 0 0 1
3 0 1 0
4 0 1 1
5 1 0 0
SRINIVAS ASSOCIATE PROFESSOR INFORMATION TECHNOLOGY
SRKR ENGINEERING COLLEGE Page 21
DESIGN AND ANALYSIS OF ALGORITHM
String Matching Algorithms
6 1 0 1
7 1 1 0
8 1 1 1
Suppose we take subgraph V = {<𝑥1 ̃ , 2>, <x2, 1>, <x3, 3>} of clique size 3.
The value of x1 = 0, x2 = 1 and x3 = 1
The formula for satisfiability problem F = C1 C2 C3 = (x1 V x2) (𝑥1 ̃)
̃ V 𝑥2
(x1 V x3) = 1.
Suppose we take subgraph V = {<x1, 1>, <𝑥2 ̃ , 2>, <x3, 3>} of clique size 3.
The value of x1 = 1, x2 = 0 and x3 = 1
The formula for satisfiability problem F = C1 C2 C3 = (x1 V x2) (𝑥1 ̃)
̃ V 𝑥2
(x1 V x3) = 1.
Suppose we take subgraph V = {<x1, 1>, <𝑥2 ̃ , 2>, <x1, 3>} of clique size 3.
The value of x1 = 1, x2 = 0 and x3 = 0
The formula for satisfiability problem F = C1 C2 C3 = (x1 V x2) (𝑥1 ̃)
̃ V 𝑥2
(x1 V x3) = 1.
So, satisfiability problem is reduces to clique decision problem.
Hence clique decision problem is NP-hard problem.
Cook’s theorem:
Satisfiability is in P if and only if P = NP.
Proof:
Suppose P = NP.
We already know that satisfiability is in NP = P
So, satisfiability is in P.
Conversely suppose satisfiability is in P.
We show that P = NP. Clearly P NP.
Now we show that NP P
To do this, we show how to obtain from any polynomial time nondeterministic
decision algorithm A and input I a formula Q (A, I) such that
Q is satisfiable if and only if A has a successful termination with input I.
If the length of the I is ‘n’ and the time complexity of A is p(n), then the
length of Q is O(p3(n) log n) = O(p4(n)).
The time needed to construct Q is also O(p3(n) log n).
A deterministic algorithm Z to determine the outcome of A on any input I can
be easily obtained.
Algorithm Z simply computes Q and then uses a deterministic algorithm for
satisfiability problem to determine whether Q is satisfiable.
If O(q(m)) is the time needed to determine whether a formula of length m is
satisfiable, then the complexity of Z is O(p3(n) log n + q(p3(n) log n)).
If satisfiability is in P, then q(m) is a polynomial function of m and the
complexity of Z becomes O(r(n)) for some polynomial r().
Hence if satisfiability is in P, then P = NP