0% found this document useful (0 votes)
490 views35 pages

String Matching

The document provides an overview of string matching algorithms, including the naive algorithm, Rabin-Karp algorithm, and Knuth-Morris-Pratt (KMP) algorithm. The naive algorithm has a worst-case running time of Θ((nm+1)m) as it compares the pattern to every substring of the text. Rabin-Karp uses hashing to reduce the average running time to O(n+m) but can have spurious hits. KMP runs in optimal O(n+m) time by intelligently shifting the pattern on a mismatch using a failure function.

Uploaded by

duration123
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
490 views35 pages

String Matching

The document provides an overview of string matching algorithms, including the naive algorithm, Rabin-Karp algorithm, and Knuth-Morris-Pratt (KMP) algorithm. The naive algorithm has a worst-case running time of Θ((nm+1)m) as it compares the pattern to every substring of the text. Rabin-Karp uses hashing to reduce the average running time to O(n+m) but can have spurious hits. KMP runs in optimal O(n+m) time by intelligently shifting the pattern on a mismatch using a failure function.

Uploaded by

duration123
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd

Outline

String Matching
Introduction Nave Algorithm Rabin-Karp Algorithm Knuth-Morris-Pratt (KMP) Algorithm

Introduction
What is string matching?
Finding all occurrences of a pattern in a given text (or body of text) While using editor/word processor/browser Login name & password checking Virus detection Header analysis in data communications DNA sequence analysis, Web search engines (e.g. Google), image analysis

Many applications

String-Matching Problem
The text is in an array T [1..n] of length n The pattern is in an array P [1..m] of length m Elements of T and P are characters from a finite alphabet
E.g., = {0,1} or = {a, b, , z}

Usually T and P are called strings of characters

String-Matching Problem

contd

We say that pattern P occurs with shift s in text T if:


a) 0 s n-m and b) T [(s+1)..(s+m)] = P [1..m]

If P occurs with shift s in T, then s is a valid shift, otherwise s is an invalid shift String-matching problem: finding all valid shifts for a given T and P

Example 1
1 2 3 4 5 6 7 8 9 10 11 12 13

text T

a b c a b a a b c a b a c

pattern P

s=3
a b a a
1 2 3 4

shift s = 3 is a valid shift (n=13, m=4 and 0 s n-m holds)

Example 2 3 4
3 4 5 6 7 8 9 10 11 12 13

pattern P a b a a
1 2

text T

a b c a b a a b c a b a a

s=3

a b a a

s=9

a b a a

Nave String-Matching Algorithm


Input: Text strings T [1..n] and P[1..m] Result: All valid shifts displayed NAVE-STRING-MATCHER (T, P) n length[T] m length[P] for s 0 to n-m if P[1..m] = T [(s+1)..(s+m)] print pattern occurs with shift s

Nave Algorithm

The Nave algorithm consists in checking, at all the positions in the text between 0 to n-m, whether an occurrence of the pattern starts there or not. After each attempt, it shifts the pattern by exactly one position to the right. Example (from left to right): a b c a b c a a b c a (shift = 0) a b c a (shift = 1) a b c a (shift = 2) a b c a (shift = 3)

Analysis: Worst-case Example


1 2 3 4

pattern P a a a b
1 2 3 4 5 6 7 8 9 10 11 12 13

text T

a a a a a a a a a a a a a

a a a b

a a a b

Worst-case Analysis

There are m comparisons for each shift in the worst case There are n-m+1 shifts So, the worst-case running time is ((nm+1)m)
In the example on previous slide, we have (13-4+1)4 comparisons in total

Nave method is inefficient because information from a shift is not used again

Analysis
Brute force pattern matching runs in time O(mn) in the worst case.
But most searches of ordinary text take O(m+n), which is very quick.

continued

Brute-force Analysis (Best)


Best Case
Example1: Found in first position of text
Text: 0000000000000000001 Pattern: 000 Cost = O(M)

Example2: Pattern Not found and always a mismatch on first character


Text: 0000000000000000001 Pattern: 11 Cost = O(N+M)

Nave Algorithm
Example (from right to left): a b c a b c a a b c a (shift =3) a b c a (shift = 2) a b c a (shift = 1) a b c a (shift = 0) Pattern occur with shift 0 and 3

Rabin-Karp Algorithm

Has a worst-case running time of O((nm+1)m) but average-case is O(n+m)


Also works well in practice

Based on number-theoretic notion of modular equivalence We assume that = {0,1, 2, , 9}, i.e., each character is a decimal digit
In general, use radix-d where d = ||

Rabin-Karp Approach
We can view a string of k characters (digits) as a length-k decimal number
E.g., the string 31425 corresponds to the decimal number 31,425

Given a pattern P [1..m], let p denote the corresponding decimal value Given a text T [1..n], let ts denote the decimal value of the length-m substring T [(s+1)..(s+m)] for s=0,1,,(n-m)

Rabin-Karp Approach

contd

ts = p iff T [(s+1)..(s+m)] = P [1..m] s is a valid shift iff ts = p p can be computed in O(m) time
p = P[m] + 10 (P[m-1] + 10 (P[m-2]+))

t0 can similarly be computed in O(m) time Other t1, t2,, tn-m can be computed in O(nm) time since ts+1 can be computed from ts in constant time

Rabin-Karp Approach

contd

ts+1 = 10(ts - 10m-1 T [s+1]) + T [s+m+1]


E.g., if T={,3,1,4,1,5,2,}, m=5 and ts= 31,415, then ts+1 = 10(31415 100003) + 2 =14152 Thus we can compute p in (m) and can compute t0, t1, t2,, tn-m in (n-m+1) time And we can find al occurrences of the pattern P[1m] in text T[1n] with (m) preprocessing time and (n-m+1) matching time.

Buta problem: this is assuming p and ts are small numbers They may be too large to work with easily

Rabin-Karp Approach

contd

Solution: we can use modular arithmetic with a suitable modulus, q


E.g., ts+1 (10(ts T[s+1]h) + T [s+m+1]) (mod q) Where h =10 m-1 (mod q)

q is chosen as a small prime number ; e.g., 13 for radix 10


Generally, if the radix is d, then dq should fit within one computer word

How values modulo 13 are computed


3 1 4 1 5 2

old highorder digit

new loworder digit

14152 ((31415 3 10000) 10 + 2 )(mod 13)

((7 3 3) 10 + 2 )(mod 13)


8 (mod 13)

Problem of Spurious Hits


ts p (mod q) does not imply that ts=p
Modular equivalence does not necessarily mean that two integers are equal

A case in which ts p (mod q) when ts p is called a spurious hit On the other hand, if two integers are not modular equivalent, then they cannot be equal

Example
3 1 4 1 5 pattern

mod 13
7
1 2 3 4 5 6 7 8

text
9 10 11 12 13 14

2 3 1 4 1 5 2 6 7 3 9 9 2 1 mod 13 1 7 8 4 5 10 11 7 9 11 valid spurious match hit

Rabin-Karp Algorithm

Basic structure like the nave algorithm, but uses modular arithmetic as described For each hit, i.e., for each s where ts p (mod q), verify character by character whether s is a valid shift or a spurious hit In the worst case, every shift is verified
Running time can be shown as O((n-m+1)m)

Average-case running time is O(n+m)

Example 2
Let T = a b c b a b and P = a b c Take a = 97, b = 98, c= 99 (i.e. ASCII value of characters). = 256. Integer value of P, p = c + 256(b+256 a) = [99 + 256(98+256 97)] % 256 =197 In similar fashion, we can calculate hash value of m-length text and compare to check valid / spurious hit (as in previous slides). Analysis In the worst case, every shift is verified Running time can be shown as O((n-m+1)m) Average-case running time is O (n + m)

3. The KMP Algorithm


The Knuth-Morris-Pratt (KMP) algorithm looks for the pattern in the text in a left-toright order (like the brute force algorithm).
But it shifts the pattern more intelligently than the brute force algorithm.

continued

If a mismatch occurs between the text and pattern P at P[j], what is the most we can shift the pattern to avoid wasteful comparisons?
Answer: the largest prefix of P[0 .. j-1] that is a suffix of P[1 .. j-1]

Example
T: P:

j=5
jnew = 2

Why

j == 5

Find largest prefix (start) of: "a b a a b" ( P[0..j-1] ) which is suffix (end) of: "b a a b" ( p[1 .. j-1] )
Answer: "a b" Set j = 2 // the new j value

KMP Failure Function


KMP preprocesses the pattern to find matches of prefixes of the pattern with the pattern itself. j = mismatch position in P[] k = position before the mismatch (k = j-1). The failure function F(k) is defined as the size of the largest prefix of P[0..k] that is also a suffix of P[1..k].

Failure Function Example


P: "abaaba" j: 012345
j F(j) 0 0 1 0 2 1

(k == j-1)
3 1 4 2

F(k) is the size of the largest prefix.

In code, F() is represented by an array, like the table.

P: "abaaba" Why is F(4) == 2?

F(4) means
find the size of the largest prefix of P[0..4] that is also a suffix of P[1..4] = find the size largest prefix of "abaab" that is also a suffix of "baab" = find the size of "ab" =2

Using the Failure Function


Knuth-Morris-Pratts algorithm modifies the brute-force algorithm.
if a mismatch occurs at P[j] (i.e. P[j] != T[i]), then k = j-1; j = F(k); // obtain the new j

Example
T: P:
a b a c a a b a c c a b a c a b a a b b
1 2 3 4 5 6

a b a c a b
7

a b a c a b
8 9 10 11 12

a b a c a b
13

a b a c a b

k
F(k

0 0

1 0

2 1

3 0

4 1

14 15 16 17 18 19

a b a c a b

P: "abacab" Why is F(4) == 1?

F(4) means
find the size of the largest prefix of P[0..4] that is also a suffix of P[1..4] = find the size largest prefix of "abaca" that is also a suffix of "baca" = find the size of "a" =1

KMP Advantages
KMP runs in optimal time: O(m+n)
very fast

The algorithm never needs to move backwards in the input text, T


this makes the algorithm good for processing very large files that are read in from external devices or through a network stream

KMP Disadvantages
KMP doesnt work so well as the size of the alphabet increases
more chance of a mismatch (more possible mismatches) mismatches tend to occur early in the pattern, but KMP is faster when the mismatches occur later

You might also like