0% found this document useful (0 votes)

490 views35 pages

String Matching

The document provides an overview of string matching algorithms, including the naive algorithm, Rabin-Karp algorithm, and Knuth-Morris-Pratt (KMP) algorithm. The naive algorithm has a worst-case running time of Θ((nm+1)m) as it compares the pattern to every substring of the text. Rabin-Karp uses hashing to reduce the average running time to O(n+m) but can have spurious hits. KMP runs in optimal O(n+m) time by intelligently shifting the pattern on a mismatch using a failure function.

Uploaded by

duration123

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

490 views35 pages

String Matching

Uploaded by

duration123

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

Outline

String Matching
Introduction Nave Algorithm Rabin-Karp Algorithm Knuth-Morris-Pratt (KMP) Algorithm

Introduction
What is string matching?
Finding all occurrences of a pattern in a given text (or body of text) While using editor/word processor/browser Login name & password checking Virus detection Header analysis in data communications DNA sequence analysis, Web search engines (e.g. Google), image analysis

Many applications

String-Matching Problem
The text is in an array T [1..n] of length n The pattern is in an array P [1..m] of length m Elements of T and P are characters from a finite alphabet
E.g., = {0,1} or = {a, b, , z}

Usually T and P are called strings of characters

String-Matching Problem

contd

We say that pattern P occurs with shift s in text T if:

a) 0 s n-m and b) T [(s+1)..(s+m)] = P [1..m]

If P occurs with shift s in T, then s is a valid shift, otherwise s is an invalid shift String-matching problem: finding all valid shifts for a given T and P

Example 1
1 2 3 4 5 6 7 8 9 10 11 12 13

text T

a b c a b a a b c a b a c

pattern P

s=3
a b a a
1 2 3 4

shift s = 3 is a valid shift (n=13, m=4 and 0 s n-m holds)

Example 2 3 4
3 4 5 6 7 8 9 10 11 12 13

pattern P a b a a
1 2

text T

a b c a b a a b c a b a a

s=3

a b a a

s=9

a b a a

Nave String-Matching Algorithm

Input: Text strings T [1..n] and P[1..m] Result: All valid shifts displayed NAVE-STRING-MATCHER (T, P) n length[T] m length[P] for s 0 to n-m if P[1..m] = T [(s+1)..(s+m)] print pattern occurs with shift s

Nave Algorithm

The Nave algorithm consists in checking, at all the positions in the text between 0 to n-m, whether an occurrence of the pattern starts there or not. After each attempt, it shifts the pattern by exactly one position to the right. Example (from left to right): a b c a b c a a b c a (shift = 0) a b c a (shift = 1) a b c a (shift = 2) a b c a (shift = 3)

Analysis: Worst-case Example

1 2 3 4

pattern P a a a b
1 2 3 4 5 6 7 8 9 10 11 12 13

text T

a a a a a a a a a a a a a

a a a b

Worst-case Analysis

There are m comparisons for each shift in the worst case There are n-m+1 shifts So, the worst-case running time is ((nm+1)m)
In the example on previous slide, we have (13-4+1)4 comparisons in total

Nave method is inefficient because information from a shift is not used again

Analysis
Brute force pattern matching runs in time O(mn) in the worst case.
But most searches of ordinary text take O(m+n), which is very quick.

continued

Brute-force Analysis (Best)

Best Case
Example1: Found in first position of text
Text: 0000000000000000001 Pattern: 000 Cost = O(M)

Example2: Pattern Not found and always a mismatch on first character

Text: 0000000000000000001 Pattern: 11 Cost = O(N+M)

Nave Algorithm
Example (from right to left): a b c a b c a a b c a (shift =3) a b c a (shift = 2) a b c a (shift = 1) a b c a (shift = 0) Pattern occur with shift 0 and 3

Rabin-Karp Algorithm

Has a worst-case running time of O((nm+1)m) but average-case is O(n+m)

Also works well in practice

Based on number-theoretic notion of modular equivalence We assume that = {0,1, 2, , 9}, i.e., each character is a decimal digit
In general, use radix-d where d = ||

Rabin-Karp Approach
We can view a string of k characters (digits) as a length-k decimal number
E.g., the string 31425 corresponds to the decimal number 31,425

Given a pattern P [1..m], let p denote the corresponding decimal value Given a text T [1..n], let ts denote the decimal value of the length-m substring T [(s+1)..(s+m)] for s=0,1,,(n-m)

Rabin-Karp Approach

contd

ts = p iff T [(s+1)..(s+m)] = P [1..m] s is a valid shift iff ts = p p can be computed in O(m) time
p = P[m] + 10 (P[m-1] + 10 (P[m-2]+))

t0 can similarly be computed in O(m) time Other t1, t2,, tn-m can be computed in O(nm) time since ts+1 can be computed from ts in constant time

Rabin-Karp Approach

contd

ts+1 = 10(ts - 10m-1 T [s+1]) + T [s+m+1]

E.g., if T={,3,1,4,1,5,2,}, m=5 and ts= 31,415, then ts+1 = 10(31415 100003) + 2 =14152 Thus we can compute p in (m) and can compute t0, t1, t2,, tn-m in (n-m+1) time And we can find al occurrences of the pattern P[1m] in text T[1n] with (m) preprocessing time and (n-m+1) matching time.

Buta problem: this is assuming p and ts are small numbers They may be too large to work with easily

Rabin-Karp Approach

contd

Solution: we can use modular arithmetic with a suitable modulus, q

E.g., ts+1 (10(ts T[s+1]h) + T [s+m+1]) (mod q) Where h =10 m-1 (mod q)

q is chosen as a small prime number ; e.g., 13 for radix 10

Generally, if the radix is d, then dq should fit within one computer word

How values modulo 13 are computed

3 1 4 1 5 2

old highorder digit

new loworder digit

14152 ((31415 3 10000) 10 + 2 )(mod 13)

((7 3 3) 10 + 2 )(mod 13)

8 (mod 13)

Problem of Spurious Hits

ts p (mod q) does not imply that ts=p
Modular equivalence does not necessarily mean that two integers are equal

A case in which ts p (mod q) when ts p is called a spurious hit On the other hand, if two integers are not modular equivalent, then they cannot be equal

Example
3 1 4 1 5 pattern

mod 13
7
1 2 3 4 5 6 7 8

text
9 10 11 12 13 14

2 3 1 4 1 5 2 6 7 3 9 9 2 1 mod 13 1 7 8 4 5 10 11 7 9 11 valid spurious match hit

Rabin-Karp Algorithm

Basic structure like the nave algorithm, but uses modular arithmetic as described For each hit, i.e., for each s where ts p (mod q), verify character by character whether s is a valid shift or a spurious hit In the worst case, every shift is verified
Running time can be shown as O((n-m+1)m)

Average-case running time is O(n+m)

Example 2
Let T = a b c b a b and P = a b c Take a = 97, b = 98, c= 99 (i.e. ASCII value of characters). = 256. Integer value of P, p = c + 256(b+256 a) = [99 + 256(98+256 97)] % 256 =197 In similar fashion, we can calculate hash value of m-length text and compare to check valid / spurious hit (as in previous slides). Analysis In the worst case, every shift is verified Running time can be shown as O((n-m+1)m) Average-case running time is O (n + m)

3. The KMP Algorithm

The Knuth-Morris-Pratt (KMP) algorithm looks for the pattern in the text in a left-toright order (like the brute force algorithm).
But it shifts the pattern more intelligently than the brute force algorithm.

continued

If a mismatch occurs between the text and pattern P at P[j], what is the most we can shift the pattern to avoid wasteful comparisons?
Answer: the largest prefix of P[0 .. j-1] that is a suffix of P[1 .. j-1]

Example
T: P:

j=5
jnew = 2

Why

j == 5

Find largest prefix (start) of: "a b a a b" ( P[0..j-1] ) which is suffix (end) of: "b a a b" ( p[1 .. j-1] )
Answer: "a b" Set j = 2 // the new j value

KMP Failure Function

KMP preprocesses the pattern to find matches of prefixes of the pattern with the pattern itself. j = mismatch position in P[] k = position before the mismatch (k = j-1). The failure function F(k) is defined as the size of the largest prefix of P[0..k] that is also a suffix of P[1..k].

Failure Function Example

P: "abaaba" j: 012345
j F(j) 0 0 1 0 2 1

(k == j-1)
3 1 4 2

F(k) is the size of the largest prefix.

In code, F() is represented by an array, like the table.

P: "abaaba" Why is F(4) == 2?

F(4) means
find the size of the largest prefix of P[0..4] that is also a suffix of P[1..4] = find the size largest prefix of "abaab" that is also a suffix of "baab" = find the size of "ab" =2

Using the Failure Function

Knuth-Morris-Pratts algorithm modifies the brute-force algorithm.
if a mismatch occurs at P[j] (i.e. P[j] != T[i]), then k = j-1; j = F(k); // obtain the new j

Example
T: P:
a b a c a a b a c c a b a c a b a a b b
1 2 3 4 5 6

a b a c a b
7

a b a c a b
8 9 10 11 12

a b a c a b
13

a b a c a b

k
F(k

0 0

1 0

2 1

3 0

4 1

14 15 16 17 18 19

a b a c a b

P: "abacab" Why is F(4) == 1?

F(4) means
find the size of the largest prefix of P[0..4] that is also a suffix of P[1..4] = find the size largest prefix of "abaca" that is also a suffix of "baca" = find the size of "a" =1

KMP Advantages
KMP runs in optimal time: O(m+n)
very fast

The algorithm never needs to move backwards in the input text, T

this makes the algorithm good for processing very large files that are read in from external devices or through a network stream

KMP Disadvantages
KMP doesnt work so well as the size of the alphabet increases
more chance of a mismatch (more possible mismatches) mismatches tend to occur early in the pattern, but KMP is faster when the mismatches occur later

Ads Unit5
No ratings yet
Ads Unit5
26 pages
String Matching Algorithms Explained
No ratings yet
String Matching Algorithms Explained
34 pages
Knuth Morris Pratt Algorithm
No ratings yet
Knuth Morris Pratt Algorithm
4 pages
Design AND Analysis OF Algorithms: K.PALRAJ M.E., (PH.D)
No ratings yet
Design AND Analysis OF Algorithms: K.PALRAJ M.E., (PH.D)
24 pages
KMP String Matching Algorithm
No ratings yet
KMP String Matching Algorithm
8 pages
Toivonen's Algorithm Overview
No ratings yet
Toivonen's Algorithm Overview
33 pages
String Matching
100% (1)
String Matching
12 pages
KMP Algorithm 1
No ratings yet
KMP Algorithm 1
22 pages
Knuth-Morris-Pratt Algorithm Explained
No ratings yet
Knuth-Morris-Pratt Algorithm Explained
30 pages
Ugc Net Exam Daa PDF
No ratings yet
Ugc Net Exam Daa PDF
94 pages
BFS and DFS Algorithms Explained
No ratings yet
BFS and DFS Algorithms Explained
29 pages
DAA 2nd Unit Notes
No ratings yet
DAA 2nd Unit Notes
22 pages
Daa
No ratings yet
Daa
113 pages
KMP Algorithm Implementation in C
No ratings yet
KMP Algorithm Implementation in C
5 pages
0 U1 Bitwise Operators
No ratings yet
0 U1 Bitwise Operators
15 pages
Unit 1 DAANotes
No ratings yet
Unit 1 DAANotes
22 pages
DAA - Backtracking Branch and Bound
No ratings yet
DAA - Backtracking Branch and Bound
39 pages
Rabinkarp PPT
No ratings yet
Rabinkarp PPT
12 pages
Daa Lecture Notes
No ratings yet
Daa Lecture Notes
169 pages
3 Greedy Method New
No ratings yet
3 Greedy Method New
92 pages
Daa Notes
No ratings yet
Daa Notes
26 pages
Rabin-Karp String Matching Algorithm
No ratings yet
Rabin-Karp String Matching Algorithm
11 pages
Dynamic Programming in Algorithms
No ratings yet
Dynamic Programming in Algorithms
17 pages
Insertion Sort Algorithm
100% (1)
Insertion Sort Algorithm
14 pages
Dijkstra Algorithm PDF
100% (1)
Dijkstra Algorithm PDF
24 pages
Ada Solved Model Question Paper - 250612 - 183212
No ratings yet
Ada Solved Model Question Paper - 250612 - 183212
63 pages
Recurrence Relation by Master Method
No ratings yet
Recurrence Relation by Master Method
19 pages
Unit 3 - Basic Search and Traversal Techniques
100% (2)
Unit 3 - Basic Search and Traversal Techniques
113 pages
Horspool's Algorithm
No ratings yet
Horspool's Algorithm
17 pages
All Pairs Shortest Path: Presented By: Sudeep Sharma
No ratings yet
All Pairs Shortest Path: Presented By: Sudeep Sharma
13 pages
Brute Force: Design and Analysis of Algorithms - Chapter 3 1
No ratings yet
Brute Force: Design and Analysis of Algorithms - Chapter 3 1
18 pages
Understanding Red-Black Trees Basics
No ratings yet
Understanding Red-Black Trees Basics
12 pages
L16 - Karatsuba Algorithm
No ratings yet
L16 - Karatsuba Algorithm
17 pages
Daa Unit III Dynamic Programming
No ratings yet
Daa Unit III Dynamic Programming
37 pages
Understanding P, NP, NP-complete, NP-hard
No ratings yet
Understanding P, NP, NP-complete, NP-hard
1 page
Hashing Techniques and Strategies
100% (1)
Hashing Techniques and Strategies
135 pages
Unit 5
No ratings yet
Unit 5
14 pages
5th Semester Algorithm Question Bank
No ratings yet
5th Semester Algorithm Question Bank
8 pages
DAA 3rd Unit Notes
No ratings yet
DAA 3rd Unit Notes
25 pages
3.6 The Knapsack Problem
No ratings yet
3.6 The Knapsack Problem
49 pages
Unit 5 - Compiler Design - WWW - Rgpvnotes.in PDF
No ratings yet
Unit 5 - Compiler Design - WWW - Rgpvnotes.in PDF
15 pages
DAA Unit-3 PPT 19
No ratings yet
DAA Unit-3 PPT 19
49 pages
Algorithm Design for CS Students
100% (1)
Algorithm Design for CS Students
37 pages
Unit 4 Daa
No ratings yet
Unit 4 Daa
186 pages
Divide and Conquer Algorithms Explained
No ratings yet
Divide and Conquer Algorithms Explained
48 pages
Design and Analysis of Algorithms Syllabus
No ratings yet
Design and Analysis of Algorithms Syllabus
116 pages
DAA Unit1
No ratings yet
DAA Unit1
26 pages
DAA Notes PDF
No ratings yet
DAA Notes PDF
55 pages
DS Unit III
No ratings yet
DS Unit III
78 pages
Disjoint Sets / Union Find in Data Structures
No ratings yet
Disjoint Sets / Union Find in Data Structures
34 pages
Analysis of Algorithms CS 477/677: Midterm Exam Review Instructor: George Bebis
No ratings yet
Analysis of Algorithms CS 477/677: Midterm Exam Review Instructor: George Bebis
25 pages
MCA Students: Balanced Trees
No ratings yet
MCA Students: Balanced Trees
59 pages
Recurrence Relation-Substitution Method
No ratings yet
Recurrence Relation-Substitution Method
4 pages
Data Structures & Algorithms Guide
No ratings yet
Data Structures & Algorithms Guide
12 pages
Lecture 22 (Floyd and Warshal)
No ratings yet
Lecture 22 (Floyd and Warshal)
16 pages
06 Sorting
No ratings yet
06 Sorting
78 pages
Infix To Postfix Prefix Conversion
No ratings yet
Infix To Postfix Prefix Conversion
31 pages
String Matching Kmprabin Karp and Naive
No ratings yet
String Matching Kmprabin Karp and Naive
41 pages
String Matching Algorithms Guide
No ratings yet
String Matching Algorithms Guide
57 pages
CH 8
No ratings yet
CH 8
26 pages
DAA Module 1
No ratings yet
DAA Module 1
107 pages
Image Reconstruction in CT
No ratings yet
Image Reconstruction in CT
27 pages
Optimizer CST PDF
No ratings yet
Optimizer CST PDF
30 pages
Understanding Constraint Satisfaction Problems
No ratings yet
Understanding Constraint Satisfaction Problems
42 pages
OS Practical Solutions of Record
No ratings yet
OS Practical Solutions of Record
15 pages
DSP Lab Manual EC8562 (R 2017)
No ratings yet
DSP Lab Manual EC8562 (R 2017)
82 pages
Neural Networks: Introduction & Types
No ratings yet
Neural Networks: Introduction & Types
9 pages
Linear Regression for Boutique Profit Forecast
No ratings yet
Linear Regression for Boutique Profit Forecast
3 pages
2A3 Linear Programming - Graphical Method - Minimization Yummy & Eco Cases STUDENTS'
No ratings yet
2A3 Linear Programming - Graphical Method - Minimization Yummy & Eco Cases STUDENTS'
46 pages
Understanding Deep Learning Concepts
No ratings yet
Understanding Deep Learning Concepts
74 pages
All DSA Pattern To Crack Any SDE Interview-1
No ratings yet
All DSA Pattern To Crack Any SDE Interview-1
11 pages
MAD4401 Practice Problems
No ratings yet
MAD4401 Practice Problems
9 pages
Movie Recommendation System Using SVD
No ratings yet
Movie Recommendation System Using SVD
1 page
Midterm Lab Quiz 2 - Attempt Review
No ratings yet
Midterm Lab Quiz 2 - Attempt Review
6 pages
University Institute of Technology
No ratings yet
University Institute of Technology
16 pages
Speckle Noise
No ratings yet
Speckle Noise
9 pages
handout-handout-SAiDL - CTE - Handout - Sem1 - 25-26 - Ashmit Rana-1755795553027-1755844843071
No ratings yet
handout-handout-SAiDL - CTE - Handout - Sem1 - 25-26 - Ashmit Rana-1755795553027-1755844843071
2 pages
Analysis and Design of Algorithms Lab File
No ratings yet
Analysis and Design of Algorithms Lab File
64 pages
Assignment 3 - Linear Programming
No ratings yet
Assignment 3 - Linear Programming
5 pages
IP & CV Syllabus-Updated
No ratings yet
IP & CV Syllabus-Updated
7 pages
Greedy Algorithms and MSTs Overview
No ratings yet
Greedy Algorithms and MSTs Overview
42 pages
Identification: 2.1 Identification of Transfer Functions 2.1.1 Review of Transfer Function
No ratings yet
Identification: 2.1 Identification of Transfer Functions 2.1.1 Review of Transfer Function
29 pages
Chapter 2
No ratings yet
Chapter 2
26 pages
DCT
No ratings yet
DCT
39 pages
CPE324 Digital Signal Processing
No ratings yet
CPE324 Digital Signal Processing
9 pages
Multirate Signal Processing
No ratings yet
Multirate Signal Processing
33 pages
Python Lab Assignment Solution
No ratings yet
Python Lab Assignment Solution
17 pages
Digital Equalisation Techniques
No ratings yet
Digital Equalisation Techniques
56 pages
ADA Module 1
No ratings yet
ADA Module 1
24 pages
Hiding Image in Audio
No ratings yet
Hiding Image in Audio
4 pages