0% found this document useful (0 votes)
99 views

Data Structures Unit 5

The document discusses various pattern matching algorithms: 1. Naive pattern matching algorithm compares characters one by one with a time complexity of O(nm) in worst case. 2. Boyer-Moore algorithm improves upon naive algorithm using bad character heuristic and good suffix heuristic with preprocessing. It has a worst case time complexity of O(n). 3. The bad character heuristic skips characters in the text when there is a mismatch based on the last occurrence of that character in the pattern. 4. The good suffix heuristic also tries to skip characters by searching for the longest suffix of the text that matches a prefix of the pattern.

Uploaded by

SYED SHDN
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
99 views

Data Structures Unit 5

The document discusses various pattern matching algorithms: 1. Naive pattern matching algorithm compares characters one by one with a time complexity of O(nm) in worst case. 2. Boyer-Moore algorithm improves upon naive algorithm using bad character heuristic and good suffix heuristic with preprocessing. It has a worst case time complexity of O(n). 3. The bad character heuristic skips characters in the text when there is a mismatch based on the last occurrence of that character in the pattern. 4. The good suffix heuristic also tries to skip characters by searching for the longest suffix of the text that matches a prefix of the pattern.

Uploaded by

SYED SHDN
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Data Structures

R18 CSE/IT II year

By
D. Subhashini
Associate Professor
Department of Computer Science and Engineering
Aurora’s Technological and Research Institute

D. Subhashini, Assoc Prof CSE dept Aurora’s Technological and Research Institute
1
UNIT V
Pattern Matching and Tries: Pattern matching
algorithms-Brute force, the Boyer –Moore algorithm,
the Knuth-Morris-Pratt algorithm, Standard Tries,
Compressed Tries, Suffix tries.

D. Subhashini, Assoc Prof CSE dept Aurora’s Technological and Research Institute
2
Pattern Matching/Searching
The Pattern Searching algorithms are sometimes also referred to as String Searching Algorithms
and are considered as a part of the String algorithms. These algorithms are useful in the case of
searching a string within another string.

Naive algorithm for Pattern Matching/


Given a text txt[0..n-1] and a pattern pat[0..m-1], write a function search(char pat[], char
txt[]) that prints all occurrences of pat[] in txt[]. You may assume that n > m.

Examples:
Input: txt[] = "THIS IS A TEST TEXT"
pat[] = "TEST"
Output: Pattern found at index 10

Input: txt[] = "AABAACAADAABAABA"


pat[] = "AABA"
Output: Pattern found at index 0
Pattern found at index 9
Pattern found at index 12

D. Subhashini, Assoc Prof CSE dept Aurora’s Technological and Research Institute
3
Pattern searching is an important problem in computer science. When we do search for a
string in notepad/word file or browser or database, pattern searching algorithms are used
to show the search results.
Naive Pattern Matching:
Slide the pattern over text one by one and check for a match. If a match is found, then
slides by 1 again to check for subsequent matches.

What is the best case?


The best case occurs when the first character of the pattern is not present in
text at all.
txt[] = "AABCCAADDEE";
pat[] = "FAA";

The number of comparisons in best case is O(n).


What is the worst case ?
The worst case of Naive Pattern Searching occurs in following scenarios.

1) When all characters of the text and pattern are same.


txt[] = "AAAAAAAAAAAAAAAAAA";
pat[] = "AAAAA";

2) Worst case also occurs when only the last character is different.
txt[] = "AAAAAAAAAAAAAAAAB";
pat[] = "AAAAB";

The number of comparisons in the worst case is O(m*(n-m+1)). Although strings


which have repeated characters are not likely to appear in English text, they may
well occur in other applications (for example, in binary texts). The KMP matching
algorithm improves the worst case to O(n).

D. Subhashini, Assoc Prof CSE dept Aurora’s Technological and Research Institute
4
Boyer Moore Algorithm for Pattern
Searching
Given a text txt[0..n-1] and a pattern pat[0..m-1], write a function search(char pat[], char
txt[]) that prints all occurrences of pat[] in txt[]. You may assume that n > m.

Examples:
Input: txt[] = "THIS IS A TEST TEXT"
pat[] = "TEST"
Output: Pattern found at index 10

Input: txt[] = "AABAACAADAABAABA"


pat[] = "AABA"
Output: Pattern found at index 0
Pattern found at index 9
Pattern found at index 12

Boyer Moore is a combination of following two approaches.


1) Bad Character Heuristic
2) Good Suffix Heuristic

Both of the above heuristics can also be used independently to search a pattern in a text.
Let us first understand how two independent approaches work together in the Boyer Moore
algorithm. If we take a look at the Naive algorithm, it slides the pattern over the text one
by one. KMP algorithm does preprocessing over the pattern so that the pattern can be
shifted by more than one. The Boyer Moore algorithm does preprocessing for the same
reason. It processes the pattern and creates different arrays for both heuristics. At every

D. Subhashini, Assoc Prof CSE dept Aurora’s Technological and Research Institute
5
step, it slides the pattern by the max of the slides suggested by the two heuristics. So it uses
best of the two heuristics at every step. Unlike the previous pattern searching
algorithms, Boyer Moore algorithm starts matching from the last character of the
pattern.

Bad Character Heuristic

The idea of bad character heuristic is simple. The character of the text which doesn’t
match with the current character of the pattern is called the Bad Character. Upon
mismatch, we shift the pattern until

1) The mismatch becomes a match


2) Pattern P move past the mismatched character.

Case 1: Mismatch become match


We will look up the position of last occurrence of mismatching character in pattern and if
mismatching character exist in pattern then we’ll shift the pattern such that it get aligned
to the mismatching character in text T.

Explanation:

In the above example, we got a mismatch at position 3. Here our mismatching character is
“A”. Now we will search for last occurrence of “A” in pattern. We got “A” at position 1 in
pattern (displayed in Blue) and this is the last occurrence of it. Now we will shift pattern 2
times so that “A” in pattern get aligned with “A” in text.

Case 2 – Pattern move past the mismatch character


We’ll look up the position of last occurrence of mismatching character in pattern and if
character does not exist we will shift pattern past the mismatching character.

D. Subhashini, Assoc Prof CSE dept Aurora’s Technological and Research Institute
6
Explanation:

Here we have a mismatch at position 7. The mismatching character “C” does not exist in
pattern before position 7 so we’ll shift pattern past to the position 7 and eventually in above
example we have got a perfect match of pattern (displayed in Green). We are doing this
because, “C” do not exist in pattern so at every shift before position 7 we will get mismatch
and our search will be fruitless.

In the following implementation, we preprocess the pattern and store the last occurrence of
every possible character in an array of size equal to alphabet size. If the character is not
present at all, then it may result in a shift by m (length of pattern). Therefore, the bad
character heuristic takes O(n/m) time in the best case.

The Bad Character Heuristic may take O(mn) time in worst case. The worst case occurs
when all characters of the text and pattern are same. For example, txt[] =
“AAAAAAAAAAAAAAAAAA” and pat[] = “AAAAA”.

D. Subhashini, Assoc Prof CSE dept Aurora’s Technological and Research Institute
7
Boyer Moore Algorithm | Good Suffix
heuristic
Just like bad character heuristic, a preprocessing table is generated for good suffix heuristic.

Good Suffix Heuristic

Let t be substring of text T which is matched with substring of pattern P. Now we shift
pattern until :

1) Another occurrence of t in P matched with t in T.


2) A prefix of P, which matches with suffix of t
3) P moves past t

Case 1: Another occurrence of t in P matched with t in T


Pattern P might contain few more occurrences of t. In such case, we will try to shift the
pattern to align that occurrence with t in text T. For example:

Explanation:

In the above example, we have got a substring t of text T matched with pattern P (in green)
before mismatch at index 2. Now we will search for occurrence of t (“AB”) in P. We have
found an occurrence starting at position 1 (in yellow background) so we will right shift the
pattern 2 times to align t in P with t in T. This is weak rule of original Boyer Moore and
not much effective.

D. Subhashini, Assoc Prof CSE dept Aurora’s Technological and Research Institute
8
Case 2: A prefix of P, which matches with suffix of t in T
It is not always likely that we will find the occurrence of t in P. Sometimes there is no
occurrence at all, in such cases sometimes we can search for some suffix of t matching
with some prefix of P and try to align them by shifting P. For example:

Explanation:

In above example, we have got t (“BAB”) matched with P (in green) at index 2-4 before
mismatch . But because there exists no occurrence of t in P we will search for some prefix
of P which matches with some suffix of t. We have found prefix “AB” (in the yellow
background) starting at index 0 which matches not with whole t but the suffix of t “AB”
starting at index 3. So now we will shift pattern 3 times to align prefix with the suffix.

Case 3: P moves past t


If the above two cases are not satisfied, we will shift the pattern past the t. For example:

D. Subhashini, Assoc Prof CSE dept Aurora’s Technological and Research Institute
9
Explanation:

If above example, there exist no occurrence of t (“AB”) in P and also there is no prefix in
P which matches with the suffix of t. So, in that case, we can never find any perfect match
before index 4, so we will shift the P past the t ie. to index 5.

D. Subhashini, Assoc Prof CSE dept Aurora’s Technological and Research Institute
10
KMP Algorithm for Pattern Searching
Given a text txt[0..n-1] and a pattern pat[0..m-1], write a function search(char pat[], char
txt[]) that prints all occurrences of pat[] in txt[]. You may assume that n > m.

Examples:
Input: txt[] = "THIS IS A TEST TEXT"
pat[] = "TEST"
Output: Pattern found at index 10

Input: txt[] = "AABAACAADAABAABA"


pat[] = "AABA"
Output: Pattern found at index 0
Pattern found at index 9
Pattern found at index 12

KMP (Knuth Morris Pratt) Pattern Searching

The Naive pattern searching algorithm doesn’t work well in cases where we see many
matching characters followed by a mismatching character. Following are some examples.

txt[] = "AAAAAAAAAAAAAAAAAB"

pat[] = "AAAAB"

txt[] = "ABABABCABABABCABABABC"

pat[] = "ABABAC" (not a worst case, but a bad case for Naive)

D. Subhashini, Assoc Prof CSE dept Aurora’s Technological and Research Institute
11
The KMP matching algorithm uses degenerating property (pattern having same sub-
patterns appearing more than once in the pattern) of the pattern and improves the worst
case complexity to O(n). The basic idea behind KMP’s algorithm is: whenever we detect a
mismatch (after some matches), we already know some of the characters in the text of the
next window. We take advantage of this information to avoid matching the characters that
we know will anyway match. Let us consider below example to understand this.

Matching Overview

txt = "AAAAABAAABA"

pat = "AAAA"

We compare first window of txt with pat

txt = "AAAAABAAABA"

pat = "AAAA" [Initial position]

We find a match. This is same as Naive String Matching.

In the next step, we compare next window of txt with pat.

txt = "AAAAABAAABA"

pat = "AAAA" [Pattern shifted one position]

This is where KMP does optimization over Naive. In this second window,
we only compare fourth A of pattern with fourth character of current
window of text to decide whether current window matches or not. Since
we know first three characters will anyway match, we skipped matching
first three characters.

Need of Preprocessing?

An important question arises from the above explanation, how to know how
many characters to be skipped. To know this, we pre-process attern and
prepare an integer array lps[] that tells us the count of characters to
be skipped.

D. Subhashini, Assoc Prof CSE dept Aurora’s Technological and Research Institute
12
Preprocessing Overview:

 KMP algorithm preprocesses pat[] and constructs an auxiliary lps[] of size m (same as
size of pattern) which is used to skip characters while matching.

 name lps indicates longest proper prefix which is also suffix.. A proper prefix is
prefix with whole string not allowed. For example, prefixes of “ABC” are “”, “A”,
“AB” and “ABC”. Proper prefixes are “”, “A” and “AB”. Suffixes of the string are “”,
“C”, “BC” and “ABC”.

 We search for lps in sub-patterns. More clearly we focus on sub-strings of patterns that
are both prefix and suffix.

 For each sub-pattern pat[0..i] where i = 0 to m-1, lps[i] stores length of the maximum
matching proper prefix which is also a suffix of the sub-pattern pat[0..i].

 lps[i] = the longest proper prefix of pat[0..i]

which is also a suffix of pat[0..i].

Note : lps[i] could also be defined as longest prefix which is also proper suffix. We need
to use properly at one place to make sure that the whole substring is not considered.

Examples of lps[] construction:

 For the pattern “AAAA”,

lps[] is [0, 1, 2, 3]

 For the pattern “ABCDE”,

lps[] is [0, 0, 0, 0, 0]

 For the pattern “AABAACAABAA”,

lps[] is [0, 1, 0, 1, 2, 0, 1, 2, 3, 4, 5]

 For the pattern “AAACAAAAAC”,

lps[] is [0, 1, 2, 0, 1, 2, 3, 3, 3, 4]

 For the pattern “AAABAAA”,

lps[] is [0, 1, 2, 0, 1, 2, 3]

D. Subhashini, Assoc Prof CSE dept Aurora’s Technological and Research Institute
13
Searching Algorithm:
Unlike Naive algorithm, where we slide the pattern by one and compare all characters at
each shift, we use a value from lps[] to decide the next characters to be matched. The idea
is to not match a character that we know will anyway match.

How to use lps[] to decide next positions (or to know a number of characters to be skipped)?

 We start comparison of pat[j] with j = 0 with characters of current window of


text.

 We keep matching characters txt[i] and pat[j] and keep incrementing i and j
while pat[j] and txt[i] keep matching.

 When we see a mismatch

 We know that characters pat[0..j-1] match with txt[i-j…i-1] (Note that


j starts with 0 and increment it only when there is a match).

 We also know (from above definition) that lps[j-1] is count of


characters of pat[0…j-1] that are both proper prefix and suffix.

 From above two points, we can conclude that we do not need to match
these lps[j-1] characters with txt[i-j…i-1] because we know that these
characters will anyway match. Let us consider above example to
understand this.

Example:

txt[] = "AAAAABAAABA"

pat[] = "AAAA"

lps[] = {0, 1, 2, 3}

i = 0, j = 0

txt[] = "AAAAABAAABA"

pat[] = "AAAA"

txt[i] and pat[j] match, do i++, j++

D. Subhashini, Assoc Prof CSE dept Aurora’s Technological and Research Institute
14
i = 1, j = 1

txt[] = "AAAAABAAABA"

pat[] = "AAAA"

txt[i] and pat[j] match, do i++, j++

i = 2, j = 2

txt[] = "AAAAABAAABA"

pat[] = "AAAA"

pat[i] and pat[j] match, do i++, j++

i = 3, j = 3

txt[] = "AAAAABAAABA"

pat[] = "AAAA"

txt[i] and pat[j] match, do i++, j++

i = 4, j = 4

Since j == M, print pattern found and reset j,

j = lps[j-1] = lps[3] = 3

Here unlike Naive algorithm, we do not match first three characters of


this window. Value of lps[j-1] (in above step) gave us index of next
character to match.

i = 4, j = 3

txt[] = "AAAAABAAABA"

pat[] = "AAAA"

txt[i] and pat[j] match, do i++, j++

D. Subhashini, Assoc Prof CSE dept Aurora’s Technological and Research Institute
15
i = 5, j = 4

Since j == M, print pattern found and reset j,

j = lps[j-1] = lps[3] = 3

Again unlike Naive algorithm, we do not match first three characters of


this window. Value of lps[j-1] (in above step) gave us index of next
character to match.

i = 5, j = 3

txt[] = "AAAAABAAABA"

pat[] = "AAAA"

txt[i] and pat[j] do NOT match and j > 0, change only j

j = lps[j-1] = lps[2] = 2

i = 5, j = 2

txt[] = "AAAAABAAABA"

pat[] = "AAAA"

txt[i] and pat[j] do NOT match and j > 0, change only j

j = lps[j-1] = lps[1] = 1

i = 5, j = 1

txt[] = "AAAAABAAABA"

pat[] = "AAAA"

txt[i] and pat[j] do NOT match and j > 0, change only j

j = lps[j-1] = lps[0] = 0

i = 5, j = 0

D. Subhashini, Assoc Prof CSE dept Aurora’s Technological and Research Institute
16
txt[] = "AAAAABAAABA"

pat[] = "AAAA"

txt[i] and pat[j] do NOT match and j is 0, we do i++.

i = 6, j = 0

txt[] = "AAAAABAAABA"

pat[] = "AAAA"

txt[i] and pat[j] match, do i++ and j++

i = 7, j = 1

txt[] = "AAAAABAAABA"

pat[] = "AAAA"

txt[i] and pat[j] match, do i++ and j++

We continue this way...

D. Subhashini, Assoc Prof CSE dept Aurora’s Technological and Research Institute
17
Types of Tries
A trie is a tree-like information retrieval data structure whose nodes store the letters of an
alphabet. It is also known as a digital tree or a radix tree or prefix tree. Tries are classified
into three categories:

1. Standard Trie

2. Compressed Trie

3. Suffix Trie

Standard Trie A standard trie have the following properties:

1. It is an ordered tree like data structure.

2. Each node (except the root node) in a standard trie is labeled with a character.

3. The children of a node are in alphabetical order.

4. Each node or branch represents a possible character of keys or words.

5. Each node or branch may have multiple branches.

6. The last node of every key or word is used to mark the end of word or node.

Below is the illustration of the Standard Trie:

D. Subhashini, Assoc Prof CSE dept Aurora’s Technological and Research Institute
18
Compressed Trie A Compressed trie have the following properties:

1. A Compressed Trie is an advanced version of the standard trie.

2. Each nodes (except the leaf nodes) have atleast 2 children.

3. It is used to achieve space optimization.

4. To derive a Compressed Trie from a Standard Trie, compression of chains of redundant


nodes is performed.

5. It consists of grouping, re-grouping and un-grouping of keys of characters.

6. While performing the insertion operation, it may be required to un-group the already
grouped characters.

7. While performing the deletion operation, it may be required to re-group the already
grouped characters.

8. A compressed trie T storing s strings(keys) has s external nodes and O(s) total number
of nodes.

Below is the illustration of the Compressed Trie:

D. Subhashini, Assoc Prof CSE dept Aurora’s Technological and Research Institute
19
Suffix Trie A Suffix trie have the following properties:
1. A Suffix Trie is an advanced version of the compressed trie.

2. The most common application of suffix trie is Pattern Matching.

3. While performing the insertion operation, both the word and its suffixes are stored.

4. A suffix trie is also used in word matching and prefix matching.

5. To generate a suffix trie, all the suffixes of given string are considered as individual
words.

6. Using the suffixes, compressed trie is built.

Below is the illustration of the Suffix Trie:

D. Subhashini, Assoc Prof CSE dept Aurora’s Technological and Research Institute
20

You might also like