Data Structures Unit 5
Data Structures Unit 5
By
D. Subhashini
Associate Professor
Department of Computer Science and Engineering
Aurora’s Technological and Research Institute
D. Subhashini, Assoc Prof CSE dept Aurora’s Technological and Research Institute
1
UNIT V
Pattern Matching and Tries: Pattern matching
algorithms-Brute force, the Boyer –Moore algorithm,
the Knuth-Morris-Pratt algorithm, Standard Tries,
Compressed Tries, Suffix tries.
D. Subhashini, Assoc Prof CSE dept Aurora’s Technological and Research Institute
2
Pattern Matching/Searching
The Pattern Searching algorithms are sometimes also referred to as String Searching Algorithms
and are considered as a part of the String algorithms. These algorithms are useful in the case of
searching a string within another string.
Examples:
Input: txt[] = "THIS IS A TEST TEXT"
pat[] = "TEST"
Output: Pattern found at index 10
D. Subhashini, Assoc Prof CSE dept Aurora’s Technological and Research Institute
3
Pattern searching is an important problem in computer science. When we do search for a
string in notepad/word file or browser or database, pattern searching algorithms are used
to show the search results.
Naive Pattern Matching:
Slide the pattern over text one by one and check for a match. If a match is found, then
slides by 1 again to check for subsequent matches.
2) Worst case also occurs when only the last character is different.
txt[] = "AAAAAAAAAAAAAAAAB";
pat[] = "AAAAB";
D. Subhashini, Assoc Prof CSE dept Aurora’s Technological and Research Institute
4
Boyer Moore Algorithm for Pattern
Searching
Given a text txt[0..n-1] and a pattern pat[0..m-1], write a function search(char pat[], char
txt[]) that prints all occurrences of pat[] in txt[]. You may assume that n > m.
Examples:
Input: txt[] = "THIS IS A TEST TEXT"
pat[] = "TEST"
Output: Pattern found at index 10
Both of the above heuristics can also be used independently to search a pattern in a text.
Let us first understand how two independent approaches work together in the Boyer Moore
algorithm. If we take a look at the Naive algorithm, it slides the pattern over the text one
by one. KMP algorithm does preprocessing over the pattern so that the pattern can be
shifted by more than one. The Boyer Moore algorithm does preprocessing for the same
reason. It processes the pattern and creates different arrays for both heuristics. At every
D. Subhashini, Assoc Prof CSE dept Aurora’s Technological and Research Institute
5
step, it slides the pattern by the max of the slides suggested by the two heuristics. So it uses
best of the two heuristics at every step. Unlike the previous pattern searching
algorithms, Boyer Moore algorithm starts matching from the last character of the
pattern.
The idea of bad character heuristic is simple. The character of the text which doesn’t
match with the current character of the pattern is called the Bad Character. Upon
mismatch, we shift the pattern until
Explanation:
In the above example, we got a mismatch at position 3. Here our mismatching character is
“A”. Now we will search for last occurrence of “A” in pattern. We got “A” at position 1 in
pattern (displayed in Blue) and this is the last occurrence of it. Now we will shift pattern 2
times so that “A” in pattern get aligned with “A” in text.
D. Subhashini, Assoc Prof CSE dept Aurora’s Technological and Research Institute
6
Explanation:
Here we have a mismatch at position 7. The mismatching character “C” does not exist in
pattern before position 7 so we’ll shift pattern past to the position 7 and eventually in above
example we have got a perfect match of pattern (displayed in Green). We are doing this
because, “C” do not exist in pattern so at every shift before position 7 we will get mismatch
and our search will be fruitless.
In the following implementation, we preprocess the pattern and store the last occurrence of
every possible character in an array of size equal to alphabet size. If the character is not
present at all, then it may result in a shift by m (length of pattern). Therefore, the bad
character heuristic takes O(n/m) time in the best case.
The Bad Character Heuristic may take O(mn) time in worst case. The worst case occurs
when all characters of the text and pattern are same. For example, txt[] =
“AAAAAAAAAAAAAAAAAA” and pat[] = “AAAAA”.
D. Subhashini, Assoc Prof CSE dept Aurora’s Technological and Research Institute
7
Boyer Moore Algorithm | Good Suffix
heuristic
Just like bad character heuristic, a preprocessing table is generated for good suffix heuristic.
Let t be substring of text T which is matched with substring of pattern P. Now we shift
pattern until :
Explanation:
In the above example, we have got a substring t of text T matched with pattern P (in green)
before mismatch at index 2. Now we will search for occurrence of t (“AB”) in P. We have
found an occurrence starting at position 1 (in yellow background) so we will right shift the
pattern 2 times to align t in P with t in T. This is weak rule of original Boyer Moore and
not much effective.
D. Subhashini, Assoc Prof CSE dept Aurora’s Technological and Research Institute
8
Case 2: A prefix of P, which matches with suffix of t in T
It is not always likely that we will find the occurrence of t in P. Sometimes there is no
occurrence at all, in such cases sometimes we can search for some suffix of t matching
with some prefix of P and try to align them by shifting P. For example:
Explanation:
In above example, we have got t (“BAB”) matched with P (in green) at index 2-4 before
mismatch . But because there exists no occurrence of t in P we will search for some prefix
of P which matches with some suffix of t. We have found prefix “AB” (in the yellow
background) starting at index 0 which matches not with whole t but the suffix of t “AB”
starting at index 3. So now we will shift pattern 3 times to align prefix with the suffix.
D. Subhashini, Assoc Prof CSE dept Aurora’s Technological and Research Institute
9
Explanation:
If above example, there exist no occurrence of t (“AB”) in P and also there is no prefix in
P which matches with the suffix of t. So, in that case, we can never find any perfect match
before index 4, so we will shift the P past the t ie. to index 5.
D. Subhashini, Assoc Prof CSE dept Aurora’s Technological and Research Institute
10
KMP Algorithm for Pattern Searching
Given a text txt[0..n-1] and a pattern pat[0..m-1], write a function search(char pat[], char
txt[]) that prints all occurrences of pat[] in txt[]. You may assume that n > m.
Examples:
Input: txt[] = "THIS IS A TEST TEXT"
pat[] = "TEST"
Output: Pattern found at index 10
The Naive pattern searching algorithm doesn’t work well in cases where we see many
matching characters followed by a mismatching character. Following are some examples.
txt[] = "AAAAAAAAAAAAAAAAAB"
pat[] = "AAAAB"
txt[] = "ABABABCABABABCABABABC"
pat[] = "ABABAC" (not a worst case, but a bad case for Naive)
D. Subhashini, Assoc Prof CSE dept Aurora’s Technological and Research Institute
11
The KMP matching algorithm uses degenerating property (pattern having same sub-
patterns appearing more than once in the pattern) of the pattern and improves the worst
case complexity to O(n). The basic idea behind KMP’s algorithm is: whenever we detect a
mismatch (after some matches), we already know some of the characters in the text of the
next window. We take advantage of this information to avoid matching the characters that
we know will anyway match. Let us consider below example to understand this.
Matching Overview
txt = "AAAAABAAABA"
pat = "AAAA"
txt = "AAAAABAAABA"
txt = "AAAAABAAABA"
This is where KMP does optimization over Naive. In this second window,
we only compare fourth A of pattern with fourth character of current
window of text to decide whether current window matches or not. Since
we know first three characters will anyway match, we skipped matching
first three characters.
Need of Preprocessing?
An important question arises from the above explanation, how to know how
many characters to be skipped. To know this, we pre-process attern and
prepare an integer array lps[] that tells us the count of characters to
be skipped.
D. Subhashini, Assoc Prof CSE dept Aurora’s Technological and Research Institute
12
Preprocessing Overview:
KMP algorithm preprocesses pat[] and constructs an auxiliary lps[] of size m (same as
size of pattern) which is used to skip characters while matching.
name lps indicates longest proper prefix which is also suffix.. A proper prefix is
prefix with whole string not allowed. For example, prefixes of “ABC” are “”, “A”,
“AB” and “ABC”. Proper prefixes are “”, “A” and “AB”. Suffixes of the string are “”,
“C”, “BC” and “ABC”.
We search for lps in sub-patterns. More clearly we focus on sub-strings of patterns that
are both prefix and suffix.
For each sub-pattern pat[0..i] where i = 0 to m-1, lps[i] stores length of the maximum
matching proper prefix which is also a suffix of the sub-pattern pat[0..i].
Note : lps[i] could also be defined as longest prefix which is also proper suffix. We need
to use properly at one place to make sure that the whole substring is not considered.
lps[] is [0, 1, 2, 3]
lps[] is [0, 0, 0, 0, 0]
lps[] is [0, 1, 0, 1, 2, 0, 1, 2, 3, 4, 5]
lps[] is [0, 1, 2, 0, 1, 2, 3, 3, 3, 4]
lps[] is [0, 1, 2, 0, 1, 2, 3]
D. Subhashini, Assoc Prof CSE dept Aurora’s Technological and Research Institute
13
Searching Algorithm:
Unlike Naive algorithm, where we slide the pattern by one and compare all characters at
each shift, we use a value from lps[] to decide the next characters to be matched. The idea
is to not match a character that we know will anyway match.
How to use lps[] to decide next positions (or to know a number of characters to be skipped)?
We keep matching characters txt[i] and pat[j] and keep incrementing i and j
while pat[j] and txt[i] keep matching.
From above two points, we can conclude that we do not need to match
these lps[j-1] characters with txt[i-j…i-1] because we know that these
characters will anyway match. Let us consider above example to
understand this.
Example:
txt[] = "AAAAABAAABA"
pat[] = "AAAA"
lps[] = {0, 1, 2, 3}
i = 0, j = 0
txt[] = "AAAAABAAABA"
pat[] = "AAAA"
D. Subhashini, Assoc Prof CSE dept Aurora’s Technological and Research Institute
14
i = 1, j = 1
txt[] = "AAAAABAAABA"
pat[] = "AAAA"
i = 2, j = 2
txt[] = "AAAAABAAABA"
pat[] = "AAAA"
i = 3, j = 3
txt[] = "AAAAABAAABA"
pat[] = "AAAA"
i = 4, j = 4
j = lps[j-1] = lps[3] = 3
i = 4, j = 3
txt[] = "AAAAABAAABA"
pat[] = "AAAA"
D. Subhashini, Assoc Prof CSE dept Aurora’s Technological and Research Institute
15
i = 5, j = 4
j = lps[j-1] = lps[3] = 3
i = 5, j = 3
txt[] = "AAAAABAAABA"
pat[] = "AAAA"
j = lps[j-1] = lps[2] = 2
i = 5, j = 2
txt[] = "AAAAABAAABA"
pat[] = "AAAA"
j = lps[j-1] = lps[1] = 1
i = 5, j = 1
txt[] = "AAAAABAAABA"
pat[] = "AAAA"
j = lps[j-1] = lps[0] = 0
i = 5, j = 0
D. Subhashini, Assoc Prof CSE dept Aurora’s Technological and Research Institute
16
txt[] = "AAAAABAAABA"
pat[] = "AAAA"
i = 6, j = 0
txt[] = "AAAAABAAABA"
pat[] = "AAAA"
i = 7, j = 1
txt[] = "AAAAABAAABA"
pat[] = "AAAA"
D. Subhashini, Assoc Prof CSE dept Aurora’s Technological and Research Institute
17
Types of Tries
A trie is a tree-like information retrieval data structure whose nodes store the letters of an
alphabet. It is also known as a digital tree or a radix tree or prefix tree. Tries are classified
into three categories:
1. Standard Trie
2. Compressed Trie
3. Suffix Trie
2. Each node (except the root node) in a standard trie is labeled with a character.
6. The last node of every key or word is used to mark the end of word or node.
D. Subhashini, Assoc Prof CSE dept Aurora’s Technological and Research Institute
18
Compressed Trie A Compressed trie have the following properties:
6. While performing the insertion operation, it may be required to un-group the already
grouped characters.
7. While performing the deletion operation, it may be required to re-group the already
grouped characters.
8. A compressed trie T storing s strings(keys) has s external nodes and O(s) total number
of nodes.
D. Subhashini, Assoc Prof CSE dept Aurora’s Technological and Research Institute
19
Suffix Trie A Suffix trie have the following properties:
1. A Suffix Trie is an advanced version of the compressed trie.
3. While performing the insertion operation, both the word and its suffixes are stored.
5. To generate a suffix trie, all the suffixes of given string are considered as individual
words.
D. Subhashini, Assoc Prof CSE dept Aurora’s Technological and Research Institute
20