Information Retrieval - Chapter 10 - String Searching Algorithms
Information Retrieval - Chapter 10 - String Searching Algorithms
10.1 INTRODUCTION
String searching is an important component of many problems, including text editing, data retrieval, and symbol manipulation. Despite the use of indices for searching large amounts of text, string searching may help in an information retrieval system. For example, it may be used for filtering of potential matches or for searching retrieval terms that will be highlighted in the output. The string searching or string matching problem consists of finding all occurrences (or the first occurrence) of a pattern in a text, where the pattern and the text are strings over some alphabet. We are interested in reporting all the occurrences. It is well known that to search for a pattern of length m in a text of length n (where n > m) the search time is 0(n) in the worst case (for fixed m). Moreover, in the worst case, at least n - m + 1 characters must be inspected. This result is due to Rivest (1977). However, for different algorithms the constant in the linear term can be very different. For example, in the worst case, the constant multiple in the naive algorithm is m, whereas for the Knuth-Morris-Pratt (1977) algorithm it is two. We present the most important algorithms for string matching: the naive or brute force algorithm, the KnuthMorris-Pratt (1977) algorithm, different variants of the Boyer-Moore (1977) algorithm, the Shift-or algorithm from Baeza-Yates and Gonnet (1989), and the Karp-Rabin (1987) algorithm, which is probabilistic. Experimental results for random text and one sample of English text are included. We also survey the main theoretical results for each algorithm. Although we only cover string searching, references for related problems are given. We use the C programming language described by Kernighan and Ritchie (1978) to present our algorithms.
10.2 PRELIMINARIES
We use the following notation: n: the length of the text m: the length of the pattern (string)
http://orion.lcg.ufrj.br/Dr.Dobbs/books/book5/chap10.htm 1/27
2/5/2014
c: the size of the alphabet * the expected number of comparisons performed by an algorithm while searching the pattern in a text of length n Theoretical results are given for the worst case number of comparisons, and the average number of comparisons between a character in the text and a character in the pattern (text pattern comparisons) when finding all occurrences of the pattern in the text, where the average is taken uniformly with respect to strings of length n over a given finite alphabet. Quoting Knuth et al. (1977). "It might be argued that the average case taken over random strings is of little interest, since a user rarely searches for a random string. However, this model is a reasonable approximation when we consider those pieces of text that do not contain the pattern, and the algorithm obviously must compare every character of the text in those places where the pattern does occur." Our experimental results show that this is the case. The emperical data, for almost all algorithms, consists of results for two types of text: random text and English text. The two cost functions we measured were the number of comparisons performed between a character in the text and a character in the pattern, and the execution time. To determine the number of comparisons, 100 runs were performed. The execution time was measured while searching l,000 patterns. In each case, patterns of lengths 2 to 20 were considered. In the case of random text, the text was of length 40,000, and both the text and the pattern were chosen uniformly and randomly from an alphabet of size c. Alphabets of size c = 4 (DNA bases) and c = 30 (approximately the number of lowercase English letters) were considered. For the case of English text we used a document of approximately 48,000 characters, with the patterns chosen at random from words inside the text in such a way that a pattern was always a prefix of a word (typical searches). The alphabet used was the set of lowercase letters, some digits, and punctuation symbols, giving 32 characters. Unsuccessful searches were not considered, because we expect unsuccessful searches to be faster than successfu1 searches (fewer comparisons on average). The results for English text are not statistically significant because only one text sample was used. However, they show the correlation of searching patterns extracted from the same text, and we expect that other English text samples will give similar results. Our experimental results agree with those presented by Davies and Bowsher (1986) and Smit (1982). We define a random string of length as a string built as the concatenation of characters chosen independently and uniformly from . That is, the probability of two characters being equal is 1/c. Our random text model is similar to the one used in Knuth et al. (1977) and Schaback (1988). For example, the probability of finding a match between a random text of length m and a random pattern of length m is
The expected number of matches of a random pattern of length m in a random text of length n is
http://orion.lcg.ufrj.br/Dr.Dobbs/books/book5/chap10.htm 2/27
2/5/2014
c h a rt e x t [ ] ,p a t [ ] ;
i n tn ,m ;
i n ti ,j ,k ,l i m ;
l i m=n m + 1 ;
k=i ;
i f (j>m)R e p o r t _ m a t c h _ a t _ p o s i t i o n (i j + 1) ;
Figure 10.1: The naive or brute force string matching algorithm The expected number of text pattern comparisons performed by the naive or brute force algorithm when
http://orion.lcg.ufrj.br/Dr.Dobbs/books/book5/chap10.htm 3/27
2/5/2014
The basic idea behind this algorithm is that each time a mismatch is detected, the "false start" consists of characters that we have already examined. We can take advantage of this information instead of repeating comparisons with the known characters. Moreover, it is always possible to arrange the algorithm so that the pointer in the text is never decremented. To accomplish this, the pattern is preprocessed to obtain a table that gives the next position in the pattern to be processed after a mismatch. The exact definition of this table (called next in Knuth et al. [1977]) is next [j ] = max{i|(pattern[k ] = pattern [j - i + k ] for k = 1, . . . , i - 1) and pattern[i] pattern[j ]}
for j = 1, . . . , m. In other words, we consider the maximal matching prefix of the pattern such that the next character in the pattern is different from the character of the pattern that caused the mismatch. This algorithm is presented in Figure 10.2. Example 1 The next table for the pattern abracadabra is
a b r a c a d a b r a
n e x t [ j ]
0 1 1 0 2 0 2 0 1 1 0 5
When the value in the next table is zero, we have to advance one position in the text and start comparing again from the beginning of the pattern. The last value of the next table (five) is used to continue the search after a match has been found. In the worst case, the number of comparisons is 2n + O(m). Further explanation of how to preprocess the pattern in time O(m) to obtain this table can be found in the original paper or in Sedgewick (1983; see Figure
http://orion.lcg.ufrj.br/Dr.Dobbs/books/book5/chap10.htm 4/27
2/5/2014
10.3).
k m p s e a r c h (t e x t ,n ,p a t ,m)/ *S e a r c hp a t [ 1 . . m ]i nt e x t [ 1 . . n ]* /
c h a rt e x t [ ] ,p a t [ ] ;
i n tn ,m ;
i n tj ,k ,r e s u m e ;
i n tn e x t [ M A X _ P A T T E R N _ S I Z E ] ;
p a t [ m + 1 ]=C H A R A C T E R _ N O T _ I N _ T H E _ T E X T ;
i n i t n e x t (p a t ,m + 1 ,n e x t) ;/ *P r e p r o c e s sp a t t e r n* /
r e s u m e=n e x t [ m + 1 ] ;
j=k=1 ;
d o{
/ *S e a r c h* /
i f (j = = 0| |t e x t [ k ] = = p a t [ j ])
k + + ;j + + ;
e l s ej=n e x t [ j ] ;
i f (j>m)
R e p o r t _ m a t c h _ a t _ p o s i t i o n (k m) ;
http://orion.lcg.ufrj.br/Dr.Dobbs/books/book5/chap10.htm 5/27
2/5/2014
J=r e s u m e ;
}w h i l e (k< =n) ;
p a t [ m + 1 ]=E N D _ O F _ S T R I N G ;
c h a rp a t [ ] ;
i n tm ,n e x t [ ] ;
i n ti ,j ;
d o
i f (j= =0| |p a t [ i ]= =p a t [ j ])
i + + ;j + + ;
i f (p a t [ i ]! =p a t [ j ]) n e x t [ i ]=j ;
e l s e
n e x t [ i ]=n e x t [ j ] ;
http://orion.lcg.ufrj.br/Dr.Dobbs/books/book5/chap10.htm
6/27
2/5/2014
e l s ej=n e x t [ j ] ;
w h i l e (i< =m);
Figure 10.3: Pattern preprocessing in the Knuth-Morris-Pratt algorithm An algorithm for searching a set of strings, similar to the KMP algorithm, was developed by Aho and Corasick (1975) . However the space used and the preprocessing time to search for one string is improved in the KMP algorithm. Variations that compute the next table "on the fly" are presented by Barth (1981) and Takaoka (1986). Variations for the Aho and Corasick algorithm are presented by Bailey and Dromey (1980) and Meyer (1985).
http://orion.lcg.ufrj.br/Dr.Dobbs/books/book5/chap10.htm
7/27
2/5/2014
Example 2 The
d d h a t[ j ] 1 7 1 6 1 5 1 4 1 3 1 2 1 1 1 3 1 2 4 1
The occurrence hueristic is obtained by noting that we must align the position in the text that caused the mismatch with the first character of the pattern that matches it. Formally calling this table d, we have
d [ x ]=m i n { s | s=mo r( 0 s<ma n dp a t t e r n[ m-s ]=x ) }
for every symbol x in the alphabet. See Figure 10.4 for the code to compute both tables (i.e., the pattern.
b m s e a r c h (t e x t ,n ,p a t ,m) / *S e a r c hp a t [ 1 . . m ]i nt e x t [ 1 . . n ] * /
and d) from
c h a rt e x t [ ] ,p a t [ ] ;
i n tn ,m ;
i n tk ,j ,s k i p ;
i n td d [ M A X _ P A T T E R N _ S I Z E ] ,d [ M A X _ A L P H A B E T _ S I Z E ] ;
i n i t d (p a t ,m ,d) ; / *P r e p r o c e s st h ep a t t e r n* /
i n i t d d (p a t ,m ,d d) ;
w h i l e (k< =n) / *S e a r c h* /
{
http://orion.lcg.ufrj.br/Dr.Dobbs/books/book5/chap10.htm 8/27
2/5/2014
j=m ;
j ;k ;
i f (j= =0)
R e p o r t _ m a t c h _ a t _ p o s i t i o n (k + 1) ;
k+ =s k i p ;
e l s ek+ =m a x (d [ t e x t [ k ] ] ,d d [ j ]) ;
Figure 10.4: The Boyer-Moore algorithm Example 3 The d table for the pattern abracadabra is
d [ ' a ' ]=0 d [ ' b ' ]=2 d [ ' c ' ]=6 d [ ' d ' ]=4 d [ ' r ' ]=1
and the value for any other character is 11. Both shifts can be precomputed based solely on the pattern and the alphabet. Hence, the space needed is m + c + O(1). Given these two shift functions, the algorithm chooses the larger one. The same shift strategy can be applied after a match. In Knuth et al. (1977) the preprocessing of the pattern is shown to be linear in the size of the pattern, as it is for the KMP algorithm. However, their algorithm is incorrect. The corrected version can be found in Rytter's paper (1980; see Figure 10.5).
i n i t d (p a t ,m ,d)/ *P r e p r o c e s sp a t t e r no fl e n g t hm:dt a b l e* /
http://orion.lcg.ufrj.br/Dr.Dobbs/books/book5/chap10.htm 9/27
2/5/2014
c h a rp a t [ ] ;
i n tm ,d [ ] ;
i n tk ;
i n i t d d (p a t ,m ,d d)/ *P r e p r o c e s sp a t t e r no fl e n g t hm:d dh a tt a b l e* /
c h a rp a t [ ] ;
i n tm ,d d [ ] ;
i n tj ,k ,t ,t 1 ,q ,q 1 ;
i n tf [ M A X _ P A T T E R N _ S I Z E + 1 ] ;
f o r (j = m ,t = m + 1 ;j>0 ;j ,t -)/ *s e t u pt h ed dh a tt a b l e* /
f [ j ]=t ;
http://orion.lcg.ufrj.br/Dr.Dobbs/books/book5/chap10.htm
10/27
2/5/2014
d d [ t ]=m i n (d d [ t ] ,m j) ;
t=f [ t ] ;
f o r (j = 1 ,t 1 = 0 ;j< =t ;t 1 + + ,j + +)
f [ j ]=t 1 ;
w h i l e (q<m)
Figure 10.5: Preprocessing of the pattern in the Boyer-Moore algorithm Knuth et al. (1977) have shown that, in the worst case, the number of comparisons is O(n + rm), where r is the total number of matches. Hence, this algorithm can be as bad as the naive algorithm when we have many matches, namely, (n) matches. A simpler alternative proof can be found in a paper by Guibas and Odlyzko (1980). In the best case Cn = n/m. Our simulation results agree well with the emprical and theoretical results in
http://orion.lcg.ufrj.br/Dr.Dobbs/books/book5/chap10.htm 11/27
2/5/2014
the original Boyer-Moore paper (1977). Some experiments in a distributed environment are presented by Moller-Nielsen and Straunstrup (1984). A variant of the BM algorithm when m is similar to n is given by Iyengar and Alia (1980). Boyer-Moore type algorithms to search a set of strings are presented by Commentz-Walter (1979) and Baeza-Yates and Regnier (1990). To improve the worst case, Galil (1979) modifies the algorithm so that it remembers how many overlapping characters it can have between two successive matches. That is, we compute the length, , of the longest proper prefix that is also a suffix of the pattern. Then, instead of going from m to 1 in the comparison loop, the algorithm goes from m to k , where if the last event was a match, or k = 1 otherwise. For example, for the pattern ababa. This algorithm is truly linear, with a worst case of O(n + m) comparisons. Recently, Cole (1991) proved that the exact worst case is 3n + O(m) comparisons. However, according to empirical results, as expected, it only improves the average case for small alphabets, at the cost of using more instructions. Recently, Apostolico and Giancarco (1986) improved this algorithm to a worst case of 2n - m + 1 comparisons.
2/5/2014
and the value for any other character is 11. The code for an efficient version of the Boyer-Moore-Horspool algorithm is extremely simple and is presented in Figure 10.6 where MAX_ALPHABET_SIZE is the size of the alphabet. In this algorithm, the order of the comparisons is not relevant, as noted by Baeza-Yates (1989c) and Sunday (1990). Thus, the algorithm compares the pattern from left to right. This algorithm also includes the idea of using the character of the text that corresponds to position m + 1 of the pattern, a modification due to Sunday (1990). Further improvements are due to Hume and Sunday (1990).
b m h s e a r c h (t e x t ,n ,p a t ,m)/ *S e a r c hp a t [ 1..m ]i nt e x t [ 1 . . n ]* /
c h a rt e x t [ ] ,p a t [ ] ;
i n tn ,m ;
i n td [ M A X _ A L P H A B E T _ S I Z E ] ,i ,j ,k ,l i m ;
p a t [ m + 1 ]=C H A R A C T E R _ N O T _ I N _ T H E _ T E X T ;/ *T oa v o i dh a v i n gc o d e
* /
/ *f o rs p e c i a lc a s en k + 1 = m * /
l i m=n m + 1 ;
i = k ;
f o r (j = 1 ;t e x t [ i ]= =p a t [ j ] ;j + +)i + + ;
2/5/2014
/ *r e s t o r ep a t [ m + 1 ]i fn e c e s s a r y* /
Figure 10.6: The Boyer-Moore-Horspool-Sunday algorithm Based on empirical and theoretical analysis, the BMH algorithm is simpler and faster than the SBM algorithm, and is as good as the BM algorithm for alphabets of size at least 10. Also, it is not difficult to prove that the expected shift is larger for the BMH algorithm. Improvements to the BMH algorithm for searching in English text are discussed by Baeza-Yates (1989b, 1989a) and Sunday (1990). A hybrid algorithm that combines the BMH and KMP algorithms is proposed by Baeza-Yates (1989c). Figure 10.7 shows, for the algorithms studied up to this point, the expected number of comparisons per character for random text with c = 4. The codes used are the ones given in this chapter, except that the KnuthMorris-Pratt algorithm was implemented as suggested by their authors. (The version given here is slower but simpler.)
http://orion.lcg.ufrj.br/Dr.Dobbs/books/book5/chap10.htm
14/27
2/5/2014
where the si are the individual states. We report a match if sm is 0, or equivalently, if state < 2m-1. That match ends at the current position. To update the state after reading a new character in the text, we must: shift the vector state 1 bit to the left to reflect that we have advanced one position in the text. In practice, this sets the initial state of s1 to be 0 by default. update the individual states according to the new character. For this we use a table T that is defined by preprocessing the pattern with one entry per alphabet symbol, and the bitwise operator or that, given the old vector state and the table value, gives the new state.
http://orion.lcg.ufrj.br/Dr.Dobbs/books/book5/chap10.htm 15/27
2/5/2014
where << denotes the shift left operation. The definition of the table T is
for every symbol x of the alphabet, where (C) is 0 if the condition C is true, and 1 otherwise. Therefore, we need m bits of extra memory, but if the word size is at least m, only words are needed. We set up the table by preprocessing the pattern before the search. This can be done in time.
Example 5 Let {a, b, c, d} be the alphabet, and ababc the pattern. The entries for table T (one digit per position in the pattern) are then:
T [ a ]=1 1 0 1 0 T [ b ]=1 0 1 0 1 T [ c ]=0 1 1 1 1 T [ d ]=1 1 1 1 1
We finish the example by searching for the first occurrence of ababc in the text abdabababc. The initial state is 11111.
t e x t: a b d a b a b a b c
T [ x ]:1 1 0 1 01 0 1 0 11 1 1 1 11 1 0 1 01 0 1 0 11 1 0 1 01 0 1 0 11 1 0 1 01 0 1 0 10 1 1 1 1
s t a t e :1 1 1 1 01 1 1 0 11 1 1 1 11 1 1 1 01 1 1 0 11 1 0 1 01 0 1 0 11 1 0 1 01 0 1 0 10 1 1 1 1
For example, the state 10101 means that in the current position we have two partial matches to the left, of lengths two and four, respectively. The match at the end of the text is indicated by the value 0 in the leftmost bit of the state of the search. The complexity of the search time in both the worst and average case is , where is the time to compute a shift or other simple operation on numbers of m bits using a word size of w bits. In practice, for small patterns (word size 32 or 64 bits), we have O(n) time for the worst and the average case. Figure 10.8 shows an efficient implementation of this algorithm. The programming is independent of the word size insofar as possible. We use the following symbolic constants: MAXSYM: size of the alphabet. For example, 128 for ASCII code. WORD: word size in bits (32 in our case). B: number of bits per individual state; in this case, one.
s o s e a r c h (t e x t ,n ,p a t ,m)/ *S e a r c hp a t [ 1 . . m ]i nt e x t [ 1 . . n ]* /
http://orion.lcg.ufrj.br/Dr.Dobbs/books/book5/chap10.htm 16/27
2/5/2014
r e g i s t e rc h a r* t e x t ;
c h a rp a t [ ] ;
i n tn ,m ;
r e g i s t e rc h a r* e n d ;
r e g i s t e ru n s i g n e di n ts t a t e ,l i m ;
u n s i g n e di n tT [ M A X S Y M ] ,i ,j ;
c h a r* s t a r t ;
i f (m>W O R D)
/ *P r e p r o c e s s i n g* /
T [ p a t [ i ] ]& =~ j ;
t e x t + + ;e n d=t e x t + n + 1 ;
/ *S e a r c h* /
s t a t e=~ 0 ;
/ *I n i t i a ls t a t e* /
f o r (s t a r t = t e x t ;t e x te n d ;t e x t + +)
2/5/2014
i f (s t a t e<l i m)R e p o r t _ m a t c h _ a t _ p o s i t i o n (t e x t s t a r t m + 2) ;
Figure 10.8: Shift-Or algorithm for string matching (simpler version) The changes needed for a more efficient implementation of the algorithm (that is, scan the text until we see the first character of the pattern) are shown in Figure 10.9. The speed of this version depends on the frequency of the first letter of the pattern in the text. The empirical results for this code are shown in Figures 10.11 and 10.12. Another implementation is possible using the bitwise operator and instead of the or operation, and complementing the value of Tx for all x .
i n i t i a l=~ 0 ;f i r s t=p a t [ 1 ] ;s t a r t=t e x t ;/ *S e a r c h* /
d o{
s t a t e=i n i t i a l ;
d o{
i f (s t a t e<l i m)R e p o r t _ m a t c h _ a t _ p o s i t i o n (t e x t s t a r t m + 2) ;
t e x t + + ;
}w h i l e (s t a t e! =i n i t i a l) ;
t e x t + + ;
}w h i l e (t e x t<e n d) ;
Figure 10.9: Shift-Or algorithm for string matching *Based on implementation of Knuth, Morris, and Pratt (1977).
http://orion.lcg.ufrj.br/Dr.Dobbs/books/book5/chap10.htm 18/27
2/5/2014
By just changing the definition of table T we can search for patterns such that every pattern position is: a set of characters (for example, match a vowel), a "don't care" symbol (match any character), or the complement of a set of characters. Furthermore, we can have "don't care" symbols in the text. This idea has been recently extended to string searching with errors and other variants by Wu and Manber (l991).
Thus, the period of the signature function is much larger than m for any practical case, as shown in Gonnet and Baeza-Yates (1990).
r k s e a r c h (t e x t ,n ,p a t ,m) / *S e a r c hp a t [ 1 . . m ]i nt e x t [ 1 . . n ]* /
c h a rt e x t [ ] ,p a t [ ] ;
http://orion.lcg.ufrj.br/Dr.Dobbs/books/book5/chap10.htm
/ *( 0m=n )
* /
19/27
2/5/2014
i n tn ,m ;
i n th 1 ,h 2 ,d M ,i ,j ;
d M=1 ;
h 1=h 2=O ;
/ *o ft h ep a t t e r na n do f * /
f o r (i = 1 ;i < = m ;i + +)
/ *t h eb e g i n n i n go ft h e
* /
/ *t e x t
* /
f o r (i=1 ;i< =n m + 1 ;i + +) / *S e a r c h* /
i f (h 1= =h 2) / *P o t e n t i a lm a t c h* /
i f (j>m)
/ *t r u em a t c h* /
R e p o r t _ m a t c h _ a t _ p o s i t i o n (i) ;
http://orion.lcg.ufrj.br/Dr.Dobbs/books/book5/chap10.htm
20/27
2/5/2014
/ *o ft h et e x t* /
Figure 10.10: The Karp-Rabin algorithm In practice, this algorithm is slow due to the multiplications and the modulus operations. However, it becomes competitive for long patterns. We can avoid the computation of the modulus function at every step by using implicit modular arithmetic given by the hardware. In other words, we use the maximum value of an integer (determined by the word size) for q. The value of d is selected such that dk mod 2r has maximal cycle length (cycle of length 2r-2), for r from 8 to 64, where r is the size, in bits, of a word. For example, an adequate value for d is 31. With these changes, the evaluation of the signature at every step (see Figure 10.10) is
h 2=h 2 * D-t e x t [ j m ] * d M+t e x t [ i + m ] ;/ *u p d a t et h es i g n a t u r ev a l u e* /
and overflow is ignored. In this way, we use two multiplications instead of one multiplication and two modulus operations.
10.8 CONCLUSIONS
We have presented the most important string searching algorithms. Figure 10.11 shows the execution time of searching 1000 random patterns in random text for all the algorithms considered, with c = 30. Based on the empirical results, it is clear that Horspool's variant is the best known algorithm for almost all pattern lengths and alphabet sizes. Figure 10.12 shows the same empirical results as Figure 10.11, but for English text instead of random text. The results are similar. For the shift-or algorithm, the given results are for the efficient version. The results for the Karp-Rabin algorithm are not included because in all cases the time exceeds 300 seconds. The main drawback of the Boyer-Moore type algorithms is the preprocessing time and the space required, which depends on the alphabet size and/or the pattern size. For this reason, if the pattern is small (1 to 3 characters long) it is better to use the naive algorithm. If the alphabet size is large, then the Knuth-Morris-Pratt algorithm is a good choice. In all the other cases, in particular for long texts, the Boyer-Moore algorithm is better. Finally, the Horspool version of the Boyer-Moore algorithm is the best algorithm, according to execution time, for almost all pattern lengths. The shift-or algorithm has a running time similar to the KMP algorithm. However, the main advantage of this algorithm is that we can search for more general patterns ("don't care" symbols, complement of a character, etc.) using exactly the same searching time (see Baeza-Yates and Gonnet [1989]); only the preprocessing is different.
http://orion.lcg.ufrj.br/Dr.Dobbs/books/book5/chap10.htm 21/27
2/5/2014
Figure 10.11: Simulation results for all the algorithms in random text (c = 30) The linear time worst case algorithms presented in previous sections are optimal in the worst case with respect to the number of comparisons (see Rivest [1977]). However, they are not space optimal in the worst case because they use space that depends on the size of the pattern, the size of the alphabet, or both. Galil and Seiferas (1980, 1983) show that it is possible to have linear time worst case algorithms using constant space. (See also Slisenko [1980, 1983].) They also show that the delay between reading two characters of the text is bounded by a constant, which is interesting for any real time searching algorithms (Galil 1981). Practical algorithms that achieve optimal worst case time and space are presented by Crochemore and Perrin (1988, 1989). Optimal parallel algorithms for string matching are presented by Galil (1985) and by Vishkin (1985). (See also Berkman et al. [1989] and Kedem et al. [1989].)
http://orion.lcg.ufrj.br/Dr.Dobbs/books/book5/chap10.htm
22/27
2/5/2014
Figure 10.12: Simulation results for all the algorithms in English text Many of the algorithms presented may be implemented with hardware (Haskin 1980; Hollaar 1979). For example, Aho and Corasick machines (see Aoe et al. (1985), Cheng and Fu (1987), and Wakabayashi et al. (1985). If we allow preprocessing of the text, we can search a string in worst case time proportional to its length. This is achieved by using a Patricia tree (Morrison 1968) as an index. This solution needs O(n) extra space and O(n) preprocessing time, where n is the size of the text. See also Weiner (1973), McCreight (1976), Majster and Reiser (1980), and Kemp et al. (1987). For other kinds of indices for text, see Faloutsos (1985), Gonnet (1983), Blumer et al. (1985; 1987). For further references and problems, see Gonnet and Baeza-Yates (1991).
REFERENCES
AHO, A. 1980. "Pattern Matching in Strings," in Formal Language Theory: Perspectives and Open Problems, ed. R. Book, pp. 325-47. London: Academic Press.
http://orion.lcg.ufrj.br/Dr.Dobbs/books/book5/chap10.htm 23/27
2/5/2014
AHO, A., and M. CORASICK. 1975. "Efficient String Matching: An Aid to Bibliographic Search," Communications of the ACM, 18, pp. 333-40. AOE, J., Y. YAMAMOTO, and R. SHIMADA. 1985. "An Efficient Implementation of Static String Pattern Matching Machines," in IEEE Int. Conf. on Supercomputing Systems, vol. 1, pp. 491-98, St. Petersburg, Fla. APOSTOLICO, A., and R. GIANCARLO. 1986. "The Boyer-Moore-Galil String Searching Strategies Revisited." SIAM J on Computing, 15, 98-105. BAEZA-YATES, R. 1989a. Efficient Text Searching. Ph.D. thesis, Dept. of Computer Science, University of Waterloo. Also as Research Report CS-89-17. BAEZA-YATES, R. 1989b. "Improved String Searching." Software-Practice and Experience, 19(3), 25771. BAEZA-YATES, R. 1989c. "String Searching Algorithms Revisited," in Workshop in Algorithms and Data Structures, F. Dehne, J.-R. Sack, and N. Santoro, eds. pp. 75-96, Ottawa, Canada. Springer Verlag Lecture Notes on Computer Science 382. BAEZA-YATES, R., and G. GONNET. 1989. "A New Approach to Text Searching," in Proc. of 12th ACM SIGIR, pp. 168-75, Cambridge, Mass. (To appear in Communications of ACM). BAEZA-YATES, R., and M. REGNIER. 1990. "Fast Algorithms for Two Dimensional and Multiple Pattern Matching," in 2nd Scandinavian Workshop in Algorithmic Theory, SWAT'90, R. Karlsson and J. Gilbert. eds. Lecture Notes in Computer Science 447, pp. 332-47, Bergen, Norway: Springer-Verlag. BAILEY, T., and R. DROMEY. 1980. "Fast String Searching by Finding Subkeys in Subtext." Inf. Proc. Letters, 11, 130-33. BARTH, G. 1981. "An Alternative for the Implementation of Knuth-Morris-Pratt Algorithm." Inf. Proc. Letters, 13, 134-37. BERKMAN, O., D. BRESLAUER, Z. GALIL, B. SCHIEBER, and U. VISHKIN. 1989. "Highly Parellelizable Problems," in Proc. 20th ACM Symp. on Theory of Computing, pp. 309-19, Seattle, Washington. BLUMER, A., J. BLUMER, A. EHRENFEUCHT, D. HAUSSLER, and R. MCCONNELL. 1987. "Completed Inverted Files for Efficient Text Retrieval and Analysis". JACM, 34, 578-95. BLUMER, A., J. BLUMER D. HAUSSLER, A. EHRENFEUCHT, M. CHEN, and J. SEIFERAS. 1985. "The Smallest Automaton Recognizing the Subwords of a Text. Theorefical Computer Science, 40, 31-55. BOYER, R., and S. MOORE. 1977. "A Fast String Searching Algorithm." CACM, 20, 762-72. CHENG, H., and K. FU. 1987. "Vlsi Architectures for String Matching and Pattern Matching." Pattern Recognition, 20, 125-41. COLE, R. 1991. "Tight Bounds on the Complexity of the Boyer-Moore String Matching Algorithm, 2nd Symp.
http://orion.lcg.ufrj.br/Dr.Dobbs/books/book5/chap10.htm 24/27
2/5/2014
on Discrete Algorithms," pp. 224-33, SIAM, San Francisco, Cal. COMMENTZ-WALTER, B. 1979. "A String Matching Algorithm Fast on the Average," in ICALP, Lecture Notes in Computer Science 71, pp. 118-32. Springer-Verlag. CROCHEMORE, M. 1988. "String Matching with Constraints," in Mathematical Foundations of Computer Science, Carlesbad, Czechoslovakia. Lecture Notes in Computer Science 324. Springer-Verlag. CROCHEMORE, M., and D. PERRIN. 1988. "Pattern Matching in Strings," in 4th Conference on Image Analysis and Processing, ed. D. Gesu. pp. 67-79. Springer-Verlag. CROCHEMORE, M., and D. PERRIN. 1989. "Two Way Pattern Matching." Technical Report 98-8, L.I.T.P., University. Paris 7 (submitted for publication). DAVIES, G., and S. BOWSHER. 1986. "Algorithms for Pattern Matching." Software-Practice and Experience, 16, 575-601. FALOUTSOS, C. 1985. "Access Methods for Text." ACM C. Surveys, 17, 49-74. GALIL, Z. 1979. "On Improving the Worst Case Running Time of the Boyer-Moore String Matching Algorithm. CACM, 22, 505-08. GALIL, Z. 1981. "String Matching in Real Time." JACM, 28, 134-49. GALIL, Z. 1985. "Optimal Parallel Algorithms for String Matching." Information and Control, 67, 144-57. GALIL, Z., and J. SEIFERAS. 1980. "Saving Space in Fast String-Matching." SIAM J on Computing, 9, 417 - 38. GALIL, Z., and J. SEIFERAS. 1983. "Time-Space-Optimal String Matching." JCSS, 26, 280-94. GONNET, G. 1983. "Unstructured Data Bases or Very Efficient Text Searching," in ACM PODS, vol. 2, pp. 117-24, Atlanta, Ga. GONNET, G., and R. BAEZA-YATES. 1990. "An Analysis of the Karp-Rabin String Matching Algorithm. Information Processing Letters, pp. 271-74. GONNET, G. H., and BAEZA-YATES, R. 1991. "Handbook of Algorithms and Data Structures," 2nd Edition, Addison-Wesley. GUIBAS, L., and A. ODLYZKO. 1980. "A New Proof of the Linearity of the Boyer-Moore String Searching Algorithm. SIAM J on Computing, 9, 672-82. HARRISON, M. 1971. "Implementation of the Substring Test by Hashing." CACM, 14, 777-79. HASKIN, R. 1980. "Hardware for Searching Very Large Text Databases," in Workshop Computer Architecture for Non-Numeric Processing, vol. 5, pp. 49-56, California. HOLLAAR, L. 1979. "Text Retrieval Computers." IEEE Computer, 12, 40-50.
http://orion.lcg.ufrj.br/Dr.Dobbs/books/book5/chap10.htm 25/27
2/5/2014
HORSPOOL, R. N. 1980. "Practical Fast Searching in Strings." Software - Practice and Experience, 10, 501-06. HUME, A., and D. M. SUNDAY. 1991. "Fast String Searching." AT&T Bell Labs Computing Science Technical Report No. 156. To appear in Software-Practice and Experience. IYENGAR, S., and V. ALIA. 1980. "A String Search Algorithm." Appl. Math. Comput ., 6, 123-31. KARP, R., and M. RABIN. 1987. "Efficient Randomized Pattern-Matching Algorithms. IBM J Res. Development , 31, 249-60. KEDEM, Z., G. LANDAU, and K. PALEM. 1989. "Optimal Parallel Suffix-Prefix Matching Algorithm and Applications," in SPAA'89, Santa Fe, New Mexico. KEMP, M., R. BAYER, and U. GUNTZER. 1987. "Time Optimal Left to Right Construction of Position Trees." Acta Informatica, 24, 461-74. KERNIGHAN, B., and D. RITCHIE. 1978. The C Programming Language. Englewood Cliffs, N.J.: Prentice Hall. KNUTH, D., J. MORRIS, and V. PRATT. 1977. "Fast Pattern Matching in Strings." SIAM J on Computing, 6, 323-50. MAJSTER, M., and A. REISER. 1980. "Efficient On-Line Construction and Correction of Position Trees." SIAM J on Computing, 9, 785-807. MCCREIGHT, E. 1976. "A Space-Economical Suffix Tree Construction Algorithm." JACM, 23, 262-72. MEYER, B. 1985. "Incremental String Matching." Inf. Proc. Letters, 21, 219-27. MOLLER-NIELSEN, P., and J. STAUNSTRUP. 1984. "Experiments with a Fast String Searching Algorithm." Inf . Proc . Letters, 18, 129-35. MORRIS, J., and V. PRATT. 1970. "A Linear Pattern Matching Algorithm." Technical Report 40, Computing Center, University of California, Berkeley. MORRISON, D. 1968. "PATRICIA-Practical Algorithm to Retrieve Information Coded in Alphanumeric." JACM, 15, 514-34. RIVEST, R. 1977. "On the Worst-Case Behavior of String-Searching Algorithms." SIAM J on Computing, 6, 669-74. RYTTER, W. 1980. "A Correct Preprocessing Algorithm for Boyer-Moore String-Searching." SIAM J on Computing, 9, 509-12. SCHABACK, R. 1988. "On the Expected Sublinearity of the Boyer-Moore Algorithm." SIAM J on Computing, 17, 548-658. SEDGEWICK, R. 1983. Algorithms. Reading, Mass.: Addison-Wesley.
http://orion.lcg.ufrj.br/Dr.Dobbs/books/book5/chap10.htm 26/27
2/5/2014
SLISENKO, A. 1980. "Determination in Real Time of all the Periodicities in a Word." Sov. Math. Dokl., 21, 392-95. SLISENKO, A. 1983. "Detection of Periodicities and String-Matching in Real Time." J. Sov. Math., 22, 131686. SMIT, G. 1982. "A Comparison of Three String Matching Algorithms." Software-Practice and Experience, 12, 57-66. SUNDAY, D. 1990. "A Very Fast Substring Search Algorithm." Communications of the ACM, 33(8), 13242. TAKAOKA, T. 1986. "An On-Line Pattern Matching Algorithm." Inf. Proc. Letters, 22, 329-30. VISHKIN, U. 1985. "Optimal Parallel Pattern Matching in Strings." Information and Control, 67, 91-113. WAKABAYASHI, S., T. KIKUNO, and N. YOSHIDA. 1985. "Design of Hardware Algorithms by Recurrence Relations." Systems and Computers in Japan, 8, 10-17. WEINER, P. 1973. "Linear Pattern Matching Algorithm," in FOCS, vol. 14, pp. 1-11. WU, S., and U. MANBER. 1991. "Fast Text Searching With Errors." Technical Report TR-91-11, Department of Computer Science, University of Arizona. To appear in Proceedings of USENIX Winter 1992 Conference, San Francisco, January, 1992. Go to Chapter 11 Back to Table of Contents
http://orion.lcg.ufrj.br/Dr.Dobbs/books/book5/chap10.htm
27/27