0% found this document useful (0 votes)

45 views55 pages

03-Rabinkarp Dfa Bitap

The document discusses using a finite automaton (FA) to search for patterns in text. It describes how an FA can search in linear time by remembering partial matches, avoiding the problem of naively searching strings forgetting previous information. It provides an example of constructing an FA to search for the pattern "pappar" in text, with the FA having states to represent matching prefixes seen so far. Overall the document introduces using finite automata for efficient string matching.

Uploaded by

Big Bertha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

45 views55 pages

03-Rabinkarp Dfa Bitap

Uploaded by

Big Bertha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

CS481/CS583:

Bioinformatics Algorithms

Can Alkan
EA509
calkan@[Link]

[Link]
■ RABIN-KARP ALGORITHM
❑ Fingerprint idea
❑ Algorithm with Fingerprints
❑ Using a Hash Function
❑ Preprocessing and Stepping
❑ Stepping
❑ Rabin-Karp Algorithm
❑ Analysis
❑ Rabin-Karp in Practice
❑ RABIN-KARP – EXAMPLE #2

■ FINITE AUTOMATA
❑ Searching in n comparisons
❑ Finite automaton search
❑ Need some Notation …
❑ FA Construction
❑ P=ababaca
❑ Search
❑ Analysis of FA

■ BITAP ALGORITHM: SHIFT/AND

❑ Extension to Bitap
RABIN-KARP ALGORITHM
Fingerprint idea

■ Assume:
❑ We can compute a fingerprint f(P) of P in O(m)
time.
❑ If f(P) ≠ f(T[s .. s+m–1]), then P ≠ T[s .. s+m–1]
❑ We can compare fingerprints in O(1)
❑ We can compute f’ = f(T[s+1.. s+m]) from f(T[s ..
s+m–1]), in O(1)
f’

AALG, lecture 3, © Simonas

Šaltenis, 2004
Algorithm with Fingerprints
■ Let the alphabet Σ ={0,1,2,3,4,5,6,7,8,9}
■ Let fingerprint to be just a decimal number, i.e.,
f(“1045”) = 1*103 + 0*102 + 4*101 + 5 = 1045
■ Fingerprint-Search(T,P)
01 fp ← compute f(P) T[s]
new f
02 f ← compute f(T[0..m–1])
03 for s ← 0 to n – m do
04 if fp = f return s
05 f ← (f – T[s]*10m-1)*10 + T[s+m] f
06 return –1 T[s+m]

■ Running time 2O(m) + O(n–m) = O(n)

AALG, lecture 3, © Simonas

Šaltenis, 2004
Using a Hash Function

■ Problem:
❑ we can not assume we can do arithmetics with m-digits-long
numbers in O(1) time
■ Solution: Use a hash function h = f mod q
❑ For example, if q = 7, h(“52”) = 52 mod 7 = 3
❑ h(S1) ≠ h(S2) ⇒ S1 ≠ S2
❑ But h(S1) = h(S2) does not imply S1=S2
■ For example, if q = 7, h(“73”) = 3, but “73” ≠ “52”
■ Basic “mod q” arithmetics:
❑ (a+b) mod q = (a mod q + b mod q) mod q
❑ (a*b) mod q = (a mod q)*(b mod q) mod q

AALG, lecture 3, © Simonas

Šaltenis, 2004
Preprocessing and Stepping

■ Preprocessing:
❑ fp = P[m-1] + 10*(P[m-2] + 10*(P[m-3]+ … +
10*(P[1] + 10*P[0])…)) mod q
❑ In the same way compute ft from T[0..m-1]
❑ Example: P = “2531”, q = 7, fp = ?
■ Stepping:
❑ ft = (ft – T[s]*10m-1mod q)*10 + T[s+m]) mod q
❑ 10m-1 mod q can be computed once in the preprocessing
❑ Example: Let T[…] = “5319”, q = 7, what is the corresponding ft?
T[s]
new ft

AALG, lecture 3, © Simonas ft

Šaltenis, 2004 T[s+m]
Stepping

■ T = 25316446766…, m = 4, q=7
■ T0 = “2531”
❑ ft = 2531 mod 7 = 4
■ T1 = “5319”
m-1
❑ ft = ((ft – T[s]*(10 mod q))*10 + T[s+m]) mod q
3
❑ ft = ((ft – T[0]*(10 mod 7))*10 + T[0+4]) mod 7

■ = ((4 – (21000 mod 7)) 10 + T[4]) mod 7

■ = ((4-(2*6))*10+6) mod 7 = (-8*10+ 9) mod 7
■ = -71 mod 7 = 6
❑ 5319 mod 7 = 6
Rabin-Karp Algorithm
Rabin-Karp-Search(T,P)
01 q ← a prime larger than m
02 c ← 10m-1 mod q // run a loop multiplying by 10 mod q
03 fp ← 0; ft ← 0
04 for i ← 0 to m-1 // preprocessing
05 fp ← (10*fp + P[i]) mod q
06 ft ← (10*ft + T[i]) mod q
07 for s ← 0 to n – m // matching
08 if fp = ft then // run a loop to compare strings
09 if P[0..m-1] = T[s..s+m-1] return s
10 ft ← ((ft – T[s]*c)*10 + T[s+m]) mod q
11 return –1

AALG, lecture 3, © Simonas Šaltenis, 2004

Analysis

■ If q is a prime, the hash function distributes m-digit

strings evenly among the q values
❑ Thus, only every qth value of shift s will result in matching
fingerprints (which will require comparing strings with O(m)
comparisons)
■ Expected running time (if q > m):
❑ Preprocessing: O(m)
❑ Outer loop: O(n-m)
❑ All inner loops:
❑ Total time: O(n-m)
■ Worst-case running time: O(nm)

AALG, lecture 3, © Simonas

Šaltenis, 2004
Rabin-Karp in Practice

■ If the alphabet has d characters, interpret

characters as radix-d digits (replace 10 with d in
the algorithm).
■ Choosing prime q > m can be done with
randomized algorithms in O(m), or q can be fixed
to be the largest prime so that 10*q fits in a
computer word.

AALG, lecture 3, © Simonas

Šaltenis, 2004
RABIN-KARP – EXAMPLE #2
Rabin-Karp
T= A:0; C:1; G:2; T:3
TACGTAGCTAGTCGA Q = 997
P = CTAG

Fingerprint(P) = code(G) + 4code(A) + 16code(T) + 64*code(C)

= 2 + 4*0 + 16*3 + 64*1 = 114

Fingerprint(T) = code(G) + 4code(C) + 16code(A) + 64*code(T)

= 2 + 4*1 + 16*0 + 64*3
= 198

F(P) != F(T) skip

Rabin-Karp
T= A:0; C:1; G:2; T:3
TACGTAGCTAGTCGA Q = 997
P = CTAG
F (P) = 114
F (T) = 198

i=2
Fingerprint(T) = (F(T) – T[1]*c) * 4 + T[5]
= (198 – code(T)*64) * 4 + code(T)
= (198 – 3*64) * 4 + 3
= 6*4 + 3 = 27

F(P) != F(T) skip

Rabin-Karp
T= A:0; C:1; G:2; T:3
TACGTAGCTAGTCGA Q = 997
P= CTAG
F (P) = 114
F (T) = 27

i=3
Fingerprint(T) = (F(T) – T[2]*c) * 4 + T[6]
= (27– code(A)*64) * 4 + code(A)
= (27 – 0*64) * 4 + 0
= 27 * 4 = 108

F(P) != F(T) skip

Rabin-Karp
T= A:0; C:1; G:2; T:3
TACGTAGCTAGTCGA Q = 997
P= CTAG
F (P) = 114
F (T) = 108

i=4
Fingerprint(T) = (F(T) – T[3]*c) * 4 + T[7]
= (108– code(C)*64) * 4 + code(G)
= (108 – 1*64) * 4 + 2
= 44 * 4 + 2 = 178

F(P) != F(T) skip

Rabin-Karp
T= A:0; C:1; G:2; T:3
TACGTAGCTAGTCGA Q = 997
P= CTAG
F (P) = 114
F (T) = 178

i=5
Fingerprint(T) = (F(T) – T[4]*c) * 4 + T[8]
= (178– code(G)*64) * 4 + code(C)
= (178 – 2*64) * 4 + 1
= 50 * 4 + 1 = 201

F(P) != F(T) skip

Rabin-Karp
T= A:0; C:1; G:2; T:3
TACGTAGCTAGTCGA Q = 997
P= CTAG
F (P) = 114
F (T) = 201

i=6
Fingerprint(T) = (F(T) – T[5]*c) * 4 + T[9]
= (201– code(T)*64) * 4 + code(T)
= (201– 3*64) * 4 + 3
= 9 * 4 + 3 = 39

F(P) != F(T) skip

Rabin-Karp
T= A:0; C:1; G:2; T:3
TACGTAGCTAGTCGA Q = 997
P= CTAG
F (P) = 114
F (T) = 39

i=7
Fingerprint(T) = (F(T) – T[6]*c) * 4 + T[10]
= (39– code(A)*64) * 4 + code(A)
= (39– 0*64) * 4 + 0
= 39 * 4 + 0 = 156

F(P) != F(T) skip

Rabin-Karp
T= A:0; C:1; G:2; T:3
TACGTAGCTAGTCGA Q = 997
P= CTAG
F (P) = 114
F (T) = 156

i=7
Fingerprint(T) = (F(T) – T[7]*c) * 4 + T[11]
= (156– code(G)*64) * 4 + code(G)
= (156– 2*64) * 4 + 2
= 28 * 4 + 2 = 114

F(P) = F(T) CHECK!!!

CTAG = CTAG
FINITE AUTOMATA
Searching in n comparisons

■ The goal: each character of the text is

compared only once!
■ Problem with the naïve algorithm:
❑ Forgets what was learned from a partial match!
❑ Examples:
■ T = “Tweedledee and Tweedledum” and
P = “Tweedledum”
■ T = “pappappappar” and P = “pappar”

AALG, lecture 3, © Simonas

Šaltenis, 2004
Finite automaton search
a
a a
a
a b a b a c a
0 1 2 3 4 5 6 7
b

b
input
state
a b c P
0 1 0 0 a i -- 1 2 3 4 5 6 7 8 9 10 11
1 1 2 0 b T[i] -- a b a b a b a c a b a
state φ(i) 0 1 2 3 4 5 4 5 6 7 2 3
2 3 0 0 a
3 1 4 0 b
4 5 0 0 a
Processing time takes Θ(n).
5 1 4 6 c But have to first construct FA.
6 7 0 0 a Main Issue: How to construct FA?
7 1 2 0
Need some Notation …
φ(w) = state FA ends up in after processing w.

Example: φ(abab) = 4.

σ(x) = max{k: Pk suf x}. Called the suffix function.

Examples: Let P = ab.

σ(ε) = 0
σ(ccaca) = 1
σ(ccab) = 2
Note: If |P| = m, then σ(x) = m indicates a match.
T: a b a b b a b b a c …
States: 0 1…...m….….m……….

match match
FA Construction
Given: P[1..m] Let Q = states = {0, 1, …, m}.

initial final

Define transition function δ as follows:

δ(q, a) = σ(Pqa) for each q and a.

Example: δ(5, b) = σ(P5b) (P = ababaca)

= σ(ababab)
=4
Intuition: Encountering a ‘b’ in state 5 means the current substring
doesn’t match. But, you know this substring ends with “abab” -- and
this is the longest suffix that matches the beginning of P. Thus, we
go to state 4 and continue processing “abab…” .
P=ababaca

b,c

a b a b a c a
0 1 2 3 4 5 6 7

Prefixes
m=7; Q={0,1,2,3,4,5,6,7)
a
ab
aba
abab
ababa
ababac
ababaca
P=ababaca

b,c a

a b a b a c a
0 1 2 3 4 5 6 7

Prefixes
a
ab
aba
δ(1, a) = σ(P1a) = σ(aa) = σ(a) = 1
abab
ababa
ababac
ababaca
P=ababaca

b,c a

a b a b a c a
0 1 2 3 4 5 6 7

c
Prefixes
a
ab
aba
δ(1, a) = σ(P1a) = σ(aa) = σ(a) = 1
abab
δ(1, c) = σ(P1c) = σ(ac) = 0
ababa
ababac
ababaca
P=ababaca

b,c a
b

a b a b a c a
0 1 2 3 4 5 6 7

c c
Prefixes
a
ab
aba
δ(2, b) = σ(P2b) = σ(abb) = 0
abab
δ(2, c) = σ(P2c) = σ(abc) = 0
ababa
ababac
ababaca
P=ababaca (fast forward & simplified)