CS481/CS583:
Bioinformatics Algorithms
Can Alkan
EA509
calkan@[Link]
[Link]
■ RABIN-KARP ALGORITHM
❑ Fingerprint idea
❑ Algorithm with Fingerprints
❑ Using a Hash Function
❑ Preprocessing and Stepping
❑ Stepping
❑ Rabin-Karp Algorithm
❑ Analysis
❑ Rabin-Karp in Practice
❑ RABIN-KARP – EXAMPLE #2
■ FINITE AUTOMATA
❑ Searching in n comparisons
❑ Finite automaton search
❑ Need some Notation …
❑ FA Construction
❑ P=ababaca
❑ Search
❑ Analysis of FA
■ BITAP ALGORITHM: SHIFT/AND
❑ Extension to Bitap
RABIN-KARP ALGORITHM
Fingerprint idea
■ Assume:
❑ We can compute a fingerprint f(P) of P in O(m)
time.
❑ If f(P) ≠ f(T[s .. s+m–1]), then P ≠ T[s .. s+m–1]
❑ We can compare fingerprints in O(1)
❑ We can compute f’ = f(T[s+1.. s+m]) from f(T[s ..
s+m–1]), in O(1)
f’
AALG, lecture 3, © Simonas
Šaltenis, 2004
Algorithm with Fingerprints
■ Let the alphabet Σ ={0,1,2,3,4,5,6,7,8,9}
■ Let fingerprint to be just a decimal number, i.e.,
f(“1045”) = 1*103 + 0*102 + 4*101 + 5 = 1045
■ Fingerprint-Search(T,P)
01 fp ← compute f(P) T[s]
new f
02 f ← compute f(T[0..m–1])
03 for s ← 0 to n – m do
04 if fp = f return s
05 f ← (f – T[s]*10m-1)*10 + T[s+m] f
06 return –1 T[s+m]
■ Running time 2O(m) + O(n–m) = O(n)
AALG, lecture 3, © Simonas
Šaltenis, 2004
Using a Hash Function
■ Problem:
❑ we can not assume we can do arithmetics with m-digits-long
numbers in O(1) time
■ Solution: Use a hash function h = f mod q
❑ For example, if q = 7, h(“52”) = 52 mod 7 = 3
❑ h(S1) ≠ h(S2) ⇒ S1 ≠ S2
❑ But h(S1) = h(S2) does not imply S1=S2
■ For example, if q = 7, h(“73”) = 3, but “73” ≠ “52”
■ Basic “mod q” arithmetics:
❑ (a+b) mod q = (a mod q + b mod q) mod q
❑ (a*b) mod q = (a mod q)*(b mod q) mod q
AALG, lecture 3, © Simonas
Šaltenis, 2004
Preprocessing and Stepping
■ Preprocessing:
❑ fp = P[m-1] + 10*(P[m-2] + 10*(P[m-3]+ … +
10*(P[1] + 10*P[0])…)) mod q
❑ In the same way compute ft from T[0..m-1]
❑ Example: P = “2531”, q = 7, fp = ?
■ Stepping:
❑ ft = (ft – T[s]*10m-1mod q)*10 + T[s+m]) mod q
❑ 10m-1 mod q can be computed once in the preprocessing
❑ Example: Let T[…] = “5319”, q = 7, what is the corresponding ft?
T[s]
new ft
AALG, lecture 3, © Simonas ft
Šaltenis, 2004 T[s+m]
Stepping
■ T = 25316446766…, m = 4, q=7
■ T0 = “2531”
❑ ft = 2531 mod 7 = 4
■ T1 = “5319”
m-1
❑ ft = ((ft – T[s]*(10 mod q))*10 + T[s+m]) mod q
3
❑ ft = ((ft – T[0]*(10 mod 7))*10 + T[0+4]) mod 7
■ = ((4 – (2*1000 mod 7)) * 10 + T[4]) mod 7
■ = ((4-(2*6))*10+6) mod 7 = (-8*10+ 9) mod 7
■ = -71 mod 7 = 6
❑ 5319 mod 7 = 6
Rabin-Karp Algorithm
Rabin-Karp-Search(T,P)
01 q ← a prime larger than m
02 c ← 10m-1 mod q // run a loop multiplying by 10 mod q
03 fp ← 0; ft ← 0
04 for i ← 0 to m-1 // preprocessing
05 fp ← (10*fp + P[i]) mod q
06 ft ← (10*ft + T[i]) mod q
07 for s ← 0 to n – m // matching
08 if fp = ft then // run a loop to compare strings
09 if P[0..m-1] = T[s..s+m-1] return s
10 ft ← ((ft – T[s]*c)*10 + T[s+m]) mod q
11 return –1
AALG, lecture 3, © Simonas Šaltenis, 2004
Analysis
■ If q is a prime, the hash function distributes m-digit
strings evenly among the q values
❑ Thus, only every qth value of shift s will result in matching
fingerprints (which will require comparing strings with O(m)
comparisons)
■ Expected running time (if q > m):
❑ Preprocessing: O(m)
❑ Outer loop: O(n-m)
❑ All inner loops:
❑ Total time: O(n-m)
■ Worst-case running time: O(nm)
AALG, lecture 3, © Simonas
Šaltenis, 2004
Rabin-Karp in Practice
■ If the alphabet has d characters, interpret
characters as radix-d digits (replace 10 with d in
the algorithm).
■ Choosing prime q > m can be done with
randomized algorithms in O(m), or q can be fixed
to be the largest prime so that 10*q fits in a
computer word.
AALG, lecture 3, © Simonas
Šaltenis, 2004
RABIN-KARP – EXAMPLE #2
Rabin-Karp
T= A:0; C:1; G:2; T:3
TACGTAGCTAGTCGA Q = 997
P = CTAG
Fingerprint(P) = code(G) + 4*code(A) + 16*code(T) + 64*code(C)
= 2 + 4*0 + 16*3 + 64*1 = 114
Fingerprint(T) = code(G) + 4*code(C) + 16*code(A) + 64*code(T)
= 2 + 4*1 + 16*0 + 64*3
= 198
F(P) != F(T) skip
Rabin-Karp
T= A:0; C:1; G:2; T:3
TACGTAGCTAGTCGA Q = 997
P = CTAG
F (P) = 114
F (T) = 198
i=2
Fingerprint(T) = (F(T) – T[1]*c) * 4 + T[5]
= (198 – code(T)*64) * 4 + code(T)
= (198 – 3*64) * 4 + 3
= 6*4 + 3 = 27
F(P) != F(T) skip
Rabin-Karp
T= A:0; C:1; G:2; T:3
TACGTAGCTAGTCGA Q = 997
P= CTAG
F (P) = 114
F (T) = 27
i=3
Fingerprint(T) = (F(T) – T[2]*c) * 4 + T[6]
= (27– code(A)*64) * 4 + code(A)
= (27 – 0*64) * 4 + 0
= 27 * 4 = 108
F(P) != F(T) skip
Rabin-Karp
T= A:0; C:1; G:2; T:3
TACGTAGCTAGTCGA Q = 997
P= CTAG
F (P) = 114
F (T) = 108
i=4
Fingerprint(T) = (F(T) – T[3]*c) * 4 + T[7]
= (108– code(C)*64) * 4 + code(G)
= (108 – 1*64) * 4 + 2
= 44 * 4 + 2 = 178
F(P) != F(T) skip
Rabin-Karp
T= A:0; C:1; G:2; T:3
TACGTAGCTAGTCGA Q = 997
P= CTAG
F (P) = 114
F (T) = 178
i=5
Fingerprint(T) = (F(T) – T[4]*c) * 4 + T[8]
= (178– code(G)*64) * 4 + code(C)
= (178 – 2*64) * 4 + 1
= 50 * 4 + 1 = 201
F(P) != F(T) skip
Rabin-Karp
T= A:0; C:1; G:2; T:3
TACGTAGCTAGTCGA Q = 997
P= CTAG
F (P) = 114
F (T) = 201
i=6
Fingerprint(T) = (F(T) – T[5]*c) * 4 + T[9]
= (201– code(T)*64) * 4 + code(T)
= (201– 3*64) * 4 + 3
= 9 * 4 + 3 = 39
F(P) != F(T) skip
Rabin-Karp
T= A:0; C:1; G:2; T:3
TACGTAGCTAGTCGA Q = 997
P= CTAG
F (P) = 114
F (T) = 39
i=7
Fingerprint(T) = (F(T) – T[6]*c) * 4 + T[10]
= (39– code(A)*64) * 4 + code(A)
= (39– 0*64) * 4 + 0
= 39 * 4 + 0 = 156
F(P) != F(T) skip
Rabin-Karp
T= A:0; C:1; G:2; T:3
TACGTAGCTAGTCGA Q = 997
P= CTAG
F (P) = 114
F (T) = 156
i=7
Fingerprint(T) = (F(T) – T[7]*c) * 4 + T[11]
= (156– code(G)*64) * 4 + code(G)
= (156– 2*64) * 4 + 2
= 28 * 4 + 2 = 114
F(P) = F(T) CHECK!!!
CTAG = CTAG
FINITE AUTOMATA
Searching in n comparisons
■ The goal: each character of the text is
compared only once!
■ Problem with the naïve algorithm:
❑ Forgets what was learned from a partial match!
❑ Examples:
■ T = “Tweedledee and Tweedledum” and
P = “Tweedledum”
■ T = “pappappappar” and P = “pappar”
AALG, lecture 3, © Simonas
Šaltenis, 2004
Finite automaton search
a
a a
a
a b a b a c a
0 1 2 3 4 5 6 7
b
b
input
state
a b c P
0 1 0 0 a i -- 1 2 3 4 5 6 7 8 9 10 11
1 1 2 0 b T[i] -- a b a b a b a c a b a
state φ(i) 0 1 2 3 4 5 4 5 6 7 2 3
2 3 0 0 a
3 1 4 0 b
4 5 0 0 a
Processing time takes Θ(n).
5 1 4 6 c But have to first construct FA.
6 7 0 0 a Main Issue: How to construct FA?
7 1 2 0
Need some Notation …
φ(w) = state FA ends up in after processing w.
Example: φ(abab) = 4.
σ(x) = max{k: Pk suf x}. Called the suffix function.
Examples: Let P = ab.
σ(ε) = 0
σ(ccaca) = 1
σ(ccab) = 2
Note: If |P| = m, then σ(x) = m indicates a match.
T: a b a b b a b b a c …
States: 0 1…...m….….m……….
match match
FA Construction
Given: P[1..m] Let Q = states = {0, 1, …, m}.
initial final
Define transition function δ as follows:
δ(q, a) = σ(Pqa) for each q and a.
Example: δ(5, b) = σ(P5b) (P = ababaca)
= σ(ababab)
=4
Intuition: Encountering a ‘b’ in state 5 means the current substring
doesn’t match. But, you know this substring ends with “abab” -- and
this is the longest suffix that matches the beginning of P. Thus, we
go to state 4 and continue processing “abab…” .
P=ababaca
b,c
a b a b a c a
0 1 2 3 4 5 6 7
Prefixes
m=7; Q={0,1,2,3,4,5,6,7)
a
ab
aba
abab
ababa
ababac
ababaca
P=ababaca
b,c a
a b a b a c a
0 1 2 3 4 5 6 7
Prefixes
a
ab
aba
δ(1, a) = σ(P1a) = σ(aa) = σ(a) = 1
abab
ababa
ababac
ababaca
P=ababaca
b,c a
a b a b a c a
0 1 2 3 4 5 6 7
c
Prefixes
a
ab
aba
δ(1, a) = σ(P1a) = σ(aa) = σ(a) = 1
abab
δ(1, c) = σ(P1c) = σ(ac) = 0
ababa
ababac
ababaca
P=ababaca
b,c a
b
a b a b a c a
0 1 2 3 4 5 6 7
c c
Prefixes
a
ab
aba
δ(2, b) = σ(P2b) = σ(abb) = 0
abab
δ(2, c) = σ(P2c) = σ(abc) = 0
ababa
ababac
ababaca
P=ababaca (fast forward & simplified)
b,c a a
a
a b a b a c a
0 1 2 3 4 5 6 7
b
Prefixes
a
ab
δ(5, a) = σ(P5a) = σ(ababaa) = σ(a) = 1 aba
δ(5, b) = σ(P5b) = σ(ababab) = σ(abab) = 4 abab
ababa
ababac
ababaca
P=ababaca (final, simplified)
b,c a a
a
a b a b a c a
0 1 2 3 4 5 6 7
b
b
Search
b,c a a
a
a b a b a c a
0 1 2 3 4 5 6 7
b
b
Prefixes
a
ab
T= abababacaba
aba
abab
ababa
ababac
ababaca
Search
b,c a a
a
a b a b a c a
0 1 2 3 4 5 6 7
b
b
Prefixes
a
ab
T= abababacaba
aba
abab
ababa
ababac
ababaca
Search
b,c a a
a
a b a b a c a
0 1 2 3 4 5 6 7
b
b
Prefixes
a
ab
T= abababacaba
aba
abab
ababa
ababac
ababaca
Search
b,c a a
a
a b a b a c a
0 1 2 3 4 5 6 7
b
b
Prefixes
a
ab
T= abababacaba
aba
abab
ababa
ababac
ababaca
Search
b,c a a
a
a b a b a c a
0 1 2 3 4 5 6 7
b
b
Prefixes
a
ab
T= abababacaba
aba
abab
ababa
ababac
ababaca
Search
b,c a a
a
a b a b a c a
0 1 2 3 4 5 6 7
b
b
Prefixes
a
ab
T= abababacaba
aba
abab
ababa
ababac
ababaca
Search
b,c a a
a
a b a b a c a
0 1 2 3 4 5 6 7
b
b
Prefixes
a
ab
T= abababacaba
aba
abab
ababa
ababac
ababaca
Search
b,c a a
a
a b a b a c a
0 1 2 3 4 5 6 7
b
b
Prefixes
a
ab
T= abababacaba
aba
abab
ababa
ababac
ababaca
Search
b,c a a
a
a b a b a c a
0 1 2 3 4 5 6 7
b
b
Prefixes
a
ab
T= abababacaba
aba
abab
Accept state, we are done
ababa
ababac
ababaca
Analysis of FA
■ Searching: O(n) → good
■ Preprocessing: O(m|Σ|) → bad
■ Memory: O(m|Σ|) → bad
BITAP ALGORITHM:
SHIFT/AND
The Shift-And Method
■ Define M to be a binary n by m matrix such that:
M(i,j) = 1 iff the first i characters of P exactly match
the i characters of T ending at character j.
M(i,j) = 1 iff P[1 .. i] ≡ T[j-i+1 .. j]
The Shift-And Method
■ Let T = california
■ Let P = for
1 2 3 4 5 6 7 8 9 m = 10
1 0 0 0 0 1 0 0 0 0 0
M= 2 0 0 0 0 0 1 0 0 0 0
3 0 0 0 0 0 0 1 0 0 0
■ M(i,j) = 1 iff the first i characters of P exactly
match the i characters of T ending at character j.
How to construct M
■ We will construct M column by column.
■ Two definitions:
■ Bit-Shift(j-1) is the vector derived by shifting the
vector for column j-1 down by one and setting the first
bit to 1.
■ Example:
How to construct M
■ We define the n-length binary vector U(x) for each
character x in the alphabet. U(x) is set to 1 for the
positions in P where character x appears.
■ Example:
P = abaac
How to construct M
■ Initialize column 0 of M to all zeros
■ For j > 1 column j is obtained by
An example j = 1
1 2 3 4 5 6 7 8 9 10 init 1 2 3 4 5 6 7 8 9 10
T=xabxabaaca 1 0 0
12345 2 0 0
P=abaac 3 0 0
4 0 0
5 0 0
An example j = 2
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
T=xabxabaaca 1 0 0 1
12345 2 0 0 0
P=abaac 3 0 0 0
4 0 0 0
5 0 0 0
An example j = 3
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
T=xabxabaaca 1 0 0 1 0
12345 2 0 0 0 1
P=abaac 3 0 0 0 0
4 0 0 0 0
5 0 0 0 0
An example j = 8
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
T=xabxabaaca 1 0 0 1 0 0 1 0 1 1
12345 2 0 0 0 1 0 0 1 0 0
P=abaac 3 0 0 0 0 0 0 0 1 0
4 0 0 0 0 0 0 0 0 1
5 0 0 0 0 0 0 0 0 0
Correctness
■ For i > 1, Entry M(i,j) = 1 iff
1) The first i-1 characters of P match the i-1
characters of T ending at character j-1.
2) Character P(i) ≡ T(j).
■ 1) is true when M(i-1,j-1) = 1.
■ 2) is true when the ith bit of U(T(j)) = 1.
■ The algorithm computes the and of these two bits.
Correctness
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
T=xabxabaaca 1 0 0 1 0 0 1 0 1 1 0 1
abaac 2 0 0 0 1 0 0 1 0 0 0 0
3 0 0 0 0 0 0 0 1 0 0 0
4 0 0 0 0 0 0 0 0 1 0 0
5 0 0 0 0 0 0 0 0 0 1 0
■ M(4,8) = 1, this is because a b a a is a prefix of P of length 4
that ends at position 8 in T.
■ Condition 1) – We had a b a as a prefix of length 3 that ended
at position 7 in T ↔ M(3,7) = 1.
■ Condition 2) – The fourth bit of P is the eighth bit of T ↔ The
fourth bit of U(T(8)) = 1.
How much did we pay?
■ Formally the running time is Θ(mn).
■ However, the method is very efficient if m is the size
of a single or a few computer words.
■ Furthermore only two columns of M are needed at
any given time. Hence, the space used by the
algorithm is O(m) for m=|P|.
Extension to Bitap
■ Supports edits (not exact matching)
■ Supports backtrack
■ Low-power, processing-in-memory design