0% found this document useful (0 votes)
53 views31 pages

Notes 05 Parallel String Matching

The document discusses parallelizing the string matching problem by dividing it into subproblems that can be solved independently, such as checking guesses in different blocks of the text string simultaneously or using "duels" between guesses based on periodicity witnesses to narrow the search space. It introduces techniques like overlapping blocks, computing periodicity witnesses, and dueling guesses to potentially solve the string matching problem more efficiently in parallel.

Uploaded by

hussein hammoud
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
0% found this document useful (0 votes)
53 views31 pages

Notes 05 Parallel String Matching

The document discusses parallelizing the string matching problem by dividing it into subproblems that can be solved independently, such as checking guesses in different blocks of the text string simultaneously or using "duels" between guesses based on periodicity witnesses to narrow the search space. It introduces techniques like overlapping blocks, computing periodicity witnesses, and dueling guesses to potentially solve the string matching problem more efficiently in parallel.

Uploaded by

hussein hammoud
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 31

5 Parallel

String Matching
Sebastian Wild
2 March 2020

version 2020-03-03 14:48


Outline

5 Parallel String Matching


5.1 Elementary Tricks
5.2 Periodicity
5.3 String Matching by Duels
Parallelizing string matching
� We have seen a plethora of string matching methods

� But all efficient methods seem inherently sequential


Indeed, they became efficient only after building on knowledge from previous steps!

Sounds like the opposite of parallel!

� This unit:
� How well can we parallelize string matching?
� What new ideas can help?

Here: string matching = find all occurrences of 𝑃 in 𝑇 (more natural problem for parallel)
always assume 𝑚 ≤ 𝑛

1
5.1 Elementary Tricks
Embarrassingly Parallel
� A problem is called “embarrassingly parallel”
if it can immediately be split into many, small subtasks
that can be solved completely independently of each other
� Typical example: sum of two large matrices (all entries independent)

� best case for parallel computation (simply assign each processor one subtask)

� Sorting is not embarrassingly parallel


� no obvious way to define many small (=efficiently solvable) subproblems
� but: some subtasks of our algorithms are, e. g., comparing all elements with pivot

2
Clicker Question

Is the string-matching problem “embarrassingly parallel”?


A Yes

B No

C Only for 𝑛 � 𝑚

D Only for 𝑛 ≈ 𝑚

pingo.upb.de/622222
3
Elementary parallel string matching
Subproblems in string matching:
� string matching = check all guesses 𝑖 = 0, . . . , 𝑛 − 𝑚 − 1
� checking one guess is a subtask!

4
Elementary parallel string matching
Subproblems in string matching:
� string matching = check all guesses 𝑖 = 0, . . . , 𝑛 − 𝑚 − 1
� checking one guess is a subtask!

Approach 1:
� Check all guesses in parallel
� Time: Θ(𝑚)
� Work: Θ((𝑛 − 𝑚)𝑚) � not great . . .

4
Elementary parallel string matching
Subproblems in string matching:
� string matching = check all guesses 𝑖 = 0, . . . , 𝑛 − 𝑚 − 1
� checking one guess is a subtask!

Approach 1:
� Check all guesses in parallel
� Time: Θ(𝑚)
� Work: Θ((𝑛 − 𝑚)𝑚) � not great . . .

Approach 2:
� Divide 𝑇 into overlapping blocks of 2𝑚 characters:
𝑇[0..2𝑚), 𝑇[𝑚..3𝑚), 𝑇[2𝑚..4𝑚), 𝑇[3𝑚..5𝑚). . .
� Find matches inside blocks in parallel, using efficient sequential method
� Θ(2𝑚 + 𝑚) = Θ(𝑚) each
� Time: Θ(𝑚) Work: Θ( 𝑚𝑛 · 𝑚) = Θ(𝑛)
4
Clicker Question

Is the string-matching problem “embarrassingly parallel”?


A Yes

B No

C Only for 𝑛 � 𝑚

D Only for 𝑛 ≈ 𝑚

pingo.upb.de/622222
5
Clicker Question

Is the string-matching problem “embarrassingly parallel”?


A Yes

B No

C Only for 𝑛 � 𝑚 �
D Only for 𝑛 ≈ 𝑚

pingo.upb.de/622222
5
Elementary parallel matching – Discussion
very simple methods

could even run distributed with access to part of 𝑇

parallel speedup only for 𝑚 � 𝑛

Goal:
� methods with better parallel time! � higher speedup

� must genuinely parallelize the matching process! (and the preprocessing of the pattern)

� need new ideas

6
5.2 Periodicity
Periodicity of Strings
� 𝑆 = 𝑆[0..𝑛 − 1] has period 𝑝 iff ∀𝑖 ∈ [0..𝑛 − 𝑝) : 𝑆[𝑖] = 𝑆[𝑖 + 𝑝]
� 𝑝 = 0 and any 𝑝 ≥ 𝑛 are trivial periods but these are not very interesting . . .

Examples:
� 𝑆 = baaababaaab has period 6:

𝑆 b a a a b a b a a a b

=
=
=
=
=
𝑝=6
𝑆 b a a a b a b a a a b

� 𝑆 = abaabaabaaba has period 3:

𝑆 a b a a b a a b a a b a
=
=
=
=
=
=
=
=
𝑝=3 =
𝑆 a b a a b a a b a a b a

7
Periodicity and KMP
Lemma 5.1 (Periodicity = Longest Overlap)
𝑝 ∈ [1..𝑛] is the shortest period in 𝑆 = 𝑆[0..𝑛 − 1]
iff 𝑆[0..𝑛 − 𝑝) is the longest prefix that is also a suffix of 𝑆[𝑝..𝑛). �

8
Periodicity and KMP
Lemma 5.1 (Periodicity = Longest Overlap)
𝑝 ∈ [1..𝑛] is the shortest period in 𝑆 = 𝑆[0..𝑛 − 1]
iff 𝑆[0..𝑛 − 𝑝) is the longest prefix that is also a suffix of 𝑆[𝑝..𝑛). �

𝑆[0..𝑛 − 1] has minimal period 𝑝 ⇐⇒ fail[𝑛] = 𝑛 − 𝑝

0 1 2 3 4 5 6 7 8 9 10 11

𝑆 a b a a b a a b a a b a
=
=
=
=
=
=
=
=
𝑝=3 =
𝑆 a b a a b a a b a a b a
fail[𝑛] = 9

8
Periodicity Lemma
Lemma 5.2 (Periodicity Lemma)
If string 𝑆 = 𝑆[0..𝑛 − 1] has periods 𝑝 and 𝑞 with 𝑝 + 𝑞 ≤ 𝑛,
then it has also period gcd(𝑝, 𝑞). �
greatest common divisor

Proof: see tutorials; hint: recall Euclid’s algorithm

9
Periodic strings
� What does the smallest period 𝑝 tell us about a string 𝑆[0..𝑛 − 1]?
� Two distinct regimes:
1. 𝑆 is periodic: 𝑝 ≤ 𝑛2
More precisely: 𝑆 is totally determined by a string 𝐹 = 𝐹[0..𝑝 − 1] = 𝑆[0..𝑝 − 1]
𝑆 keeps repeating 𝐹 until 𝑛 characters are filled

� 𝑆 is highly repetitive!

2. 𝑆 is aperiodic (also non-periodic): 𝑝 > 𝑛2


𝑆 cannot be written as 𝑆 = 𝐹 𝑘 · 𝑌 with 𝑘 ≥ 2 and 𝑌 a prefix of 𝐹

10
Clicker Question

Is 𝑆 = aaaaaaaaaaab a periodic string?

A Yes

B No

pingo.upb.de/622222
11
Clicker Question

Is 𝑆 = aaaaaaaaaaab a periodic string?

A Yes

B No �
� “looking repetitive” is not enough for periodic!

pingo.upb.de/622222
11
5.3 String Matching by Duels
Periods and Matching
Witnesses for non-periodicity:
� Assume, 𝑃[0..𝑚 − 1] does not have period 𝑝

� ∃ witness against periodicity: position 𝜔 ∈ [0..𝑚 − 𝑝) : 𝑃[𝜔] ≠ 𝑃[𝜔 + 𝑝]

12
Periods and Matching
Witnesses for non-periodicity:
� Assume, 𝑃[0..𝑚 − 1] does not have period 𝑝

� ∃ witness against periodicity: position 𝜔 ∈ [0..𝑚 − 𝑝) : 𝑃[𝜔] ≠ 𝑃[𝜔 + 𝑝]

Dueling via witnesses:


� If 𝑃[0..𝑚 − 1] does not have period 𝑝, then
at most one of positions 𝑖 and 𝑖 + 𝑝 can be (the starting position of) an occurrence of 𝑃.

Proof: Cannot have 𝑇[(𝑖 + 𝑝) + 𝜔] = 𝑃[𝜔] ≠ 𝑃[𝜔 + 𝑝] = 𝑇[𝑖 + (𝜔 + 𝑝)].

12
Periods and Matching
Witnesses for non-periodicity:
� Assume, 𝑃[0..𝑚 − 1] does not have period 𝑝

� ∃ witness against periodicity: position 𝜔 ∈ [0..𝑚 − 𝑝) : 𝑃[𝜔] ≠ 𝑃[𝜔 + 𝑝]

Dueling via witnesses:


� If 𝑃[0..𝑚 − 1] does not have period 𝑝, then
at most one of positions 𝑖 and 𝑖 + 𝑝 can be (the starting position of) an occurrence of 𝑃.

Proof: Cannot have 𝑇[(𝑖 + 𝑝) + 𝜔] = 𝑃[𝜔] ≠ 𝑃[𝜔 + 𝑝] = 𝑇[𝑖 + (𝜔 + 𝑝)].

� Duel between guess 𝑖 and 𝑖 + 𝑝:


compare text character overlapped with witness 𝜔

12
String Matching by Duels – Sequential
Assume that pattern 𝑃 is aperiodic. (can deal with periodic case separately; details omitted)

Algorithm:
1. Set 𝜇 := � 𝑚2 �

2. Compute witnesses 𝜔[1..𝜇] against periodicity for all 𝑝 ≤ 2.


𝑚

3. For each block of 𝜇 consecutive indices [0..𝜇), [𝜇..2𝜇), [2𝜇..3𝜇), . . .


run 𝜇 − 1 duels to eliminate all but one guesses in the block

4. check remaining � 𝜇𝑛 � = 𝑂(𝑛/𝑚) guesses naively

13
String Matching by Duels – Sequential
Assume that pattern 𝑃 is aperiodic. (can deal with periodic case separately; details omitted)

Algorithm: Analysis:
1. Set 𝜇 := � 𝑚2 � 1. 𝑂(1)

2. Compute witnesses 𝜔[1..𝜇] against periodicity for all 𝑝 ≤ 2.


𝑚
2. 𝑂(𝑚) � later

3. For each block of 𝜇 consecutive indices [0..𝜇), [𝜇..2𝜇), [2𝜇..3𝜇), . . . 3. 𝑂( 𝑚𝑛 ) blocks


run 𝜇 − 1 duels to eliminate all but one guesses in the block 𝑂(𝑚) duels each

4. check remaining � 𝜇𝑛 � = 𝑂(𝑛/𝑚) guesses naively 4. 𝑂( 𝑚𝑛 ),


≤ 𝑚 cmps each

� another worst-case 𝑂(𝑛 + 𝑚) string matching method!

13
String Matching by Duels – Parallel
Assume that pattern 𝑃 is aperiodic. (can deal with periodic case separately; details omitted)

Algorithm:
1. Set 𝜇 := � 𝑚
2�

2. Compute witnesses 𝜔[1..𝜇] against periodicity for all 𝑝 ≤ 𝑚


2.

3. For each block of 𝜇 consecutive indices [0..𝜇), [𝜇..2𝜇), [2𝜇..3𝜇), . . .


run 𝜇 − 1 duels to eliminate all but one guesses in the block

4. check remaining � 𝜇𝑛 � = 𝑂(𝑛/𝑚) guesses naively

14
String Matching by Duels – Parallel
Assume that pattern 𝑃 is aperiodic. (can deal with periodic case separately; details omitted)

Algorithm: How to parallelize:


1. Set 𝜇 := � 𝑚
2� 1. —

2. Compute witnesses 𝜔[1..𝜇] against periodicity for all 𝑝 ≤ 𝑚


2. 2. 𝑂(log2 (𝑚)) � later

3. For each block of 𝜇 consecutive indices [0..𝜇), [𝜇..2𝜇), [2𝜇..3𝜇), . . . 3. blocks in parallel (indep.),
run 𝜇 − 1 duels to eliminate all but one guesses in the block tournament of �lg 𝜇� rounds

4. check remaining � 𝜇𝑛 � = 𝑂(𝑛/𝑚) guesses naively 4. check in parallel


collect result (like prefix sum)
Tournament of duals:
� each dual eliminates one guess
� declare other guess winner
� conceptually like prefix sum! 0 1 2 3 4 5 6 7

14
String Matching by Duels – Parallel
Assume that pattern 𝑃 is aperiodic. (can deal with periodic case separately; details omitted)

Algorithm: How to parallelize:


1. Set 𝜇 := � 𝑚
2� 1. —

2. Compute witnesses 𝜔[1..𝜇] against periodicity for all 𝑝 ≤ 𝑚


2. 2. 𝑂(log2 (𝑚)) � later

3. For each block of 𝜇 consecutive indices [0..𝜇), [𝜇..2𝜇), [2𝜇..3𝜇), . . . 3. blocks in parallel (indep.),
run 𝜇 − 1 duels to eliminate all but one guesses in the block tournament of �lg 𝜇� rounds

4. check remaining � 𝜇𝑛 � = 𝑂(𝑛/𝑚) guesses naively 4. check in parallel


collect result (like prefix sum)
Tournament of duals:
� each dual eliminates one guess
� declare other guess winner
� conceptually like prefix sum! 0 1 2 3 4 5 6 7

� Matching part can be done in 𝑂(log 𝑚) parallel time and 𝑂(𝑛) work!
14
Computing witnesses
It remains to find the witnesses 𝜔[1..𝜇].

sequentially:
� an elementary procedure is similar in spirit to KMP failure array

� can be computed in Θ(𝑚) time

parallel:
� much more complicated � beyond scope of the module
� first 𝑂(log2 (𝑚)) time on CREW-RAM
� later 𝑂(log 𝑚) time and 𝑂(𝑚) work using pseudoperiod method

15
Parallel Matching – State of the art
� 𝑂(log 𝑚) time & work-efficient parallel string matching
� this is optimal for CREW-PRAM

� on CRCW-PRAM: matching part even in 𝑂(1) time(!)


but preprocessing requires Θ(log log 𝑚) time

16

You might also like