Paper A Parallel Bloom Filter String Searching Algorithm
Paper A Parallel Bloom Filter String Searching Algorithm
I. INTRODUCTION
String searching algorithms represent a fundamental
method of finding an element or a set of elements with
specific properties among a collection of elements. Since the
advent of the computing and digital storage technology, the
capacity of storing substantial amount of data, from which
individual records are to be retrieved, identified or located
have paved the way in the design and development of various
string searching algorithms.
Rudimentary linear based string searching algorithms were
initially applied due to its simplicity in design at the expense
of poor searching performance for large datasets. Binary string
searching algorithms substantially improved searching
performance, albeit requiring the dataset to be sorted, which
impacts its overall efficiency [\]. Hash tables were later
introduced using associative arrays to map as a structure to
maps keys to data set values, which prove to be a good choice
when the keys are sparsely distributed [2]. It is also chosen
when the primary concern is the speed of access to the data.
However, sufficient memory is needed to store the hash table.
This lead to limitations in the datasets as a larger hash table is
required to cater for a larger dataset.
978-1-4799-0285-9/13/$31.00 2013 IEEE
One approach to address the limitations of the hash table
would be the utilization of the Bloom filter string searching
algorithm as a memory-efficient data structure, which rapidly
indicates whether an element is present in a dataset [3]. In
spite of improved speed and efficiency, a serial Bloom filter
design, which runs on a single logical processor, would still
incur longer processing time for substantial increases in
dataset size.
The advancement of computing technology has seen
significant developments of the multi-core processing
technology in line with the progression of parallel computing
solutions. This in turn has also fueled the development of the
many-core processing technology, such as the incorporation of
generic stream processing units into a Graphics Processing
Unit (GPU), allowing generalized computing devices with
large numbers of processing cores and hence the term, General
purpose computing on GPU (GPGPU). The deployment of the
Compute Unified Device Architecture (CUDA) parallel
computing platform by NVIDIA for GPGPUs has further
increased adaptation of parallel computing solutions using a
many-core architecture. A GPU offers high performance
throughput with little overhead between the threads. By
performing uniform operations on independent data, massive
data parallelism in an application can be achieved.
However, a serial Bloom filter algorithm would not be able
to leverage on the advancements of both the multi-core and
many-core architectures for improved searching performance
for large datasets. Hence, this paper investigates the impact of
a serial Bloom filter algorithm for large datasets, in justifying
the need for a parallel architecture in accelerating the string
searching process. In achieving an effective parallel string
searching design, a parallel Bloom filter algorithm using
software application threads for a multi-core architecture is
first investigated as a benchmark, which exhibits improved
performance speedups against a serial Bloom filter. In further
improving the speedup, a parallel Bloom filter algorithm using
the CUDA parallel computing platform is proposed, to support
batch-oriented string search operation in a data-intensive high
performance system. The proposed algorithm segments the
string list into blocks of words and threads in generating the
bit table, which is used during the string searching process.
This method maximizes the computational performance and
sustains consistent string searching results.
2013 IEEE Conference on Open Systems (ICOS), December 2 - 4, 2013, Sarawak, Malaysia
An eXlstmg CUDA implementation of the Bloom filter
string searching algorithm by Costa et al. [4] limits the false
positive probability (fpp) to 10-4 and the number of strings
tested for both the insertion and searching processes to 1
million. In this paper however, the proposed parallel Bloom
filter algorithm using CUDA removes the limit on fpp, and
instead, computes the fpp based on the number of strings
involved in the insertion process. In addition, the number of
strings tested is increased to 10 million for a more accurate
representation in performance speedup assessment.
The paper is organized as follows. Section II investigates
the performance impact of a serial Bloom filter string
searching algorithm. Section III describes the parallel Bloom
filter algorithm on a multi-core architecture. Section IV
proposes the parallel Bloom filter algorithm using the CUDA
parallel computing platform on many-core architecture.
Section V concludes this paper.
I. PENDAHULUAN
Algoritma pencarian string mewakili yang mendasar
metode menemukan elemen atau serangkaian elemen dengan
properti tertentu antara kumpulan elemen. Karena
kedatangan dari komputasi dan digital teknologi penyimpanan,
kapasitas menyimpan sejumlah besar data, yang
setiap catatan yang diperoleh, diidentifikasi atau terletak
telah membuka jalan dalam desain dan pengembangan berbagai
string pencarian algoritma.
Dasar linier berbasis string pencarian algoritma yang
pada
awalnya diterapkan karena kesederhanaan dalam desain dengan mengorbankan
miskin kinerja pencarian untuk dataset besar. Rangkaian biner
mencari algoritma pencarian ditingkatkan secara substansial
kinerja, walaupun memerlukan dataset harus disortir, yang
dampak keseluruhan efisiensi [\]. Tabel hash kemudiannya
memperkenalkan menggunakan array asosiatif untuk memetakan sebagai struktu
r untuk
peta kunci untuk nilai-nilai data set, yang terbukti menjadi pilihan yang baik
Ketika tombol yang jarang didistribusikan [2]. Itu juga dipilih
Ketika perhatian utama adalah kecepatan akses ke data.
Namun, cukup memori diperlukan untuk menyimpan Tabel hash.
Hal ini menyebabkan keterbatasan dalam dataset sebagai Tabel hash besar adala
h
diperlukan untuk memenuhi dataset besar.
978-1-4799-0285-9/13/$31.00 2013 IEEE
Salah satu pendekatan untuk mengatasi keterbatasan hash table
akan menjadi penggunaan Bloom penyaring string pencarian
algoritma sebagai hemat memori data struktur, yang cepat
menunjukkan apakah elemen hadir dalam dataset [3]. Dalam
Meskipun meningkatkan kecepatan dan efisiensi, serial Bloom penyaring
Desain, yang berjalan pada prosesor logis tunggal, masih
dikenakan waktu pengolahan lebih lama untuk peningkatan substansial dalam
ukuran dataset.
Kemajuan teknologi komputasi telah melihat
perkembangan yang signifikan dari proses multi-core
Technology sejalan dengan perkembangan komputasi paralel
solusi. Ini pada gilirannya telah juga memicu pengembangan
banyak-core pengolahan teknologi, seperti penggabungan
generik stream unit pengolahan ke pengolahan grafis
Unit (GPU), memungkinkan perangkat komputasi yang umum dengan
sejumlah besar Core pengolahan dan karenanya istilah, Umum
tujuan komputasi pada GPU (GPGPU).
Penyebaran
Menghitung paralel Unified Device arsitektur (CUDA)
memiliki platform komputasi oleh NVIDIA untuk GPGPUs lebih lanjut
peningkatan adaptasi paralel komputasi solusi menggunakan
banyak-core arsitektur. GPU menawarkan kinerja tinggi
throughput dengan sedikit overhead antara benang. Oleh
melakukan seragam operasi pada data independen, besar-besaran
data paralelisme dalam aplikasi dapat dicapai.
Namun, sebuah serial Bloom penyaring algoritma tidak akan dapat
untuk memanfaatkan kemajuan kedua multi inti dan
arsitektur banyak-inti untuk meningkatkan kinerja pencarian
untuk dataset besar. Oleh karena itu, makalah ini menyelidiki dampak
serial Bloom penyaring algoritma untuk dataset besar, dalam membenarkan
kebutuhan untuk arsitektur yang paralel dalam mempercepat string
proses pencarian. Dalam mencapai string paralel yang efektif
mencari desain, paralel Bloom penyaring algoritma menggunakan
threads aplikasi perangkat lunak untuk arsitektur multi-core adalah
pertama diselidiki sebagai patokan, yang menunjukkan peningkatan
kinerja speedups terhadap serial Bloom penyaring. Pada lebih lanjut
meningkatkan speedup, Bloom paralel filter menggunakan algoritma
platform komputasi paralel CUDA yang diusulkan, untuk mendukung
string berorientasi batch operasi pencarian yang tinggi data-intensif
kinerja sistem. Algoritma yang diusulkan segmen
Daftar string ke dalam blok kata-kata dan benang menghasilkan
bit tabel, yang digunakan selama string pencarian proses.
Metode ini memaksimalkan kinerja komputasi dan
menopang konsisten string hasil pencarian.
2013 IEEE konferensi Open System (ICOS), 2-4 Januari 2013, Sarawak, Malaysia
CUDA eXlstmg pelaksanaan Bloom penyaring
algoritma pencarian string oleh Costa et al. [4] batas palsu
positif probabilitas (fpp) untuk 10-4 dan jumlah string
diuji untuk penyisipan dan proses pencarian untuk 1
juta. Dalam tulisan ini namun, paralel diusulkan mekar
algoritma filter menggunakan CUDA menghapus batas fpp, dan
Sebaliknya, menghitung fpp berdasarkan jumlah string
terlibat dalam proses penyisipan. Selain itu, jumlah
string yang diuji meningkat menjadi 10 juta untuk lebih akurat
Perwakilan dalam penilaian speedup kinerja.
Karya ini disusun sebagai berikut. Menyelidiki Bagian II
dampak kinerja serial mekar saringan
algoritma pencarian. Bagian III menjelaskan mekar paralel
Filter algoritma arsitektur multi-core. Bagian IV
mengusulkan mekar paralel menggunakan CUDA algoritma filter
platform komputasi paralel pada banyak-core arsitektur.
Bagian V menyimpulkan karya ini.
mMlI
bit maskf- masking array (8
- elements).
'II' f- set of application threads
tpf- p-th thread of ']I' ,
,=0,1,2,... 1l"-1
MI[biUndex/B] 1=
bit_mask[bit] biCmask[bit]
Figure 3. Parallel Bloom filter insertion on a multicore architecture.
Fig. 4 illustrates the execution model of the parallel Bloom
filter searching process using OpenMP. The algorithm for the
parallel Bloom filter searching algorithm in Figure 5 adopts a
similar implementation to that of the serial Bloom filter
searching as discussed in the aforementioned section. The
atomic operation as applied in the insertion process is not
required here as the application threads are only reading the bit
table, M value from memory, thus leaving the M memory
content unchanged. Similar to the parallel filter insertion
process, 'IT' application threads concurrently execute the
searching process based on a n! I 'IT'I workload distribution,
with results of the search operation being stored in memory.
2013 IEEE Conference on Open Systems (ICOS), December 2 - 4, 2013, Sarawak, Malaysia
kf. number of hash functions.
M f. filter data table array.
mf-I Mil
bit_rnaskf- masking array (8
elements) for collision
avoidance.
"-
Figure 5. Performance analysis of a parallel Bloom filter algorithm on a
multicore architecture against a serial Bloom filter for an increasing n.
Replicating the test setup in Section 2, a performance
analysis was carried out to evaluate the effectiveness of the
multi-core parallel Bloom filter algorithm, in comparison with
the serial Bloom filter algorithm. Fig. 5 illustrates the average
operation time (i.e., for both insertion and searching) of the
parallel Bloom filter algorithm implemented using OpenMP
with 1'][' 1= 8 in comparison with the serial Bloom filter
algorithm. In addition, Fig. 5 also computes the performance
speedup to analyze the scale of operation time improvement of
the multi-core parallel Bloom filter algorithm against the serial
Bloom filter algorithm (green circle). Results from Fig. 5
suggest marked speedups of the multi-core parallel Bloom
filter algorithm against the serial Bloom filter algorithm. A
notable improvement in computational performance is visible
for n = 10,000,000 with the multi-core parallel algorithm
registering a 109 second computational time against a 360
second computational time by the serial algorithm, which in
tum yields a 3.3x in performance speedup. In addition, a steep
increase in performance speedup is initially observed as n
4
increases from 1000 to 100,000 strings, which then exhibits a
consistent speedup as n > 500,000 strings. The consistent
speedup as n > 500,000 is deduced as a result of logical
processor (i.e., 1'][' 1= 8) saturation in processing the large
values of n. Further speedups are attainable if 1'][' 1 increases
for an increase in the number of available logical processors.
This puts forward the concept of leveraging on the many-core
compute capabilities of a GPU processor for general purpose
computing, which will explored in the following section.
2013 IEEE Conference on Open Systems (ICOS), December 2 - 4, 2013, Sarawak, Malaysia
I String .49999
--- ----,
r--------!
: SM 7
Bloom Filter Bits Array
I Queued blocks
SM 7
Block (21.0.0)
(String 21)
Block (22.0.0)
(String 22)
Block (23.0.0)
(String 23)
ll1rcad (99.0.0)
(Hash function
number 500)
:
:
:
:
:
:
.:
Figure 6. Execution Model for filter insertion (for one batch of data with batch
size of 50,000) using CUDA
Gambar 6. Eksekusi Model untuk penyisipan filter (untuk satu batch data dengan batch
ukuran 50.000) menggunakan CUDA
2013 IEEE Conference on Open Systems (ICOS), December 2 - 4, 2013, Sarawak, Malaysia
400,000
360,000
320,000
'00' 280,000
2, 240,000
. 200,000
160,000
"
"
" 120,000
p.,
80,000
40,000
0
-
E!= --'- Serial Bloom filter
-+-Parallel Bloom filter using CUDA
E==
l===- =
5.20
4.80
, 4.40 4.00 0- OJ
3.60 ]
I==-F 2.80 3.20 m 0- '"
.. 2.40 " "
2.00
1.60 1.20
0.80 "-
0.40
0.00
1.00E+03 1.00E+04 1.00E+05 1.00E+06 1.00E+07
Number of strings, 11
Figure 8. Performance analysis of a parallel Bloom filter algorithm using
CUDA against a serial Bloom filter for an increasing n.
Results from Fig. 8 also suggest increased speedups of the
proposed parallel Bloom filter algorithm using CUDA against
the serial Bloom filter algorithm. For n = 10,000,000, the
proposed parallel algorithm on CUDA registered a 69 second
computational time, which translates into an improved
perfonnance speedup of S.Sx against the serial Bloom filter
algorithm.
Although the proposed algorithm exploits the use of shared
memory within blocks with the purpose of reducing the
computational time, the M bit-table remained in the GPU
global memory. This is because updates to the M bit-table
must be visible to all threads. Consequently, the performance
of the algorithm was affected due to the high latency on
continuous GPU global memory access and in addition to the
latency caused by data transfer between host and device
memory. In spite of this limitation, as a whole, the batch size
concept used in the proposed design did exhibit improved
performance speedups against a serial Bloom filter algorithm.
Jumlah senar, 11
Gambar 8. Analisis kinerja paralel Bloom penyaring algoritma menggunakan
CUDA terhadap serial Bloom penyaring untuk n meningkat.
Hasil dari gambar 8 juga menunjukkan peningkatan speedups dari
diusulkan paralel Bloom penyaring menggunakan algoritma CUDA terhadap
serial Bloom penyaring algoritma. Untuk n = 10.000.000,-
algoritma paralel yang diusulkan pada CUDA terdaftar kedua 69
waktu komputasi, yang diterjemahkan ke dalam peningkatan
speedup perfonnance dari S.Sx terhadap serial Bloom penyaring
algoritma.
Meskipun yang diusulkan algoritma eksploitasi penggunaan bersama
memori blok dengan tujuan mengurangi
komputasi waktu, tabel bit M tetap di GPU
Global memori. Hal ini karena pembaruan tabel bit M
harus terlihat untuk semua thread. Akibatnya, kinerja
dari algoritma yang terpengaruh karena latency tinggi pada
terus-menerus GPU global memori akses dan di samping
latensi yang disebabkan oleh transfer data antara host dan perangkat
memori. Walaupun pembatasan ini, secara keseluruhan, batch ukuran
konsep yang digunakan dalam desain diusulkan lakukan menunjukkan peningkatan
kinerja speedups terhadap sebuah serial Bloom penyaring algoritma.
120,000
100.000
'";;;' 80.000
5
E 60,000
t::::
m
8 40,000
ct
20.000
==- , -----==f ------..........
,1
--+-Parallel Bloom filter using OpenMP {-
--+-Parallel Bloom filter using CUDA I
--e --Performance Speedup
I I I ""#
==- -- I
1.60
1.40
1.20 '
-g
1.00 8.
m
0.80 8
"
0.60 E
0.40 t5
"-
0.20
1.00E+03 1.00E+04 1.00E+05 1.00E+06
0.00
1.00E+07
Number of strings, 11
Figure 9. Perfonnance analysis of a parallel Bloom filter algorithm using
CUDA against a parallel Bloom filter on a multicore architecture.
To further evaluate the performance between the proposed
parallel Bloom filter algorithm using CUDA and the
benchmarked multi-core parallel Bloom filter algorithm, Fig. 9
compares the computational performance between these two
algorithms. Fig. 9 also computes the performance speedup
between the benchmarked and proposed parallel Bloom filter
algorithms (green circle). Results from Fig. 9 undoubtedly
exhibits noticeable performance improvements of the
proposed parallel Bloom filter algorithm using CUDA to that
of the multi-core version using OpenMP with a peak
performance speedup of 1.6Sx for n = 10,000,000 strings.
6
Jumlah senar, 11
Gambar 9. Perfonnance analisis penggunaan paralel Bloom penyaring algoritma
CUDA terhadap filter mekar paralel pada arsitektur multicore.
Untuk lebih lanjut mengevaluasi kinerja antara usulan regu-
paralel algoritma Bloom penyaring menggunakan CUDA dan
benchmarked multi-core paralel Bloom penyaring algoritma, 9 Gbr
membandingkan kinerja komputasi antara kedua
algoritma. Gambar 9 juga menghitung kinerja speedup
antara benchmarked dan diusulkan paralel Bloom penyaring
algoritma (lingkaran hijau). Hasil dari 9 Gbr tidak diragukan lagi
Pameran nyata kinerja perbaikan
diusulkan paralel Bloom penyaring menggunakan algoritma CUDA yang
Versi multi core menggunakan OpenMP dengan puncak
kinerja speedup dari 1-.6sx untuk n = 10.000.000 string.
6
V. CONCLUSION
In this paper, the underlying architecture of a serial Bloom
filter was analyzed in identifying the performance impact for
large datasets. A parallel multi-core Bloom filter algorithm
using software application threads was implemented as
benchmark. Despite exhibiting performance speedups of up
3.3x for n = 10,000,000 against a serial Bloom filter
algorithm, the limited number of logical processors (11l' 1= 8)
on a multi-core architecture constrained the speedup levels. As
such, to improve the speedup, this paper proposed a parallel
many-core Bloom filter algorithm using the CUDA parallel
computing platform based on a batch string processing
process. The proposed algorithm improved the performance
speedup to S.Sx against a serial Bloom filter algorithm.
However, latency in data transfer between host and device
memory and the approach to classify M within the GPU
global memory limited the perfonnance speedup.
Optimization in the design of data transfer between host and
device memory should be given additional attention as future
work to further improve the performance speedup.
VI. ACKNOWLEDGMENT
This research was done under Joint Lab of "NVIDIA - HP
- MIMOS GPU R&D and Solution Center", which was
established October 2012. Funding for the work came from
MOST!, Malaysia.
V. KESIMPULAN
Dalam tulisan ini, arsitektur yang mendasari mekar serial
Filter dianalisis dalam mengidentifikasi dampak kinerja
dataset besar. Multi-core Bloom penyaring algoritma yang paralel
menggunakan perangkat lunak aplikasi benang dilaksanakan sebagai
patokan. Meskipun menunjukkan kinerja speedups dari atas
3.3 x untuk n = 10,000,000 terhadap serial Bloom penyaring
algoritma, terbatasnya jumlah prosesor logis (11l ' 1 = 8)
arsitektur multi-core dibatasi tingkat speedup. Sebagai
seperti itu, untuk meningkatkan speedup, karya ini diusulkan paralel
banyak-core Bloom penyaring menggunakan CUDA paralel algoritma
platform komputasi berdasarkan pengolahan string batch
proses. Algoritma yang diusulkan peningkatan kinerja
speedup untuk S.Sx terhadap sebuah serial Bloom penyaring algoritma.
Namun, latency data transfer antara host dan perangkat
memori dan pendekatan untuk mengklasifikasikan M dalam GPU
Global memori terbatas perfonnance speedup.
Optimasi dalam desain data transfer antara host dan
memori perangkat harus diberikan perhatian tambahan sebagai masa depan
bekerja untuk lebih meningkatkan kinerja speedup.
VI. PENGAKUAN
Penelitian dilakukan di bawah bersama Lab "NVIDIA - HP
-MIMOS GPU R&D dan pusat solusi ", yang
Didirikan Oktober 2012. Pendanaan untuk pekerjaan berasal dari
PALING!, Malaysia.
REFERENCES
[1] T. H. Connen, C. E. Leiserson, R. L. Rivest, and C. Stein, "Introduction
to algorithms," 3rd ed., McGraw-Hili Education, 2001, pp. 253-255.
[2] 1. Chai and 1. D. White, "Structuring data and building algorithms,"
Updated edition, McGraw-Hill Education, 2009, pp. 263-275.
[3] C. Zhiwang, X. Junggang, and S. Jian, "A multi-layer bloom filter for
duplicated URL detection," in the 3rd Int!. Conf on Advanced Computer
Theory and Engineering 2010, pp. 586-591, China, August 2010.
[4] L. B. Costa, S. Al-Kiswany, and M. Ripeanu, "GPU support for batch
oriented workload," in IEEE 28th IntI. Perf Compo and Comm. Conj, pp.
231-238, USA, December 2009.
[5] A. Natarajan, S. Subramanian, and K. Premalatha, "A comparative study
of cuckoo search and bat algorithm for bloom filter optimisation in spam
filtering," IntI. 1. of Bio-Inspired Computation, vol. 4, no.2, pp. 89-99,
2012.
[6] B. Chazelle, J. Kilian, R. Rubinfield, and A. Tal, "The bloomier filter: an
etlicient data structure for static support lookup tables," in Proc. of the
Fifteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pp.
30-39, USA, January 2004.
[7] N. S. Artan, K. Sinkar, 1. Patel, and H. 1. Chao, "Aggregated bloom
filters for intrusion detection and prevention hardware," in the IEEE
Global Telecommunications Conj, pp. 349-354, USA, November 2007.
[8] D. Eppstein and M.T. Goodrich, "Space-etlicient straggler identification
in round-trip data streams via newton's identities and invertible bloom
filters," LNCS, vol. 4619, pp. 637-648, Aug. 2007.
[9] A. Partow (2012, April 3). General-purpose hash function algorithms.
[Online]. Available:
http://www.partow.net/programminglhashfunctions/index.html.
[10] Y. Liu, L. J. Guo, J. B. Li, M. R. Ren, and K. Q. Li, "Parallel algorithms
for approximate string matching with k mismatches on CUDA," in IEEE
26th IntI. Parallel and Distributed Processing Symposium Workshop &
PHD Forum, pp. 2414-2422, China, May 2012.
[ll] T-y' Liang, Y-W. Chang, and H-F. Li, "A CUDA programming toolkit
on grids," IntI. J. of Grid and Utility Comp., vol. 3, no. 2-3, pp. 97-111,
July 2012.
TRANSLATE