A Survey On Compression Algorithms in Hadoop
A Survey On Compression Algorithms in Hadoop
Volume: 2 Issue: 3
ISSN: 2321-8169
479 482
______________________________________________________________________________________________
AbstractNow a days, big data is hot term in IT. It contains large volume of data. This data may be structured, unstructured or semi structured.
Each big data source has different characteristics like frequency, volume, velocity and veracity of the data. Reasons of growth in the
volume is use of internet, smart phone ,social networks, GPS devices and so on. However, analyzing big data is a very challenging problem today.
Traditional data warehouse systems are not able to handle this large amount of data. As the size is very large, compression will surely add the benefit
to store this large size of data. This paper explains various compression techniques in hadoop.
Keywords-bzip2, gzip ,lzo, lz4 ,snappy
______________________________________________________*****___________________________________________________
I.
INTRODUCTION
_______________________________________________________________________________________________
ISSN: 2321-8169
479 482
______________________________________________________________________________________________
Gzip
Bzip2
LZ4, and
Snappy.
A. LZO
The LZO compression format is composed of many smaller
blocks of compressed data allowing jobs to be split
along block boundaries. Block size should be same for
compression and decompression. This is fast and splittable. LZO
is a lossless data compression library written in ANSI C.LZO has
good speed. Its source code and the compressed data format are
designed to be portable across platforms. Decompression is very
fast. Characteristics of LZO are
[4, 11]:
Data compression is similar to other popular
compression techniques, such as gzip and bzip.
_______________________________________________________________________________________________
ISSN: 2321-8169
479 482
______________________________________________________________________________________________
C.Bzip2
bzip2 [7, 8] is a freely available high-quality data compressor. It
typically compresses files to within 10% to 15% of the best
available techniques. Bzip2 compresses data in blocks of size
between 100 and 900 kB. Bzip2 performance is asymmetric, as
decompression is relatively fast. The current version is 1.0.6. It
supports (limited) recovery from media errors. If you are trying to
restore compressed data from a backup tape or disk, and that data
contains some errors, bzip2 may still be able to decompress those
parts of the file which are undamaged. It's very portable. It should
run on any 32 or 64-bit machine with an ANSI C compiler. bzip2
compresses large files in blocks. The block size affects both the
compression ratio achieved, and the amount of memory needed
for compression and decompression. The header of bzip2 data
starts with the letter BZ.
D LZ4
LZ4 is a lossless data compression algorithm that is focused
on compression and decompression speed. It provides
compression speed at 400 MB/s per core fast decoder, with speed
in multiple GB/s per core [10].
Tool
Algorithm
File
extention
Splittable
Gzip
gzip
DEFLATE
.gz
No
bzip2
bizp2
bzip2
.bz2
Yes
LZO
lzop
LZO
.lzo
Yes
indexed
Snappy
N/A
Snappy
.snappy
No
if
File
Size
(GB)
8.0
1.3
2.0
some_logs
some_logs.gz
some_logs.lzo
V.
Compressin
Time (s)
241
55
Decompressin
Time (s)
72
35
CONCLUSION
We are in the era of big data. There are various challenges and
issues regarding big data. Large amount of data is generated
from the various sources either in structured, semi structured or
unstructured form. Such data are scattered across the Internet.
Hadoop supports various types of compression and compression
formats. Different types of compression algorithm are discussed
in this paper. These algorithms are summarized. Finally,
algorithm comparison is mentioned here.
481
_______________________________________________________________________________________________
ISSN: 2321-8169
479 482
______________________________________________________________________________________________
REFERENCES
[1]
[2]
[3]
http://doc.mapr.com/display/MapR/Compression
Dhruba Borthakur ,The Hadoop Distributed File System: Architecture
and Design, pp.4-5
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
http://porky.linuxjournal.com:8080/LJ/220/11186.html
http://hbase.apache.org/book/gzip.compression.html
http://www.gzip.org/
http://comphadoop.weebly.com/
http://www.bzip.org/
http://code.google.com/p/snappy/
http://code.google.com/p/lz4/
http://www.oberhumer.com/opensource/lzo/
http://blog.cloudera.com/blog/2009/11/hadoop-t-twitterpart-1- Splittable-lzo-compression/
Jean Yan, Big Data, Bigger Opportunities , White
Paper, April 9, 2013
[13]
482
IJRITCC | March 2014, Available @ http://www.ijritcc.org
_______________________________________________________________________________________________