Data Compression
Data Compression
Cleophas Mochoge
INTE 412
1
In computer science and information theory, data
compression, source coding, or bit-rate
reduction involves encoding information using
fewer bits than the original representation.
Compression can be either lossy or lossless.
Lossless compression reduces bits by identifying
And eliminating statistical redundancy. No
information is lost in lossless compression.
Lossy compression reduces bits by identifying
Marginally important information and removing it.
2
The process of reducing the size of a data file is
popularly referred to as data compression,
although its formal name is source coding
(coding done at the source of the data, before it
is stored or transmitted). Compression is useful
because it helps reduce resources usage, such
as data Storage space or transmission capacity.
Because compressed data must be
decompressed to use, this extra processing
imposes computational or other costs through
decompression.
3
Lossless
• Lossless data compression algorithms
usually exploit statistical redundancy to
represent data more concisely without
losing information. Lossless compression is
possible because most real-world data has
statistical redundancy.
4
For example, an image may have areas
of colour that do not change over several
pixels; instead of coding "red pixel, red
pixel, ..." the data may be encoded as
"279 red pixels". This is a simple example
of run-length encoding; there are many
schemes to reduce size by eliminating
redundancy.
5
Lossy
Lossy data compression is contrasted
with lossless data compression. In these
schemes, some loss of information is
acceptable. Depending upon the
application, detail can be dropped from
the data to save storage space. Generally
, lossy data compression schemes are
guided by research on how people
perceive the data in question. 6
For example, the human eye is more sensitive
to subtle variations in luminance than it is to
variations in color. JPEG image compression
works in part by "rounding off" less-important
visual information. There is a corresponding
trade-off between information lost and the
size reduction. A number of popular
compression
formats exploit these perceptual differences,
including those used in music files, images,
and video
7
Introduction
8
9
Compression Principles
Entropy Encoding
Run-length encoding
Lossless & Independent of the type of source
information
Used when the source information comprises
long substrings of the same character or
binary digit
(string or bit pattern, # of occurrences), as
FAX
e.g) 000000011111111110000011……
0,7 1, 10, 0,5 1,2…… 7,10,5,2……
10
Compression Principles
Entropy Encoding
Statistical encoding
Based on the probability of occurrence of a
pattern
The more probable, the shorter codeword
“Prefix property”: a shorter codeword must not
form the start of a longer codeword
11
Huffman Codding
Huffman coding is one type of entropy
coding where a given character must be
encoded together with the probability of
their occurrence. The Huffman Coding
Algorithm determines the optimal code
using the minimum number of bits. The
length (number of bits) of the coded
character will be differing.
12
To determine Huffman code, it is useful to
construct a binary tree. The leaves (nodes) of
the tree represent the characters that are to be
encoded. Every nodes contains the occurrence
of probability 0 and 1 are assigned to the
branches of the tree. Every character has
associated weight equal to number of times the
character occurs in a data stream.
13
Stream of characters
• P(A) = 0.16
• P(B) = 0.51
• P(C) = 0.09
• P(D) = 0.13
• P(E) = 0.11
14
15
E.g) symbols A,B,C,D,E with probabilities
A(0.16), B(0.51), C(0.09), D(0.13), E(.011)
Huffman Encoding
Entropy, H: theoretical min. avg. # of bits that are required
to transmit a particular stream
H = -Σ i=1 n Pi log2Pi
Efficiency, E = H/H’
where, H’ = avr. # of bits per codeword = Σ i=1 n Ni
Pi
Ni: # of bits of symbol i 17
E.g) symbols M(10), F(11), Y(010), N(011), 0(000),
1(001) with probabilities 0.25, 0.25, 0.125, 0.125,
0.125, 0.125
18
Huffman Algorithm
19
Huffman Algorithm
20
• Huffman coding
21
Huffman coding
22
• Right of binary tree :1
• Left of Binary tree :0
• Prefix (example)
– e:”01”, b: “010”
– “01” is prefix of “010” ==> “e0”
• same frequency : need consistency of
left or right
23
• Example(64 data)
• R K K K K K K K
• K K K R R K K K
• K K R R R R G G
• K K B C C C R R
• G G G M C B R R
• B B B M Y B B R
• G G G G G G G R
• G R R R R G R R
24
• Color frequency Huffman code
• =================================
• R 19 00
• K 17 01
• G 14 10
• B 7 110
• C 4 1110
• M 2 11110
• Y 1 11111
25
26
Static Huffman Coding
Huffman (Code) Tree
Given : a number of symbols (or characters) and their relative
probabilities in prior
Must hold “prefix property” among codes
Symbol Occurrence
Root node 0 8 1
A 4/8
0 4 1 A
B 2/8 Branch node
0 2 1 B
C 1/8
Leaf node D C
D 1/8
Symbol Code
A 1 41 + 22 + 13 +
B 01 13 = 14 bits are
C 001 required to transmit
“AAAABBCD”
D 000
Prefix Property !
27
The end
Thank you
28