0% found this document useful (0 votes)
21 views28 pages

Data Compression

Data compression involves encoding information using fewer bits than the original representation. It can be either lossless, preserving all information, or lossy, accepting some loss of information. Lossless compression removes statistical redundancy in data, while lossy compression removes marginally important information. Common techniques include entropy encoding like Huffman coding, which assigns shorter codes to more common symbols. Compression reduces storage and transmission costs but requires decompression processing.

Uploaded by

Kim
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
Download as ppt, pdf, or txt
0% found this document useful (0 votes)
21 views28 pages

Data Compression

Data compression involves encoding information using fewer bits than the original representation. It can be either lossless, preserving all information, or lossy, accepting some loss of information. Lossless compression removes statistical redundancy in data, while lossy compression removes marginally important information. Common techniques include entropy encoding like Huffman coding, which assigns shorter codes to more common symbols. Compression reduces storage and transmission costs but requires decompression processing.

Uploaded by

Kim
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1/ 28

Data Compression

Cleophas Mochoge
INTE 412

1
In computer science and information theory, data
compression, source coding, or bit-rate
reduction involves encoding information using
fewer bits than the original representation.
Compression can be either lossy or lossless.
Lossless compression reduces bits by identifying
And eliminating statistical redundancy. No
information is lost in lossless compression.
Lossy compression reduces bits by identifying
Marginally important information and removing it.

2
The process of reducing the size of a data file is
popularly referred to as data compression,
although its formal name is source coding
(coding done at the source of the data, before it
is stored or transmitted). Compression is useful
because it helps reduce resources usage, such
as data Storage space or transmission capacity.
Because compressed data must be
decompressed to use, this extra processing
imposes computational or other costs through
decompression.

3
Lossless
• Lossless data compression algorithms
usually exploit statistical redundancy to
represent data more concisely without
losing information. Lossless compression is
possible because most real-world data has
statistical redundancy.

4
For example, an image may have areas
of colour that do not change over several
pixels; instead of coding "red pixel, red
pixel, ..." the data may be encoded as
"279 red pixels". This is a simple example
of run-length encoding; there are many
schemes to reduce size by eliminating
redundancy.

5
Lossy
Lossy data compression is contrasted
with lossless data compression. In these
schemes, some loss of information is
acceptable. Depending upon the
application, detail can be dropped from
the data to save storage space. Generally
, lossy data compression schemes are
guided by research on how people
perceive the data in question. 6
For example, the human eye is more sensitive
to subtle variations in luminance than it is to
variations in color. JPEG image compression
works in part by "rounding off" less-important
visual information. There is a corresponding
trade-off between information lost and the
size reduction. A number of popular
compression
formats exploit these perceptual differences,
including those used in music files, images,
and video

7
Introduction

 Compression is used to reduce the volume


of information to be stored into storages or
to reduce the communication bandwidth
required for its transmission over the
networks

8
9
Compression Principles
 Entropy Encoding
 Run-length encoding
 Lossless & Independent of the type of source
information
 Used when the source information comprises
long substrings of the same character or
binary digit
(string or bit pattern, # of occurrences), as
FAX
e.g) 000000011111111110000011……
 0,7 1, 10, 0,5 1,2……  7,10,5,2……

10
Compression Principles

 Entropy Encoding
 Statistical encoding
 Based on the probability of occurrence of a
pattern
 The more probable, the shorter codeword
 “Prefix property”: a shorter codeword must not
form the start of a longer codeword

11
Huffman Codding
Huffman coding is one type of entropy
coding where a given character must be
encoded together with the probability of
their occurrence. The Huffman Coding
Algorithm determines the optimal code
using the minimum number of bits. The
length (number of bits) of the coded
character will be differing.
12
To determine Huffman code, it is useful to
construct a binary tree. The leaves (nodes) of
the tree represent the characters that are to be
encoded. Every nodes contains the occurrence
of probability 0 and 1 are assigned to the
branches of the tree. Every character has
associated weight equal to number of times the
character occurs in a data stream.

13
Stream of characters
• P(A) = 0.16
• P(B) = 0.51
• P(C) = 0.09
• P(D) = 0.13
• P(E) = 0.11

14
15
E.g) symbols A,B,C,D,E with probabilities
A(0.16), B(0.51), C(0.09), D(0.13), E(.011)

H’ = Σ i=1 5 Ni Pi = (0.16 +


0.51+0.09+0.13+0.11) = 1
bit/codeword
H = -Σ i=1 5 Pi log2Pi = -
((0.16log20.16) + (0.51log20.51)+
(0.09log20.09) + (0.13log20.13) +
(0.11log20.11)) = 1.95
16
Compression Principles

 Huffman Encoding
 Entropy, H: theoretical min. avg. # of bits that are required
to transmit a particular stream

H = -Σ i=1 n Pi log2Pi

where n: # of symbols, Pi: probability of symbol i

 Efficiency, E = H/H’
where, H’ = avr. # of bits per codeword = Σ i=1 n Ni
Pi
Ni: # of bits of symbol i 17
 E.g) symbols M(10), F(11), Y(010), N(011), 0(000),
1(001) with probabilities 0.25, 0.25, 0.125, 0.125,
0.125, 0.125

 H’ = Σ i=1 6 Ni Pi = (2(20.25) + 4(30.125)) = 2.5


bits/codeword
 H = -Σ i=1 6 Pi log2Pi = - (2(0.25log20.25) +
4(0.125log20.125)) = 2.5
 E = H/H’ =100 %
 3-bit/codeword if we use fixed-length codewords for six
symbols

18
Huffman Algorithm

Method of construction for an encoding tree


• Full Binary Tree Representation
• Each edge of the tree has a value,
(0 is the left child, 1 is the right child)
• Data is at the leaves, not internal nodes
• Result: encoding tree
• “Variable-Length Encoding”

19
Huffman Algorithm

• 1. Maintain a forest of trees


• 2. Weight of tree = sum frequency of
leaves
• 3. For 0 to N-1
– Select two smallest weight trees
– Form a new tree

20
• Huffman coding

• variable length code whose length is inversely


proportional to that character’s frequency
• must satisfy nonprefix property to be uniquely
decodable
• two pass algorithm
– first pass accumulates the character frequency
and generate codebook
– second pass does compression with the
codebook

21
Huffman coding

• create codes by constructing a binary tree


1. consider all characters as free nodes
2. assign two free nodes with lowest frequency to
a parent nodes with weights equal to sum of
their frequencies
3. remove the two free nodes and add the newly
created parent node to the list of free nodes
4. repeat step2 and 3 until there is one free node
left. It becomes the root of tree

22
• Right of binary tree :1
• Left of Binary tree :0
• Prefix (example)
– e:”01”, b: “010”
– “01” is prefix of “010” ==> “e0”
• same frequency : need consistency of
left or right

23
• Example(64 data)
• R K K K K K K K
• K K K R R K K K
• K K R R R R G G
• K K B C C C R R
• G G G M C B R R
• B B B M Y B B R
• G G G G G G G R
• G R R R R G R R

24
• Color frequency Huffman code
• =================================
• R 19 00
• K 17 01
• G 14 10
• B 7 110
• C 4 1110
• M 2 11110
• Y 1 11111

25
26
Static Huffman Coding
 Huffman (Code) Tree
 Given : a number of symbols (or characters) and their relative
probabilities in prior
 Must hold “prefix property” among codes

Symbol Occurrence
Root node 0 8 1
A 4/8
0 4 1 A
B 2/8 Branch node
0 2 1 B
C 1/8
Leaf node D C
D 1/8

Symbol Code
A 1 41 + 22 + 13 +
B 01 13 = 14 bits are
C 001 required to transmit
“AAAABBCD”
D 000
Prefix Property !
27
The end
Thank you

28

You might also like