Huffman Coding: Greedy Algorithm
Huffman Coding: Greedy Algorithm
Greedy Algorithm
Today Covered
• Huffman Problem
• Problem Analysis
– Binary coding techniques
– Prefix codes
• Algorithm of Huffman Coding Problem
• Time Complexity
– Analysis and Greedy Algorithm
• Conclusion
Using ASCII Code: Text Encoding
• Our objective is to develop a code that represents a given
text as compactly as possible.
• A standard encoding is ASCII, which represents every
character using 7 bits
Example
Represent “An English sentence” using ASCII code
– 1000001 (A) 1101110 (n) 0100000 ( ) 1000101 (E)
1101110 (n) 1100111 (g) 1101100 (l) 1101001 (i)
1110011 (s) 1101000 (h) 0100000 ( ) 1110011 (s)
1100101 (e) 1101110 (n) 1110100 (t) 1100101 (e)
1101110 (n) 1100011 (c) 1100101 (e)
= 133 bits ≈ 17 bytes
Refinement in Text Encoding
• Now a better code is given by the following encoding:
‹space› = 000, A = 0010, E = 0011, s = 010,
c = 0110, g = 0111, h = 1000, i = 1001,
l = 1010, t = 1011, e = 110, n = 111
• Then we encode the phrase as
0010 (A) 111 (n) 000 ( ) 0011 (E) 111 (n) 0111 (g)
1010 (l) 1001 (i) 010 (s) 1000 (h) 000 ( ) 010 (s)
110 (e) 111 (n) 1011 (t) 110 (e) 111 (n) 0110 (c)
110 (e)
• This requires 65 bits ≈ 9 bytes. Much improvement.
• The technique behind this improvement, i.e., Huffman
coding which we will discuss later on.
Major Types of Binary Coding
There are many ways to represent a file of information.
Binary Character Code (or Code)
– each character represented by a unique binary string.
• Fixed-Length Code
– If = {0, 1} then
– All possible combinations of two bit strings
x = {00, 01, 10, 11}
– If there are less than four characters then two bit
strings enough
– If there are less than three characters then two bit
strings not economical
Fixed Length Code
• Fixed-Length Code
– All possible combinations of three bit strings
x x = {000, 001, 010, 011, 100, 101, 110, 111}
– If there are less than nine characters then three bit
strings enough
– If there are less than five characters then three bit
strings not economical and can be considered two bit
strings
– If there are six characters then needs 3 bits to
represent, following could be one representation.
a = 000, b = 001, c = 010,
d = 011, e = 100, f = 101
Variable Length Code
• Variable-Length Code
– better than a fixed-length code
– It gives short code-words for frequent
characters and
– long code-words for infrequent characters
• Assigning variable code requires some skill
• Before we use variable codes we have to discuss
prefix codes to assign variable codes to set of
given characters
Prefix Code (Variable Length Code)
• A prefix code is a code typically a variable length
code, with the “prefix property”
• Prefix property is defined as no codeword is a
prefix of any other code word in the set.
Examples
1. Code words {0,10,11} has prefix property
2. A code consisting of {0, 1, 10, 11} does not have,
because “1” is a prefix of both “10” and “11”.
Other names
• Prefix codes are also known as prefix-free codes,
prefix condition codes, comma-free codes, and
instantaneous codes etc.
Why are prefix codes?
• Encoding simple for any binary character code;
• Decoding also easy in prefix codes. This is because
no codeword is a prefix of any other.
Example 1
• If a = 0, b = 101, and c = 100 in prefix code then the
string: 0101100 is coded as 0·101·100
Example 2
• In code words: {0, 1, 10, 11}, receiver reading “1” at
the start of a code word would not know whether
– that was complete code word “1”, or
– prefix of the code word “10” or of “11”
Prefix codes and binary trees
• Tree representation of prefix codes
0 1
A 00
B 010 0 1 1
0
C 0110
D 0111 A 0 1
F
E
E 10
F 11 B
0 1
C D
Huffman Codes
• In Huffman coding, variable length code is used
• Data considered to be a sequence of characters.
• Huffman codes are a widely used and very
effective technique for compressing data
– Savings of 20% to 90% are typical, depending on the
characteristics of the data being compressed.
• Huffman’s greedy algorithm uses a table of the
frequencies of occurrence of the characters to
build up an optimal way of representing each
character as a binary string.
• Now let us see an example to understand the
concepts used in Huffman coding
Example: Huffman Codes
a b c d e f
Frequency (in thousands) 45 13 12 16 9 5
Fixed-length codeword 000 001 010 011 100 101
Variable-length codeword 0 101 100 111 1101 1100
100
Frequency Variable-
(in length 0 1
thousands) codeword
a:45 55
a 45 0
0 1
b 13 101
25 30
c 12 100
0 1 0 1
d 16 111
c:12 b:13 14 d:16
e 9 1101
0 1
f 5 1100
f:5 e:9
B(T ) f (c ) d
cC
T (c)
f:5 e:9
for i ← 1
Allocate a new node z
left[z] ← x ← Extract-Min (Q) = f:5
right[z] ← y ← Extract-Min (Q) = e:9
f [z] ← f [x] + f [y] (5 + 9 = 14)
Insert (Q, z)
Constructing a Huffman Codes
Q: c:12 b:13 14 d:16 a:45
0 1
f:5 e:9
Q: 14 d:16 25 a:45
0 1 0 1
Q: 25 30 a:45
0 1 0 1
c:12 b:13
14 d:16
for i ← 3 0 1
c:12 b:13
14 d:16
0 1
f:5 e:9
Q: a:45 55
for i ← 4 0 1
25 30
0 1 0 1
f:5 e:9
Lemma 1: Greedy Choice
There exists an optimal prefix code such that the two
characters with smallest frequency are siblings and have
maximal depth in T.
T
Proof:
• Let x and y be two such characters, and let
T be a tree representing an optimal prefix code. x y
• Let a and b be two sibling leaves of
maximal depth in T, and assume with a b
out loss of generality T'
that f(x) ≤ f(y) and f(a) ≤ f(b).
• This implies that f(x) ≤ f(a) and f(y) ≤ f(b).
• Let T' be the tree obtained by a b
exchanging a and x and b and y. x y
Contd..
The cost difference between trees T and T' is