Introduction To Communications: Source Coding

3100/7100
Introduction to Communications
Lecture 31: Source Coding
This lecture:
1. Information Theory.
2. Entropy.
3. Source Coding.
4. Huffman Coding.
Ref: CCR pp. 697–709, A Mathematical Theory of Communication.
COMS3100/COMS7100 Intro to Communications L31 - Source Coding 1 / 20

Information Theory
Claude Shannon’s paper, A Mathematical Theory of

Communication, showed that reliable
communication is possible at non-zero rate.
I Shannon proposed the following model for
point-to-point communication.

Information Theory (2)
I Shannon showed that there is a very widely
applicable, quantitative de inition of
information.
I The source generates information at a certain
rate.
I The noisy channel can be shown to have a
capacity at which it can reliably transmit
information.
I Reliable transmission is possible if and only if
the source’s rate is not greater than the
channel’s capacity.

The Mathematical Model
The source produces symbols from an alphabet.

I The alphabet is the (discrete) set of all
possible symbols.
E.g., the Roman alphabet, the Unicode character set or the range of
possible pixel intensities from a camera sensor.
I The source produces these symbols in a

discrete-time sequence.
I The channel has its own alphabet or, rather,
alphabets: one at the input and one at the
output.
E.g., consider a channel including a polar NRZ modulator and
demodulator, so the input and output alphabets are {−A, +A}.

The Mathematical Model (2)
I At each use of the channel, the channel output
is random (because of noise) but dependent
on the input.
I The rate at which the source generates
symbols may be different to the rate at which
the channel is used.
I Some mechanism is needed to translate
between the source and the channel alphabets
and back again.
I Shannon found that this can always be broken
into two independent processes at the
transmitter: source and channel coding.
I Similarly, at the receiver, source and channel decoding.
The Mathematical Model (3)
I Without loss of generality, the common

language of the source and channel (de)coder
is a bitstream.
I Source symbol generation, intermediate
bitstream and channel use may all be at
different rates.

Information & Entropy
We measure information and entropy in terms of
probability and random variables.
Self-Information
I In order to use random variables, map the
source alphabet to the numbers {0, . . . , M − 1}
where M is the size of the alphabet.
I Let the source symbol selected for
transmission at a certain time instant be
represented as a discrete r.v. X.
I Let i represent one possible value for X and
de ine pi = P (X = i).

Self-Information
The amount of information (or surprise) at
learning that X = i is log(1/pi ) = − log pi .
I Hence, the rarer the event, i.e., the less probable,
the more surprising and the more informative

it is.
I Shannon calls this self-information.
I The base of the logarithm has not been
speci ied, but it is usually taken to be 2, in
which case the unit is bits.
I If we take the natural logarithm, the unit is nats.
I (Technically, self-information is dimensionless.)
I Measurement in bits is natural: if eight
symbols are equally likely, it makes sense that
they have 3 bits of information each.
Entropy
The expected self-information is called the
entropy of X:
[ ] ∑
M−1
1
H (X) = E log =− pi log pi .
pi i=0
I Shannon used this name because of the

similar expression that arises in statistical
thermodynamics.
I It can be regarded as measure of randomness
or disorder.
I Degenerate r.v.s have zero entropy.
Entropy (2)
I It can be shown that uniformly distributed r.v.s

have the highest entropy for a given M, so that
0 ≤ H (X) ≤ log M.
I We’ll use Hb (X) when we need to be explicit
about using a logarithm to the base b.

The Source Coding Theorem
If, each time the source emits a symbol, it is
independent of previous symbols and identically
distributed, we call it a discrete memoryless source
(DMS).
I Suppose the DMS emits r symbols per second.
I Shannon showed that it is possible to use

source coding to encode the symbols in a
bitstream at rH2 (X) bits per second.
I Conversely, he showed it is impossible to have a uniquely
decodable bitstream at a lower rate.
I This is Shannon’s source coding theorem (and

converse).

Shannon’s Source Coding Procedure
In proving the source coding theorem, Shannon

devised a simple but impractical source coding
scheme.
I We group N symbols together into a block.
I All likely sequences (in a certain sense) are
identi ied.
I These sequences are enumerated using binary
words of NH2 (X) bits (rounded down;

ignoring some sequences if too many).
I This constitutes the codebook.

Coding Procedure (2)
I To perform source coding, compare a given

symbol sequence against those in the
codebook & output the code, if there is one.
I The probability that this scheme doesn’t work
→ 0 as N → ∞.
I This scheme is impractical because there is not
necessarily much structure in the codebook.
⇒ The codebook may require massive storage space.
⇒ The codebook may need to be exhaustively searched.

Variable-Length Codes
An ideal source-coding scheme is theoretically

easy but practically dif icult.
I Also, for inite N, the scheme is unreliable (not
all sequences have codes!).

I How to make the best possible codes for inite
block sizes?
I We’ll start with codes for a single symbol, i.e.,
N = 1.
I Consider a variable-length source code where
each symbol maps to a variable number of bits.

Variable-Length Codes(2)
I Suppose symbol i is assigned a code of ni bits.

I It turns out that the code can be made
uniquely decodable if and only if the Kraft
inequality is satis ied:
∑
M−1
2−ni ≤ 1.
i=0

Huffman Coding
In 1952, David Huffman discovered a simple

method of constructing an optimal
variable-length, uniquely decodable source code.
I The method constructs a binary tree—a tree in
which each node has at most two children.

I It proceeds from the bottom up, combining
leaves into ‘twigs’, twigs into ‘branches’ and so

on until the tree is built.
I Let’s call any partially assembled portion of
the tree a twig.

Huffman Coding Technique
I To start with, there are M twigs which consist

only of the leaves themselves, the symbols.
I The probability of a twig is the sum of the
probabilities of all of its leaves.
I The code construction algorithm is simply the
following step, iterated (M − 1 times) until
only one twig remains:
I Choose the two twigs with the least probability and assemble
them together to make a larger twig.
I To read off the codes, descend through the tree
towards the symbol’s leaf.
I Each time we take a left branch, output a ‘0’, otherwise a ‘1’.

Huffman Code Example
Consider a source with M = 5 for which the probabilities are
p0 = p1 = 0.25, p2 = 0.2, p3 = p4 = 0.15.
p=1 Step 4
0
p=
Step 3 0.55
1
0 1
p=
p= 0 0.45
0.3
0 1
p0=0.25 0 1
Step 2
Step 1 1 2
3 4 p1=0.25 p2=0.2
p3=0.15 p4=0.15
I Average code length is 2.3 bits and the entropy is 2.29 bits.

Developments of Source Coding
Source coding is also known as data compression.

I More particularly, lossless data compression,
since the input and output symbols are

identical.
I Our exposition required that the probability
distribution is known in advance.

I If we don’t, we can use universal source
coding.
I Examples: Lempel-Ziv (LZ77) & Lempel-Ziv-Welch (LZW)
algorithms & derivatives such as DEFLATE in ZIP & gzip
software.

Source Coding Applications
I For sources like English text, lossless coding is

very important.
I In other applications, like audio, images and
video, we may be able to put up with some
distortion for a lower bit rate.
I In 1963, Shannon developed rate-distortion
theory, the basis of modern lossy data
compression.
I Examples: voice coding in mobile phones, MP3 for music,
JPEG for images, MPEG for video.

Introduction To Communications: Source Coding

Uploaded by

Introduction To Communications: Source Coding

Uploaded by

3100/7100

COMS3100/COMS7100 Intro to Communications L31 - Source Coding 1 / 20

Claude Shannon’s paper, A Mathematical Theory of

COMS3100/COMS7100 Intro to Communications L31 - Source Coding 2 / 20

COMS3100/COMS7100 Intro to Communications L31 - Source Coding 3 / 20

The source produces symbols from an alphabet.

I The source produces these symbols in a

COMS3100/COMS7100 Intro to Communications L31 - Source Coding 4 / 20

I Without loss of generality, the common

COMS3100/COMS7100 Intro to Communications L31 - Source Coding 6 / 20

COMS3100/COMS7100 Intro to Communications L31 - Source Coding 7 / 20

the more surprising and the more informative

I Shannon used this name because of the

I It can be shown that uniformly distributed r.v.s

COMS3100/COMS7100 Intro to Communications L31 - Source Coding 10 / 20

I Shannon showed that it is possible to use

I This is Shannon’s source coding theorem (and

COMS3100/COMS7100 Intro to Communications L31 - Source Coding 11 / 20

In proving the source coding theorem, Shannon

I All likely sequences (in a certain sense) are

words of NH2 (X) bits (rounded down;

COMS3100/COMS7100 Intro to Communications L31 - Source Coding 12 / 20

I To perform source coding, compare a given

COMS3100/COMS7100 Intro to Communications L31 - Source Coding 13 / 20

An ideal source-coding scheme is theoretically

all sequences have codes!).

each symbol maps to a variable number of bits.

COMS3100/COMS7100 Intro to Communications L31 - Source Coding 14 / 20

I Suppose symbol i is assigned a code of ni bits.

COMS3100/COMS7100 Intro to Communications L31 - Source Coding 15 / 20

In 1952, David Huffman discovered a simple

which each node has at most two children.

leaves into ‘twigs’, twigs into ‘branches’ and so

the tree a twig.

COMS3100/COMS7100 Intro to Communications L31 - Source Coding 16 / 20

I To start with, there are M twigs which consist

COMS3100/COMS7100 Intro to Communications L31 - Source Coding 17 / 20

COMS3100/COMS7100 Intro to Communications L31 - Source Coding 18 / 20

Source coding is also known as data compression.

since the input and output symbols are

distribution is known in advance.

COMS3100/COMS7100 Intro to Communications L31 - Source Coding 19 / 20

I For sources like English text, lossless coding is

COMS3100/COMS7100 Intro to Communications L31 - Source Coding 20 / 20

You might also like