0% found this document useful (0 votes)
29 views

Lecture 3: Entropy, Relative Entropy, and Mutual Information

1) The lecture introduces key information measures including entropy, mutual information, and relative entropy. Entropy measures the uncertainty of a random variable, mutual information measures the information one variable conveys about another, and relative entropy measures the distance between two probability distributions. 2) Properties of entropy are discussed, including that entropy is maximized for a uniform distribution and minimized for a deterministic variable. The chain rule and non-negativity of relative entropy are also covered. 3) Conditional entropy, joint entropy, and mutual information are defined building on entropy. Mutual information quantifies the reduction in uncertainty of one variable given knowledge of another.

Uploaded by

Mostafa Naseri
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views

Lecture 3: Entropy, Relative Entropy, and Mutual Information

1) The lecture introduces key information measures including entropy, mutual information, and relative entropy. Entropy measures the uncertainty of a random variable, mutual information measures the information one variable conveys about another, and relative entropy measures the distance between two probability distributions. 2) Properties of entropy are discussed, including that entropy is maximized for a uniform distribution and minimized for a deterministic variable. The chain rule and non-negativity of relative entropy are also covered. 3) Conditional entropy, joint entropy, and mutual information are defined building on entropy. Mutual information quantifies the reduction in uncertainty of one variable given knowledge of another.

Uploaded by

Mostafa Naseri
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

EE376A Information Theory

Lecture 3 - 01/13/2015

Lecture 3: Entropy, Relative Entropy, and Mutual Information


Lecturer: Tsachy Weissman

Scribe: Alon Devorah, David Hallac, Kevin Shutzberg

In this lecture1 , we will introduce certain key measures of information, that play crucial roles in theoretical
and operational characterizations throughout the course. These include the entropy, the mutual information,
and the relative entropy. We will also exhibit some key properties exhibited by these information measures.

Notation

A quick summary of the notation


1. Random Variables (objects): used more loosely, i.e. X, Y, U, V
2. Alphabets: X , Y, U, V
3. Specific Values: x, y, u, v
(u)

For discrete random variable (object), U has p.m.f: PU


(y|x)
(x,y)
Similarly: p(x, y) for PX,Y and p(y|x) for PY |X , etc.

, P (U = u). Often, well just write p(u).

Entropy

Definition 1. Surprise Function:


s(u) , log

1
(u)

(1)

PU

Definition 2. Entropy: Let U a discrete R.V. taking values in U. The entropy of U is defined by:
(2)
Note: The entropy H(U ) is not a random variable. In fact it is not a function of the object U , but
(u)
rather a functional (or property) of the underlying distribution PU , u U. An analogy is E[U ], which is
also a number (the mean) corresponding to the distribution.
Jensens Inequality: Let Q denote a convex function, and X denote any random variable. Jensens
inequality states that
E[Q(X)] Q(E[X]).

(3)

Further, if Q is strictly convex, equality holds iff X is deterministic.


Example: Q(x) = ex is a convex function. Therefore, for a random variable X, we have by Jensens
inquality:
E[eX ] eE[X]
Conversely, if Q is a concave function, then
E[Q(X) Q(E[X]).

(4)

Example: Q(x) = log x is a concave function. Therefore, for a random variable X 0,


E[log X] log E[X]
1 Reading:

Chapter 2 of Cover and Thomas.

(5)

2.1

Properties of Entropy

W.L.O.G suppose U = {1,2,...,m}


1. H(U ) log m, with equality iff P (u) =

1
m u

(i.e. uniform).

Proof:
1
]
P (U
1
] (Jensens inequality, since log is concave)
log E[
P (U )
X
1
= log
P (U )
P (U )
u

(7)

= log m.

(9)

H(U ) = E[log

Equality in Jensen, iff

1
P (U )

is deterministic, iff p(u) =

(6)

(8)

1
m

2. H(U ) 0, with equality iff U is deterministic.


Proof:
H(U ) = E[log
The equality occurs iff log

1
P (U )

1
1
] 0 since log
0
P (U )
P (U )

(10)

= 0 with probability 1, iff P(U) = 1 w.p. 1 iff U is deterministic.

3. For a PMF q, defined on the same alphabet as p, define


Hq (U ) ,

p(u) log

uU

1
.
q(u)

(11)

Note that this is the expected surprise function, but instead of the surprise associated with p, it is the
surprise associated U , which is distributed according to PMF p, but incorrectly assumed to be having
the PMF of q. The following result stipulates, that we will (on average) be more surprised if we had
the wrong distribution in mind. This makes intuitive sense! Mathematically,
H(U ) Hq (U ),

(12)

with equality iff q = p.


Proof:




1
1
H(U ) Hq (U ) = E log
E log
p(u)
q(u)


q(u)
H(U ) Hq (U ) = E log
p(u)

(13)
(14)

h
By Jensens, we know that E log

q(u)
p(u)

log E

q(u)
p(u)

, so



q(u)
H(U ) Hq (U ) log E
p(u)
X
q(u)
= log
p(u)
p(u)
uU
X
= log
q(u)

(15)
(16)
(17)

uU

= log 1

(18)

=0

(19)

Therefore, we see that


H(U ) Hq (U ) 0.
Equality only holds when Jensens yields equality. That only happens when
only occurs when q = p, i.e. the distributions are identical.

q(u)
p(u)

is deterministic, which

Definition 3. Relative Entropy. An important measure of distance between probability measures is


relative entropy, or the KullbackLeibler divergence:


X
p(u)
p(u)
D(p||q) ,
= E log
p(u) log
(20)
q(u)
q(u)
uU

Note that property 3 is equivalent to saying that the relative entropy is always greater than or equal
to 0, with equality iff q = p (convince yourself).
4. If X1 , X2 , . . . , Xn are independent random variables, then
H(X1 , X2 , . . . , Xn ) =

n
X

H(Xi )

(21)

i=1

Proof:

H(X1 , X2 , . . . , Xn ) = E log

1
p(x1 , x2 , . . . , xn )


(22)

= E [ log p(x1 , x2 , . . . , xn )]

(23)

= E [ log p(x1 )p(x2 ) . . . p(xn )]


" n
#
X
=E
log p(xi )

(24)
(25)

i=1

=
=

n
X
i=1
n
X

E [ log p(xi )]

(26)

H(Xi ).

(27)

i=1

Therefore, the entropy of independent random variables is the sum of the individual entropies. This is
also intuitive, since the uncertainty (or surprise) associated with each random variable is independent.

Definition 4. Conditional Entropy of X given Y



1
P (X|Y )
X
1
=
Pr [x, y]
log P (x|y)
x,y
X
X


1
=
P (y)
P (x|y)
log
P
(x|y)
y
x
X
=
P (y)H(X|y).


H(X|Y ) , E log

(28)
(29)
(30)
(31)

Note: The conditional entropy is a functional of the joint distribution of (X, Y ). Note that this is also
a number, and denotes the average surprise in X when we observe Y. Here, by definition, we also
average over the realizations of Y. Note that the conditional entropy is NOT a function of the random
variable Y . In this sense, it is very different from a familar object in probability, the conditional
expectation E[X|Y ] which is a random variable (and a function of Y ).
5. H(X|Y ) H(X), equal iff X Y
Proof:


1
1 
E log
P (X)
P (X|Y )


P (X|Y ) P (Y ) 
P (X, Y )
= E log
= E log
]
P (X) P (Y )
P (X)P (Y )
X
P (x, y)
=
P (x, y) log
P
(x)P (y)
x,y


H(X) H(X|Y ) = E log

(32)
(33)
(34)

= D(Px,y ||Px Py )

(35)

(36)

equal iff X Y.

The last step follows from the non-negativity of relative entropy. Equality holds iff Px,y Px Py , i.e.
X and Y are independent.
Definition 5. Joint Entropy of X and Y


1
P (X, Y )

1
= E log
]
P (X)P (Y |X)


H(X, Y ) , E log

(37)
(38)

6. Chain rule for entropy:


H(X, Y ) = H(X) + H(Y |X)

(39)

= H(Y ) + H(X|Y )

(40)

7. Sub-additivity of entropy
H(X, Y ) H(X) + H(Y ),
with equality iff X Y (follows from the property that conditioning does not increase entropy)
4

(41)

Definition 6. Mutual information between X and Y


We now define the mutual information between random variables X and Y distributed according to
the joint PMF P (x, y):
I(X, Y ) , H(X) + H(Y ) H(X, Y )

(42)

= H(Y ) H(Y |X)

(43)

= H(X) H(X|Y )

(44)

= D(Px,y ||Px Py )

(45)

The mutual information is a canonical measure of the information conveyed by one random variable
about another. The definition tells us that it is the reduction in average surprise, upon observing a
correlated random variable. The mutual information is again a functional of the joint distribution of
the pair (X, Y ). It can also be viewed as the relative entropy between the joint distribution, and the
product of the marginals.

You might also like