Lecture 3: Entropy, Relative Entropy, and Mutual Information
Lecture 3: Entropy, Relative Entropy, and Mutual Information
Lecture 3 - 01/13/2015
In this lecture1 , we will introduce certain key measures of information, that play crucial roles in theoretical
and operational characterizations throughout the course. These include the entropy, the mutual information,
and the relative entropy. We will also exhibit some key properties exhibited by these information measures.
Notation
Entropy
1
(u)
(1)
PU
Definition 2. Entropy: Let U a discrete R.V. taking values in U. The entropy of U is defined by:
(2)
Note: The entropy H(U ) is not a random variable. In fact it is not a function of the object U , but
(u)
rather a functional (or property) of the underlying distribution PU , u U. An analogy is E[U ], which is
also a number (the mean) corresponding to the distribution.
Jensens Inequality: Let Q denote a convex function, and X denote any random variable. Jensens
inequality states that
E[Q(X)] Q(E[X]).
(3)
(4)
(5)
2.1
Properties of Entropy
1
m u
(i.e. uniform).
Proof:
1
]
P (U
1
] (Jensens inequality, since log is concave)
log E[
P (U )
X
1
= log
P (U )
P (U )
u
(7)
= log m.
(9)
H(U ) = E[log
1
P (U )
(6)
(8)
1
m
1
P (U )
1
1
] 0 since log
0
P (U )
P (U )
(10)
p(u) log
uU
1
.
q(u)
(11)
Note that this is the expected surprise function, but instead of the surprise associated with p, it is the
surprise associated U , which is distributed according to PMF p, but incorrectly assumed to be having
the PMF of q. The following result stipulates, that we will (on average) be more surprised if we had
the wrong distribution in mind. This makes intuitive sense! Mathematically,
H(U ) Hq (U ),
(12)
(13)
(14)
h
By Jensens, we know that E log
q(u)
p(u)
log E
q(u)
p(u)
, so
q(u)
H(U ) Hq (U ) log E
p(u)
X
q(u)
= log
p(u)
p(u)
uU
X
= log
q(u)
(15)
(16)
(17)
uU
= log 1
(18)
=0
(19)
q(u)
p(u)
is deterministic, which
Note that property 3 is equivalent to saying that the relative entropy is always greater than or equal
to 0, with equality iff q = p (convince yourself).
4. If X1 , X2 , . . . , Xn are independent random variables, then
H(X1 , X2 , . . . , Xn ) =
n
X
H(Xi )
(21)
i=1
Proof:
H(X1 , X2 , . . . , Xn ) = E log
1
p(x1 , x2 , . . . , xn )
(22)
= E [ log p(x1 , x2 , . . . , xn )]
(23)
(24)
(25)
i=1
=
=
n
X
i=1
n
X
E [ log p(xi )]
(26)
H(Xi ).
(27)
i=1
Therefore, the entropy of independent random variables is the sum of the individual entropies. This is
also intuitive, since the uncertainty (or surprise) associated with each random variable is independent.
H(X|Y ) , E log
(28)
(29)
(30)
(31)
Note: The conditional entropy is a functional of the joint distribution of (X, Y ). Note that this is also
a number, and denotes the average surprise in X when we observe Y. Here, by definition, we also
average over the realizations of Y. Note that the conditional entropy is NOT a function of the random
variable Y . In this sense, it is very different from a familar object in probability, the conditional
expectation E[X|Y ] which is a random variable (and a function of Y ).
5. H(X|Y ) H(X), equal iff X Y
Proof:
1
1
E log
P (X)
P (X|Y )
P (X|Y ) P (Y )
P (X, Y )
= E log
= E log
]
P (X) P (Y )
P (X)P (Y )
X
P (x, y)
=
P (x, y) log
P
(x)P (y)
x,y
H(X) H(X|Y ) = E log
(32)
(33)
(34)
= D(Px,y ||Px Py )
(35)
(36)
equal iff X Y.
The last step follows from the non-negativity of relative entropy. Equality holds iff Px,y Px Py , i.e.
X and Y are independent.
Definition 5. Joint Entropy of X and Y
1
P (X, Y )
1
= E log
]
P (X)P (Y |X)
H(X, Y ) , E log
(37)
(38)
(39)
= H(Y ) + H(X|Y )
(40)
7. Sub-additivity of entropy
H(X, Y ) H(X) + H(Y ),
with equality iff X Y (follows from the property that conditioning does not increase entropy)
4
(41)
(42)
(43)
= H(X) H(X|Y )
(44)
= D(Px,y ||Px Py )
(45)
The mutual information is a canonical measure of the information conveyed by one random variable
about another. The definition tells us that it is the reduction in average surprise, upon observing a
correlated random variable. The mutual information is again a functional of the joint distribution of
the pair (X, Y ). It can also be viewed as the relative entropy between the joint distribution, and the
product of the marginals.