The Binary Entropy Function: ECE 7680 Lecture 2 - Definitions and Basic Facts
The Binary Entropy Function: ECE 7680 Lecture 2 - Definitions and Basic Facts
Objective: To learn a bunch of denitions about entropy and information measures that will be useful through the quarter, and to present some simple but important theorems: Jensens inequality, and the information inequality
In this lecture a bunch of denitions and a few simple theorems are presented. While these might be somewhat bewildering at rst, you should just hang on and dig in. There is a sort of algebra of informational quantities that must be presented as preliminary material before we can get much further.
Suppose X is a binary random variable, X= Then the entropy of X is H(X) = p log p (1 p) log(1 p) Since this depends on p, this is also written sometimes as H(p). Plot. Observe: concave function of p. (What does this mean?) H(0) = 0, H(1) = 0. Why? Where is the max? More generally, the entropy of a binary discrete random variable with probability p is written as either H(X) or H(p). 1 with probability p 0 with probability 1 p
Joint entropy
Often we are interested in the entropy of pairs of random variables (X, Y ). Another way of thinking of this is as a vector of random variables. Denition 1 If X and Y are jointly distributed according to p(X, Y ), then the joint entropy H(X, Y ) is H(X, Y ) =
xX yY
or H(X, Y ) = E log p(X, Y ) 2 Denition 2 If (X, Y ) p(x, y), then the conditional entropy H(Y |X) is H(Y |X) =
xX yY
This can also be written in the following equivalent (and also useful) ways: H(Y |X) =
xX
p(x)
yY
=
xX
p(x)H(Y |X = x) 2
Theorem 1 (chain rule) H(X, Y ) = H(X) + H(Y |X) Interpretation: The uncertainty (entropy) about both X and Y is equal to the uncertainty (entropy) we have about X, plus whatever we have about Y , given that we know X. Proof This proof is very typical of the proofs in this class: it consists of a long string of equalities. (Later proofs will consist of long strings of inequalities. Some people make their livelihood out of inequalities!) H(X, Y ) =
x y
= =
x y
=
x
= H(X) + H(Y |X) This can also be done in the following streamlined manner: Write log p(X, Y ) = log p(X) + log p(Y |X) and take the expectation of both sides. 2 We can also have a joint entropy with a conditioning on it, as shown in the following corollary: Corollary 1 H(X, Y |Z) = H(X|Z) + H(Y |X, Z) The proof is similar to the one above. (This is a good one to work on your own.)
Denition 3 The relative entropy or Kullback-Leibler distance between two probability mass functions p(x) and q(x) is dened as D(p q) =
xX
p(x) log
Note that this is not symmetric, and the q (the second argument) appears only in the denominator. Another important concept is that of mutual information. How much information does one random variable tell about another one. In fact, this perhaps the central idea in much of information theory. When we look at the output of a channel, we see the outcomes of a r.v. What we want to know is what went into the channel we want to know what was sent, and the only thing we have is what came out. The channel coding theorem (which is one of the high points we are trying to reach in the class) is basically a statement about mutual information. Denition 4 Let X and Y be r.v.s with joint distribution p(X, Y ) and marginal distributions p(x) and p(y). The mutual information I(X; Y ) is the relative entropy between the joint distribution and the product distribution: I(X; Y ) = D(p(x, y) p(x)p(y)) p(x, y) = p(x, y) log p(x)p(y) x y 2 Note that when X and Y are independent, p(x, y) = p(x)p(y) (denition of independence), so I(X; Y ) = 0. This makes sense: if they are independent random variables then Y can tell us nothing about X. An important interpretation of mutual information comes from the following. Theorem 2 I(X; Y ) = H(X) H(X|Y ) Interpretation: The information that Y tells us about X is the reduction in uncertainty about X due to the knowledge of Y . Proof I(X; Y ) =
x,y
= =
= H(X) H(X|Y ) 2 Observe that by symmetry I(X; Y ) = H(Y ) H(Y |X) = I(Y ; X). That is, Y tells as much about X as X tells about Y . Using H(X, Y ) = H(X) + H(Y |X) we get I(X; Y ) = H(X) + H(Y ) H(X, Y )
The information that X tells about Y is the uncertainty in X plus the uncertainty about Y minus the uncertainty in both X and Y . We can summarize a bunch of statements about entropy as follows: I(X; Y ) = H(X) H(X|Y ) I(X; Y ) = H(Y ) H(Y |X) I(X; Y ) = H(X) + H(Y ) H(X, Y ) I(X; Y ) = I(Y ; X) I(X; X) = H(X)
H(X1 , X2 , . . . , Xn ) =
i=1
H(Xi |Xi1 , . . . , X1 ).
Proof Observe that H(X1 , X2 ) = H(X1 ) + H(X2 |X1 ) H(X1 , X2 , X3 ) = H(X1 ) + H(X2 , X3 |X1 ) = H(X1 ) + H(X2 |X1 ) + H(X3 |X2 , X1 ) () . . . H(X1 , X2 , . . . , Xn ) = H(X1 ) + H(X2 |X1 ) + H(X3 |X2 , X1 ) + + H(Xn |Xn1 , Xn2 , . . . , X1 ) An alternate proof can be obtained by observing that
n
p(x1 , x2 , . . . , xn ) =
i=1
p(xi |xi1 , . . . , x1 )
and taking an expectation. 2 We sometimes have two variables that we wish to consider, both conditioned upon another variable. Denition 5 The conditional mutual information of random variables X and Y given Z is dened by I(X; Y |Z) = H(X|Z) H(X|Y, Z) p(X, Y |Z) = E log p(X|Z)p(Y |Z) 2 In other words, it is the same as mutual information, but everything is conditioned upon Z. The chain rule for entropy leads us to a chain rule for mutual information. Theorem 4
n
I(X1 , X2 , . . . , Xn ; Y ) =
i=1
=
i=1 n
H(Xi |Xi1 , . . . , X1 )
i=1
H(Xi |Xi1 , . . . , X1 , Y )
=
i=1
To understand the denition, recall that x1 +(1)x2 ) is simply a line segment connecting x1 and x2 (in the x direction) and f (x1 )+(1)f (x2 ) is a line segment connecting f (x1 ) and f (x2 ). Pictorially, the function is convex if the function lies below the straight line segment connecting two points, for any two points in the interval. Denition 7 A function f is concave if f is convex. 2
You will need to keep reminding me of which is which, since when I learned this, the nomenclature was convex and convex . Example 1 Convex x2 , ex , |x|, x log x. Concave: log x, x 2 One reason why we are interested in convex functions is that it is known that over the interval of convexity there is only one minimum. This can strengthen many of the results we might want. Theorem 5 If f has a second derivative which is non-negative (positive) everywhere, then f is convex (strictly convex).
Proof The Taylor-series expansion of f about the point x0 is 1 f (x) = f (x0 ) + f (x0 )(x x0 ) + f (x )(x x0 )2 2 If f (x) 0 the last term is non-negative. Let x0 = x1 + (1 )x2 and let x = x1 . Then f (x1 ) f (x0 ) + f (x0 )[(1 )(x1 x2 )]. Now let x = x2 and get f (x2 ) f (x0 ) + f (x0 )[(x2 x1 )] Multiply the rst by and the second by 1, and add together to get the convexity result. 2 We now introduce Jensens inequality. Theorem 6 If f is a convex function and X is a r.v. then Ef (X) f (EX). Put another way, p(x)f (x) f
x x
p(x)x
If f is strictly convex then equality in the theorem implies that X = EX w.p. 1. If f is concave then Ef (X) f (EX). The theorem allows us (more or less) to pull a function outside of a summation in some circumstances. Proof The proof is by induction. When X takes on two values the inequality is p1 f (x1 ) + p2 f (x2 ) f (p1 x1 + p2 x2 ). This is true by the denition of convex functions. Inductive hypothesis: suppose the theorem is true for distributions with k 1 values. Then let pi = pi /(1 pk ) for i = 1, 2, . . . , k 1,
k k1
pi f (xi ) = pk f (xk ) + (1 pk )
i=1 i=1
pi f (xi )
k1
pk f (xk ) + (1 pk )f
i=1 k1
pi xi pi xi
f =f
px xk + (1 pk )
i=1 k
pi xi
i=1
2 There is another inequality that got considerable use (in many of the same ways as Jensens inequality) way back in the dark ages when I took information theory. I may refer to it simply as the information inequality.
Theorem 7 log x x 1, with equality if and only if x = 1. This can also be generalized by taking the line at dierent points along the function. With these simple inequalities we can now prove some facts about some of the information measures we dened so far. Theorem 8 D(p q) 0, with equality if and only if p(x) = q(x) for all x. Proof D(p q) =
x
p(x) log
p(x) q(x) =
x
p(x) log
log
x
(Jensen s)
= log
x
Since log is a strictly concave function, we have equality if and only if q(x)/p(x) = 1. 2 Proof Here is another proof using the information inequality: D(p q) = =
x
p(x) q(x) x q(x) p(x) log p(x) q(x) p(x) 1 p(x) p(x) log q(x) p(x) = 0.
(log x x 1)
=
x
2 Corollary 2 Mutual information is positive: I(X; Y ) 0, with equality if and only if X and Y are independent. Let X be the set of values that the random variable X takes on and let |X | denote the number of elements in the set. For discrete random variables, the uniform distribution over the range X has the maximum entropy. Theorem 9 H(X) log |X |, with equality i X has a uniform distribution. Proof Let u(x) = for X. Then
1 |X |
be the uniform distribution and let p(X) be the distribution p(x) = log |X | H(X). u(x)
D(p u) =
x
p(x) log
2 Note how easily this optimizing value drops in our lap by means of an inequality. There is an important principle of engineering design here: if you can show that some performance criterion is upper-bounded by some function, then show how to achieve that upper bound, you have got an optimum design. No calculus required! The more we know, the less uncertainty there is:
Theorem 10 Condition reduces entropy: H(X|Y ) H(X), with equality i X and Y are independent. Proof 0 I(X; Y ) = H(X) H(X|Y ). Theorem 11
n
H(X1 , X2 , . . . , Xn )
i=1
H(Xi )
with equality if and only if the Xi are independent. Proof By the chain rule for entropy,
n
H(X1 , X2 , . . . , Xn ) =
i=1 n