Lecture Notes
Lecture Notes
John Duchi
3 Concentration Inequalities 36
3.1 Basic tail inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.1.1 Sub-Gaussian random variables . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.1.2 Sub-exponential random variables . . . . . . . . . . . . . . . . . . . . . . . . 41
3.1.3 First applications of concentration: random projections . . . . . . . . . . . . 46
3.1.4 A second application of concentration: codebook generation . . . . . . . . . . 47
3.2 Martingale methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.2.1 Sub-Gaussian martingales and Azuma-Hoeffding inequalities . . . . . . . . . 50
3.2.2 Examples and bounded differences . . . . . . . . . . . . . . . . . . . . . . . . 51
3.3 Uniformity, basic generalization bounds, and complexity classes . . . . . . . . . . . . 54
3.3.1 Symmetrization and uniform laws . . . . . . . . . . . . . . . . . . . . . . . . 54
1
Stanford Statistics 311/Electrical Engineering 377 John Duchi
2
Stanford Statistics 311/Electrical Engineering 377 John Duchi
3
Stanford Statistics 311/Electrical Engineering 377 John Duchi
4
Stanford Statistics 311/Electrical Engineering 377 John Duchi
5
Stanford Statistics 311/Electrical Engineering 377 John Duchi
6
Chapter 1
This set of lecture notes explores some of the (many) connections relating information theory,
statistics, computation, and learning. Signal processing, machine learning, and statistics all revolve
around extracting useful information from signals and data. In signal processing and information
theory, a central question is how to best design signals—and the channels over which they are
transmitted—to maximally communicate and store information, and to allow the most effective
decoding. In machine learning and statistics, by contrast, it is often the case that there is a
fixed data distribution that nature provides, and it is the learner’s or statistician’s goal to recover
information about this (unknown) distribution.
A central aspect of information theory is the discovery of fundamental results: results that
demonstrate that certain procedures are optimal. That is, information theoretic tools allow a
characterization of the attainable results in a variety of communication and statistical settings. As
we explore in these notes in the context of statistical, inferential, and machine learning tasks, this
allows us to develop procedures whose optimality we can certify—no better procedure is possible.
Such results are useful for a myriad of reasons; we would like to avoid making bad decisions or false
inferences, we may realize a task is impossible, and we can explicitly calculate the amount of data
necessary for solving different statistical problems.
In this context, we provide two main high-level examples, one for each of these tasks.
Example 1.1 (Source coding): The source coding, or data compression problem, is to
take information from a source, compress it, decompress it, and recover the original message.
Graphically, we have
7
Stanford Statistics 311/Electrical Engineering 377 John Duchi
The question, then, is how to design a compressor (encoder) and decompressor (decoder) that
uses the fewest number of bits to describe a source (or a message) while preserving all the
information, in the sense that the receiver receives the correct message with high probability.
This fewest number of bits is then the information content of the source (signal). 3
Example 1.2: The channel coding, or data transmission problem, is the same as the source
coding problem of Example 1.1, except that between the compressor and decompressor is a
source of noise, a channel. In this case, the graphical representation is
Here the question is the maximum number of bits that may be sent per each channel use in
the sense that the receiver may reconstruct the desired message with low probability of error.
Because the channel introduces noise, we require some redundancy, and information theory
studies the exact amount of redundancy and number of bits that must be sent to allow such
reconstruction. 3
Here, we estimate Pb—an empirical version of the distribution P that is easier to describe than
the original signal X1 , . . . , Xn , with the hope that we learn information about the generating
distribution P , or at least describe it efficiently.
In our analogy with channel coding, we make a connection with estimation and inference.
Roughly, the major problem in statistics we consider is as follows: there exists some unknown
function f on a space X that we wish to estimate, and we are able to observe a noisy version
of f (Xi ) for a series of Xi drawn from a distribution P . Recalling the graphical description of
Example 1.2, we now have a channel P (Y | f (X)) that gives us noisy observations of f (X) for each
Xi , but we may (generally) now longer choose the encoder/compressor. That is, we have
X1 ,...,Xn f (X1 ),...,f (Xn ) Y1 ,...,Yn
Source (P ) −→ Compressor −→ Channel P (Y | f (X)) −→ Decompressor.
8
Stanford Statistics 311/Electrical Engineering 377 John Duchi
Example 1.3: A classical example of the statistical paradigm in this lens is the usual linear
regression problem. Here the data Xi belong to Rd , and the compression function f (x) = θ> x
for some vector θ ∈ Rd . Then the channel is often of the form
Yi = θ> Xi + εi ,
| {z } |{z}
signal noise
iid
where εi ∼ N(0, σ 2 ) are independent mean zero normal perturbations. The goal is, given a
sequence of pairs (Xi , Yi ), to recover the true θ in the linear model.
In active learning or active sensing scenarios, also known as (sequential) experimental design,
we may choose the sequence Xi so as to better explore properties of θ. Later in the course we
will investigate whether it is possible to improve estimation by these strategies. As one concrete
idea, if we allow infinite power, which in this context corresponds to letting kXi k → ∞—
choosing very “large” vectors xi —then the signal of θ> Xi should swamp any noise and make
estimation easier. 3
For the remainder of the class, we explore these ideas in substantially more detail.
9
Stanford Statistics 311/Electrical Engineering 377 John Duchi
on the relationship between coding and entropy (Chapter 13), but we also provide an interpreta-
tion of entropy and information as measures of uncertainty in statistical experiments and statistical
learning, which is a perspective typically missing from information-theoretic treatments of entropy
(Chapters TBD). We also relate these ideas to game-playing and maximum likelihood estimation.
Finally, we relate generic divergence measures to questions of optimality and consistency in statisti-
cal and machine learning problems, which allows us to delineate when (at least in asymptotic senses)
it is possible to computationally efficiently learn good predictors and design good experiments.
10
Chapter 2
In this chapter, we discuss and review many of the basic concepts of information theory. Our
presentation is relatively brisk, as our main goal is to get to the meat of the lectures on applications
of these inequalities, but we must provide a starting point.
2.1.1 Definitions
Here, we provide the basic definitions of entropy, information, and divergence, assuming the random
variables of interest are discrete or have densities with respect to Lebesgue measure.
Entropy: We begin with a central concept in information theory: the entropy. Let P be a distri-
bution on a finite (or countable) set X , and let p denote the probability mass function associated
with P . That is, if X is a random variable distributed according to P , then P (X = x) = p(x). The
entropy of X (or of P ) is defined as
X
H(X) := − p(x) log p(x).
x
11
Stanford Statistics 311/Electrical Engineering 377 John Duchi
Because p(x) ≤ 1 for all x, it is clear that this quantity is positive. We will show later that if X
is finite, the maximum entropy distribution on X is the uniform distribution, setting p(x) = 1/|X |
for all x, which has entropy log(|X |).
Later in the class, we provide a number of operational interpretations of the entropy. The
most common interpretation—which forms the beginning of Shannon’s classical information the-
ory [130]—is via the source-coding theorem. We present Shannon’s source coding theorem in
Chapter 13, where we show that if we wish to encode a random variable X, distributed according
to P , with a k-ary string (i.e. each entry of the string takes
P on one of k values), then the minimal
expected length of the encoding is given by H(X) = − x p(x) logk p(x). Moreover, this is achiev-
able (to within a length of at most 1 symbol) by using Huffman codes (among many other types of
codes). As an example of this interpretation, we may consider encoding a random variable X with
equi-probable distribution on m items, which has H(X) = log(m). In base-2, this makes sense: we
simply assign an integer to each item and encode each integer with the natural (binary) integer
encoding of length dlog me.
We can also define the conditional entropy, which is the amount of information left in a random
variable after observing another. In particular, we define
X X
H(X | Y = y) = − p(x | y) log p(x | y) and H(X | Y ) = p(y)H(X | Y = y),
x y
Example 2.2 (Bernoulli random variables): Let h2 (p) = −p log p − (1 − p) log(1 − p) denote
the binary entropy, which is the entropy of a Bernoulli(p) random variable. 3
Example 2.3 (Geometric random variables): A random variable X is Geometric(p), for some
p ∈ [0, 1], if it is supported on {1, 2, . . .}, and P (X = k) = (1 − p)k−1 p; this is the probability
distribution of the number X of Bernoulli(p) trials until a single success. The entropy of such
a random variable is
∞
X ∞
X
H(X) = − (1 − p)k−1 p [(k − 1) log(1 − p) + log p] = − (1 − p)k p [k log(1 − p) + log p] .
k=1 k=0
P∞ k 1 d 1 1 P∞ k−1 ,
As k=0 α = 1−α and dα 1−α = (1−α)2
= k=1 kα we have
∞ ∞
X X 1−p
H(X) = −p log(1 − p) · k(1 − p)k − p log p · (1 − p)k = − log(1 − p) − (1 − p) log p.
p
k=1 k=1
Example 2.4 (A random variable with infinite entropy): While most “reasonable” discrete
random variables have finite entropy, it is possible to construct distributions with infinite
entropy. Indeed, let X have p.m.f. on {2, 3, . . .} defined by
∞
A X 1
p(k) = 2 where A−1 = 2 < ∞,
k log k k=2
k log k
12
Stanford Statistics 311/Electrical Engineering 377 John Duchi
R∞ Rx
the last sum finite as 2 x log1 α x dx < ∞ if and only if α > 1: for α = 1, we have e 1
t log t =
log log x, while for α > 1, we have
d 1
(log x)1−α = (1 − α)
dx x logα x
R∞ 1 1
so that e t logα t dt = e(1−α) . To see that the entropy is infinite, note that
KL-divergence: Now we define two additional quantities, which are actually much more funda-
mental than entropy: they can always be defined for any distributions and any random variables,
as they measure distance between distributions. Entropy simply makes no sense for non-discrete
random variables, let alone random variables with continuous and discrete components, though it
proves useful for some of our arguments and interpretations.
Before defining these quantities, we recall the definition of a convex function f : Rk → R as any
bowl-shaped function, that is, one satisfying
for all λ ∈ [0, 1], all x, y. The function f is strictly convex if the convexity inequality (2.1.1) is
strict for λ ∈ (0, 1) and x 6= y. We recall a standard result:
Proposition 2.5 (Jensen’s inequality). Let f be convex. Then for any random variable X,
Moreover, if f is strictly convex, then f (E[X]) < E[f (X)] unless X is constant.
Now we may define and provide a few properties of the KL-divergence. Let P and Q be
distributions defined on a discrete set X . The KL-divergence between them is
X p(x)
Dkl (P ||Q) := p(x) log .
q(x)
x∈X
We observe immediately that Dkl (P ||Q) ≥ 0. To see this, we apply Jensen’s inequality (Proposi-
tion 2.5) to the function − log and the random variable q(X)/p(X), where X is distributed according
to P :
q(X) q(X)
Dkl (P ||Q) = −E log ≥ − log E
p(X) p(X)
X
q(x)
= − log p(x) = − log(1) = 0.
x
p(x)
Moreover, as log is strictly convex, we have Dkl (P ||Q) > 0 unless P = Q. Another consequence of
the positivity of the KL-divergence is that whenever the set X is finite with cardinality |X | < ∞,
13
Stanford Statistics 311/Electrical Engineering 377 John Duchi
for any random variable X supported on X we have H(X) ≤ log |X |. Indeed, letting m = |X |, Q
1
be the uniform distribution on X so that q(x) = m , and X have distribution P on X , we have
X p(x) X
0 ≤ Dkl (P ||Q) = p(x) log = −H(X) − p(x) log q(x) = −H(X) + log m, (2.1.2)
x
q(x) x
so that H(X) ≤ log m. Thus, the uniform distribution has the highest entropy over all distributions
on the set X .
Mutual information: Having defined KL-divergence, we may now describe the information
content between two random variables X and Y . The mutual information I(X; Y ) between X and
Y is the KL-divergence between their joint distribution and their products (marginal) distributions.
More mathematically,
X p(x, y)
I(X; Y ) := p(x, y) log . (2.1.3)
x,y
p(x)p(y)
We can rewrite this in several ways. First, using Bayes’ rule, we have p(x, y)/p(y) = p(x | y), so
X p(x | y)
I(X; Y ) = p(y)p(x | y) log
x,y
p(x)
XX X X
=− p(y)p(x | y) log p(x) + p(y) p(x | y) log p(x | y)
x y y x
= H(X) − H(X | Y ).
Similarly, we have I(X; Y ) = H(Y ) − H(Y | X), so mutual information can be thought of as the
amount of entropy removed (on average) in X by observing Y . We may also think of mutual infor-
mation as measuring the similarity between the joint distribution of X and Y and their distribution
when they are treated as independent.
Comparing the definition (2.1.3) to that for KL-divergence, we see that if PXY is the joint
distribution of X and Y , while PX and PY are their marginal distributions (distributions when X
and Y are treated independently), then
Entropies of continuous random variables For continuous random variables, we may define
an analogue of the entropy known as differential entropy, which for a random variable X with
density p is defined by Z
h(X) := − p(x) log p(x)dx. (2.1.4)
14
Stanford Statistics 311/Electrical Engineering 377 John Duchi
Note that the differential entropy may be negative—it is no longer directly a measure of the number
of bits required to describe a random variable X (on average), as was the case for the entropy. We
can similarly define the conditional entropy
Z Z
h(X | Y ) = − p(y) p(x | y) log p(x | y)dxdy.
We remark that the conditional differential entropy of X given Y for Y with arbitrary distribution—
so long as X has a density—is
Z
h(X | Y ) = E − p(x | Y ) log p(x | Y )dx ,
where p(x | y) denotes the conditional density of X when Y = y. The KL divergence between
distributions P and Q with densities p and q becomes
Z
p(x)
Dkl (P ||Q) = p(x) log dx,
q(x)
and similarly, we have the analogues of mutual information as
Z
p(x, y)
I(X; Y ) = p(x, y) log dxdy = h(X) − h(X | Y ) = h(Y ) − h(Y | X).
p(x)p(y)
As we show in the next subsection, we can define the KL-divergence between arbitrary distributions
(and mutual information between arbitrary random variables) more generally without requiring
discrete or continuous distributions. Before investigating these issues, however, we present a few
examples. We also see immediately that for X uniform on a set [a, b], we have h(X) = log(b − a).
Example 2.6 (Entropy of normal random variables): The differential entropy (2.1.4) of a
normal random variable is straightforward to compute. Indeed, for X ∼ N(µ, σ 2 ) we have
p(x) = √ 1 2 exp(− 2σ1 2 (x − µ)2 ), so that
2πσ
E[(X − µ)2 ]
Z
1 1 1 2 1 2 1
h(X) = − p(x) log 2
− 2
(x − µ) = log(2πσ ) + 2
= log(2πeσ 2 ).
2 2πσ 2σ 2 2σ 2
For a general multivariate Gaussian, where X ∼ N(µ, Σ) for a vector µ ∈ Rn and Σ 0 with
density p(x) = n/2
1
√ exp(− 21 (x − µ)> Σ−1 (x − µ)), we similarly have
(2π) det(Σ)
1 h i
h(X) = E n log(2π) + log det(Σ) + (X − µ)> Σ−1 (X − µ)
2
n 1 1 n 1
= log(2π) + log det(Σ) + tr(ΣΣ−1 ) = log(2πe) + log det(eΣ).
2 2 2 2 2
3
Continuing our examples with normal distributions, we may compute the divergence between
two multivariate Gaussian distributions:
Example 2.7 (Divergence between Gaussian distributions): Let P be the multivariate normal
N(µ1 , Σ), and Q be the multivariate normal distribution with mean µ2 and identical covariance
Σ 0. Then we have that
1
Dkl (P ||Q) = (µ1 − µ2 )> Σ−1 (µ1 − µ2 ). (2.1.5)
2
We leave the computation of the identity (2.1.5) to the reader. 3
15
Stanford Statistics 311/Electrical Engineering 377 John Duchi
An interesting consequence of Example 2.7 is that if a random vector X has a given covari-
ance Σ ∈ Rn×n , then the multivariate Gaussian with identical covariance has larger differential
entropy. Put another way, differential entropy for random variables with second moments is always
maximized by the Gaussian distribution.
Proposition 2.8. Let X be a random vector on Rn with a density, and assume that Cov(X) = Σ.
Then for Z ∼ N(0, Σ), we have
h(X) ≤ h(Z).
Proof Without loss of generality, we assume that X has mean 0. Let P be the distribution of
X with density p, and let Q be multivariate normal with mean 0 and covariance Σ; let Z be this
random variable. Then
Z Z
p(x) n 1 > −1
Dkl (P ||Q) = p(x) log dx = −h(X) + p(x) log(2π) − x Σ x dx
q(x) 2 2
= −h(X) + h(Z),
because Z has the same covariance as X. As 0 ≤ Dkl (P ||Q), we have h(Z) ≥ h(X) as desired.
We remark in passing that the fact that Gaussian random variables have the largest entropy has
been used to prove stronger variants of the central limit theorem; see the original results of Barron
[18], as well as later quantitative results on the increase of entropy of normalized sums by Artstein
et al. [9] and Madiman and Barron [110].
16
Stanford Statistics 311/Electrical Engineering 377 John Duchi
Chain rules for information and divergence: As another immediate corollary to the chain
rule for entropy, we see that mutual information also obeys a chain rule:
n
X
I(X; Y1n ) = I(X; Yi | Y1i−1 ).
i=1
Indeed, we have
n
X n
i−1 i−1
X
I(X; Y1n ) H(Y1n ) H(Y1n I(X; Yi | Y1i−1 ).
= − | X) = H(Yi | Y1 ) − H(Yi | X, Y1 ) =
i=1 i=1
The KL-divergence obeys similar chain rules, making mutual information and KL-divergence mea-
sures useful tools for evaluation of distances and relationships between groups of random variables.
As a second example, suppose that the distribution P = P1 ×P2 ×· · ·×Pn , and Q = Q1 ×· · ·×Qn ,
that is, that P and Q are product distributions over independent random variables Xi ∼ Pi or
Xi ∼ Qi . Then we immediately have the tensorization identity
n
X
Dkl (P ||Q) = Dkl (P1 × · · · × Pn ||Q1 × · · · × Qn ) = Dkl (Pi ||Qi ) .
i=1
We remark in passing that these two identities hold for arbitrary distributions Pi and Qi or random
variables X, Y . As a final tensorization identiy, we consider a more general chain rule for KL-
divergences, which will frequently be useful. We abuse notation temporarily, and for random
variables X and Y with distributions P and Q, respectively, we denote
In analogy to the entropy, we can also define the conditional KL divergence. Let X and Y have
distributions PX|z and PY |z conditioned on Z = z, respectively. Then we define
Dkl (X||Y | Z) = EZ [Dkl PX|Z ||PY |Z ],
P
so that if Z is discrete we have Dkl (X||Y | Z) = z p(z)Dkl PX|z ||PY |z . With this notation, we
have the chain rule
n
X
Dkl Xi ||Yi | X1i−1 ,
Dkl (X1 , . . . , Xn ||Y1 , . . . , Yn ) = (2.1.6)
i=1
because (in the discrete case, which—as we discuss presently—is fully general for this purpose) for
distributions PXY and QXY we have
X p(x, y) X p(y | x) p(x)
Dkl (PXY ||QXY ) = p(x, y) log = p(x)p(y | x) log + log
x,y
q(x, y) x,y
q(y | x) q(x)
X p(x) X X p(y | x)
= p(x) log + p(x) p(y | x) log ,
x
q(x) x y
q(y | x)
P
where the final equality uses that y p(y | x) = 1 for all x.
Expanding upon this, we give several tensorization identities, showing how to transform ques-
tions about the joint distribution of many random variables to simpler questions about their
17
Stanford Statistics 311/Electrical Engineering 377 John Duchi
marginals. As a first example, we see that as a consequence of the fact that conditioning de-
creases entropy, we see that for any sequence of (discrete or continuous, as appropriate) random
variables, we have
Both equalities hold with equality if and only if X1 , . . . , Xn are mutually independent. (The only
if follows because I(X; Y ) > 0 whenever X and Y are not independent, by Jensen’s inequality and
the fact that Dkl (P ||Q) > 0 unless P = Q.)
We return to information and divergence now. Suppose that random variables Yi are indepen-
dent conditional on X, meaning that
Such scenarios are common—as we shall see—when we make multiple observations from a fixed
distribution parameterized by some X. Then we have the inequality
n
X
I(X; Y1 , . . . , Yn ) = [H(Yi | Y1i−1 ) − H(Yi | X, Y1i−1 )]
i=1
n n n (2.1.7)
X X X
= [H(Yi | Y1i−1 ) − H(Yi | X)] ≤ [H(Yi ) − H(Yi | X)] = I(X; Yi ),
i=1 i=1 i=1
X → Y → Z.
Proposition 2.9. With the above Markov chain, we have I(X; Z) ≤ I(X; Y ).
where we note that the final equality follows because X is independent of Z given Y :
18
Stanford Statistics 311/Electrical Engineering 377 John Duchi
There are related data processing inequalities for the KL-divergence—which we generalize in
the next section—as well. In this case, we may consider a simple Markov chain X → Z. If we
let P1 and
R P2 be distributions on X and Q1 and Q2 be the induced distributions on Z, that is,
Qi (A) = P(Z ∈ A | x)dPi (x), then we have
Dkl (Q1 ||Q2 ) ≤ Dkl (P1 ||P2 ) ,
the basic KL-divergence data processing inequality. A consequence of this is that, for any function
f and random variables X and Y on the same space, we have
Dkl (f (X)||f (Y )) ≤ Dkl (X||Y ) .
We explore these data processing inequalities more when we generalize KL-divergences in the next
section and in the exercises.
19
Stanford Statistics 311/Electrical Engineering 377 John Duchi
qA (x) = i when x ∈ Ai .
2.2.2 KL-divergence
In this section, we present the general definition of a KL-divergence, which holds for any pair of
distributions. Let P and Q be distributions on a space X . Now, let A be a finite algebra on X
(as in the previous section, this is equivalent to picking a partition of X and then constructing the
associated algebra), and assume that its atoms are atoms(A). The KL-divergence between P and
Q conditioned on A is
X P (A)
Dkl (P ||Q | A) := P (A) log .
Q(A)
A∈atoms(A)
That is, we simply sum over the partition of X . Another way to write this is as follows. Let
q : X → {1, . . . , m} be a quantizer, and define the sets Ai = q−1 ({i}) to be the pre-images of each
i (i.e. the different quantization regions, or the partition of X that q induces). Then the quantized
KL-divergence between P and Q is
m
X P (Ai )
Dkl (P ||Q | q) := P (Ai ) log .
Q(Ai )
i=1
We may now give the fully general definition of KL-divergence: the KL-divergence between P
and Q is defined as
This also gives a rigorous definition of mutual information. Indeed, if X and Y are random variables
with joint distribution PXY and marginal distributions PX and PY , we simply define
20
Stanford Statistics 311/Electrical Engineering 377 John Duchi
while if P and Q both have probability mass functions p and q, then—as we see in Exercise 2.6—the
definition (2.2.1) is equivalent to
X p(x)
Dkl (P ||Q) = p(x) log ,
x
q(x)
Measure-theoretic definition of KL-divergence If you have never seen measure theory be-
fore, skim this section; while the notation may be somewhat intimidating, it is fine to always
consider only continuous or fully discrete distributions. We will describe an interpretation that will
mean for our purposes that one never needs to really think about measure theoretic issues.
The general definition (2.2.1) of KL-divergence is equivalent to the following. Let µ be a measure
on X , and assume that P and Q are absolutely continuous with respect to µ, with densities p and
q, respectively. (For example, take µ = P + Q.) Then
Z
p(x)
Dkl (P ||Q) = p(x) log dµ(x). (2.2.2)
X q(x)
The proof of this fact is somewhat involved, requiring the technology of Lebesgue integration. (See
Gray [77, Chapter 5].)
For those who have not seen measure theory, the interpretation
R of the equality (2.2.2) should be
as follows. When integrating a function f (x), replace f (x)dµ(x) with one of two pairsR of symbols:
one may simply think of dµ(x) as dx, so that
R we are performing standard integration f (x)dx, or
one should think
R of the integralPoperation f (x)dµ(x) as summing the argument of the integral, so
dµ(x) = 1 and f (x)dµ(x) = x f (x). (This corresponds to µ being “counting measure” on X .)
2.2.3 f -divergences
A more general notion of divergence is the so-called f -divergence, or Ali-Silvey divergence [4, 49]
(see also the alternate interpretations in the article by Liese and Vajda [108]). Here, the definition
is as follows. Let P and Q be probability distributions on the set X , and let f : R+ → R be a
21
Stanford Statistics 311/Electrical Engineering 377 John Duchi
convex function satisfying f (1) = 0. If X is a discrete set, then the f -divergence between P and Q
is
X p(x)
Df (P ||Q) := q(x)f .
x
q(x)
More generally, for any set X and a quantizer q : X → {1, . . . , m}, letting Ai = q−1 ({i}) = {x ∈
X | q(x) = i} be the partition the quantizer induces, we can define the quantized divergence
m
X P (Ai )
Df (P ||Q | q) = Q(Ai )f ,
Q(Ai )
i=1
and the general definition of an f divergence is (in analogy with the definition (2.2.1) of general
KL divergences)
The definition (2.2.3) shows that, any time we have computations involving f -divergences—such
as KL-divergence or mutual information—it is no loss of generality, when performing the compu-
tations, to assume that all distributions have finite discrete support. There is a measure-theoretic
version of the definition (2.2.3) which is frequently easier to use. Assume w.l.o.g. that P and Q are
absolutely continuous with respect to the base measure µ. The f divergence between P and Q is
then Z
p(x)
Df (P ||Q) := q(x)f dµ(x). (2.2.4)
X q(x)
This definition, it turns out, is not quite as general as we would like—in particular, it is unclear
how we should define the integral for points x such that q(x) = 0. With that in mind, we recall
that the perspective transform (see Appendices A.1.1 and A.2.3) of a function f : R → R is defined
by pers(f )(t, u) = uf (t/u) if u > 0 and by +∞ if u ≤ 0. This function is convex in its arguments
(Proposition A.20). In fact, this is not quite enough for the fully correct definition. The closure of
a convex function f is cl f (x) = sup{`(x) | ` ≤ f, ` linear}, the supremum over all linear functions
that globally lower bound f . Then [87, Proposition IV.2.2.2] the closer of pers(f ) is defined, for
any t0 ∈ int dom f , by
uf (t/u)
if u > 0
0
cl pers(f )(t, u) = limα↓0 αf (t − t + t/α) if u = 0
+∞ if u < 0.
(The choice of t0 does not affect the definition.) Then the fully general formula expressing the
f -divergence is Z
Df (P ||Q) = cl pers(f )(p(x), q(x))dµ(x). (2.2.5)
X
This is what we mean by equation (2.2.4), which we use without comment.
In the exercises, we explore several properties of f -divergences, including the quantized repre-
sentation (2.2.3), showing different data processing inequalities and orderings of quantizers based
on the fineness of their induced partitions. Broadly, f -divergences satisfy essentially the same prop-
erties as KL-divergence, such as data-processing inequalities, and they provide a generalization of
mutual information. We explore f -divergences from a non-standard perspective later—they are
important both for optimality in estimation and related to consistency and prediction problems, as
we discuss in Chapter 18.
22
Stanford Statistics 311/Electrical Engineering 377 John Duchi
Examples We give three examples of f -divergences here; in Section 7.2.2 we provide a few
examples of their uses as well as providing a few natural inequalities between them.
1. KL-divergence: by taking f (t) = t log t, which is convex and satisfies f (1) = 0, we obtain
Df (P ||Q) = Dkl (P ||Q).
2. KL-divergence, reversed: by taking f (t) = − log t, we obtain Df (P ||Q) = Dkl (Q||P ).
3. The total variation distance between probability distributions P and Q defined on a set X is
defined as the maximum difference between probabilities they assign on subsets of X :
Note that (by considering compliments P (Ac ) = 1−P (A)) the absolute value on the right hand
side is unnecessary. The total variation distance, as we shall see later in the course, is very
important for verifying the optimality of different tests, and appears in the measurement of
difficulty of solving hypothesis testing problems. An important inequality, known as Pinsker’s
inequality, is that
1
kP − Qk2TV ≤ Dkl (P ||Q) . (2.2.7)
2
By taking f (t) = 21 |t − 1|, we obtain the total variation distance. Indeed, we have
Z Z
1 p(x) 1
Df (P ||Q) = − 1 q(x)dµ(x) =
|p(x) − q(x)|dµ(x)
2 q(x) 2
Z Z
1 1
= [p(x) − q(x)] dµ(x) + [q(x) − p(x)] dµ(x)
2 x:p(x)>q(x) 2 x:q(x)>p(x)
1 1
= sup [P (A) − Q(A)] + sup [Q(A) − P (A)] = kP − QkTV .
2 A⊂X 2 A⊂X
5. The χ2 -divergence is generated by taking f (t) = 21 (t − 1)2 , and between distributions P and
Q is given by
Z 2
1 p(x)
Dχ2 (P ||Q) = − 1 q(x)dµ(x). (2.2.9)
2 q(x)
There are a variety of inequalities relating different f -divergences, which are often convenient for
analyzing the properties of product distributions (as will become apparent in Chapter 7. We enu-
merate a few of the most important inequalities here, which provide inequalities relating variation
distance to the others.
Proposition 2.10. The total variation distance satisfies the following relationships:
(a) For the Hellinger distance,
1 p
dhel (P, Q)2 ≤ kP − QkTV ≤ dhel (P, Q) 1 − dhel (P, Q)2 /4.
2
23
Stanford Statistics 311/Electrical Engineering 377 John Duchi
We provide the proof of Proposition 2.10 in Section 2.4.1. We also have the following bounds on
the KL-divergence in terms of the χ2 -divergence.
Proposition 2.11. For any distributions P, Q,
It is also possible to relate mutual information between distributions to f -divergences, and even
to bound the mutual information above and below by the Hellinger distance for certain problems. In
this case, we consider the following situation: let V ∈ {0, 1} uniformly at random, and conditional
on V = v, draw X ∼ Pv for some distribution Pv on a space X . Then we have that
1 1
I(X; V ) = Dkl P0 ||P + Dkl P1 ||P
2 2
where P = 12 P0 + 12 P1 . As a consequence, we also have
1 1
I(X; V ) = Df (P0 ||P1 ) + Df (P1 ||P0 ) ,
2 2
1
where f (t) = −t log( 2t + 21 ) = t log t+1
2t
, so that the mutual information is a particular f -divergence.
This form—as we see in the later chapters—is frequently convenient because it gives an object
with similar tensorization properties to KL-divergence while enjoying the boundedness properties
of Hellinger and variation distances. The following proposition capture the latter properties.
Proposition 2.12. Let (X, V ) be distributed as above. Then
24
Stanford Statistics 311/Electrical Engineering 377 John Duchi
Df (λP1 + (1 − λ)P2 ||λQ1 + (1 − λ)Q2 ) ≤ λDf (P1 ||Q1 ) + (1 − λ)Df (P2 ||Q2 ) .
The proof of this proposition we leave as an exercise (Q. 2.11), which we treat as a consequence
of the more general “log-sum” like inequalities of Question 2.8. It is, however, an immediate
consequence of the fully specified definition (2.2.5) of an f -divergence, because pers(f ) is jointly
convex. As an immediate corollary, we see that the same result is true for KL-divergence as well.
Corollary 2.14. The KL-divergence Dkl (P ||Q) is jointly convex in its arguments P and Q.
We can also provide more general data processing inequalities for f -divergences, paralleling
those for the KL-divergence. In this case, we consider random variables X and Z on spaces X
and Z, respectively, and a Markov transition kernel K giving the Markov chain X → Z. That
is, K(· | x) is a probability distribution on Z for each x ∈ X , and conditioned on X = x, Z has
distribution K(· | x) so that K(A | x) = P(Z ∈ A | X = x). Certainly, this includes the situation
when Z = φ(X) for some function φ, and more generally when Z = φ(X, U ) for a function φ and
some additional randomness U . For a distribution P on X, we then define the marginals
Z
KP (A) := K(A, x)dP (x).
X
Proposition 2.15. Let P and Q be distributions on X and let K be any Markov kernel. Then
Thus, further processing of random variables can only bring them “closer” in the space of distribu-
tions; downstream processing of signals cannot make them further apart as distributions.
25
Stanford Statistics 311/Electrical Engineering 377 John Duchi
decide which distribution Pi is the true P ? if all the distributions are similar—the divergence be-
tween the Pi is small, or the information between X and P ? is negligible—and easy if the distances
between the distributions Pi are large. With this outline in mind, we present two inequalities, and
first examples of their application, to make concrete these connections to the notions of information
and divergence defined in this section.
inf {P1 (Ψ 6= 1) + P2 (Ψ 6= 2)} = inf {1 − (P1 (A) − P2 (A))} = 1 − sup (P1 (A) − P2 (A)),
Ψ A⊂X A⊂X
In the two-hypothesis case, we also know that the optimal test, by the Neyman-Pearson lemma,
is a likelihood ratio test. That is, assuming that P1 and P2 have densities p1 and p2 , the optimal
test is of the form
1 if pp12 (X)
(
(X) ≥ t
Ψ(X) = p1 (X)
2 if p2 (X) < t
for some threshold t ≥ 0. In the case that the prior probabilities on P1 and P2 are each 21 , then
t = 1 is optimal.
We give one example application of Proposition 2.17 to the problem of testing a normal mean.
26
Stanford Statistics 311/Electrical Engineering 377 John Duchi
iid
Example 2.18 (Testing a normal mean): Suppose we observe X1 , . . . , Xn ∼ P for P = P1
or P = P2 , where Pv is the normal distribution N(µv , σ 2 ), where µ1 6= µ2 . We would like to
understand the sample size n necessary to guarantee that no test can have small error, that
is, say, that
1
inf {P1 (Ψ(X1 , . . . , Xn ) 6= 1) + P2 (Ψ(X1 , . . . , Xn ) 6= 2)} ≥ .
Ψ 2
By Proposition 2.17, we have that
inf {P1 (Ψ(X1 , . . . , Xn ) 6= 1) + P2 (Ψ(X1 , . . . , Xn ) 6= 2)} ≥ 1 − kP1n − P2n kTV ,
Ψ
iid
where Pvn denotes the n-fold product of Pv , that is, the distribution of X1 , . . . , Xn ∼ Pv .
The interaction between total variation distance and product distributions is somewhat subtle,
so it is often advisable to use a divergence measure more attuned to the i.i.d. nature of the sam-
pling scheme. Two such measures are the KL-divergence and Hellinger distance, both of which
we explore in the coming chapters. With that in mind, we apply Pinsker’s inequality (2.2.7)
to see that kP1n − P2n k2TV ≤ 21 Dkl (P1n ||P2n ) = n2 Dkl (P1 ||P2 ), which implies that
r r 1 √
n n n 1 n 1 2
2 n |µ1 − µ2 |
1 − kP1 − P2 kTV ≥ 1 − Dkl (P1 ||P2 ) = 1 −
2 (µ1 − µ2 ) =1− .
2 2 2σ 2 2 σ
σ2
In particular, if n ≤ (µ1 −µ2 )2
, then we have our desired lower bound of 21 .
2
Conversely, a calculation yields that n ≥ (µ1Cσ
−µ2 )2
, for some numerical constant C ≥ 1, implies
small probability of error. We leave this calculation to the reader. 3
27
Stanford Statistics 311/Electrical Engineering 377 John Duchi
where the zero follows because given there is no error, X has no variability given X.
b Expanding
the entropy by the chain rule in a different order, we have
H(X, E | X)
b = H(X | X)
b + H(E | X,
b X),
| {z }
=0
H(X | X)
b = H(X, E | X)
b = P(E = 1)H(X | E = 1, X)
b + H(E | X).
Noting that H(E | X) ≤ H(E) = h2 (P(E = 1)), as conditioning reduces entropy, and that
H(X | E = 1, X)
b ≤ log(|X | − 1), as X can take on at most |X | − 1 values when there is an error,
completes the proof.
b 6= X) ≥ 1 − I(X; Y ) + log 2
P(X . (2.3.2)
log(|X |)
Proof Let Perror = P(X 6= X) b denote the probability of error. Noting that h2 (p) ≤ log 2 for any
p ∈ [0, 1] (recall inequality (2.1.2), that is, that uniform random variables maximize entropy), then
using Proposition 7.8, we have
(i)
b (ii)
log 2 + Perror log(|X |) ≥ h2 (Perror ) + Perror log(|X | − 1) ≥ H(X | X) = H(X) − I(X; X).
b
Here step (i) uses Proposition 2.19 and step (ii) uses the definition of mutual information, that
b = H(X) − H(X | X).
I(X; X) b The data processing inequality implies that I(X; X) b ≤ I(X; Y ),
and using H(X) = log(|X |) completes the proof.
In particular, Corollary 2.20 shows that when X is chosen uniformly at random and we observe
Y , we have
I(X; Y ) + log 2
inf P(Ψ(Y ) 6= X) ≥ 1 − ,
Ψ log |X |
where the infimum is taken over all testing procedures Ψ. Some interpretation of this quantity
is helpful. If we think roughly of the number of bits it takes to describe a variable X uniformly
chosen from X , then we expect that log2 |X | bits are necessary (and sufficient). Thus, until we
collect enough information that I(X; Y ) ≈ log |X |, so that I(X; Y )/ log |X | ≈ 1, we are unlikely to
be unable to identify the variable X with any substantial probability. So we must collect enough
bits to actually discover X.
Example 2.21 (20 questions game): In the 20 questions game—a standard children’s game—
there are two players, the “chooser” and the “guesser,” and an agreed upon universe X . The
chooser picks an element x ∈ X , and the guesser’s goal is to find x by using a series of yes/no
28
Stanford Statistics 311/Electrical Engineering 377 John Duchi
questions about x. We consider optimal strategies for each player in this game, assuming that
X is finite and letting m = |X | be the universe size for shorthand.
For the guesser, it is clear that at most dlog2 me questions are necessary to guess the item
X that the chooser has picked—at each round of the game, the guesser asks a question that
eliminates half of the remaining possible items. Indeed, let us assume that m = 2l for some
l ∈ N; if not, the guesser can always make her task more difficult by increasing the size of X
until it is a power of 2. Thus, after k rounds, there are m2−k items left, and we have
k
1
m ≤ 1 if and only if k ≥ log2 m.
2
b 6= X) ≥ 1 − I(X; Y1 , . . . , Yk ) + log 2
P(X .
log m
By the chain rule for mutual information, we have
k
X k
X k
X
I(X; Y1 , . . . , Yk ) = I(X; Yi | Y1:i−1 ) = H(Yi | Y1:i−1 ) − H(Yi | Y1:i−1 , X) ≤ H(Yi ).
i=1 i=1 i=1
As the answers Yi are yes/no, we have H(Yi ) ≤ log 2, so that I(X; Y1:k ) ≤ k log 2. Thus we
find
P(Xb 6= X) ≥ 1 − (k + 1) log 2 = log2 m − 1 − k ,
log m log2 m log2 m
so that we the guesser must have k ≥ log2 (m/2) to be guaranteed that she will make no
mistakes. 3
Rp
But of course, we have dhel (P, Q)2 = 2 − p(x)q(x)dµ(x), so this implies
Z
1
|p(x) − q(x)|dµ(x) ≤ dhel (P, Q)(4 − dhel (P, Q)2 ) 2 .
29
Stanford Statistics 311/Electrical Engineering 377 John Duchi
Dividing both sides by 2 gives the upper bound on kP √ − QkTV . For the lower bound on total
variation, note that for any a, b ∈ R+ , we have a + b − 2 ab ≤ |a − b| (check the cases a > b and
a < b separately); thus
Z h p i Z
2
dhel (P, Q) = p(x) + q(x) − 2 p(x)q(x) dµ(x) ≤ |p(x) − q(x)|dµ(x).
For part (b) we present a proof based on the Cauchy-Schwarz inequality, which differs from
standard arguments [48, 136]. Using the definition (2.2.3) (or (2.2.1)), we may assume without loss
of generality that P and
P Q are finitely supported, say with p.m.f.s p1 , . . . , pm and2q1 , . . .1, qm . Define
the function h(p) = m i=1 pi log pi . Then showing that Dkl (P ||Q) ≥ 2 kP − QkTV = 2 kp − qk1 is
2
By Taylor’s theorem, there is some p̃ = (1 − t)p + tq, where t ∈ [0, 1], such that
1
h(p) = h(q) + h∇h(q), p − qi + hp − q, ∇2 h(p̃)(p − q)i.
2
P
But looking at the final quadratic, we have for any vector v and any p ≥ 0 satisfying i pi = 1,
m m m 2
v2 v2
X
X X √ |vi |
2
hv, ∇ h(p̃)vi = i
= kpk1 i
≥ pi √ = kvk21 ,
pi pi pi
i=1 i=1 i=1
√ √
where the inequality follows from Cauchy-Schwarz applied to the vectors [ pi ]i and [|vi |/ pi ]i .
Thus inequality (2.4.1) holds.
2.5 Bibliography
The material in this section of the lecture notes is more or less standard. For all of our treatment of
mutual information, entropy, and KL-divergence in the discrete case, Cover and Thomas provide an
essentially complete treatment in Chapter 2 of their book [48]. Gray [77] provides a more advanced
(measure-theoretic) version of these results, with Chapter 5 covering most of our results (or Chapter
7 in the newer addition of the same book).
The f -divergence was independently discovered by Ali and Silvey [4] and Csiszár [49], and is
consequently sometimes called an Ali-Silvey divergence or Csiszár divergence. Liese and Vajda [108]
provide a survey of f -divergences and their relationships with different statistical concepts (taking
a Bayesian point of view), and various authors have extended the pairwise divergence measures
to divergence measures between multiple distributions [81], making connections to experimental
design and classification [73, 61], which we investigate later in the lectures. For a proof that
equality (2.2.4) is equivalent to the definition (2.2.3) with the appropriate closure operations, see
the paper [61, Proposition 1].
30
Stanford Statistics 311/Electrical Engineering 377 John Duchi
2.6 Exercises
Our first few questions investigate properties of a divergence between distributions that is weaker
than the KL-divergence, but is intimately related to optimal testing. Let P1 and P2 be arbitrary
distributions on a space X . The total variation distance between P1 and P2 is defined as
Question 2.1: Prove the following identities about total variation. Throughout, let P1 and P2
have densities p1 and p2 on a (common) set X .
R
(a) 2 kP1 − P2 kTV = |p1 (x) − p2 (x)|dx.
(b) For functions f : X → R, Rdefine the supremum norm kf k∞ = supx∈X |f (x)|. Show that
2 kP1 − P2 kTV = supkf k∞ ≤1 X f (x)(p1 (x) − p2 (x))dx.
R
(c) kP1 − P2 kTV = max{p1 (x), p2 (x)}dx − 1.
R
(d) kP1 − P2 kTV = 1 − min{p1 (x), p2 (x)}dx.
Question 2.2 (Divergence between multivariate normal distributions): Let P1 be N(θ1 , Σ) and
P2 be N(θ2 , Σ), where Σ 0 is a positive definite matrix. What is Dkl (P1 ||P2 )?
Question 2.3 (The optimal test between distributions): Prove Le-Cam’s inequality: for any
function ψ with dom ψ ⊃ X and any distributions P1 , P2 ,
Thus, the sum of the probabilities of error in a hypothesis testing problem, where based on a sample
X we must decide whether P1 or P2 is more likely, has value at least 1 − kP1 − P2 kTV . Given P1
and P2 is this risk attainable?
Question 2.4: A random variable X has Laplace(λ, µ) distribution if it has density p(x) =
λ
2 exp(−λ|x−µ|). Consider the hypothesis test of P1 versus P2 , where X has distribution Laplace(λ, µ1 )
under P1 and distribution Laplace(λ, µ2 ) under P2 , where µ1 < µ2 . Show that the minimal value
over all tests ψ of P1 versus P2 is
λ
inf P1 (ψ(X) 6= 1) + P2 (ψ(X) 6= 2) = exp − |µ1 − µ2 | .
ψ 2
31
Stanford Statistics 311/Electrical Engineering 377 John Duchi
Question 2.6: Given quantizers g1 and g2 , we say that g1 is a finer quantizer than g2 under the
following condition: assume that g1 induces the partition A1 , . . . , An and g2 induces the partition
B1 , . . . , Bm ; then for any of the sets Bi , there are exists some k and sets Ai1 , . . . , Aik such that
Bi = ∪kj=1 Aij . We let g1 ≺ g2 denote that g1 is a finer quantizer than g2 . Prove
Question 2.7 (f -divergences generalize standard divergences): Show the following properties of
f -divergences:
(c) If f (t) = t log t − log t, then Df (P ||Q) = Dkl (P ||Q) + Dkl (Q||P ).
(d) For any convex f satisfying f (1) = 0, Df (P ||Q) ≥ 0. (Hint: use Jensen’s inequality.)
(b) Generalizing the preceding result, let a : X → R+ and b : X → R+ , and let µ be a finite
measure on X . Show that
Z R Z
b(x)dµ(x) b(x)
a(x)dµ(x)f R ≤ a(x)f dµ(x).
a(x)dµ(x) a(x)
If you are unfamiliarR with measure theory, prove the following essentially equivalent result: let
u : X → R+ satisfy u(x)dx < ∞. Show that
Z R Z
b(x)u(x)dx b(x)
a(x)u(x)dxf R ≤ a(x)f u(x)dx.
a(x)u(x)dx a(x)
(Hint: use the fact that the perspective of a function f , defined by h(x, t) = tf (x/t) for t > 0, is
jointly convex in x and t [e.g. 33, Chapter 3.2.6].)
32
Stanford Statistics 311/Electrical Engineering 377 John Duchi
Question 2.9 (Data processing and f -divergences I): As with the KL-divergence, given a quan-
tizer g of the set X , where g induces a partition A1 , . . . , Am of X , we define the f -divergence
between P and Q conditioned on g as
m m
P (g −1 ({i}))
X
X P (Ai ) −1
Df (P ||Q | g) := Q(Ai )f = Q(g ({i}))f .
Q(Ai ) Q(g −1 ({i}))
i=1 i=1
Given quantizers g1 and g2 , we say that g1 is a finer quantizer than g2 under the following condition:
assume that g1 induces the partition A1 , . . . , An and g2 induces the partition B1 , . . . , Bm ; then for
any of the sets Bi , there are exists some k and sets Ai1 , . . . , Aik such that Bi = ∪kj=1 Aij . We let
g1 ≺ g2 denote that g1 is a finer quantizer than g2 .
(a) Let g1 and g2 be quantizers of the set X , and let g1 ≺ g2 , meaning that g1 is a finer quantization
than g2 . Prove that
Df (P ||Q | g2 ) ≤ Df (P ||Q | g1 ) .
Equivalently, show that whenever A and B are collections of sets partitioning X , but A is a
finer partition of X than B, that
X
X P (B) P (A)
Q(B)f ≤ Q(A)f .
Q(B) Q(A)
B∈B A∈A
where on the left we are using the partition definition (2.2.3); you should show that the partition
into discrete parts of X achieves the supremum. You may assume that X is finite. (Though
feel free to prove the result in the case that X is infinite.)
Question 2.10 (General data processing inequalities): Let f be a convex function satisfying
f (1) = 0. Let K be a Markov transition kernel from X to Z, that is, K(·, x) is a probability
distribution on Z for each x ∈ X . (Written differently, we have X → Z, and conditioned on X = x,
Z has distribution K(·, x), so that K(A, x) is the probability that Z ∈ A given X = x.)
R R
(a) Define the marginals KP (A) = K(A, x)p(x)dx and KQ (A) = K(A, x)q(x)dx. Show that
Df (KP ||KQ ) ≤ Df (P ||Q) .
Hint: by equation (2.2.3), w.l.o.g. we may assume that Z is finite and Z = {1, . . . , m}; also
recall Question 2.8.
(b) Let X and Y be random variables with joint distribution PXY and marginals PX and PY .
Define the f -information between X and Y as
If (X; Y ) := Df (PXY ||PX × PY ) .
Use part (a) to show the following general data processing inequality: if we have the Markov
chain X → Y → Z, then
If (X; Z) ≤ If (X; Y ).
33
Stanford Statistics 311/Electrical Engineering 377 John Duchi
Question 2.11 (Convexity of f -divergences): Prove Proposition 2.13. Hint: Use Question 2.8.
Question 2.12 (Variational forms of KL divergence): Let P and Q be arbitrary distributions on a
common space X . Prove the following variational representation, known as the Donsker-Varadhan
theorem, of the KL divergence:
Dkl (P ||Q) = sup EP [f (X)] − log EQ [exp(f (X))] .
f :EQ [ef (X) ]<∞
Question 2.14: Let P1 be N(θ1 , Σ1 ) and P2 be N(θ2 , Σ2 ), where Σi 0 are positive definite
matrices. Give Dkl (P1 ||P2 ).
Question 2.15: Let {Pv }v∈V be an arbitrary collection of distributions on a space X and µ be a
probability measure on V. Show that if V ∼ µ and conditional on V = v, we draw X ∼ Pv , then
R R
(a) I(X; V ) = Dkl Pv ||P dµ(v), where P = Pv dµ(v) is the (weighted) average of the Pv . You
may assume that V is discrete if you like.
R
(b) For any distribution
R Q on X , I(X; V ) = Dkl (Pv ||Q) dµ(v)R − Dkl P ||Q . Conclude that
I(X; V ) ≤ Dkl (Pv ||Q) dµ(v), or, equivalently, P minimizes Dkl (Pv ||Q) dµ(v) over all prob-
abilities Q.
Question 2.16 (The triangle inequality for variation distance): Let P and Q be distributions
on X1n = (X1 , . . . , Xn ) ∈ X n , and let Pi (· | xi−1
1 ) be the conditional distribution of Xi given
X1i−1 = xi−1
1 (and similarly for Q i ). Show that
n h
X
i
kP − QkTV ≤ EP
Pi (· | X1i−1 ) − Qi (· | X1i−1 )
TV ,
i=1
34
Part I
35
Chapter 3
Concentration Inequalities
In many scenarios, it is useful to understand how a random variable X behaves by giving bounds
on the probability that it deviates far from its mean or median. This can allow us to give prove
that estimation and learning procedures will have certain performance, that different decoding and
encoding schemes work with high probability, among other results. In this chapter, we give several
tools for proving bounds on the probability that random variables are far from their typical values.
We conclude the section with a discussion of basic uniform laws of large numbers and applications
to empirical risk minimization and statistical learning, though we focus on the relatively simple
cases we can treat with our tools.
P(X ≥ t)?
We begin with the three most classical three inequalities for this purpose: the Markov, Chebyshev,
and Chernoff bounds, which are all instances of the same technique.
The basic inequality off of which all else builds is Markov’s inequality.
Proposition 3.1 (Markov’s inequality). Let X be a nonnegative random variable, meaning that
X ≥ 0 with probability 1. Then
E[X]
P(X ≥ t) ≤ .
t
Proof For any random variable, P(X ≥ t) = E[1 {X ≥ t}] ≤ E[(X/t)1 {X ≥ t}] ≤ E[X]/t, as
X/t ≥ 1 whenever X ≥ t.
When we know more about a random variable than that its expectation is finite, we can give
somewhat more powerful bounds on the probability that the random variable deviates from its
typical values. The first step in this direction, Chebyshev’s inequality, requires two moments, and
when we have exponential moments, we can give even stronger results. As we shall see, each of
these results is but an application of Proposition 3.1.
36
Stanford Statistics 311/Electrical Engineering 377 John Duchi
Proposition 3.2 (Chebyshev’s inequality). Let X be a random variable with Var(X) < ∞. Then
Var(X) Var(X)
P(X − E[X] ≥ t) ≤ 2
and P(X − E[X] ≤ −t) ≤
t t2
for all t ≥ 0.
Proof We prove only the upper tail result, as the lower tail is identical. We first note that
X − E[X] ≥ t implies that (X − E[X])2 ≥ t2 . But of course, the random variable Z = (X − E[X])2
is nonnegative, so Markov’s inequality gives P(X − E[X] ≥ t) ≤ P(Z ≥ t2 ) ≤ E[Z]/t2 , and
E[Z] = E[(X − E[X])2 ] = Var(X).
In particular, taking the infimum over all λ ≥ 0 in Proposition 3.3 gives the more standard Chernoff
(large deviation) bound
P(X ≥ t) ≤ exp inf log ϕX (λ) − λt .
λ≥0
Example 3.4 (Gaussian random variables): When X is a mean-zero Gaussian variable with
variance σ 2 , we have 2 2
λ σ
ϕX (λ) = E[exp(λX)] = exp . (3.1.1)
2
To see this, we compute the integral; we have
Z ∞
1 1 2
E[exp(λX)] = √ exp λx − 2 x dx
−∞ 2πσ 2 2σ
Z ∞
2
λ σ 2 1 1 2 2
=e 2 √ exp − 2 (x − λσ x) dx,
−∞ 2πσ 2 2σ
| {z }
=1
because this is simply the integral of the Gaussian density.
As a consequence of the equality (3.1.1) and the Chernoff bound technique (Proposition 3.3),
we see that for X Gaussian with variance σ 2 , we have
t2 t2
P(X ≥ E[X] + t) ≤ exp − 2 and P(X ≤ E[X] − t) ≤ exp − 2
2σ 2σ
λ2 σ 2 2 2 2
for all t ≥ 0. Indeed, we have log ϕX−E[X] (λ) = 2 , and inf λ { λ 2σ − λt} = − 2σ
t
2 , which is
attained by λ = σt2 . 3
37
Stanford Statistics 311/Electrical Engineering 377 John Duchi
Of course, Gaussian random variables satisfy Definition 3.1 with equality. This would be un-
interesting if only Gaussian random variables satisfied this property; happily, that is not the case,
and we detail several examples.
Example 3.5 (Random signs (Rademacher variables)): The random variable X taking values
{−1, 1} with equal property is 1-sub-Gaussian. Indeed, we have
∞ ∞ ∞ ∞
1 X λk 1 X (−λ)k X λ2k X (λ2 )k λ2
1 1
E[exp(λX)] = eλ + e−λ = + = ≤ = exp ,
2 2 2 k! 2 k! (2k)! 2k k! 2
k=0 k=0 k=0 k=0
as claimed. 3
Bounded random variables are also sub-Gaussian; indeed, we have the following example.
Example 3.6 (Bounded random variables): Suppose that X is bounded, say X ∈ [a, b].
Then Hoeffding’s lemma states that
λ2 (b − a)2
λ(X−E[X])
E[e ] ≤ exp ,
8
= E exp(λε(X − X 0 )) ,
where the inequality follows from Jensen’s inequality and the last equality is a conseqence of
the fact that X − X 0 is symmetric about 0. Using the result of Example 3.5,
λ (X − X 0 )
2 2
λ (b − a)2
0
E exp(λε(X − X )) ≤ E exp ≤ exp ,
2 2
38
Stanford Statistics 311/Electrical Engineering 377 John Duchi
Chernoff bounds for sub-Gaussian random variables are immediate; indeed, they have the same
concentration properties as Gaussian random variables, a consequence of the nice analytical prop-
erties of their moment generating functions (that their logarithms are at most quadratic). Thus,
using the technique of Example 3.4, we obtain the following proposition.
t2
P(X − E[X] ≥ t) ∨ P(X − E[X] ≤ −t) ≤ exp − 2 .
2σ
Chernoff bounds extend naturally to sums of independent random variables, because moment
generating functions of sums of independent random variables become products of moment gener-
ating functions.
Proof We assume w.l.o.g. that the Xi are mean zero. We have by independence that and
sub-Gaussianity that
Xn n−1
X 2 2 n−1
X
λ σn
E exp λ Xi = E exp λ Xi E[exp(λXn )] ≤ exp E exp λ Xi .
2
i=1 i=1 i=1
Two immediate corollary to Propositions 3.7 and 3.8 show that sums of sub-Gaussian random
variables concentrate around their expectations. We begin with a general concentration inequality.
Corollary 3.9. Let Xi be independent σi2 -sub-Gaussian random variables. Then for all t ≥ 0
( n n )
t2
X X
max P (Xi − E[Xi ]) ≥ t , P (Xi − E[Xi ]) ≤ −t ≤ exp − Pn 2 .
i=1 i=1
2 i=1 σi
Additionally, the classical Hoeffding bound, follows when we couple Example 3.6 with Corollary 3.9:
if Xi ∈ [ai , bi ], then
n
2t2
X
P (Xi − E[Xi ]) ≥ t ≤ exp − Pn 2
.
i=1 i=1 (bi − ai )
To give another interpretation of these inequalities, let us assume that Xi are indepenent and
σ 2 -sub-Gaussian. Then we have that
n
nt2
X
1
P (Xi − E[Xi ]) ≥ t ≤ exp − 2 ,
n 2σ
i=1
39
Stanford Statistics 311/Electrical Engineering 377 John Duchi
q
1
nt2 2σ 2 log δ
or, for δ ∈ (0, 1), setting exp(− 2σ 2) = δ or t = √
n
, we have that
q
1X
n 2σ 2 log 1δ
(Xi − E[Xi ]) ≤ √ with probability at least 1 − δ.
n n
i=1
There are a variety of other conditions equivalent to sub-Gaussianity, which we relate by defining
the sub-Gaussian norm of a random variable. In particular, we define the sub-Gaussian norm
(sometimes known as the ψ2 -Orlicz norm in the literature) as
1
kXkψ2 := sup √ E[|X|k ]1/k (3.1.2)
k≥1 k
Theorem 3.10. Let X be a mean-zero random variable and σ 2 ≥ 0 be a constant. The following
statements are all equivalent, meaning that there are numerical constant factors Kj such that if one
statement (i) holds with parameter Ki , then statement (j) holds with parameter Kj ≤ CKi , where
C is a numerical constant.
2
(1) Sub-gaussian tails: P(|X| ≥ t) ≤ 2 exp(− Kt1 σ2 ) for all t ≥ 0.
√
(2) Sub-gaussian moments: E[|X|k ]1/k ≤ K2 σ k for all k.
Particularly,
q (1) implies (2) with K1 = 1 and K2 ≤ e1/e ; (2) implies (3) with K2 = 1 and
2
K3 = e e−1 < 3; (3) implies (4) with K3 = 1 and K4 ≤ 43 ; and (4) implies (1) with K4 = 12 and
K1 ≤ 2.
This result is standard in the literature on concentration and random variables; our proof is based
on Vershynin [138]. See Appendix 3.4.1 for a proof of this theorem. We note in passing that in
each of the statements of Theorem 3.10, we may take σ = kXkψ2 , and (in general) these are the
sharpest possible results except for numerical constants.
For completeness, we can give a tighter result than part (3) of the preceding theorem, giving a
concrete upper bound on squares of sub-Gaussian random variables. The technique used in the ex-
ample, to introduce an independent random variable for auxiliary randomization, is a common and
useful technique in probabilistic arguments (similar to our use of symmetrization in Example 3.6).
40
Stanford Statistics 311/Electrical Engineering 377 John Duchi
To see this result, we focus on the Gaussian case first and assume (for this case) without loss
of generality (by scaling) that σ 2 = 1. Assuming that λ < 12 , we have
Z Z √
1 1 2 1 − 1−2λ z 2 2π 1
E[exp(λZ )] =2
√ e−( 2 −λ)z dz = √ e 2 dz = √ √ ,
2π 2π 1 − 2λ 2π
the final equality a consequence of the fact that (as we know for normal random variables)
R − 1 z2 √
e 2σ2 dz = 2πσ 2 . When λ ≥ 12 , the above integrals are all infinite, giving the equality in
expression (3.1.3).
For the more general inequality, we recall that if Z is an independent N(0, 1) random variable,
2
then E[exp(tZ)] = exp( t2 ), and so
√ (i) (ii) 1
E[exp(λX 2 )] = E[exp( 2λXZ)] ≤ E exp(λσ 2 Z 2 ) =
1 ,
[1 − 2σ 2 λ]+2
where inequality (i) follows because X is sub-Gaussian, and inequality (ii) because Z ∼ N(0, 1).
3
Definition 3.2. A random variable X is sub-exponential with parameters (τ 2 , b) if for all λ such
that |λ| ≤ 1/b, 2 2
λ(X−E[X]) λ τ
E[e ] ≤ exp .
2
where inequality (i) holds for λ ≤ 14 , because − log(1 − 2λ) ≤ 2λ + 4λ2 for λ ≤ 14 . 3
As a second example, we can show that bounded random variables are sub-exponential. It is
clear that this is the case as they are also sub-Gaussian; however, in many cases, it is possible to
show that their parameters yield much tighter control over deviations than is possible using only
sub-Gaussian techniques.
41
Stanford Statistics 311/Electrical Engineering 377 John Duchi
Example 3.13 (Bounded random variables are sub-exponential): Suppose that X is a mean
zero random variable taking values in [−b, b] with variance σ 2 = E[X 2 ] (note that we are
guaranteed that σ 2 ≤ b2 in this case). We claim that
2 2
3λ σ 1
E[exp(λX)] ≤ exp for |λ| ≤ . (3.1.5)
5 2b
To see this, note first that for k ≥ 2 we have E[|X|k ] ≤ E[X 2 bk−2 ] = σ 2 bk−2 . Then by an
expansion of the exponential, we find
∞ ∞
λ2 E[X 2 ] X λk E[X k ] λ2 σ 2 X λk σ 2 bk−2
E[exp(λX)] = 1 + E[λX] + + ≤1+ +
2 k! 2 k!
k=3 k=3
∞
λ2 σ 2 2 2
X (λb)k (i) λ2 σ 2 λ2 σ 2
=1+ +λ σ ≤ 1+ + ,
2 (k + 2)! 2 10
k=1
1
inequality (i) holding for λ ≤ 2b . Using that 1 + x ≤ ex gives the result.
It is possible to give a slightly tighter result for λ ≥ 0 In this case, we have the bound
∞
λ2 σ 2 X λk−2 bk−2 σ2
E[exp(λX)] ≤ 1 + + λ2 σ 2 = 1 + 2 eλb − 1 − λb .
2 k! b
k=3
Then using that 1 + x ≤ ex , we obtain Bennett’s moment generating inequality, which is that
2
λX σ λb
E[e ] ≤ exp e − 1 − λb for λ ≥ 0. (3.1.6)
b2
λ2 b2
Inequality (3.1.6) always holds, and for λb near 0, we have eλb − 1 − λb ≈ 2 . 3
In particular, if the variance σ 2 b2 , the absolute bound on X, inequality (3.1.5) gives much
tighter control on the moment generating function of X than typical sub-Gaussian bounds based
only on the fact that X ∈ [−b, b] allow.
We can give a broader characterization, as with sub-Gaussian random variables in Theorem 3.10.
First, we define the sub-exponential norm (in the literature, there is an equivalent norm often called
the Orlicz ψ1 -norm)
1
kXkψ1 := sup E[|X|k ]1/k .
k≥1 k
For any sub-Gaussian random variable—whether it has mean-zero or not—we have that sub-
exponential is sub-Gaussian squared:
kXk2ψ2 ≤
X 2
ψ1 ≤ 2 kXk2ψ2 ,
(3.1.7)
which is immediate from the definitions. More broadly, we can show a result similar to Theo-
rem 3.10.
Theorem 3.14. Let X be a random variable and σ ≥ 0. Then—in the sense of Theorem 3.10—the
following statements are all equivalent for suitable numerical constants K1 , . . . , K4 .
42
Stanford Statistics 311/Electrical Engineering 377 John Duchi
(4) If, in addition, E[X] = 0, then E[exp(λX)] ≤ exp(K4 λ2 σ 2 ) for all |λ| ≤ K40 /σ.
In particular, if (2) holds with K2 = 1, then (4) holds with K4 = 2e2 and K40 = 1
2e .
The proof, which is similar to that for Theorem 3.10, is presented in Section 3.4.2.
While the concentration properties of sub-exponential random variables are not quite so nice as
those for sub-Gaussian random variables (recall Hoeffding’s inequality, Corollary 3.9), we can give
sharp tail bounds for sub-exponential random variables. We first give a simple bound on deviation
probabilities.
Proposition 3.15. Let X be a mean-zero (τ 2 , b)-sub-exponential random variable. Then for all
t ≥ 0, 2
1 t t
P(X ≥ t) ∨ P(X ≤ −t) ≤ exp − min , .
2 τ2 b
Proof The proof is an application of the Chernoff bound technique; we prove only the upper tail
as the lower tail is similar. We have
E[eλX ] (i)
2 2
λ τ
P(X ≥ t) ≤ ≤ exp − λt ,
eλt 2
inequality (i) holding for |λ| ≤ 1/b. To minimize the last term in λ, we take λ = min{ τt2 , 1/b},
which gives the result.
Comparing with sub-Gaussian random variables, which have b = 0, we see that Proposition 3.15
gives a similar result for small t—essentially the same concentration sub-Gaussian random variables—
while for large t, the tails decrease only exponentially in t.
We can also give a tensorization identity similar to Proposition 3.8.
Proposition 3.16. Let X1 , . . . , Xn be independent mean-zero sub-exponential random variables,
where Xi is (σi2 , bi )-sub-exponential. Then for any vector ai ∈ Rn , we have
n
" !# 2 Pn
λ 2 2
i=1 ai σi 1
X
E exp λ Xi ≤ exp for |λ| ≤ ,
2 b∗
i=1
Pn 2 2 1
where b∗ = maxi bi |ai |. That is, ha, Xi is ( i=1 ai σi , mini bi |ai | )-sub-exponential.
Proof We apply an inductive technique similar to that used in the proof of Proposition 3.8.
1
First, for any fixed i, we know that if |λ| ≤ bi |ai|
, then |ai λ| ≤ b1i and so
λ2 a2i σi2
E[exp(λai Xi )] ≤ exp .
2
1
Now, we inductively apply the preceding inequality, which applies so long as |λ| ≤ bi |ai | for all i.
We have
n n n
" X # Y 2 2 2
Y λ ai σi
E exp λ ai Xi = E[exp(λai Xi )] ≤ exp ,
2
i=1 i=1 i=1
43
Stanford Statistics 311/Electrical Engineering 377 John Duchi
It is instructive to study the structure of the bound of Corollary 3.17. Notably, the bound is
similar to the Hoeffding-type bound of Corollary 3.9 (holding for σ 2 -sub-Gaussian random variables)
that
n
!
t2
X
P ai Xi ≥ t ≤ exp − ,
i=1
2 kak22 σ 2
so that for small t, Corollary 3.17 gives sub-Gaussian tail behavior. For large t, the bound is weaker.
However, in many cases, Corollary 3.17 can give finer control than naive sub-Gaussian bounds.
Indeed, suppose that the random variables Xi are i.i.d., mean zero, and satisfy Xi ∈ [−b, b] with
probability 1, but have variance σ 2 = E[Xi2 ] ≤ b2 as in Example 3.13. Then Corollary 3.17 implies
that
n
( )!
5 t2
X
1 t
P ai Xi ≥ t ≤ exp − min , . (3.1.8)
2 6 σ 2 kak22 2b kak∞
i=1
When applied to a standard mean (and with a minor simplification that 5/12 < 1/3) with ai = n1 ,
t2
we obtain the bound that n1 ni=1 Xi ≤ t with probability at least 1−exp(−n min{ 3σ t
P
2 , 4b }). Written
q
3 log 1δ 4b log 1δ
differently, we take t = max{σ n , n } to obtain
q
n 1
1 X 3 log 1
δ 4b log δ
Xi ≤ max σ √ , with probability 1 − δ.
n n n
i=1
q √
The sharpest such bound possible via more naive Hoeffding-type bounds is b 2 log 1δ / n, which
has substantially worse scaling.
44
Stanford Statistics 311/Electrical Engineering 377 John Duchi
where µ = E[X] and σ 2 = Var(X) = E[X 2 ] − µ2 . In this case, the following lemma controls the
moment generating function of X. This result is essentially present in Theorem 3.14, but it provides
somewhat tighter control with precise constants.
Lemma 3.18. Let X be a random variable satisfying the Bernstein condition (3.1.9). Then
λ2 σ 2
h
λ(X−µ)
i 1
E e ≤ exp for |λ| ≤ .
2(1 − b|λ|) b
√
Said differently, a random variable satisfying Condition (3.1.9) is ( 2σ, b/2)-sub-exponential.
Proof Without loss of generality we assume µ = 0. We expand the moment generating function
by noting that
∞ ∞
λ2 σ 2 X λk E[X k ] (i) λ2 σ 2 λ2 σ 2 X
E[eλX ] = 1 + + ≤ 1+ + |λb|k−2
2 k! 2 2
k=3 k=3
λ2 σ 2 1
=1+
2 [1 − b|λ|]+
where inequality (i) used the Bernstein condition (3.1.9). Noting that 1+x ≤ ex gives the result.
As one final example, we return to Bennett’s inequality (3.1.6) from Example 3.13.
Proposition 3.19 (Bennett’s inequality). Let Xi be independent mean-zero random
Pn variables with
2 2 2
Var(Xi ) = σi and |Xi | ≤ b. Then for h(t) := (1 + t) log(1 + t) − t and σ := i=1 σi , we have
n
! 2
X σ bt
P Xi ≥ t ≤ exp − 2 h .
b σ2
i=1
Proof We assume without loss of generality that E[X] = 0. Using the standard Chernoff bound
argument coupled with inequality (3.1.6), we see that
n n
! !
X X X σi2 λb
P Xi ≥ t ≤ exp e − 1 − λb − λt .
b2
i=1 i=1
Letting h(t) = (1 + t) log(1 + t) − t as in the statement of the proposition and σ 2 = ni=1 σi2 , we
P
minimize over λ ≥ 0, setting λ = 1b log(1 + σbt2 ). Substituting into our Chernoff bound application
gives the proposition.
A slightly more intuitive writing of Bennett’s inequality is to use averages, in which case for
σ 2 = n1 ni=1 σi2 the average of the variances,
P
n
!
nσ 2
1X bt
P Xi ≥ t ≤ exp − h .
n b σ2
i=1
It is possible to show that
nσ 2 nt2
bt
h ≥ ,
b 2σ 2 + 23 bt
σ2
which gives rise to the classical Bernstein inequality that
n
! !
1X nt2
P Xi ≥ t ≤ exp − 2 2 .
n 2σ + 3 bt
i=1
45
Stanford Statistics 311/Electrical Engineering 377 John Duchi
Depending on the norm chosen, this task may be impossible; for the Euclidean (`2 ) norm, however,
such an embedding is easy to construct using Gaussian random variables and with m = O( 12 log n).
This embedding is known as the Johnson-Lindenstrauss embedding. Note that this size m is
independent of the dimension d, only depending on the number of points n.
and let Φi ∈ Rd denote the ith row of this matrix. We claim that
8 1
m ≥ 2 2 log n + log implies kΦui − Φuj k22 ∈ (1 ± ) kui − uj k22
δ
log n
for all pairs ui , uj with probability at least 1 − δ. In particular, m & 2
is sufficient to achieve
accurate dimension reduction with high probability.
To see this, note that for any fixed vector u,
m
hΦi , ui kΦuk22 X
∼ N(0, 1/m), and = hΦi , u/ kuk2 i2
kuk2 kuk22 i=1
is a sum of independent scaled χ2 -random variables. In particular, we have E[kΦu/ kuk2 k22 ] = 1,
and using the χ2 -concentration result of Example 3.12 yields
2 2 2 2
P kΦuk2 / kuk2 − 1 ≥ = P m kΦuk2 / kuk2 − 1 ≥ m
m2
2
≤ 2 inf exp 2mλ − λm = 2 exp − ,
|λ|≤ 41 8
46
Stanford Statistics 311/Electrical Engineering 377 John Duchi
the last inequality holding for ∈ [0, 1]. Now, using the union bound applied to each of the
pairs (ui , uj ) in the sample, we have
m2
2 2
2
n
P there exist i 6= j s.t. kΦ(ui − uj )k2 − kui − uj k2 ≥ kui − uj k2 ≤ 2 exp − .
2 8
2
Taking m ≥ 82 log nδ = 16
2
log n + 82 log 1δ yields that with probability at least 1 − δ, we have
kΦui − Φuj k2 ∈ (1 ± ) kui − uj k22 . 3
2
the maximum likelihood decoder. We now investigate how to choose a collection {x1 , . . . , xm }
of such codewords and give finite sample bounds on its probability of error. In fact, by using
concentration inequalities, we can show that a randomly drawn codebook of fairly small dimension
is likely to enjoy good performance.
Intuitively, if our codebook {x1 , . . . , xm } ⊂ {0, 1}d is well-separated, meaning that each pair of
words xi , xk satisfies kxi − xk k1 ≥ cd for some numerical constant c > 0, we should be unlikely to
make a mistake. Let us make this precise. We mistake word i for word k only if the received signal
Z satisfies kZ − xi k1 ≥ kZ − xk k1 , and letting J = {j ∈ [d] : xij 6= xkj } denote the set of at least
c · d indices where xi and xk differ, we have
X
kZ − xi k1 ≥ kZ − xk k1 if and only if |Zj − xij | − |Zj − xkj | ≥ 0.
j∈J
If xi is the word being sent and xi and xk differ in position j, then |Zj − xij | − |Zj − xkj | ∈ {−1, 1},
and is equal to −1 with probability (1 − ) and 1 with probability . That is, we have kZ − xi k1 ≥
kZ − xk k1 if and only if
X
|Zj − xij | − |Zj − xkj | + |J|(1 − 2) ≥ |J|(1 − 2) ≥ cd(1 − 2),
j∈J
47
Stanford Statistics 311/Electrical Engineering 377 John Duchi
and the expectation EQ [|Zj − xij | − |Zj − xkj | | xi ] = −(1 − 2) when xij 6= xkj . Using the Hoeffding
bound, then, we have
where we have used that there are at least |J| ≥ cd indices differing between xi and xk . The
probability of making a mistake at all is thus at most m exp(− 12 cd(1 − 2)2 ) if our codebook has
separation c · d.
For low error decoding to occur with extremely high probability, it is thus sufficient to choose
a set of code words {x1 , . . . , xm } that is well separated. To that end, we state a simple lemma.
m2
d m
exp −2dt2 ≤ exp −2dt2 .
P ∃ i, j s.t. kXi − Xj k1 < − dt ≤
2 2 2
Proof First,
n let us consider
o two independent draws X and X 0 uniformly on the hypercube. Let
Pd
Z = j=1 1 Xj 6= Xj0 = dham (X, X 0 ) = kX − X 0 k1 . Then E[Z] = d2 . Moreover, Z is an i.i.d.
1
sum of Bernoulli 2 random variables, so that by our concentration bounds of Corollary 3.9, we have
2t2
d
P
X − X 0
1 ≤ − t ≤ exp −
.
2 d
(ii) By taking m ≤ exp(d/32), or d ≥ 32 log m, and δ = e−d/32 , then with probability at least
1 − e−d/32 —exponentially large in d—a randomly drawn codebook has all its entries separated
by at least kxi − xj k1 ≥ d4 .
8 log m
δ
d ≥ max 32 log m, .
(1 − 2)2
Then with probability at least 1 − 1/m over the draw of the codebook, the probability we make a
mistake in transmission of any given symbol i over the channel Q is at most δ.
48
Stanford Statistics 311/Electrical Engineering 377 John Duchi
for all n ∈ N.
There are numerous examples of martingale sequences. The classical one is the symmetric
random walk.
Example 3.22: Let Dn ∈ {±1} be uniform and independent. Then Dn form a martingale
difference sequence adapted to themselves (that is, we may take Zn = Dn ), and Mn = ni=1 Di
P
is a martingale. 3
A more sophisticated example, to which we will frequently return and that suggests the potential
usefulness of martingale constructions, is the Doob martingale associated with a function f .
Example 3.23 (Doob martingales): Let f : X n → R be an otherwise arbitrary function,
and let X1 , . . . , Xn be arbitrary random variables. The Doob martingale is defined by the
difference sequence
Di := E[f (X1n ) | X1i ] − E[f (X1n ) | X1i−1 ].
By inspection, the Di are functions of X1i , and we have
49
Stanford Statistics 311/Electrical Engineering 377 John Duchi
by the tower property of expectations. Thus, the Di satisfy Definition 3.4 of a martingale
difference sequence, and moreover, we have
n
X
Di = f (X1n ) − E[f (X1n )],
i=1
and so the Doob martingale captures exactly the difference between f and its expectation. 3
because D1 , . . . , Dn−1 are functions of Z1n−1 . Then we use Definition 3.5, which implies that
2 2
E[eλDn | Z1n−1 ] ≤ eλ σn /2 , and we obtain
"n−1 # 2 2
Y
λDi λ σn
E[exp(λMn )] ≤ E e exp .
2
i=1
50
Stanford Statistics 311/Electrical Engineering 377 John Duchi
as desired.
The second claims are simply applications of Chernoff bounds via Proposition 3.7 and that
E[Mn ] = 0.
As an immediate corollary,
Pwe recover Proposition 3.8, as sums of independent random variables
form martingales via Mn = ni=1 (Xi − E[Xi ]). A second corollary gives what is typically termed
the Azuma inequality:
Corollary 3.25. Let P Di be a bounded difference martingale difference sequence, meaning that
|Di | ≤ c. Then Mn = ni=1 Di satisfies
t2
−1/2 −1/2
P(n Mn ≥ t) ∨ P(n Mn ≤ −t) ≤ exp − 2 for t ≥ 0.
2c
√
Thus, bounded random walks are (with high probability) within ± n of their expectations after
n steps.
There exist extensions of these inequalities to the cases where we control the variance of the
martingales; see Freedman [71].
Definition 3.6 (Bounded differences). Let f : X n → R for some space X . Then f satisfies
bounded differences with constants ci if for each i ∈ {1, . . . , n}, all xn1 ∈ X n , and x0i ∈ X we have
i−1 0
|f (xi−1 n n
1 , xi , xi+1 ) − f (x1 , xi , xi+1 )| ≤ ci .
The classical inequality relating bounded differences and concentration is McDiarmid’s inequal-
ity, or the bounded differences inequality.
2t2
n n n n
P (f (X1 ) − E[f (X1 )] ≥ t) ∨ P (f (X1 ) − E[f (X1 )] ≤ −t) ≤ exp − n 2 .
P
i=1 ci
Proof The basic idea is to show that the Doob martingale (Example 3.23) associated with f is
c2i /4-sub-Gaussian, and then to simply apply the Azuma-Hoeffding P inequality. To that end, define
Di = E[f (X1n ) | X1i ] − E[f (X1n ) | X1i−1 ] as before, and note that ni=1 Di = f (X1n ) − E[f (X1n )]. The
random variables
51
Stanford Statistics 311/Electrical Engineering 377 John Duchi
where we have used the independence of the Xi and Definition 3.6 of bounded differences. Conse-
quently, we have by Hoeffding’s Lemma (Example 3.6) that E[eλDi | X1i−1 ] ≤ exp(λ2 c2i /8), that is,
the Doob martingale is c2i /4-sub-Gaussian.
The remainder of the proof is simply Theorem 3.24.
A number of quantities satisfy the conditions of Proposition 3.26, and we give two examples
here; we will revisit them more later.
Example 3.27 (Bounded random vectors): Let B be a Banach space—a complete normed
vector space—with norm k·k. Let Xi be independent bounded random vectors in B satisfying
E[Xi ] = 0 and kXi k ≤ c. We claim that the quantity
n
n
1 X
f (X1 ) :=
n Xi
i=1
i−1 0 n 1
x − x0
≤ 2c .
|f (xi−1 n
1 , x, xi+1 ) − f (x1 , x , xi+1 )| ≤
n n
Consequently, if Xi are indpendent, we have
n
1 n
!
1 X
nt2
X
Xi
− E
Xi
≥ t ≤ 2 exp − 2 (3.2.1)
P
n n 2c
i=1 i=1
for all t ≥ 0. That is, the norm of (bounded) random vectors in an essentially arbitrary vector
space concentrates extremely quickly about its expectation.
The challenge becomes to control the expectation term in the concentration bound (3.2.1),
which can be a bit challenging. In certain cases—for example, when we have a Euclidean
structure on the vectors Xi —it can be easier. Indeed, let us specialize to the case that Xi ∈ H,
a (real) Hilbert space, so that there is an inner product h·, ·i and the norm satisfies kxk2 = hx, xi
for x ∈ H. Then Cauchy-Schwarz implies that
n
n n
X
2
X
2
X X
E[kXi k2 ].
E
Xi
≤ E
Xi
= E[hX i , Xj i] =
i=1 i=1 i,j i=1
That is assuming the Xi are independent and E[kXi k2 ] ≤ σ 2 , inequality (3.2.1) becomes
nt2
σ
σ
P X n ≥ √ + t + P X n ≤ − √ − t ≤ 2 exp − 2
n n 2c
where X n = n1 ni=1 Xi . 3
P
52
Stanford Statistics 311/Electrical Engineering 377 John Duchi
We can specialize Example 3.27 to a situation that is very important for treatments of concen-
tration, sums of random vectors, and generalization bounds in machine learning.
Example 3.28 (Rademacher complexities): This example is actually a special case of Ex-
ample 3.27, but its frequent uses justify a more specialized treatment and consideration. Let
X be some space, and let F be some collection of functions f : X → R. Let εi ∈ {−1, 1} be a
collection of independent random sign vectors. Then the empirical Rademacher complexity of
F is
n
" #
n 1 X
Rn (F | x1 ) := E sup εi f (xi ) ,
n f ∈F i=1
where the expectation is over only the random signs
P εi . (In some cases, depending on context
and convenience, one takes the absolute value | i εi f (xi )|.) The Rademacher complexity of
F is
Rn (F) := E[Rn (F | X1n )],
the expectation of the empirical Rademacher complexities.
If f : X → [b0 , b1 ] for all f ∈ F, then the Rademacher complexity satisfies bounded differences,
because for any two sequences xn1 and z1n differing in only element j, we have
Xn
n|Rn (F | xn1 )−Rn (F | z1n )| ≤ E sup εi (f (xi )−f (zi )) = E[sup εi (f (xj )−f (zj ))] ≤ b1 −b0 .
f ∈F i=1 f ∈F
(b1 −b0 )2
Consequently, the empirical Rademacher complexity satisfies Rn (F | X1n ) − Rn (F) is 4n -
sub-Gaussian by Theorem 3.24. 3
These examples warrant more discussion, and it is possible to argue that many variants of these
random variables are well-concentrated. For example, instead of functions we may simply consider
an arbitrary set A ⊂ Rn and define the random variable
n
X
Z(A) := supha, εi = sup ai εi .
a∈A a∈A i=1
As a function of the random signs εi , we may write Z(A) = f (ε), and this is then a function
satisfying |f (ε) − f (ε0 )| ≤ supa∈A |ha, ε − ε0 i|, so that 0
Pn if ε and ε differ in index i, we have |f (ε) −
0
f (ε )| ≤ 2 supa∈A |ai |. That is, Z(A) − E[Z(A)] is i=1 supa∈A |ai |2 -sub-Gaussian.
as elements of this vector space L. (Here we have used 1Xi to denote the point mass at Xi .)
Then the Rademacher complexity is nothing more than the expected norm of Pn0 , a random
vector, as in Example 3.27. This view is somewhat sophisticated, but it shows that any general
results we may prove about random vectors, as in Example 3.27, will carry over immediately
to versions of the Rademacher complexity. 3
53
Stanford Statistics 311/Electrical Engineering 377 John Duchi
denote the empirical distribution on {Xi }ni=1 , where 1Xi denotes the point mass at Xi . Then for
functions f : X → R (or more generally, any function f defined on X ), we let
n
1X
Pn f := EPn [f (X)] = f (Xi )
n
i=1
denote the empirical expectation of f evaluated on the sample, and we also let
Z
P f := EP [f (X)] = f (x)dP (x)
denote general expectations under a measure P . With this notation, we study uniform laws of
large numbers, which consist of proving results of the form
where convergence is in probability, expectation, almost surely, or with rates of convergence. When
we view Pn and P as (infinite-dimensional) vectors on the space of maps from F → R, then we
may define the (semi)norm k·kF for any L : F → R by
kPn − P kF → 0.
54
Stanford Statistics 311/Electrical Engineering 377 John Duchi
Thus, roughly, we are simply asking questions about when random vectors converge to their expec-
tations.1
The starting point of this investigation considers bounded random functions, that is, F consists
of functions f : X → [a, b] for some −∞ < a ≤ b < ∞. In this case, the bounded differences
inequality (Proposition 3.26) immediately implies that expectations of kPn − P kF provide strong
guarantees on concentration of kPn − P kF .
2nt2
P (kPn − P kF ≥ E[kPn − P kF ] + t) ≤ exp − for t ≥ 0.
(b − a)2
Proof Let Pn and Pn0 be two empirical distributions, differing only in observation i (with Xi and
Xi0 ). We observe that
by the triangle inequality. An entirely parallel argument gives the converse lower bound of − b−a
n ,
and thus Proposition 3.26 gives the result.
Proposition 3.30 shows that, to provide control over high-probability concentration of kPn − P kF ,
it is (at least in cases where F is bounded) sufficient to control the expectation E[kPn − P kF ]. We
take this approach through the remainder of this section, developing tools to simplify bounding
this quantity.
Our starting points consist of a few inequalities relating expectations to symmetrized quantities,
which are frequently easier to control than their non-symmetrized parts. This symmetrization
technique is widely used in probability theory, theoretical statistics, and machine learning. The key
is that for centered random variables, symmetrized quantities have, to within numerical constants,
similar expectations to their non-symmetrized counterparts. Thus, in many cases, it is equivalent
to analyze the symmetized quantity and the initial quantity.
Proposition 3.31. Let Xi be independent random vectors on a (Banach) space with norm k·k and
let εi {−1, 1} be independent random signs. Then for any p ≥ 1,
"
n
p # "
n
p # "
n
p #
−p
X
X
p
X
2 E
≤ E
(Xi − E[Xi ])
≤ 2 E
εi (Xi − E[Xi ])
εi Xi
i=1 i=1 i=1
In the proof of the upper bound, we could also show the bound
"
n
p # "
n
p #
X X
≤ 2p E
E
(Xi − E[X i ])
ε i (Xi − E[Xi ])
,
i=1 i=1
55
Stanford Statistics 311/Electrical Engineering 377 John Duchi
Proof We prove the right bound first. We introduce independent copies of the Xi and use
these to symmetrize the quantity. Indeed, let Xi0 be an independent copy of Xi , and use Jensen’s
inequality and the convexity of k·kp to observe that
"
n
p # "
n
p # "
n
p #
X
X X
= E
(Xi − E[Xi0 ])
≤ E
(Xi − Xi0 )
.
E
(Xi − E[X i ])
i=1 i=1 i=1
dist
Now, note that the distribution of Xi − Xi0 is symmetric, so that Xi − Xi0 = εi (Xi − Xi0 ), and thus
"
n
p # "
n
p #
X
X
εi (Xi − Xi0 )
E
(Xi − E[Xi ])
≤ E
.
i=1 i=1
as desired.
For the left bound in the proposition, let Yi = Xi − E[Xi ] be the centered version of the random
variables. We break the sum over random variables into two parts, conditional on whether εi = ±1,
using repeated conditioning. We have
"
n
p # "
p #
X
X X
E
εi Yi
=E
Yi − Yi
i=1 i:εi =1 i:ε=−1
" "
# "
##
X
p
X
p
p−1 p−1
≤E 2 E
|ε +2 E
Yi
|ε
Yi
i:εi =1 i:εi −1
" "
# "
p ##
X X
p X X
= 2p−1 E E
Yi +
|ε +E
E[Yi ]
Yi +
|ε
E[Yi ]
i:εi =1 i:εi =−1 i:εi =−1 i:εi =1
" "
p # "
##
X X X X
p
≤ 2p−1 E E
Yi +
|ε +E
Yi
Yi +
|ε
Yi
i:εi =1 i:εi =−1 i:εi =−1 i:εi =1
"
n
#
p
X
p
=2 E
.
Yi
i=1
56
Stanford Statistics 311/Electrical Engineering 377 John Duchi
From Corollary 3.32, it is evident that by controlling the expectation of the symmetrized process
E[kPn0 kF ] we can derive concentration inequalities and uniform laws of large numbers. For example,
we immediately obtain that
2nt2
0
P kPn − P kF ≥ 2E[kPn kF ] + t ≤ exp −
(b − a)2
Proposition 3.33 (Massart’s finite class bound). Let F be any collection of functions with f :
X → R, and assume that σn2 := n−1 E[maxf ∈F ni=1 f (Xi )2 ] < ∞. Then
P
p
2σn2 log |F|
Rn (F) ≤ √ .
n
Pn Pn
Proof For each fixed xn1 , the random variable i=1 ε i f (x i ) is 2
i=1 f (xi ) -sub-Gaussian. Now,
n
σ 2 (xn1 ) := n−1 maxf ∈F i=1 f (xi )2 . Using the results of Exercise 3.7, that is, that E[maxj≤n Zj ] ≤
P
define
p
2σ 2 log n if the Zj are each σ 2 -sub-Gaussian, we see that
p
2σ 2 (xn1 ) log |F|
Rn (F | xn1 ) ≤ √ .
n
√ p
Jensen’s inequality that E[ ·] ≤ E[·] gives the result.
A refinement of Massart’s finite class bound applies when the classes are infinite but, on a
collection X1 , . . . , Xn , the functions f ∈ F may take on only a (smaller) number of values. In this
case, we define the empirical shatter coefficient of a collection of points x1 , . . . , xn by SF (xn1 ) :=
card{(f (x1 ), . . . , f (xn )) | f ∈ F }, the number of distinct vectors of values (f (x1 ), . . . , f (xn )) the
functions f ∈ F may take. The shatter coefficient is the maximum of the empirical shatter coeffi-
cients over xn1 ∈ X n , that is, SF (n) := supxn1 SF (xn1 ). It is clear that SF (n) ≤ |F| always, but by
only counting distinct values, we have the following corollary.
p
2σn2 log SF (n)
Rn (F) ≤ √ .
n
57
Stanford Statistics 311/Electrical Engineering 377 John Duchi
Typical classes with small shatter coefficients include Vapnik-Chervonenkis classes of functions; we
do not discuss these further here, instead referring to one of the many books in machine learning
and empirical process theory in statistics.
The most important of the calculus rules we use are the comparison inequalities for Rademacher
sums, which allow us to consider compositions of function classes and maintain small complexity
measurers. We state the rule here; the proof is complex, so we defer it to Section 3.4.3
Proof The result is an almost immediate consequence of Theorem 3.35; we simply recenter our
functions. Indeed, we have
" n n #
1 X 1 X
Rn (φ ◦ F | xn1 ) = E sup
εi (φ(f (xi )) − φ(0)) + εi φ(0)
f ∈F n i=1 n
i=1
" n # " n #
1 X 1 X
≤ E sup εi (φ(f (xi )) − φ(0)) + E εi φ(0)
f ∈F n
i=1
n
i=1
|φ(0)|
≤ 2LRn (F) + √ ,
n
58
Stanford Statistics 311/Electrical Engineering 377 John Duchi
In general, however, we only have access to the risk via the empirical distribution of the Zi , and
we often choose f by minimizing the empirical risk
n
b n (f ) := 1
X
L `(f, Zi ). (3.3.3)
n
i=1
As written, this formulation is quite abstract, so we provide a few examples to make it somewhat
more concrete.
Example 3.37 (Binary classification problems): One standard problem—still abstract—that
motivates the formulation (3.3.2) is the binary classification problem. Here the data Zi come in
pairs (X, Y ), where X ∈ X is some set of covariates (independent variables) and Y ∈ {−1, 1}
is the label of example X. The function class F consists of functions f : X → R, and the goal
is to find a function f such that
P(sign(f (X)) 6= Y )
is small, that is, minimizing the risk E[`(f, Z)] where the loss is the 0-1 loss, `(f, (x, y)) =
1 {f (x)y ≤ 0}. 3
Example 3.39 (Binary classification with linear functions): In the standard statistical
d
learning setting, the data x belong to R , and we assume that our function class F is in-
dexed by a set Θ ⊂ Rd , so that F = {fθ : fθ (x) = θ> x, θ ∈ Θ}. In this case, we may
use the zero-one loss, > the convex
hinge loss, or the (convex) logistic loss, which are variously
`zo (fθ , (x, y)) := 1 yθ x ≤ 0 , and the convex losses
h i
`hinge (fθ , (x, y)) = 1 − yx> θ and `logit (fθ , (x, y)) = log(1 + exp(−yx> θ)).
+
The hinge and logistic losses, as they are convex, are substantially computationally easier to
work with, and they are common choices in applications. 3
The main motivating question that we ask is the following: given a sample Z1 , . . . , Zn , if we
choose some fbn ∈ F based on this sample, can we guarantee that it generalizes to unseen data? In
particular, can we guarantee that (with high probability) we have the empirical risk bound
n
b n (fbn ) = 1
X
L `(fbn , Zi ) ≤ R(fbn ) + (3.3.4)
n
i=1
for some small ? If we allow fbn to be arbitrary, then this becomes clearly impossible: consider
the classification example 3.37, and set fbn to be the “hash” function that sets fbn (x) = y if the pair
(x, y) was in the sample, and otherwise fbn (x) = −1. Then clearly L b n (fbn ) = 0, while there is no
useful bound on R(fbn ).
59
Stanford Statistics 311/Electrical Engineering 377 John Duchi
Such uniform bounds are certainly sufficient to guarantee that the empirical risk is a good proxy
for the true risk L, even when fbn is chosen based on the data.
Now, recalling that our set of functions or predictors F is finite or countable, let us suppose
that for each f ∈ F, we have a complexity measure c(f )—a penalty—such that
X
e−c(f ) ≤ 1. (3.3.6)
f ∈F
This inequality should look familiar to the Kraft inequality—which we will see in the coming
chapters—from coding theory. As soon as we have such a penalty function, however, we have the
following result.
Theorem 3.40. Let the loss `, distribution P on Z, and function class F be such that `(f, Z) is
σ 2 -sub-Gaussian for each f ∈ F, and assume that the complexity inequality (3.3.6) holds. Then
with probability at least 1 − δ over the sample Z1:n ,
s
1
b n (f ) + 2σ 2 log δ + c(f ) for all f ∈ F.
L(f ) ≤ L
n
Proof First, we note that by the usual sub-Gaussian concentration inequality (Corollary 3.9) we
have for any t ≥ 0 and any f ∈ F that
nt2
P L(f ) ≥ Ln (f ) + t ≤ exp − 2 .
b
2σ
p
Now, if we replace t by t2 + 2σ 2 c(f )/n, we obtain
nt2
p
2 2
P L(f ) ≥ Ln (f ) + t + 2σ c(f )/n ≤ exp − 2 − c(f ) .
b
2σ
Then using a union bound, we have
nt2
X
p
2 2
P ∃ f ∈ F s.t. L(f ) ≥ Ln (f ) + t + 2σ c(f )/n ≤
b exp − 2 − c(f )
2σ
f ∈F
nt2 X
= exp − 2 exp(−c(f )) .
2σ
f ∈F
| {z }
≤1
60
Stanford Statistics 311/Electrical Engineering 377 John Duchi
As one classical example of this setting, suppose that we have a finite class of functions F. Then
we can set c(f ) = log |F|, in which case we clearly have the summation guarantee (3.3.6), and we
obtain s
1
L(f ) ≤ Lb n (f ) + 2σ 2 log δ + log |F| uniformly for f ∈ F
n
with probability at least 1 − δ. To make this even more concrete, consider the following example.
Example 3.41 (Floating point classifiers): We implement a linear binary classifier using
double-precision floating point values, that is, we have fθ (x) = θ> x for all θ ∈ Rd that may
be represented using d double-precision floating point numbers. Then for each coordinate of
θ, there are at most 264 representable numbers;
> in total, we must thus have |F| ≤ 264d . Thus,
for the zero-one loss `zo (fθ , (x, y)) = 1 θ xy ≤ 0 , we have
s
log 1δ + 45d
L(fθ ) ≤ L
b n (fθ ) +
2n
for all representable classifiers simultaneously, with probability at least 1 − δ, as the zero-one
loss is 1/4-sub-Gaussian. (Here we have used that 64 log 2 < 45.) 3
We also note in passing that by replacing δ with δ/2 in the bounds of Theorem 3.40, a union
bound yields the following two-sided corollary.
Large classes
When the collection of functions is (uncountably) infinite, it can be more challenging to obtain
strong generalization bounds. There still exist numerous tools for these ideas, however, and we
present a few of the more basic and common of them. We return in the next chapter to alterna-
tive approaches based on randomization and divergence measures, which provide guarantees with
somewhat similar structure to those we present here.
Let us begin by considering a few examples, after which we provide examples showing how to
derive explicit bounds using Rademacher complexities.
Example 3.43 (Rademacher complexity of the `2 -ball): Let Θ = {θ ∈ Rd | kθk2 ≤ r}, and
consider the class of linear functionals F := {fθ (x) = θT x, θ ∈ Θ}. Then
v
u n
r uX
n
Rn (F | x1 ) ≤ t kxi k22 ,
n
i=1
61
Stanford Statistics 311/Electrical Engineering 377 John Duchi
because we have
"
n v " v
# u
n
2 # u n
r X r u
X r uX
n
kxi k22 ,
Rn (F | x1 ) = E
εi x i
≤
t E
εi x i
= t
n 2 n 2 n
i=1 i=1 i=1
as desired. 3
Now, each coordinate j of ni=1 εi xi is ni=1 x2ij -sub-Gaussian, and thus using that E[maxj≤d Zj ] ≤
P P
p
2σ 2 log d for arbitrary σ 2 -sub-Gaussian Zj (see Exercise 3.7), we have
v
u n
n r u X
Rn (F | x1 ) ≤ t2 log(2d) max x2ij .
n j
i=1
To facilitate comparison with Example 3.44, suppose that the vectors p xi all satisfy kxi k∞ ≤ b.
n ) ≤ rb 2 log(2d)/√n. In contrast,
In this case, the preceding inequality implies that
√ nR (F | x 1
the `2 -norm √ of such xi may satisfy kxi k2 = b d, so that the bounds of Example 3.43 scale
√
instead as rb d/ n, which can be exponentially larger. 3
These examples are sufficient to derive a few sophisticated risk bounds. We focus on the case
where we have a loss function applied to some class with reasonable Rademacher complexity, in
which case it is possible to recenter the loss class and achieve reasonable complexity bounds. The
coming proposition does precisely this in the case of margin-based binary classification. Consider
points (x, y) ∈ X × {±1}, and let F be an arbitrary class of functions f : X → R and L =
{(x, y) 7→ `(yf (x))}f ∈F be the induced collection of losses. As a typical example, we might have
`(t) = [1 − t]+ , `(t) = e−t , or `(t) = log(1 + e−t ). We have the following proposition.
Proposition 3.45. Let F and X be such that supx∈X |f (x)| ≤ M for f ∈ F and assume that
` is L-Lipschitz. Define the empirical and population risks L
b n (f ) := Pn `(Y f (X)) and L(f ) :=
P `(Y f (X)). Then
!
nt2
P sup |Ln (f ) − L(f )| ≥ 4LRn (F) + t ≤ 2 exp − 2 2
b for t ≥ 0.
f ∈F 2L M
Proof We may recenter the class L, that is, replace `(·) with `(·) − `(0), without changing
b n (f ) − L(f ). Call this class L0 , so that kPn − P k = kPn − P k . This recentered class satisfies
L L L0
bounded differences with constant 2M L, as |`(yf (x)) − `(y 0 f (x0 ))| ≤ L|yf (x) − y 0 f (x0 )| ≤ 2LM , as
in the proof of Proposition 3.30. Applying Proposition 3.30 and then Corollary 3.32 and gives that
P(supf ∈F |L b n (f ) − L(f )| ≥ 2Rn (L0 ) + t) ≤ exp(− nt22 2 ) for t ≥ 0. Then applying the contraction
2M L
62
Stanford Statistics 311/Electrical Engineering 377 John Duchi
inequality (Theorem 3.35) yields Rn (L0 ) ≤ 2LRn (F), giving the result.
Example 3.46 (Support vector machines and hinge losses): In the support vector machine
problem, we receive data (Xi , Yi ) ∈ Rd × {±1}, and we seek to minimize average of the losses
`(θ; (x, y)) = 1 − yθT x + . We assume that the space X has kxk2 ≤ b for x ∈ X and that
nt2
P sup |Pn `(θ; (X, Y )) − P `(θ; (X, Y ))| ≥ 4Rn (FΘ ) + t ≤ exp − 2 2 ,
θ∈Θ 2r b
where FΘ = {fθ (x) = θT x}θ∈Θ . Now, we apply Example 3.43, which implies that
2rb
Rn (φ ◦ FΘ ) ≤ 2Rn (Fθ ) ≤ √ .
n
That is, we have
nt2
4rb
P sup |Pn `(θ; (X, Y )) − P `(θ; (X, Y ))| ≥ √ + t ≤ exp − ,
θ∈Θ n 2(rb)2
√
so that Pn and P become close at rate roughly rb/ n in this case. 3
L∗ = inf L(f ),
f
where the preceding infimum is taken across all (measurable) functions. Then we have
There is often a tradeoff between these two, analogous to the bias/variance tradeoff in classical
statistics; if the approximation error is very small, then it is likely hard to guarantee that the esti-
mation error converges quickly to zero, while certainly a constant function will have low estimation
error, but may have substantial approximation error. With that in mind, we would like to develop
63
Stanford Statistics 311/Electrical Engineering 377 John Duchi
procedures that, rather than simply attaining good performance for the class F, are guaranteed
to trade-off in an appropriate way between the two types of error. This leads us to the idea of
structural risk minimization.
In this scenario, we assume we have a sequence of classes of functions, F1 , F2 , . . ., of increasing
complexity, meaning that F1 ⊂ F2 ⊂ . . .. For example, in a linear classification setting with
vectors x ∈ Rd , we might take a sequence of classes allowing increasing numbers of non-zeros in
the classification vector θ:
n o n o
F1 := fθ (x) = θ> x such that kθk0 ≤ 1 , F2 := fθ (x) = θ> x such that kθk0 ≤ 2 , . . . .
More broadly, let {Fk }k∈N be a (possibly infinite) increasing sequence of function classes. We
assume that for each Fk and each n ∈ N, there exists a constant Cn,k (δ) such that we have the
uniform generalization guarantee
!
P sup Ln (f ) − L(f ) ≥ Cn,k (δ) ≤ δ · 2−k .
b
f ∈Fk
(We will see in subsequent sections of the course how to obtain other more general guarantees.)
We consider the following structural risk minimization procedure. First, given the empirical
risk L
b n , we find the model collection b
k minimizing the penalized risk
k := argmin inf L
b b n (f ) + Cn,k (δ) . (3.3.8a)
k∈N f ∈Fk
We then choose fb to minimize the risk over the estimated “best” class Fbk , that is, set
fb := argmin L
b n (f ). (3.3.8b)
f ∈Fkb
Theorem 3.47. Let fb be chosen according to the procedure (3.3.8a)–(3.3.8b). Then with probability
at least 1 − δ, we have
L(fb) ≤ inf inf {L(f ) + 2Cn,k (δ)} .
k∈N f ∈Fk
64
Stanford Statistics 311/Electrical Engineering 377 John Duchi
On the event that supf ∈Fk |L b n (f ) − L(f )| < Cn,k (δ) for all k, which occurs with probability at least
1 − δ, we have
n o
L(fb) ≤ L
b n (f ) + C b (δ) = inf inf L
n,k
b n (f ) + Cn,k (δ) ≤ inf inf {L(f ) + 2Cn,k (δ)}
k∈N f ∈Fk k∈N f ∈Fk
We conclude with a final example, using our earlier floating point bound from Example 3.41,
coupled with Corollary 3.42 and Theorem 3.47.
Example 3.48 (Structural risk minimization with floating point classifiers): Consider again
our floating point example, and let the function class Fk consist of functions defined by at
most k double-precision floating point values, so that log |Fk | ≤ 45d. Then by taking
s
log 1δ + 65k log 2
Cn,k (δ) =
2n
we have that |Lb n (f )−L(f )| ≤ Cn,k (δ) simultaneously for all f ∈ Fk and all Fk , with probability
at least 1 − δ. Then the empirical risk minimization procedure (3.3.8) guarantees that
s
1
2 log δ + 91k
L(fb) ≤ inf inf L(f ) + .
k∈N f ∈Fk n
Roughly, we trade between small risk L(f )—as the qrisk inf f ∈Fk L(f ) must be decreasing in
k—and the estimation error penalty, which scales as (k + log 1δ )/n. 3
where for the last inequality we made the substitution u = t2 /σ 2 . Noting that this final integral is
Γ(k/2), we have E[|X|k ] ≤ kσ k Γ(k/2). Because Γ(s) ≤ ss for s ≥ 1, we obtain
p √
E[|X|k ]1/k ≤ k 1/k σ k/2 ≤ e1/e σ k.
Thus (2) holds with K2 = e1/e .
1 k
(2) implies (3) Let σ = kXkψ2 = supk≥1 k − 2 E[|X|k ]1/k , so that K2 = 1 and E[|X|k ] ≤ k 2 σ for
all k. For K3 ∈ R+ , we thus have
∞ ∞ ∞
E[X 2k ] σ 2k (2k)k (i) X 2e k
X X
2 2
E[exp(X /(K3 σ ))] = ≤ ≤
k!K32k σ 2k k!K32k σ 2k K32
k=0 k=0 k=0
where inequality (i) follows because k! ≥ (k/e)k , or 1/k! ≤ (e/k)k . Noting that ∞ k 1
P
p k=0 α = 1−α ,
we obtain (3) by taking K3 = e 2/(e − 1) ≈ 2.933.
65
Stanford Statistics 311/Electrical Engineering 377 John Duchi
(3) implies (4) Let us take K3 = 1. We claim that (4) holds with K4 = 34 . We prove this
result for both small and large λ. First, note the (highly non-standard, but true!) inequality that
9x2
ex ≤ x + e 16 for all x. Then we have
9λ2 X 2
E[exp(λX)] ≤ E[λX] +E exp
| {z } 16
=0
4
Now note that for |λ| ≤ we have 9λ2 σ 2 /16 ≤ 1, and so by Jensen’s inequality,
3σ ,
2 2
9λ X 2
2 2
2 9λ16σ 9λ2 σ 2
E exp = E exp(X /σ ) ≤ e 16 .
16
λ2 cx2
For large λ, we use the simpler Fenchel-Young inequality, that is, that λx ≤ 2c + 2 , valid for all
c ≥ 0. Then we have for any 0 ≤ c ≤ 2 that
2
λ2 σ 2 cX λ2 σ 2 c
E[exp(λX)] ≤ e 2c E exp ≤ e 2c e 2 ,
2σ 2
4 1 9 2 2
where the final inequality follows from Jensen’s inequality. If |λ| ≥ 3σ , then 2 ≤ 32 λ σ , and we
have 2 2
1
[ 2c 9c 2 2
+ 32 ]λ σ 3λ σ
E[exp(λX)] ≤ inf e = exp .
c∈[0,2] 4
1
(4) implies (1) This is the content of Proposition 3.7, with K4 = 2 and K1 = 2.
where inequality (i) used that k! ≥ (k/e)k . Taking K3 = e2 /(e − 1) < 5 gives the result.
66
Stanford Statistics 311/Electrical Engineering 377 John Duchi
(2) if and only if (4) Thus, we see that up to constant numerical factors, the definition kXkψ1 =
supk≥1 k −1 E[|X|k ]1/k has the equivalent statements
Now, let us assume that (2) holds with K2 = 1, so that σ = kXkψ1 and that E[X] = 0. Then we
have E[X k ] ≤ k k kXkkψ1 , and
∞ ∞ ∞
X λk E[X k ] X kk X
E[exp(λX)] = 1 + ≤1+ λk kXkkψ1 · ≤1+ λk kXkkψ1 ek ,
k! k!
k=2 k=2 k=2
1
the final inequality following because k! ≥ (k/e)k . Now, if |λ| ≤ 2ekXkψ , then we have
1
∞
X
E[exp(λX)] ≤ 1 + λ2 e2 kXkψ1 (λ kXkψ1 e)k ≤ 1 + 2e2 kXk2ψ1 λ2 ,
k=0
P∞ −k
as the final sum is at most k=0 2 = 2. Using 1 + x ≤ ex gives that (2) implies (4). For
the opposite direction, we may simply use that if (4) holds with K4 = 1 and K40 = 1, then
E[exp(X/σ)] ≤ exp(1), so that (3) holds.
3.5 Bibliography
A few references on concentration, random matrices, and entropies include Vershynin’s extraor-
dinarily readable lecture notes [138], the comprehensive book of Boucheron, Lugosi, and Massart
[32], and the more advanced material in Buldygin and Kozachenko [38]. Many of our arguments
are based off of those of Vershynin and Boucheron et al.
3.6 Exercises
Question 3.1 (Concentration of bounded random variables): Let X be a random variable taking
values in [a, b], where −∞ < a ≤ b < ∞. In this question, we show Hoeffding’s Lemma, that is,
that X is sub-Gaussian: for all λ ∈ R, we have
2
λ (b − a)2
E[exp(λ(X − E[X]))] ≤ exp .
8
(b−a)2
(a) Show that Var(X) ≤ ( b−a 2
2 ) = 4 for any random variable X taking values in [a, b].
(b) Let
ϕ(λ) = log E[exp(λ(X − E[X]))].
67
Stanford Statistics 311/Electrical Engineering 377 John Duchi
Assuming that E[X] = 0 (convince yourself that this is no loss of generality) show that
(You may assume that derivatives and expectations commute, which they do in this case.)
(c) Construct a random variable Yt , defined for t ∈ R, such that Yt ∈ [a, b] and
Question 3.2: In this question, we show how to use Bernstein-type (sub-exponential) inequalities
to give sharp convergence guarantees. Recall (Example 3.13, Corollary 3.17, and inequality (3.1.8))
that if Xi are independent bounded random variables with |Xi −E[X]| ≤ b for all i and Var(Xi ) ≤ σ 2 ,
then
n n
( ! !)
5 nt2 nt
1X 1X 1
max P Xi ≥ E[X] + t , P Xi ≤ E[X] − t ≤ exp − min , .
n n 2 6 σ 2 2b
i=1 i=1
We consider minimization of loss functions ` over finite function classes F with ` ∈ [0, 1], so that if
L(f ) = E[`(f, Z)] then |`(f, Z) − L(f )| ≤ 1. Throughout this question, we let
We will show that, roughly, a procedure based on picking an empirical risk minimizer is unlikely to
choose a function f ∈ F with bad performance, so that we obtain faster concentration guarantees.
nt2
1 5 nt
P L(f ) ≥ L(f ) + t ∨ P L(f ) ≤ L(f ) − t ≤ exp − min
b b , .
2 6 L(f )(1 − L(f )) 2
(b) Define the set of “bad” prediction functions F bad := {f ∈ F : L(f ) ≥ L? + }. Show that for
any fixed ≥ 0 and any f ∈ F2 bad , we have
n2
?
1 5 n
P L(f ) ≤ L + ≤ exp − min
b , .
2 6 L? (1 − L? ) + (1 − ) 2
68
Stanford Statistics 311/Electrical Engineering 377 John Duchi
(d) Using the result of part (c), argue that with probability at least 1 − δ,
q
|F | L? (1 − L? ) · log |Fδ |
r
? 4 log δ 12
L(fbn ) ≤ L(f ) + + · √ .
n 5 n
Why is this better than an inequality based purely on the boundedness of the loss `, such as
Theorem 3.40 or Corollary 3.42? What happens when there is a perfect risk minimizer f ? ?
Question 3.3 (Likelihood ratio bounds and concentration): Consider a data release problem,
where given a sample x, we release a sequence of data Z1 , Z2 , . . . , Zn belonging to a discrete set Z,
where Zi may depend on Z1i−1 and x. We assume that the data has limited information about x
in the sense that for any two samples x, x0 , we have the likelihood ratio bound
p(zi | x, z1i−1 )
i−1
≤ eε .
0
p(zi | x , z1 )
Let us control the amount of “information” (in the form of an updated log-likelihood ratio) released
by this sequential mechanism. Fix x, x0 , and define
p(z1 , . . . , zn | x)
L(z1 , . . . , zn ) := log .
p(z1 , . . . , zn | x0 )
t2
ε
P (L(Z1 , . . . , Zn ) ≥ nε(e − 1) + t) ≤ exp − .
2nε2
(b) Let γ ∈ (0, 1). Give the largest value of ε you can that is sufficient to guarantee that for any
test Ψ : Z n → {x, x0 }, we have
where Px and Px0 denote the sampling distribution of Z1n under x and x0 , respectively?
where Cp is a constant (that depends on p). As a corollary, derive that if E[|Xi |p ] ≤ σ p and p ≥ 2,
then " n #
1 X p σp
E Xi ≤ Cp p/2 .
n n
i=1
69
Stanford Statistics 311/Electrical Engineering 377 John Duchi
That is, sample means converge quickly to zero in higher moments. Hint: For any fixed x ∈ Rn , if
εi are i.i.d. uniform signs εi ∈ {±1}, then εT x is sub-Gaussian.
Question 3.5 (Small balls and anti-concentration): Let X be a nonnegative random variable
satisfying P(X ≤ ) ≤ c for some c < ∞ and all > 0. Argue that if Xi are i.i.d. copies of X, then
n
!
1X
P Xi ≥ t ≥ 1 − exp(−2n [1/2 − 2ct]2+ )
n
i=1
for all t.
Question 3.6 (Lipschitz functions remain sub-Gaussian): Let X be σ 2 -sub-Gaussian and f :
R → R be L-Lipschitz, meaning that |f (x) − f (y)| ≤ L|x − y| for all x, y. Prove that there exists a
numerical constant C < ∞ such that f (X) is CL2 σ 2 -sub-Gaussian.
Question 3.7 (Sub-gaussian maxima): Let X1 , . . . , Xn be σ 2 -sub-gaussian (not necessarily in-
dependent) random variables. Show that
p
(a) E[maxi Xi ] ≤ 2σ 2 log n.
(b) There exists a numerical constant C < ∞ such that E[maxi |Xi |p ] ≤ (Cpσ 2 log k)p/2 .
Question 3.8: Consider a binary classification problem with logistic loss `(θ; (x, y)) = log(1 +
exp(−yθT x)), where θ ∈ Θ := {θ ∈ Rd | kθk1 ≤ r} and y ∈ {±1}. Assume additionally that the
space X ⊂ {x ∈ Rd | kxk∞ ≤ b}. Define the empirical and population risks L b n (θ) := Pn `(θ; (X, Y ))
and L(θ) := P `(θ; (X, Y )), and let θbn = argminθ∈Θ L(θ).
b Show that with probability at least 1 − δ
iid
over (Xi , Yi ) ∼ P , q
rb log dδ
L(θbn ) ≤ inf L(θ) + C √
θ∈Θ n
where C < ∞ is a numerical constant (you need not specify this).
70
Chapter 4
so that Pn f = n1 ni=1 f (Xi ) denotes the empirical expectation when Pn is the empirical measure
P
on the sample {X1 , . . . , Xn }.
where the supremum is taken over measurable functions g : X → R such that EQ [eg(X) ] < ∞.
Proof We may assume that P is absolutely continuous with respect to Q, as otherwise both
sides are infinite by inspection. Thus, we assume without loss of generality that P and Q have
densities p and q.
Attainment in the equality is easy: we simply take g(x) = log p(x)
q(x) , so that EQ [e
g(X) ] = 1. To
show that the right hand side is never larger than Dkl (P ||Q) requires a bit more work. To that
end, let g be any function such that EQ [eg(X) ] < ∞, and define the random variable Zg (x) =
71
Stanford Statistics 311/Electrical Engineering 377 John Duchi
eg(x) /EQ [eg(X) ], so that EQ [Z] = 1. Then using the absolute continuity of P w.r.t. Q, we have
p(X) q(X) dQ
EP [log Zg ] = EP log + log Zg (X) = Dkl (P ||Q) + EP log Zg
q(X) p(X) dP
dQ
≤ Dkl (P ||Q) + log EP Zg
dP
= Dkl (P ||Q) + log EQ [Zg ].
As EQ [Zg ] = 1, using that EP [log Zg ] = EP [g(X)] − log EQ [eg(X) ] gives the result.
The Donsker-Varadhan representation already gives a hint that we can use some information-
theoretic techniques to control the difference between an empirical sample and its expectation, at
least in an average sense. In particular, we see that for any function g, we have
for any random variable X. Now, changing this on its head a bit, suppose that we consider a
collection of functions F and put two probability measures π and π0 on F, and consider Pn f − P f ,
where we consider f a random variable f ∼ π or f ∼ π0 . Then a consequence of the Donsker-
Varadhan theorem is that
Z Z
(Pn f − P f )dπ(f ) ≤ Dkl (π||π0 ) + log exp(Pn f − P f )dπ0 (f )
for any π, π0 . While this inequality is a bit naive—bounding a difference by an exponent seems
wasteful—as we shall see, it has substantial applications when we can upper bound the KL-
divergence Dkl (π||π0 ).
72
Stanford Statistics 311/Electrical Engineering 377 John Duchi
Proof Without loss of generality, we assume that P f = 0 for all f ∈ F, and recall that Pn f =
1 Pn 2
n i=1 f (Xi ) is the empirical mean of f . Then we know that Pn f is σ /n-sub-Gaussian, and
−1/2
Lemma 4.2 implies that E[exp(λ(Pn f )2 )] ≤ 1 − 2λσ 2 /n +
for any f , and thus for any “prior”
π0 on f we have Z
−1/2
exp(λ(Pn f ) )dπ0 (f ) ≤ 1 − 2λσ 2 /n + .
2
E
3n
Consequently, taking λ = λn := 8σ 2
, we obtain
Z Z
2 3n 2
E exp(λn (Pn f ) )dπ0 (f ) = E exp (Pn f ) dπ0 (f ) ≤ 2.
8σ 2
This holds without any probabilistic qualifications, so using the application (4.2.1) of Markov’s
inequality with λ = λn , we thus see that with probability at least 1 − δ over X1 , . . . , Xn , simulta-
neously for all distributions π,
By Jensen’s inequality (or Cauchy-Schwarz), it is immediate from Theorem 4.3 that we also
have
s
8σ 2 Dkl (π||π0 ) + log 2δ
Z
|Pn f − P f |dπ(f ) ≤ simultaneously for all π ∈ Π (4.2.2)
3 n
√
with probability at least 1 − δ, so that Eπ [|Pn f − P f |] is with high probability of order 1/ n. The
inequality (4.2.2) is the original form of the PAC-Bayes bound due to McAllester, with slightly
sharper constants and improved logarithmic dependence. The key is that stability, in the form of a
prior π0 and posterior π closeness, allow us to achieve reasonably tight control over the deviations
of random variables and functions with high probability.
Let us give an example, which is similar to many of our approaches in Section 3.3.2, to illustrate
some of the approaches this allows. The basic idea is that by appropriate choice of prior π0 and
“posterior” π, whenever we have appropriately smooth classes of functions we achieve certain
generalization guarantees.
73
Stanford Statistics 311/Electrical Engineering 377 John Duchi
Example 4.4 (A uniform law for Lipschitz functions): Consider a case as in Section 3.3.2,
where we let L(θ) = P `(θ, Z) for some function ` : Θ × Z → R. Let Bd2 = {v ∈ Rd | kvk2 ≤ 1}
be the `2 -ball in Rd , and let us assume that Θ ⊂ rBd2 and additionally that θ 7→ `(θ, z) is
L-Lipschitz for all z ∈ Z. For simplicity, we assume that `(θ, z) ∈ [0, Lr] for all θ ∈ Θ (though
it is possible to avoid this by relativizing our bounds by replacing ` by `(·, z) − inf θ∈Θ `(θ, z)).
If L
b n (θ) = Pn `(θ, Z), then Theorem 4.3 implies that
s
2L2 r2
Z
2
|Ln (θ) − L(θ)|dπ(θ) ≤
b Dkl (π||π0 ) + log
3n δ
for all π with probability at least 1 − δ. Now, let θ0 ∈ Θ be arbitrary, and for > 0 (to be
chosen later) take π0 to be uniform on (r + )Bd2 and π to be uniform on θ0 + Bd2 . Then we
immediately see that Dkl (π||π0 ) = d log(1 + r ). Moreover, we have L
R
b n (θ)dπ(θ) ∈ L
b n (θ0 ) ± L
and similarly for L(θ), by the L-Lipschitz continuity of `. For any fixed > 0, we thus have
s
2 r2
2 2L r 2
|L
b n (θ0 ) − L(θ0 )| ≤ 2L + d log 1 + + log
3n δ
rd
simultaneously for all θ0 ∈ Θ, with probability at least 1 − δ. By choosing = n we obtain
that with probability at least 1 − δ,
s
2L2 r2
2Lrd n 2
sup |Ln (θ) − L(θ)| ≤
b + d log 1 + + log .
θ∈Θ n 3n d δ
q
Thus, roughly, with high probability we have |L
b n (θ) − L(θ)| ≤ O(1)Lr d
n log nd for all θ. 3
On the one hand, the result in Example 4.4 is satisfying: it applies to any Lipschitz function
and provides a uniform bound. On the other hand, when we compare to the results achievable for
specially structured linear function classes, then applying Rademacher complexity bounds—such
as Proposition 3.45 and Example 3.46—we have somewhat weaker results, in that they depend on
the dimension explicitly, while the Rademacher bounds do not exhibit this explicit dependence.
This means they can potentially apply in infinite dimensional spaces that Example 4.4 cannot. We
will give an example presently showing how to address some of these issues.
74
Stanford Statistics 311/Electrical Engineering 377 John Duchi
make this precise, let F be a collection of functions f : X → R, and let σ 2 (f ) := Var(f (X)) be
the variance of functions in F. We assume the class satisfies the Bernstein condition (3.1.9) with
parameter b, that is,
h i k!
E (f (X) − P f )k ≤ σ 2 (f )bk−2 for k = 3, 4, . . . . (4.2.3)
2
This says that the second moment of functions f ∈ F bounds—with the additional boundedness-
type constant b—the higher moments of functions in f . We then have the following result.
Proposition 4.5. Let F be a collection of functions f : X → R satisfying the Bernstein condi-
1
tion (4.2.3). Then for any |λ| ≤ 2b , with probability at least 1 − δ,
Z Z Z
1 1
λ P f dπ(f ) − λ2 σ 2 (f )dπ(f ) ≤ λ Pn f dπ(f ) + Dkl (π||π0 ) + log
n δ
simultaneously for all π ∈ Π.
Proof We begin with an inequality on the moment generating function of random variables
satisfying the Bernstein condition (3.1.9), that is, that |E[(X − µ)k ]| ≤ k! 2 k−2 for k ≥ 2. In this
2σ b
case, Lemma 3.18 implies that
E[eλ(X−µ) ] ≤ exp(λ2 σ 2 )
for |λ| ≤ 1/(2b). As a consequence, for any f in our collection F, we see that if we define
∆n (f, λ) := λ Pn f − P f − λσ 2 (f ) ,
we have that
E[exp(n∆n (f, λ))] = E[exp(λ(f (X) − P f ) − λ2 σ 2 (f ))]n ≤ 1
1
for all n, f ∈ F, and |λ| ≤ 2b .Then, for any fixed measure π0 on F, Markov’s inequality implies
that Z
1
P exp(n∆n (f, λ))dπ0 (f ) ≥ ≤ δ. (4.2.4)
δ
Now, as in the proof of Theorem 4.3, we use the Donsker-Varadhan Theorem 4.1 (change of mea-
sure), which implies that
Z Z
n ∆n (f, λ)dπ0 (f ) ≤ Dkl (π||π0 ) + log exp(n∆n (f, λ))dπ0 (f )
for all distributions π. Using inequality (4.2.4), we obtain that with probability at least 1 − δ,
Z
1 1
∆n (f, λ)dπ0 (f ) ≤ Dkl (π||π0 ) + log
n δ
for all π. As this holds for any fixed |λ| ≤ 1/(2b), this gives the desired result by rearranging.
We would like to optimize over the bound in Proposition 4.5 by choosing the “best” λ. If we
could choose the optimal λ, by rearranging Proposition 4.5 we would obtain the bound
2 1 h 1i
Eπ [P f ] ≤ Eπ [Pn f ] + inf λEπ [σ (f )] + Dkl (π||π0 ) + log
λ>0 nλ δ
r
Eπ [σ 2 (f )] h 1 i
= Eπ [Pn f ] + 2 Dkl (π||π0 ) + log
n δ
75
Stanford Statistics 311/Electrical Engineering 377 John Duchi
simultaneously for all π, with probability at least 1−δ. The problem with this approach is two-fold:
first, we cannot arbitrarily choose λ in Proposition 4.5, and second, the bound above depends on
the unknown population variance σ 2 (f ). It is thus of interest to understand situations in which
we can obtain similar guarantees, but where we can replace unknown population quantities on the
right side of the bound with known quantities.
To that end, let us consider the following condition, a type of relative error condition related
to the Bernstein condition (3.1.9): for each f ∈ F,
σ 2 (f ) ≤ bP f. (4.2.5)
This condition is most natural when each of the functions f take nonnegative values—for example,
when f (X) = `(θ, X) for some loss function ` and parameter θ of a model. If the functions f are
nonnegative and upper bounded by b, then we certainly have σ 2 (f ) ≤ E[f (X)2 ] ≤ bE[f (X)] = bP f ,
so that Condition (4.2.5) holds. Revisiting Proposition 4.5, we rearrange to obtain the following
theorem.
Theorem 4.6. Let F be a collection of functions satisfying the Bernstein condition (4.2.3) as in
Proposition 4.5, and in addition, assume the variance-bounding condition (4.2.5). Then for any
1
0 ≤ λ ≤ 2b , with probability at least 1 − δ,
λb 1 1h 1i
Eπ [P f ] ≤ Eπ [Pn f ] + Eπ [Pn f ] + Dkl (π||π0 ) + log
1 − λb λ(1 − λb) n δ
for all π.
Proof We use condition (4.2.5) to see that
apply Proposition 4.5, and divide both sides of the resulting inequality by λ(1 − λb).
To make this uniform in λ, thus achieving a tighter bound (so that we need not pre-select λ),
1 λb
we choose multiple values of λ and apply a union bound. To that end, let 1 + η = 1−λb , or η = 1−λb
1 (1+η)2
and λb(1−λb) = η , so that the inequality in Theorem 4.3 is equivalent to
(1 + η)2 b h 1i
Eπ [P f ] ≤ Eπ [Pn f ] + ηEπ [Pn f ] + Dkl (π||π0 ) + log .
η n δ
Using that our choice of η ∈ [0, 1], this implies
1 bh 1 i 3b h 1i
Eπ [P f ] ≤ Eπ [Pn f ] + ηEπ [Pn f ] + Dkl (π||π0 ) + log + Dkl (π||π0 ) + log .
ηn δ n δ
Now, take η1 = 1/n, . . . , ηn = 1. Then by optimizing over η ∈ {η1 , . . . , ηn } (which is equivalent, to
within a 1/n factor, to optimizing over 0 < η ≤ 1) and applying a union bound, we obtain
Corollary 4.7. Let the conditions of Theorem 4.6 hold. Then with probability at least 1 − δ,
r
bEπ [Pn f ] h ni 1 h n i
Eπ [P f ] ≤ Eπ [Pn f ] + 2 Dkl (π||π0 ) + log + Eπ [Pn f ] + 5b Dkl (π||π0 ) + log ,
n δ n δ
simultaneously for all π on F.
76
Stanford Statistics 311/Electrical Engineering 377 John Duchi
for each η ∈ {1/n, . . . , 1}. We consider two cases. In the first, assume that Eπ [Pn f ] ≤ nb (Dkl (π||π0 )+
log nδ . Then taking η = 1 above evidently gives the result. In the second, we have Eπ [Pn f ] >
b n
n (Dkl (π||π0 ) + log δ ), and we can set
s
b
(Dkl (π||π0 ) + log nδ )
η? = n ∈ (0, 1).
Eπ [Pn f ]
1
Choosing η to be the smallest value ηk in {η1 , . . . , ηn } with ηk ≥ η? , so that η? ≤ η ≤ η? + n then
implies the claim in the corollary.
Let us revisit the loss minimization approaches central to Section 3.3.2 and Example 4.4 in the
context of Corollary 4.7. We will investigate an approach to achieve convergence guarantees that
are (nearly) independent of dimension, focusing on 0-1 losses in a binary classification problem.
Example 4.8 (Large margins and PAC-Bayes): Consider a binary classification problem
with data (x, y) ∈ Rd × {±1}, where we make predictions hθ, xi (or its sign), and for a margin
penalty γ ≥ 0 we define the loss
We call the quantity hθ, xiy the margin of θ on the pair (x, y), noting that when the margin is
large, hθ, xi has the same sign as y and is “confident” (i.e. far from zero). For shorthand, let
us define the expected and empirical losses at margin γ by
Consider the following scenario: the data x lie in a ball of radius b, so that kxk2 ≤ b; note
that the losses `γ and `0 satisfy the Bernstein (4.2.3) and self-bounding (4.2.5) conditions with
constant 1 as they take values in {0, 1}. Let π0 be N(0, τ 2 I) for some τ > 0 to be chosen, and
b τ 2 I) for some θb ∈ Rd satisfying kθk
let π be N(θ, b 2 ≤ r. Then Corollary 4.7 implies that
s
b γ (θ)] + 2 Eπ [Lγ (θ)] Dkl (π||π0 ) + log n
b h i
Eπ [Lγ (θ)] ≤ Eπ [L
n δ
1 h n i
+ Eπ [L
b γ (θ)] + C Dkl (π||π0 ) + log
n s δ
2
b γ (θ)] + 2 Eπ [Lγ (θ)] r + log n
b h i
≤ Eπ [L
n 2τ 2 δ
2
1 h r ni
+ Eπ [Lγ (θ)] + C
b + log
n 2τ 2 δ
77
Stanford Statistics 311/Electrical Engineering 377 John Duchi
Let us use the margin assumption. Note that if Z ∼ N(0, τ 2 I), then for any fixed θ0 , x, y
we have
`0 (θ0 ; (x, y)) − P(Z T x ≥ γ) ≤ E[`γ (θ0 + Z; (x, y))] ≤ `2γ (θ0 ; (x, y)) + P(Z T x ≥ γ)
where the middle expectation is over Z ∼ N(0, τ 2 I). Using the τ 2 kxk22 -sub-Gaussianity of
Z T x, we can obtain immediately that if kxk2 ≤ b, we have
γ2 γ2
`0 (θ0 ; (x, y)) − exp − 2 2 ≤ E[`γ (θ0 + Z; (x, y))] ≤ `2γ (θ0 ; (x, y)) + exp − 2 2 .
2τ b 2τ b
Returning to our earlier bound, we evidently have that if kxk2 ≤ b for all x ∈ X , then with
probability at least 1 − δ, simultaneously for all θ ∈ Rd with kθk2 ≤ r,
s
γ 2
b 2γ (θ) + exp(− γ22 2 ) h r2
L ni
2τ b
L0 (θ) ≤ L
b 2γ (θ) + 2 exp − + 2 + log
2τ 2 b2 n 2τ 2 δ
2 2
1 b γ h r n i
+ L2γ (θ) + exp − 2 2 + C 2
+ log .
n 2τ b 2τ δ
2
Setting τ 2 = 2b2γlog n , we immediately see that for any choice of margin γ > 0, we have with
probability at least 1 − δ that
s
ih 2 2
b 2γ (θ) + 2b + 2 1 L b 2γ (θ) + b r b log n + log n
h i
L0 (θ) ≤ L
n n n 2γ 2 δ
1 b 1 h r2 b2 log n ni
+ L2γ (θ) + + C + log
n n 2γ 2 δ
for all kθk2 ≤ r. Rewriting (replacing 2γ with γ) and ignoring lower-order terms, we have
(roughly) that there exists a constant C < ∞ such that that for any fixed margin γ > 0, with
high probability
√
rb log n p
sup P (hθ, XiY ≤ 0) − Pn (hθ, XiY ≤ γ) − C √ Pn (hθ, XiY ≤ γ) ≤ 0. (4.2.6)
θ∈Θ γ n
for some decreasing convex φ : R → R+ , e.g. φ(t) = [1 − t]+ or φ(t) = log(1 + e−t )—then we
get strong generalization performance guarantees relative to the empirical margin γ. 3
78
Stanford Statistics 311/Electrical Engineering 377 John Duchi
79
Stanford Statistics 311/Electrical Engineering 377 John Duchi
by Corollary 3.9 (sub-Gaussian concentration) and a union bound. Thus, so long as |Φ| is not
exponential in the sample size n, we expect uniformly high accuracy.
Example 4.9 (Risk minimization via statistical queries): Suppose that we are in the loss-
minimization setting (3.3.3), where the losses `(θ, Xi ) are convex and differentiable in θ. Then
gradient descent applied to Lb n (θ) = Pn `(θ, X) will converge to a minimizing value of Lb n . We
can evidently implement gradient descent by a sequence of statistical queries φ(x) = ∇θ `(θ, x),
iterating
θ(k+1 ) = θ(k) − αk Pn φ(k) , (4.3.2)
where φ(k) = ∇θ `(θ(k) , x) and αk is a stepsize. 3
One issue with the example (4.9) is that we are interacting with the dataset, because each
sequential query φ(k) depends on the previous k − 1 queries. (Our results on uniform convergence
of empirical functionals and related ideas address many of these challenges, so that the result of
the process (4.3.2) will be well-behaved regardless of the interactivity.)
We consider an interactive version of the statistical query estimation problem. In this version,
there are two parties: an analyst (or statistician or learner), who issues queries φ : X → R, and
a mechanism that answers the queries to the analyst. We index our functionals φ by t ∈ T for a
(possibly infinite) set T , so we have a collection {φt }t∈T . In this context, we thus have the following
scheme:
Input: Sample X1n drawn i.i.d. P , collection {φt }t∈T of possible queries
Repeat: for k = 1, 2, . . .
Of interest in the iteration 4.1 is that we interactively choose T1 , T2 , . . . , Tk , where the choice Ti
may depend on our approximations of EP [φTj (X)] for j < i, that is, on the results of our previous
80
Stanford Statistics 311/Electrical Engineering 377 John Duchi
queries. Even more broadly, the analyst may be able to choose the index Tk in alternative ways
depending on the sample X1n , and our goal is to still be able to accurately compute expectations
P φT = EP [φT (X)] when the index T may depend on X1n . The setting in Figure 4.1 clearly breaks
with the classical statistical setting in which an analysis is pre-specified before collecting data, but
more closely captures modern data exploration practices.
Theorem 4.10. Let {φt }t∈T be a collection of σ 2 -sub-Gaussian functions φt : X → R. Then for
any random variable T and any λ > 0,
2 1 n 1 2
E[(Pn φT − P φT ) ] ≤ I(X1 ; T ) − log 1 − 2λσ /n +
λ 2
and r
2σ 2
|E[Pn φT ] − E[P φT ]| ≤ I(X1n ; T )
n
where the expectations are taken over T and the sample X1n .
Proof The proof is similar to that of our first basic PAC-Bayes result in Theorem 4.3. Let us
assume w.l.o.g. that P φt = 0 for all t ∈ T , noting that then Pn φt is σ 2 /n-sub-Gaussian. We prove
−1/2
the first result first. Lemma 4.2 implies that E[exp(λ(Pn φt )2 )] ≤ 1 − 2λσ 2 /n +
for each t ∈ T .
As a consequence, we obtain that via the Donsker-Varadhan equality (Theorem 4.1) that
Z (i) Z
2 2
λE (Pn φt ) dπ(t) ≤ E[Dkl (π||π0 )] + E log exp(λ(Pn φt ) )dπ0 (t)
(ii)
Z
2
≤ E[Dkl (π||π0 )] + log E exp(λ(Pn φt ) )dπ0 (t)
(iii) 1
log 1 − 2λσ 2 /n +
≤ E[Dkl (π||π0 )] −
2
for all distributions π on T , which may depend on Pn , where the expectation E is taken over the
iid
sample X1n ∼ P . (Here inequality (i) is Theorem 4.1, inequality (ii) is Jensen’s inequality, and
inequality (iii) is Lemma 4.2.) Now, let π0 be the marginal distribution on T (marginally over
all observations X1n ), and let π denote the posterior of T conditional on the sample X1n . Then
E[Dkl (π||π0 )] = I(X1n ; T ) by definition of the mutual information, giving the bound on the squared
error.
For the second result, note that the Donsker-Varadhan equality implies
λ2 σ 2
Z Z
λE Pn φt dπ(t) ≤ E[Dkl (π||π0 )] + log E[exp(λPn φt )]dπ0 (t) ≤ I(X1n ; T ) + .
2n
p
Dividing both sides by λ gives E[Pn φT ] ≤ 2σ 2 I(X1n ; T )/n, and performing the same analysis with
−φT gives the second result of the theorem.
81
Stanford Statistics 311/Electrical Engineering 377 John Duchi
The key in the theorem is that if the mutual information—the Shannon information—I(X; T )
between the sample X and T is small, then the expected squared error can be small. To make this
n
a bit clearer, let us choose values for λ in the theorem; taking λ = 2eσ 2 gives the following corollary.
2eσ 2 5σ 2
E[(Pn φT − P φT )2 ] ≤ I(X1n ; T ) + .
n 4n
Consequently, if we can limit the amount of information any particular query T (i.e., φT ) contains
about the actual sample X1n , then guarantee reasonably high accuracy in the second moment errors
(Pn φT − P φT )2 .
Example 4.12 (A stylized correlation analysis): Consider the following stylized genetics
experiment. We observe vectors X ∈ {−1, 1}k , where Xj = 1 if gene j is expressed and −1
otherwise. We also observe phenotypes Y ∈ {−1, 1}, where Y = 1 indicates appearance of the
phenotype. In our setting, we will assume that the vectors X are uniform on {−1, 1}k and
independent of Y , but an experimentalist friend of ours wishes to know if there exists a vector
v with kvk2 = 1 such that the correlation between v T X and Y is high, meaning that v T X
is associated with Y . In our notation here, we have index set {v ∈ Rk | kvk2 = 1}, and by
Example 3.6, Hoeffding’s lemma, and the independence of the coordinates of X we have that
v T XY is kvk22 /4 = 1/4-sub-Gaussian. Now, we recall the fact that if Zj , j = 1, . . . , k, are
σ 2 -sub-Gaussian, then for any p ≥ 1, we have
for a numerical constant C. That is, powers of sub-Gaussian maxima grow at most logarith-
mically. Indeed, by Theorem 3.10, we have for any q ≥ 1 by Hölder’s inequality that
X 1/q
p pq
E[max |Zj | ] ≤ E |Zj | ≤ k 1/q (Cpqσ 2 )p/2 ,
j
j
and setting q = log k gives the inequality. Thus, we see that for any a priori fixed v1 , . . . , vk , vk+1 ,
we have
log k
E[max(vjT (Pn Y X))2 ] ≤ O(1) .
j n
If instead we allow a single interaction, the problem is different. We issue queries associated
with v = e1 , . . . , ek , the k standard basis vectors; then we simply set Vk+1 = Pn Y X/ kPn Y Xk2 .
Then evidently
k
T
E[(Vk+1 (Pn Y X))2 ] = E[kPn Y Xk22 ] = ,
n
82
Stanford Statistics 311/Electrical Engineering 377 John Duchi
which is exponentially larger than in the non-interactive case. That is, if an analyst is allowed
to interact with the dataset, he or she may be able to discover very large correlations that are
certainly false in the population, which in this case has P XY = 0. 3
Example 4.12 shows that, without being a little careful, substantial issues may arise in interac-
tive data analysis scenarios. When we consider our goal more broadly, which is to be able to provide
accurate approximations to P φ for queries φ chosen adaptively for any population distribution P
and φ : X → [−1, 1], it is possible to construct quite perverse situations, where if we compute
sample expectations Pn φ exactly, one round of interaction is sufficient to find a query φ for which
Pn φ − P φ ≥ 1.
Example 4.13 (Exact query answering allows arbitrary corruption): Suppose we draw a
iid
sample X1n of size n on a sample space X = [m] with Xi ∼ Uniform([m]), where m ≥ 2n. Let
Φ be the collection of all functions φ : [m] → [−1, 1], so that P(|Pn φ − P φ| ≥ t) ≤ exp(−nt2 /2)
for any fixed φ. Suppose that in the interactive scheme in Fig. 4.1, we simply release answers
A = Pn φ. Consider the following query:
83
Stanford Statistics 311/Electrical Engineering 377 John Duchi
because of the output of the first in arbitrary ways, they should remain jointly stable. Second, our
notion should bound the mutual information I(X1n ; T ) between the sample X1n and T . Lastly, we
remark that this control on the mutual information has an additional benefit: by the data process-
ing inequality, any downstream analysis we perform that depends only on T necessarily satisfies the
same stability and information guarantees as T , because if we have the Markov chain X1n → T → V
then I(X1n ; V ) ≤ I(X1n ; T ).
We consider randomized algorithms A : X n → A, taking values in our index set A, where
A(X1n ) ∈ A is a random variable that depends on the sample X1n . For simplicity in derivation,
we abuse notation in this section, and for random variables X and Y with distributions P and Q
respectively, we denote
Dkl (X||Y ) := Dkl (P ||Q) .
We make the following definition.
n n
1X 1 X 1 2 1
Dkl A(xn1 )||A(x\i ) =
2 2
xi ≤ 2 2 ,
n 2nσ n 2σ n
i=1 i=1
so that a the sample mean of a bounded random variable perturbed with Guassian noise is
ε = 2σ12 n2 -KL-stable. 3
Example 4.15 (KL-stability in mean estimation: Laplace noise addition): Let the conditions
of Example 2.7P hold, but suppose instead of Gaussian noise we add scaled Laplace noise, that
is, A(xn1 ) = n1 ni=1 xi + Z for Z with density p(z) = 2σ1
exp(−|z|/σ), where σ > 0. Then
using that if Lµ,σ denotes the Laplace distribution with shape σ and mean µ, with density
1
p(z) = 2σ exp(−|z − µ|/σ), we have
Z |µ1 −µ0 |
1
Dkl (Lµ0 ,σ ||Lµ1 ,σ ) = 2 exp(−z/σ)(|µ1 − µ0 | − z)dz
σ 0
|µ1 − µ0 |2
|µ1 − µ0 | |µ1 − µ0 |
= exp − −1+ ≤ ,
σ σ 2σ 2
we see that in this case the sample mean of a bounded random variable perturbed with Laplace
noise is ε = 2σ12 n2 -KL-stable, where σ is the shape parameter. 3
84
Stanford Statistics 311/Electrical Engineering 377 John Duchi
The two key facts are that KL-stable algorithms compose adaptively and that they bound
mutual information in independent samples.
Lemma 4.16. Let A : X n → A0 and A0 : A0 × X → A1 be ε and ε0 -KL-stable algorithms,
respectively. Then the (randomized) composition A0 ◦ A(xn1 ) = A0 (A(xn1 ), xn1 ) is ε + ε0 -KL-stable.
Moreover, the pair (A0 ◦ A(xn1 ), A(xn1 )) is ε + ε0 -KL-stable.
Proof Let Ai and A0i be the promised sub-algorithms in Definition 4.1. We apply the data
processing inequality, which implies for each i that
Dkl A0 (A(xn1 ), xn1 )||A0i (Ai (x\i ), x\i ) ≤ Dkl A0 (A(xn1 ), xn1 ), A(xn1 )||A0i (Ai (x\i ), x\i ), Ai (x\i ) .
We require a bit of notational trickery now. Fixing i, let PA,A0 be the joint distribution of
A0 (A(xn1 ), xn1 ) and A(xn1 ) and QA,A0 the joint distribution of A0i (Ai (x\i ), x\i ) and Ai (x\i ), so that
they are both distributions over A1 × A0 . Let PA0 |a be the distribution of A0 (t, xn1 ) and similarly
QA0 |a is the distribution of A0i (t, x\i ). Note that A0 , A0i both “observe” x, so that using the chain
rule (2.1.6) for KL-divergences, we have
as desired.
The second key result is that KL-stable algorithms also bound the mutual information of a
random function.
Lemma 4.17. Let Xi be independent. Then for any random variable A,
Xn n Z
X
n
Dkl A(xn1 )||Ai (x\i ) dP (xn1 ),
I(A; X1 ) ≤ I(A; Xi | X\i ) =
i=1 i=1
Proof Without loss of generality, we assume A and X are both discrete. In this case, we have
n
X n
X
I(A; X1n ) = I(A; Xi | X1i−1 ) = H(Xi | X1i−1 ) − H(Xi | A, X1i−1 ).
i=1 i=1
Now, because the Xi follow a product distribution, H(Xi | X1i−1 ) = H(Xi ), while H(Xi |
A, X1i−1 ) ≥ H(Xi | A, X\i ) because conditioning reduces entropy. Consequently, we have
n
X n
X
I(A; X1n ) ≤ H(Xi ) − H(Xi | A, X\i ) = I(A; Xi | X\i ).
i=1 i=1
85
Stanford Statistics 311/Electrical Engineering 377 John Duchi
Combining Lemmas 4.16 and 4.17, we see (nearly) immediately that KL stability implies a mu-
tual information bound, and consequently even interactive KL-stable algorithms maintain bounds
on mutual information.
Fix an index j and for shorthand, let A = A and A0 = (A1 , . . . , Aj−1 ) be the first j − 1 procedures.
Then expanding the final mutual information term and letting ν denote the distribution of A0 , we
have
Z
I(A; Xi | X\i , A0 ) = Dkl A(a0 , xn1 )||A(a0 , x\i ) dP (xi | A0 = a0 , x\i )dP n−1 (x\i )dν(a0 | x\i )
where A(a0 , xn1 ) is the (random) procedure A on inputs xn1 and a0 , while A(a0 , x\i ) denotes the
(random) procedure A on input a0 , x\i , Xi , and where the ith example Xi follows its disdtribution
conditional on A0 = a0 and X\i = x\i , as in Lemma 4.17. We then recognize that for each i, we
have
Z Z
Dkl A(a , x1 )||A(a , x\i ) dP (xi | a , x\i ) ≤ Dkl A(a0 , xn1 )||A(a
0 n 0 0 e 0 , x\i ) dP (xi | a0 , x\i )
for any randomized function A, e as the marginal A in the lemma minimizes the average KL-
divergence (recall Exercise 2.15). Now, sum over i and apply the definition of KL-stability as
in Lemma 4.16.
86
Stanford Statistics 311/Electrical Engineering 377 John Duchi
Input: Sample X1n ∈ X n drawn i.i.d. P , collection {φt }t∈T of possible queries φt : X →
[−1, 1]
Repeat: for k = 1, 2, . . .
87
Stanford Statistics 311/Electrical Engineering 377 John Duchi
2e 5 5 ek 5
E[(Pn φT − P φT )2 ] ≤ I(X1n ; Tk+1 ) + ≤ 2ekε + = 2 2+ .
n 4n 4n σ n 4n
Now, we simply consider the independent noise addition, noting that (a + b)2 ≤ 2a2 + 2b2 for any
a, b ∈ R, so that
2 2 2
E max(Aj − P φTj ) ≤ 2E[(Pn φT − P φT ) ] + 2E max{Zj }
j≤k j≤k
2ek 10
≤ 2 2
+ + 4σ 2 (log k + 1), (4.3.3)
σ n 4n
where inequality (4.3.3) is the desired result and follows by the following lemma.
Lemma 4.20. Let Wj , j = 1, . . . , k be independent N(0, 1). Then E[maxj Wj2 ] ≤ 2(log k + 1).
Proof We assume that k ≥ 3, as the result is trivial otherwise. Using the standard tail bound for
2
Gaussians (tighter than the standard sub-Gaussian bound) that P(W ≥ t) ≤ √2πt 1
e−t /2 for t ≥ 0
R∞
and that E[Z] = 0 P(Z ≥ t)dt for a nonnegative random variable Z, we obtain that for any t0 ,
Z ∞ Z ∞
2 2
E[max Wj ] = P(max Wj ≥ t)dt ≤ t0 + P(max Wj2 ≥ t)dt
j 0 j t0 j
Z ∞ √ Z ∞
2k 4k
≤ t0 + 2k P(W1 ≥ t)dt ≤ t0 + √ e−t/2 dt = t0 + √ e−t0 /2 .
t0 2π t0 2π
√
Setting t0 = 2 log(4k/ 2π) gives E[maxj Wj2 ] ≤ 2 log k + log √42π + 1.
4.4 Bibliography
For PAC-Bayes: the original papers are David McAllester’s [111, 112, 113], and the tutorial [114].
Our approach is also similar to Catoni’s [40]. Our proofs are a simplified version of McCallester’s
PAC-Bayesian Stochastic Model Selection.
Interactive data analysis: [69, 67, 68] and [22, 23].
88
Stanford Statistics 311/Electrical Engineering 377 John Duchi
4.5 Exercises
Question 4.1 (Large-margin PAC-Bayes bounds for multiclass problems): Consider the following
multiclass prediction scenario. Data comes in pairs (x, y) ∈ bBd2 × [k] where Bd2 = {v ∈ Rd | kvk2 ≤
1} denotes the `2 -ball and [k] = {1, . . . , k}. We make predictions using predictors θ1 , . . . , θk ∈ Rd ,
where the prediction of y on an example x is
We suffer an error whenever yb(x) 6= y, and the margin of our classifier on pair (x, y) is
If hθy , xi > hθi , xi for all i 6= y, the margin is then positive (and the prediction is correct).
(a) Develop an analogue of the bounds in Example 4.8 in this k-class multiclass setting. To do
so, you should (i) define the analogue of the margin-based loss `γ , (ii) show how Gaussian
perturbations leave it similar, and (iii) prove an analogue of the bound in Example 4.8. You
should assume one of the two conditions
k
X
(C1) kθi k2 ≤ r for all i (C2) kθi k22 ≤ kr2
i=1
Question 4.2 (A variance-based information bound): Let Φ = {φt }t∈T be a collection of functions
φt : X → R, where each φt satisfies the Bernstein condition (3.1.9) with parameters σ 2 (φt ) and b,
that is, |E[(φt (X) − P φt (X))k ]| ≤ k! 2
2 σ (φt )b
k−2 for all k ≥ 3 and Var(φ (X)) = σ 2 (φ ). Let T ∈ T
t t
be any random variable, which may depend on an observed sample X1n . Show that for all C > 0
C
and |λ| ≤ 2b , then
E Pn φT − P φT ≤ 1 I(T ; X1n ) + |λ|.
max{C, σ(φT )} n|λ|
Question 4.3 (An information bound on variance): Let Φ = {φt }t∈T be a collection of functions
φt : X → R, where each φt : X → [−1, 1]. Let σ 2 (φt ) = Var(φt (X)). Let s2n (φ) = Pn φ2 − (Pn φ)2 be
the sample variance of φ. Show that for all C > 0 and 0 ≤ λ ≤ C/4, then
s2n (φT )
1
E ≤ I(T ; X1n ) + 2.
max{C, σ 2 (φT )} nλ
The max{C, σ 2 (φT )} term is there to help avoid division by 0. Hint: If 0 ≤ x ≤ 1, then
ex ≤ 1 + 2x, and if X ∈ [0, 1], then E[eX ] ≤ 1 + 2E[X] ≤ e2E[X] . Use this to argue that
2 2
E[eλnPn (φ−P φ) / max{C,σ } ] ≤ e2λn for any φ : X → [−1, 1] with Var(φ) ≤ σ 2 , then apply the
Donsker-Varadhan theorem.
Question 4.4: Consider the following scenario: let φ : X → [−1, 1] and let α > 0, τ > 0. Let
µ = Pn φ and s2 = Pn φ2 − µ2 . Define σ 2 = max{αs2 , τ 2 }, and assume that τ 2 ≥ 5α
n .
89
Stanford Statistics 311/Electrical Engineering 377 John Duchi
(b) Show that if α2 ≤ C 0 τ 2 for a numerical constant C 0 < ∞, then we can take ε ≤ O(1) n21α .
Hint: Use exercise 2.14, and consider the “alternative” mechanisms of sampling from
2 2
N(µ−i , σ−i ) where σ−i = max{αs2−i , τ 2 }
for
1 X 1 X
µ−i = φ(Xj ) and s2−i = φ(Xj )2 − µ2−i .
n−1 n−1
j6=i j6=i
Input: Sample X1n ∈ X n drawn i.i.d. P , collection {φt }t∈T of possible queries φt : X →
[−1, 1], parameters α > 0 and τ > 0
Repeat: for k = 1, 2, . . .
iii. Mechanism draws independent Zk ∼ N(0, σk2 ) and responds with answer
n
1X
Ak := Pn φ + Zk = φ(Xi ) + Zk .
n
i=1
Question 4.5 (A general variance-dependent bound on interactive queries): Consider the algo-
rithm in Fig. 4.3. Let σ 2 (φt ) = Var(φt (X)) be the variance of φt .
(a) Show that for b > 0 and for all 0 ≤ λ ≤ 2b ,
r
|Aj − P φTj | τ2
1 p 4α
E max ≤ I(X1n ; T1k ) + λ + 2 log(ke) I(X1n ; T1k ) + 2α + 2 .
j≤k max{b, σ(φTj )} nλ nb b
(If you do not have quite the right constants, that’s fine.)
(b) Using the result of Question 4.4, show that with appropriate choices for the parameters
α, b, τ 2 , λ that for a numerical constant C < ∞
" #
|Aj − P φTj | (k log k)1/4
E max √ ≤ C √ .
j≤k max{(k log k)1/4 / n, σ(φTj )} n
You may assume that k, n are large if necessary.
(c) Interpret the result from part (b). How does this improve over Theorem 4.19?
90
Chapter 5
91
Stanford Statistics 311/Electrical Engineering 377 John Duchi
defined whenever Z ≥ 0 with probability 1. As our particular focus throughout this chapter, we
consider the moment generating function and associated transformation X 7→ eλX . If we know the
moment generating function ϕX (λ) := E[eλX ], then ϕ0X (λ) = E[XeλX ], and so
This suggests—in a somewhat roundabout way we make precise—that control of the entropy H(eλX )
should be sufficient for controlling the moment generating function of X.
The Herbst argument makes this rigorous.
Proposition 5.2. Let X be a random variable and assume that there exists a constant σ 2 < ∞
such that
λ2 σ 2
H(eλX ) ≤ ϕX (λ). (5.1.3)
2
for all λ ∈ R (respectively, λ ∈ R+ ) where ϕX (λ) = E[eλX ] denotes the moment generating function
of X. Then 2 2
λ σ
E[exp(λ(X − E[X]))] ≤ exp
2
for all λ ∈ R (respectively, λ ∈ R+ ).
Proof Let ϕ = ϕX for shorthand. The proof procedes by an integration argument, where we
2 2
show that log ϕ(λ) ≤ λ 2σ . First, note that
ϕ0 (λ) = E[XeλX ],
λ2 σ 2
λϕ0 (λ) − ϕ(λ) log ϕ(λ) = H(eλX ) ≤ ϕ(λ),
2
and dividing both sides by λ2 ϕ(λ) yields the equivalent statement
ϕ0 (λ) 1 σ2
− 2 log ϕ(λ) ≤ .
λϕ(λ) λ 2
∂ 1 ϕ0 (λ) 1
log ϕ(λ) = − log ϕ(λ).
∂λ λ λϕ(λ) λ2
92
Stanford Statistics 311/Electrical Engineering 377 John Duchi
It is possible to give a similar argument for sub-exponential random variables, which allows us
to derive Bernstein-type bounds, of the form of Corollary 3.17, but using the entropy method. In
particular, in the exercises, we show the following result.
Proposition 5.3. Assume that there exist positive constants b and σ such that
be the collection of all variables except Xi . Our first result is a consequence of the chain rule for
entropy and is known as Han’s inequality.
93
Stanford Statistics 311/Electrical Engineering 377 John Duchi
Proof The proof is a consequence of the chain rule for entropy and that conditioning reduces
entropy. We have
We also require a divergence version of Han’s inequality, which will allow us to relate the entropy
H of a random variable to divergences and other information-theoretic quantities. Let X be an
arbitrary space, and let Q be a distribution over X n and P = P1 ×· · ·×Pn be a product distribution
on the same space. For A ⊂ X n−1 , define the marginal densities
Proof We have seen earlier in the notes (recall the definition (2.2.1) of the KL divergence as
a supremum over all quantizers and the surrounding discussion) that it is no loss of generality to
assume that X is discrete. Thus, noting that the probability mass functions
X Y
q (i) (x\i ) = q(xi−1 n (i)
1 , x, xi+1 ) and p (x\i ) = pj (xj ),
x j6=i
Now, by subtracting q(xn1 ) log p(xn1 ) from both sides of the preceding display, we obtain
X X
(n − 1)Dkl (Q||P ) = (n − 1) q(xn1 ) log q(xn1 ) − (n − 1) q(xn1 ) log p(xn1 )
xn
1 xn
1
n X
X X
≥ q (i) (x\i ) log q (i) (x\i ) − (n − 1) q(xn1 ) log p(xn1 ).
i=1 x\i xn
1
94
Stanford Statistics 311/Electrical Engineering 377 John Duchi
We expand the final term. Indeed, by the product nature of the distributions p, we have
X X n
X
(n − 1) q(xn1 ) log p(xn1 ) = (n − 1) q(xn1 ) log pi (xi )
xn
1 xn
1 i=1
n X
X X n X
X
= q(xn1 ) log pi (xi ) = q (i) (x\i ) log p(i) (x\i ).
i=1 xn
1 j6=i i=1 x\i
| {z }
=log p(i) (x\i )
Noting that
X X
q (i) (x\i ) log q (i) (x\i ) − q (i) (x\i ) log p(i) (x\i ) = Dkl Q(i) ||P (i)
x\i x\i
Finally, we will prove the main result of this subsection: a tensorization identity for the entropy
H(Y ) for an arbitrary random variable Y that is a function of n independent random variables.
For this result, we use a technique known as tilting, in combination with the two variants of Han’s
inequality we have shown, to obtain the result. The tilting technique is one used to transform
problems of random variables into one of distributions, allowing us to bring the tools of information
and entropy to bear more directly. This technique is a common one, and used frequently in
large deviation theory, statistics, for heavy-tailed data, amont other areas. More concretely, let
Y = f (X1 , . . . , Xn ) for some non-negative function f . Then we may always define a tilted density
f (x1 , . . . , xn )p(x1 , . . . , xn )
q(x1 , . . . , xn ) := (5.1.5)
EP [f (X1 , . . . , Xn )]
which, by inspection, satisfies q(xn1 ) = 1 and q ≥ 0. In our context, if f ≈ constant under the
R
distribution P , then we should have f (xn1 )p(xn1 ) ≈ cp(xn1 ) and so Dkl (Q||P ) should be small; we
can make this rigorous via the following tensorization theorem.
Proof Inequality (5.1.6) holds for Y if and only if holds identically for cY for any c > 0, so
we assume without loss of generality that EP [Y ] = 1. We thus obtain that H(Y ) = E[Y log Y ] =
E[φ(Y )], where assign φ(t) = t log t. Let P have density p with respect to a base measure µ. Then
by defining the tilted distribution (density) q(xn1 ) = f (xn1 )p(xn1 ), we have Q(X n ) = 1, and moreover,
we have
q(xn1 )
Z Z
n
Dkl (Q||P ) = q(x1 ) log dµ(x1 ) = f (xn1 )p(xn1 ) log f (xn1 )dµ(xn1 ) = EP [Y log Y ] = H(Y ).
n
p(xn1 )
95
Stanford Statistics 311/Electrical Engineering 377 John Duchi
E[φ(Y )] − E[φ(E[Y | X\i ])] = E[E[φ(Y ) | X\i ] − φ(E[Y | X\i ])] = E[H(Y | X\i )].
Using Han’s inequality for relative entropies (Proposition 5.4) then immediately gives
n h
X i Xn
(i) (i)
H(Y ) = Dkl (Q||P ) ≤ Dkl (Q||P ) − Dkl Q ||P = E[H(Y | X\i )],
i=1 i=1
Theorem 5.6 shows that if we can show that individually the conditional entropies H(Y | X\i )
are not too large, then the Herbst argument (Proposition 5.2 or its variant Proposition 5.3) allows
us to provide strong concentration inequalities for general random variables Y .
sup f (x1 , . . . , xi−1 , x, xi+1 , . . . , xn ) − f (x1 , . . . , xi−1 , x0 , xi+1 , . . . , xn ) ≤ ci for all x\i .
x∈X ,x0 ∈X
(5.1.7)
Then we have the following result.
Proposition 5.7 (Bounded differences). Assume that f satisfies the bounded differences condi-
tion (5.1.7), where 41 ni=1 c2i ≤ σ 2 . Let Xi be independent. Then Y = f (X1 , . . . , Xn ) is σ 2 -sub-
P
Gaussian.
Proof We use a similar integration argument to the Herbst argument of Proposition 5.2, and
we apply the tensorization inequality (5.1.6). First, let U be an arbitrary random variable taking
values in [a, b]. We claim that if ϕU (λ) = E[eλU ] and ψ(λ) = log ϕU (λ) is its cumulant generating
function, then
H(eλU ) λ2 (b − a)2
≤ . (5.1.8)
E[eλU ] 8
96
Stanford Statistics 311/Electrical Engineering 377 John Duchi
[t − Ek i=1 Xi k2 ]2+
Pn !
≤ exp −2 Pn 2 .
i=1 ci
q P qP
Noting that E[k ni=1 Xi k2 ] ≤ E[k ni=1 Xi k22 ] = n 2
i=1 E[kXi k2 ] gives the result. 3
P
97
Stanford Statistics 311/Electrical Engineering 377 John Duchi
Theorem 5.9. Let X1 , . . . , Xn be independent random variables with Xi ∈ [a, b] for all i. Assume
that f : Rn → R is separately convex and L-Lipschitz with respect to the k·k2 norm. Then
We defer the proof of the theorem temporarily, giving two example applications. The first is to
the matrix concentration problem that motivates the beginning of this section.
Example 5.10: Let X ∈ Rm×n be a matrix with independent entries, where Xij ∈ [−1, 1]
for all i, j, and let |||·||| denote the operator norm on matrices, that is, |||A||| = supu,v {u> Av :
kuk2 ≤ 1, kvk2 ≤ 1}. Then Theorem 5.9 implies
2
t
P(|||X||| ≥ E[|||X|||] + t) ≤ exp −
16
where k·kFr denotes the Frobenius norm of a matrix. Thus the matrix operator norm is 1-
Lipschitz. Therefore, we have by Theorem 5.9 and the Chernoff bound technique that
As a second example, we consider Rademacher complexity. These types of results are important
for giving generalization bounds in a variety of statistical algorithms, and form the basis of a variety
of concentration and convergence results. We defer further motivation of these ideas to subsequent
chapters, just mentioning here that we can provide strong concentration guarantees for Rademacher
complexity or Rademacher chaos.
Example 5.11: Let A ⊂ Rn be any collection of vectors. The the Rademacher complexity of
the class A is
n
" #
X
Rn (A) := E sup a i εi , (5.1.9)
a∈A i=1
t2
P(Rn (A) ≥ Rn (A) + t) ≤ exp −
b ,
16 diam(A)2
where diam(A) := supa∈A kak2 . Indeed, we have that ε 7→ supa∈A a> ε is a convex function,
as it is the maximum of a family of linear functions. Moreover, it is Lipschitz, with Lipschitz
constant bounded by supa∈A kak2 . Applying Theorem 5.9 as in Example 5.10 gives the result.
3
Proof of Theorem 5.9 The proof relies on our earlier tensorization identity and a symmetriza-
tion lemma.
98
Stanford Statistics 311/Electrical Engineering 377 John Duchi
iid
Lemma 5.12. Let X, Y ∼ P be independent. Then for any function g : R → R, we have
Proof For the first result, we use the convexity of the exponential in an essential way. In
particular, we have
because log is concave and ex ≥ 0. Using symmetry, that is, that g(X) − g(Y ) has the same
distribution as g(Y ) − g(X), we then find
1
H(eλg(X) ) ≤ E[λ(g(X)−g(Y ))(eλg(X) −eλg(Y ) )] = E[λ(g(X)−g(Y ))(eλg(X) −eλg(Y ) )1 {g(X) ≥ g(Y )}].
2
Now we use the classical first order convexity inequality—that a convex function f satisfies f (t) ≥
f (s)+f 0 (s)(t−s) for all t and s, Theorem A.14 in the appendices—which gives that et ≥ es +es (t−s)
for all s and t. Rewriting, we have es − et ≤ es (s − t), and whenever s ≥ t, we have (s − t)(es − et ) ≤
es (s − t)2 . Replacing s and t with λg(X) and λg(Y ), respectively, we obtain
λ(g(X) − g(Y ))(eλg(X) − eλg(Y ) )1 {g(X) ≥ g(Y )} ≤ λ2 (g(X) − g(Y ))2 eλg(X) 1 {g(X) ≥ g(Y )} .
Returning to the main thread of the proof, we note that the separate convexity of f and the
tensorization identity of Theorem 5.6 imply
n n
X X " 2 #
λf (X1:n ) λf (X1:n ) 2 2 ∂ λf (X1:n )
H(e )≤E H(e | X\i ) ≤ E λ E (Xi − Yi ) f (X1:n ) e | X\i ,
∂xi
i=1 i=1
where Yi are independent copies of the Xi . Now, we use that (Xi −Yi )2 ≤ (b−a)2 and the definition
of the partial derivative to obtain
Noting that k∇f (X)k22 ≤ L2 , and applying the Herbst argument, gives the result.
99
Stanford Statistics 311/Electrical Engineering 377 John Duchi
so that we “project out” the jth coordinate, and define the projected sets.
100
Chapter 6
JCD Comment: Add a bit of commentary on local privacy versus centralized privacy
In this chapter, we continue to build off of our ideas on stability in different scenarios, ranging
from model fitting and concentration to interactive data analyses. Here, we show how stability ideas
allow us to provide a new type of protection: the privacy of participants in studies. The major
challenge in this direction had, until the mid-2000s with the introduction of differential privacy—a
type of stability in likelihood ratios—been a satisfactory definition of privacy, because collection of
side information often results in unforeseen compromises of private information. Consequently, in
this chapter we focus on privacy notions based on differential privacy and its cousins, developing the
information-theoretic stability ideas helpful to understand the protections it is possible to provide.
101
Stanford Statistics 311/Electrical Engineering 377 John Duchi
now we have “compromised” the privacy of everyone who smokes who did not participate in the
study: we know they are more likely to get cancer.
In each of these cases, the biggest challenge is one of side information: how can we be sure
that, when releasing a particular statistic, dataset, or other quantity that no adversary will be able
to infer sensitive data about participants in our study? We articulate three desiderata that—we
believe—suffice for satisfactory definitions of privacy. In discussion of private releases of data, we
require a bit of vocabulary. We term a (randomized) algorithm releasing data either a privacy
mechanism, consistent with much of the literature in privacy, or a channel, mapping from the input
sample to some output space, in keeping with our statistical and information-theoretic focus. In
no particular order, we wish our privacy mechanism, which takes as input a sample X1n ∈ X n and
releases some Z to satisfy the following.
i. Given the output Z, even an adversary knowing everyone in the study (excepting one person)
should not be able to test whether you belong to the study.
ii. If you participate in multiple “private” studies, there should be some graceful degradation
in the privacy protections, rather than a catastrophic failure. As part of this, any definition
should guarantee that further processing of the output Z of a private mechanism X1n → Z, in
the form of the Markov chain X1n → Z → Y , should not allow further compromise of privacy
(that is, a data-processing inequality). Additional participation in “private” studies should
continue to provide little additional information.
iii. The mechanism X1n → Z should be resilient to side information: even if someone knows
something about you, he should learn little about you if you belong to X1n , and this should
remain true even if the adversary later gleans more information about you.
The third desideratum is perhaps most elegantly phrased via a Bayesian perspective, where an
adversary has some prior beliefs π on the membership of a dataset (these prior beliefs can then
capture any side information the adversary has). Perhaps the strongest adversary might have a
prior supported on two samples {x1 , . . . , xn } and {x01 , . . . , x0n } differing in only a single element; a
private mechanism would then guarantee his posterior beliefs (after the release X1n → Z) should
not change significantly.
The challenges of side information motivate the definition of differential privacy, due to Dwork
et al. [65]. The key in differential privacy is that we the noisy channel releasing statistics provides
guarantees of bounded likelihood ratios between neighboring samples, that is, samples differing in
only a single entry.
Definition 6.1 (Differential privacy). Let Q be a Markov kernel from X n to an output space Z.
Then Q is ε-differentially private if for all (measurable) sets S ⊂ Z and all samples xn1 ∈ X n and
y1n ∈ X n differing in at most a single entry,
Q(Z ∈ S | xn1 )
≤ eε . (6.1.1)
Q(Z ∈ S | y1n )
The intuition and original motivation for this definition are that an individual has little incentive
to participate (or not participate) in a study, as the individual’s data has limited effect on the
outcome.
The model (6.1.1) of differential privacy presumes that there is a trusted curator, such as a
hospital, researcher, or corporation, who can collect all the data into one centralized location, and
102
Stanford Statistics 311/Electrical Engineering 377 John Duchi
it is consequently known as the centralized model. A stronger model of privacy is the local model,
in which data providers trust no one, not even the data collector, and privatize their individual
data before the collector even sees it.
Q(Z ∈ S | x)
≤ eε . (6.1.2)
Q(Z ∈ S | x0 )
It is clear that Definition 6.2 and the condition (6.1.2) are stronger than Definition 6.1: when
samples {x1 , . . . , xn } and {x01 , . . . , x0n } differ in at most one observation, then the local model (6.1.2)
guarantees that the densities
n
dQ(Z1n | {xi }) Y dQ(Zi | xi )
= ≤ eε ,
dQ(Z1n | {x0i }) dQ(Zi | x0i )
i=1
where the inequality follows because only a single ratio may contain xi 6= x0i .
In the remainder of this introductory section, we provide a few of the basic mechanisms in use
in differential privacy, then discuss its “semantics,” that is, its connections to the three desiderata
we outline above. In the coming sections, we revisit a few more advanced topics, in particular, the
composition of multiple private mechanisms and a few weakenings of differential privacy, as well as
more sophisticated examples.
Q(Yes | x = 0) Q(No | x = 0)
= e−ε and = eε ,
Q(Yes | x = 1) Q(No | x = 1)
so that Q(Z = z | x)/Q(Z = z | x0 ) ∈ [e−ε , eε ] for all x, z. That is, the randomized response
channel provides ε-local privacy. 3
103
Stanford Statistics 311/Electrical Engineering 377 John Duchi
The interesting question is, of course, whether we can still use this channel to estimate the
proportion of the population with the sensitive characteristic. Indeed, we can. We can provide
a somewhat more general analysis, however, which we now do so that we can give a complete
example.
Example 6.2 (Randomized response, continued): Suppose that we have an attribute of
interest, x, taking the values x ∈ {1, . . . , k}. Then we consider the channel (of Z drawn
conditional on x)
(
eε
x with probability k−1+eε
Z= k−1
Uniform([k] \ {x}) with probability k−1+eε .
This (generalized) randomized response mechanism is evidently ε-locally private, satisfying
Definition 6.2.
Let p ∈ Rk+ , pT 1 = 1 indicate the true probabilities pi = P(X = i). Then by inspection, we
have
eε 1 eε − 1 1
P(Z = i) = pi + (1 − p i ) = pi + .
k − 1 + eε k − 1 + eε eε + k − 1 eε + k − 1
Thus, letting bcn ∈ Rk+ denote the empirical proportion of the Z observations in a sample of
size n, we have
eε + k − 1
1
pbn := cn − ε 1
eε − 1 e +k−1
b
satisfies E[b
pn ] = p, and we also have
k
e +k−1 2 1 eε + k − 1 2 X
ε
pn − pk22 = 2
E kb E kb
cn − c
E[b ]k
n 2 = P(Z = j)(1−P(Z = j)).
eε − 1 n eε − 1
j=1
eε +k−1
pn − pk22 ] ≤ n1 ( )2 .
P
As j P(Z = j) = 1, we always have the bound E[kb eε −1
We may consider two regimes for simplicity: when ε ≤ 1 and when ε ≥ log k. In the former
case—the high privacy regime—we have k1 . P(Z = i) . k1 , so that the mean `2 squared error
2
scales as n1 kε2 . When ε ≥ log k is large, by contrast, we see that the error scales at worst as n1 ,
which is the “non-private” mean squared error. 3
While randomized response is essentially the standard mechanism in locally private settings, in
centralized privacy, the “standard” mechanism is Laplace noise addition because of its exponential
tails. In this case, we require a few additional definitions. Suppose that we wish to release some
d-dimensional function f (X1n ) of the sample X1n , where f takes values in Rd . In the case that
f is Lipschitz with respect to the Hamming metric—that is, the counting metric on X n —it is
relatively straightforward to develop private mechanisms. For easier use in our future development,
for p ∈ [1, ∞] and some distance-like function dist taking values in R+ , we define the Lipschitz
constant Lipp,dist by
( )
kf (x) − f (x0 )kp 0
Lipp,dist (f ) := sup | dist(x, x ) > 0 .
x,x0 dist(x, x0 )
The appropriate notion of distance in the case of (centralized) differential privacy is the Hamming
metric
n
X
dham ({x1 , . . . , xn }, {x01 , . . . , x0n }) = 1 xi 6= x0i ,
i=1
104
Stanford Statistics 311/Electrical Engineering 377 John Duchi
which counts the number of differences between samples x and x0 . Differentially private mechanisms
(Definition 6.1) are most convenient to define for functions that are Lipschitz with respect to the
Hamming metric, because they allow simple noise addition strategies. In the privacy literature, the
Lipschitz constant of a function is often called the sensitivity.
(you should convince yourself that this is an equivalent definition of the Lipschitz constant for
the Hamming metric). Then if we consider the mechanism defined by the addition of W ∈ Rd
with independent Laplace(L/ε) coordinates,
iid
Z := f (X1n ) + W, Wj ∼ Laplace(L/ε), (6.1.3)
q(z | x) ε ε
f (x0 ) − z
≤ exp ε
f (x) − f (x0 )
≤ exp(ε)
= exp − kf (x) − zk 1 ·exp
q(z | x0 ) L L 1 L 1
by the triangle inequality and that f is L-Lipschitz with respect to the Hamming metric. Thus
Z is ε-differentially private. Moreover, we have
2dL2
E[kZ − f (xn1 )k22 ] = ,
ε2
so that if L is small, we may report the value of f accurately. 3
The most common instances and applications of the Laplace mechanism are in estimation of
means and histograms. Let us demonstrate more carefully worked examples in these two cases.
Example 6.4 (Private one-dimensional mean estimation): Suppose that we have variables
Xi taking values in [−b, b] for somePb < ∞, and wish to estimate E[X]. A natural function to
release is then f (X1n ) = X n = n1 ni=1 Xi . This has Lipschitz constant 2b/n with respect to
the Hamming metric, because for any two samples x, x0 ∈ [−b, b]n differing in only entry i, we
have
1 2b
|f (x) − f (x0 )| = |xi − x0i | ≤
n n
105
Stanford Statistics 311/Electrical Engineering 377 John Duchi
because xi ∈ [−b, b]. Thus the Laplace mechanism (6.1.3) with the choice variance W ∼
Laplace(2b/(nε)) yields
1 8b2 b2 8b2
E[(Z − E[X])2 ] = E[(X n − E[X])2 ] + E[(Z − X n )2 ] = Var(X) + 2 2 ≤ + 2 2.
n n ε n n ε
We can privately release means with little penalty so long as ε n −1/2 . 3
Example 6.5 (Private histogram (multinomial) release): Suppose that we wish to estimate
a multinomial distribution, or put differently, a histogram. That is, we have observations
X ∈ {1, . . . , k}, where k may be large, and wish to estimate pj := P(X = j) P for j = 1, . . . , k.
For a given sample xn1 , the empirical count vector pbn with coordinates pbn,j = n1 ni=1 1 {Xi = j}
satisfies
2
pn ) ≤ ,
Lip1,dham (b
n
because swapping a single example xi for x0i may change the counts for at most two coordinates
j, j 0 by 1. Consequently, the Laplace noise addition mechanism
iid 2
Z = pbn + W, Wj ∼ Laplace
nε
satisfies
8k
E[kZ − pbn k22 ] =
n 2 ε2
and consequently
k
8k 1X 8k 1
E[kZ − pk22 ] = 2 2
+ pj (1 − pj ) ≤ 2 2 + .
n ε n n ε n
j=1
This example shows one of the challenges of differentially private mechanisms: even in the case
where the quantity of interest is quite stable (insensitive to changes in the underlying sample,
or has small Lipschitz constant), it may be the case that the resulting mechanism adds noise
that introduces some dimension-dependent scaling. In this case, the conditions on privacy
levels acceptable for good estimation—in that the P rate of convergence is no different from the
non-private case, which achieves E[kb pn − pk2 ] = n kj=1 pj (1 − pj ) ≤ n1 are that ε nk . Thus,
2 1
in the case that the histogram has a large number of bins, the naive noise addition strategy
cannot provide as much protection without sacrificing efficiency.
If instead of `2 -error we consider `∞ error, it is possible to provide somewhat more satisfying
iid
results in this case. Indeed, we know that P(kW k∞ ≥ t) ≤ k exp(−t/b) for Wj ∼ Laplace(b),
so that in the mechanism above we have
tnε
P(kZ − pbn k∞ ≥ t) ≤ k exp − all t ≥ 0,
2
so using that each coordinate of pbn is 1-sub-Gaussian, we have
r
2 log k 2k tnε
E[kZ − pk∞ ] ≤ E[kb
pn − pk∞ ] + E[kW k∞ ] ≤ + inf t + exp −
n t≥0 nε 2
r
2 log k 2 log k 2
≤ + + .
n nε nε
−1/2 , we obtain rate of convergence at least
p this case, then, whenever ε (n/ log k)
In
2 log k/n, which is a bit loose (as we have not controlled the variance of pbn ), but some-
what more satisfying than the k-dependent penalty above. 3
106
Stanford Statistics 311/Electrical Engineering 377 John Duchi
P (Y ∈ A | z)q(z | x0 )dµ(z)
R R
P(Y ∈ A | x) P (Y ∈ A | z)q(z | x)dµ(z) ε
= ≤ e = eε .
P(Y ∈ A | x0 )
R R
P (Y ∈ A | z)q(z | x0 )dµ(z) P (Y ∈ A | z)q(z | x0 )dµ(z)
so that the difference between the samples under H0 and H1 is only in the ith observation Xi ∈
{xi , x0i }. Now, for a channel taking inputs from X n and outputting Z ∈ Z, we define ε-conditional
hypothesis testing privacy by saying that
for all sets A ⊂ Z satisfying Q(A | H0 ) > 0 and Q(A | H1 ) > 0. That is, roughly, no matter
what value Z takes on, the probability of error in a test of whether H0 or H1 is true—even with
knowledge of xj , j 6= i—is high. We then have the following proposition.
Proposition 6.6. Assume the channel Q is ε-differentially private. Then Q is also ε̄ = 1 − e−2ε ≤
2ε-conditional hypothesis testing private.
Proof Let Ψ be any test of H0 versus H1 , and let B = {z | Ψ(z) = 1} be the acceptance region
of the test. Then
Q(A, B | H0 ) Q(A, B c | H1 )
Q(B | H0 , Z ∈ A) + Q(B c | H1 , Z ∈ A) = +
Q(A | H0 ) Q(A | H1 )
Q(A, B | H1 ) Q(A, B c | H1 )
≥ e−2ε +
Q(A | H1 ) Q(A | H1 )
Q(A, B | H1 ) + Q(A, B c | H1 )
≥ e−2ε ,
Q(A | H1 )
where the first inequality uses ε-differential privacy. Then we simply note that Q(A, B | H1 ) +
Q(A, B c | H1 ) = Q(A | H1 ).
So we see that (roughly), even conditional on the output of the channel, we still cannot test whether
the initial dataset was x or x0 whenever x, x0 differ in only a single observation.
An alternative perspective is to consider a Bayesian one, which allows us to more carefully
consider side information. In this case, we consider the following thought experiment. An adversary
has a set of prior beliefs π on X n , and wishes to test whether a particular value x belongs to a given
107
Stanford Statistics 311/Electrical Engineering 377 John Duchi
sample, which we denote by S for notational convenience. Now, consider the posterior distribution
π(· | Z) induced by observing an output of the channel Z ∼ Q(· | S). We will show that, under
a few mild conditions on the types of priors allowed, that differential privacy guarantees that the
posterior beliefs of the adversary about who belongs to the sample cannot change much. There is
some annoyance in this calculation in that the order of the sample may be important, but it at
least gets toward some semantic interpretation of differential privacy.
We consider the adversary’s beliefs on whether a particular value x belongs to the sample, but
more precisely, we consider whether Xi = x. We assume that the prior density π on X n satisfies
are independent of his beliefs about the other members of the dataset. (We assume that π is
a density with respect to a measure µ on X n−1 × X , where dµ(s, x) = dµ(s)dµ(x).) Under the
condition (6.1.5), we have the following proposition.
Proposition 6.7. Let Q be an ε-differentially private channel and let π be any prior distribution
satisfying condition (6.1.5). Then for any z, the posterior density πi on Xi satisfies
Proof We abuse notation and for a sample s ∈ X n−1 , where s = (x1i−1 , xni+1 ), we let s ⊕i x =
(xi−1 n
1 , x, xi+1 ). Letting µ be the base measure on X
n−1 × X with respect to which π is a density
n
and q(· | x1 ) be the density of the channel Q, we have
R
s∈X n−1 q(z | s ⊕i x)π(s ⊕i x)dµ(s)
πi (x | Z = z) = R R
0 0 0
s∈X n−1 x0 ∈X q(z | s ⊕i x )π(s ⊕i x )dµ(s, x )
R
n−1 q(z | s ⊕i x)π(s ⊕i x)dµ(s)
(?)
s∈X
≤ eε R R
0 0
s∈X n−1 x0 ∈X q(z | s ⊕i x)π(s ⊕i x )dµ(s)dµ(x )
R
s∈X n−1 q(z | s ⊕i x)π\i (s)dµ(s)πi (x)
= eε R R
0 0
s∈X n−1 q(z | s ⊕i x)π\i (s)dµ(s) x0 ∈X πi (x )dµ(x )
= eε πi (x),
where inequality (?) follows from ε-differential privacy. The lower bound is similar.
There are other versions of prior and posterior beliefs that differential privacy may protect
against. If the channel is invariant to permutations, so that Q(· | xn1 ) = Q(· | (xσ(1) , . . . , xσ(n) )) for
any permutation σ of {1, . . . , n}, then we may change Proposition 6.7 to reflect a semantics more
in line with the question of whether a particular value x belongs to a sample X1n at all, so long
as the adversary’s prior beliefs follow a product distribution that is also appropriately invariant
to permutations. The conditioning and ordering gymnastics necessary for this are a bit tedious,
however, so we omit the development. Roughly, however, we see that Proposition 6.7 captures the
idea that even if an adversary has substantial prior knowledge—in the form of a prior distribution
π on the ith value Xi and everything else in the sample—the posterior cannot change much.
We may devise an alternative view by considering Bayes factors, which measure how much prior
and posterior distributions differ after observations. In this case, we have the following immediate
result.
108
Stanford Statistics 311/Electrical Engineering 377 John Duchi
Proposition 6.8. A channel Q from X n → Z is ε-differentially private if and only if for any prior
distribution π on X n and any observation z ∈ Z, the posterior odds satisfy
π(x | z)
≤ eε
π(x0 | z)
Proof We have π(x | z) = q(z | x)π(x)/q(z), where q is the density (conditional or marginal) of
Z ∈ Z. Then
π(x | z) q(z | x)π(x) π(x)
0
= 0 0
≤ eε
π(x | z) q(z | x )π(x ) π(x0 )
for all z, x, x0 if and only if Q is ε-differentially private.
Thus we see that private channels mean that prior and posterior odds between two neighboring
samples cannot change substantially, no matter what the observation Z actually is.
Definition 6.3. Let ε, δ ≥ 0. A channel Q from X n to output space Z is (ε, δ)-differentially private
if for all (measurable) sets S ⊂ Z and all neighboring samples xn1 ∈ X n and y1n ∈ X n ,
One typically thinks of δ in the definition above as satisfying δ = δn , where δn n−k for any
k ∈ N. (That is, δ decays super-polynomially to zero.) Some practitioners contend that all real-
world differentially private algorithms are in fact (ε, δ)-differentially private: while one may use
cryptographically secure random number generators, there is some possibility (call this δ) that a
cryptographic key may leak, or an encoding may be broken, in the future, making any mechanism
(ε, δ)-private at best for some δ > 0.
An alternative definition of privacy is based on Rényi divergences between distributions. These
are essentially simply monotonically transformed f divergences (recall Chapter 2.2), though their
structure is somewhat more amenable to analysis, especially in our contexts. With that in mind,
we define
109
Stanford Statistics 311/Electrical Engineering 377 John Duchi
Definition 6.4. Let P and Q be distributions on a space X with densities p and q (with respect to
a measure µ). For α ∈ [1, ∞], the Rényi-α-divergence between P and Q is
p(x) α
Z
1
Dα (P ||Q) := log q(x)dµ(x).
α−1 q(x)
Here, the values α ∈ {1, ∞} are defined in terms of their respective limits.
Rényi divergences satisfy exp((α − 1)Dα(P ||Q)) = Df (P ||Q) for the f -divergence defined by
f (t) = tα − 1, so that they inherit a number of the properties of such divergences. We enumerate
a few here for later reference.
Proposition 6.9 (Basic facts on Rényi divergence). Rényi divergences satisfy the following prop-
erties.
ii. limα↓1 Dα (P ||Q) = Dkl (P ||Q) and limα↑∞ Dα (P ||Q) = supx {p(x)/q(x) | q(x) > 0}.
iii. Let K(· | x) be a Markov kernel from X → Z as in Proposition 2.15, and let KP and KQ be
the induced marginals of P and Q under K, respectively. Then Dα (KP ||KQ ) ≤ Dα (P ||Q).
Each of these properties we leave as an exercise to the reader, noting that property i is a conse-
quence of Hölder’s inequality, property ii is by L’Hopital’s rule, and property iii is an immediate
consequence of Proposition 2.15. Rényi divergences also tensorize nicely—generalizing the ten-
sorization properties of KL-divergence and information of Chapter 2 (recall the chain rule (2.1.6)
for KL-divergence)—and we return to this later. As a preview, however, these tensorization proper-
ties allow us to prove that the composition of multiple private data releases remains appropriately
private.
With these preliminaries in place, we can then provide
Definition 6.5 (Rényi-differential privacy). Let ε ≥ 0 and α ∈ [1, ∞]. A channel Q from X n to
output space Z is (ε, α)-Rényi private if for all neighboring samples xn1 , y1n ∈ X n ,
Clearly, any ε-differentially private channel is also (ε, α)-Rényi private for any α ≥ 1; as we soon
see, we can provide tighter guarantees than this.
Example 6.10 (Rényi divergence between Gaussian distributions): Consider normal distri-
butions N(µ0 , Σ) and N(µ1 , Σ). Then
α
Dα (N(µ0 , Σ)||N(µ1 , Σ)) = (µ0 − µ1 )T Σ−1 (µ0 − µ1 ). (6.2.3)
2
110
Stanford Statistics 311/Electrical Engineering 377 John Duchi
To see this equality, we compute the appriate integral of the densities. Let p and q be the
densities of N(µ0 , Σ) and N(µ1 , Σ), respectively. Then letting Eµ1 denote expectation over
X ∼ N(µ1 , Σ), we have
p(x) α
Z α
h α i
q(x)dx = Eµ1 exp − (X − µ0 )T Σ−1 (X − µ0 ) + (X − µ1 )T Σ−1 (X − µ1 )
q(x) 2 2
(i)
h α i
= Eµ1 exp − (µ0 − µ1 )T Σ−1 (µ0 − µ1 ) + α(µ0 − µ1 )T Σ−1 (X − µ1 )
2
α2
(ii) α T −1 T −1
= exp − (µ0 − µ1 ) Σ (µ0 − µ1 ) + (µ0 − µ1 ) Σ (µ0 − µ1 ) ,
2 2
where equality (i) is simply using that (x − a)2 − (x − b)2 = (a − b)2 + 2(b − a)(x − b) and
equality (ii) follows because (µ0 − µ1 )T Σ−1 (X − µ1 ) ∼ N(0, (µ1 − µ0 )T Σ−1 (µ1 − µ0 )) under
X ∼ N(µ1 , Σ). Noting that −α + α2 = α(α − 1) and taking logarithms gives the result. 3
Example 6.10 is the key to developing different privacy-preserving schemes under Rényi privacy.
Let us reconsider Example 6.3, except that instead of assuming the function f of interest is smooth
with respect to `1 norm, we use the `2 -norm.
Example 6.11 (Gaussian mechanisms): Suppose that f : X n → Rd has Lipschitz constant
L with respect to the `2 -norm (for the Hamming metric dham ), that is,
Z = f (X1n ) + W, W ∼ N(0, σ 2 I)
satisfies
α
2 α
Dα N(f (x), σ 2 )||N(f (x0 ), σ 2 ) = 2
f (x) − f (x0 )
2 ≤ 2 L2
2σ 2σ
whenever dham (x, x0 ) ≤ 1. Thus, if we have Lipschitz constant L and desire (ε, α)-Rényi
2
privacy, we may take σ 2 = L2εα , and then the mechanism
L2 α
n
Z = f (X1 ) + W, W ∼ N 0, I (6.2.4)
2ε
satisfies (ε, α)-Rényi privacy. 3
Certain special cases can make this more concrete. Indeed, suppose we wish to estimate a mean
iid
E[X] where Xi ∼ P for some distribution P such that kXi k2 ≤ r with probability 1 for some
radius.
Example 6.12 (Bounded mean estimation with Gaussian mechanisms): Letting f (X1n ) = X n
be the sample mean, where Xi satisfy kXi k2 ≤ r as above, we see that
f (x) − f (x0 )
≤ 2r
2 n
whenever dham (x, x0 ) ≤ 1. In this case, the Gaussian mechanism (6.2.4) with L = 2r
n yields
2dr2 α
E[kZ − f (X1n )k22 ] = E[kW k22 ] = .
n2 ε
111
Stanford Statistics 311/Electrical Engineering 377 John Duchi
Then we have
r2 2dr2 α
E[kZ − E[X]k22 ] = E[kf (X1n ) − E[X]k22 ] + E[kZ − f (X1n )k22 ] ≤ + 2 .
n n ε
It is not immediately apparent how to compare this quantity to the case for the Laplace mech-
anism in Example 6.3, but we will return to this shortly once we have developed connections
between the various privacy notions we have developed. 3
Before proving the proposition, let us see its implications for Example 6.12 versus estimation
under ε-differential privacy. Let ε ≤ 1, so that roughly to have “similar” privacy, we require that our
0 2
Rényi private channels
√ satisfy D√α (Q(· | x)||Q(· | x )) ≤ ε . The `1 -sensitivity of the mean satisfies
kxn − x n k1 ≤ dkxn − x n k2 ≤ dr/n for neighboring x, x0 . Then the Laplace mechanism (6.1.3)
0 0
satisfies
2 2r2
E[kZLaplace − E[X]k22 ] = E[
X n − E[X]
2 ] + 2 2 · d2 ,
n ε
2
while the Gaussian mechanism under (ε , α)-Rényi privacy will yield
2 2r2
E[kZGauss − E[X]k22 ] = E[
X n − E[X]
2 ] + 2 2 · dα.
n ε
This is evidently better than the Laplace mechanism whenever α < d.
Proof of Proposition 6.13 We asume that P and Q have densities p and q with respect to a
−ε ε
base measure µ, which is no
1
R loss αof generality, whence the ratio condition implies that e ≤ p/q ≤ e
and Dα (P ||Q) = α−1 log (p/q) qdµ. We prove the result assuming that α ∈ (1, ∞), as continuity
gives the result for α ∈ {1, ∞}.
First, it is clear that Dα (P ||Q) ≤ ε always. For the other term in the minimum, let us assume
that α ≤ 1 + 1ε and ε ≤ 1. If either of these fails, the result is trivial, because for α > 1 + 1ε we
have 32 αε2 ≥ 32 ε ≥ ε, and similarly ε ≥ 1 implies 23 αε2 ≥ ε.
Now we perform a Taylor approximation of t 7→ (1 + t)α . By Taylor’s theorem, we have for any
t > −1 that
α(α − 1)
(1 + t)α = 1 + αt + t)α−2 t2
(1 + e
2
112
Stanford Statistics 311/Electrical Engineering 377 John Duchi
for some et ∈ [0, t] (or [t, 0] if t < 0). In particular, if 1 + t ≤ c, then (1 + t)α ≤ 1 + αt +
α(α−1)
2 max{1, cα−2 }t2 . Now, we compute the divergence: we have
p(z) α
Z
exp ((α − 1)Dα (P ||Q)) = q(z)dµ(z)
q(z)
Z α
p(z)
= 1+ − 1 q(z)dµ(z)
q(z)
Z Z 2
p(z) α(α − 1) p(z)
≤1+α − 1 q(z)dµ(z) + max{1, exp(ε(α − 2))} − 1 q(z)dµ(z)
q(z) 2 q(z)
α(α − 1) ε[α−2]+
≤1+ e · (eε − 1)2 .
2
Now, we know that α − 2 ≤ 1/ε − 1 by assumption, so using that log(1 + x) ≤ x, we obtain
α ε
Dα (P ||Q) ≤ (e − 1)2 · exp([1 − ε]+ ).
2
3α 2
Finally, a numerical calculation yields that this quantity is at most 2 ε for ε ≤ 1.
We can also provide connections from (ε, α)-Rényi privacy to (ε, δ)-differential privacy, and
then from there to ε-differential privacy. We begin by showing how to develop (ε, δ)-differential
privacy out of Rényi privacy. Another way to think about this proposition is that whenever two
distributions P and Q are close in Rényi divergence, then there is some limited “amplification” of
probabilities that is possible in moving from one to the other.
Proposition 6.15. Let P and Q satisfy Dα (P ||Q) ≤ ε. Then for any set A,
α−1 α−1
P (A) ≤ exp ε Q(A) α .
α
Before turning to the proof of the proposition, we show how it can provide prototypical (ε, δ)-
private mechanisms via Gaussian noise addition.
113
Stanford Statistics 311/Electrical Engineering 377 John Duchi
p
for any δ > 0 and α > 1. Optimizing first over α by taking α = 1 + 2σ 2 log δ −1 /L2 , we see
L2
p
that the channel is ( 2σ 2 + 2L2 log δ −1 /σ 2 , δ)-differentially private. Thus we have that the
Gaussian mechanism
( )
n 2 2 2 8 log 1δ 1
Z = f (X1 ) + W, W ∼ N(0, σ I) for σ = L max , (6.2.5)
ε2 ε
is (ε, δ)-differentially private.
To continue with our `2 -bounded mean-estimation in Example 6.12, let us assume that
ε < 8 log 1δ , in which case the Gaussian mechanism (6.2.5) with L2 = r2 /n2 achieves (ε, δ)-
differential privacy, and we have
2 r2 1
E[kZGauss − E[X]k22 ] = E[
X n − E[X]
2 ] + O(1) 2 2 · d log .
n ε δ
Comparing to the previous cases, we see an improvement over the Laplace mechanism whenever
log 1δ d, or that δ e−d . 3
Proof of Proposition 6.15 We use the data processing inequality of Proposition 6.9.iii, which
shows that
P (A) α
1
ε ≥ Dα (P ||Q) ≥ log Q(A) .
α−1 Q(A)
Rearranging and taking exponentials, we immediately obtain the first claim of the proposition.
α
For the second, we require a bit more work. First, let us assume that Q(A) > e−ε δ α−1 . Then
we have by the first claim of the proposition that
α−1 1 1
P (A) ≤ exp ε + log Q(A)
α α Q(A)
α−1 1 1 1 1 1
≤ exp ε+ ε+ log Q(A) = exp ε + log Q(A).
α α α−1 δ α−1 δ
α
On the other hand, when Q(A) ≤ e−ε δ α−1 , then again using the first result of the proposition,
α−1
P (A) ≤ exp (ε + log Q(A))
α
α − 1 α
≤ exp ε−ε+ log δ = δ.
α α−1
This gives the second claim of the proposition.
Finally, we develop our last set of connections, which show how we may relate (ε, δ)-private
channels with ε-private channels. To provide this definition, we require one additional weakened
notion of divergence, which relates (ε, δ)-differential privacy to Rényi-α-divergence with α = ∞.
We define
δ P (S) − δ
D∞ (P ||Q) := sup log | P (S) > δ ,
S⊂X Q(S)
where the supremum is over measurable sets. Evidently equivalent to this definition is that
D∞δ (P ||Q) ≤ ε if and only if
114
Stanford Statistics 311/Electrical Engineering 377 John Duchi
Lemma 6.18. Let ε > 0 and δ ∈ (0, 1), and let P and Q be distributions on a space X .
δ (P ||Q) ≤ ε if and only if there exists a probability distribution R on X such that
(i) We have D∞
kP − RkTV ≤ δ and D∞ (R||Q) ≤ ε.
(ii) We have D∞δ (P ||Q) ≤ ε and D δ (Q||P ) ≤ ε if and only if there exist distributions P and Q
∞ 0 0
such that
δ δ
kP − P0 kTV ≤ , kQ − Q0 kTV ≤ ,
1 + eε 1 + eε
and
D∞ (P0 ||Q0 ) ≤ ε and D∞ (Q0 ||P0 ) ≤ ε.
The proof of the lemma is technical, so we defer it to Section 6.5.1. The key application of the
lemma—which we shall see presently—is that (ε, δ)-differentially private algorithms compose in
elegant ways.
Proposition 6.19. Let Q be a (ε, δ)-differentially private channel from X n → Z. Then for any
neighboring x0 , x, x0 ∈ X n , we have with probability at least 1 − δ over the draw of Z ∼ Q(· | x0 ),
the posterior odds satisfy
π(x | z) π(x)
0
≤ e3ε .
π(x | z) π(x0 )
Deferring the proof momentarily, this result shows that as long as two samples x, x0 are neighboring,
then an adversary is extremely unlikely to be able to glean substantially distinguishing information
between the samples. This is suggestive of a heuristic in differential privacy that if n is the sample
size, then one should take δ 1/n to limit the probability of disclosure: by a union bound, we see
that for each individual i ∈ {1, . . . , n}, we can simultaneously guarantee that the posterior odds
for swapping individual i’s data do not change much (with high probability).
Unsurprisingly at this point, we can also give posterior update bounds for Rényi differential
privacy. Here, instead of giving high-probability bounds—though it is possible—we can show that
moments of the odds ratio do not change significantly. Indeed, we have the following proposition:
Proposition 6.20. Let Q be a (ε, α)-Rényi private channel from X n → Z, where α ∈ (1, ∞).
Then for any neighboring x0 , x, x0 ∈ X n , we have
" 1
α−1 # α−1
π(x | Z) π(x)
E0 ≤ eε ,
π(x0 | X) π(x0 )
115
Stanford Statistics 311/Electrical Engineering 377 John Duchi
Proposition 6.20 communicates a similar message to our previous results in this vein: even if we get
information from the output of the private mechanism on some sample x0 ∈ X n near the samples
(datasets) of interest x, x0 that an adversary wishes to distinguish, it is impossible to update beliefs
by much. The parameter α then controls the degree of difficulty of this “impossible” claim, which
one can see by (for example) applying a Chebyshev-type bound to the posterior ratio and prior
ratios.
We now turn to the promised proofs of Propositions 6.19 and 6.20. To prove the former, we
require a definition.
Definition 6.6. Distributions P and Q on a space X are (ε, δ)-close if for all measurable A
P (A) ≤ eε Q(A) + δ and Q(A) ≤ eε P (A) + δ.
Letting p and q denote their densities (with respect to any shared base measure), they are (ε, δ)-
pointwise close if the set
A := {x ∈ X : e−ε q(x) ≤ p(x) ≤ eε q(x)} = {x ∈ X : e−ε p(x) ≤ q(x) ≤ eε p(x)}
satisfies P (A) ≥ 1 − δ and Q(A) ≥ 1 − δ.
The following lemma shows that closeness and approximate differential privacy are strongly
related.
Lemma 6.21. If P and Q are (ε, δ)-close, then for any β > 0, the sets
A+ := {x : p(x) > e(1+β)ε q(x)} and A− := {x : p(x) ≤ e−(1+β)ε q(x)}
satisfy
eβε δ e−ε δ
max{P (A+ ), Q(A− )} ≤ , max{P (A − ), Q(A + )} ≤ .
eβε − 1 eβε − 1
Conversely, if P and Q are (ε, δ)-pointwise close, then
P (A) ≤ eε Q(A) + δ and Q(A) ≤ eε P (A) + δ
for all sets A.
Proof Let A = A+ = {x : p(x) > e(1+β)ε q(x)}. Then
P (A) ≤ eε Q(A) + δ ≤ e−βε P (A) + δ,
δ
so that P (A) ≤ 1−e−βε
. Similarly,
116
Stanford Statistics 311/Electrical Engineering 377 John Duchi
Lemma 6.22. Let P0 , P1 , P2 be distributions on a space X , each (ε, δ)-close. Then for any i, j, k,
j 6= k, the set
pj (x)
Ajk := x ∈ X : log > 3ε satisfies Pi (Ajk ) ≤ Cδ max{ε−1 , 1}
pk (x)
as pr = α and p∗ = α. Taking everything to the 1/(α − 1) power and gives the result.
117
Stanford Statistics 311/Electrical Engineering 377 John Duchi
Theorem 6.23. Let the conditions above hold, εi < ∞ for i = 1, . . . , n, and α ∈ [1, ∞]. Assume
that conditional on z1i−1 , we have Dα Pi (· | z1i−1 )||Qi (· | z1i−1 ) ≤ εi . Then
n
X
Dα (P1n ||Qn1 ) ≤ εi .
i=1
Proof We assume without loss of generality that the conditional distributions Pi (· | z1i−1 ) and
Qi are absolutely continuous with respect to a base measure µ on Z.1 Then we have
n α
pi (zi | z1i−1 )
Z Y
1
Dα (P1n ||Qn1 ) = log i−1
qi (zi | z1i−1 )dµn (z1n )
α−1 qi (zi | z1 )
i=1
"Z α # n−1
Y pi α
pn (zn | z1n−1 )
Z
1 n−1
= log q n (z n | z 1 )dµ(z n ) qi dµn−1
α−1 Z1n−1 qn (zn | z1n−1 ) i=1
qi
Z n−1
Y pi (zi | z ) i−1 α
1
≤ log exp((α − 1)εn ) 1
i−1
qi (zi | z1i−1 )dµn−1 (z1n−1 )
α−1 Z1n−1 qi (zi | z1 )
i=1
n−1 n−1
= εn + Dα P1 ||Q1 .
118
Stanford Statistics 311/Electrical Engineering 377 John Duchi
i. Adversary chooses arbitrary space X , n ∈ N, and two datasets x(0) , x(1) ∈ X n with
dham (x(0) , x(1) ) ≤ 1.
Figure 6.1. The privacy game. In this game, the adversary may not directly observe
the private b ∈ {0, 1}.
Let δ > 0. Then a collection Q of channels satisfies (ε, δ)-differential privacy under k-fold
adaptive composition if D∞δ (Q(0) ||Q(1) ) ≤ ε and D δ (Q(1) ||Q(0) ) ≤ ε.
∞
By considering a special case centered around a particular individual in the game 6.1, we can gain
some intuition for the definition. Indeed, suppose that an individual has some data x0 ; in each
round of the game the adversary generates two datasets, one containing x0 and the other identical
except that x0 is removed. Then satisfying Definition 6.7 captures the intuition that an individual’s
privacy remains protected, even in the face of multiple (private) accesses of the individual’s data.
As an immediate corollary to Theorem 6.23, we then have the following.
Corollary 6.24. Assume that each channel in the game Pin Fig. 6.1 is (εi , α)-Rényi private. Then
k
the arbitrary composition of k such channels remains ( i=1 εi , α)-Rényi private.
More sophisticated corollaries are possible once we start to use the connections between privacy
measures we outline in Section 6.2.2. In this case, we can develop so-called advanced composition
rules, which sometimes suggest that privacy degrades more slowly than might be expected under
adaptive composition.
Corollary 6.25. Assume that each channel in the game in Fig. 6.1 is ε-differentially private. Then
the composition of k such channels is kε-differentially private. Additionally, the composition of k
such channels is r !
3k 2 1
ε + 6k log · ε, δ
2 δ
differentially private for all δ > 0.
Proof The first claim is immediate: for Q(0) , Q(1) as in Definition 6.7, we know that Dα Q(0) ||Q(1) ≤
kε for all α ∈ [1, ∞] by Theorem 6.23 coupled with Proposition 6.13 (or Corollary 6.14).
119
Stanford Statistics 311/Electrical Engineering 377 John Duchi
3α 2
For the second claim, we require a bit more work. Here, we use the bound 2 ε in the Rényi
privacy bound in Corollary 6.14. Then we have for any α ≥ 1 that
3kα
Dα Q(0) ||Q(1) ≤ ε2
2
by Theorem 6.23. Now we apply Proposition 6.15 and Corollary 6.16, which allow us to conclude
(ε, δ)-differential privacy from Rényi privacy. Indeed, by the preceding display, setting η = 1 + α,
3kη 2
we have that the composition is ( 3k 2 1 1
2 ε + 2 ε + η log δ , δ)-differentially private for all η > 0 and
δ > 0. Optimizing over η gives the second result.
Corollary 6.26. Assume that each channel in the game in Fig. 6.1 is (ε, δ)-differentially private.
Then the composition of k such channels is (kε, kδ)-differentially private. Additionally, they are
r
3k 2 1 kδ
ε + 6k log · ε, δ0 +
2 δ0 1 + eε
Proof Consider the channels Qi in Fig. 6.1. As each satisfies D∞ δ (Q (· | x(0) )||Q (· | x(1) )) ≤ ε
i i
δ (1) (0)
and D∞ (Qi (· | x )||Qi (· | x )) ≤ ε, Lemma 6.18 guarantees the existence (at each sequential step,
(0) (1)
which may depend on the preceding i − 1 outputs) of probability measures Qi and Qi such that
(1−b) (b) (b)
D∞ (Qi ||Qi ) ≤ ε, kQi − Qi (· | x(b) )kTV ≤ δ/(1 + eε ) for b ∈ {0, 1}.
(b) (b) (1−b) (1−b)
Note that by construction (and Theorem 6.23) we have Dα(Q1 · · · Qk ||Q1 · · · Qk ) ≤
min{ 3kα
2 ε 2 , kε}, where Q(b) denotes the joint distribution on Z , . . . , Z under bit b. We also have
1 k
(b) (b)
by the triangle inequality that kQ1 · · · Qk − Q(b) kTV ≤ kδ/(1 + eε ) for b ∈ {0, 1}. (See Exer-
cise 2.16.) As a consequence, we see as in the proof of Corollary 6.25 that the composition is
3kη 2
( 3k 2 1 1 ε
2 ε + 2 ε + η log δ0 , δ0 + kδ/(1 + e ))-differentially private for all η > 0 and δ0 . Optimizing
gives the result.
As a consequence of these results, we see that whenever the privacy parameter ε < 1, it is
possible to compose
√ multiple privacy mechanisms together and have privacy penalty scaling only
as the worse of kε and kε2 , which is substantially better than the “naive” bound of kε. Of course,
a challenge here—relatively unfrequently discussed in the privacy literature—is that when ε ≥ 1,
which is a frequent case for practical deployments of privacy, all of these bounds are much worse
than a naive bound that k-fold composition of ε-differentially private algorithms is kε-differentially
private.
120
Stanford Statistics 311/Electrical Engineering 377 John Duchi
and µ is a finite measure, making the last assumption trivial.) That is, we release Z with probability
proportional to exp(− Lε `(x, z)). It is evident that the mechanism (6.4.1) is 2ε-differentially private:
for any x, x0 with dham (x, x0 ) ≤ 1, we have
exp(− Lε `(x0 , z))dµ(z) A exp(− Lε `(x, z))dµ(z)
R R
Q(A | x)
= R
Q(A | x0 ) exp(− Lε `(x, z))dµ(z) A exp(− Lε `(x0 , z))dµ(z)
R
n ε o n ε o
≤ sup exp [`(x, z) − `(x0 , z)] · sup exp [`(x0 , z) − `(x, z)] ≤ exp(2ε).
z∈Z L z∈A L
As a first (somewhat trivial) example, we can recover the Laplace mechanism:
Example 6.27 (The Laplace mechanism): We can recover Example 6.3 through the exponen-
tial mechanism. Indeed, suppose that we wish to release f : X n → Rd , where f has sensitivity
L. Then taking z ∈ Rd , `(x, z) = kf (xn1 ) − zk1 , and µ to be the usual Lebesgue measure on
Rd , the exponential mechanism simply uses density
ε
q(z | x) ∝ exp − kf (xn1 ) − zk1 ,
L
which is the Laplace mechanism. 3
121
Stanford Statistics 311/Electrical Engineering 377 John Duchi
One challenge with the exponential mechanism (6.4.1) is that it is somewhat abstract and is
often hard to compute, as it requires evaluating an often high-dimensional integral to sample from.
Yet it provides a nice abstract mechanism with strong privacy guarantees and, as we shall see, good
utility guarantees. For the moment, we defer further examples and provide utility guarantees when
µ(Z) is finite, giving bounds based on the measure of “bad” solutions. For notational convenience,
we define the optimal value
`? (x) = inf `(x, z),
z∈Z
`(x, Z) ≤ `? (x) + 2t
µ(Z)
with probability at least 1 − exp − εt
L + log µ(St)
.
Proof Assume without loss of generality (by scaling) that the global Lipschitzian (sensitivity)
constant of ` is L = 1. Then for Z ∼ Q(· | x), we have
?
R R
c exp(−ε`(x, z))dµ(z)
S2t c exp(−ε(`(x, z) − ` (x)))dµ(z)
S2t
?
P (`(x, Z) ≥ ` (x) + 2t) = R = R
exp(−ε`(x, z))dµ(z) exp(−ε(`(x, z) − `? (x)))dµ(z)
R
c exp(−2εt)dµ(z)
S2t
c )
µ(S2t
≤R ?
≤ exp(−εt) ,
St exp(−ε(`(x, z) − ` (x)))dµ(z) µ(St )
We can provide a few simplifications of this result in different special cases. For example, if Z
is finite with cardinality card(Z), then Proposition 6.28 implies that taking µ to be the counting
measure on Z we have
Corollary 6.29. In addition to the conditions in Proposition 6.28, assume that card(Z) is finite.
Then for any u ∈ (0, 1), with probability at least 1 − u,
2L card(Z)
`(x, Z) ≤ `? (x) + log .
ε u
That is, with extremely high probability, the loss of Z from the exponential mechanism is at most
logarithmic in card(Z) and grows only linearly with the global sensitivity L.
A second corollary allows us to bound the expected loss of the exponential mechanism, assuming
we have some control over the measure of the sublevel sets St .
2L µ(Z) L
Corollary 6.30. Let t ≥ 0 be the smallest scalar such that t ≥ ε log µ(St)
and t ≥ ε. Then Z
drawn from the exponential mechanism (6.4.1) satisfies
? 2L ? ? L µ(Z)
E[`(x, Z)] ≤ ` (x) + t + ≤ ` (x) + 3t ≤ ` (x) + O(1) log 1 + .
ε ε µ(St )
122
Stanford Statistics 311/Electrical Engineering 377 John Duchi
Z ∞
E[`(x, Z) − `? (x)] ≤ t0 + P(`(x, Z) − `? (x) ≥ t)dt
t0
Z ∞
= t0 + 2 P(`(x, Z) − `? (x) ≥ 2t)dt
t0 /2
Z ∞
εt µ(Z)
≤ t0 + 2 exp − + log dt
t0 /2 L µ(St )
Z ∞
ρ εt 2L εt0
≤ t0 + 2e exp − dt = t0 + exp ρ − .
t0 /2 L ε 2L
Corollary 6.30 may seem a bit circular: we require the ratio µ(Z)/µ(St ) to be controlled—but it
is relatively straightforward to use it (and Proposition 6.28) with a bit of care and standard bounds
on volumes.
Example 6.31 (Empirical risk minimization via the exponential mechanism): We consider
the empirical risk minimization problem, where we have losses ` : Θ × X → R+ , where Θ ⊂ Rd
is a parameter space of interest, and we wish to choose
n
( )
1 X
θbn ∈ argmin Ln (θ, x) := `(θ, xi ) .
θ∈Θ n
i=1
We make a few standard assumptions: first, for simplicity, that n is large enough that nd ≥ ε.
We also assume that Θ ⊂ Rd is an `2 -ball of radius R, that θ 7→ `(θ, xi ) is M -Lipschitz for all
xi , and that `(θ, xi ) ∈ [0, 2M R] for all θ ∈ Θ. (Note that this last is no loss of generality, as
`(θ, xi ) − inf θ∈Θ `(θ, xi ) ≤ M supθ,θ0 ∈Θ kθ − θ0 k2 ≤ 2M R.)
Take the empirical loss Ln (θ, x) as our criterion function for the exponential mechanism, which
evidently satisfies |Ln (θ, x) − Ln (θ, x0 )| ≤ 2Mn R whenever dham (x, x0 ) ≤ 1, so that we release θ
with density nε
q(θ | x) ∝ exp − Ln (θ, x) .
2M R
Let θbn be the empirical minimizer as above; then by the Lipschitz continuity of `, the sublevel
set St evidently satisfies
t
St ⊃ θ ∈ Θ | kθ − θbn k2 ≤ .
M
Then a volume calculation (with the factor of 2 necessary because we may have θbn on the
boundary of Θ) yields that for µ the Lebesgue measure,
d
µ(St ) t
≥ .
µ(Z) 2M R
123
Stanford Statistics 311/Electrical Engineering 377 John Duchi
As a consequence, by Corollary 6.30, whenever t ≥ O(1) MnεR · d log MtR , we have E[Ln (θ, x) |
x] ≤ Ln (θbn , x) + 3t. Evidently the choice t = O(1) MnεRd suffices whenever dε ≤ 1, so we obtain
M Rd nε
E[Ln (θ, x)] ≤ Ln (θbn , x) + O(1) log ,
nε d
d
whenever nε ≤ 1. Notably, standard empirical risk minimization (recall Chapter 3.3.2) typ-
√
ically achieves rates of convergence roughly of M R/ n, so that the gap of the exponential
mechanism is lower order whenever √dnε ≤ 1. 3
where `(·; x) is convex for each x ∈ X and EP denotes expectation taken over X ∼ P . A standard
approach to solving problems of the form (6.4.2), is to use the stochastic gradient method, which
iterates for k = 1, 2, . . ., beginning from some θ0 ∈ Θ,
iid
i. Draw Xk ∼ P
ii. Compute stochastic gradient gk = ∇θ `(θk ; Xk )
iii. Update
θk+1 = ProjΘ (θk − ηk gk ) (6.4.3)
where ηk > 0 is a non-increasing sequence of stepsizes and ProjΘ denotes Euclidean projection onto
Θ, that is, n o
ProjΘ (θ0 ) = argmin kθ − θ0 k22 .
θ∈Θ
The analysis of such stochastic gradient procedures constitutes an entire field on its own. The
important fact is that in the iteration (6.4.3), it is unimportant that gk = ∇θ `(θk ; Xk ) precisely,
but all that is required is that we have unbiased gradient estimates E[gk | θk ] = ∇L(θk ). To keep
matters simple, we present one typical type of result, which we do not prove. In the theorem, we
assume that the stochastic gradients g = g(θ, X, W ) for some random variable W independent of
X and θ.
Proposition 6.32 (Bach and Moulines [14], Theorem 3). Assume that θ? = argminθ L(θ) belongs
to the interior of Θ, is unique, and that ∇2 L(θ? ) 0. Let the stepsizes ηk = η0 k −β for some
β ∈ (1/2, 1). Additionally assume that θ 7→ ∇`(θ; x) is L(x) Lipschitz on Θ with E[L(X)2 ] < ∞
and θ 7→ ∇2 `(θ; x) is Lipschitz on Θ. Then
r
2 L(θ ? )−1 Σ∇2 L(θ ? )−1 )
?
2 1/2 tr(∇ 1
E[
θn − θ
2 ] = +O
n n1−β/2
where θn = n1 nk=1 θk and Σ = Cov(g(θ? , X, W )).
P
124
Stanford Statistics 311/Electrical Engineering 377 John Duchi
This result is in fact optimal—no method can achieve better convergence in n in the leading term
when g(θ? , X, W ) = ∇`(θ? ; X)—and shows that the convergence rate slows the larger the covariance
Σ of the stochastic gradients taken at θ? is.
The importance of this result is that we can develop locally private procedures for fitting large
scale models using the iteration (6.4.3) by adding noise to or appropriately limiting the stochastic
gradients ∇`(θ; x). Indeed, a natural strategy is to, at each iteration (6.4.3), perturb the stochastic
gradients via some (conditional) mean-zero noise, sufficient to guarantee some type of privacy. We
consider a specialized version of this problem, where we assume the stochastic gradient vectors
belong to the `2 -ball of radius M , so that k∇`(θ; x)k2 ≤ M for all θ ∈ Θ and x ∈ X . We wish to
develop a scheme providing ε-local differential privacy for individual contributors of data points x.
In this case, the first idea might be to add independent Laplacian noise, but (as we have seen in
Example 6.3) this may add noise of too large a magnitude. Instead, we develop a new mechanism
based on uniform sampling on the sphere Sd−1 = {u ∈ Rd | kuk2 = 1} ⊂ Rd .
We begin with a vector v ∈ Rd . Then the mechanism proceeds as follows:
( ε
1 with probability eεe+1
set T =
0 otherwise
2 Γ( d2 + 1) 1
Cd := E[U1 | U1 ≥ 0] = √ d−1
&√ ,
π dΓ( 2 + 1) d
where the inequality is a consequence of Stirling’s approximation to the gamma function. (The
first coordinate U1 has the same distribution as 2B − 1, where B ∼ Beta( d−1 d−1
2 , 2 ).)
With this derivation, we see how we may define a channel that preserves ε-local privacy and
computes unbiased stochastic gradients. At iteration k, we let gk = ∇`(θk ; Xk ) as in the itera-
tion (6.4.3). Then, we scale gk , which satisfies kgk k2 ≤ M , so that it lies on the surface of the ball:
we set
kgk k2
(
+gk / kgk k2 w.p. 21 + 2M
gek = kgk k2
−gk / kgk k2 w.p. 12 − 2M ,
1
gk | gk ] =
so that E[e M gk . Then given this vector, we draw Wk according to the mechanism (6.4.4),
and then set √
1 eε + 1 eε + 1 π dΓ( d−1
2 + 1)
Zk = M Wk = M ε Wk , (6.4.5)
Cd e ε − 1 e − 1 2 Γ( d2 + 1)
which then satisfies E[Zk | gk ] = gk , so that it is a valid stochastic gradient.
125
Stanford Statistics 311/Electrical Engineering 377 John Duchi
eε
eε + 1
1
eε +1
Figure 6.2. εLocal ε-differentially private sampling of a vector v on the surface of the `2 -ball. With
e 1
probability 1+e ε , draw W uniformly from the hemisphere in the direction of v, with probability 1+eε
Combining the mechanism (6.4.5) into the stochastic gradient iteration (6.4.3), where we replace
gk with Zk via
θk+1 = ProjΘ (θk − ηk Zk ), (6.4.6)
we have the following corollary to Proposition 6.32.
Corollary 6.33. Let the conditions of Proposition 6.32 hold. Then the private stochastic gradient
iteration (6.4.6) satisfies
r
√ ε+1 tr(∇2 L(θ? )−2 )
h
?
2 i1/2 e 1
E
θn − θ
2 ≤c M· ε +O .
e −1 n n1−β/2
for a numerical constant c ≤ 2. There exist problems for which this inequality is sharp to within a
numerical constant.
126
Stanford Statistics 311/Electrical Engineering 377 John Duchi
where Σ? = E[∇`(θ? ; X)∇`(θ? ; X)T / k∇`(θ? ; X)k22 ] satisfies tr(Σ? ) = 1 and Σ? ≺ I. Consequently,
we have
ε 2
T 2 e +1
Σ := E[Z∞ Z∞ ] c · M d·I
eε − 1
where c ≤ 4 is a numerical constant. Substituting this in Proposition 6.32 gives the corollary.
The sharpness of the result comes from considering estimating the mean of vectors X drawn
uniformly from the unit sphere Sd−1 with loss `(θ; x) = 21 kθ − xk22 .
Let us inspect and understand this quantity a bit, considering the time (or sample size) it
takes to solve the problem (6.4.2) with and without privacy. Assume for simplicity that Σ =
Cov(∇`(θ? ; X)) (M 2 /d)I, which is the natural scaling for vectors satisfying k∇`(θ? ; X)k2 ≤ M
that are (roughly) isometric, that is, have approximately scaled identity covariance. Then for a fixed
accuracy γ = E[kθn − θ? k22 ]1/2 , if N (γ) denotes the sample size necessary to solve problem (6.4.2)
to accuracy γ in the non-private case, so that we have roughly
then in the locally private case the necessary sample size for our scheme (6.4.6) is
2 2
M 2 tr(∇2 L(θ? )−2 ) eε + 1 eε + 1
Npriv (γ) & · =d· · N (γ).
γ2 eε − 1 eε − 1
That is, for ε . 1, there is a degradation in sample complexity of N (γ) 7→ dN (γ)/ε2 . As we shall
see later in the lecture notes, this degradation is essentially unavoidable.
Sparse vectors
P (S) − δ
P (S) ≤ R(S) + δ ≤ eε Q(S) + δ, i.e. log ≤ ε,
Q(S)
127
Stanford Statistics 311/Electrical Engineering 377 John Duchi
which is equivalent to D∞δ (P ||Q) ≤ ε. Now, let us assume that D δ (P ||Q) ≤ ε, whence we must
∞
construct the distribution R.
We assume w.l.o.g. that P and Q have densities p, q, and define the sets
On these sets, we have 0 ≤ P (S) − eε Q(S) ≤ δ by assumption, and we then define a distribution R
with density that we partially specify via
In particular, when x ∈ T , we may take the density r so that p(x) ≤ r(x) ≤ q(x), as
by the inequalities (6.5.1), and so that R(X ) = 1. With this, we evidently have r(x) ≤ eε q(x) by
construction, and because S ⊂ T c , we have
by assumption.
Now, we turn to the second statement of the lemma. We start with the easy direction, where
we assume that P0 and Q0 satisfy D∞ (P0 ||Q0 ) ≤ ε and D∞ (Q0 ||P0 ) ≤ ε as well as kP − P0 kTV ≤ δ
and kQ − Q0 kTV ≤ δ. Then for any set S we have
δ δ δ
P (S) ≤ P0 (S) + ε
≤ eε Q0 (S) + ε
≤ eε Q(S) + eε δ + ,
1+e 1+e 1 + eε
128
Stanford Statistics 311/Electrical Engineering 377 John Duchi
eε P (S) − eε Q(S) δ
α := P (S) − P1 (S) = P (S) − ε
(P (S) + Q(S)) = ε
≤ .
1+e 1+e 1 + eε
and similarly
Q(S 0 ) − eε P (S 0 ) δ
α0 := Q(S 0 ) − Q1 (S 0 ) = ε
≤ .
1+e 1 + eε
Note also that we have P (S) − P1 (S) = Q1 (S) − Q(S) and Q(S 0 ) − Q1 (S 0 ) = P1 (S 0 ) − P (S 0 ) by
construction.
We assume w.l.o.g. that α ≥ α0 , so that if β = α − α0 ≥ 0, we have β ≤ 1+eδ
ε , and we have the
sandwiching
where
P0 (S ∪ T ) = P1 (S ∪ T ) + β and Q0 (S ∪ T ) = Q1 (S ∪ T ) − β. (6.5.2)
With these choices, we evidently obtain Q0 (X ) = P0 (X ) = 1 and that D∞ (P0 ||Q0 ) ≤ ε and
D∞ (Q0 ||P0 ) ≤ ε by construction. It remains to consider the variation distances. As p0 = p on T 0 ,
129
Stanford Statistics 311/Electrical Engineering 377 John Duchi
we have
Z Z Z
1 1 1
kP − P0 kTV = |p − p0 | + |p − p0 | + |p − p0 |
2 S 2 S0 2 T
1 1 1
= (P (S) − P0 (S)) + (P0 (S 0 ) − P (S)) + (P0 (T ) − P (T ))
2 2 2
1 1 1
≤ (P (S) − P1 (S)) + (P0 (S 0 ) − P (S)) + (P0 (T ) − P (T )),
2| {z } 2| {z } 2| {z }
=α =α0 ≤β
where the P0 (T ) − P (T ) ≤ β claim follows becase p1 (x) = p(x) on T and by the increasing
construction yielding equality (6.5.2), we have P0 (T ) − P (T ) = P0 (T ) − P1 (T ) = β + P1 (S) −
0
P0 (S) ≤ β. In particular, we have kP − P0 kTV ≤ α+α 2 + β2 = α ≤ 1+e δ
ε . The argument that
δ
kQ − Q0 kTV ≤ 1+e ε is similar.
6.6 Bibliography
Given the broad focus of this book, our treatment of privacy is necessarily somewhat brief, and
there is substantial depth to the subject that we do not cover.
The initial development of randomized response began with Warner [141], who proposed ran-
domized response in survey sampling as a way to collect sensitive data. This elegant idea remained
in use for many years, and a generalization to data release mechanisms with bounded likelihood
ratios—essentially, the local differential privacy definition 6.2—is due to Evfimievski et al. [70] in
2003 in the databases community. Dwork, McSherry, Nissim, and Smith [65] and the subsequent
work of Dwork et al. [64] defined differential privacy and its (ε, δ)-approximate relaxation. A small
industry of research has built out of these papers, with numerous extensions and developments.
The book of Dwork and Roth [63] surveys much of the field, from the perspective of computer
science, as of 2014. Lemma 6.18 is due to Dwork et al. [66], and our proof is based on theirs.
6.7 Exercises
Question 6.1 (Laplace mechanisms versus randomized response): In this question, you will
investigate using Laplace and randomized response mechanisms, as in Examples 6.3 and 6.1–6.2,
to perform locally private estimation of a mean, and compare this with randomized-response based
mechanisms.
We consider the following scenario: we have data Xi ∈ [0, 1], drawn i.i.d., and wish to estimate
the mean E[X] under local ε-differential privacy.
iid
(a) The Laplace mechanism simply sets Zi = Xi + Wi for Wi ∼ Laplace(b) for some b. What choice
of b guarantees ε-local differential privacy?
(c) A randomized response mechanism for this case is the following: first, we randomly round Xi
to {0, 1}, by setting (
Xei = 1 with probability Xi
0 otherwise.
130
Stanford Statistics 311/Electrical Engineering 377 John Duchi
Conditional on X
ei = x, we then set
(
eε
x with probability 1+eε
Zi = 1
1−x with probability 1+eε .
What is E[Zi ]?
(d) For the randomized response Zi above, give constants a and b so that aZi − b is unbiased
1 Pn
for E[X], that is, E[aZi − b] = E[X]. Let θn = n i=1 (aZi − b) be your mean estimator.
b
What is E[(θbn − E[X])2 ]? Does this converge to the mean-square error of the sample mean
E[(X n − E[X])2 ] = Var(X)/n as ε ↑ ∞?
(e) Let us consider a more sophisticated randomized response scheme. Define quantized values
1 k−1
b0 = 0, b1 = , . . . , bk−1 = , bk = 1. (6.7.1)
k k
Now consider a randomized response estimator that, when X ∈ [bj , bj+1 ] first rounds X ran-
e ∈ {bj , bj+1 } so that E[X
domly to X e | X] = X. Conditional on X e = j, we then set
(
eε
j with probability k+e ε
Z= k
Uniform({0, . . . , k} \ {j}) with probability k+eε .
Question 6.2 (Subsampling via divergence measures (Balle et al. [15])): The hockey stick di-
vergence functional, defined for α ≥ 1, is φα (t) = [1 − αt]+ . It is straightforward to relate this to
(ε, δ)-differential privacy via Definition 6.6: two distributions P and Q are (ε, δ)-close if and only
their φeε -divergences are less than δ, i.e., if and only if
Dφeε (P ||Q) ≤ δ and Dφeε (Q||P ) ≤ δ.
(In your answer to this question, feel free to use Dα (P ||Q) as a shorthand for Dφα (P ||Q).)
131
Stanford Statistics 311/Electrical Engineering 377 John Duchi
(a) Let P0 , P1 , Q1 be any three distributions, and for some q ∈ [0, 1] and α ≥ 1, define P =
(1 − q)P0 + qP1 and Q = (1 − q)P0 + qQ1 . Let α0 = 1 + q(α − 1) = (1 − q) + qα and
θ = α0 /α ≤ 1. Show that
(b) Let ε > 0 and define ε(q) = log(1 + q(eε − 1)). Show that
Question 6.3 (Subsampling and privacy amplification (Balle et al. [15])): Consider the follow-
ing subsampling approach to privacy. Assume that we have a private (randomized) algorithm,
represented by A, that acts on samples of size m and guarantees (ε, δ)-differential privacy. The
subsampling mechanism is then defined as follows: given a sample X1n of size n > m, choose a
subsample Xsub of size m uniformly at random from X1n , and then release Z = A(Xsub ).
(a) Use the results of parts (a) and (b) in Question 6.2 to show that Z is (ε(q), δq)-differentially
private, where q = m/n and ε(q) = log(1 + q(eε − 1)).
(b) Show that if ε ≤ 1, then Z is ((e − 1)qε, qδ)-differentially private, and if ε ≤ 21 , then Z is
√
(2( e − 1)qε, qδ)-differentially private. Hint: Argue that for any T > 0, one has et − 1 ≤
(eT − 1) Tt for all t ∈ [0, T ].
Question 6.4 (Subsampling and privacy): We would like to estimate the mean E[X] of X ∼ P ,
where X ∈ B = {x ∈ Rd | kxk2 ≤ 1}, the `2 -ball in Rd . We investigate the extent to which
subsampling of a dataset can improve privacy by providing some additional anonymity. Consider
the following mechanism for estimating (scaled) multiples of this mean: for a dataset {X1 , . . . , Xn },
we let Si ∈ {0, 1} be i.i.d. Bernoulli(q), that is, E[Si ] = q, and then consider the algorithm
n
X
Z= Xi Si + σW, W ∼ N(0, Id ). (6.7.2)
i=1
(a) Let Q(· | X) and Q(· | X 0 ) denote the channels for the mechanism (6.7.2) with data matrices
X = [x1 · · · xn−1 x] and X 0 = [x1 · · · xn−1 ] ∈ Rd×n . Let Pµ denote the normal distribution
N(µ, σ 2 I) with mean µ and covariance σ 2 I on Rd . Show that for any α ∈ (1, ∞),
and
Dα Q(· | X 0 )||Q(· | X) ≤ Dα (P0 ||qPx + (1 − q)P0 ) .
132
Stanford Statistics 311/Electrical Engineering 377 John Duchi
Consider two mechanisms for computing a sample mean X n of vectors, where kxi k2 ≤ b for all i.
The first is to repeat the following T times: for t = 1, 2, . . . , T ,
iid
i. Draw S ∈ {0, 1}n with Si ∼ Bernoulli(q)
1 iid
ii. Set Zt = nq (XS + σsub Wt ), where Wt ∼ N(0, I), as in (6.7.2).
1 PT
Then set Zsub = T t=1 Zt . The other mechanism is to simply set ZGauss = X n + σGauss W for
W ∼ N(0, I).
(c) What level of privacy does Zsub have? That is, Zsub is (ε, 2)-Rényi private (against single
removals (6.7.3)). Give a tight upper bound on ε.
(e) Fix ε > 0, and assume that each mechanism Zsub and ZGauss have parameters chosen so that
they are (ε, 2)-Rényi private. Optimize over T, q, n, σsub in the subsampling mechanism and
σGauss in the Gaussian mechanism, and provide the sharpest bound you can on
2
2
E[
Zsub − X n
2 ] and E[
ZGauss − X n
2 ].
You may assume kxi k2 = b for all i. (In your derivation, to avoid annoying constants, you
should replace log(1 + t) with its upper bound, log(1 + t) ≤ t, which is fairly sharp for t ≈ 0.)
Question 6.5 (Privacy and stochastic gradient methods): In this question, we develop tools for
private (and locally private) estimation in statistical risk minimization, focusing on problems of the
form (6.4.2).
Consider a stochastic gradient method using privacy (Eqs. (6.4.3) and (6.4.6)), where instead
of using the careful `2 -sampling scheme of Fig. 6.2 we add Gaussian noise and subsample a random
fraction q of the dataset. We are given a sample X1n of size n, and at each iteration k we draw a
sample Sk ⊂ {1, . . . , n}, where indices are chosen independently and P(i ∈ Sk ) = q, then set
1 X
gk := ∇`(θk ; Xi ) + σsub Wk (6.7.4)
nq
i∈Sk
where Wk ∼ N(0, I). We then update via the projection (6.4.3), i.e.
where ηk = η0 k −β for some η0 > 0 and β ∈ (1/2, 1). We assume that k∇`(θ; x)k2 ≤ M for all
x ∈ X , θ ∈ Θ.
133
Stanford Statistics 311/Electrical Engineering 377 John Duchi
(a) What level ε of (ε, 2)-Rényi privacy does one noisy gradient calculation (6.7.4) provide? (To
simplify your answer, you may assume that that σsub ≥ M and q < 1 − 1/e.)
Now we consider the application of the results for the stochastic gradient method in Proposition 6.32
in the context of the stochastic gradients (6.7.4). Let the empirical loss Ln (θ) = n1 ni=1 `(θ; Xi ).
P
You may assume all the conditions of Proposition 6.32, additionally assuming that k∇`(θ; x)k2 ≤ M
for all θ, x.
(b) Choose q ∈ (0, 1), σsub , and a number of iterations T to perform the stochastic gradient
Piteration
with gradients (6.7.4). Prove that for your choices, the resulting average θT = T1 Tk=1 θk is
(ε, 2)-Rényi private. (You may assume that ε ≤ 1.)
(c) Using your choices of q, σsub , and T from part (b), give the tightest upper bound you can on
the root mean squared error
h i1/2
E kθT − θbn k22
in terms of the sample size n, privacy level ε, bound M on k∇`(θ; x)k2 , ∇2 Ln (θbn ), and Σn :=
Covn (∇`(θbn , X)) where Covn denotes empirical covariance and θbn minimizes Ln (θ) over θ ∈ Θ.
(You may have unspecified numerical constants, and you may assume that θbn ∈ int Θ.)
(d) Assume that if θ(X1n ) is any function of the data satisfying nE[kθ(X1n ) − θbn k22 ] → 0 as n → ∞
then E[kθ(X1n ) − θ? k22 ] satisfies the exact bound of Proposition 6.32. What does this say about
your estimator from part (c)?
(e) An implementation for solving logistic regression. Construct a dataset as follows: for d = 25
and n = 2000, draw {Xi }ni=1 i.i.d. and uniform on Sd−1 , and draw θ? ∈ Sd−1 uniformly as well.
Then for each i = 1, . . . , n, for y ∈ {±1} set
1
Yi = y with probability
1 + exp(−yhθ? , Xi i)
that is, following the binary logistic regression model.
Now, for the loss `(θ; (x, y)) = log(1 + exp(−yhx, θi), implement
i. The non-private stochastic gradient method
ii. The sampling scheme from parts (a–c) of this problem
iii. The `2 -locally private sampling approach in Eqs. (6.4.5)–(6.4.6).
Initialize each method at θ0 = 0, use stepsizes ηk = k −2/3 , and set the privacy levels ε = 1 for
each problem. Use Θ = Rd so that there are no projections.
Repeat these experiments at least 10 times each, and then plot your errors kθ − θ? k2 (in what-
ever format you like) for each of the non-private, (centralized) Rényi private, and locally private
approaches. Explain (briefly) your plots.
Question 6.6 (Concentration and privacy composition): In this question, we give an alternative
to the privacy composition approaches we exploit in Section 6.3.2. Consider an identical scenario to
that in Fig. 6.1, and begin by assuming that each channel Qi is ε-differentially private with density
qi , and let Q(b) be shorthand for Q(· | x(b) ). Define the log-likelihood ratio
k (b)
X qi (Zi )
L(b) (Z1k ) := log (1−b)
.
i=1 qi (Zi )
134
Stanford Statistics 311/Electrical Engineering 377 John Duchi
(a) Let P , Q be any two distributions satisfying D∞ (P ||Q) ≤ ε and D∞ (Q||P ) ≤ ε, i.e., that
P (A)
log Q(A) ∈ [−ε, ε] for all sets A. Show that
(b) Let Q(b) denote the joint distribution of Z1 , . . . , Zk when bit b holds in the privacy game in
Fig. 6.1. Show that
Eb [L(b) (Z1k )] ≤ kε(eε − 1)
where Eb denotes expectation under Q(b) , and that for all t ≥ 0,
t2
Q(b) L(b) (Z1k ) ≥ kε(eε − 1) + t ≤ exp − .
2kε2
Conclude that for any δ ∈ (0, 1), with probability at least 1 − δ over Z1k ∼ Q(b) ,
r
1
L(b) (Z1k ) ≤ k(eε − 1)ε + 2k log · ε.
δ
(d) Conclude the following tighter variant of Corollary 6.25: if each channel in Fig. 6.1 is ε-
differentially private, then the composition of k such channels is
r !
1
kε(eε − 1) + 2k log · ε, δ
δ
As an aside, a completely similar derivation yields the following tighter analogue of Corollary 6.26:
if each channel is (ε, δ)-differentially private, then their composition is
r
ε 1 kδ
kε(e − 1) + 2k log · ε, δ0 +
δ0 1 + eε
135
Part II
136
Stanford Statistics 311/Electrical Engineering 377 John Duchi
i. Minimax lower bounds (both local and global) using Le Cam’s, Fano’s, and Assouad’s methods.
ii. Strong data processing inequalities, along with some bounds on them
iii. Constrained risk inequalities and functionals for lower bounds perhaps
137
Chapter 7
Understanding the fundamental limits of estimation and optimization procedures is important for
a multitude of reasons. Indeed, developing bounds on the performance of procedures can give
complementary insights. By exhibiting fundamental limits of performance (perhaps over restricted
classes of estimators), it is possible to guarantee that an algorithm we have developed is optimal, so
that searching for estimators with better statistical performance will have limited returns, though
searching for estimators with better performance in other metrics may be interesting. Moreover,
exhibiting refined lower bounds on the performance of estimators can also suggest avenues for de-
veloping alternative, new optimal estimators; lower bounds need not be a fully pessimistic exercise.
In this set of notes, we define and then discuss techniques for lower-bounding the minimax risk,
giving three standard techniques for deriving minimax lower bounds that have proven fruitful in a
variety of estimation problems [143]. In addition to reviewing these standard techniques—the Le
Cam, Fano, and Assouad methods—we present a few simplifications and extensions that may make
them more “user friendly.”
138
Stanford Statistics 311/Electrical Engineering 377 John Duchi
density of P .1 In this case, θ does not parameterize P , so we take a slightly broader viewpoint of
estimating functions of distributions in these notes.
The space Θ in which the parameter θ(P ) takes values depends on the underlying statistical
problem; as an example, if the goal is to estimate the univariate mean θ(P ) = EP [X], we have
Θ ⊂ R. To evaluate the quality of an estimator θ, b we let ρ : Θ × Θ → R+ denote a (semi)metric
on the space Θ, which we use to measure the error of an estimator for the parameter θ, and let
Φ : R+ → R+ be a non-decreasing function with Φ(0) = 0 (for example, Φ(t) = t2 ).
For a distribution P ∈ P, we assume we receive i.i.d. observations Xi drawn according to some
P , and based on these {Xi }, the goal is to estimate the unknown parameter θ(P ) ∈ Θ. For a
given estimator θ—a b measurable function θb : X n → Θ—we assess the quality of the estimate
θ(X
b 1 , . . . , Xn ) in terms of the risk
h i
EP Φ ρ(θ(X
b 1 . . . , Xn ), θ(P )) .
For instance, for a univariate mean problem with ρ(θ, θ0 ) = |θ − θ0 | and Φ(t) = t2 , this risk is the
mean-squared error. As the distribution P is varied, we obtain the risk functional for the problem,
which gives the risk of any estimator θb for the family P.
For any fixed distribution P , there is always a trivial estimator of θ(P ): simply return θ(P ),
which will have minimal risk. Of course, this “estimator” is unlikely to be good in any real sense,
and it is thus important to consider the risk functional not in a pointwise sense (as a function of
individual P ) but to take a more global view. One approach to this is Bayesian: we place a prior
π on the set of possible distributions P, viewing θ(P ) as a random variable, and evaluate the risk
of an estimator θb taken in expectation with respect to this prior on P . Another approach, first
suggested by Wald [140], which is to choose the estimator θb minimizing the maximum risk
h i
sup EP Φ ρ(θ(X1 . . . , Xn ), θ(P )) .
b
P ∈P
An optimal estimator for this metric then gives the minimax risk, which is defined as
h i
Mn (θ(P), Φ ◦ ρ) := inf sup EP Φ ρ(θ(Xb 1 , . . . , Xn ), θ(P )) , (7.1.1)
θb P ∈P
where we take the supremum (worst-case) over distributions P ∈ P, and the infimum is taken over
b Here the notation θ(P) indicates that we consider parameters θ(P ) for P ∈ P and
all estimators θ.
distributions in P.
In some scenarios, we study a specialized notion of risk appropriate for optimization problems
(and statistical problems in which all we care about is prediction). In these settings, we assume
there exists some loss function ` : Θ × X → R, where for an observation x ∈ X , the value `(θ; x)
measures the instantaneous loss associated with using θ as a predictor. In this case, we define the
risk Z
LP (θ) := EP [`(θ; X)] = `(θ; x)dP (x) (7.1.2)
X
as the expected loss of the vector θ. (See, e.g., Chapter 5 of the lectures by Shapiro, Dentcheva,
and Ruszczyński [131], or work on stochastic approximation by Nemirovski et al. [118].)
1
Such problems arise, for example, in estimating the uniformity of the distribution of a species over an area (large
θ(P ) indicates an irregular distribution).
139
Stanford Statistics 311/Electrical Engineering 377 John Duchi
Example 7.1 (Support vector machines): In linear classification problems, we observe pairs
z = (x, y), where y ∈ {−1, 1} and x ∈ Rd , and the goal is to find a parameter θ ∈ Rd so
that sign(hθ, xi) = y. A convex loss surrogate for this problem is the hinge loss `(θ; z) =
[1 − yhθ, xi]+ ; minimizing the associated risk functional (7.1.2) over a set Θ = {θ ∈ Rd :
kθk2 ≤ r} gives the support vector machine [46]. 3
Example 7.2 (Two-stage stochastic programming): In operations research, one often wishes
to allocate resources to a set of locations {1, . . . , m} before seeing demand for the resources.
Suppose that the (unobserved) sample x consists of the pair x = (C, v), where C ∈ Rm×m
corresponds to the prices of shipping a unit of material, so cij ≥ 0 gives the cost of shipping
from location i to j, and v ∈ Rm denotes the value (price paid for the good) at each location.
Letting θ ∈ Rm + denote the amount of resources allocated to each location, we formulate the
loss as
X m
X m
X m
X m
X
`(θ; x) := inf cij Tij − vi ri | ri = θi + Tji − Tij , Tij ≥ 0, Tij ≤ θi .
r∈Rm ,T ∈Rm×m
i,j i=1 j=1 j=1 j=1
Here the variables T correspond to the goods transported to and from each location (so Tij is
goods shipped from i to j), and we wish to minimize the cost of ourPshipping and maximize
the profit. By minimizing the risk (7.1.2) over a set Θ = {θ ∈ Rm
+ : i θi ≤ b}, we maximize
our expected reward given a budget constraint b on the amount of allocated resources. 3
where the expectation is taken over Xi and any randomness in the procedure θ. b This expression
captures the difference between the (expected) risk performance of the procedure θb and the best
possible risk, available if the distribution P were known ahead of time. The minimax excess risk,
defined with respect to the loss `, domain Θ, and family P of distributions, is then defined by the
best possible maximum excess risk,
h i
Mn (Θ, P, `) := inf sup EP LP (θ(X1 , . . . , Xn )) − inf LP (θ) ,
b (7.1.3)
θb P ∈P θ∈Θ
where the infimum is taken over all estimators θb : X n → Θ and the risk LP is implicitly defined in
terms of the loss `. The techniques for providing lower bounds for the minimax risk (7.1.1) or the
excess risk (7.1.3) are essentially identical; we focus for the remainder of this section on techniques
for providing lower bounds on the minimax risk.
140
Stanford Statistics 311/Electrical Engineering 377 John Duchi
where the final inequality follows because Φ is non-decreasing. Now, let us define θv = θ(Pv ), so
that ρ(θv , θv0 ) ≥ 2δ for v 6= v 0 . By defining the testing function
b := argmin{ρ(θ,
Ψ(θ) b θv )},
v∈V
141
Stanford Statistics 311/Electrical Engineering 377 John Duchi
θb
θv
θv 0 2δ
Figure 7.1. Example of a 2δ-packing of a set. The estimate θb is contained in at most one of the
δ-balls around the points θv .
breaking ties arbitrarily, we have that ρ(θ,b θv ) < δ implies that Ψ(θ) b = v because of the triangle
inequality and 2δ-separation of the set {θv }v∈V . Indeed, assume that ρ(θ, b θv ) < δ; then for any
0
v 6= v, we have
b θv0 ) ≥ ρ(θv , θv0 ) − ρ(θ,
ρ(θ, b θv ) > 2δ − δ = δ.
The test must thus return v as claimed. Equivalently, for v ∈ V, the inequality Ψ(θ)
b 6= v implies
ρ(θ, θv ) ≥ δ. (See Figure 7.1.) By averaging over V, we find that
b
b θ(P )) ≥ δ) ≥ 1 b θ(Pv )) ≥ δ | V = v) ≥ 1
X X
sup P(ρ(θ, P(ρ(θ, b 6= v | V = v).
P(Ψ(θ)
P |V| |V|
v∈V v∈V
The remaining challenge is to lower bound the probability of error in the underlying multi-way
hypothesis testing problem, which we do by choosing the separation δ to trade off between the loss
Φ(δ) (large δ increases the loss) and the probability of error (small δ, and hence separation, makes
the hypothesis test harder). Usually, one attempts to choose the largest separation δ that guarantees
a constant probability of error. There are a variety of techniques for this, and we present three:
Le Cam’s method, Fano’s method, and Assouad’s method, including extensions of the latter two
to enhance their applicability. Before continuing, however, we review some inequalities between
divergence measures defined on probabilities, which will be essential for our development, and
concepts related to packing sets (metric entropy, covering numbers, and packing).
142
Stanford Statistics 311/Electrical Engineering 377 John Duchi
of f -divergences (recall Section 2.2.3). We first recall the definitions of the three when applied to
distributions P , Q on a set X , which we assume have densities p, q with respect to a base measure
µ. Then we recall the total variation distance (2.2.6) is
Z
1
kP − QkTV := sup |P (A) − Q(A)| = |p(x) − q(x)|dµ(x),
A⊂X 2
which is the f -divergence Df (P ||Q) generated by f (t) = 12 |t − 1|. The Hellinger distance (2.2.8) is
Z p p
dhel (P, Q)2 := ( p(x) − q(x))2 dµ(x),
√
which is the f -divergence Df (P ||Q) generated by f (t) = ( t − 1)2 . We also recall the Kullback-
Leibler (KL) divergence Z
p(x)
Dkl (P ||Q) := p(x) log dµ(x), (7.2.2)
q(x)
which is the f -divergence Df (P ||Q) generated by f (t) = t log t. As noted in Section 2.2.3, Propo-
sition 2.10, these divergences have the following relationships.
Proposition (Proposition 2.10, restated). The total variation distance satisfies the following rela-
tionships:
We now show how Proposition 2.10 is useful, because KL-divergence and Hellinger distance
both are easier to manipulate on product distributions than is total variation. Specifically, consider
the product distributions P = P1 × · · · × Pn and Q = Q1 × · · · × Qn . Then the KL-divergence
satisfies the decoupling equality
n
X
Dkl (P ||Q) = Dkl (Pi ||Qi ) , (7.2.3)
i=1
143
Stanford Statistics 311/Electrical Engineering 377 John Duchi
In particular, we see that for product distributions P n and Qn , Proposition 2.10 implies that
1 n
kP n − Qn k2TV ≤ Dkl (P n ||Qn ) = Dkl (P ||Q)
2 2
and p
kP n − Qn kTV ≤ dhel (P n , Qn ) ≤ 2 − 2(1 − dhel (P, Q)2 )n .
√
As a consequence, if we can guarantee that Dkl (P ||Q) ≤ 1/n or dhel (P, Q) ≤ 1/ n, then we
guarantee the strict inequality kP n − Qn kTV ≤ 1 − c for a fixed constant c > 0, for any n. We
will see how this type of guarantee can be used to prove minimax lower bounds in the following
sections.
144
Stanford Statistics 311/Electrical Engineering 377 John Duchi
Proof We use the proof of Guntuboyina [80]. Consider a maximal subset V of Hd = {−1, 1}d
satisfying
v − v 0
≥ d/2 for all distinct v, v 0 ∈ V.
1
(7.2.5)
That is, the addition of any vector w ∈ Hd , w 6∈ V to V will break the constraint (7.2.5). This
means that if we construct the closed balls B(v, d/2) := {w ∈ Hd : kv − wk1 ≤ d/2}, we must have
[ X
B(v, d/2) = Hd so |V||B(0, d/2)| = |B(v, d/2)| ≥ 2d . (7.2.6)
v∈V v∈V
We now upper bound the cardinality of B(v, d/2) using the probabilistic method, which will imply
the desired result. Let Si , i = 1, . . . , d, be i.i.d. Bernoulli {0, 1}-valued random variables. Then by
their uniformity, for any v ∈ Hd ,
for any λ > 0, by Markov’s inequality (or the Chernoff bound). Since E[exp(λS1 )] = 21 (1 + eλ ), we
obtain n o
2−d |B(v, d/2)| ≤ inf 2−d (1 + eλ )d exp(−3λd/4)
λ≥0
33d/4
3
|V|3−3d/4 4d ≥ |V||B(v, d/2)| ≥ 2d , or |V| ≥ = exp d log 3 − log 2 ≥ exp(d/8),
2d 4
as claimed.
Given the relationships between packing, covering, and size of sets Θ, we would expect there
to be relationships between volume, packing, and covering numbers. This is indeed the case, as we
now demonstrate for arbitrary norm balls in finite dimensions.
145
Stanford Statistics 311/Electrical Engineering 377 John Duchi
As a consequence of Lemma 7.6, we see that for any δ < 1, there is a packing V of B such that
kv − v 0 k ≥ δ for all distinct v, v 0 ∈ V and |V| ≥ (1/δ)d , because we know M (δ, B, k·k) ≥ N (δ, B, k·k)
as in Lemma 7.4. In particular, the lemma shows that any norm ball has a 21 -packing in its own
norm with cardinality at least 2d . We can also construct exponentially large packings of arbitrary
norm-balls (in finite dimensions) where points are of constant distance apart.
Proof We prove the lemma via a volumetric argument. For the lower bound, note that if the
points v1 , . . . , vN are a δ-cover of B, then
N
X
Vol(B) ≤ Vol(δB + vi ) = N Vol(δB) = N Vol(B)δ d .
i=1
In particular, N ≥ δ −d . For the upper bound on N (δ, B, k·k), let V be a δ-packing of B with
maximal cardinality, so that |V| = M (δ, B, k·k) ≥ N (δ, B, k·k) (recall Lemma 7.4). Notably, the
collection of δ-balls {δB + vi }M
i=1 cover the ball B (as otherwise, we could put an additional element
in the packing V), and moreover, the balls { 2δ B + vi } are all disjoint by definition of a packing.
Consequently, we find that
d
δ d
δ δ δ
M Vol(B) = M Vol B ≤ Vol B + B = 1 + Vol(B).
2 2 2 2
Rewriting, we obtain
d
δ d Vol(B) 2 d
2
M (δ, B, k·k) ≤ 1+ = 1+ ,
δ 2 Vol(B) δ
146
Stanford Statistics 311/Electrical Engineering 377 John Duchi
Example 7.7 (Bernoulli mean estimation): Consider the problem of estimating the mean
b 2 , where
θ ∈ [−1, 1] of a {±1}-valued Bernoulli distribution under the squared error loss (θ − θ)
Xi ∈ {−1, 1}. In this case, by fixing some δ > 0, we set V = {−1, 1}, and we define Pv so that
1 + vδ 1 − vδ
Pv (X = 1) = and Pv (X = −1) = ,
2 2
whence we see that the mean θ(Pv ) = δv. Using the metric ρ(θ, θ0 ) = |θ−θ0 | and loss Φ(δ) = δ 2 ,
we have separation 2δ of θ(P−1 ) and θ(P1 ). Thus, via Le Cam’s method (7.3.3), we have that
1
Mn (Bernoulli([−1, 1]), (·)2 ) ≥ δ 2 1 −
P−1
n
− P1n
TV .
2
We would thus like to upper bound kP−1 n − P nk
1 TV as a function of the separation δ and
sample size n; here we use Pinsker’s inequality (Proposition 2.10(b)) and the tensorization
identity (7.2.3) that makes KL-divergence so useful. Indeed, we have
147
Stanford Statistics 311/Electrical Engineering 377 John Duchi
While the factor 1/24 is smaller Pnthan necessary, this bound is optimal to2 within constant
factors; the sample mean (1/n) i=1 Xi achieves mean-squared error (1 − θ )/n.
As an alternative proof, we may use the Hellinger distance and its associated decoupling
identity (7.2.4). We sketch the idea, ignoring lower order terms when convenient. In this case,
Proposition 2.10(a) implies
p
kP1n − P2n kTV ≤ dhel (P1n , P2n ) = 2 − 2(1 − dhel (P1 , P2 )2 )n .
Noting that
r r !2 r
2 1+δ 1−δ 1 − δ2 p 1
dhel (P1 , P2 ) = − =1−2 = 1 − 1 − δ2 ≈ δ2,
2 2 4 2
2 −δ 2
p noting that (1 − δ ) ≈ e , we
and have (up to lower p order terms in δ) that kP1n − P2n kTV ≤
2 − 2 exp(−δ 2 n/2). Choosing δ 2 = 1/(4n), we have 2 − 2 exp(−δ 2 n/2) ≤ 1/2, thus giving
the lower bound
2 1 2 1 1
Mn (Bernoulli([−1, 1]), (·) ) “ ≥ ” δ 1 − = ,
2 2 16n
where the quotations indicate we have been fast and loose in the derivation. 3
This example shows the “usual” rate of convergence in parametric estimation problems, that is,
that we can estimate a parameter θ at a rate (in squared error) scaling as 1/n. The mean estimator
above is, in some sense, the prototypical example of such regular problems. In some “irregular”
scenarios—including estimating the support of a uniform random variable, which we study in the
homework—faster rates are possible.
We also note in passing that their are substantially more complex versions of Le Cam’s method
that can yield sharp results for a wider variety of problems, including some in nonparametric
estimation [104, 143]. For our purposes, the simpler two-point perspective provided in this section
will be sufficient.
JCD Comment: Talk about Euclidean structure with KL space and information geometry a
bit here to suggest the KL approach later.
148
Stanford Statistics 311/Electrical Engineering 377 John Duchi
with cardinality |V| ≥ 2. If we let the function h2 (p) = −p log p − (1 − p) log(1 − p) denote the
entropy of the Bernoulli random variable with parameter p, Fano’s inequality (Proposition 2.19
from Chapter 2) takes the following form [e.g. 48, Chapter 2]:
Restating the results in Chapter 2, we also have the following convenient rewriting of Fano’s
inequality when V is uniform in V (recall Corollary 2.20).
I(V ; X) + log 2
P(Vb 6= V ) ≥ 1 − . (7.4.2)
log(|V|)
I(V ; X) + log 2
inf P(Ψ(X) 6= V ) ≥ 1 − ,
Ψ log |V|
where the infimum is taken over all testing procedures Ψ. By combining Corollary 7.9 with the
reduction from estimation to testing in Proposition 7.3, we obtain the following result.
Proposition 7.10. Let {θ(Pv )}v∈V be a 2δ-packing in the ρ-semimetric. Assume that V is uniform
on the set V, and conditional on V = v, we draw a sample X ∼ Pv . Then the minimax risk has
lower bound
I(V ; X) + log 2
M(θ(P); Φ ◦ ρ) ≥ Φ(δ) 1 − .
log |V|
To gain some intuition for Proposition 7.10, we think of the lower bound as a function of the
separation δ > 0. Roughly, as δ ↓ 0, the separation condition between the distributions Pv is
relaxed and we expect the distributions Pv to be closer to one another. In this case—as will be
made more explicity presently—the hypothesis testing problem of distinguishing the Pv becomes
more challenging, and the information I(V ; X) shrinks. Thus, what we roughly attempt to do
is to choose our packing θ(Pv ) as a function of δ, and find the largest δ > 0 making the mutual
information small enough that
I(V ; X) + log 2 1
≤ . (7.4.3)
log |V| 2
In this case, the minimax lower bound is at least Φ(δ)/2. We now explore techniques for achieving
such results.
149
Stanford Statistics 311/Electrical Engineering 377 John Duchi
where PX PV denotes the distribution of (X, V ) when the random variables are independent. By
manipulating this definition, we can rewrite it in a way that is a bit more convenient for our
purposes.
Indeed, focusing on our setting of testing, let us assume that V is drawn from a prior distribution
π (this may be a discrete or arbitrary distribution, though for simplicity we focus on the case when
π is discrete). Let Pv denote the distribution of X conditional on V = v, as in Proposition 7.10.
Then marginally, we know that X is drawn from the mixture distribution
X
P := π(v)Pv .
v
With this definition of the mixture distribution, via algebraic manipulations, we have
X
I(V ; X) = π(v)Dkl Pv ||P , (7.4.4)
v
a representation that plays an important role in our subsequent derivations. To see equality (7.4.4),
let µ be a base measure over X (assume w.l.o.g. that X has density p(· | v) = pv (·) conditional on
V = v), and note that
XZ p(x | v) p(x | v)
X Z
I(V ; X) = p(x | v)π(v) log P 0 0
dµ(x) = π(v) p(x | v) log dµ(x).
v X v 0 p(x | v )π(v ) v X p(x)
Representation (7.4.4) makes it clear that if the distributions of the sample X conditional
on V are all similar, then there is little information content. Returning to the discussion after
Proposition 7.10, we have in this uniform setting that
1 X 1 X
P = Pv and I(V ; X) = Dkl Pv ||P .
|V| |V|
v∈V v∈V
The mutual information is small if the typical conditional distribution Pv is difficult to distinguish—
has small KL-divergence—from P .
In the local Fano method approach, we construct a local packing. This local packing approach
is based on constructing a family of distributions Pv for v ∈ V defining a 2δ-packing (recall Sec-
tion 7.2.1), meaning that ρ(θ(Pv ), θ(Pv0 )) ≥ 2δ for all v 6= v 0 , but which additionally satisfy the
uniform upper bound
Dkl (Pv ||Pv0 ) ≤ κ2 δ 2 for all v, v 0 ∈ V, (7.4.6)
150
Stanford Statistics 311/Electrical Engineering 377 John Duchi
where κ > 0 is a fixed problem-dependent constant. If we have the inequality (7.4.6), then so long
as we can find a local packing V such that
we are guaranteed the testing error condition (7.4.3), and hence the minimax lower bound
1
M(θ(P), Φ ◦ ρ) ≥ Φ(δ).
2
The difficulty in this approach is constructing the packing set V that allows δ to be chosen to obtain
sharp lower bounds, and we often require careful choices of the packing sets V. (We will see how
to reduce such difficulties in subsequent sections.)
Constructing local packings As mentioned above, the main difficulty in using Fano’s method
is in the construction of so-called “local” packings. In these problems, the idea is to construct a
packing V of a fixed set (in a vector space, say Rd ) with constant radius and constant distance.
Then we scale elements of the packing by δ > 0, which leaves the cardinality |V| identical, but
allows us to scale δ in the separation in the packing and the uniform divergence bound (7.4.6). In
particular, Lemmas 7.5 and 7.6 show that we can construct exponentially large packings of certain
sets with balls of a fixed radius.
We now illustrate these techniques via two examples.
Example 7.11 (Normal mean estimation): Consider the d-dimensional normal location
family Nd = {N(θ, σ 2 Id×d ) | θ ∈ Rd }; we wish to estimate the mean θ = θ(P ) of a given
distribution P ∈ Nd in mean-squared error, that is, with loss kθb − θk22 . Let V be a 1/2-packing
of the unit `2 -ball with cardinality at least 2d , as guaranteed by Lemma 7.6. (We assume for
simplicity that d ≥ 2.)
Now we construct our local packing. Fix δ > 0, and for each v ∈ V, set θv = δv ∈ Rd . Then
we have
δ
kθv − θv0 k2 = δ
v − v 0
2 ≥
2
0
for each distinct pair v, v ∈ V, and moreover, we note that kθv − θv0 k2 ≤ δ for such pairs as
well. By applying the Fano minimax bound of Proposition 7.10, we see that (given n normal
iid
observations Xi ∼ P )
2
I(V ; X1n ) + log 2 δ2 I(V ; X1n ) + log 2
1 δ
Mn (θ(Nd ), k·k22 ) ≥ · 1− = 1− .
2 2 log |V| 16 d log 2
Now note that for any pair v, v 0 , if Pv is the normal distribution N(θv , σ 2 Id×d ) we have
δ2
2
Dkl (Pvn ||Pvn0 ) = n · Dkl N(δv, σ 2 Id×d )||N(δv 0 , σ 2 Id×d ) = n · 2
v − v 0
2 ,
2σ
as the KL-divergence between two normal distributions with identical covariance is
1
Dkl (N(θ1 , Σ)||N(θ2 , Σ)) = (θ1 − θ2 )> Σ−1 (θ1 − θ2 )
2
as in Example 2.7. As kv − v 0 k2 ≤ 1, we have the KL-divergence bound (7.4.6) with κ2 =
n/2σ 2 .
151
Stanford Statistics 311/Electrical Engineering 377 John Duchi
Example 7.12 (Linear regression): In this example, we show how local packings can give
(up to some constant factors) sharp minimax rates for standard linear regression problems. In
particular, for fixed matrix X ∈ Rn×d , we observe
Y = Xθ + ε,
j=1
2
2dγmax (X) 2
!
dδ 2 dδ 2 δ + log 2
2 I(V ; Y ) + log 2 σ 2
M(θ(P), k·k2 ) ≥ 1− ≥ 1− .
2 log |V| 2 d/8
Now, if we choose
σ2 2 (X)δ 2
8 log 2 16dγmax 1 1 1
δ2 = 2 (X)
, then 1 − − ≥1− − = ,
64γmax d d 4 4 2
152
Stanford Statistics 311/Electrical Engineering 377 John Duchi
1 σ2d 1 σ2d 1
M(θ(P), k·k22 ) ≥ = ,
2
256 γmax (X) 256 n γmax ( √1n X)
2
√
for a convergence rate (roughly) of σ 2 d/n after rescaling the singular values of X by 1/ n.
This bound is sharp in terms of the dimension, dependence on n, and the variance σ 2 , but
it does not fully capture the dependence on X, as it depends only on the maximum singular
value. Indeed, in this case, an exact calculation (cf. [107]) shows that the minimax value of
the problem is exactly σ 2 tr((X > X)−1 ). Letting λj (A) be the jth eigenvalue of a matrix A,
we have
d
σ2 σ2 X 1
σ 2 tr((X > X)−1 ) = tr((n−1 X > X)−1 ) = 1
n n λ ( X > X)
j=1 j n
σ2d 1 σ2d 1
≥ min 1 = 2 ( √1 X)
.
n j λj ( n X > X) n γmax n
Thus, the local Fano method captures most—but not all—of the difficulty of the problem. 3
|V| − Ntmin
h2 (Pt ) + Pt log + log Ntmax ≥ H(V | Vb ). (7.4.9)
Ntmax
153
Stanford Statistics 311/Electrical Engineering 377 John Duchi
Before proving the proposition, which we do in Section 7.5.1, it is informative to note that it
reduces to the standard form of Fano’s inequality (7.4.1) in a special case. Suppose that we take
ρV to be the 0-1 metric, meaning that ρV (v, v 0 ) = 0 if v = v 0 and 1 otherwise. Setting t = 0 in
Proposition 7.13, we have P0 = P[Vb 6= V ] and N0min = N0max = 1, whence inequality (7.4.9) reduces
to inequality (7.4.1). Other weakenings allow somewhat clearer statements (see Section 7.5.2 for a
proof):
Corollary 7.14. If V is uniform on V and (|V| − Ntmin ) > Ntmax , then
I(V ; X) + log 2
P(ρV (Vb , V ) > t) ≥ 1 − . (7.4.10)
log N|V|
max
t
Inequality (7.4.10) is the natural analogue of the classical mutual-information based form of
Fano’s inequality (7.4.2), and it provides a qualitatively similar bound. The main difference is
that the usual cardinality |V| is replaced by the ratio |V|/Ntmax . This quantity serves as a rough
measure of the number of possible “regions” in the space V that are distinguishable—that is, the
number of subsets of V for which ρV (v, v 0 ) > t when v and v 0 belong to different regions. While
this construction is similar in spirit to the usual construction of packing sets in the standard
reduction from testing to estimation (cf. Section 7.2.1), our bound allows us to skip the packing set
construction. We can directly compute I(V ; X) where V takes values over the full space, as opposed
to computing the mutual information I(V 0 ; X) for a random variable V 0 uniformly distributed over
a packing set contained within V. In some cases, the former calculation can be much simpler, as
illustrated in examples and chapters to follow.
We now turn to providing a few consequences of Proposition 7.13 and Corollary 7.14, showing
how they can be used to derive lower bounds on the minimax risk. Proposition 7.13 is a generaliza-
tion of the classical Fano inequality (7.4.1), so it leads naturally to a generalization of the classical
Fano lower bound on minimax risk, which we describe here. This reduction from estimation to
testing is somewhat more general than the classical reductions, since we do not map the original
estimation problem to a strict test, but rather a test that allows errors. Consider as in the standard
reduction of estimation to testing in Section 7.2.1 a family of distributions {Pv }v∈V ⊂ P indexed by
a finite set V. This family induces an associated collection of parameters {θv := θ(Pv )}v∈V . Given
a function ρV : V × V → R and a scalar t, we define the separation δ(t) of this set relative to the
metric ρ on Θ via
δ(t) := sup δ | ρ(θv , θv0 ) ≥ δ for all v, v 0 ∈ V such that ρV (v, v 0 ) > t .
(7.4.11)
As a special case, when t = 0 and ρV is the discrete metric, this definition reduces to that of a
packing set: we are guaranteed that ρ(θv , θv0 ) ≥ δ(0) for all distinct pairs v 6= v 0 , as in the classical
approach to minimax lower bounds. On the other hand, allowing for t > 0 lends greater flexibility
to the construction, since only certain pairs θv and θv0 are required to be well-separated.
Given a set V and associated separation function (7.4.11), we assume the canonical estimation
setting: nature chooses V ∈ V uniformly at random, and conditioned on this choice V = v, a sample
X is drawn from the distribution Pv . We then have the following corollary of Proposition 7.13,
whose argument is completely identical to that for inequality (7.2.1):
Corollary 7.15. Given V uniformly distributed over V with separation function δ(t), we have
δ(t) I(X; V ) + log 2
Mn (θ(P), Φ ◦ ρ) ≥ Φ 1− for all t. (7.4.12)
2 log |V|
max Nt
154
Stanford Statistics 311/Electrical Engineering 377 John Duchi
Notably, using the discrete metric ρV (v, v 0 ) = 1 {v 6= v 0 } and taking t = 0 in the lower bound (7.4.12)
gives the classical Fano lower bound on the minimax risk based on constructing a packing [91, 143,
142]. We now turn to an example illustrating the use of Corollary 7.15 in providing a minimax
lower bound on the performance of regression estimators.
Example: Normal regression model Consider the d-dimensional linear regression model Y =
Xθ + ε, where ε ∈ Rn is i.i.d. N(0, σ 2 ) and X ∈ Rn×d is known, but θ is not. In this case, our
family of distributions is
n o n o
PX := Y ∼ N(Xθ, σ 2 In×n ) | θ ∈ Rd = Y = Xθ + ε | ε ∼ N(0, σ 2 In×n ), θ ∈ Rd .
We then obtain the following minimax lower bound on the minimax error in squared `2 -norm: there
is a universal (numerical) constant c > 0 such that
σ 2 d2 c σ2d
Mn (θ(PX , k·k22 ) ≥ c ≥ √ · , (7.4.13)
kXk2Fr γmax (X/ n)2 n
where γmax denotes the maximum singular value. Notably, this inequality is nearly the sharpest
known bound proved via Fano inequality-based methods [39], but our technique is essentially direct
and straightforward.
To see inequality (7.4.13), let the set V = {−1, 1}d be the d-dimensional hypercube, and define
θv = δv for some fixed δ > 0. Then letting ρV be the Hamming metric on√V and ρ be the usual
`2 -norm, the associated separation function (7.4.11) satisfies δ(t) > max{ t, 1}δ. Now, for any
t ≤ dd/3e, the neighborhood size satisfies
t t
max
X d d de
Nt = ≤2 ≤2 .
τ t t
τ =0
155
Stanford Statistics 311/Electrical Engineering 377 John Duchi
where the final term vanishes since E is (V, Vb )-measurable. On the other hand, we also have
using the fact that conditioning reduces entropy. Applying the definition of conditional entropy
yields
and we upper bound each of these terms separately. For the first term, we have
since conditioned on the event E = 0, the random variable V may take values in a set of size at
most |V| − Ntmin . For the second, we have
since conditioned on E = 1, or equivalently on the event that ρ(Vb , V ) ≤ t, we are guaranteed that
V belongs to a set of cardinality at most Ntmax .
Combining the pieces and and noting P(E = 0) = Pt , we have proved that
Combining this inequality with our earlier equality (7.5.1), we see that
|V| − Ntmin
H(V | X) − log Ntmax ≤ H(V | Vb ) − log Ntmax ≤ P(ρ(Vb , V ) > t) log + log 2.
Ntmax
Rearranging the preceding equations yields
H(V | X) − log Ntmax − log 2
P(ρ(Vb , V ) > t) ≥ . (7.5.2)
|V|−Ntmin
log Ntmax
156
Stanford Statistics 311/Electrical Engineering 377 John Duchi
Note that his bound holds without any assumptions on the distribution of V .
By definition, we have I(V ; X) = H(V ) − H(V | X). When V is uniform on V, we have
H(V ) = log |V|, and hence H(V | X) = log |V| − I(V ; X). Substituting this relation into the
bound (7.5.2) yields the inequality
log N|V|
max I(V ; X) + log 2 I(V ; X) + log 2
t
P(ρ(Vb , V ) > t) ≥ − ≥1− .
|V|−Ntmin |V|−Ntmin
log Ntmax log Ntmax
log N|V|
max
t
7.6 Exercises
Question 7.1 (A generalized version of Fano’s inequality; cf. Proposition 7.13): Let V and V b be
arbitrary sets, and suppose that π is a (prior) probability measure on V, where V is distributed
according to π. Let V → X → Vb be Markov chain, where V takes values in V and Vb takes values
in V.
b Let N ⊂ V × V b denote a measurable subset of V × V b (a collection of neighborhoods), and for
any vb ∈ V, denote the slice
b
Nvb := {v ∈ V : (v, vb) ∈ N } . (7.6.1)
That is, N denotes the neighborhoods of points v for which we do not consider a prediction vb for
v to be an error, and the slices (7.6.1) index the neighborhoods. Define the “volume” constants
Define the error probability Perror = P[(V, Vb ) 6∈ N ] and entropy h2 (p) = −p log p − (1 − p) log(1 − p).
1 − pmin 1
h2 (Perror ) + Perror log max
≥ log max − I(V ; Vb ). (7.6.2)
p p
I(V ; X) + log 2
P[(V, Vb ) 6∈ N ] ≥ 1 − 1 .
inf vb log π(N )
v
b
(c) Now we give a version explicitly using distances. Let V ⊂ Rd and define N = {(v, v 0 ) :
kv − v 0 k ≤ δ} to be the points within δ of one another. Let Bv denote the k·k-ball of radius 1
centered at v. Conclude that for any prior π on Rd that
I(V ; X) + log 2
P kV − Vb k2 ≥ δ ≥ 1 − 1 .
log sup π(δB v) v
Question 7.2: In this question, we will show that the minimax rate of estimation for the parameter
iid
of a uniform distribution (in squared error) scales as 1/n2 . In particular, assume that Xi ∼
Uniform(θ, θ + 1), meaning that Xi have densities p(x) = 1 {x ∈ [θ, θ + 1]}. Let X(1) = mini {Xi }
denote the first order statistic.
157
Stanford Statistics 311/Electrical Engineering 377 John Duchi
(b) Using Le Cam’s two-point method, show that the minimax rate for estimation of θ ∈ R for the
uniform family U = {Uniform(θ, θ + 1) : θ ∈ R} in squared error has lower bound c/n2 , where
c is a numerical constant.
Question 7.3 (Sign identification in sparse linear regression): In sparse linear regression, we
have n observations Yi = hXi , θ∗ i + εi , where Xi ∈ Rd are known (fixed) matrices and the vector θ∗
iid
has a small number k d of non-zero indices, and εi ∼ N(0, σ 2 ). In this problem, we investigate
the problem of sign recovery, that is, identifying the vector of signs sign(θj∗ ) for j = 1, . . . , d, where
sign(0) = 0.
Assume we have the following process: fix a signal threshold θmin > 0. First, a vector S ∈
{−1, 0, 1}d is chosen uniformly at random from the set of vectors Sk := {s ∈ {−1, 0, 1}d : ksk1 = k}.
Then we define vectors θs so that θjs = θmin sj , and conditional on S = s, we observe
d
where c is a numerical constant. You may assume that k ≥ 4 or log k ≥ 4 log 2.
(b) Assume that X ∈ {−1, 1}n×d . Give a lower bound on how large n must be for sign recovery.
Give a one sentence interpretation of σ 2 /θmin
2 .
Question 7.4 (General minimax lower bounds): In this exercise, we outline a more general
approach to minimax risk than that afforded by studying losses applied to parameter error. In
particular, we may instead consider losses of the form
L : Θ × P → R+
where P is a collection of distributions and Θ is a parameter space, where additionally the losses
satisfy the condition
inf L(θ, P ) = 0 for all P ∈ P.
θ∈Θ
(a) Consider a statistical risk minimization problem, where we have a distribution P on random
variable X ∈ X , loss function f : Θ × X → R, and for P ∈ P define the population risk
FP (θ) := EP [f (θ, X)]. Show that
158
Stanford Statistics 311/Electrical Engineering 377 John Duchi
(b) For distributions P0 , P1 , define the separation between them (for the loss L) by
L(θ, P0 ) ≤ δ implies L(θ, P1 ) ≥ δ
sepL (P0 , P1 ; Θ) := sup δ ≥ 0 : for any θ ∈ Θ . (7.6.3)
L(θ, P1 ) ≤ δ implies L(θ, P0 ) ≥ δ
That is, having small loss on P0 implies large loss on P1 and vice versa.
We say a collection of distributions {Pv }v∈V indexed by V is δ-separated if sepL (Pv , Pv0 ; Θ) ≥ δ.
Show that if {Pv }v∈V is δ-separated, then for any estimator θb
1 X b Pv )] ≥ δ inf P(b
EPv [L(θ, v 6= V ),
|V| v
b
v∈V
where P is the joint distribution over the random index V chosen uniformly and then X sampled
X ∼ Pv conditional on V = v.
Question 7.5 (Optimality in stochastic optimization): In this question, we prove minimax lower
bounds on the convergence rates in stochastic optimization problems based on the size of the
domain over which we optimize and certain Lipschitz conditions of the functions themselves. You
may assume the dimension d in the problems we consider is as large as you wish.
The setting is as follows: we have a domain Θ ⊂ Rd , function f : Θ × X → R, which is convex
in its first argument, and population risks FP (θ) := EP [f (θ, X)], where the expectation is taken
over X ∼ P . For any two functions F0 , F1 , let θv ∈ argminθ∈Θ Fv (θ), and define the optimization
distance between F0 and F1 by
159
Stanford Statistics 311/Electrical Engineering 377 John Duchi
(d) Let V ⊂ {±1}d be a d/2-packing in `1 -distance of cardinality at least exp(d/8) (by Gilbert-
Varshamov, Lemma 7.5). Assume that Θ ⊃ [−1, 1]d . Show that dopt (Fv , Fv0 ) ≥ δ kv − v 0 k1 /d
for all distinct v, v 0 ∈ V.
(e) For our loss L(θ, P ) = FP (θ) − inf θ∈Θ FP (θ), show that the minimax loss gap
n o
Mn (P, Θ, L) := inf sup EP [L(θbn (X1n ), P )] = inf sup EP [FP (θbn (X1n )) − FP? ]
θbn P ∈P θbn P ∈P
iid
(where FP? = inf θ∈Θ FP (θ) and X1n ∼ P ) satisfies
(√ )
d
Mn (P, L) ≥ c min √ ,1 .
n
where c > 0 is a constant. You may assume d ≥ 8 (or any other large constant) for simplicity.
(f) Show how to modify this construction so that for constants L, R > 0, if Θ ⊃ [−R, R]d , there
are functions f that are L-Lipschitz with respect to the `∞ norm, meaning
such that for this domain Θ, loss f (and induced L), and the same family of distributions P
as above, (√ )
d
Mn (P, Θ, L) ≥ cLR min √ , 1 .
n
(g) Suppose that instead, we have Θ ⊃ {θ ∈ Rd | kθk2 ≤ R2 }, the `2 -ball of radius R2 , and allow
f to be L2 -Lipschitz with respect to the `2 -norm (instead of `∞ ). Show that
L2 R2
Mn (P, Θ, L) ≥ c √ .
n
160
Stanford Statistics 311/Electrical Engineering 377 John Duchi
That is, X ∼ Pv has independent random sign coordinates except in coordinate j when v = ej ,
where P±ej (Xj = ±1) = 1±δ
2 . Let
(b) Using the optimization distance dopt (F0 , F1 ; Θ) = inf θ∈Θ {F0 (θ) + F1 (θ) − F0? − F1? }, where
Fv? = inf θ∈Θ Fv (θ), defined in Question 7.5, show the separation
(c) Let the loss L(θ, P ) = FP (θ) − inf θ∈Θ FP (θ) as in Question 7.5, let P be the collection of
distributions supported on [−1, 1]d , and define the minimax loss gap
n h io
Mn (P, Θ, L) := inf sup EP FP (θbn (X1n )) − FP?
θbn P ∈P
iid
where X1n ∼ P . Show that there exists a numerical constant c > 0 such that
p
log(2d)
Mn (P, Θ, L) ≥ c √ .
n
(You may assume d ≥ 2 to avoid trivial cases.) Hint. Use the result of Question 7.4 part (c).
Question 7.7 (Optimal algorithms for memory access): In a modern CPU, memory is
organized in a hierarchy, so that data upon which computations are being actively performed lies
in a very small memory close to the logic units of the processor for which access is extraordinarily
fast, while data not being actively used lies in slower memory slightly farther from the processor.
(Modern processor memory is generally organized into the registers—a small number of 4- or 8-byte
memory locations on the processor—and level 1, 2, (and sometimes 3 or more) cache, which contain
small amounts of data and increasing access times, and RAM (random access memory).) Moving
data—communicating—between levels of the memory hierarchy is both power intensive and very
slow relative to computation on the data itself, so that in many algorithms the bulk of the time of
the algorithm is in moving data from one place to another to be computed upon. Thus, developing
very fast algorithms for numerical (and other) tasks on modern computers requires careful tracking
of memory access and communication, and careful control of these quantities can often yield orders
of magnitude speed improvements in execution. In this problem, you will prove a lower bound on
the number of communication steps that a variety of numerical-type methods must perform, giving
a concrete (attainable) inequality that allows one to certify optimality of specific algorithms.
In particular, we consider matrix multiplication, as it is a proxy for a class of cubic algorithms
that are well behaved. Let A, B ∈ Rn×n be matrices, and assume we wish to compute C = AB,
via the simple algorithm that for all i, j sets
n
X
Cij = Ail Blj .
l=1
161
Stanford Statistics 311/Electrical Engineering 377 John Duchi
where F is some function—that may depend on i, j, l—and Mem(·) indicates that we access the
memory associated with the argument. (In our case, we have Cij = Cij + Ail · Blj .) We assume
that executing F requires that Mem(Ail ), Mem(Blj ), and Mem(Cij ) belong to fast memory, and
that each are distinct (stored in a separate place in flow and fast memory). We assume that the
order of the computations does not matter, so we may re-order them in any way. We call Mem(Ail )
(respectively B or C) and operand in our computation. We let M denote the size of fast/local
memory, and we would like to lower bound the number of times we must communicate an operand
into or out of the fast local memory as a function of n, the matrix size, and M , the fast memory
size, when all we may do is re-order the computation being executed. We let NStore denote the
number of times we write something from fast memory out to slow memory and let NLoad the
number of times we load something from slow memory to fast memory. Let N be the total number
of operations we execute (for simple matrix multiplication, we have N = n3 , though with sparse
matrices, this can be smaller).
We analyze the procedure by breaking the computation into a number of segments, where each
segment contains precisely M load or store (communication-causing) instructions.
(a) Let Nseg be an upper bound on the number of evaluations with the function F (·) in any given
segment (you will upper bound this in a later part of the problem). Justify that
(b) Within a segment, all operands involved must be in fast memory at least once to be computed
with. Assume that memory locations Mem(Ail ), Mem(Blj ), and Mem(Cij ) do not overlap.
For any operand involved in a memory operation in one of the segments, the operand (1) was
already in fast memory at the beginning of the segment, (2) was read from slow memory, (3)
is still in fast memory at the end of the segment, or (4) is written to slow memory at the end
of the segment. (There are also operands potentially created during execution that are simply
discarded; we do not bound those.) Justify the following: within a segment, for each type of
operand Mem(Aij ), Mem(Bij ), or Mem(Cij ), there are at most c · M such operands (i.e. there
are at most cM operands of type Mem(Aij ), independent of the others, and so on), where c is
a numerical constant. What value of c can you attain?
√
(c) Using the result of question 5.1, argue that Nseg ≤ c0 M 3 for a numerical constant c0 . What
value of c0 do you get?
(d) Using the result of part (c), argue that the number of loads and stores satisfies
N
NStore + NLoad ≥ c00 √ − M
M
for a numerical constant c00 . What is your constant?
162
Chapter 8
Assouad’s method
Assouad’s method provides a somewhat different technique for proving lower bounds. Instead of
reducing the estimation problem to a multiple hypothesis test or simpler estimation problem, as
with Le Cam’s method and Fano’s method from the preceding lectures, here we transform the
original estimation problem into multiple binary hypothesis testing problems, using the structure
of the problem in an essential way. Assouad’s method applies only problems where the loss we care
about is naturally related to identification of individual points on a hypercube.
That is, we can take the parameter θ and test the individual indices via b
v.
163
Stanford Statistics 311/Electrical Engineering 377 John Duchi
While Lemma 8.2 requires conditions on the loss Φ and metric ρ for the separation condi-
tion (8.1.1) to hold, it is sometimes easier to apply than Fano’s method. Moreover, while we will
not address this in class, several researchers [7, 59] have noted that it appears to allow easier ap-
plication in so-called “interactive” settings—those for which the sampling of the Xi may not be
precisely i.i.d. It isPclosely related to Le Cam’s method, discussed previously, as we see that if we
define P+j = 21−d v:vj =1 Pv (and similarly for −j), Lemma 8.2 is equivalent to
d
X
M(θ(P), Φ ◦ ρ) ≥ δ 1 − kP+j − P−j kTV . (8.1.2)
j=1
There are standard weakenings of the lower bound (8.1.2) (and Lemma 8.2). We give one such
weakening. First, we note that the total variation is convex, so that if we define Pv,+j to be the
distribution Pv where coordinate j takes the value vj = 1 (and similarly for P − v, −j), we have
1 X 1 X
P+j = Pv,+j and P−j = Pv,+j .
2d 2d
v∈{−1,1}d v∈{−1,1}d
Then as long as the loss satisfies the per-coordinate separation (8.1.1), we obtain the following:
M(θ(P), Φ ◦ ρ) ≥ dδ 1 − max kPv,+j − Pv,−j kTV . (8.1.3)
v,j
164
Stanford Statistics 311/Electrical Engineering 377 John Duchi
and consequently we have a not quite so terribly weak version of inequality (8.1.2):
X d 1
1 X 2
M(θ(P), Φ ◦ ρ) ≥ δd 1 − kPv,+j − Pv,−j k2TV . (8.1.4)
d
j=1 v∈{−1,1}
d
Regardless of whether we use the sharper version (8.1.2) or weakened versions (8.1.3) or (8.1.4),
the technique is essentially the same. We simply seek a setting of the distributions Pv so that the
probability of making a mistake in the hypothesis test of Lemma 8.2 is high enough—say 1/2—
or the variation distance is small enough—such as kP+j − P−j kTV ≤ 1/2 for all j. Once this is
satisfied, we obtain a minimax lower bound of the form
d
X 1 dδ
M(θ(P), Φ ◦ ρ) ≥ δ 1− = .
2 2
j=1
d
X
Φ(ρ(θ, θ(Pv ))) ≥ 2δ 1 {[b
v(θ)]j 6= vj } .
j=1
as the average is smaller than the maximum of a set and using the separation assumption (8.1.1).
Recalling the definition of the mixtures P±j as the joint distribution of V and X conditional on
Vj = ±1, we swap the summation orders to see that
1 X b 1 X
b j 6= vj + 1
X
Pv [bv(θ)]j 6= vj = Pv [b v(θ)] Pv [b b j 6= vj
v(θ)]
|V| |V| |V|
v∈V v:vj =1 v:vj =−1
1
b j 6= vj + 1 P−j [b
b j 6= vj .
= P+j [b v(θ)] v(θ)]
2 2
This gives the statement claimed in the lemma, while taking an infimum over all testing procedures
Ψ : X → {−1, +1} gives the claim (8.1.2).
165
Stanford Statistics 311/Electrical Engineering 377 John Duchi
Example 8.3 (Normal mean estimation): For some σ 2 > 0 and d ∈ N, we consider estimation
of mean parameter for the normal location family
n o
N := N(θ, σ 2 Id×d ) : θ ∈ Rd
in squared Euclidean distance. We now show how for this family, the sharp Assouad’s method
implies the lower bound
dσ 2
Mn (θ(N ), k·k22 ) ≥ . (8.2.1)
8n
Up to constant factors, this bound is sharp; the sample mean has mean squared error dσ 2 /n.
We proceed in (essentially) the usual way we have set up. Fix some δ > 0 and define θv = δv,
taking Pv = N(θv , σ 2 Id×d ) to be the normal distribution with mean θv . In this case, we see that
the hypercube P structure is natural, as our loss function decomposes on coordinates: we have
kθ − θv k22 ≥ δ 2 dj=1 1 {sign(θj ) 6= vj }. The family Pv thus induces a δ 2 -Hamming separation
for the loss k·k22 , and by Assouad’s method (8.1.2), we have
d
δ2 X h
i
Mn (θ(N ), k·k22 )
n n
≥ 1 −
P+j − P−j TV
,
2
j=1
n = 21−d n n − Pn k
P
where P±j v:vj =±1 Pv . It remains to provide upper bounds on kP+j −j TV . By
the convexity of k·k2TV and Pinsker’s inequality, we have
n
2 1
kPvn − Pvn0 k2TV ≤
n
Dkl (Pvn ||Pvn0 ) .
P+j − P−j ≤ max max
TV dham (v,v 0 )≤1 2 dham (v,v0 )≤1
Example 8.4 (Logistic regression): In this example, consider the logistic regression model,
where we have known (fixed) regressors Xi ∈ Rd and an unknown parameter θ ∈ Rd ; the goal
is to infer θ after observing a sequence of Yi ∈ {−1, 1}, where for y ∈ {−1, 1} we have
1
P (Yi = y | Xi , θ) = .
1 + exp(−yXi> θ)
Denote this family by Plog , and for P ∈ Plog , let θ(P ) be the predictor vector θ. We would
like to estimate the vector θ in squared `2 error. As in Example 8.3, if we choose some δ > 0
and for each v ∈P{−1, 1}d , we set θv = δv, then we have the δ 2 -separation in Hamming metric
kθ − θv k22 ≥ δ 2 dj=1 1 {sign(θj ) 6= vj }. Let Pvn denote the distribution of the n independent
166
Stanford Statistics 311/Electrical Engineering 377 John Duchi
observations Yi when θ = θv . Then we have by Assouad’s lemma (and the weakening (8.1.4))
that
d
δ2 X h
i
Mn (θ(Plog ), k·k22 ) ≥
n n
1 −
P+j − P−j TV
2
j=1
" d 1 #
dδ 2
X
1 1 X
n n
2
2
≥ 1− d
Pv,+j − Pv,−j TV . (8.2.2)
2 d 2
j=1 d
v∈{−1,1}
Let us upper bound the final KL-divergence. Let pa = 1/(1 + ea ) and pb = 1/(1 + eb ). We
claim that
Dkl (pa ||pb ) + Dkl (pb ||pa ) ≤ (a − b)2 . (8.2.3)
Deferring the proof of claim (8.2.3), we immediately see that
n
δ2 X > 2
kPvn − Pvn0 kTV ≤ Xi (v − v 0 ) .
4
i=1
Now we recall inequality (8.2.2) for motivation, and we see that the preceding display implies
d d X
n n d
1 X X
n n
2
Pv,+j − Pv,−j
δ2 1 X X
2 δ2 X X 2
≤ (2Xij ) = Xij .
2d d TV 4d 2d d
j=1 v∈{−1,1} d v∈{−1,1} j=1 i=1
d i=1 j=1
Replacing the final double sum with kXk2Fr , where X is the matrix of the Xi , we have
2
" 2 21 #
dδ δ
Mn (θ(Plog ), k·k22 ) ≥ 1− kXk2Fr .
2 d
That is, we have a minimax lower bound scaling roughly as d/n for logistic regression, where
“large” Xi (in `2 -norm) suggest that we may obtain better performance in estimation. This is
intuitive, as a larger Xi gives a better signal to noise ratio.
We now return to prove the claim (8.2.3). Indeed, by a straightforward expansion, we have
pa 1 − pa pb 1 − pb
Dkl (pa ||pb ) + Dkl (pb ||pa ) = pa log + (1 − pa ) log + pb log + (1 − pb ) log
pb 1 − pb pa 1 − pa
pa 1 − pa pa 1 − pb
= (pa − pb ) log + (pb − pa ) log = (pa − pb ) log .
pb 1 − pb 1 − pa pb
167
Stanford Statistics 311/Electrical Engineering 377 John Duchi
8.3 Exercises
Question 8.1: In this question, we study the question of whether adaptivity can give better
estimation performance for linear regression problems. That is, for i = 1, . . . , n, assume that we
observe variables Yi in the usual linear regression setup,
iid
Yi = hXi , θi + εi , εi ∼ N(0, σ 2 ), (8.3.1)
where θ ∈ Rd is unknown. But now, based on observing Y1i−1 = {Y1 , . . . , Yi−1 }, we allow an adaptive
choice of the next predictor variables Xi ∈ Rd . Let Lnada (F2 ) denote the family of linear regression
problems under this adaptive setting (with n observations) where we constrain P the Frobenius norm
of the data matrix X > = [X1 · · · Xn ], X ∈ Rn×d , to have bound kXk2Fr = ni=1 kXi k22 ≤ F2 . We
use Assouad’s method to show that the minimax mean-squared error satisfies the following bound:
dσ 2 1
M(Lnada (F2 ), k·k22 ) := inf sup E[kθb − θk22 ] ≥ · 1 2. (8.3.2)
θb θ∈Rd n 16 dn F
Here the infimum is taken over all adaptive procedures satisfying kXk2Fr ≤ F2 .
In general, when we choose Xi based on the observations Y1i−1 , we are taking Xi = Fi (Y1i−1 , U1i ),
where Ui is a random variable independent of εi and Y1i−1 and Fi is some function. Justify the
following steps in the proof of inequality (8.3.2):
(i) Assume that nature chooses v ∈ V = {−1, 1}d uniformly at random and, conditionally on v,
let θ = θv . Justify
1 X
M(Lnada (F2 ), k·k22 ) ≥ inf Eθv [kθb − θv k22 ].
θ
b |V|
v∈V
Argue it is no loss of generality to assume that the choices for Xi are deterministic based on
the Y1i−1 . Thus, throughout we assume that Xi = Fi (Y1i−1 , ui1 ), where un1 is a fixed sequence,
or, for simplicity, that Xi is a function of Y1i−1 .
(ii) Fix δ > 0. Let v ∈ {−1, 1}d , and for each such v, define θv = δv. Also let Pvn denote the joint
distributionP
(over all adaptively chosenP Xi ) of the observed variables Y1 , . . . , Yn , and define
1 1
n
P+j = 2d−1 v:vj =1 Pv and P−j = 2d−1 v:vj =−1 Pvn , so that P±j
n n n denotes the distribution of
the Yi when v ∈ {−1, 1}d is chosen uniformly at random but conditioned on vj = ±1. Then
d
1 X δ2 X h
i
Eθv [kθb − θv k22 ] ≥
n n
inf 1 −
P+j − P−j TV
.
θb |V| 2
v∈V j=1
168
Stanford Statistics 311/Electrical Engineering 377 John Duchi
(iii) We have
d h d 1
δ2 i δ2d X
1 − 1
2
n
2
X
n n
n
1 −
P+j − P−j TV
≥
P+j − P−j TV
.
2 2 d
j=1 j=1
(i)
(iv) Let P+j be the distribution of the random variable Yi conditioned on vj = +1 (with the other
(i)
coordinates of v chosen uniformly at random), and let P+j (· | y1i−1 , xi ) denote the distribution
of Yi conditioned on vj = +1, Y1i−1 = y1i−1 , and xi . Justify
n n
2
P+j − P−j
1 n n
TV
≤ Dkl P+j ||P−j
2
n Z
1X
(i) (i)
≤ Dkl P+j (· | y1i−1 , xi )||P−j (· | y1i−1 , xi ) dP+j
i−1 i−1
(y1 , xi ).
2
i=1
(vi) We have
d
X
n
2 δ2
E[kXk2Fr ],
n
P+j − P−j ≤
TV σ2
j=1
where the final expectation is over V drawn uniformly in {−1, 1}d and all Yi , Xi .
(vii) Show how to choose δ appropriately to conclude the minimax bound (8.3.2).
Question 8.2: Suppose under the setting of Question 8.1 that we may no longer be adaptive,
meaning that the matrix X ∈ Rn×d must be chosen ahead of time (without seeing any data).
Assuming n ≥ d, is it possible to attain (within a constant factor) the risk (8.3.2)? If so, give an
example construction, if not, explain why not.
169
Chapter 9
9.1 Introduction
We consider one of the two the most classical non-parametric problems in this example: estimating
a regression function on a subset of the real line (the most classical problem being estimation of a
density). In non-parametric regression, we assume there is an unknown function f : R → R, where
f belongs to a pre-determined class of functions F; usually this class is parameterized by some
type of smoothness guarantee. To make our problems concrete, we will assume that the unknown
function f is L-Lipschitz and defined on [0, 1]. Let F denote this class. (For a fuller technical
introduction into nonparametric estimation, see the book by Tsybakov [136].)
Figure 9.1. Observations in a non-parametric regression problem, with function f plotted. (Here
f (x) = sin(2x + cos2 (3x)).)
170
Stanford Statistics 311/Electrical Engineering 377 John Duchi
Yi = f (Xi ) + εi (9.1.1)
where εi are independent, mean zero conditional on Xi , and E[ε2i ] ≤ σ 2 . See Figure 9.1 for an
example. We also assume that we fix the locations of the Xi as Xi = i/n ∈ [0, 1], that is, the Xi
are evenly spaced in [0, 1]. Given n observations Yi , we ask two questions: (1) how can we estimate
f ? and (2) what are the optimal rates at which it is possible to estimate f ?
where λ0 > 0 (this says the kernel has some width to it). A natural example is the “tent” function
given by Ktent (x) = [1 − |x|]+ , which satisfies inequality (9.2.1) with λ0 = 1/2. See Fig. 9.2 for two
examples, one the tent function and the other the function
1 1
K(x) = 1 {|x| < 1} exp − exp − ,
(x − 1)2 (x + 1)2
which is infinitely differentiable and supported on [−1, 1].
−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 −1.0 −0.5 0.0 0.5 1.0
Figure 9.2: Left: “tent” kernel. Right: infinitely differentiable compactly supported kernel.
Now we consider a natural estimator of the function f based on observations (9.2.1) known as
the Nadaraya-Watson estimator. Fix a bandwidth h, which we will see later smooths the estimated
functions f . For all x, define weights
K Xih−x
Wni (x) := P
n Xj −x
j=1 K h
171
Stanford Statistics 311/Electrical Engineering 377 John Duchi
The intuition here is that we have a locally weighted regression function, where points Xi in the
neighborhood of x are given higher weight than further points. Using this function fbn as our
estimator, it is possible to provide a guarantee on the bias and variance of the estimated function
at each point x ∈ [0, 1].
Proposition 9.1. Let the observation model (9.1.1) hold and assume condition (9.2.1). In addition
assume the bandwidth is suitably large that h ≥ 2/n and that the Xi are evenly spaced on [0, 1].
Then for any x ∈ [0, 1], we have
2σ 2
|E[fbn (x)] − f (x)| ≤ Lh and Var(fbn (x)) ≤ .
λ0 nh
Proof To bound the bias, we note that (conditioning implicitly on Xi )
n
X n
X n
X
E[fbn (x)] = E[Yi Wni (x)] = E[f (Xi )Wni (x) + εi Wni (x)] = f (Xi )Wni (x).
i=1 i=1 i=1
and because there are at least nh/2 indices satisfying |Xj − x| ≤ h, we obtain the claim (9.2.2).
Using the claim, we have
n
X 2 n
X 2
Var(fn (x)) = E
b (Yi − f (Xi ))Wni (x) =E εi Wni (x)
i=1 i=1
n
X n
X
= Wni (x)2 E[ε2i ] ≤ σ 2 Wni (x)2 .
i=1 i=1
172
Stanford Statistics 311/Electrical Engineering 377 John Duchi
Pn
Noting that Wni (x) ≤ 2/λ0 nh and i=1 Wni (x) = 1, we have
n n
X
2 2 2
X 2
σ Wni (x) ≤ σ max Wni (x) Wni (x) ≤ σ 2 ,
i λ0 nh
i=1
|i=1 {z }
=1
With the proposition in place, we can then provide a theorem bounding the worst case pointwise
mean squared error for estimation of a function f ∈ F.
Theorem 9.2. Under the conditions of Proposition 9.1, choose h = (σ 2 /L2 λ0 )1/3 n−1/3 . Then
there exists a universal (numerical) constant C < ∞ such that for any f ∈ F,
2/3
Lσ 2
2
sup E[(fbn (x) − f (x))2 ] ≤ C n− 3 .
x∈[0,1] λ0
By integrating the result in Theorem 9.2 over the interval [0, 1], we immediately obtain the following
corollary.
Corollary 9.3. Under the conditions of Theorem 9.2, if we use the tent kernel Ktent , we have
2/3
Lσ 2
sup Ef [kfbn − f k22 ] ≤ C ,
f ∈F n
In Proposition 9.1, it is possible to show that a more clever choice of kernels—ones that are not
always positive—can attain bias E[fbn (x)] − f (x) = O(hβ ) if f has Lipschitz (β − 1)th derivative.
In this case, we immediately obtain that the rate can be improved to
2β
−
sup E[(fbn (x) − f (x))2 ] ≤ Cn 2β+1 ,
x
and every additional degree of smoothness gives a corresponding improvement in convergence rate.
We also remark that rates of this form, which are much larger than n−1 , are characteristic of non-
parametric problems; essentially, we must adaptively choose a dimension that balances the sample
size, so that rates of 1/n are difficult or impossible to achieve.
173
Stanford Statistics 311/Electrical Engineering 377 John Duchi
174
Stanford Statistics 311/Electrical Engineering 377 John Duchi
which we see is 1-Lipschitz. Now, consider any function f : [0, 1] → R, and let Ej be shorthand for
the intervals Ej = [(j − 1)/k, j/k] for j = 1, . . . , k. We must find a mapping identifying a function
f with points in the hypercube {−1, 1}k . To that end, we may define a vector b v(f ) ∈ {−1, 1}k by
Z
vj (f ) = argmin
b (f (t) − sgj (t))2 dt.
s∈{−1,1} Ej
Indeed, on the set Ej , we have vj gj (t) = fv (t), and thus Ej gj (t)2 dt = Ej fv (t)2 dt. Then by the
R R
Bounding the binary testing error Let Pvn denote the distribution of the n observations
Yi = fv (Xi ) + εi when fv is the true regression function. Then inequality (9.3.2) implies via
Assouad’s lemma that
k
c Xh
i
Mn (F, k·k22 ) ≥
n n
1 −
P+j − P−j . (9.3.3)
k3 TV
j=1
175
Stanford Statistics 311/Electrical Engineering 377 John Duchi
For any two functions fv and fv0 , we have that the observations Yi are independent and normal
with means fv (Xi ) or fv0 (Xi ), respectively. Thus
n
X
Dkl (Pvn ||Pvn0 ) = Dkl N(fv (Xi ), σ 2 )||N(fv0 (Xi ), σ 2 )
i=1
n
X 1
= (fv (Xi ) − fv0 (Xi ))2 . (9.3.4)
2σ 2
i=1
Now we must show that the expression (9.3.4) scales more slowly than n, which we will see must
be the case as whenever dham (v, v 0 ) ≤ 1. Intuitively, most of the observations have the same
distribution by our construction of the fv as bump functions; let us make this rigorous.
We may assume without loss of generality that vj = v 0 j for j > 1. As the Xi = i/n, we thus
have that only Xi for i near 1 can have non-zero values in the tensorization (9.3.4). In particular,
i 2 2n
fv (i/n) = fv0 (i/n) for all i s.t. ≥ , i.e. i ≥ .
n k k
Rewriting expression (9.3.4), then, and noting that fv (x) ∈ [−1/k, 1/k] for all x by construction,
we have
n 2n/k
X 1 2
X 1 1 2n 1 n
2
(fv (Xi ) − fv 0 (Xi )) ≤ 2
(fv (Xi ) − fv0 (Xi ))2 ≤ 2 2
= 3 2.
2σ 2σ 2σ k k k σ
i=1 i=1
Combining this with inequality (9.3.4) and the minimax bound (9.3.3), we obtain
r
n
P+j − P−jn
n
≤ ,
TV 2k 3 σ 2
so
k r
c X n
Mn (F, k·k22 ) ≥ 3 1− .
k 2k 3 σ 2
j=1
and we arrive at
k 2 2/3
c X1 c 0 σ
Mn (F, k·k22 ) ≥ 3 = 2 ≥c ,
k 2 2k n
j=1
176
Chapter 10
In this chapter, we extend the techniques of Chapter 7.4 on Fano’s method (the local Fano method)
to a more global construction. In particular, we show that, rather than constructing a local packing,
choosing a scaling δ > 0, and then optimizing over this δ, it is actually, in many cases, possible to
prove lower bounds on minimax error directly using packing and covering numbers (metric entropy
and packing entropy). The material in this chapter is based on a paper of Yang and Barron [142].
Now, we show how to connect this mutual information quantity to a covering number of a set of
distributions.
Assume that for all v, we have Pv ∈ P, where P is a collection of distributions. In analogy
with Definition 7.1, we say that the collection of distributions {Qi }N i=1 form an -cover of P in
KL-divergence if for all P ∈ P, there exists some i such that Dkl (P ||Qi ) ≤ 2 . With this, we may
define the KL-covering number of the set P as
2
Nkl (, P) := inf N ∈ N | ∃ Qi , i = 1, . . . , N, sup min Dkl (P ||Qi ) ≤ , (10.1.2)
P ∈P i
where Nkl (, P) = +∞ if no such cover exists. With definition (10.1.2) in place, we have the
following proposition.
177
Stanford Statistics 311/Electrical Engineering 377 John Duchi
so that inequality (10.1.4) holds. By carefully choosing the distribution Q in the upper bound (10.1.4),
we obtain the proposition.
Now, assume that the distributions Qi , i = 1, . . . , N form an 2 -cover of the family P, meaning
that
min Dkl (P ||Qi ) ≤ 2 for all P ∈ P.
i∈[N ]
Let pv and qi denote the densities of Pv and Qi with respect to some fixed base measure on PX (the
choice of based measure does not matter). Then definining the distribution Q = (1/N ) N i=1 Qi ,
we obtain for any v that in expectation over X ∼ Pv ,
pv (X) pv (X)
Dkl (Pv ||Q) = EPv log = EPv log −1 Pn
q(X) N i=1 qi (X)
" #
pv (X) pv (X)
= log N + EPv log PN ≤ log N + EPv log
i=1 qi (X)
maxi qi (X)
pv (X)
≤ log N + min EPv log = log N + min Dkl (Pv ||Qi ) .
i qi (X) i
By our assumption that the Qi form a cover, this gives the desired result, as ≥ 0 was arbitrary,
as was our choice of the cover.
Corollary 10.2. Assume that X1 , . . . , Xn are drawn i.i.d. from Pv conditional on V = v. Let
Nkl (, P) denote the KL-covering number of a collection P containing the distributions (over a
single observation) Pv for all v ∈ V. Then
178
Stanford Statistics 311/Electrical Engineering 377 John Duchi
With Corollary 10.2 and Proposition 10.1 in place, we thus see that the global covering numbers
in KL-divergence govern the behavior of information.
We remark in passing that the quantity (10.1.3), and its i.i.d. analogue in Corollary 10.2, is
known as the index of resolvability, and it controls estimation rates and redundancy of coding
schemes for unknown distributions in a variety of scenarios; see, for example, Barron [19] and
Barron and Cover [20]. It is also similar to notions of complexity in Dudley’s entropy integral
(cf. Dudley [62]) in empirical process theory, where the fluctuations of an empirical process are
governed by a tradeoff between covering number and approximation of individual terms in the
process.
(i) Bound the packing entropy. Give a lower bound on the packing number of the set Θ with
2δ-separation (call this lower bound M (δ)).
(ii) Bound the metric entropy. Give an upper bound on the KL-metric entropy of the class P of
distributions containing all the distributions Pv , that is, an upper bound on log Nkl (, P).
(iii) Find the critical radius. Noting as in Corollary 10.2 that with n i.i.d. observations, we have
we now balance the information I(V ; X1n ) and the packing entropy log M (δ). To that end, we
choose n and δ > 0 at the critical radius, defined as follows: choose the any n such that
log M (δn ) ≥ 4n2n + 2 log 2 ≥ 2Nkl (n , P) + 2n2n + 2 log 2 ≥ 2 (I(V ; X1n ) + log 2) .
(We could have chosen the n attaining the infimum in the mutual information, but this way
we need only an upper bound on log Nkl (, P).)
(iv) Apply the Fano minimax bound. Having chosen δn and n as above, we immediately obtain
that for the Markov chain V → X1n → Vb ,
I(V ; X1 , . . . , Xn ) + log 2 1 1
P(V 6= Vb ) ≥ 1 − ≥1− = ,
log M (δn ) 2 2
and thus, applying the Fano minimax bound in Proposition 7.10, we obtain
1
Mn (θ(P); Φ ◦ ρ) ≥ Φ(δn ).
2
179
Stanford Statistics 311/Electrical Engineering 377 John Duchi
The rate in Proposition 10.3 is sharp to within factors logarithmic in n; a more precise analysis of
the upper and lower bounds on the minimax rate yields
h i σ 2 log n 1/3
Mn (F, k·k∞ ) := inf sup Ef kfbn − f k∞ .
fbn f ∈F n
1 2 kf − gk2∞
Dkl (Pf ||Pg ) = (f (Xi ) − g(Xi )) ≤ ,
2σ 2 2σ 2
and if Pfn is the distribution of the n observations f (Xi ) + εi , i = 1, . . . , n, we also have
n
X 1 n
Dkl Pfn ||Pgn (f (Xi ) − g(Xi ))2 ≤ 2 kf − gk2∞ .
= 2
2σ 2σ
i=1
180
Stanford Statistics 311/Electrical Engineering 377 John Duchi
as desired.
181
Chapter 11
In this chapter, we revisit our minimax bounds in the context of what we term constrained risk
inequalities. While the minimax risk of previous chapters provides a first approach for providing
fundamental limits on procedures, its reliance on the collection of all measurable functions as
its class of potential estimators is somewhat limiting. Indeed, in most statistical and statistical
learning problems, we have some type of constraint on our procedures: they must be efficiently
computable, they must work with data arriving in a sequential stream, they must be robust, or they
must protect the privacy of the providers of the data. In modern computational hardware, where
physical limits prevent increasing clock speeds, we may like to use as much parallel computation
as possible, though there are potential tradeoffs between “sequentialness” of procedures and their
parallelism.
With this as context, we replace the minimax risk of Chapter 7.1 with the constrained mini-
max risk, which, given a collection C of possible procedures—private, communication limited, or
otherwise—defines
h i
M(θ(P), Φ ◦ ρ, C) := inf sup EP Φ ρ(θ(X),
b θ(P )) , (11.0.1)
b P ∈P
θ∈C
where as in the original defining equation (7.1.1) of the minimax risk, Φ : R+ → R+ is a nondecreas-
ing loss, ρ is a semimetric on the space Θ, and the expectation is taken over the sample X ∼ P .
In this chapter, we study the quantity (11.0.1) via a few examples, highlighting possibilities and
challenges with its analysis. We will focus on a restricted class of examples—many procedures do
not fall in the framework we consider—that assumes, given a sample X1 , . . . , Xn , we can represent
the class C of estimators under consideration as acting on some view or processed version Zi of Xi .
In particular, this allows us to study communication complexity, memory complexity, and certain
private estimators.
182
Stanford Statistics 311/Electrical Engineering 377 John Duchi
Definition 11.1 (Strong data processing inequalities). Let f : R+ → R ∪ {+∞} be convex and
satisfy f (1) = 0. For distributions PR0 , P1 on X and a channel Q from X to a space Z, define
the marginal distribution Mv (A) := Q(A | x)dPv (x). The channel Q satisfies a strong data
processing inequality with constant α ≤ 1 for the given f -divergence
for any choice of P0 , P1 on X . For any such f , we define the f -strong data processing constant
Df (M0 ||M1 )
αf (Q) := sup .
P0 6=P1 Df (P0 ||P1 )
These types of inequalities are common throughout information and probability theory. Perhaps
their most frequent use is in the development conditions for the fast mixing of Markov chains.
Indeed, suppose the Markov kernel Q satisfies a strong data processing inequality with constant α
with respect to variation distance. If π denotes the stationary distribution of the Markov kernel Q
and we use the operator ◦ to denote one step of the Markov kernel,1
Z
Q ◦ P := Q(· | x)dP (x),
because Q ◦ π = π by definition of the stationary distribution. Thus, the Markov chain enjoys
geometric mixing.
To that end, a common quantity of interest is the Dobrushin coefficient, which immediately
implies mixing rates.
The Dobrushin coefficient satisfies many properties, some of which we discuss in the exercises and
others of which we enumerate here. The first is that
Proposition 11.1. The Dobrushin coefficient is the variation distances strong data processing
constant, that is,
kQ ◦ P0 − Q ◦ P1 kTV
αTV (Q) = sup .
P0 6=P1 kP0 − P1 kTV
A more substantial fact is that the Dobrushin coefficient upper bounds every other strong data
processing constant.
Theorem 11.2. Let f : R+ → R ∪ {∞} satisfy f (1) = 0. Then for any channel Q,
183
Stanford Statistics 311/Electrical Engineering 377 John Duchi
The theorem is roughly a consequence of a few facts. First, Proposition 11.1 holds. Second, without
loss of generality we may assume that f ≥ 0; indeed, replace f (t) with h(t) = f (t) − f 0 (1)t for
any f 0 (1) ∈ ∂f (1), we have h ≥ 0 as 0 ∈ ∂h(1) and Dh = Df . Third, any f ≥ 0 with 0 ∈ ∂f (1)
can be approximated arbitrarily accurately with functions of the form h(t) = ki=1 ai [t − ci ]+ +
P
Pk
i=1 bi [di − t]+ , where ci ≥ 1 and di ≤ 1. For such h, an argument shows that
which follows from the similarities between variation distance, with f (t) = 21 |t|, and the positive
part functions [·]+ . For a formal proof, see the papers of Del Moral et al. [55] and Cohen et al. [45].
In our context, that of (constrained) minimax lower bounds, such data processing inequalities
immediately imply somewhat sharper lower bounds than the (unconstrained) applications in previ-
ous chapters. Indeed, let us revisit the situation present in the local Fano bound, where we the KL
divergence has a Euclidean structure as in the bound (7.4.6), meaning that Dkl (P0 ||P1 ) ≤ κ2 δ 2 when
our parameters of interest θv = θ(Pv ) satisfy ρ(θ0 , θ1 ) ≤ δ. We assume that the constraints C impose
that the data Xi is passed through a channel Q with KL-data processing constant αKL (Q) ≤ 1. In
this case, in the basic Le Cam’s method (7.3.2), an application of Pinsker’s inequality yields that
whenever ρ(θ0 , θ1 ) ≥ 2δ then
r
Φ(δ) n Φ(δ) h p
2 2
i
Mn (θ(P), Φ ◦ ρ, C) ≥ 1− Dkl (M0 ||M1 ) ≥ 1 − nκ αKL (Q)δ /2 ,
2 2 2
and the “standard” choice of δ to make the probability of error constant results in δ 2 = (2nκ2 αKL (Q))−1 ,
or the minimax lower bound
!
1 1
Mn (θ(P), Φ ◦ ρ, C) ≥ Φ p ,
4 2nκ2 αKL (Q)
which suggests an effective sample size degradation of n 7→ nαKL (Q). Similarly, in the local Fano
method in Chapter 7.4.1, we see identical behavior and an effective sample size degradation of
n 7→ nαKL (Q), that is, if without constraints a sample size of n() is required to achieve some
desired accuracy , with the constraint a sample size of at least n()/αKL (Q) is necessary.
(a) Suppose Q is an ε-differentially private channel. We also allow sequential interactivity, meaning
that the ith private variable Zi may depend on both Xi and Z1i−1 . Under local differential
privacy, we have
Q(Zi ∈ A | Xi = x, z1i−1 )
≤ eε .
Q(Zi ∈ A | Xi = x0 , z1i−1 )
Dkl (M0 ||M1 ) + Dkl (M1 ||M0 ) ≤ 4(eε − 1)2 kP0 − P1 k2TV .
184
Stanford Statistics 311/Electrical Engineering 377 John Duchi
(c) In the Rcase we have interactive (multi-sample) setting, i.e. Zi ∼ Q(· | Xi , Z1i−1 ) and define
Mvn = Q(· | xn1 )dPv (xn1 ) to be the marginal distribution over all the Z1n , then
Corollary 11.4. Assume that each channel Q(· | Xi , Z1i−1 ) is εi -differentially private. Then
n
X
Dkl (M0n ||M1n ) ≤4 (eεi − 1)2 kP0 − P1 k2TV .
i=1
(d) Examples:
Here we have used the notation Z<i := (Z1 , . . . , Zi−1 ), and we will use Z≤i := (Z1 , . . . , Zi ) and
(t) (t)
similarly for superscripts throughout. We will also use the notation Z→i = (B (1) , Z<i ) to denote
(t)
all the messages coming into communication of Zi . In Figure 11.1 we illustrate two rounds of this
communication scheme.
It turns out that we can provide lower bounds on the minimax risk of communication-constrained
estimators by extending the data processing inequality approach we have developed. Our approach
to the lower bounds, which we provide in Sections 11.3.1 and 11.3.2 to follow, is roughly as follows.
First, we develop what is known in the communication complexity literature as a direct sum bound,
meaning that the difficulty of solving a d-dimensional problem is roughly d-times that of solving
a 1-dimensional version of the problem; thus, any lower bounds on the error in 1-dimensional
problems imply lower bounds for d-dimensional problems. Second, we provide an extension of
the data processing inequalities we have developed thus far to apply to particular communication
scenarios.
The key to our reductions is that we consider families of distributions where the coordinates of
X are independent, which dovetails with Assouad’s method. We thus index our distributions by
v ∈ {−1, 1}d , and in proving our lower bounds, we assume the typical Markov structure
V → (X1 , . . . , Xm ) → Π,
185
Stanford Statistics 311/Electrical Engineering 377 John Duchi
X1 X2 X3 Xm
B (2)
Figure 11.1. Left: single round of communication of variables, writing to public blackboard B (1) .
Right: two rounds of communication of variables, writing to public blackboards B (1) and B (2) .
where V is chosen uniformly at random from {−1, 1}d , and Π denotes the transcript of the entire
communication—in this context, the transcript
Π = (B (1) , . . . , B (T ) ),
so that it consists of all the blackboards (and the order in which things appeared on them). We
assume that X follows a d-dimensional product distribution, so that conditional on V = v we have
With the generation strategy (11.3.1), conditional on the jth coordinate Vj = vj , the coordinates
Xi,j are i.i.d. and independent of V\j = (V1 , . . . , Vj−1 , Vj+1 , . . . , Vd ) as well as independent of Xi0 ,j
for data points i0 6= i.
d
X d
X
Π) 6= Vj ) =
P(Vbj (Π e j ) 6= Vj ).
P(Vbj (Π
j=1 j=1
Now, let X≤n,j = (Xi,j )ni=1 be the jth coordinate of the data, and let X≤n,\j denote the remaining
d − 1 coordinates across all i = 1, . . . , n. By construction of the simulator (Fig. 11.2), we have the
186
Stanford Statistics 311/Electrical Engineering 377 John Duchi
Markov structure
Vj → X≤n,j → Π
ej ← X
e≤n,\j ← Ve\j ,
Vj → X≤n,j → Π
ej. (11.3.2)
Now, define M±j to be the marginal distributions over the total communicated private variables
e j conditional on Vj = ±1. Then Le Cam’s inequalities (Proposition 2.17 and Proposition 2.10(a))
Π
imply that
d
X d
X
2 Π) 6= Vj ) ≥
P(Vbj (Π (1 − kM−j − M+j kTV )
j=1 j=1
d
X √
≥ (1 − 2dhel (M−j , M+j ))
j=1
v
u d
u2 X
≥ d 1 − t d2hel (M−j , M+j ) (11.3.3)
d
j=1
Recalling Assouad’s method of Chapter 8.1, we see that any time we have a problem with separation
with respect to the Hamming metric (8.1.1), we have a lower bound on its error in estimation
problems.
187
Stanford Statistics 311/Electrical Engineering 377 John Duchi
define Mel to be the marginal distribution over the protocol Π generated from (X1 , . . . , Xm ), except
that (
P0 if i 6= l
Xi ∼ . (11.3.5)
P1 if i = l
Because M0 should be close to Mel , we hope for some type of tensorization behavior, where we can
relate M0 and M1 via one-step changes from M0 to Mel . Indeed, we have
Lemma 11.7. Let M0 , M1 , and Mel be as above. Then
m
X
d2hel (M0 , M1 ) ≤ 7 d2hel (M0 , Mel ). (11.3.6)
l=1
Proof The proof is somewhat complex and relies on Euclidean structures that the Hellinger
distance induces, as well as the specific structure that the probability distributions M and P have.
We assume without loss of generality that Π is discrete, as the Hellinger distance is an f -divergence
and so can be arbitrarily approximated by discrete random variables.
First, we introduce a particular tensorization property—the so-called “cut and paste” prop-
erty in communication complexity [16, 35]—which will allow us to develop the decomposition
bound (11.3.6). First, we note that for any X1m = xm 1 , we may write
m
Y
Π = π | xm
Q(Π 1 )= fi,ππ (xi ) (11.3.7)
i=1
188
Stanford Statistics 311/Electrical Engineering 377 John Duchi
(t)
for some functions fi,ππ that may depend on π . Indeed, for π = {zi }i≤n,t≤T we have
m Y
T
(t) (t) (t) (t)
Y Y
π|
Q(π xm
1 ) = Q(zi | xm
1 , z→i ) = Q(zi | xi , z→i )
i,t i=1 t=1
| {z }
=:fi,π
π (xi )
(t) (t)
where we have used that message zi depends only on xi and z→i . Now we introduce yet a bit
more notation. For a bit vector b ∈ {0, 1}m , we let Mb denote the marginal distribution over the
transcript Π conditional on drawing
Xi | b ∼ Pbi .
Then we can write Mb (Π Π = π ) as a product using Eq. (11.3.7): integrating over independent
Xi ∼ Pbi , we have
Z Ym Z
Π = π ) = Q(π
Mb (Π π | xm
1 )dP (x
b1 1 ) · · · dP (x
bm m ) = fi,ππ (xi )dPbi (xi ) .
i=1 | {z }
:=gi,π
π (bi )
k
Y m
X
−l 2
(1 − 2 )dhel (P0 , Pb ) ≤ d2hel (P0 , Pb(i) ).
l=1 i=1
where the second inequality again follows from Lemma 11.9 as b(i) = ej or ej + ej 0 for some basis
vectors ej , e0j . This gives the result.
189
Stanford Statistics 311/Electrical Engineering 377 John Duchi
W → Xl0 → Π 0 ← X\l
0
,
I(W ; Π 0 ) ≤ βI(Xl0 ; Π 0 ),
It remains to relate I(Xl0 ; Π 0 ) to I(Xl ; Π | V = 0). Here we use the lower bounds of P0 by P1 .
Indeed, we have
1 2 P0 + P1
P0 ≥ P1 so (c + 1)P0 ≥ P0 + P1 or P0 ≥ .
c c+1 2
As a consequence, we have
Z
I(Xl ; Π | V = 0) = Dkl (Q(· | Xl = x)||M0 ) dP0 (x)
Z
2 dP0 (x) + dP1 (x)
≥ Dkl (Q(· | Xl = x)||M0 )
c+1 2
Z
2 dP0 (x) + dP1 (x)
≥ Dkl Q(· | Xl = x)||M
c+1 2
2
= I(Xl0 ; Π 0 ),
c+1
where the second inequality uses that M = Q(· | Xl = x) dP0 (x)+dP 1 (x)
R
2 minimizes the integrated
KL-divergence (recall inequality (10.1.4)). Returning to inequality (11.3.8), we evidently have the
result of the lemma.
190
Stanford Statistics 311/Electrical Engineering 377 John Duchi
Now, we note that as the Xi are independent conditional on V (and w.l.o.g. for the purposes of
mutual information, we may assume they are discrete), for any v ∈ {0, 1} we have
m
X m
X
I(Xi ; Π | V = v) = [H(Xi | V = v) − H(Xi | Π , V = v)]
i=1 i=1
Xm
H(Xi | X1i−1 , V = v) − H(Xi | Π , V = v)
=
i=1
Xm
H(Xi | X1i−1 , V = v) − H(Xi | X1i−1 , Π , V = v)
≤
i=1
Xm
= I(Xi ; Π | X1i−1 , V = v) = I(X1 , . . . , Xm ; Π | V = v),
i=1
where the inequality used that conditioning decreases entropy. We thus obtain
7
d2hel (M0 , M1 ) ≤ (c + 1)β min I(X1 , . . . , Xm ; Π | V = v)
2 v∈{0,1}
as desired.
Theorem 11.11. Let Π be the transcript of the entire communication protocol in Figure 11.1, let
iid
V ∈ {−1, 1}d be uniform, and generate Xi ∼ Pv , i = 1, . . . , m, with independent coordinates as in
191
Stanford Statistics 311/Electrical Engineering 377 John Duchi
Eq. (11.3.1). Assume additionally that for each coordinate j = 1, . . . , d, the coordinate distributions
P−1 and P1 satisfy 1c P−1 ≤ P1 ≤ cP−1 for some 1 ≤ c < ∞, and that they satisfy the mutual
information data processing inequality (Def. 11.3) with constant β(P−1 , P1 ) ≤ β ≤ 1. Then for any
estimator Vb ,
d r !
X d β
Π) 6= Vj ) ≥
P(Vbj (Π 1 − 7(c + 1) · I(X1 , . . . , Xm ; Π | V ) .
2 d
j=1
Proof Under the given conditions, Proposition 11.5 and Theorem 11.6 immediately combine to
give
v
d u d
X d u β X
Π) 6= Vj ) ≥ 1 − t7(c + 1)
P(Vbj (Π min I(X1,j , . . . , Xm,j ; Π | Vj = v) .
2 d v∈{−1,1}
j=1 j=1
where equality (i) used the independence of Xi,j from V\j and Xi,j 0 for j 0 6= j given Vj , and the
inequality that conditioning reduces entropy. This gives the theorem.
192
Stanford Statistics 311/Electrical Engineering 377 John Duchi
dPv
Lemma 11.12. Let V → X → Z, where X ∼ Pv conditional on V = v. If | log dP | ≤ α for all
v0
0
v, v , then h i
I(V ; Z) ≤ 4(eα − 1)2 EZ kPX (· | Z) − PX k2TV ≤ 2(eα − 1)2 I(X; Z).
(t)
on the message from Xi in round t. (This is a weaker condition that H(Zi ) ≤ Bi,t for each i, t.)
With this bound, we can provide minimax lower bounds on communication-constrained estimator.
For our first collection, we consider estimating the parameters of d independent Bernoulli dis-
tributions in squared error. Let Pd be the family of d-dimensional Bernoulli distributions, where
we let the parameter θ ∈ [0, 1]d be such that Pθ (Xj = 1) = θj . Then we have the following result.
Proposition 11.14. Let Mm (θ(Pd ), k·k22 , {Bi }m i=1 ) denote the minimax mean-square error for es-
timation of a d-dimensional Bernoulli under the information constraint (11.4.1). Then
( )
d d
Mm (θ(Pd ), k·k22 , {Bi }m
i=1 ) ≥ c min 1 Pm ,d ,
mm i=1 Bi
193
Stanford Statistics 311/Electrical Engineering 377 John Duchi
Now, we note that for any Markov chain V → X → Z, we have I(X; Z | V ) = H(Z | V ) − H(Z |
X, V ) = H(Z | V ) − H(Z | X) ≤ H(Z) − H(Z | X) = I(X; Z). Thus we obtain
I(X1 , . . . , Xm ; Π | V ) ≤ I(X1 , . . . , Xm ; Π )
m X
T
(t) (t)
X
= I(X1 , . . . , Xm ; Zi | Z→i ).
i=1 t=1
(t) (t) P (t) (t)
As Zi ⊥ X\i | Z→i , Xi , we have that this final quantity is equal to i,t I(Xi ; Zi | Z→i ). But of
(t) (t) (t) (t)
course I(Xi ; Zi | Z→i ) ≤ H(Zi | Z→i ) ≤ Bi,t , and thus we have
v
u 2X
δ
Mm (θ(Pd ), k·k22 , {Bi }m 2
i=1 ) ≥ cδ d 1 − tC
u
Bi,t .
d
i,t
d
Choosing δ = min{1/5, 2C P
i Bi } gives the result.
194
Stanford Statistics 311/Electrical Engineering 377 John Duchi
where the inequality follows from (11.5.2) and the equality by the assumed cut-and-paste property.
Now, we apply Baranyai’s theorem, which says that we may decompose any complete graph KN ,
where N is even, into N − 1 perfect matchings Mi with N/2 edges—necessarily, as they form a
perfect matching—where each Mi is edge disjoint. Identifying the pairs i < j with the complete
graph, we thus obtain
X N
X −1 X
d2hel (P0 , Pb(i) +b(j) ) = d2hel (P0 , Pb(i) +b(j) ). (11.5.4)
1≤i<j≤N l=1 (i,j)∈Ml
0 0
Now fix n ∈ {1, . . . , N −1} and a matching Mn . By assumption we have hb(i) +b(j) , b(i ) +b(j ) i =
0 0
0 for any distinct pairs (i, j), (i , j ) ∈ Mn , and moreover, (i,j)∈Mn (b(i) + b(j) ) = b. Thus, our
P
induction hypothesis gives that for any l ∈ {1, . . . , N − 1} and any of our matchings Mn , we have
X k−1
Y
d2hel (P0 , Pb(i) +b(j) ) ≥ d2hel (P0 , Pb ) (1 − 2−l ).
(i,j)∈Mn l=1
Substituting this lower bound into inequality (11.5.4) and using inequality (11.5.3), we obtain
N k−1 k
X 1 Y Y
d2hel (P0 , Pb(i) ) ≥ · (N − 1)d2hel (P0 , Pb ) (1 − 2−l ) = d2hel (P0 , Pb ) (1 − 2−l ),
N
i=1 l=1 l=1
11.6 Bibliography
Strong data processing inequalities are all over. Raginsky [121] provides a nice survey. Dobrushin
[56] originally proposed the Dobrushin coefficient and used it to demonstrate sufficient mixing in
Markov chains to achieve central limit theorems; Cohen et al. [45] first proved Theorem 11.2 for
finite state spaces using careful linear algebraic techniques, and later Del Moral et al. [55] proved
the result with the approach we outline below the theorem.
Communication complexity is huge. Standard book is Kushilevitz and Nisan [102]. Our ap-
proach follows from Zhang, Duchi, Jordan, and Wainwright [144], our direct sum simulation argu-
ment is due to Garg, Ma, and Nguyen [74], and the strong data processing communication results
we adapt from Braverman, Garg, Ma, Nguyen, and Woodruff [35].
11.7 Exercises
Question 11.1: For k ∈ [1, ∞], we consider the collection of distributions
Pk := {P : EP [|X|k ]1/k ≤ 1},
that is, distributions P supported on R with kth moment bounded by 1. We consider minimax
estimation of the mean E[X] for these families under ε-local differential privacy, meaning that for
each observation Xi , we observe a private realization Zi (which may depend on Z1i−1 ) where Zi
is an ε-differentially private view of Xi . Let Qε denote the collection of all ε-differentially private
channels, and define the (locally) private minimax risk
Mn (θ(P), (·)2 , ε) := inf inf sup EP,Q [(θbn (Z1n ) − θ(P ))2 ].
θbn Q∈Qε P ∈P
195
Stanford Statistics 311/Electrical Engineering 377 John Duchi
(a) Assume that ε ≤ 1. For k ∈ [1, ∞], show that there exists a constant c > 0 such that
k−1
2 1 k
Mn (θ(Pk ), (·) , ε) ≥ c .
nε2
(b) Give an ε-locally differentially private estimator achieving the minimax rate in part (a).
Question 11.2: In this question, we apply the results of Question 11.1 to a problem of estimation
of drug use. Assume we interview a series of individuals i = 1, . . . , n, asking each whether he
or she takes illicit drugs. Let Xi ∈ {0, 1} be 1 if person i uses drugs, 0 otherwise, and define
θ∗ = E[X] = E[Xi ] = P (X = 1). To avoid answer bias, each answer Xi is perturbed by some
channel Q, where Q is ε-differentially private. That is, we observe independent Zi where conditional
on Xi , we have
Zi | Xi = x ∼ Q(· | Xi = x).
To make sure everyone feels suitably private, we assume ε < 1/2 (so that (eε − 1)2 ≤ 2ε2 ). In the
questions, let Qε denote the family of all ε-differentially private channels, and let P denote the
Bernoulli distributions with parameter θ(P ) = P (Xi = 1) ∈ [0, 1] for P ∈ P.
(a) Use Le Cam’s method and the strong data processing inequality in Theorem 11.3 to show that
the minimax rate for estimation of the proportion θ∗ in absolute value satisfies
b 1 , . . . , Zn ) − θ(P )| ≥ c √ 1 ,
h i
Mn (θ(P), | · |, ε) := inf inf sup E |θ(Z
Q∈Qε θb P ∈P nε2
where c > 0 is a universal constant. Here the infimum is over channels Q and estimators θ,
b and
the expectation is taken with respect to both the Xi (according to P ) and the Zi (according
to Q(· | Xi )).
(c) Let Pk , for k ≥ 2, denote the family of distributions on R such that EP |X|k ≤ 1 for P ∈ Pk
(note that X is no longer restricted to have support {0, 1}). Show, similarly to part (a), that
for θ(P ) = EP [X]
h i 1
Mn (θ(Pk ), | · |, ε) := inf inf sup E |θ(Z
b 1 , . . . , Zn ) − θ(P )| ≥ c
k−1 .
Q∈Qε θb P ∈P (nε2 ) 2k
196
Stanford Statistics 311/Electrical Engineering 377 John Duchi
Question 11.3 (Lower bounds for private logistic regression): This question is (likely) challenging.
Consider the logistic regression model for y ∈ {±1}, x ∈ Rd , that
1
pθ (y | x) = .
1 + exp(−yhθ, xi)
For a distribution P on (X, Y ) ∈ Rd × {±1}, where Y | X = x has logistic distribution, define the
excess risk
L(θ, P ) := EP [`(θ; X, Y )] − inf EP [`(θ; X, Y )]
θ
where `(θ; x, y) = log(1 + exp(−yhx, θi)) is the logistic loss. Let P be the collection of such
distributions, where X is supported on {−1, 1}d . Following the notation of Question 7.4, for a
channel Q mapping (X, Y ) → Z, define
b n ), P )],
Mn (P, L, Q) := inf sup EP,Q [L(θ(Z 1
θb P ∈P
where the expectation is taken over Zi ∼ Q(· | Xi , Z1i−1 ). Assume that the channel releases are all
(locally) ε-differentially private.
d d
Mn (P, L, Q) ≥ c · ·
n ε ∧ ε2
for some (numerical) constant c > 0.
(b) Suppose we allow additional passes through the dataset (i.e. multiple rounds of communication),
but still require that all data Zi released from Xi be ε-differentially private. That is, assume
we have the (sequential and interactive) release schemes of Fig. 11.1, and we guarantee that
(t) (t) (t)
Zi ∼ Q(· | Xi , B (1) , . . . , B (t) , Z1 , . . . , Zi−1 )
P
is εi,t -differentially private, where t εi,t ≤ ε for all i. Does the lower bound of part (a) change?
197
Chapter 12
Estimation of functionals
To be written.
dPt = (1 + tg)dP0 .
R√
Define the Battacharya affinity ρhel (P, Q) = dP dQ.
iid
(so that the Xi ∼ P0 ). Then expanding the product, we see that the term Tn is
X X
Tn = 2 + 2 tk g(Xi1 ) · · · g(Xik ),
k∈2N,2≤k≤n i1 <···<ik
where we have used that all terms with an odd number of factors cancel. In particular, for
X X
Y = tk g(Xi1 ) · · · g(Xik ),
k∈2N,2≤k≤n i1 <···<ik
198
Stanford Statistics 311/Electrical Engineering 377 John Duchi
p √
we may write Tn /2 = 1 + Y , where (as Tn ≥ 0) we must have 1 + Y ≥ 0, and so using the
√
elementary estimate that 1 + y ≥ 1 + y/2 − y 2 /2 whenever y ≥ −1, we have
1 1
ρhel (P n , P0n ) ≥ 1 + E0 [Y ] − E0 [Y 2 ]
2 2
1 X X 1 X n 2k 2k
=1− t2k E0 [g 2 (Xi1 ) · · · g 2 (Xik )] = 1 − t σ ,
2 2 k
k∈2N,2≤k≤n i1 <···<ik k∈2N,2≤k≤n
Note: Birgé and Massart achieve a tighter factor than claimed here.
For shorthand, define the function ψ(s, n) = (1 + s)n + (1 − s)n − 2, so that Lemma 12.1 is
equivalent to the statement that ρhel (P n , P0n ) ≥ 1 − 12 ψ(t2 σ 2 , n).
Definition 12.1. Let k ∈ N and the functions gj : X → [−b, b]. Then the functions are an
admissible partition with variances σj2 of X with respect to a probability distribution P0 if
(i) The supports Ej = supp gj of each of the functions are disjoint.
(ii) Each function has P0 mean 0, i.e., EP0 [gj (X)] = 0 for each j.
(iii) Function j has variance σj2 = EP0 [gj2 (X)] = gj2 (x)dP0 (x).
R
JCD Comment: Put some commentary about how the key is to use mixtures to prove lower
bounds.
Also, a more direct proof by Taylor expansions might be “easier” if you assume you can ig-
nore all higher-order terms. Perhaps sketch that out, and then go back and prove the full theorem?
Theorem 12.2. Let the functions {gj }kj=1 be an admissible partition of X with variances σj2 with
respect to P0 (Definition 12.1), V = {−1, 1}k , and for t ∈ R define the distributions Ptv by
k
Y
dPtv (x) = (1 + tvj gj (x))dP0 (x) = (1 + thv, g(x)i) dP0 (x)
j=1
2
for v ∈ V. Assume that nt2 σj2 ≤ e for each j. Then
k
X
d2hel (Ptn , P0n ) ≤ n2 t4 σj4 .
j=1
A sharper result is possible: Birgé and Massart [30, Thm. 1] show that under the same setting,
k k
X 1 X
d2hel (Ptn , P0n ) ≤ C(t, b, σ)n2 t4 σj4 ≤ n2 t4 σj4 ,
3
j=1 j=1
199
Stanford Statistics 311/Electrical Engineering 377 John Duchi
where the constant C = C(t, b, σ) < 13 but can be smaller for certain values of t, σ, and the bound
b on the admissible partition in Def. 12.1.
Proof The essential idea is to reduce the problem to a series of simpler divergence calculations,
where we may leverage Lemma 12.1. To do so, we first look abstractly at the affinity between
distributions with a certain fixed number of samples from distinct product distributions, then treat
the original problem as a multinomial mixture of such products.
Abstract fixed sample size affinities. Fix an integer vector m = (m1 , . . . , mk ) ∈ Nk , and let
v ∈ V. For each j ∈ {1, . . . , k}, let Pj,0 be a distribution on X and define the densities
eliding the dependence on t for now. Then define the product distributions and averages
k k k k
Y m
Y m
Y m 1 X m Y m
Pvm := P+jj P−jj = Pvj jj , Pm = k Pv , P0m := Pj,0j .
2
j:vj =1 j:vj =−1 j=1 v∈V j=1
The affinity between P m and P0m is relatively straightforward to express as a product of affinities
between averages of just two distributions, which allows us to leverage Lemma 12.1:
Lemma 12.3. For the distributions defined above, let τj2 = gj2 (x)dPj,0 (x). Then
R
k
mj 1 mj mj
Y
ρhel (P m , P0m ) = ρhel Pj,0 , (P+j + P−j )
2
j=1
(12.1.1)
k
1 X n n
1 + t2 τj2 + 1 − t2 τj2 − 2 .
≥1−
2
j=1
m
dP±jj
Proof Denoting the likelihood ratio L±j := m , we therefore obtain that
dPj,0j
k
dPvm Y
= Lvj j ,
dP0m
j=1
where step (i) uses that L±j is independent of L±j 0 for j 6= j 0 , and equality (ii) is the induction.
In particular, we can write the ratio as the product
k m m
dP m Y dP+jj + dP−jj
= m .
dP0m 2dPj,0j
j=1
200
Stanford Statistics 311/Electrical Engineering 377 John Duchi
Substituting this product into the definition of the affinity ρhel gives the equality in (12.1.1).
To obtain the inequality in (12.1.1), we apply Lemma 12.1 to obtain
mj 1 mj mj 1
(1 + t2 τj2 )n + (1 − t2 τj2 )n − 2 ,
ρhel Pj,0 , (P+j + P−j ) ≥ 1 −
2 2
then take the product and use that j=1 (1 − aj ) ≥ 1 − kj=1 aj for any aj ∈ [0, 1].
Qk P
Multinomial mixtures and conditional affinities. With Lemma 12.3 in hand, we now pro-
ceed conditionally, by treating the sampling of the Xi as a multinomial and conditioning on the
number of Xi in the regions that the gj partition. Indeed, consider the original affinity:
!1/2
n
XY dPtv
ρhel (Ptn , P0n ) = E0 2−k (Xi ) .
dP0
v∈V i=1
iid
For Xi ∼ P0 , if we let M = (M 0 , M 1 , . . . , M k ) be a multinomial that counts the Xi falling in
each set Ej = supp gj (where we let E0 = X \ ∪j≥1 Ej be the subset of X distinct from the supports
of the gj ), then conditional on M = m, we have
n mj
k Y
Y dPtv dist
Y
(Xi ) | (M = m) = (1 + tvj gj (Yij )),
dP0
i=1 j=1 i=1
iid
where the Yij are i.i.d. from the conditional distribution of Xi falling within Ej , i.e., Yij ∼ P0 (· |
X ∈ Ej ). Lemma 12.3 then implies that if we define Pj,0 = P0 (· | Xi ∈ Ej ) to be the law of Yij
(i.e., X conditional on X ∈ Ej ) and dP±j = (1 ± tgj )dPj,0 , then by construction of Ptv ,
!1/2
n k
−k
XY dPtv Y mj 1 mj mj
E0 2 (Xi ) |M =m = ρhel Pj,0 , (P+j + P−j ) .
dP0 2
v∈V i=1 j=1
If qj = P0 (Xi ∈ Ej ), then we have gj2 (x)dPj,0 (x) = gj2 (x)P0 (x)/P0 (X ∈ Ej ) = σj2 /qj , and so
R R
201
Stanford Statistics 311/Electrical Engineering 377 John Duchi
We now use binomial identities (again) to control the final sum. Let aj = t2 σj2 for shorthand,
noting that naj ≤ 1 by assumption. We have
k k n
1X (i) X n X n
[(1 + aj )n + (1 − aj )n − 2] = 1 + a2 + al − 1
2 2 j l j
j=1 j=1 l∈2N,l≥4
k bn/2c
X
n X n n−1 2(l−1)
= a2j 1 + aj ,
2 2l 2
j=1 l=2
n
n−1 1 n−2
where step (i) uses that the odd powers cancel. Noting that 2l 2 = l(2l−1) 2(l−1) , we obtain
k X k bn/2c
1X n X 1 n − 2
[(1 + aj )n + (1 − aj )n − 2] ≤ a2j 1 + a2l
j
2 2 2l2 2l
j=1 j=1 l=1
k bn/2c k
(ii) n X X 1 neaj 2l (iii) n X
2
≤ aj 1 + ≤ 2 a2j ,
2 2l2 2l 2
j=1 l=1 j=1
k
X
d2hel (Ptn , P0n ) = 1 − ρhel (Ptn , P0n ) ≤ n2 t4 σj4 ,
j=1
202
Part III
203
Chapter 13
In this chapter, we explore the basic results in source coding—that is, given a sequence of random
variables X1 , X2 , . . . distributed according to some known distribution P , how much storage is
required for us to encode the random variables? The material in this chapter is covered in a variety
of sources; standard references include Cover and Thomas [48] and Csiszár and Körner [50].
While Definition 13.1 is natural, generally speaking, we wish to transmit or encode a variety of
codewords simultaneously, that is, we wish to encode a sequence X1 , X2 , . . . using the natural exten-
sion of the code C as the string C(X1 )C(X2 )C(X3 ) · · · , where C(x1 )C(x2 ) denotes the concatenation
of the strings C(x1 ) and C(x2 ). In this case, we require that the code be uniquely decodable:
Definition 13.2. A d-ary code C : X → {0, . . . , d − 1}∗ is uniquely decodable if for all sequences
x1 , . . . , xn ∈ X and x01 , . . . , x0n ∈ X we have
C(x1 )C(x2 ) · · · C(xn ) = C(x01 )C(x02 ) · · · C(x0n ) if and only if x1 = x01 , . . . , xn = x0n .
While more useful (generally) than simply non-singular codes, uniquely decodable codes may require
inspection of an entire string before recovering the first element. With that in mind, we now consider
the easiest to use codes, which can always be decoded instantaneously.
204
Stanford Statistics 311/Electrical Engineering 377 John Duchi
As is hopefully apparent from the definitions, all prefix/instantaneous codes are uniquely decodable,
which are in turn non-singular. The converse is not true, though we will see a sense in which—as
long as we care only about encoding sequences—using prefix instead of uniquely decodable codes
has negligible consequences.
For example, written English, with periods (.) and spaces ( ) included at the ends of words
(among other punctuation) is an instantaneous encoding of English into the symbols of the alphabet
and punctuation, as punctuation symbols enforce that no “codeword” is a prefix of any other. A
few more concrete examples may make things more clear.
Example 13.1 (Encoding strategies): Consider the encoding schemes below, which encode
the letters a, b, c, and d.
By inspection, it is clear that C1 is non-singular but certainly not uniquely decodable (does
the sequence 0000 correspond to aaaa, bb, aab, aba, baa, ca, ac, or d?), while C3 is a prefix
code. We leave showing that C2 is uniquely decodable is an exercise for the interested reader.
3
Theorem 13.2. Let X be a finite or countable set, and let ` : X → N be a function. If `(x) is the
length of the encoding of the symbol x in a uniquely decodable d-ary code, then
X
d−`(x) ≤ 1. (13.2.1)
x∈X
Conversely, given any function ` : X → N satisfying inequality (13.2.1), there is a prefix code whose
codewords have length `(x) for each x ∈ X .
Proof We prove the first statement of the theorem first by a counting and asymptotic argument.
We begin by assuming that X is finite; we eliminate this assumption subsequently. As a
consequence, there is some maximum length `max such that `(x) ≤ `max for all x ∈ X . ForP a sequence
x1 , . . . , xn ∈ X , we have by the definition of our encoding strategy that `(x1 , . . . , xn ) = ni=1 `(xi ).
In addition, for each m we let
205
Stanford Statistics 311/Electrical Engineering 377 John Duchi
0 2
1
x1
2 0 2
0 1 1
x2 x3 x5 x6 x7
Figure 13.1. Prefix-tree encoding of a set of symbols. The encoding for x1 is 0, for x2 is 10, for
x3 is 11, for x4 is 12, for x5 is 20, for x6 is 21, and nothing is encoded as 1, 2, or 22.
denote the symbols x encoded with codewords of length m in our code, then as the code is uniquely
decodable we certainly have card(En (m)) ≤ dm P for all n and m. Moreover, for all x1:n ∈ X n we
have `(x1:n ) ≤ n`max . We thus re-index the sum x d−`(x) and compute
X n`
X max
−`(x1 ,...,xn )
d = card(En (m))d−m
x1 ,...,xn ∈X n m=1
n`
X max
≤ dm−m = n`max .
m=1
as each subset {xP ∈ X : `(x) ≤ k} is uniquely decodable, we have Dk ≤ 1 for all k. Then
1 ≥ limk→∞ Dk = x∈X d−`(x) .
The achievability of such a code is straightforward by a pictorial argument (recall Figure 13.1),
so we sketch the result non-rigorously. Indeed, let Td be an (infinite) d-ary tree. Then, at each
206
Stanford Statistics 311/Electrical Engineering 377 John Duchi
level m of the tree, assign one of the nodes at that level to each symbol x ∈ X such that `(x) = m.
Eliminate the subtree below that node, and repeat with the remaining symbols. The codeword
corresponding to symbol x is then the path to the symbol in the tree.
With the Kraft-McMillan theorem in place, we we may directly relate the entropy of a random
variable to the length of possible encodings for the variable; in particular, we show that the entropy
is essentially the best possible code length of a uniquely decodable source code. In this theorem,
we use the shorthand X
Hd (X) := − p(x) logd p(x).
x∈X
Theorem 13.3. Let X ∈ X be a discrete random variable distributed according to P and let `C be
the length function associated with a d-ary encoding C : X → {0, . . . , d − 1}∗ . In addition, let C be
the set of all uniquely decodable d-ary codes for X . Then
Proof The lower bound is an argument by convex optimization, while for the upper bound
we give an explicit length function and (implicit) prefix code attaining the bound. For the lower
bound, we assume for simplicity that X is finite, and we identify X = {1, . . . , |X |} (let m = |X | for
shorthand). Then as C consists of uniquely decodable codebooks, all the associated length functions
must satisfy the Kraft-McMillan inequality (13.2.1). Letting `i = `(i), the minimal encoding length
is at least (m m
)
X X
infm pi `i : d−`i ≤ 1 .
`∈R
i=1 i=1
By introducing the Lagrange multiplier λ ≥ 0 for the inequality constraint, we may write the
Lagrangian for the preceding minimization problem as
n
!
X h im
L(`, λ) = p> ` + λ d−`i − 1 with ∇` L(`, λ) = p − λ d−`i log d .
i=1
i=1
θ
θ Pm − logd
In particular, the optimal ` satisfies `i = logd pi for some constant θ, and solving i=1 d
pi
=1
gives θ = 1 and `(i) = logd p1i .
l m
1
To attain the result, simply set our encoding to be `(x) = logd P (X=x) , which satisfies the
Kraft-McMillan inequality and thus yields a valid prefix code with
X 1 X
EP [`(X)] = p(x) logd ≤− p(x) logd p(x) + 1 = Hd (X) + 1
p(x)
x∈X x∈X
as desired.
207
Stanford Statistics 311/Electrical Engineering 377 John Duchi
Finally, we present a result showing that it is possible to achieve average code length of at most
the entropy rate, which for stationary sequences is smaller than the entropy of any single random
variable Xi . To do so, we require the use of a block code, which (while it may be prefix code) treats
sets of random variables (X1 , . . . , Xm ) ∈ X m as a single symbol to be jointly encoded.
1
Indeed, let > 0 and take N such that n ≥ N implies that |ai − a| < . Then for n ≥ N , we have
n n
1X N (cN − a) 1 X N (cN − a)
cn − a = (ai − a) = + (ai − a) ∈ ± .
n i=1 n n i=N +1 n
Taking n → ∞ yields that the term N (cN − a)/n → 0, which gives that cn − a ∈ [−, ] eventually for any > 0,
which is our desired result.
208
Stanford Statistics 311/Electrical Engineering 377 John Duchi
Proposition 13.5. Let the sequence of random variables X1 , X2 , . . . be stationary. Then for any
> 0, there exists an m ∈ N and a d-ary (prefix) block encoder C : X m → {0, . . . , d − 1}∗ such that
1
lim EP [`C (X1:n )] ≤ H({Xi }) + = lim H(Xn | X1 , . . . , Xn−1 ) + .
n n n
2n + dlog2 me n
EP [`C (X1:n )] ≤ + H(X1 , . . . , Xm ).
m m
Dividing by n and letting n → ∞ gives the result, as we can always choose m large.
209
Chapter 14
In this set of notes, we give a very brief introduction to exponential family models, which are a broad
class of distributions that have been extensively studied in the statistics literature [36, 5, 17, 139].
There are deep connections between exponential families, convex analysis [139], and information
geometry and the geometry of probability measures [5], and we will only touch briefly on a few of
those here.
whenever A is finite.
In some settings, it is convenient to define a base function h : X → R+ and define
though we can always simply include h in the base measure µ. In some scenarios, it may be convient
to re-parameterize the problem in terms of some function η(θ) instead of θ itself; we will not worry
about such issues and simply use the formulae that are most convenient.
We now give a few examples of exponential family models.
210
Stanford Statistics 311/Electrical Engineering 377 John Duchi
denote the kth order tensor, or multilinear operator, that for v1 , . . . , vk ∈ Rd satisfies
k
Y
⊗k
x (v1 , . . . , vk ) := hx, v1 i · · · hx, vk i = hx, vi i.
i=1
With this notation, our first key result regards the differentiability of A, where we can compute
(all) derivatives of eA(θ) by formally interchanging integration and differentiation.
211
Stanford Statistics 311/Electrical Engineering 377 John Duchi
The proof of the proposition is involved and requires complex analysis, so we defer it to Sec. 14.5.1.
As particular consequences of Proposition 14.4, we have the familiar (and important) expecta-
tion and covariance relationships that
Z Z
1 hθ,φ(x)i
∇A(θ) = R ∇e dµ(x) = φ(x)pθ (x)dµ(x) = Eθ [φ(X)]
ehθ,φ(x)i dµ(x)
and
( φ(x)ehθ,φ(x)i dµ(x))⊗2
Z R
2 1 ⊗2 hθ,φ(x)i
∇ A(θ) = R φ(x) e dµ(x) − R
ehθ,φ(x)i dµ(x) ( ehθ,φ(x)i dµ(x))2
= Eθ [φ(X)φ(X)> ] − Eθ [φ(X)]Eθ [φ(X)]>
= Covθ (φ(X)).
Another important result is the convexity of A. While this is immediate from the differentiation
identity that ∇2 A(θ) = Covθ (φ(X)), one can also provide a direct argument without appealing to
differentiation.
Proposition 14.5. The cumulant-generating function θ 7→ A(θ) is convex, and it is strictly convex
if and only if Covθ (φ(X)) is positive definite for all θ ∈ dom A.
Proof Let θλ = λθ1 + (1 − λ)θ2 , where θ1 , θ2 ∈ Θ. Then 1/λ ≥ 1 and 1/(1 − λ) ≥ 1, and Hölder’s
inequality implies
Z Z
log exp(hθλ , φ(x)i)dµ(x) = log exp(hθ1 , φ(x)i)λ exp(hθ2 , φ(x)i)1−λ dµ(x)
Z λ Z 1−λ
λ 1−λ
≤ log exp(hθ1 , φ(x)i) λ dµ(x) exp(hθ2 , φ(x)i) 1−λ dµ(x)
Z Z
= λ log exp(hθ1 , φ(x)i)dµ(x) + (1 − λ) log exp(hθ2 , φ(x)i)dµ(x),
As a final remark, we note that this convexity makes estimation in exponential families equiva-
lent to moment matching, and, assuming that computing A(θ) is tractable, computationally easy.
212
Stanford Statistics 311/Electrical Engineering 377 John Duchi
Indeed, given a sample X1 , . . . , Xn , assume that we estimate θ by maximizing the likelihood (equiv-
alently, minimizing the log-loss):
n n
X 1 X
minimize log = [−hθ, φ(Xi )i + A(θ)] ,
θ pθ (Xi )
i=1 i=1
which is thus convex in θ. This means there are no non-global local minima, and the maximum
likelihood estimate is any vector θbn such that
n
1X
∇A(θbn ) = Eθbn [φ(X)] = φ(Xi ).
n
i=1
This θbn is unique whenever Covθ (φ(X)) 0 for all θ, that is, when the covariance of φ is full rank in
the exponential family model. Later we will explore some properties of these types of minimization
and log-loss problems.
Definition 14.2. Let µ be a base measure on X and assume P has density p with respect to µ.
Then the Shannon entropy of P is
Z
H(P ) = − p(x) log p(x)dµ(x).
P
Notably, if X is a discrete set and µ is counting measure, then H(P ) = − x p(x) log p(x) is
simply the standard entropy. However, for other base measures the calculation is different. For
example, if we take µ to be Lebesgue measure, meaning that dµ(x) = dx and giving rise to the
usual integral on R (or Rd ), then we obtain differential entropy [48, Chapter 8].
Example 14.6: Let P be the uniform distribution on [0, a]. Then the differential entropy
H(P ) = − log(1/a) = log a. 3
Example 14.7: Let P be the normal distribution N(µ, Σ) and µ be Lebesgue measure. Then
Z " #
1 1 > −1
H(P ) = − p(x) log p − (x − µ) Σ (x − µ) dx
2π det(Σ) 2
1 1
= log(2π det(Σ)) + E[(X − µ)> Σ−1 (X − µ)]
2 2
1 d
= log(2π det(Σ)) + .
2 2
3
213
Stanford Statistics 311/Electrical Engineering 377 John Duchi
over all distributions P having densities with respect to the base measure µ, that is, we have the
(equivalent) absolute continuity condition P µ. Rewriting problem (14.4.1), we see that it is
equivalent to
Z
maximize − p(x) log p(x)dµ(x)
Z Z
subject to p(x)φi (x)dµ(x) = αi , p(x) ≥ 0 for x ∈ X , p(x)dµ(x) = 1.
Let
Pαlin := {P µ : EP [φ(X)] = α}
be distributions with densities w.r.t. µ satisfying the expectation (linear) constraint E[φ(X)] = α.
We then obtain the following theorem.
with respect to the measure µ. If EPθ [φ(X)] = α, then Pθ maximizes H(P ) over Pαlin ; moreover,
the distribution Pθ is unique.
Proof We first give a heuristic derivation—which is not completely rigorous—and then check to
verify that our result is exact. First, we write a Lagrangian for the problem (14.4.1). Introducing
Lagrange multipliers λ(x) ≥ 0 for the constraint p(x) ≥ 0, θ0 ∈ R for the normalization constraint
214
Stanford Statistics 311/Electrical Engineering 377 John Duchi
that P (X ) = 1, and θi for the constraints that EP [φi (X)] = αi , we obtain the following Lagrangian:
Z d
X Z
L(p, θ, θ0 , λ) = p(x) log p(x)dµ(x) + θi αi − p(x)φi (x)dµ(x)
i=1
Z Z
+ θ0 p(x)dµ(x) − 1 − λ(x)p(x)dµ(x).
Now, heuristically treating the density p = [p(x)]x∈X as a finite-dimensional vector (in the case
that X is finite, this is completely rigorous), we take derivatives and obtain
d
∂ X
L(p, θ, θ0 , λ) = 1 + log p(x) − θi φi (x) + θ0 − λ(x) = 1 + log p(x) − hθ, φ(x)i + θ0 − λ(x).
∂p(x)
i=1
To find the minimizing p for the Lagrangian (the function is convex in p), we set this equal to zero
to find that
p(x) = exp (hθ, φ(x)i − 1 − θ0 − λ(x)) .
Now, we note that with this setting, we always have p(x) > 0, so that the constraint p(x) ≥ 0
is unnecessary and (by complementary
R slackness) we have λ(x) = 0. In particular, by taking
θ0 = −1+A(θ) = −1+log exp(hθ, φ(x)i)dµ(x), we have that (according to our heuristic derivation)
the optimal density p should have the form
EP [X 2 ] = 1.
215
Stanford Statistics 311/Electrical Engineering 377 John Duchi
Example 14.9: Assume that the base measure µ is counting measure on the support {−1, 1},
so that µ({−1}) = µ({1}) = 1. Then the maximum entropy distribution is given by P (X =
x) = 21 for x ∈ {−1, 1}. 3
Example 14.10: Assume that the base measure µ is Lebesgue measure on X = R, so that
µ([a, b]) = b − a for b ≥ a. Then by Theorem 14.8, we have that the maximum entropy
distribution has the form pθ (x) ∝ exp(−θx2 ); recognizing the normal, we see that the optimal
distribution is simply N(0, 1). 3
Example 14.11: Assume that the base measure µ is counting measure on the integers Z =
{. . . , −2, −1, 0, 1, . . .}. Then Theorem 14.8 shows that the optimal distribution is a discrete
version of the normal: we have pθ (x) P ∝ exp(−θx2 ) for x ∈ Z. That is, we choose θ > 0 so that
the distribution pθ (x) = exp(−θx )/ ∞ 2
j=−∞ exp(−θj ) has variance 1. 3
2
ii. The semidefinite cone. Take C = {X ∈ Rd×d : X = X > , X 0}, where a matrix X 0 means
that a> Xa ≥ 0 for all vectors a. Then we have that C is convex and closed under positive
scaling as well.
Given a convex cone C, we associate a cone ordering with the cone and say that for two
elements x, y ∈ C, we have x y if x − y 0, that is, x − y ∈ C. In the orthant case, this simply
means that x is component-wise larger than y. For a given inner product h·, ·i, we define the dual
cone
C ∗ := {y : hy, xi ≥ 0 for all x ∈ C} .
For the standard (Euclidean) inner product, the positive orthant is thus self-dual, and similarly the
semidefinite cone is also self-dual. For a vector y, we write y ∗ 0 if y ∈ C ∗ is in the dual cone.
With this generality in mind, we may consider the following linearly constrained maximum
entropy problem, which is predicated on a particular cone C with associated cone ordering and
a function ψ mapping into the ambient space in which C lies:
where the base measure µ is implicit. We denote the family of distributions (with densities w.r.t.
216
Stanford Statistics 311/Electrical Engineering 377 John Duchi
We make a few remarks in passing before proving the theorem. First, we note that we must assume
both equalities are attained for the theorem to hold. We may also present an example.
Example 14.13 (Normal distributions maximize entropy subject to covariance constraints): Sup-
pose that the cone C is the positive semidefinite cone in Rd×d , that α = 0, that we use the
Lebesgue measure as our base measure, and that ψ(x) = xx> ∈ Rd×d . Let us fix β = Σ for
some positive definite matrix Σ. This gives us the problem
Z
maximize − p(x) log p(x)dx subject to EP [XX > ] Σ
Then we have by Theorem 14.12 that if we can find a density pK (x) ∝ exp(−hK, xx> i) =
exp(−x> Kx) satisfying E[XX > ] = Σ, this distribution maximizes the entropy. But this is not
hard: simply take the normal distribution N(0, Σ), which gives K = 21 Σ−1 . 3
217
Stanford Statistics 311/Electrical Engineering 377 John Duchi
R R
Now, we note that pθ,K (x)φ(x)dµ(x) = α and pθ,K (x)ψ(x)dµ(x) = β by assumption. Then we
have
Proof Let B = {u ∈ Rd | kuk ≤ 1} be the unit ball in Rd . For any > 0, there exists a K = K()
such that kxkk ≤ Kekxk for all x ∈ Rd . As C ⊂ int Conv(Θ0 ), there exists an > 0 such that for
Pm
all θ0 ∈ C, θ0 + 2B ⊂ Θ0 , and by construction, for any u ∈ B we can write θ0 + 2u = j=1 λj θj
for some λ ∈ Rm >
+ with 1 λ = 1. We therefore have
But using the convexity of t 7→ exp(t) and that θ0 + 2u ∈ Θ0 , the last quantity has upper bound
1
For complex functions, Osgood’s lemma shows that if A is continuous and holomorphic in each variable individ-
ually, it is holomorphic. For a treatment of such ideas in an engineering context, see, e.g. [76, Ch. 1].
218
Stanford Statistics 311/Electrical Engineering 377 John Duchi
Proof We write
exp(hθ, xi) − exp(hθ0 , xi) exp(hθ − θ0 , xi) − 1
= exp(hθ0 , xi)
kθ − θ0 k kθ − θ0 k
so that the lemma is equivalent to showing that
|ehθ−θ0 ,xi − 1|
≤ K max exp(hθj − θ0 , xi).
kθ − θ0 k j≤m
From this, we can assume without loss of generality that θ0 = 0 (by shifting). Now note that
by convexity e−a ≥ 1 − a for all a ∈ R, so 1 − ea ≤ |a| when a ≤ 0. Conversely, if a > 0, then
d
aea ≥ ea − 1 (note that da (aea ) = aea + ea ≥ ea ), so dividing by kxk, we see that
|ehθ,xi − 1| |ehθ,xi − 1| max{hθ, xiehθ,xi , |hθ, xi|}
≤ ≤ ≤ ehθ,xi + 1.
kθk kxk |hθ, xi| |hθ, xi|
As θ ∈ C, Lemma 14.14 then implies that
|ehθ,xi − 1|
≤ kxk ehθ,xi + 1 ≤ K max ehθj ,xi ,
kθk j
as desired.
With the lemmas in hand, we can demonstrate a dominating function for the derivatives. Indeed,
fix θ0 ∈ int Θ and for θ ∈ Θ, define
exp(hθ, xi) − exp(hθ0 , xi) − exp(hθ0 , xi)hx, θ − θ0 i ehθ,xi − ehθ0 ,xi − h∇ehθ0 ,xi , θ − θ0 i
g(θ, x) = = .
kθ − θ0 k kθ − θ0 k
Then limθ→θ0 g(θ, x) = 0 by the differentiability of t 7→ et . Lemmas 14.14 and 14.15 show that if
we take any collection {θj }m j=1 ⊂ Θ for which θ ∈ int Conv{θj }, then for C ⊂ int Conv{θj }, there
exists a constant K such that
| exp(hθ, xi) − exp(hθ0 , xi)|
|g(θ, x)| ≤ + kxk exp(hθ0 , xi) ≤ K max exp(hθj , xi)
kθ − θ0 k j
m
for all θ ∈ C. As maxj ehθj ,xi dµ(x) ≤ hθj ,xi dµ(x) < ∞, the dominated convergence
R P R
j=1 e
theorem thus implies that Z
lim g(θ, x)dµ(x) = 0,
θ→θ0
and so M (θ) = exp(A(θ)) is differentiable in θ, as
Z
hθ0 ,xi
M (θ) = M (θ0 ) + xe dµ(x), θ − θ0 + o(kθ − θ0 k).
219
Stanford Statistics 311/Electrical Engineering 377 John Duchi
√
Analyticity Over the subset ΘC := {θ + iz | θ ∈ Θ, z ∈ Rd } (where i = −1 is the imaginary
unit), we can extend the preceding results to demonstrate that A is analytic on ΘC . Indeed, we
first simply note that for a, b ∈ R, exp(a + ib) = exp(a) exp(ib) and | exp(a + ib)| = exp(a), i.e.
|ez | = e z for z ∈ C, and so Lemmas 14.14 and 14.15 follow mutatis-mutandis as in the real case.
These are enough for the application of the dominated convergence theorem above, and we use that
exp(·) is analytic to conclude that θ 7→ M (θ) is analytic on ΘC .
14.6 Exercises
Question 14.1: Prove that the log determinant function is concave over the positive semidefinite
matrices. That is, show that for X, Y ∈ Rd×d satisfying X 0 and Y 0, we have
where σij are specified only for indices i, j ∈ S (but we know that σij = σji and (i, i) ∈ S for all i).
Let Σ∗ denote the solution to this problem, assuming there is a positive definite matrix Σ satisfying
Σij = σij for i, j ∈ S. Show that for each unobserved pair (i, j) 6∈ S, the (i, j) Rentry [Σ∗−1 ]ij of
the inverse Σ∗−1 is 0. Hint: The distribution maximizing the entropy H(X)P = − p(x) log p(x)dx
subject to E[Xi Xj ] = σij has Gaussian density of the form p(x) = exp( (i,j)∈S λij xi xj − Λ0 ).
Question 14.3: Something about moment generating functions and log-barriers.
220
Chapter 15
In this lecture, we continue our study of exponential families, but now we investigate their properties
in somewhat more depth, showing how exponential family models provide a natural robustness
against model mis-specification, enjoy natural projection properties, and arise in other settings.
then the distribution with density pθ (x) = p(x) exp(hθ, φ(x)i − A(θ)) uniquely maximized the
(Shannon) entropy over the family Pαlin if we could find any θ satisfying EPθ [φ(X)] = α. (Recall
Theorem 14.8.) Now, of course, we must ask: does this actually happen? For if it does not, then
all of this work is for naught.
Luckily for us, the answer is that we often find ourselves in the case that such results occur.
Indeed, it is possible to show that, except for pathological cases, we are essentially always able to
find such a solution. To that end, define the mean space
n Z o
d
Mφ := α ∈ R : ∃Q s.t. q(x) = f (x)p(x), f ≥ 0, and q(x)φ(x)dx = α
Then we have the following result, which is well-known in the literature on exponential family
modeling; we refer to Wainwright and Jordan [139, Proposition 3.2 and Theorem 3.3] for the proof.
In the statement of the theorem, we recall that Rthe domain dom A of the log partition function is
defined as those points θ for which the integral p(x) exp(hθ, φ(x)i)dx < ∞.
221
Stanford Statistics 311/Electrical Engineering 377 John Duchi
Theorem 15.1. Assume that there exists some point θ0 ∈ int dom A, where dom A := {θ ∈
Rd : A(θ) < ∞}. Then for any α in the interior of Mφ , there exists some θ = θ(α) such that
EPθ [φ(X)] = α.
Using tools from convex analysis, it is possible to extend this result to the case that dom A has no
interior but only a relative interior, and similarly for Mφ (see Hiriart-Urruty and Lemaréchal [87]
or Rockafellar [126] for discussions of interior and relative interior). Moreover, it is also possible to
show that for any α ∈ Mφ (not necessarily the interior), there exists a sequence θ1 , θ2 , . . . satisfying
the limiting guarantee limn EPθn [φ(X)] = α. Regardless, we have our desired result: if P lin is not
empty, maximum entropy distributions exist and exponential family models attain these maximum
entropy solutions.
222
Stanford Statistics 311/Electrical Engineering 377 John Duchi
where as usual we assume the exponential family model pθ (x) = p(x) exp(hθ, φ(x)i − A(θ)). We
have the following result.
Proof The proof follows immediately upon taking derivatives. We define the empirical negative
log likelihood (the empirical risk) as
n n n
bn (θ) := − 1 1X 1X
X
R log pθ (Xi ) = − hθ, φ(Xi )i + A(θ) − log p(Xi ),
n n n
i=1 i=1 i=1
which is convex as θ 7→ A(θ) is convex (recall Proposition 14.4). Taking derivatives, we have
n
bn (θ) = − 1
X
∇θ R φ(Xi ) + ∇A(θ)
n
i=1
n Z
1 X 1
=− φ(Xi ) + R φ(x)p(x) exp(hθ, φ(x)i)dx
n p(x) exp(hθ, φ(x)i)dx
i=1
n
1 X
=− φ(Xi ) + EPθ [φ(X)].
n
i=1
In particular, finding any θ such that ∇A(θ) = EPbn [φ(X)] gives the result.
223
Stanford Statistics 311/Electrical Engineering 377 John Duchi
As a consequence of the result, we have the following rough equivalences tying together the
preceding material. In short, maximum entropy subject to (linear) empirical moment constraints
(Theorem 14.8) is equivalent to maximum likelihood estimation in exponential families (Proposi-
tion 15.3), which is equivalent to I-projection of a fixed base distribution onto a linearly constrained
family of distributions (Proposition 15.2).
To motivate this abstract setting we give two examples, the first abstract and the second somewhat
more concrete.
iid
Example 15.4: Suppose that receive n random variables Xi ∼ P ; in this case, we have the
sequential prediction loss
n
X 1
EP [− log q(X1n )] = EP log ,
i=1
q(Xi | X1i−1 )
which corresponds to predicting Xi given X1i−1 as well as possible, when the Xi follow an
(unknown or adversarially chosen) distribution P . 3
Example 15.5 (Coding): Expanding on the preceding example, suppose that the set X is
finite, and we wish to encode X into {0, 1}-valued sequences using as few bits as possible. In
this case, the Kraft inequality (recall Theorem 13.2) tells us that if C : X → {0, 1}∗ is an
uniquely decodable code, and `C (x) denotes the length of the encoding for the symbol x ∈ X ,
then X
2−`C (x) ≤ 1.
x
224
Stanford Statistics 311/Electrical Engineering 377 John Duchi
In particular, we have a coding game where we attempt to choose a distribution Q (or sequential
coding scheme C) that has as small an expected length as possible, uniformly over distributions
P . (The field of universal coding studies such questions in depth; see Tsachy Weissman’s course
EE376b.) 3
We now show how the minimax game (15.3.1) naturally gives rise to exponential family models,
so that exponential family distributions are so-called robust Bayes procedures (cf. Grünwald and
Dawid [79]). Specifically, we say that Q is a robust Bayes procedure for the class P of distributions
if it minimizes the supremum risk (15.3.1) taken over the family P; that is, it is uniformly good for
all distributions P ∈ P. If we restrict our class P to be a linearly constrained family of distributions,
then we see that the exponential family distributions are natural robust Bayes procedures: they
uniquely solve the minimax game. More concretely, assume that P = Pαlin and that Pθ denotes the
exponential family distribution with density pθ (x) = p(x) exp(hθ, φ(x)i − A(θ)), where p denotes
the base density. We have the following.
inf sup EP [− log q(X)] = sup EP [− log pθ (X)] = sup inf EP [− log q(X)].
Q P ∈P lin lin
P ∈Pα lin Q
P ∈Pα
α
Proof This is a standard saddle-point argument (cf. [126, 87, 33]). First, note that
where H denotes the Shannon entropy, for any distribution P ∈ Pαlin . Moreover, for any Q 6= Pθ ,
we have
sup EP [− log q(X)] ≥ EPθ [− log q(X)] > EPθ [− log pθ (X)] = H(Pθ ),
P
But we know from our standard maximum entropy results (Theorem 14.8) that Pθ maximizes the
entropy over Pαlin , that is, supP ∈Pαlin H(P ) = H(Pθ ).
In short: maximum entropy is equivalent to robust prediction procedures for linear families of
distributions Pαlin , which is equivalent to maximum likelihood in exponential families, which in turn
is equivalent to I-projection.
225
Chapter 16
Fisher Information
Having explored the definitions associated with exponential families and their robustness properties,
we now turn to a study of somewhat more general parameterized distributions, developing connec-
tions between divergence measures and other geometric ideas such as the Fisher information. After
this, we illustrate a few consequences of Fisher information for optimal estimators, which gives a
small taste of the deep connections between information geometry, Fisher information, exponential
family models. In the coming chapters, we show how Fisher information measures come to play a
central role in sequential (universal) prediction problems.
where the score function `˙θ = ∇θ log pθ (x) is the gradient of the log likelihood at θ (implicitly
depending on X) and the expectation Eθ denotes expectation taken with respect to Pθ . Intuitively,
the Fisher information captures the variability of the gradient ∇ log pθ ; in a family of distributions
for which the score function `˙θ has high variability, we intuitively expect estimation of the parameter
θ to be easier—different θ change the behavior of `˙θ —though the log-likelihood functional θ 7→
Eθ0 [log pθ (X)] varies more in θ.
Under suitable smoothness conditions on the densities pθ (roughly, that derivatives pass through
expectations; see Remark 16.1 at the end of this chapter), there are a variety of alternate definitions
of Fisher information. These smoothness conditions hold for exponential families, so at least in
the exponential family case, everything in this chapter is rigorous. (We note in passing that there
are more general definitions of Fisher information for more general families under quadratic mean
differentiability; see, for example, van der Vaart [137].) First, we note that the score function has
226
Stanford Statistics 311/Electrical Engineering 377 John Duchi
where in equality (?) we have assumed that integration and derivation may be exchanged. Under
similar conditions, we thus attain an alternate definition of Fisher information as the negative
expected hessian of log pθ (X). Indeed,
Z
2 2
= −E[∇ log pθ (x)] + ∇ pθ (x)dx = −E[∇2 log pθ (x)]. (16.1.2)
| {z }
=1
This representation also makes clear the additional fact that, if we have n i.i.d. observations
Pn from the
n
model Pθ , then the information content similarly grows linearly, as log pθ (X1 ) = i=1 log pθ (Xi ).
We now give two examples of Fisher information, the first somewhat abstract and the second
more concrete.
Iθ = ∇2 A(θ).
227
Stanford Statistics 311/Electrical Engineering 377 John Duchi
where Ip is the Fisher information for the single observation Bernoulli(p) family as in Example 16.2.
In fact, this inverse dependence on Fisher information is unavoidable, as made clear by the Cramér
Rao Bound, which provides lower bounds on the mean squared error of all unbiased estimators.
Proposition 16.3 (Cramér Rao Bound). Let φ : Rd → R be an arbitrary differentiable function
and assume that the random function (estimator) T is unbiased for φ(θ) under Pθ . Then
As an immediate corollary to Proposition 16.3, we may take φ(θ) = hλ, θi for λ ∈ Rd . Then
varying λ over all of Rd , and we obtain that for any unbiased estimator T for the parameter θ ∈ Rd ,
we have Var(hλ, T i) ≥ λ> Iθ−1 λ. That is, we have
Corollary 16.4. Let T be unbiased for the parameter θ under the distribution Pθ . Then the
covariance of T has lower bound
Cov(T ) Iθ−1 .
In fact, the Cramér-Rao bound and Corollary 16.4 hold, in an asymptotic sense, for substantially
more general settings (without the unbiasedness requirement). For example, see the books of
van der Vaart [137] or Le Cam and Yang [105, Chapters 6 & 7], which show that under appropriate
conditions (known variously as quadratic mean differentiability and local asymptotic normality)
that no estimator can have smaller mean squared error than Fisher information in any uniform
sense.
We now prove the proposition, where, as usual, we assume that it is possible to exchange
differentiation and integration.
Proof Throughout this proof, all expectations and variances are computed with respect to Pθ .
The idea of the proof is to choose λ ∈ Rd to minimize the variance
228
Stanford Statistics 311/Electrical Engineering 377 John Duchi
where in the final step we used that T is unbiased for φ(θ). Using the preceding equality,
Var(T − hλ, `˙θ i) = Var(T ) + λ> Iθ λ − 2E[(T − φ(θ))hλ, `˙θ i] = Var(T ) + λ> Iθ λ − 2hλ, ∇φ(θ)i.
Taking λ = Iθ−1 ∇φ(θ) gives 0 ≤ Var(T − hλ, `˙θ i) = Var(T ) − ∇φ(θ)> Iθ−1 ∇φ(θ), and rearranging
gives the result.
Example 16.5 (Divergences in exponential families): Consider the exponential family density
pθ (x) = h(x) exp(hθ, φ(x)i − A(θ)). Then a straightforward calculation implies that for any θ1
and θ2 , the KL-divergence between distributions Pθ1 and Pθ2 is
That is, the divergence is simply the difference between A(θ2 ) and its first order expansion
around θ1 . This suggests that we may approximate the KL-divergence via the quadratic re-
mainder in the first order expansion. Indeed, as A is infinitely differentiable (it is an exponential
family model), the Taylor expansion becomes
1
Dkl (Pθ1 ||Pθ2 ) = hθ1 − θ2 , ∇2 A(θ1 )(θ1 − θ2 )i + O(kθ1 − θ2 k3 )
2
1
= hθ1 − θ2 , Iθ1 (θ1 − θ2 )i + O(kθ1 − θ2 k3 ).
2
3
In particular, KL-divergence is roughly quadratic for exponential family models, where the
quadratic form is given by the Fisher information matrix. We also remark in passing that for a
convex function f , the Bregman divergence (associated with f ) between points x and y is given
by Bf (x, y) = f (x) − f (y) − h∇f (y), x − yi; such divergences are common in convex analysis,
optimization, and differential geometry. Making such connections deeper and more rigorous is the
goal of the field of information geometry (see the book of Amari and Nagaoka [5] for more).
We can generalize this example substantially under appropriate smoothness conditions. Indeed,
we have
Proposition 16.6. For appropriately smooth families of distributions {Pθ }θ∈Θ ,
1
Dkl (Pθ1 ||Pθ2 ) = hθ1 − θ2 , Iθ1 (θ1 − θ2 )i + o(kθ1 − θ2 k2 ). (16.3.1)
2
229
Stanford Statistics 311/Electrical Engineering 377 John Duchi
We only sketch the proof, as making it fully rigorous requires measure-theoretic arguments and
Lebesgue’s dominated convergence theorem.
Sketch of Proof By a Taylor expansion of the log density log pθ2 (x) about θ1 , we have
log pθ2 (x) = log pθ1 (x) + h∇ log pθ1 (x), θ1 − θ2 i
1
+ (θ1 − θ2 )> ∇2 log pθ1 (x)(θ1 − θ2 ) + R(θ1 , θ2 , x),
2
where R(θ1 , θ2 , x) = Ox (kθ1 − θ2 k3 ) is the remainder term, where Ox denotes a hidden dependence
on x. Taking expectations and assuming that we can interchange differentiation and expectation
appropriately, we have
Eθ1 [log pθ2 (X)] = Eθ1 [log pθ1 (X)] + hEθ1 [`˙θ1 ], θ1 − θ2 i
1
+ (θ1 − θ2 )> Eθ1 [∇2 log pθ1 (X)](θ1 − θ2 ) + Eθ1 [R(θ1 , θ2 , X)]
2
1
= Eθ1 [log pθ1 (X)] − (θ1 − θ2 )> Iθ1 (θ1 − θ2 ) + o(kθ1 − θ2 k2 ),
2
where we have assumed that the O(kθ1 − θ2 k3 ) remainder is uniform enough in X that E[R] =
o(kθ1 − θ2 k2 ) and used that the score function `˙θ is mean zero under Pθ .
We may use Proposition 16.6 to give a somewhat more general version of the Cramér-Rao
bound (Proposition 16.3) that applies to more general (sufficiently smooth) estimation problems.
Indeed, we will show that Le Cam’s method (recall Chapter 7.3) is (roughly) performing a type of
discrete second-order approximation to the KL-divergence, then using this to provide lower bounds.
More concretely, suppose we are attempting to estimate a parameter θ parameterizing the family
P = {Pθ }θ∈Θ , and assume that Θ ⊂ Rd and θ0 ∈ int Θ. Consider the minimax rate of estimation
of θ0 in a neighborhood around θ0 ; that is, consider
inf sup b n ) − θk2 ],
Eθ [kθ(X1
θb θ=θ0 +v∈Θ
where the observations Xi are drawn i.i.d. Pθ . Fixing v ∈ Rd and setting θ = θ0 + δv for some
δ > 0, Le Cam’s method (7.3.3) then implies that
2 2
b n ) − θk2 ] ≥ δ kvk 1 −
P n − P n
inf max Eθ [kθ(X1 θ0 θ0 +δv TV .
θb θ∈{θ0 ,θ+δv} 8
Using Pinsker’s inequality that 2 kP − Qk2TV ≤ Dkl (P ||Q) and the asymptotic quadratic approxi-
mation (16.3.1), we have
r √ 1
≤ n Dkl (Pθ ||Pθ +δv ) = n δ 2 v > Iθ v + o(δ 2 kvk2 ) 2 .
n
P − P n
θ0 θ0 +δv TV 0 0 0
2 2
By taking δ 2 = (nv > Iθ0 v)−1 , for large enough v and n we know that θ0 + δv ∈ int Θ (so that the
distribution Pθ0 +δv exists), and for large n, the remainder term o(δ 2 kvk2 ) becomes negligible. Thus
we obtain
2 2 2
inf max Eθ [kθ(X b n ) − θk2 ] & δ kvk = 1 kvk . (16.3.2)
1
θb θ∈{θ0 ,θ+δv} 16 16 nv > Iθ0 v
In particular, in one-dimension, inequality (16.3.2) implies a result generalizing the Cramér-Rao
bound. We have the following asymptotic local minimax result:
230
Stanford Statistics 311/Electrical Engineering 377 John Duchi
Corollary 16.7. Let P = {Pθ }θ∈Θ , where Θ ⊂ R, be a family of distributions satisfying the
quadratic approximation condition of Proposition 16.6. Then there exists a constant c > 0 such
that h i 1
lim lim inf sup √ Eθ (θbn (X1n ) − θ)2 ≥ c Iθ−1 .
v→∞ n→∞ θb θ:|θ−θ |≤v/ n
n 0
n 0
Written differently (and with minor extension), Corollary 16.7 gives a lower bound based on a
local modulus of continuity of the loss function with respect to the metric induced by the Fisher
information. Indeed, suppose we wish to estimate a parameter θ in the neighborhood of θ0 (where
√
the neighborhood size decreases as 1/ n) according to some loss function ` : Θ × Θ → R. Then if
we define the modulus of continuity of ` with respect to the Fisher information metric as
`(θ0 , θ0 + δv)
ω` (δ, θ0 ) := sup ,
v:kvk≤1 δ 2 v > Iθ0 v
the combination of Corollary 16.7 and inequality (16.3.2) shows that the local minimax rate of
estimating Eθ [`(θbn , θ)] for θ near θ0 must be at least ω` (n−1/2 , θ0 ). For more on connections between
moduli of continuity and estimation, see, for example, Donoho and Liu [57].
Remark 16.1: In order to make all of our exchanges of differentiation and expectation rigorous,
we must have some conditions on the densities we consider. One simple condition sufficient to make
this work is via Lebesgue’s dominated convergence theorem. Let f : X × Θ → R be a differentiable
function. For a fixed base measure µ assume there exists a function g such that g(x) ≥ k∇θ f (x, θ)k
for all θ, where Z
g(x)dµ(x) < ∞.
X
R R
Then in this case, we have ∇θ f (x, θ)dµ(x) = ∇θ f (x, θ)dµ(x) by the mean-value theorem and
definition of a derivative. (Note that for all θ0 we have supv:kvk2 ≤δ k∇θ f (x, θ)k2 θ=θ0 +v ≤ g(x).)
More generally, this type of argument can handle absolutely continuous functions, which are dif-
ferentiable almost everywhere. 3
231
Chapter 17
232
Stanford Statistics 311/Electrical Engineering 377 John Duchi
3. Note the minimizer of `: we have α∗ (η) = sign(η − 1/2), and f ∗ (X) = sign(η(X) − 1/2)
minimizes risk R(f ) over all f
4. Minimizing f can be achieved pointwise, and we have
(b) Example 17.1 (Exponential loss): Consider the exponential loss, used in AdaBoost
(among other settings), which sets φ(α) = e−α . In this case, we have
1 η ∂
argmin `φ (α, η) = log because `φ (α, η) = −ηe−α + (1 − η)eα .
α 2 1−η ∂α
η(x)
Thus fφ∗ (x) = 1
2 log 1−η(x) , and this is Fisher consistent. 3
(c) Classification calibration
1. Consider pointwise versions of risk (all that is necessary, turns out)
2. Define the infimal conditional φ-risks as
Definition 17.2. The margin-based loss φ is classification calibrated if H(δ) > 0 for
all δ > 0. Equivalently, for any η 6= 21 , we have `∗φ (η) < `wrong
φ (η).
5. Example (Example 17.1 continued): For the exponential loss, we have
`wrong
−α
ηe + (1 − η)eα = e0 = 1
φ (η) = inf
α(2η−1)≤0
233
Stanford Statistics 311/Electrical Engineering 377 John Duchi
Proof Let A ⊂ Rd × R denote all the pairs (a, b) minorizing f , that is, those pairs
such that f (t) ≥ ha, ti − b for all t. Then we have
234
Stanford Statistics 311/Electrical Engineering 377 John Duchi
ψ(δn ) → 0 ⇔ δn → 0.
1. Some insights from theorem. Recall examples 17.1 and 17.2. For both of these, we
have that ψ(δ) = H(δ), as H is convex. For the hinge loss, φ(α) = [1 − α]+ , we obtain
for any f that
P(Y f (X) ≤ 0) − inf P(Y f (X) ≤ 0) ≤ E [1 − Y f (X)]+ − inf E [1 − Y f (X)]+ .
f f
235
Stanford Statistics 311/Electrical Engineering 377 John Duchi
`wrong
φ (1/2) := inf `φ (α, 1/2) = inf `φ (α, 1/2) = `∗φ (1/2),
α(1−1)≤0 α
so H(0) = `∗φ (1/2) − `∗φ (1/2) = 0. (It is clear that the sub-optimality gap H ≥ 0 by
construction.)
2. We begin with the first statement of the theorem, inequality (17.0.2). Consider first
the gap (for a fixed margin α) in conditional 0-1 risk,
Now we use expression (17.0.3) to get an upper bound on R(f ) − R∗ via the φ-risk.
Indeed, consider the ψ-transform (17.0.1). By Jensen’s inequality, we have that
Now we use the special structure of the suboptimality function we have constructed.
Note that ψ ≤ H, and moreover, we have for any α ∈ R that
∗
1 {sign(α) 6= sign(2η − 1)} H(|2η − 1|) = 1 {sign(α) 6= sign(2η − 1)} inf `φ (α, η) − `φ (η)
α(2η−1)≤0
= Rφ (f ) − Rφ∗ ,
236
Stanford Statistics 311/Electrical Engineering 377 John Duchi
ψ(δ) → 0. It remains to show that ψ(δ) → 0 implies that δ → 0. But this is clear
because we know that ψ(0) = 0 andψ(δ) > 0 whenever δ > 0, and the convexity of ψ
implies that ψ is increasing.
To obtain IV(b)3 from IV(b)2, note that by inequality (17.0.2), we have
satisfies `0φ (0, η) = (2η − 1)φ0 (0), and if φ0 (0) < 0, this quantity is negative for η > 1/2.
Thus the minimizing α(η) ∈ (0, ∞]. (Proof by picture, but formalize in full notes.)
For the other direction assume that φ is classification calibrated. Recall the definition of a
subgradient gα of the function φ at α ∈ R is any gα such that φ(t) ≥ φ(α) + gα (t − α) for
all t ∈ R. (Picture.) Let g1 , g2 be such that `(α) ≥ `(0) + g1 α and `(α) ≥ `(0) + g2 α, which
exist by convexity. We show that both g1 , g2 < 0 and g1 = g2 . By convexity we have
and under the assumption that g1 > g2 we obtain `∗φ (η) = inf α≥0 `φ (α, η) > `wrongφ (η),
which is a contradiction to classification calibration. We thus obtain g1 = g2 , so that the
function φ has a unique subderivative at α = 0 and is thus differentiable.
237
Stanford Statistics 311/Electrical Engineering 377 John Duchi
If φ0 (0) ≥ 0, then for α ≥ 0 and η > 1/2 we must have the right hand side is at least
φ(0), which contradicts classification calibration, because we know that `∗φ (η) < `wrong
φ (η)
exactly as in the preceding argument.
and
1 β
f (x0 ) ≤ βf (b) + (1 − β)f (x), or f (x0 ) ≤ f (x) + f (b).
1−β 1−β
Taking α, β → 0, we obtain
as desired.
238
Stanford Statistics 311/Electrical Engineering 377 John Duchi
assumption we have h(z/2) = b > 0, whence we have h(1) ≥ b > 0. In particular, the piecewise
linear function defined by (
0 if t ≤ z/2
g(t) = b
1−z/2 (t − z/2) if t > z/2
is closed, convex, and satisfies g ≤ h. But g(z) > 0 = h∗∗ (z), a contradiction to the fact that h∗∗
is the largest (closed) convex function below h.
17.2 Exercises
Question 17.1: Find the suboptimality function Hφ and ψ-transform for the binary classification
problem with the following losses.
(a) Logistic loss. That is,
φ(α) = log(1 + e−α )
(b) Squared error (ordinary regression). The surrogate loss in this case for the pair (x, y) is 12 (f (x)−
y)2 . Show that for y ∈ {−1, 1}, this can be written as a margin-based loss, and compute the
associated suboptimality function Hφ and ψ-transform. Is the squared error classification
calibrated?
Question 17.2: Suppose we have a regression problem with data (independent variables) x ∈ X
and y ∈ R. We wish to find a predictor f : X → R minimizing the probability of being far away
from the true y, that is, for some c > 0, our loss is of the form
Show that no loss of the form ϕ(α, y) = |α − y|p , where p ≥ 1, is Fisher consistent for the loss L,
even if the distribution of Y conditioned on X = x is symmetric about its mean E[Y | X]. That is,
show there exists a distribution on pairs X, Y such that the set of minimizers of the surrogate
Rϕ (f ) := E[ϕ(f (X), Y )]
is not included in the set of minimizers of the true risk, R(f ) = P(|Y − f (X)| ≥ c), even if the
distribution of Y (conditional on X) is symmetric.
Question 17.3 (Empirics of classification calibration): In this problem you will compare the
performance of hinge loss minimization and an ordinary linear regression in terms of classifica-
tion performance. Specifically, we compare the performance of the hinge surrogate loss with the
regression surrogate when the data is generated according to the model
239
Stanford Statistics 311/Electrical Engineering 377 John Duchi
(ii) Set
n
1X
θbhinge = argmin [1 − yi hxi , θi]+
θ:kθk2 ≤R n i=1
and
n
1 X
θbreg = argmin (yi − hxi , θi)2 = argmin kXθ − yk22 .
θ 2n θ
i=1
(iii) Evaluate the 0-1 error rate of the vectors θbhinge and θbreg on the held-out data points {(xtest test n
i , yi )}i=1 .
Perform the preceding steps (i)–(iii), using any n ≥ 100 and d ≥ 10 and a radius R = 5, for
different standard deviations σ = {0, 1, . . . , 10}; perform the experiment a number of times. Give
a plot or table exhibiting the performance of the classifiers learned on the held-out data. How do
the two compare? Given that for the hinge loss we know Hφ (δ) = δ (as presented in class), what
would you expect based on the answer to Question 17.1?
I have implemented (in the julia language; see http://julialang.org/) methods for solving
the hinge loss minimization problem with stochastic gradient descent so that you do not need to.
The file is available at this link. The code should (hopefully) be interpretable enough that if julia
is not your language of choice, you can re-implement the method in an alternative language.
Question 17.4: In this question, we generalize our results on classification calibration and surro-
gate risk consistency to a much broader supervised learning setting. Consider the following general
supervised learning problem, where we assume that we have data in pairs (X, Y ) ∈ X × Y, where
X and Y are general spaces.
Let L : Rm × Y → R+ be a loss function we wish to minimize, so that the loss of a prediction
function f : X → Rm for the pair (x, y) is L(f (x), y). Let ϕ : Rm × Y → R be an arbitrary
surrogate, where ϕ(f (x), y) is the surrogate loss. Define the risk and ϕ-risk
Let PY denote the space of all probability distributions on Y, and define the conditional (pointwise)
risks ` : Rm × PY → R and `ϕ : Rm × PY → R by
Z Z
`(α, P ) = L(α, y)p(y)dy and `ϕ (α, P ) = `(α, y)p(y)dy.
Y Y
(Here for simplicity we simply write integration against dy; you may make this fully general if you
wish.) Let `∗ (P ) = inf α `(α, P ) denote the minimal conditional risk, and similarly for `∗ϕ (P ), when
Y has distribution P . If Px denotes the distribution of Y conditioned on X = x, then we may
rewrite the risk functionals as
We will show that the same machinery we developed for classification calibration extends to this
general supervised learning setting.
For ≥ 0, define the suboptimality gap function
240
Stanford Statistics 311/Electrical Engineering 377 John Duchi
which measures the gap between achievable (pointwise) risk and the best surrogate risk when we
enforce that the true loss is not minimized. Also define the uniform suboptimality function
(Compare this with the definition of ∆ for the classification case to gain intuition.)
Prove that ∆ϕ () > 0 for all > 0 implies if Rϕ (fn ) → Rϕ∗ , then R(fn ) → R∗ .
(b) We say that the loss ϕ is uniformly calibrated if ∆ϕ () > 0 for all > 0. Show that, in the
margin-based binary classification case with loss φ : R → R, uniform calibration as defined
here is equivalent to classification-calibration as defined in class. You may assume that the
margin-based loss φ is continuous.
(c) A non-uniform result: assume that for all distributions P ∈ PY on the set Y, we have
∆ϕ (, P ) > 0
if > 0. (We call this calibration.) Assume that there exists an upper bound function B : X →
R+ such that E[B(X)] < ∞ and `(α, Px ) ≤ `∗ (Px ) + B(x) for all x and α ∈ Rm . For example, if
the loss L is bounded, this holds. Show that if the sequence of functions fn : X → Rm satisfies
Equivalently, show that for any distribution P on X × Y, for all > 0 there exists a δ > 0 such
that
Rϕ (f ) ≤ Rϕ∗ + δ implies R(f ) ≤ R∗ + .
(You may ignore any measurability issues that come up.)
241
Chapter 18
c. Examples of 0-1 loss and its friends: have X ∈ X and Y ∈ {−1, 1}.
242
Stanford Statistics 311/Electrical Engineering 377 John Duchi
1. Example 18.1 (Binary classification with 0-1 loss): What is Bayes risk of binary
classifier? Let
P (Y = 1 | X = x)p(x)
p+1 (x) = p(x | Y = 1) =
P (Y = 1)
be the density of X conditional on Y = 1 and similarly for p−1 (x), and assume that
each class occurs with probability 1/2. Then
Z
∗
R = inf [1 {γ(x) ≤ 0} P (Y = 1 | X = x) + 1 {γ(x) ≥ 0} P (Y = −1 | X = x)] p(x)dx
γ
Z Z
1 1
= inf [1 {γ(x) ≤ 0} p+1 (x) + 1 {γ(x) ≥ 0} p−1 (x)] dx = min{p+1 (x), p−1 (x)}dx.
2 γ 2
Similarly, we may compute the minimal prior risk, which is simply 12 by defini-
tion (18.0.2). Looking at the gap between the two, we obtain
Z Z
∗ ∗ 1 1 1 1
Rprior −R = − min{p+1 (x), p−1 (x)}dx = [p1 − p−1 ]+ = kP1 − P−1 kTV .
2 2 2 2
That is, the difference is half the variation distance between P1 and P−1 , the dis-
tributions of x conditional on the label Y . 3
2. Example 18.2 (Binary classification with hinge loss): We now repeat precisely
the same calculations as in Example 18.1, but using as our loss the hinge loss (recall
Example 17.2). In this case, the minimal φ-risk is
Z
Rφ∗ = inf [1 − α]+ P (Y = 1 | X = x) + [1 + α]+ P (Y = −1 | X = x) p(x)dx
α
Z Z
1
= inf [1 − α]+ p1 (x) + [1 + α]+ p−1 (x) dx = min{p1 (x), p−1 (x)}dx.
2 α
∗
We can similarly compute the prior risk as Rφ,prior = 1. Now, when we calculate
the improvement available via observing X = x, we find that
Z
∗
Rφ,prior − Rφ∗ = 1 − min{p1 (x), p−1 (x)}dx = kP1 − P−1 kTV ,
a. Statistical information
1. Suppose we have a classification problem with data X ∈ X and labels Y ∈ {−1, 1}. A
natural notion of information that X carries about Y is the gap
∗
Rprior − R∗ , (18.0.5)
that between the prior risk and the risk attainable after viewing x ∈ X .
243
Stanford Statistics 311/Electrical Engineering 377 John Duchi
2. Didn’t present this. True definition of statistical information: suppose class 1 has
prior probability π and class −1 has prior 1 − π, and let P1 and P−1 be the distributions
of X ∈ X given Y = 1 and Y = −1, respectively. The Bayes risk associated with the
problem is then
Z
Bπ (P1 , P−1 ) := inf [1 {γ(x) ≤ 0} p1 (x)π + 1 {γ(x) ≥ 0} p−1 (x)(1 − π)] dx (18.0.6)
γ
Z
= p1 (x)π ∧ p−1 (x)(1 − π)dx
244
Stanford Statistics 311/Electrical Engineering 377 John Duchi
Proof First, consider the integrated Bayes risk. Recalling the definition of the condi-
tional distribution η(x) = P (Y = 1 | X = x), we have
Z
∗
`φ (π) − `∗φ (η(x)) p(x)dx
Bφ,π − Bφ,π (P1 , P−1 ) =
Z
= sup `∗φ (π) − φ(α)P (Y = 1 | x) − φ(−α)P (Y = −1 | x) p(x)dx
α
p−1 (x)(1 − π)
Z
∗ p1 (x)π
= sup `φ (π) − φ(α) − φ(−α) p(x)dx,
α p(x) p(x)
where we have used Bayes rule as in (18.0.9). Let us now divide all appearances of the
density p1 by p−1 , which yields
Bφ,π − Bφ,π (P1 , P−1 )
Z φ(α) pp−1
1 (x)
(x) π + φ(−α)(1 − π)
p1 (x)
∗
= sup `φ (π) −
p1 (x)
π + (1 − π) p−1 (x)dx.
p−1 (x)
p−1 (x) π + (1 − π)
α
(18.0.11)
By inspection, the representation (18.0.11) gives the result of the theorem if we can argue
that the function fπ is convex, where we substitute p1 (x)/p−1 (x) for t in fπ (t).
To see that the function fπ is convex, consider the intermediate function
sπ (u) := sup {−πφ(α)u − (1 − π)φ(−α)} .
α
245
Stanford Statistics 311/Electrical Engineering 377 John Duchi
Rφ (γ | q) = E[φ(Y γ(q(X)))]
d. Result unifying quantization and learning: we say that loss functions φ1 and φ2 are univer-
sally equivalent if they induce the same f divergence (18.0.10), that is, there is a constant
c > 0 and a, b ∈ R such that
Theorem 18.4. Let φ1 and φ2 be equivalent margin-based surrogate loss functions. Then
for any quantizers q1 and q2 ,
Rφ∗ 1 (q1 ) ≤ Rφ∗ 1 (q2 ) if and only if Rφ∗ 2 (q1 ) ≤ cRφ∗ 2 (q2 ).
Proof The proof follows straightforwardly via the representation (18.0.12). If φ1 and φ2
are equivalent, then we have that
Rφ∗ 1 ,prior − Rφ∗ 1 (q) = Dfπ,φ1 (P−1 ||P1 | q) = cDfπ,φ2 (P−1 ||P1 | q) + a + b
= c Rφ∗ 2 ,prior − Rφ∗ 2 (q) + a + b
246
Stanford Statistics 311/Electrical Engineering 377 John Duchi
Rφ∗ 1 (q1 ) ≤ Rφ∗ 1 (q2 ) if and only if Rφ∗ 1 ,prior − Rφ∗ 1 (q1 ) ≥ Rφ∗ 1 ,prior − Rφ∗ 1 (q2 )
if and only if Dfπ,φ1 (P−1 ||P1 | q1 ) ≥ Dfπ,φ1 (P−1 ||P1 | q2 )
if and only if Dfπ,φ2 (P−1 ||P1 | q1 ) ≥ Dfπ,φ2 (P−1 ||P1 | q2 )
if and only if Rφ∗ 2 ,prior − Rφ∗ 2 (q1 ) ≥ Rφ∗ 2 ,prior − Rφ∗ 2 (q2 ).
Subtracting Rφ∗ 2 ,prior from both sides gives our desired result.
e. Some comments:
1. We have an interesting thing: if we wish to learn a quantizer and a classifier jointly,
then this is possible by using any loss equivalent to the true loss we care about.
2. Example: hinge loss and 0-1 loss are equivalent.
3. Turns out that the condition that the losses φ1 and φ2 be equivalent is (essentially)
necessary and sufficient for two quantizers to induce the same ordering [119]. This
equivalence is necessary and sufficient for the ordering conclusion of Theorem 18.4.
18.1 Exercises
Question 18.1 (Bayes risk gaps): Consider a general binary classification problem with (X, Y ) ∈
X × {−1, 1}. Let φ(α) = log(1 + e−α ), so that we use the logistic loss. Show that the surrogate
risk gap
∗
Rφ,prior − Rφ∗ = I(X; Y ),
where I is the mutual information.
247
Part IV
248
Chapter 19
In this chapter, we explore sequential game playing and online probabilistic prediction schemes.
These have applications in coding when the true distribution of the data is unknown, biological
algorithms (encoding genomic data, for example), control, and a variety of other areas. The field of
universal prediction is broad; in addition to this chapter touching briefly on a few of the techniques
therein and their relationships with statistical modeling and inference procedures, relevant reading
includes the survey by Merhav and Feder [116], the more recent book of Grünwald [78], and Tsachy
Weissman’s EE376c course at Stanford.
where we have written it as the sum over q(xi | xi−11 ) to emphasize the sequential nature of the
game. Associated with the regret of the sequence xn1 is the adversarial regret (usually simply called
the regret) of Q with respect to the family P of distributions, which is
RX
n (Q, P) := sup Reg(Q, P, xn1 ). (19.1.2)
P ∈P,xn
1 ∈X
n
249
Stanford Statistics 311/Electrical Engineering 377 John Duchi
In more generality, we may which to use a loss function L different than the log loss; that is, we
might wish to measure a loss-based version the regret as
n
X
L(xi , Q(· | x1i−1 )) − L(xi , P (· | xi−1
1 )),
i=1
where L(xi , P ) indicates the loss suffered on the point xi when the distribution P over Xi is played,
and P (· | xi−1
1 ) denotes the conditional distribution of Xi given x1
i−1
according to P . We defer
discussion of such extensions later, focusing on the log loss for now because of its natural connections
with maximum likelihood and coding.
A less adversarial problem is to minimize the redundancy, which is the expected regret under a
distribution P . In this case, we define the redunancy of Q with respect to P as the expected regret
of Q with respect to P under the distribution P , that is,
1 1
Redn (Q, P ) := EP log − log = Dkl (P ||Q) , (19.1.3)
q(X1n ) p(X1n )
where the dependence on n is implicit in the KL-divergence. The worst-case redundancy with
respect to a class P is then
Rn (Q, P) := sup Redn (Q, P ). (19.1.4)
P ∈P
Example 19.1 (Example 15.5 on coding, continued): We noted in Example 15.5 that for
any p.m.f.s p and q on the set X , it is possible to define coding schemes Cp and Cq with code
lengths
1 1
`Cp (x) = log and `Cq (x) = log .
p(x) q(x)
Conversely, given (uniquely decodable) encoding schemes Cp and Cq : X → {0, 1}∗ , the func-
tions pCp (x) = 2−`Cp (x) and qCq (x) = 2−`Cq (x) satisfy x pCp (x) ≤ 1 and x qCq (x) ≤ 1. Thus,
P P
the redundancy of Q with respect to P is the additional number of bits required to encode
variables distributed according to P when we assume they have distribution Q:
n
X 1 1
Redn (Q, P ) = EP log i−1
− log
i=1
q(Xi | X1 ) p(Xi | X1i−1 )
Xn
= EP [`Cq (Xi )] − EP [`Cp (Xi )],
i=1
where `C (x) denotes the number of bits C uses to encode x. Note that, as in Chapter 13, the
code d− log p(x)e is (essentially) optimal. 3
As another example, we may consider a filtering or prediction problem for a linear system.
Example 19.2 (Prediction in a linear system): Suppose we believe that a sequence of random
variables Xi ∈ Rd are Markovian, where Xi given Xi−1 is normally distributed with mean
AXi−1 + g, where A is an unknown matrix and g ∈ Rd is a constant drift term. Concretely, we
assume Xi ∼ N(AXi−1 + g, σ 2 Id×d ), where we assume σ 2 is fixed and known. For our class of
250
Stanford Statistics 311/Electrical Engineering 377 John Duchi
predicting distributions Q, we may look at those that at iteration i predict Xi ∼ N(µi , σ 2 I).
In this case, the regret is given by
n
X 1 1
Reg(Q, P, xn1 ) = 2
kµi − xi k22 − 2 kAxi−1 + g − xi k22 ,
2σ 2σ
i=1
Moreover, if Compn (Θ) < +∞, then the normalized maximum likelihood distribution (also known
as the Shtarkov distribution) Q, defined with density
supθ∈Θ pθ (xn1 )
q(xn1 ) = R ,
supθ pθ (xn1 )dxn1
The proposition completely characterizes the minimax regret in the adversarial setting, and it
gives the unique distribution achieving the regret. Unfortunately, in most cases it is challenging
to compute the minimax optimal distribution Q, so we must make approximations of some type.
One approach is to make Bayesian approximations to Q, as we do in the sequel when we consider
redundancy rather than adversarial regret. See also the book of Grünwald [78] for more discussion
of this and other issues.
251
Stanford Statistics 311/Electrical Engineering 377 John Duchi
Proof We begin by proving the result in the case that Compn < +∞. First, note that the
normalized maximum likelihood distribution Q has constant regret:
1 1
RX
n (Q, P) = sup log − log
xn
1 ∈X
n q(xn1 ) supθ pθ (xn1 )
supθ pθ (xn1 )dxn1
R
1
= sup log − log = Compn (P).
xn1
supθ pθ (xn1 ) supθ pθ (xn1 )
We now give an example where (up to constant factor terms) we can explicitly calculate the
minimax regret in the adversarial setting. In this case, we compete with the family of i.i.d. Bernoulli
distributions.
252
Stanford Statistics 311/Electrical Engineering 377 John Duchi
we have Pθ (x) = θx (1 − θ)1−x . For a sequence xn1 ∈ {0, 1}n with m non-zeros, we thus have
for θb = m/n that
where h2 (p) = −p log p − (1 − p) log(1 − p) is the binary entropy. Using this representation, we
find that the complexity of the Bernoulli family is
n
X n −nh2 ( m )
Compn ([0, 1]) = log e n .
m
m=0
Rather than explicitly compute with this, we now use Stirling’s approximation (cf. Cover and
Thomas [48, Chapter 17]): for any p ∈ (0, 1) with np ∈ N, we have
" #
n 1 1 1
∈√ p ,p exp(nh2 (p)).
np n 8p(1 − p) πp(1 − p)
the notedR 1asymptote occuring as n → ∞ by the fact that this sum is a Riemann sum for the
integral 0 θ−1/2 (1 − θ)−1/2 dθ. In particular, we have that as n → ∞,
Z 1 !
1
inf RX
n (Q, P) = Compn ([0, 1]) = log 2 + [8
−1/2
, π −1/2 ]n1/2 p dθ + o(1)
Q 0 θ(1 − θ)
Z 1
1 1
= log n + log p dθ + O(1).
2 0 θ(1 − θ)
R1√
We remark in passing that this is equal to 12 log n + log 0 Iθ dθ, where Iθ denotes the Fisher
information of the Bernoulli family (recall Example 16.2). We will see that this holds in more
generality, at least for redundancy, in the sequel. 3
253
Stanford Statistics 311/Electrical Engineering 377 John Duchi
P = {Pθ }θ∈Θ . In the simplest case—upon which we focus—the data X1n are then generated i.i.d.
according to Pθ , and we suffer expected regret (or redundancy)
1 1
Redn (Q, Pθ ) = Eθ log − Eθ log = Dkl (Pθn ||Qn ) , (19.3.1)
q(X1n ) pθ (X1n )
where we use Qn to denote that Q is applied on all n data points (in a sequential fashion, as
Q(· | X1i−1 )). In this expression, q and p denote the densities of Q and P , respectively. In a slightly
more general setting, we may consider the expected regret of Q with respect to a distribution Pθ
even under model mis-specification, meaning that the data is generated according to an alternate
distribution P . In this case, the (more general) redundancy becomes
1 1
EP log − log . (19.3.2)
q(X1n ) pθ (X1n )
In both cases (19.3.1) and (19.3.2), we would like to be able to guarantee that the redundancy
grows more slowly than n as n → ∞. That is, we would like to find distributions Q such that, for
1 n
any θ0 ∈ Θ, we have n Dkl Pθ0 ||Qn → 0 as n → ∞. Assuming we could actually obtain such a
distribution in general, this is interesting because (even in the i.i.d. case) for any fixed distribution
Pθ 6= Pθ0 , we must have Dkl Pθ0 ||Pθ = nDkl (Pθ0 ||Pθn ) = Ω(n). A standard approach to attaining
n n
such guarantees is the mixture approach, which is based on choosing Q as a convex combination
(mixture) of all the possible source distributions Pθ for θ ∈ Θ.
In particular, given a prior distribution π (weighting function integrating to 1) over Θ, we define
the mixture distribution Z
π
Qn (A) = π(θ)Pθ (A)dθ for A ⊂ X n . (19.3.3)
Θ
Rewriting this in terms of densities pθ , we have
Z
π n
qn (x1 ) = π(θ)pθ (xn1 )dθ.
Θ
Conceptually, this gives a simple prediction scheme, where at iteration i we play the density
q π (xi1 )
q π (xi | xi−1
1 )= ,
q π (xi−1
1 )
where we have emphasized that this strategy exhibits an exponential weighting approach, where
distribution weights are scaled exponentially by their previous loss performance of log 1/pθ (xi−1
1 ).
254
Stanford Statistics 311/Electrical Engineering 377 John Duchi
This mixture construction (19.3.3), with the weighting scheme (19.3.4), enjoys very good per-
formance. In fact, we say that so long as the prior π puts non-zero mass over all of Θ, under some
appropriate smoothness conditions, the scheme Qπ is universal, meaning that Dkl (Pθn ||Qπn ) = o(n).
We have the following theorem illustrating this effect. In the theorem, we let π be a density on Θ,
and we assume the Fisher information Iθ for the family P = {Pθ }θ∈Θ exists in a neighborhood of
θ0 ∈ int Θ, and that the distributions Pθ are sufficiently regular that differentiation and integration
can be interchanged. (See Clarke and Barron [44] for precise conditions.) We have
Theorem 19.5 (Clarke and Barron [44]). Under the above conditions, if Qπn = Pθn π(θ)dθ is the
R
While we do not rigorously prove the theorem, we give a sketch showing the main components
of the result based on asymptotic normality arguments for the maximum likelihood estimator in
Section 19.4. See Clarke and Barron [44] for a full proof.
Example 19.6 (Bernoulli distributions with a Beta prior): Consider the class of binary (i.i.d.
or memoryless) Bernoulli sources, that is, the Xi are i.i.d Bernoulli(θ), where θ = Pθ (X = 1) ∈
[0, 1]. The Beta(α, β)-distribution prior on θ is the mixture π with density
Γ(α + β) α−1
π(θ) = θ (1 − θ)β−1
Γ(α)Γ(β)
R∞
on [0, 1], where Γ(a) = 0 ta−1 e−t dt denotes the gamma function. We remark that that under
α
the Beta(α, β) distribution, we have Eπ [θ] = α+β . (See any undergraduate probability text for
such results.)
If we play via a mixture of Bernoulli distributions under such a Beta-prior for θ, by Theo-
rem 19.5 we have a universal prediction scheme. We may also explicitly calculate the predictive
i
Pi Q. To do so, we first compute the posterior π(θ | X1 ) as in expression (19.3.4).
distribution
Let Si = j=1 Xj be partial sum of the Xs up to iteration i. Then
pθ (xi1 )π(θ)
π(θ | xi1 ) = i
∝ θSi (1 − θ)i−Si θα−1 θβ−1 = θα+Si −1 (1 − θ)β+i−Si −1 ,
q(x1 )
where we have ignored the denominator as we must simply normalize the above quantity in
θ. But by inspection, the posterior density of θ | X1i is a Beta(α + Si , β + i − Si ) distribution.
Thus to compute the predictive distribution, we note that Eθ [Xi ] = θ, so we have
Si + α
Q(Xi = 1 | X1i ) = Eπ [θ | X1i ] = .
i+α+β
Moreover, Theorem 19.5 shows that when we play the prediction game with a Beta(α, β)-prior,
we have redundancy scaling as
n π
1 n Γ(α)Γ(β) 1 1 1
Dkl Pθ0 ||Qn = log + log + log + o(1)
2 2πe Γ(α + β) θ0α−1 (1 − θ0 )β−1 2 θ0 (1 − θ0 )
255
Stanford Statistics 311/Electrical Engineering 377 John Duchi
As one additional interesting result, we show that mixture models are actually quite robust,
even under model mis-specification, that is, when the true distribution generating the data does not
belong to the class P = {Pθ }θ∈Θ . That is, mixtures can give good performance for the generalized
redundancy quantity (19.3.2).R For this next result, we as usual define the mixture distribution Qπ
over the set X via Qπ (A) = Θ Pθ (A)dπ(θ). We may also restrict this mixture distribution to a
subset Θ0 ⊂ Θ by defining Z
π 1
QΘ0 (A) = Pθ (A)dπ(θ).
π(Θ0 ) Θ0
Then we obtain the following robustness result.
Proposition 19.7. Assume that Pθ have densities pθ over X , let P be any distribution having
density p over X , and let q π be the density associated with Qπ . Then for any Θ0 ⊂ Θ,
1 1 1
+ Dkl P ||QπΘ0 − Dkl (P ||Pθ ) .
EP log π − log ≤ log
q (X) pθ (X) π(Θ0 )
In particular, Proposition 19.7 shows that so long as the mixture distributions QπΘ0 can closely
approximate Pθ , then we attain a convergence guarantee nearly as good as any in the family P =
{Pθ }θ∈Θ . (This result is similar in flavor to the mutual information bound (10.1.3), Corollary 10.2,
and the index of resolvability quantity.)
Proof Fix any Θ0 ⊂ Θ. Then we have q π (x) = Θ pθ (x)dπ(θ) ≥ Θ0 pθ (x)dπ(θ). Thus we have
R R
" #
p(X) p(X)
EP log π ≤ EP inf log R
q (X) Θ0 ⊂Θ
Θ0 pθ (x)dπ(θ)
" # " #
p(X)π(Θ0 ) p(X)
= EP inf log R = EP inf log π (X) .
Θ0 π(Θ0 ) Θ0 pθ (x)dπ(θ) Θ0 π(Θ0 )qΘ0
This is certainly smaller than the same quantity with the infimum outside the expectation, and
noting that
1 1 p(X) p(X)
EP log π − log = EP log π − EP log
q (X) pθ (X) q (X) pθ (X)
gives the result.
256
Stanford Statistics 311/Electrical Engineering 377 John Duchi
a random variable distributed according to π, and conditional on T = θ assume that the Xi are
drawn according to Pθ , we have that the mutual information between T and X1n is
Z Z
Iπ (T ; X1n ) = π(θ)Dkl (Pθn ||Qπn ) dθ = inf π(θ)Dkl (Pθn ||Q) dθ. (19.3.6)
Q
With Theorem 19.5 in hand, we can give a somewhat more nuanced picture of this mutual
information quantity. As a first consequence of Theorem 19.5, we have that
Z √
n d n det Iθ
Iπ (T ; X1 ) = log + log π(θ)dθ + o(1), (19.3.7)
2 2πe π(θ)
where Iθ denotes the Fisher information matrix for the family {Pθ }θ∈Θ . One strand of Bayesian
statistics—we will not delve too deeply into this now, instead referring to the survey by Bernardo
[26]—known as reference analysis, advocates that in performing a Bayesian analysis, we should
choose the prior π that maximizes the mutual information between the parameters θ about which
we wish to make inferences and any observations X1n available. Moreover, in this set of strategies,
one allows n to tend to ∞, as we wish to take advantage of any data we might actually see. The
asymptotic formula (19.3.7) allows us to choose such a prior.
In a different vein, Jeffreys [96] proposed that if the square root of the determinant of the Fisher
information was integrable, then one should take π as
√
det Iθ
πjeffreys (θ) = R √
Θ det Iθ dθ
known as the Jeffreys prior. Jeffreys originally proposed this for invariance reasons, as the infer-
ences made on the parameter θ under the prior πjeffreys are identical to those made on a trans-
formed parameter φ(θ) under the appropriately transformed Jeffreys prior. The asymptotic ex-
pression (19.3.7), however, shows that the Jeffreys prior is the asymptotic reference prior. Indeed,
computing the integral in (19.3.7), we have
Z √ Z Z p
det Iθ πjeffreys (θ)
π(θ) log dθ = π(θ) log dθ + log det Iθ dθ
Θ π(θ) Θ π(θ)
Z p
= −Dkl (π||πjeffreys ) + log det Iθ dθ,
whenever the Jeffreys prior exists. Moreover, we see that in an asymptotic sense, the worst-case
prior distribution π for nature to play is given by the Jeffreys prior, as otherwise the −Dkl (π||πjeffreys )
term in the expected (Bayesian) redundancy is negative.
Example 19.8 (Jeffreys priors and the exponential distribution): Let us now assume that
our source distributions Pθ are exponential distributions, meaning that θ ∈ (0, ∞) and we have
density pθ (x) = exp(−θx − log 1θ ) for x ∈ [0, ∞). This is clearly an exponential family model,
∂2 1 2
and the Fisher information is easy to compute as Iθ = ∂θ 2 log θ = 1/θ (cf. Example 16.1).
√
In this case, the Jeffreys prior is πjeffreys (θ) ∝ I = 1/θ, but this “density” does not integrate
over [0, ∞). One approach to this difficulty, advocated by Bernardo [26, Definition 3] (among
others) is to just proceed formally and notice that after observing a single datapoint, the
257
Stanford Statistics 311/Electrical Engineering 377 John Duchi
“posterior” distribution π(θ | X)P is well-defined. Following this idea, note that after seeing
some data X1 , . . . , Xi , with Si = ij=1 Xj as the partial sum, we have
i
X 1
π(θ | xi1 ) ∝ pθ (xi1 )πjeffreys (θ) = θ exp i
−θ xj = θi−1 exp(−θSi ).
θ
j=1
Pi
Integrating, we have for si = j=1 xj
Z ∞ Z ∞ Z ∞
−θx i−1 −θsi 1
q(x | xi1 ) = pθ (x)π(θ | xi1 )dθ ∝ θe θ e dθ = ui e−u du,
0 0 (si + x)i+1 0
where we made the change of variables u = θ(si + x). This is at least a distribution that
normalizes, so often one simply assumes the existence of a piece of fake data. For example, by
saying we “observe” x0 = 1, we have prior proportional to π(θ) = e−θ , which yields redundancy
1 n 1
Dkl Pθn0 ||Qπn = log + θ0 + log + o(1).
2 2πe θ0
The difference is that, in this case, the redundancy bound is no longer uniform in θ0 , as it
would be for the true reference (or Jeffreys, if it exists) prior. 3
A natural question that arises from this expression is the following: if nature chooses a worst-case
prior, can we swap the order of maximization and minimization? That is, do we ever have the
equality
sup Iπ (T ; X1n ) = inf sup Dkl (Pθn ||Q) ,
π Q θ
so that the worst-case Bayesian redundancy is actually the minimax redundancy? It is clear that
if nature can choose the worst case Pθ after we choose Q, the redundancy must be at least as bad
as the Bayesian redundancy, so
sup Iπ (T ; X1n ) ≤ inf sup Dkl (Pθn ||Q) = inf Rn (Q, P).
π Q θ Q
∗
Indeed, if this inequality were an equality, then for the worst-case prior π ∗ , the mixture Qπn would
be minimax optimal.
In fact, the redundancy-capacity theorem, first proved by Gallager [72], and extended by Haus-
sler [84] (among others) allows us to do just that. That is, if we must choose a distribution Q and
then nature chooses Pθ adversarially, we can guarantee to worse redundancy than in the (worst-case)
Bayesian setting. We state a simpler version of the result that holds when the random variables
X take values in finite spaces; Haussler’s more general version shows that the next theorem holds
whenever X ∈ X and X is a complete separable metric space.
258
Stanford Statistics 311/Electrical Engineering 377 John Duchi
Theorem 19.9 (Gallager [72]). Let X be a random variable taking on a finite number of values
and Θ be a measurable space. Then
Z
sup inf Dkl (Pθ ||Q) dπ(θ) = sup Iπ (T ; X) = inf sup Dkl (Pθ ||Q) .
π Q π Q θ∈Θ
Moreover, the infimum on the right isRuniquely attained by some distribution Q∗ , and if π ∗ attains
the supremum on the left, then Q∗ = Pθ dπ ∗ (θ).
where R is a remainder term. Assuming that θb → θ0 at any reasonable rate (this can be made
rigorous), this remainder is negligible asymptotically.
Rearranging this equality, we obtain
259
Stanford Statistics 311/Electrical Engineering 377 John Duchi
where we have used that the Fisher information Iθ = −Eθ [∇2 log pθ (X)] and the law of large
numbers. By the (multivariate) central limit theorem, we then obtain the asymptotic normality
result
n
√ 1 X d
n(θb − θ0 ) ≈ √ Iθ−1 ∇ log pθ0 (Xi ) → N(0, Iθ−1 ),
n 0 0
i=1
d
where → denotes convergence in distribution, with asymptotic variance
Iθ−1
0
Eθ0 [∇ log pθ0 (X)∇ log pθ0 (X)> ]Iθ−1
0
= Iθ−1
0
Iθ0 Iθ−1
0
= Iθ−1
0
.
Completely heuristically, we also write
θb “ ∼ ” N(θ0 , (nIθ0 )−1 ). (19.4.1)
by the asymptotic normality result, π(θ) b = π(θ0 ) + O(1/√n) again by the asymptotic normality
result, and
X n >
n n
log pθb(X1 ) ≈ log pθ0 (X1 ) + ∇ log pθ0 (Xi ) (θb − θ0 )
i=1
n
X > Xn
−1 1
≈ log pθ0 (X1n ) + ∇ log pθ0 (Xi ) Iθ0 ∇ log pθ0 (Xi ) .
n
i=1 i=1
Substituting these three into the redundancy expression (19.4.2), we obtain
pθ0 (X1n )
1 1 b > −1 b
Eθ0 log ≈ log + Eθ0 − (θ − θ0 ) (nIθ0 ) (θ − θ0 )
qn (X1n ) (2π)d/2 det(nIθ0 )−1/2 2
" n > Xn #
1 X 1
+ log − Eθ0 ∇ log pθ0 (Xi ) Iθ−1 ∇ log pθ0 (Xi )
π(θ0 ) 0 n
i=1 i=1
d n 1 1
= log + log det(Iθ0 ) + log − d + R,
2 2π 2 π(θ0 )
where R is a remainder term. This gives the major terms in the asymptotic result in Theorem 19.5.
260
Stanford Statistics 311/Electrical Engineering 377 John Duchi
where we have assumed that X belongs to a finite set, so that Q(X) is simply the probability of
X. For a given prior distribution π on θ, we define the expected redundancy as
Z
Red(Q, π) = Dkl (Pθ ||Q) dπ(θ).
Our goal is to show that the max-min value of the prediction game is the same as the min-max
value of the game, that is,
Proof We know that the max-min risk (worst-case Bayes risk) of the game is supπ Iπ (T ; X);
it remains to show that this is the min-max risk. To that end, define the capacity of the family
{Pθ }θ∈Θ as
C := sup Iπ (T ; X). (19.5.1)
π
Notably, this constant is finite (because Iπ (T ; X) ≤ log |X |), and there exists a sequence πn of prior
probabilities such that Iπn (T ; X) → C. Now, let Q̄ be any cluster point of the sequence of mixtures
π
R
Q = Pθ dπn (θ); such a point exists because the space of probability distributions on the finite
n
and we claim this is sufficient for the theorem. Indeed, suppose that inequality (19.5.2) holds. Then
in this case, we have
inf sup Red(Q, θ) ≤ sup Red(Q̄, θ) = sup Dkl Pθ ||Q̄ ≤ C,
Q θ∈Θ θ∈Θ θ∈Θ
sup inf Red(Q, θ) ≤ inf sup Red(Q, π) = inf sup Red(Q, θ).
π Q Q π Q θ∈Θ
For the sake of contradiction, let us assume that there exists some θ ∈ Θ such that inequal-
ity (19.5.2) fails, call it θ∗ . We will then show that suitable mixtures (1 − λ)π + λδθ∗ , where δθ∗ is
the point mass on θ∗ , could increase the capacity (19.5.1). To that end, for shorthand define the
mixtures
πn,λ = (1 − λ)πn + λδθ∗ and Qπn ,λ = (1 − λ)Qπn + λPθ∗
for λ ∈ [0, 1]. Let us also use the notation Hw (X | T ) to denote the conditionaly entropy of
the random variable X on T (when T is distributed as w), and we abuse notation by writing
H(X) = H(P ) when X is distributed as P . In this case, it is clear that we have
261
Stanford Statistics 311/Electrical Engineering 377 John Duchi
To demonstrate our contradiction, we will show two things: first, that at λ = 0 the limits of both
sides of the preceding display are equal to the capacity C, and second, that the derivative of the
right hand side is positive. This will contradict the definition (19.5.1) of the capacity.
To that end, note that
It is clear that at λ = 0, both sides are equal to the capacity C, while taking derivatives with
respect to λ we have
∂ X
H((1 − λ)Q̄ + λPθ∗ ) = − (Pθ∗ (x) − Q̄(x)) log (1 − λ)Q̄(x) + λPθ∗ (x) .
∂λ x
∂
In particular, if inequality (19.5.2) fails to hold, then ∂λ limn Iπn ,λ (T ; X)|λ=0 > 0, contradicting the
definition (19.5.1) of the channel capacity.
The uniqueness of the result follows from the strict convexity of the mutual information I in
the mixture channel Q̄.
19.6 Exercises
Question 19.1 (Minimax redundancy and different loss functions): In this question, we consider
iid
expected losses under the Bernoulli distribution. Assume that Xi ∼ Bernoulli(p), meaning that
Xi = 1 with probability p and Xi = 0 with probability 1 − p. We consider four different loss
functions, and their associated expected regret, for measuring the accuracy of our predictions of
such Xi . For each of the four choices below, we prove expected regret bounds on
n
X n
X
Redn (θ,
b P, L) := b i−1 ), Xi )] − inf
EP [L(θ(X EP [L(θ, Xi )], (19.6.1)
1
θ
i=1 i=1
262
Stanford Statistics 311/Electrical Engineering 377 John Duchi
where θb is a predictor based on X1 , . . . , Xi−1 at time i. Define Si = ij=1 Xj to be the partial sum
P
up to time i. For each of parts (a)–(c), at time i use the predictor
1
b i−1 ) = Si−1 + 2 .
θbi = θ(X1
i
(a) Loss function: L(θ, x) = 12 (x − θ)2 . Show that Redn (θ,
b P, L) ≤ C · log n where C is a constant.
1
(b) Loss function: L(θ, x) = x log 1θ + (1 − x) log 1−θ , the usual log loss for predicting probabilities.
b P, L) ≤ C · log n whenever the true probability p ∈ (0, 1), where C is a
Show that Redn (θ,
constant. Hint: Note that there exists a prior π for which θb is a Bayes strategy. What is this
prior?
(d) Extra credit: Show that there is a numerical constant c > 0 such that for any procedure θ, b
√
b Bernoulli(p), L) ≥ c n for the absolute loss L in
the worst-case redundancy supp∈[0,1] Redn (θ,
part (c). Give a strategy attaining this redundancy.
Question 19.2 (Strong versions of redundancy): Assume that for a given θ ∈ Θ we draw
X1n ∼ Pθ . We define the Bayes redundancy for a family of distributions P = {Pθ }θ∈Θ as
Z
Cn := inf Dkl (Pθ ||Q) dπ(θ) = Iπ (T ; X1n ),
π
Q
(b) Assume that π attains the supremum in the definition of Cn∗ . Show that
Hint: Introduce the random variable Z to be 1 if the random variable T ∈ B and 0 otherwise, then
use that Z → T → X1n forms a Markov chain, and expand the mutual information. For part (b),
the inequality 1−x 1
x log 1−x ≤ 1 for all x ∈ [0, 1] may be useful.
263
Stanford Statistics 311/Electrical Engineering 377 John Duchi
Question 19.3 (Mixtures are as good as point distributions): Let P be a Laplace(λ) distribution
on R, meaning that X ∼ P has density
λ
p(x) = exp(−λ|x|).
2
iid
Assume that X1 , . . . , Xn ∼ P , and let P n denote the n-fold product of P . In this problem, we
compare the predictive performance of distributions from the normal location family P = {N(θ, σ 2 ) :
θ ∈ R} with the mixture distribution Qπ over P defined by the normal prior distribution N(µ, τ 2 ),
that is, π(θ) = (2πτ 2 )−1/2 exp(−(θ − µ)2 /2τ 2 ).
(a) Let Pθ,Σ be the multivariate normal distribution with mean θ ∈ Rn and covariance Σ ∈ Rn×n .
What is Dkl (P n ||Pθ,Σ )?
(b) Show that inf θ∈Rn Dkl (P n ||Pθ,Σ ) = Dkl (P n ||P0,Σ ), that is, the mean-zero normal distribution
has the smallest KL-divergence from the Laplace distribution.
(c) Let Qπn be the mixture of the n-fold products in P, that is, Qπn has density
Z ∞
qnπ (xn1 ) = π(θ)pθ (x1 ) · · · pθ (xn )dθ,
−∞
(d) Show that the redundancy of Qπn under the distribution P is asymptotically nearly as good
as the redundancy of any Pθ ∈ P, the normal location family (so Pθ has density pθ (x) =
(2πσ 2 )−1/2 exp(−(x − θ)2 /2σ 2 )). That is, show that
1 1
sup EP log π n − log = O(log n)
θ∈R qn (X1 ) pθ (X1n )
for any prior variance τ 2 > 0 and any prior mean µ ∈ R, where the big-Oh hides terms
dependent on τ 2 , σ 2 , µ2 .
(e) Extra credit: Can you give an interesting condition under which such redundancy guarantees
hold more generally? That is, using Proposition 19.7 in the notes, give a general condition
under which
1 1
EP log π n − log = o(n)
q (X1 ) pθ (X1n )
as n → ∞, for all θ ∈ Θ.
264
Chapter 20
Thus far, in our discussion of universal prediction and related ideas, we have focused (essentially)
exclusively on making predictions with the logarithmic loss, so that we play a full distribution over
the set X as our prediction at each time step in the procedure. This is natural in settings, such as
coding (recall examples 15.5 and 19.1), in which the log loss corresponds to a quantity we directly
care about, or when we do not necessarily know much about the task at hand but rather wish
to simply model a process. (We will see this more shortly.) In many cases, however, we have a
natural task-specific loss. The natural question that follows, then, is to what extent it is possible
to extend the results of Chapter 19 to different settings in which we do not necessarily care about
prediction of an entire distribution. (Relevant references include the paper of Cesa-Bianchi and
Lugosi [41], which shows how complexity measures known as Rademacher complexity govern the
regret in online prediction games; the book by the same authors [42], which gives results covering a
wide variety of online learning, prediction, and other games; the survey by Merhav and Feder [116];
and the study of consequences of the choice of loss for universal prediction problems by Haussler
et al. [85].)
where Xbi are the predictions of the procedure we use and P is the distribution generating the data
n
X1 . In this case, if the distribution P is known, it is clear that the optimal strategy is to play the
Bayes-optimal prediction
Z
∗ i−1
Xi ∈ argmin EP [L(x, Xi ) | X1 ] = argmin L(x, xi )dP (xi | X1i−1 ). (20.1.1)
x∈Xb x∈Xb X
265
Stanford Statistics 311/Electrical Engineering 377 John Duchi
In many cases, however, we do not know the distribution P , and so our goal (as in the previous chap-
ter) is to simultaneously minimize the cumulative loss simultaneously for all source distributions in
a family P.
where X bi chosen according to Q(· | X i−1 ) as in expression (20.1.2). The natural question now, of
1
course, is whether the strategy (20.1.2) has redundancy growing more slowly than n.
It turns out that in some situations, this is the case: we have the following theorem [116, Section
III.A.2], which only requires that the usual redundancy (19.1.3) (with log loss) is sub-linear and the
loss is suitably bounded. In the theorem, we assume that the class of distributions P = {Pθ }θ∈Θ is
indexed by θ ∈ Θ.
x, x)−L(x∗ , x)| ≤ L
Theorem 20.1. Assume that the redundancy Redn (Q, Pθ ) ≤ Rn (θ) and that |L(b
b, x∗ . Then we have
for all x and predictions x
r
1 2
Redn (Q, Pθ , L) ≤ L Rn (θ).
n n
To attain vanishing expected regret under the loss L, then, Theorem 20.1 requires only that we
play a Bayes’ strategy (20.1.2) with a distribution Q for which the average (over n) of the usual
redundancy (19.1.3) tends to zero, so long as the loss is (roughly) bounded. We give two examples of
bounded losses. First, we might consider the 0-1 loss, which clearly satisfies |L(b x, x)−L(x∗ , x)| ≤ 1.
Second, the absolute value loss (which is used for robust estimation of location parameters [120, 90]),
x, x) = |x − x
given by L(b x, x) − L(x∗ , x)| ≤ |b
b|, satisfies |L(b x − x∗ |. If the distribution Pθ has median
θ and Θ is compact, then E[|b x − X|] is minimized by its median, and |b x − x∗ | is bounded by the
diameter of Θ.
Proof The theorem is essentially a consequence of Pinsker’s inequality (Proposition 2.10). By
266
Stanford Statistics 311/Electrical Engineering 377 John Duchi
Xn Z Z h i
= pθ (xi−1 (pθ (xi | x1i−1 ) − q(xi | xi−1 bi , xi ) − L(X ∗ , xi ) dxi dxi−1
1 ) 1 )) L( X i 1
i=1 X i−1 X
n Z
X
∗
+ pθ (xi−1 i−1 i−1
1 ) EQ [L(Xi , Xi ) − L(Xi , Xi ) | x1 ] dx1 ,
b (20.1.4)
i=1 X i−1 | {z }
≤0
bi , Xi ) − L(X ∗ , Xi ) | X i−1 ]
EQ [L(X i 1
where we have used the definition of total variation distance. Combining this inequality with (20.1.4),
we obtain
n Z
X
pθ (xi−1 i−1 i−1
i−1
Redn (Q, Pθ , L) ≤ 2L 1 ) Pθ (· | x1 ) − Q(· | x1 ) TV dx1
i−1
i=1 X
(?) n Z 1 Z 1
X 2
2 2
pθ (xi−1 i−1
pθ (xi−1 xi−1 x1i−1 )
TV
≤ 2L 1 )dx1 1 ) Pθ (·
| 1 ) − Q(· |
X i−1 X i−1
i=1
n Z 1
X
2 2
pθ (xi−1 x1i−1 ) xi−1
= 2L 1 ) Pθ (·
| − Q(· | 1 ) TV
,
i=1 X i−1
√
where the inequality (?) follows by the Cauchy-Schwarz inequality applied to the integrands pθ
√
and pθ kP − QkTV . Applying the Cauchy-Schwarz inequality to the final sum, we have
n Z 1
√
X
2 2
pθ (x1i−1 )
Pθ (· x1i−1 ) xi−1
Redn (Q, Pθ , L) ≤ 2L n | − Q(· | 1 ) TV
i=1 X i−1
n Z 1
(??) √ 1X i−1 i−1 i−1
i−1 2
≤ 2L n pθ (x1 )Dkl Pθ (· | x1 )||Q(· | x1 ) dx1
2 i−1
i=1 X
√ q
= L 2n Dkl Pθn ||Q ,
267
Stanford Statistics 311/Electrical Engineering 377 John Duchi
where inequality (??) is an application of Pinsker’s inequality. But of course, we know by that
Redn (Q, Pθ ) = Dkl (Pθn ||Q) by definition (19.1.3) of the redundancy.
Before proceding to examples, we note that in a variety of cases the bounds of Theorem 20.1 are
loose. For example, under mean-squared error, universal linear predictors [53, 124] have redundancy
√
O(log n), while Theorem 20.1 gives at best a bound of O( n).
TODO: Add material on redundancy/capacity (Theorem 19.9) analogue in general loss case,
which allows playing mixture distributions based on mixture of {Pθ }θ∈Θ .
20.1.2 Examples
We now give an example application of Theorem 20.1 with an application to a classification problem
with side information. In particular, let us consider the 0-1 loss `0−1 (ŷ, y) = 1 {ŷ · y ≤ 0}, and
assume that we wish to predict y based on a vector x ∈ Rd of regressors that are fixed ahead of
time. In addition, we assume that the “true” distribution (or competitor) Pθ is that given x and
θ, Y has normal distribution with mean hθ, xi and variance σ 2 , that is,
iid
Yi = hθ, xi i + εi , εi ∼ N(0, σ 2 ).
Now, we consider playing according to a mixture distribution (19.3.3), and for our prior π we choose
θ ∼ N(0, τ 2 Id×d ), where τ > 0 is some parameter we choose.
Let us first consider the case in which we observe Y1 , . . . , Yn directly (rather than simply whether
we classify correctly) and consider the prediction scheme this generates. First, we recall as in the
posterior calculation (19.3.4) that we must calculate the posterior on θ given Y1 , . . . , Yi at step i+1.
Assuming we have computed this posterior, we play
Ybi := argmin EQπ [`0−1 (y, Yi ) | Y1i−1 ] = argmin Qπ (sign(Yi ) 6= sign(y) | Y1i−1 )
y∈R y∈R
Z ∞
= argmin Pθ (sign(Yi ) 6= sign(y))π(θ | Y1i−1 )dθ. (20.1.5)
y∈R −∞
Lemma 20.2. Assume that θ has prior N(0, τ 2 Id×d ). Then conditional on Y1i = y1i and the first i
vectors xi1 = (x1 , . . . , xi ) ⊂ Rd , we have
i i
X 1 1 X
θ | y1i , xi1 ∼ N Ki−1 xj yj , Ki−1 , where Ki = 2 Id×d + 2 xj x>
j .
τ σ
j=1 j=1
Deferring the proof of Lemma 20.2 temporarily, we note that under the distribution Qπ , as by
assumption we have Yi = hθ, xi i + εi , the posterior distribution (under the prior π for θ) on Yi+1
conditional on Y1i = yii and x1 , . . . , xi+1 is
D i
X E
−1 > −1
Yi+1 = hθ, xi+1 i + εi+1 | y1i , x1i ∼ N xi+1 , Ki 2
xj yj , xi+1 Ki xi+1 + σ .
j=1
268
Stanford Statistics 311/Electrical Engineering 377 John Duchi
Consequently, if we let θbi+1 be the posterior mean of θ | y1i , xii (as given by Lemma 20.2), the optimal
prediction (20.1.5) is to choose any Ybi+1 satisfying sign(Ybi+1 ) = sign(hxi+1 , θbi+1 i). Another option
is to simply play
X i
−1
Ybi+1 = x> K
i+1 i y j j ,
x (20.1.6)
j=1
269
Stanford Statistics 311/Electrical Engineering 377 John Duchi
inf sup Redn (Q, Pθ ) = inf sup Dkl (Pθn ||Q) = sup Iπ (T ; X1n ) ≤ log |Θ|,
Q θ∈Θ Q θ∈Θ π
where T ∼ π and conditioned on T = θ we draw X1n ∼ Pθ . (Here we have used that I(T ; X1n ) =
H(T )−H(T | X1n ) ≤ H(T ) ≤ log |Θ|, by definition (2.1.3) of the mutual information.) In particular,
the redundancy is constant for any n.
Now we come to our question: is this possible in a purely sequential case? More precisely,
suppose we wish to predict a sequence of variables yi ∈ {−1, 1}, we have access to a finite collection
of strategies, and we would like to guarantee that we perform as well in prediction as any single
member of this class. Then, while it is not possible to achieve constant regret, it is possible to have
regret that grows only logarithmically in the number of comparison strategies. To establish the
setting, let us denote our collection of strategies, henceforth called “experts”, by {xi,j }dj=1 , where
i ranges in 1, . . . , n. Then at iteration i of the prediction game, we measure the loss of expert j by
L(xi,j , y).
We begin by considering a mixture strategy that would be natural under the logarithmic loss,
we assume the experts play points xi,j ∈ [0, 1], where xi,j = P (Yi = 1) according to expert j.
(We remark in passing that while the notation is perhaps not completely explicit about this, the
experts may adapt to the sequence Y1n .) In this case, the loss we suffer is the usual log loss,
L(xi,j , y) = y log x1i,j + (1 − y) log 1−x1
i,j
. Now, if we assume we begin with the uniform prior
distribution π(j) = 1/d for all j, then the posterior distribution, denoted by πji = π(j | Y1i−1 ), is
i i !
1 1
xyl,jl (1 − xl,j )1−yl
Y X
πji ∝ π(j) = π(j) exp − yl log + (1 − yl ) log
xl,j 1 − xl,j
l=1 l=1
i
!
X
= π(j) exp − L(xl,j , yl ) .
l=1
This strategy suggests what is known variously as the multiplicative weights strategy [8], exponenti-
ated gradient descent method [98], or (after some massaging) a method known since the late 1970s as
the mirror descent or non-Euclidean gradient descent method (entropic gradient descent) [117, 24].
In particular, we consider an algorithm for general losses where fix a stepsize η > 0 (as we cannot
be as aggressive as in the probabilistic setting), and we then weight each of the experts j by expo-
nentially decaying the weight assigned to the expert for the losses it has suffered. For the algorithm
to work, unfortunately, we need a technical condition on the loss function and experts xi,j . This
loss function is analogous to a weakened version of exp-concavity, which is a common assumption
in online game playing scenarios (see the logarithmic regret algorithms developed by Hazan et al.
[86], as well as earlier work, for example, that by Kivinen and Warmuth [99] studying regression
270
Stanford Statistics 311/Electrical Engineering 377 John Duchi
problems for which the loss is strongly convex in one variable but not simultaneously in all). In
particular, exp-concavity is the assumption that
x 7→ exp(−L(x, y))
is a concave function. Because the exponent of the logarithm is linear, the log loss is obviously
exp-concave, but for alternate losses, we make a slightly weaker assumption. In particular, we
assume there are constants c, η such that for any vector π in the d-simplex (i.e. π ∈ Rd+ satisfies
Pd
j=1 πj = 1) there is some way to choose y
b so that for any y (that can be played in the game)
Xd d
1 X
exp − L(b
y , y) ≥ πj exp(−ηL(xi,j , y)) or y , y) ≤ −c log
L(b πj exp(−ηL(xi,j , y)) .
c
j=1 j=1
(20.2.1)
By
Pd inspection, inequality (20.2.1) holds for the log loss with c = η = 1 and the choice yb =
j=1 πj xi,j , because of the exp-concavity condition; any exp-concave loss also satisfies inequal-
ity (20.2.1) with c = η = 1 and the choice of the posterior mean yb = dj=1 πj xi,j . The idea in
P
this case is that losses satisfying inequality (20.2.1) behave enough like the logarithmic loss that a
Bayesian updating of the experts works. (Condition (20.2.1) originates with the work of Haussler
et al. [85], where they name such losses (c, η)-realizable.)
Example 20.3 (Squared error and exp-concavity): Consider the squared error loss L(b y , y) =
1 2
y − y) , where yb, y ∈ R. We claim that if xj ∈ [0, 1] for each j, π is in the simplex, meaning
2 (b
P
j πj = 1 and πj ≥ 0, and y ∈ [0, 1], then the squared error π 7→ L(hπ, xi, y) is exp-concave,
that is, inequality (20.2.1) holds with c = η = 1 and yb = hπ, xi. Indeed, computing the Hessian
of the exponent, we have
1 1
∇2π exp − (hπ, xi − y)2 = ∇π − exp − (hπ, xi − y)2 (hπ, xi − y)x
2 2
1 2
(hπ, xi − y)2 − 1 xx> .
= exp − (hπ, xi − y)
2
We can also show that the 0-1 loss satisfies the weakened version of exp-concavity in inequal-
ity (20.2.1), but we have to take the constant c to be larger (or η to be smaller).
Example 20.4 (Zero-one loss and weak exp-concavity): Now suppose that we use the 0-1
y , y) = 1 {y · yb ≤ 0}. We claim P
loss, that is, `0−1 (b that if we take a weighted majority vote
under the distribution π, meaning that we set yb = dj=1 πj sign(xj ) for a vector x ∈ Rd , then
inequality (20.2.1) holds with any c large enough that
2
c−1 ≤ log . (20.2.2)
1 + e−η
271
Stanford Statistics 311/Electrical Engineering 377 John Duchi
Thus, to attain
d
X
−η`0−1 (xj ,y)
y , y) = 1 ≤ −c log
`0−1 (b πj e
j=1
it is sufficient that
d
1 + e−η
X
−η`0−1 (xj ,y) −1 2
1 ≤ −c log ≤ −c log πj e , or c ≤ log .
2 1 + e−η
j=1
3. Choose ybi satisfying (20.2.1) for the weighting π = π i and expert values {xi,j }dj=1
4. Observe yi and suffer loss L(b
yi , yi )
With the scheme above, we have the following regret bound.
Theorem 20.5 (Haussler et al. [85]). Assume condition (20.2.1) holds and that ybi is chosen by the
above scheme. Then for any j ∈ {1, . . . , d} and any sequence y1n ∈ Rn ,
n
X n
X
yi , yi ) ≤ c log d + cη
L(b L(xi,j , yi ).
i=1 i=1
Proof This is an argument based on potentials. At each iteration, any loss we suffer implies that
the potential W i must decrease, but it cannot decrease too quickly (as otherwise the individual
predictors xi,j would suffer too much loss). Beginning with condition (20.2.1), we observe that
d i+1
X
i W
L(byi , yi ) ≤ −c log πj exp(−ηL(xi,j , yi )) = −c log
Wi
j=1
272
Stanford Statistics 311/Electrical Engineering 377 John Duchi
where the inequality uses that exp(·) is increasing. As log exp(a) = a, this is the desired result.
Example (Example 20.4 continued): By substituting the choice c−1 = log 1+e2−η into the
regret guarantee of Theorem 20.5 (which satisfies inequality (20.2.1) by our guarantee (20.2.2)
from Example 20.4), we obtain
P
2 n
Xn
log d η − log 1+e −η i=1 `0−1 (xi,j , yi )
yi , yi ) − `0−1 (xi,j , yi ) ≤
`0−1 (b + .
i=1
log 1+e2−η log 1+e2−η
Now, we make an asymptotic expansion to give the basic flavor of the result (this can be made
rigorous, but it is sufficient). First, we note that
2 η η2
log ≈ − ,
1 + e−η 2 8
and substituting this into the previous display, we have regret guarantee
n n
X log d X
yi , yi ) − `0−1 (xi,j , yi ) .
`0−1 (b +η `0−1 (xi,j , yi ). (20.2.3)
η
i=1 i=1
p
By making the choice η ≈ log d/n and noting that `0−1 ≤ 1, we obtain
n
X p
yi , yi ) − `0−1 (xi,j , yi ) .
`0−1 (b n log d
i=1
We make a few remarks on the preceding example to close the chapter. First, ideally we would
like to attain adaptive regret guarantees, meaning that the regret scales with the performance of
the bestPpredictor in inequality (20.2.3). In particular, we might expect that a good expert would
satisfy ni=1 `0−1 (xi,j , yi ) n, which—if we could choose
1
log d 2
η≈ Pn ,
i=1 `0−1 (xi,j ∗ , yi )
273
Stanford Statistics 311/Electrical Engineering 377 John Duchi
Pn
where j ∗ = argminj i=1 `0−1 (xi,j , yi )—then we would attain regret bound
v
u n
X p
u
tlog d · `0−1 (xi,j ∗ , yi ) n log d.
i=1
For results of this form, see, for example, Cesa-Bianchi et al. [43] or the more recent work on mirror
descent of Steinhardt and Liang [133].
Secondly, we note that it is actually possible to give a regret bound of the form (20.2.3) without
relying on the near exp-concavity condition (20.2.1). In particular, performing mirror descent on
the convex losses defined by
d
X
π 7→
sign(xi,j )πj − sign(yi ),
j=1
√
which is convex, will give a regret bound of n log d for the zero-one loss as well. We leave this
exploration to the interested reader.
274
Chapter 21
A related notion to the universal prediction problem with alternate losses is that of online learning
and online convex optimization, where we modify the requirements of Chapter 20 further. In the
current setting, we essentially do away with distributional assumptions at all, including prediction
with a distribution, and we consider the following two player sequential game: we have a space W
in which we—the learner or first player—can play points w1 , w2 , . . ., while nature plays a sequence
of loss functions Lt : W → R. The goal is to guarantee that the regret
n
X
Lt (wt ) − Lt (w? )
(21.0.1)
t=1
grows at most sub-linearly with n, for any w? ∈ W (often, we desire this guarantee to be uniform).
As stated, this goal is too broad, so in this chapter we focus on a few natural restrictions, namely,
that the sequence of losses Lt are convex, and W is a convex subset of Rd . In this setting, the
problem (21.0.1) is known as online convex programming.
λw + (1 − λ)w0 ∈ W.
for all λ ∈ [0, 1] and w, w0 . The subgradient set, or subdifferential, of a convex function f at the
point w is defined to be
and we say that any vector g ∈ Rd satisfying f (v) ≥ f (w) + hg, v − wi for all v is a subgradient. For
convex functions, the subdifferential set ∂f (w) is essentially always non-empty for any w ∈ dom f .1
1
Rigorously, we are guaranteed that ∂f (w) 6= ∅ at all points w in the relative interior of the domain of f .
275
Stanford Statistics 311/Electrical Engineering 377 John Duchi
We now give several examples of convex functions, losses, and corresponding subgradients. The
first two examples are for classification problems, in which we receive data points x ∈ Rd and wish
to predict associated labels y ∈ {−1, 1}.
Example 21.1 (Support vector machines): In the support vector machine problem, we
receive data in pairs (xt , yt ) ∈ Rd × {−1, 1}, and the loss function
which is convex because it is the maximum of two linear functions. Moreover, the subgradient
set is
−yt xt
if yt hw, xt i < 1
∂Lt (w) = −λ · yt xt for λ ∈ [0, 1] if yt hw, xt i = 1
0 otherwise.
Example 21.2 (Logistic regression): As in the support vector machine, we receive data in
pairs (xt , yt ) ∈ Rd × {−1, 1}, and the loss function is
To see that this loss is convex, note that if h(t) = log(1 + et ), then h0 (t) = 1
1+e−t
and h00 (t) =
e−t
(1+e−t )2
≥ 0, and Lt is the composition of a linear transformation with h. In this case,
1
∂Lt (w) = ∇Lt (w) = − yt x t .
1+ eyt hxt ,wi
3
where we have defined the vector gt = [`0−1 (xt,j , yt )]dj=1 ∈ {0, 1}d . Notably, the expected zero-
one loss is convex (even linear), so that its online minimization falls into the online convex
programming framework. 3
As we see in the sequel, online convex programming approaches are often quite simple, and, in
fact, are often provably optimal in a variety of scenarios outside of online convex optimization. This
motivates our study, and we will see that online convex programming approaches have a number of
similarities to our regret minimization approaches in previous chapters on universal coding, regret,
and redundancy.
276
Stanford Statistics 311/Electrical Engineering 377 John Duchi
277
Stanford Statistics 311/Electrical Engineering 377 John Duchi
(w, ψ(w))
Bψ (w, v)
(v, ψ(v))
Example 21.5 (KL divergence as a Bregman divergence): Take ψ(w) = dj=1 wj log wj .
P
Then ψ is convex over the positive orthant Rd+ (the second derivative of w log w is 1/w), and
for w, v ∈ ∆d = {u ∈ Rd+ : h1, ui = 1}, we have
X X X X wj
Bψ (w, v) = wj log wj − vj log vj − (1 + log vj )(wj − vj ) = wj log = Dkl (w||v) ,
vj
j j j j
where in the final equality we treat w and v as probability distributions on {1, . . . , d}. 3
With these examples in mind, we now present the mirror descent algorithm, which is the natural
generalization of online gradient descent.
Before providing the analysis of Algorithm 21.3, we give a few examples of its implementation.
First, by taking W = Rd and ψ(w) = 12 kwk22 , we note that the mirror descent procedure simply
corresponds to the gradient update wt+1 = wt −ηt gt . We can also recover the exponentiated gradient
algorithm, also known as entropic mirror descent.
278
Stanford Statistics 311/Electrical Engineering 377 John Duchi
For example, a straightforward calculation shows that the dual to the `∞ -norm is the `1 -norm,
and the Euclidean norm k·k2 is self-dual (by the Cauchy-Schwarz inequality). Lastly, we require a
definition of functions of suitable curvature for use in mirror descent methods.
Definition 21.2. A convex function f : Rd → R is strongly convex with respect to the norm k·k
over the set W if for all w, v ∈ W and g ∈ ∂f (w) we have
1
f (v) ≥ f (w) + hg, v − wi + kw − vk2 .
2
That is, the function f is strongly convex if it grows at least quadratically fast at every point in its
domain. It is immediate from the definition of the Bregman divergence that ψ is strongly convex
if and only if
1
Bψ (w, v) ≥ kw − vk2 .
2
As two examples, we consider Euclidean distance and entropy. For the Euclidean distance, which
uses ψ(w) = 12 kwk22 , we have ∇ψ(w) = w, and
1 1 1 1
kvk22 = kw + v − wk22 = kwk22 + hw, v − wi + kw − vk22
2 2 2 2
by a calculation, so that ψ is strongly convex with respect to the Euclidean norm. We also have
the following observation.
279
Stanford Statistics 311/Electrical Engineering 377 John Duchi
P
Observation 21.7. Let ψ(w) = j wj log wj be the negative entropy. Then ψ is strongly convex
with respect to the `1 -norm, that is,
1
Bψ (w, v) = Dkl (w||v) ≥ kw − vk21 .
2
Proof The result is an immediate consequence of Pinsker’s inequality, Proposition 2.10.
With these examples in place, we present the main theorem of this section.
Theorem 21.8 (Regret of mirror descent). Let Lt be an arbitrary sequence of convex functions,
and let wt be generated according to the mirror descent algorithm 21.3. Assume that the proximal
function ψ is strongly convex with respect to the norm k·k, which has dual norm k·k∗ . Then
Before proving the theorem, we provide a few comments to exhibit its power. First, we consider
the Euclidean case, where ψ(w) = 21 kwk22 , and we assume that the loss functions Lt are all L-
Lipschitz, meaning that |Lt (w) − Lt (v)| ≤ L kw − vk2 , which is equivalent to kgt k2 ≤ L for all
gt ∈ ∂Lt (w). In this case, the two regret bounds above become
n
1 η 1 2 X ηt 2
kw? − w1 k22 + nL2 and R + L ,
2η 2 2ηn 2
t=1
respectively, where in the second case we assumed that kw? − wt k2 ≤ R for all t. In the former
R
case, we take η = L√ n
, while in the second, we take ηt = LR√t , which does not require knowledge
of n ahead of time. Focusing on the latter case, we have the following corollary.
Corollary 21.9. Assume that W ⊂ {w ∈ Rd : kwk2 ≤ R} and that the loss functions Lt are
L-Lipschitz with respect to the Euclidean norm. Take ηt = LR√t . Then for all w? ∈ W,
n
X √
[Lt (wt ) − Lt (w? )] ≤ 3RL n.
t=1
Proof For any w, w? ∈ W, we have kw − w? k2 ≤ 2R, so that Bψ (w? , w) ≤ 4R2 . Using that
n n √
Z
X 1 1
t− 2 ≤ t− 2 dt = 2 n
t=1 0
280
Stanford Statistics 311/Electrical Engineering 377 John Duchi
Now that we have presented the Euclidean variant of online convex optimization, we turn to an
example that achieves better performance in high dimensional settings, as long as the domain is
the probability simplex. (Recall Example 21.3 for motivation.) In this case, we have the following
corollary to Theorem 21.8.
d
CorollaryP 21.10. Assume that W = ∆d = {w ∈ R+ : h1, wi = 1} and take the proximal function
ψ(w) = j wj log wj to be the negative entropy in the mirror descent procedure 21.3. Then with the
fixed stepsize η and initial point as the uniform distribution w1 = 1/d, we have for any sequence of
convex losses Lt
n n
X log d η X
[Lt (wt ) − Lt (w? )] ≤ + kgt k2∞ .
η 2
t=1 t=1
Proof Using Pinsker’s inequality in the form of Observation 21.7, we have that ψ is strongly
convex with respect to k·k1 . Consequently, taking the dual norm to be the `∞ -norm, part (a) of
Theorem 21.8 shows that
n d n
X 1X ? wj? ηX
[Lt (wt ) − Lt (w? )] ≤ wj log + kgt k2∞ .
η w1,j 2
t=1 j=1 t=1
Noting that with w1 = 1/d, we have Bψ (w? , w1 ) ≤ log d for any w? ∈ W gives the result.
Corollary 21.10 yields somewhat sharper results than Corollary 21.9, though in the restricted
setting that W is the probability simplex in Rd . Indeed, let us assume that the subgradients
gt ∈ [−1, d d
√1] , the hypercube in R . In this case, the tightest possible bound on their `2 -norm is
kgt k2 ≤ d, while kgt k∞ ≤ 1 always. Similarly, if W = ∆d , then while we are only guaranteed that
kw? − w1 k2 ≤ 1. Thus, the best regret guaranteed by the Euclidean case (Corollary 21.9) is
1 η √ 1
kw? − w1 k22 + nd ≤ nd with the choice η = √ ,
2η 2 nd
P
while the entropic mirror descent procedure (Alg. 21.3 with ψ(w) = j wj log wj ) guarantees
√
log d η p 2 log d
+ n ≤ 2n log d with the choice η = √ . (21.2.5)
η 2 2 n
The latter guarantee is exponentially better in the dimension. Moreover, the key insight is that
we essentially maintain a “prior,” and then perform “Bayesian”-like updating of the posterior
distribution wt at each time step, exactly as in the setting of redundancy minimization.
281
Stanford Statistics 311/Electrical Engineering 377 John Duchi
Lemma 21.12. Let Lt : W → R be any sequence of convex loss functions and ηt be a non-increasing
sequence, where η0 = ∞. Then with the mirror descent strategy (21.2.4), for any w? ∈ W we have
n n n
X
?
X 1 1 ?
X 1
Lt (wt ) − Lt (w ) ≤ − Bψ (w , wt ) + − Bψ (wt+1 , wt ) + hgt , wt − wt+1 i .
ηt ηt−1 ηt
t=1 t=1 t=1
Proof Our proof follows by the application of a few key identities. First, we note that by
convexity, we have for any gt ∈ ∂Lt (wt ) that
Lt (wt ) − Lt (w? ) ≤ hgt , wt − w? i. (21.2.6)
Secondly, we have that because wt+1 minimizes
1
hgt , wi + Bψ (w, wt )
ηt
over w ∈ W, then Lemma 21.11 implies
hηt gt + ∇ψ(wt+1 ) − ∇ψ(wt ), w − wt+1 i ≥ 0 for all w ∈ W. (21.2.7)
Taking w = w? in inequality (21.2.7) and making a substitution in inequality (21.2.6), we have
Lt (wt ) − Lt (w? ) ≤ hgt , wt − w? i = hgt , wt+1 − w? i + hgt , wt − wt+1 i
1
≤ h∇ψ(wt+1 ) − ∇ψ(wt ), w? − wt+1 i + hgt , wt − wt+1 i
ηt
1
= [Bψ (w? , wt ) − Bψ (w? , wt+1 ) − Bψ (wt+1 , wt )] + hgt , wt − wt+1 i (21.2.8)
ηt
where the final equality (21.2.8) follows from algebraic manipulations of Bψ (w, w0 ). Summing
inequality (21.2.8) gives
n n n
X X 1 X
Lt (wt ) − Lt (w? ) ≤ [Bψ (w? , wt ) − Bψ (w? , wt+1 ) − Bψ (wt+1 , wt )] + hgt , wt − wt+1 i
η
t=1 t=1 t t=1
n
X 1 1 1 1
= − Bψ (w? , wt ) + Bψ (w? , w1 ) − Bψ (w? , wn+1 )
ηt ηt−1 η1 ηn
t=2
n
X 1
+ − Bψ (wt+1 , wt ) + hgt , wt − wt+1 i
ηt
t=1
as desired.
It remains to use the negative terms −Bψ (wt , wt+1 ) to cancel the gradient terms hgt , wt − wt+1 i.
To that end, we recall Definition 21.1 of the dual norm k·k∗ and the strong convexity assumption
on ψ. Using the Fenchel-Young inequality, we have
ηt 1
hgt , wt − wt+1 i ≤ kgt k∗ kwt − wt+1 k ≤ kgt k2∗ + kwt − wt+1 k2 .
2 2ηt
Now, we use the strong convexity condition, which gives
1 1
− Bψ (wt+1 , wt ) ≤ − kwt − wt+1 k2 .
ηt 2ηt
Combining the preceding two displays in Lemma 21.12 gives the result of Theorem 21.8.
282
Stanford Statistics 311/Electrical Engineering 377 John Duchi
While as stated, the bound of the proposition does not look substantially more powerful than
Corollary 21.10, but a few remarks will exhibit its consequences. We prove the proposition in
Section 21.4.1 to come.
2 ≤ kg k2 . So certainly
P
First, we note that because wt ∈ ∆d , we will always have j wt,j gt,j t ∞
the bound of Proposition 21.13 is never worse than that of Corollary 21.10. Sometimes this can
be made tighter, however, as exhibited by the next corollary, which applies (for example) to the
experts setting of Example 21.3. More specifically, we have d experts, each suffering losses in [0, 1],
and we seek to predict with the best of the d experts.
Corollary 21.14. Consider the linear online convex optimization setting, that is, where Lt (wt ) =
hgt , wt i for vectors gt , and d
?
Pn assume? that gt ∈ R+ with kgt k∞ ≤ 1. In addition,
√ assume
p that we know
an upper bound Ln on t=1 Lt (w ). Then taking the stepsize η = min{1, log d/ L?n }, we have
n
X n o
[Lt (wt ) − Lt (w? )] ≤ 3 max log d, L?n log d .
p
t=1
Note that when Lt (w? ) = 0 for all w? , which corresponds to a perfect expert in Example 21.3,
the upper bound becomes constant in n, yielding 3 log d as a bound on the regret. Unfortunately,
in our bound of Corollary 21.14, we had to assume that we knew ahead of time a bound on the
loss of the best predictor w? , which is unrealistic in practice. There are a number of techniques for
dealing with such issues, including a standard one in the online learning literature known as the
doubling trick. We explore someP in the exercises.
2 ≤ hw, g i for any nonnegative vector w, as g
Proof First, we note that j wj gt,j t t,j ∈ [0, 1]. Thus,
Proposition 21.13 gives
n n n
X log d η X log d η X
[Lt (wt ) − Lt (w? )] ≤ + hwt , gt i = + Lt (wt ).
η 2 η 2
t=1 t=1 t=1
283
Stanford Statistics 311/Electrical Engineering 377 John Duchi
xi exp(−ηgi )
yi = P ,
j xj exp(−ηgj )
d
1 ηX 2
− Bψ (y, x) + hg, x − yi ≤ gi xi .
η 2
i=1
Deferring the proof of the lemma, we note that it precisely applies to the setting of Lemma 21.12.
Indeed, with a fixed stepsize η, we have
n n
X
? 1 ?
X 1
Lt (wt ) − Lt (w ) ≤ Bψ (w , w1 ) + − Bψ (wt+1 , wt ) + hgt , wt − wt+1 i .
η η
t=1 t=1
Earlier, we used the strong convexity of ψ to eliminate the gradient terms hgt , wt − wt+1 i using the
bregman divergence Bψ . This time, we use Lemma 21.12: setting y = wt+1 and x = wt yields the
bound
n n d
X 1 X ηX 2
Lt (wt ) − Lt (w? ) ≤ Bψ (w? , w1 ) + gt,i wt,i
η 2
t=1 t=1 i=1
as desired.
Proof of Lemma 21.15 We begin by noting that a direct calculation yields Bψ (y, x) =
Dkl (y||x) = i yi log xyii . Substituting the values for x and y into this expression, we have
P
!
yi x exp(−ηgi ) X
Pi
X X X
yi log = yi log = −ηhg, yi − yi log xj e−ηgj .
xi xi ( j exp(−ηgj )xj )
i i i j
284
Stanford Statistics 311/Electrical Engineering 377 John Duchi
Now we use a Taylor expansion of the function g 7→ log( j xj e−ηgj ) around the point 0. If we
P
X η2
log xj e−ηgj = log(h1, xi) − ηhp(0), gi + g > (diag(p(e
g )) − p(e g )> )g,
g )p(e
2
j
where ge = λg for some λ ∈ [0, 1]. Noting that p(0) = x and h1, xi = h1, yi = 1, we obtain
η2 >
Bψ (y, x) = −ηhg, yi + log(1) + ηhg, xi − g )) − p(e
g (diag(p(e g )> )g,
g )p(e
2
whence
d
1 ηX 2
− Bψ (y, x) + hg, x − yi ≤ gi pi (e
g ). (21.4.1)
η 2
i=1
Using the Fenchel-Young inequality, we have ab ≤ 13 |a|3 + 32 |b|3/2 for any a, b, so gi gj2 ≤ 13 gi3 + 23 gj3 .
This implies
Pd that the numerator in our expression for s0 (λ) is non-positive. Thus we have s(λ) ≤
2
s(0) = i=1 gi xi , which gives the result when combined with inequality (21.4.1).
285
Chapter 22
Consider the following problem: we have a possible treatment for a population with a disease, but
we do not know whether the treatment will have a positive effect or not. We wish to evaluate the
treatment to decide whether it is better to apply it or not, and we wish to optimally allocate our
resources to attain the best outcome possible. There are challenges here, however, because for each
patient, we may only observe the patient’s behavior and disease status in one of two possible states—
under treatment or under control—and we wish to allocate as few patients to the group with worse
outcomes (be they control or treatment) as possible. This balancing act between exploration—
observing the effects of treatment or non-treatment—and exploitation—giving treatment or not as
we decide which has better palliative outcomes—underpins and is the paridigmatic aspect of the
multi-armed bandit problem.1
Our main focus in this chapter is a fairly simple variant of the K-armed bandit problem, though
we note that there is a substantial literature in statistics, operations research, economics, game
theory, and computer science on variants of the problems we consider. In particular, we consider the
following sequential decision making scenario. We assume that there are K distributions P1 , . . . , PK
on R, which we identify (with no loss of generality) with K random variables Y1 , . . . , YK . Each
random variable Yi has mean µi and is σ 2 -sub-Gaussian, meaning that
2 2
λ σ
E [exp (λ(Yi − µi ))] ≤ exp . (22.0.1)
2
The goal is to find the index i with the maximal mean µi without evaluating sub-optimal “arms”
(or random variables Yi ) too often. At each iteration t of the process, the player takes an action
At ∈ {1, . . . , K}, then, conditional on i = At , observes a reward Yi (t) drawn independently from
the distribution Pi . Then the goal is to minimize the the regret after n steps, which is
n
X
Regn := µi? − µA t , (22.0.2)
t=1
1
The problem is called the bandit problem in the literature because we imagine a player in a casino, choosing
between K different slot machines (hence a K-armed bandit, as this is a casino and the player will surely lose
eventually), each with a different unknown reward distribution. The player wishes to put as much of his money as
possible into the machine with the greatest expected reward.
286
Stanford Statistics 311/Electrical Engineering 377 John Duchi
where i? ∈ argmaxi µi so µi? = maxi µi . The regret Regn as defined is a random quantity, so we
generally seek to give bounds on its expectation or high-probability guarantees on its value. In this
chapter, we generally focus for simplicity on the expected regret,
X n
Regn := E µ i? − µA t , (22.0.3)
t=1
where the expectation is taken over any randomness in the player’s actions At and in the repeated
observations of the random variables Y1 , . . . , YK .
to be the running average of the rewards of arm i at time t (computed only on those instances in
which arm i was selected), we claim that for all i and all t,
s s
2 1 2 1
σ log δ σ log δ
P µbi (t) ≥ µi + ∨ P µbi (t) ≤ µi − ≤ δ. (22.1.1)
Ti (t) Ti (t)
That is, so long as we pull the arms sufficiently many times, we are unlikely to pull the wrong arm.
We prove the claim (22.1.1) in the appendix to this chapter.
Here then is the UCB procedure:
If we define
∆i := µi? − µi
287
Stanford Statistics 311/Electrical Engineering 377 John Duchi
to be the gap in means between the optimal arm and any sub-optimal arm, we then obtain the
following guarantee on the expected number of pulls of any sub-optimal arm i after n steps.
Proposition 22.1. Assume that each of the K arms is σ 2 -sub-Gaussian and let the sequence
δ1 ≥ δ2 ≥ · · · be non-increasing and positive. Then for any n and any arm i 6= i? ,
n
& '
4σ 2 log δ1n X
E[Ti (n)] ≤ +2 δt .
∆2i t=2
Proof Without loss of generality, we assume arm 1 satisfies µ1 = maxi µi , and let arm i be any
sub-optimal arm. The key insight is to carefully consider what occurs if we play arm i in the UCB
procedure of Figure 22.1. In particular, if we play arm i at time t, then we certainly have
s s
σ 2 log δ1t σ 2 log δ1t
µ
bi (t) + ≥µ
b1 (t) + .
Ti (t) T1 (t)
For this to occur, at least one of the following three events must occur (we suppress the dependence
on i for each of them):
s s
2 1 2 1
σ log δt σ log δt
E1 (t) := µ bi (t) ≥ µi + , E2 (t) := µ b1 (t) ≤ µ1 − ,
Ti (t) T1 (t)
s
2 1
σ log δt
E3 (t) := ∆i ≤ 2 .
Ti (t)
Indeed, suppose that none of the events E1 , E2 , E3 occur at time t. Then we have
s s s
σ 2 log δ1t σ 2 log δ1t σ 2 log δ1t
µbi (t) + < µi + 2 < µi + ∆i = µ1 < µ
b1 (t) + ,
Ti (t) Ti (t) T1 (t)
the inequalities following by E1 , E3 , and E2 , respectively.
Now, for any l ∈ {1, . . . , n}, we see that
n
X n
X
E[Ti (n)] = E[1 {At = i}] = E[1 {At = i, Ti (t) > l} + 1 {At = i, Ti (t) ≤ l}]
t=1 t=1
n
X
≤l+ P(At = i, Ti (t) > l).
t=l+1
288
Stanford Statistics 311/Electrical Engineering 377 John Duchi
Naturally, the number of times arm i is selected in the sequential game is related to the regret
of a procedure; indeed, we have
n
X K
X K
X
Regn = (µi? − µAt ) = (µi? − µi )Ti (n) = ∆i Ti (n).
t=1 i=1 i=1
Using this identity, we immediately obtain two theorems on the (expected) regret of the UCB
algorithm.
Theorem 22.2. Let δt = δ/t2 for all t. Then for any n ∈ N the UCB algorithm attains
K
X 4σ 2 [2 log n − log δ] π 2 − 2 X K
X
Regn ≤ + ∆i δ + ∆i .
?
∆i 3
i6=i i=1 i=1
by Proposition 22.1. Summing over i 6= i? and noting that −2 = π 2 /6−1 gives the result.
P
t≥2 t
Let us unpack the bound of Theorem 22.2 slightly. First, we make the simplifying assumption
that δt = 1/t2 for all t, and let ∆ = mini6=i? ∆i . In this case, we have expected regret bounded by
K
Kσ 2 log n π 2 + 1 X
Regn ≤ 8 + ∆i .
∆ 3
i=1
So we see that the asymptotic regret with this choice of δ scales as (Kσ 2 /∆) log n, roughly linear
in the classes, logarithmic in n, and inversely proportional to the gap in means. As a concrete
example, if we know that the rewards for each arm Yi belong to the interval [0, 1], then Hoeffding’s
lemmaP(recall Example 3.6) states that we may take σ 2 = 1/4. Thus the mean regret becomes at
most i:∆i >0 2 log n
∆i (1 + o(1)), where the o(1) term tends to zero as n → ∞.
If we knew a bit more about our problem, then by optimizing over δ and choosing δ = σ 2 /∆,
we obtain the upper bound
Kσ 2
n∆ maxi ∆i
Regn ≤ O(1) log 2 + K , (22.1.2)
∆ σ mini ∆i
that is, the expected regret scales asymptotically as (Kσ 2 /∆) log( n∆
σ2
)—linearly in the number of
classes, logarithmically in n, and inversely proportional to the gap between the largest and other
means.
If any of the gaps ∆i → 0 in the bound of Theorem 22.2, the bound becomes vacuous—it simply
says that the regret is upper bounded by infinity. Intuitively, however, pulling a slightly sub-optimal
arm should be insignificant for the regret. With that in mind, we present a slight variant of the
√
above bounds, which has a worse scaling with n—the bound scales as n rather than log n—but
is independent of the gaps ∆i .
289
Stanford Statistics 311/Electrical Engineering 377 John Duchi
p K
X
Regn ≤ 8Kσ 2 n log n + 4 ∆i .
i=1
Proof Fix any γ > 0. Then we may write the regret with the standard identity
X X X X
Regn = ∆i Ti (n) = ∆i Ti (n) + ∆i Ti (n) ≤ ∆i Ti (n) + nγ,
i6=i? i:∆i ≥γ i:∆i <γ i:∆i ≥γ
Combining the above two theorems, we see that the UCB algorithm with parameters δt = 1/t2
automatically achieves the expected regret guarantee
X σ 2 log n p
Regn ≤ C · min , Kσ 2 n log n . (22.1.3)
∆i
i:∆i >0
That is, UCB enjoys some adaptive behavior. It is not, however, optimal; there are algorithms,
including Audibert and Bubeck’s MOSS (Minimax Optimal in the Stochastic Case) bandit proce-
dure [11], which achieve regret
√ n∆2
K
Regn ≤ C · min Kn, log ,
∆ K
which is essentially the bound specified by inequality (22.1.2) (which required knowledge of the
∆i s) and an improvement by log n over the analysis of Theorem 22.3. It is also possible to provie
a high-probability guarantee for the UCB algorithms, which follows essentially immediately from
the proof techniques of Proposition 22.1, but we leave this to the interested reader.
290
Stanford Statistics 311/Electrical Engineering 377 John Duchi
collection of distributions P = {Pθ }θ∈Θ parameterized by a set Θ (often, this is some subset of RK
when we look at K-armed bandit problems with card(A) = K, but we stay in this abstract setting
temporarily). We also have a loss function L : A × Θ → R that measure the quality of an action
a ∈ A for the parameter θ.
Example 22.4 (Classical Bernoulli bandit problem): The classical bandit problem, as in the
UCB case of the previous section, has actions (arms) A = {1, . . . , K}, and the parameter space
Θ = [0, 1]K , and we have that Pθ is a distribution on Y ∈ {0, 1}K , where Y has independent
coordinates 1, . . . , K with P (Yj = 1) = θj , that is, Yj ∼ Bernoulli(θj ). The goal is to find the
arm with highest mean reward, that is, argmaxj θj , and thus possible loss functions include
L(a, θ) = −θa or, if we wish the loss to be positive, L(a, θ) = 1 − θa ∈ [0, 1]. 3
Lastly, in this Bayesian setting, we require a prior distribution π on the space Θ, where π(Θ) = 1.
We then define the Bayesian regret as
X n
?
Regn (A, L, π) = Eπ L(At , θ) − L(A , θ) , (22.2.1)
t=1
where A? ∈ argmina∈A L(a, θ) is the minimizer of the loss, and At ∈ A is the action the player
takes at time t of the process. The expectation (22.2.1) is taken both over the randomness in θ
according to the prior π and any randomness in the player’s strategy for choosing the actions At
at each time.
Our approaches in this section build off of those in Chapter 19, except that we no longer fully
observe the desired observations Y —we may only observe YAt (t) at time t, which may provide less
information. The broad algorithmic framework for this section is as follows. We now give several
concrete instantiations of this broad procedure, as well as tools (both information-theoretic and
otherwise) for its analysis.
291
Stanford Statistics 311/Electrical Engineering 377 John Duchi
the distribution on θ conditional on Ht−1 . This procedure was originally proposed by Thompson
[135] in 1933 in the first paper on bandit problems. There are several analyses of Thompson (and
related Bayesian) procedures possible; our first analysis proceeds by using confidence bounds, while
our later analyses give a more information theoretic analysis.
First, we provide a more concrete specification of Algorithm 22.2 for Thompson (posterior)
sampling in the case of Bernoulli rewards.
Example 22.5 (Thompson sampling with Bernoulli penalities): Let us suppose that the
vector θ ∈ [0, 1]K , and we draw θi ∼ Beta(1, 1), which corresponds to the uniform distribution
on [0, 1]d . The actions available are simply to select one of the coordinates, a ∈ A = {1, . . . , K},
and we observe Ya ∼ Bernoulli(θa ), that is, P(Ya = 1 | θ) = θa . That is, L(a, θ) = θa . Let
Ta1 (t) = card{τ ≤ t : At = a, Ya (τ ) = 1} be the number of times arm a is pulled and results in
a loss of 1 by time t, and similarly let Ta0 (t) = card{τ ≤ t : At = a, Ya (τ ) = 0}. Then, recalling
Example 19.6 on Beta-Bernoulli distributions, Thompson sampling proceeds as follows:
(1) For each arm a ∈ A = {1, . . . , K}, draw θa (t) ∼ Beta(1 + Ta1 (t), 1 + Ta0 (t)).
(2) Play the action At = argmina θa (t).
(3) Observe the loss YAt (t) ∈ {0, 1}, and increment the appropriate count.
Thompson sampling is simple in this case, and it is implementable with just a few counters.
3
We may extend Example 22.5 to the case in which the losses come from any distribution with mean
θi , so long as the distribution is supported on [0, 1]. In particular, we have the following example.
Example 22.6 (Thompson sampling with bounded random losses): Let us again consider the
setting of Example 22.5, except that the observed losses Ya (t) ∈ [0, 1] with E[Ya | θ] = θa . The
following modification allows us to perform Thompson sampling in this case, even without
knowing the distribution of Ya | θ: instead of observing a loss Ya ∈ {0, 1}, we construct a
random observation Yea ∈ {0, 1} with the property that P(Yea = 1 | Ya ) = Ya . Then the losses
L(a, θ) = θa are identical, and the posterior distribution over θ is still a Beta distribution. We
simply redefine
Ta0 (t) := card{τ ≤ t : At = a, Yea (τ ) = 0} and Ta1 (t) := card{τ ≤ t : At = a, Yea (τ ) = 0}.
Our first analysis shows that Thompson sampling can guarantee performance similar to (or
in some cases, better than) confidence-based procedures, which we do by using a sequence of
(potential) lower and upper bounds on the losses of actions. (Recall we wish to minimize our
losses, so that we would optimistically play those arms with the lowest estimated loss.) This
analysis is based on that of Russo and Van Roy [127]. Let Lt : A → R and Ut : A → R be an
arbitrary sequence of (random) functions that are measurable with respect to Ht−1 , that is, they
are constructed based only on {A1 , YA1 (1), . . . , At−1 , YAt−1 (t − 1)}. Then we can decompose the
292
Stanford Statistics 311/Electrical Engineering 377 John Duchi
293
Stanford Statistics 311/Electrical Engineering 377 John Duchi
where the final inequality uses that (a + b)2 ≥ a2 + b2 for ab ≥ 0. We have an identical bound for
P(Lt (a) ≥ L(a, θ) + | Ta (t)).
We may now bound the final two sums in the regret expansion (22.2.2) using inequality (22.2.3).
First, however,
R∞ we make the observation that for any nonnegative random variable Z, we have
E[Z] = 0 P(Z ≥ )d. Using this, we have
n
X n X
X
Eπ [L(At , θ) − Ut (At )] ≤ Eπ [L(a, θ) − Ut (a)]+
t=1 t=1 a∈A
Xn X Z ∞
= Eπ P(Ut (a) ≥ L(a, θ) + | Ta (t))d
t=1 a∈A 0
s
n X ∞ n
Ta (t)2 πσ 2
(i)
Z
X (ii) X X
≤ δEπ exp − d = δ Eπ ,
0 2σ 2 2Ta (t)
t=1 a∈A t=1 a∈A
where inequality (i) uses the bound (22.2.3) and equality (ii) uses that this is the integral of half
of a normal density. Substituting this bound, as well as the identical one for the terms involving
Lt (A?t ), into the decomposition (22.2.2) yields
n n
"s #
X X X 2πσ 2
Regn (A, L, π) ≤ Eπ [Ut (At ) − Lt (At )] + δ Eπ .
Ta (t)
t=1 t=1 a∈A
p
2πσ 2 /Ta (t)] < 3σ|A|. Lastly, we use
P
Using that Ta (t) ≥ 1 for each action a, we have a∈A Eπ [
that s
2σ 2 log 1δ
Ut (At ) − Lt (At ) = 2 .
TAt (t)
Thus we have
n r " #
X 1X X 1
Eπ [Ut (At ) − Lt (At )] = 2 2σ 2 log Eπ p .
δ Ta (t)
t=1 a∈A t:At =a
PT − 12 1 RT √
Once we see that t=1 t ≤t− 2 dt = 2 T , we have the upper bound
0
r
1X p
Regn (A, L, π) ≤ 4 2σ 2 log Eπ [ Ta (n)] + 3nδσ|A|.
δ
a∈A
P P p p
As a∈A Ta (n) = n, the Cauchy-Scwharz inequality implies a∈A Ta (n) ≤ |A|n, which gives
the result.
An immediate Corollary of Theorem 22.7 is the following result, which applies in the case of
bounded losses Ya as in Examples 22.5 and 22.6.
iid
Corollary 22.8. Let the losses Ya ∈ [0, 1] with E[Ya | θ] = θa , where θi ∼ Beta(1, 1) for i =
1, . . . , K. Then Thompson sampling satisfies
p 3
Regn (A, L, π) ≤ 3 Kn log n + K.
2
294
Stanford Statistics 311/Electrical Engineering 377 John Duchi
Then E[Ye | Y ] = Y .
Proof The proof is immediate: for each coordinate j of Ye , we have E[Yej | Y ] = wj Yj /wj = Yj .
Lemma 22.9 suggests the following procedure, which gives rise to (a variant of) Auer et al.’s
EXP3 (Exponentiated gradient for Exploration and Exploitation) algorithm [13]. We can prove
the following bound on the expected regret of the EXP3 Algorithm 22.3 by leveraging our refined
analysis of exponentiated gradients in Proposition 21.13.
Proposition 22.10. Assume that for each j, we have E[Yj2 ] ≤ σ 2 and the observed loss Yj ≥ 0.
Then Alg. 22.3 attains expected regret (we are minimizing)
n
X log K η
Regn = E[µAt − µi? ] ≤ + σ 2 Kn.
η 2
t=1
p
In particular, choosing η = log K/(Kσ 2 n) gives
n
X 3 p
Regn = E[µAt − µi? ] ≤ σ Kn log K.
2
t=1
295
Stanford Statistics 311/Electrical Engineering 377 John Duchi
wt,i exp(−ηgt,i )
wt+1,i = P .
j wt,j exp(−ηgt,j )
Proof With Lemma 22.9 in place, we recall the refined regret bound of Proposition 21.13. We
have that for w? ∈ ∆K and any sequence of vectors g1 , g2 , . . . with gt ∈ RK
+ , then exponentiated
gradient descent achieves
n n k
X log K η XX
hgt , wt − w? i ≤ + 2
wt,j gt,j .
η 2
t=1 t=1 j=1
E[gt | wt ] = E[Y ] = µ
This careful normalizing, allowed by Proposition 21.13, is essential to our analysis (and fails for
more naive applications of online convex optimization bounds). In particular, we have
n n
X X log K η
Regn = E[hµ, wt − w? i] = E[hgt , wt − w? i] ≤ + nE[kY k22 ].
η 2
t=1 t=1
When√the random observed losses Ya (t) are bounded in [0, 1], then we have the mean regret
bound 23 Kn log K, which is as sharp as any of our other bounds.
296
Stanford Statistics 311/Electrical Engineering 377 John Duchi
More recent work in machine learning (there are far too many references to list): the books
Cesa-Bianchi and Lugosi [42] and Bubeck and Cesa-Bianchi [37] are good references. The papers
of Auer et al. [13] and Auer et al. [12] introduced UCB and EXP3.
Our approach to Bayesian bandits follows Russo and Van Roy [127, 128, 129]. More advanced
techniques allow Thompson sampling to apply even when the prior is unknown (e.g. Agrawal and
Goyal [2]).
b0i (t) = Ti1(t) τ :Aτ =i Yi0 (τ ) is the empirical mean of the copies Yi0 (τ ) for those steps when
P
where µ
arm i is selected. To see this, we use the standard fact that the characteristic function of a random√
variable completely characterizes the random variable. Let ϕYi (λ) = E[eιλYi ], where ι = −1 is
the imaginary unit, denote the characteristic function of Yi , noting that by construction we have
ϕYi = ϕYi0 . Then writing the joint characteristic function of Ti (t)b µi (t) and Ti (t), we obtain
t
" !#
X
E exp ιλ1 1 {Aτ = i} Yi (τ ) + ιλ2 Ti (t)
τ =1
t
" #
(i) Y
=E E [exp (ιλ1 1 {Aτ = i} Yi (τ ) + ιλ2 1 {Aτ = i}) | Hτ −1 ]
τ =1
" t #
(ii) Y
ιλ2
= E 1 {Aτ = i} e E [exp(ιλ1 Yi (τ )) | Hτ −1 ] + 1 {Aτ 6= i}
τ =1
t
" #
(iii) Y
= E 1 {Aτ = i} eλ2 ι ϕYi (λ1 ) + 1 {Aτ 6= i}
τ =1
t
" #
(iv) Y
λ2 ι
= E 1 {Aτ = i} e ϕYi0 (λ1 ) + 1 {Aτ 6= i}
τ =1
t
" !#
X
= E exp ιλ1 1 {Aτ = i} Yi0 (τ ) + ιλ2 Ti (t) ,
τ =1
where equality (i) is the usual tower property of conditional expectations, where Hτ −1 denotes the
history to time τ − 1, equality (ii) because Aτ ∈ Hτ −1 (that is, it is a function of the history),
equality (iii) follows because Yi (τ ) is independent of Hτ −1 , and equality (iv) follows because Yi0 and
Yi have identical distributions. The final step is simply reversing the steps.
297
Stanford Statistics 311/Electrical Engineering 377 John Duchi
With the distributional equality (22.A.1) in place, we see that for any δ ∈ [0, 1], we have
s s s
2 1 2 1 2 1
σ log δ σ log δ σ log δ
P µbi (t) ≥ µi + = P µb0i (t) ≥ µi + = P µ b0i (t) ≥ µi +
Ti (t) Ti (t) Ti (t)
s
t 2 1
X σ log δ
= P µ b0i (t) ≥ µi + | Ti (t) = s P(Ti (t) = s)
s
s=1
t
X
≤ δP(Ti (t) = s) = δ.
s=1
298
Appendix A
In this appendix, we review several results in convex analysis that are useful for our purposes. We
give only a cursory study here, identifying the basic results and those that will be of most use to
us; the field of convex analysis as a whole is vast. The study of convex analysis and optimization
has become very important practically in the last fourty to fifty years for a few reasons, the most
important of which is probably that convex optimization problems—those optimization problems
in which the objective and constraints are convex—are tractable, while many others are not. We do
not focus on optimization ideas here, however, building only some analytic tools that we will find
useful. We borrow most of our results from Hiriart-Urruty and Lemaréchal [87], focusing mostly on
the finite-dimensional case (though we present results that apply in infinite dimensional cases with
proofs that extend straightforwardly, and we do not specify the domains of our functions unless
necessary), as we require no results from infinite-dimensional analysis.
In addition, we abuse notation and assume that the range of any function is the extended real
line, meaning that if f : C → R we mean that f (x) ∈ R ∪ {−∞, +∞}, where −∞ and +∞ are
infinite and satisfy a + ∞ = +∞ and a − ∞ = −∞ for any a ∈ R. However, we assume that
our functions are proper, meaning that f (x) > −∞ for all x, as this allows us to avoid annoying
pathologies.
Definition A.1. A set C is convex if for all λ ∈ [0, 1] and all x, y ∈ C, we have
λx + (1 − λ)y ∈ C.
An important restriction of convex sets is to closed convex sets, those convex sets that are, well,
closed.
TODO: Picture
We now consider two operations that extend sets, convexifying them in nice ways.
Definition A.2. The affine hull of a set C is the smallest affine set containing C. That is,
k
X k
X
k
aff(C) := λi xi : k ∈ N, xi ∈ C, λ ∈ R , λi = 1 .
i=1 i=1
299
Stanford Statistics 311/Electrical Engineering 377 John Duchi
Proof Call T the set on the right hand side of the equality in the proposition. Then T ⊃ C
is clear, as we may simply take λ1 = 1 and vary x ∈ C. Moreover, the set T ⊂ Conv(C), as any
convex set containing C must contain all convex combinations of its elements; similarly, any convex
set S ⊃ C must have S ⊃ T .
Thus
P if we show thatP T is convex, then we are done. Take any two points x, y ∈ T . Then
x = ki=1 αi xi and y = li=1 βi yi for xi , yi ∈ C. Fix λ ∈ [0, 1]. Then (1 − λ)βi ≥ 0 and λαi ≥ 0 for
all i,
Xk Xl
λ αi + (1 − λ) βi = λ + (1 − λ) = 1,
i=1 i=1
and λx + (1 − λ)y is a convex combination of the points xi and yi weighted by λαi and (1 − λ)βi ,
respectively. So λx + (1 − λ)y ∈ T and T is convex.
We also give one more definition, which is useful for dealing with some pathalogical cases in
convex analysis, as it allows us to assume many sets are full-dimensional.
Definition A.4. The relative interior of a set C is the interior of C relative to its affine hull, that
is,
relint(C) := {x ∈ C : B(x, ) ∩ aff(C) ⊂ C for some > 0} ,
where B(x, ) := {y : ky − xk < } denotes the open ball of radius centered at x.
An example may make Definition A.4 clearer.
Example A.2 (Relative interior of a disc): Consider the (convex) set
n o
C = x ∈ Rd : x21 + x22 ≤ 1, xj = 0 for j ∈ {3, . . . , d} .
The affine hull aff(C) = R2 × {0} = {(x1 , x2 , 0, . . . , 0) : x1 , x2 ∈ R} is simply the (x1 , x2 )-plane
in Rd , while the relative interior relint(C) = {x ∈ Rd : x21 + x22 < 1} ∩ aff(C) is the “interior”
of the 2-dimensional disc in Rd . 3
In finite dimensions, we may actually restrict the definition of the convex hull of a set C to convex
combinations of a bounded number (the dimension plus one) of the points in C, rather than arbi-
trary convex combinations as required by Proposition A.1. This result is known as Carathéodory’s
theorem.
300
Stanford Statistics 311/Electrical Engineering 377 John Duchi
Theorem A.3. LetPC ⊂ Rd . Then x ∈ Conv(C) if and only if there exist points x1 , . . . , xd+1 ∈ C
and λ ∈ Rd+1
+ with d+1
i=1 λi = 1 such that
d+1
X
x= λ i xi .
i=1
Proof It is clear that if x can be represetned as such a sum, then x ∈ Conv(C). Conversely,
Proposition A.1 implies that for any x ∈ Conv(C) we have
k
X k
X
x= λi xi , λi ≥ 0, λi = 1, xi ∈ C
i=1 i=1
for some λi , xi . Assume that k > d + 1 and λi > 0 for each i, as otherwise, there is nothing to prove.
Then we know that the points xi − x1 are certainly linearly dependent (as Pk there are k − 1 > d of
them), and we can find (not identically zero) values α2 , . . . , αk such that i=2 αi (xi − x1 ) = 0. Let
α1 = − ki=2 αi to obtain that we have both
P
k
X k
X
αi xi = 0 and αi = 0. (A.1.1)
i=1 i=1
λi
Notably, the equalities (A.1.1) imply that at least one αi > 0, and if we define λ∗ = mini:αi >0 αi > 0,
then setting λ0i = λi − λ∗ αi we have
k
X k
X k
X k
X k
X k
X
λ0i ≥ 0 for all i, λ0i = λi − λ ∗
αi = 1, and λ0i xi = λi xi − λ ∗
αi xi = x.
i=1 i=1 i=1 i=1 i=1 i=1
But we know that at least one of the λ0i = 0, so that we could write x as a convex combination of
k − 1 elements. Repeating this strategy until k = d + 1 gives the theorem.
Observation A.4 is clear, as we have C ⊂ Conv(C), while any other convex S ⊃ C clearly satisfies
S ⊃ Conv(C). Secondly, we note that intersections preserve convexity.
Observation A.5. Let {Cα }α∈A be an arbitrary collection of convex sets. Then
\
C= Cα
α∈A
301
Stanford Statistics 311/Electrical Engineering 377 John Duchi
The convexity property follows because if x1 ∈ C and x2 ∈ C, then clearly x1 , x2 ∈ Cα for all
α ∈ A, and moreover λx1 + (1 − λ)x2 ∈ Cα for all α and any λ ∈ [0, 1]. The closure property is
standard. In addition, we note that closing a convex set maintains convexity.
Observation A.6. Let C be convex. Then cl(C) is convex.
To see this, we note that if x, y ∈ cl(C) and xn → x and yn → y (where xn , yn ∈ C), then for any
λ ∈ [0, 1], we have λxn + (1 − λ)yn ∈ C and λxn + (1 − λ)yn → λx + (1 − λ)y. Thus we have
λx + (1 − λ)y ∈ cl(C) as desired.
Observation A.6 also implies the following result.
Observation A.7. Let D be an arbitrary set. Then
\
{C : C ⊃ D, C is convex} = cl Conv(D).
Proof Let T denote the leftmost set. It is clear that T ⊂ cl Conv(D) as cl Conv(D) is a closed
convex set (by Observation A.6) containing D. On the other hand, if C ⊃ D is a closed convex
set, then C ⊃ Conv(D), while the closedness of C implies it also contains the closure of Conv(D).
Thus T ⊃ cl Conv(D) as well.
TODO: picture
As our last consideration of operations that preserve convexity, we consider what is known as
the perspective of a set. To define this set, we need to define the perspective function, which, given
a point (x, t) ∈ Rd × R++ (here R++ = {t : t > 0} denotes strictly positive points), is defined as
x
pers(x, t) = .
t
We have the following definition.
Definition A.5. Let C ⊂ Rd × R+ be a set. The perspective transform of C, denoted by pers(C),
is nx o
pers(C) := : (x, t) ∈ C and t > 0 .
t
This corresponds to taking all the points z ∈ C, normalizing them so their last coordinate is 1, and
then removing the last coordinate. (For more on perspective functions, see Boyd and Vandenberghe
[33, Chapter 2.3.3].)
It is interesting to note that the perspective of a convex set is convex. First, we note the
following.
Lemma A.8. Let C ⊂ Rd+1 be a compact line segment, meaning that C = {λx + (1 − λ)y : λ ∈
[0, 1]}, where xd+1 > 0 and yd+1 > 0. Then pers(C) = {λ pers(x) + (1 − λ) pers(y) : λ ∈ [0, 1]}.
Proof Let λ ∈ [0, 1]. Then
λx1:d + (1 − λ)y1:d
pers(λx + (1 − λ)y) =
λxd+1 + (1 − λ)yd+1
λxd+1 x1:d (1 − λ)yd+1 y1:d
= +
λxd+1 + (1 − λ)yd+1 xd+1 λxd+1 + (1 − λ)yd+1 yd+1
= θ pers(x) + (1 − θ) pers(y),
302
Stanford Statistics 311/Electrical Engineering 377 John Duchi
where x1:d and y1:d denote the vectors of the first d components of x and y, respectively, and
λxd+1
θ= ∈ [0, 1].
λxd+1 + (1 − λ)yd+1
Proof Let x, y ∈ C and define L = {λx + (1 − λ)y : λ ∈ [0, 1]} to be the line segment between
them. By Lemma A.8, pers(L) = {λ pers(x) + (1 − λ) pers(y) : λ ∈ [0, 1]} is also a (convex) line
segment, and we have pers(L) ⊂ pers(C) as necessary.
Theorem A.10 (Projections). Let C be a closed convex set. Then for any x, there exists a unique
point πC (x) minimizing ky − xk over y ∈ C. Moreover, this point is characterized by the inequality
Proof The existence and uniqueness of the projection follows from the parallelogram identity,
that is, that for any x, y we have kx − yk2 + kx + yk2 = 2(kxk2 + kyk2 ), which follows by noting
that kx + yk2 = kxk2 + kyk2 + 2hx, yi. Indeed, let {yn } ⊂ C be a sequence such that
kyn − xk → inf ky − xk =: p?
y∈C
as n → ∞, where p? is the infimal value. We show that yn is Cauchy, so that there exists a (unique)
limit point of the sequence. Fix > 0 and let N be such that n ≥ N implies kyn − xk2 ≤ p2? + 2 .
Let m, n ≥ N . Then by the parallelogram identity,
h i
kyn − ym k2 = k(x − yn ) − (x − ym )k2 = 2 kx − yn k2 + kx − ym k2 − k(x − yn ) + (x − ym )k2 .
Noting that
yn + ym yn + ym
(x − yn ) + (x − ym ) = 2 x − and ∈ C (by convexity of C),
2 2
303
Stanford Statistics 311/Electrical Engineering 377 John Duchi
we have
2
2 2 2 yn + ym
p2? +2 , p2? +2 ,
≥ 4p2? .
kx − yn k ≤ kx − ym k ≤ and k(x − yn ) + (x − ym )k = 4
x −
2
In particular, we have
As > 0 was arbitrary, this completes the proof of the first statement of the theorem.
To see the second result, assume that z is a point satisfying inequality (A.1.2), that is, such
that
hz − x, y − zi ≥ 0 for all y ∈ C.
Then we have
kz − xk2 = hz − x, z − xi = hz − x, z − yi +hz − x, y − xi ≤ kz − xk ky − xk
| {z }
≤0
kπC (x) − xk2 ≤ k(1 − t)πC (x) + ty − xk2 = kπC (x) − x + t(y − πC (x))k2
= kπC (x) − xk2 + 2thπC (x) − x, y − πC (x)i + t2 ky − πC (x)k2 .
Subtracting the projection value kπC (x) − xk2 from both sides and dividing by t > 0, we have
Corollary A.11. Let C be closed convex and x 6∈ C. Then there is a vector v strictly separating
x from C, that is,
hv, xi > suphv, yi.
y∈C
In addition, we can show the existence of supporting hyperplanes, that is, hyperplanes “sepa-
rating” the boundary of a convex set from itself.
304
Stanford Statistics 311/Electrical Engineering 377 John Duchi
Theorem A.12. Let C be a convex set and x ∈ bd(C), where bd(C) = cl(C) \ int C. Then there
exists a non-zero vector v such that hv, xi ≥ supy∈C hv, yi.
Proof Let D = cl(C) be the closure of C and let xn 6∈ D be a sequence of points such that
xn → x. Let us define the sequence of separating vectors sn = xn − πD (xn ) and the normalized
version vn = sn / ksn k. Notably, we have hvn , xn i > supy∈C hvn , yi for all n. Now, the sequence
{vn } ⊂ {v : kvk = 1} belongs to a compact set.1 Passing to a subsequence if necessary, let us
assume w.l.o.g. that vn → v with kvk = 1. Then by a standard limiting argument for the xn → x,
we have
hv, xi ≥ hv, yi for all y ∈ C,
which was our desired result.
H := {x : hv, xi ≤ r}.
Before stating the theorem, we remark that by Observation A.6, the intersection of all the closed
convex sets containing a set D is equal to the closure of the convex hull of D.
where H denotes a closed half-space containing D. Moreover, for any closed convex set C,
\
C= Hx , (A.1.4)
x∈bd(C)
Proof We begin with the proof of the second result (A.1.4). Indeed, by Theorem A.12, we know
that at each point x on the boundary of C, there exists a non-zero supporting hyperplane v, so
that the half-space
Hx,v := {y : hv, yi ≤ hv, xi} ⊃ C
is closed, convex, and contains C. We clearly have the containment C ⊂ ∩x∈bd(C) Hx . Now let
x0 6∈ C; we show that x0 6∈ ∩x∈bd(C) Hx . As x0 6∈ C, the projection πC (x0 ) of x0 onto C satisfies
hx0 − πC (x0 ), x0 i > supy∈C hx0 − πC (x0 ), yi by Corollary A.11. Moreover, letting v = x0 − πC (x0 ),
the hyperplane
hx0 ,v := {y : hy, vi = hπC (x0 ), vi}
1
In infinite dimensions, this may not be the case. But we can apply the Banach-Alaoglu theorem, which states
that, as vn are linear operators, the sequence is weak-* compact, so that there is a vector v with kvk ≤ 1 and a
subequence m(n) ⊂ N such that hvm(n) , xi → hv, xi for all x.
305
Stanford Statistics 311/Electrical Engineering 377 John Duchi
is clearly supporting to C at the point πC (x0 ). The half-space {y : hy, vi ≤ hπC (x0 ), vi} thus
contains C and does not contain x0 , implying that x0 6∈ ∩x∈bd(C) Hx .
Now we show the first result (A.1.3). Let C be the closed convex hull of D and T = ∩H⊃D H.
By a trivial extension of the representation (A.1.4), we have that C = ∩H⊃C H, where H denotes
any halfspace containing C. As C ⊃ D, we have that H ⊃ C implies H ⊃ D, so that
\ \
T = H⊂ H = C.
H⊃D H⊃C
On the other hand, as C = cl Conv(D), Observation A.7 implies that any closed set containing D
contains C. As a closed halfspace is convex and closed, we have that H ⊃ D implies H ⊃ C, and
thus T = C as desired.
epi f
We now build off of the definitions of convex sets to define convex functions. As we will see,
convex functions have several nice properties that follow from the geometric (separation) properties
of convex sets. First, we have
Definition A.6. A function f is convex if for all λ ∈ [0, 1] and x, y ∈ dom f ,
We define the domain dom f of a convex function to be those points x such that f (x) < +∞. Note
that Definition A.6 implies that the domain of f must be convex.
An equivalent definition of convexity follows by considering a natural convex set attached to
the function f , known as its epigraph.
306
Stanford Statistics 311/Electrical Engineering 377 John Duchi
That is, the epigraph of a function f is the set of points on or above the graph of the function itself,
as depicted in Figure A.1. It is immediate from the definition of the epigraph that f is convex if
and only if epi f is convex. Thus, we see that any convex set C ⊂ Rd+1 that is unbounded “above,”
meaning that C = C + {0} × R+ , defines a convex function, and conversely, any convex function
defines such a set C. This duality in the relationship between a convex function and its epigraph
is central to many of the properties we exploit.
and multiplying the former by (1 − λ) and the latter by λ and adding the two inequalities yields
as desired. In Theorem A.14 to come, we see that the converse to inequality (A.2.2) holds as well,
that is, differentiable convex functions satisfy inequality (A.2.2).
We may also give the standard second order characterization: if f : R → R is twice differentiable
and f 00 (x) ≥ 0 for all x, then f is convex. To see this, note that
1
f (y) = f (x) + f 0 (x)(y − x) + f 00 (tx + (1 − t)y)(x − y)2
2
for some t ∈ [0, 1] by Taylor’s theorem, so that f (y) ≥ f (x) + f 0 (x)(y − x) for all x, y because
f 00 (tx + (1 − t)y) ≥ 0. As a consquence, we obtain inequality (A.2.2), which implies that f is
convex.
As convexity is a property that depends only on properties of functions on lines—one dimen-
sional projections—we can straightforwardly extend the preceding results to functions f : Rd → R.
Indeed, noting that if h(t) = f (x + ty) then h0 (0) = h∇f (x), yi and h00 (0) = y > ∇2 f (x)y, we have
that a differentiable function f : Rd → R is convex if and only if
307
Stanford Statistics 311/Electrical Engineering 377 John Duchi
See Figure A.2 for an illustration of the affine minorizing function given by the subgradient of a
convex function at a particular point.
f (x)
f (x0 ) + hg, x − x0 i
(x0 , f (x0 ))
Figure A.2. The tangent (affine) function to the function f generated by a subgradient g at the
point x0 .
Interestingly, convex functions have subgradients (at least, nearly everywhere). This is perhaps
intuitively obvious by viewing a function in conjunction with its epigraph epi f and noting that
epi f has supporting hyperplanes, but here we state a result that will have further use.
Theorem A.14. Let f be convex. Then there is an affine function minorizing f . More precisely,
for any x0 ∈ relint dom f , there exists a vector g such that
Proof If relint dom f = ∅, then it is clear that f is either identically +∞ or its domain is a
single point {x0 }, in which case the constant function f (x0 ) minorizes f . Now, we assume that
int dom f 6= ∅, as we can simply always change basis to work in the affine hull of dom f .
We use Theorem A.12 on the existence of supporting hyperplanes to construct a subgradient.
Indeed, we note that (x0 , f (x0 )) ∈ bd epi f , as for any open set O we have that (x0 , f (x0 )) + O
contains points both inside and outside of epi f . Thus, Theorem A.12 guarantees the existence of
a vector v and a ∈ R, not both simultaneously zero, such that
308
Stanford Statistics 311/Electrical Engineering 377 John Duchi
Inequality (A.2.4) implies that a ≥ 0, as for any x we may take t → +∞ while satisfying (x, t) ∈
epi f . Now we argue that a > 0 strictly. To see this, note that for suitably small δ > 0, we have
x = x0 − δv ∈ dom f . Then we find by inequality (A.2.4) that
hv, x0 i + af (x0 ) ≤ hv, x0 i − δ kvk2 + af (x0 − δv), or a [f (x0 ) − f (x0 − δv)] ≤ −δ kvk2 .
So if v = 0, then Theorem A.12 already guarantees a 6= 0, while if v 6= 0, then kvk2 > 0 and we
must have a 6= 0 and f (x0 ) 6= f (x0 − δv). As we showed already that a ≥ 0, we must have a > 0.
Then by setting t = f (x0 ) and dividing both sides of inequality (A.2.4) by a, we obtain
1
hv, x0 − xi + f (x0 ) ≤ f (x) for all x ∈ dom f.
a
Setting g = −v/a gives the result of the theorem, as we have f (x) = +∞ for x 6∈ dom f .
Convex functions generally have quite nice behavior. Indeed, they enjoy some quite remarkable
continuity properties just by virtue of the defining convexity inequality (A.2.1). In particular, the
following theorem shows that convex functions are continuous on the relative interiors of their
domains. Even more, convex functions are Lipschitz continuous on any compact subsets contained
in the (relative) interior of their domains. (See Figure A.3 for an illustration of this fact.)
Theorem A.15. Let f : Rd → R be convex and C ⊂ relint dom f be compact. Then there exists
an L = L(C) ≥ 0 such that
|f (x) − f (x0 )| ≤ L
x − x0
.
Lemma A.16. Let f : Rd → R be convex and suppose that there are x0 , δ > 0, m, and M such
that
m ≤ f (x) ≤ M for x ∈ B(x0 , 2δ) := {x : kx − x0 k < 2δ}.
Then f is Lipschitz on B(x0 , δ), and moreover,
M −m
|f (y) − f (y 0 )| ≤
y − y 0
for y, y 0 ∈ B(x0 , δ).
δ
Proof Let y, y 0 ∈ B(x0 , δ), and define y 00 = y 0 + δ(y 0 − y)/ ky 0 − yk ∈ B(x0 , 2δ). Then we can
write y 0 as a convex combination of y and y 00 , specifically,
ky 0 − yk 00 δ
y0 = 0
y + y.
δ + ky − yk δ + ky 0 − yk
309
Stanford Statistics 311/Electrical Engineering 377 John Duchi
Figure A.3. Left: discontinuities in int dom f are impossible while maintaining convexity (Theo-
rem A.15). Right: At the edge of dom f , there may be points of discontinuity.
ky 0 − yk δ ky − y 0 k
f (y 0 ) − f (y) ≤ f (y 00
) + f (y) − f (y) = [f (y 00 ) − f (y)]
δ + ky 0 − yk δ + ky 0 − yk δ + ky − y 0 k
M −m
y − y 0
.
≤ 0
δ + ky − y k
Here we have used the bounds on f assumed in the lemma. Swapping the assignments of y and y 0
gives the same lower bound, thus giving the desired Lipschitz continuity.
With Lemma A.16 in place, we proceed to the proof proper. We assume without loss of gener-
ality that dom f has an interior; otherwise we prove the theorem restricting ourselves to the affine
hull of dom f . The proof follows a standard compactification argument. Suppose that for each
x ∈ C, we could construct an open ball Bx = B(x, δx ) with δx > 0 such that
|f (y) − f (y 0 )| ≤ Lx
y − y 0
for y, y 0 ∈ Bx .
(A.2.5)
As the Bx cover the compact set C, we can extract a finite number of them, call them Bx1 , . . . , Bxk ,
covering C, and then within each (overlapping) ball f is maxk Lxk Lipschitz. As a consequence, we
find that
|f (y) − f (y 0 )| ≤ max Lxk
y − y 0
k
for any y, y 0∈ C.
We thus must derive inequality (A.2.5), for which we use the boundedness Lemma A.16. We
must demonstrate that f is bounded in a neighborhood of each x ∈ C. To that end, fix x ∈
int dom f , and let the points x0 , . . . , xd be affinely independent and such that
∆ := Conv{x0 , . . . , xd } ⊂ dom f
310
Stanford Statistics 311/Electrical Engineering 377 John Duchi
thus
d
X
f (y) ≤ λi f (xi ) ≤ max f (xi ) =: M.
i∈{0,...,d}
i=0
Moreover, Theorem A.14 implies that there is some affine h function minorizing f ; let h(x) =
a + hv, xi denote this function. Then
exists and is finite, so that in the ball B(x, 2δ) constructed above, we have f (y) ∈ [m, M ] as required
by Lemma A.16. This guarantees the existence of a ball Bx required by inequality (A.2.5).
epi f
f (x) f (x)
Our final discussion of continuity properties of convex functions revolves around the most com-
mon and analytically convenient type of convex function, the so-called closed-convex functions.
for all x0 and any sequence of points tending toward x0 . See Figure A.4 for an example such
function and its associated epigraph.
Interestingly, in the one-dimensional case, closed convexity implies continuity. Indeed, we have
the following observation (compare Figures A.4 and A.3 previously):
311
Stanford Statistics 311/Electrical Engineering 377 John Duchi
Proof By Theorem A.15, we need only consider the endpoints of the domain of f (the result
is obvious by Theorem A.15 if dom f = R); let x0 ∈ bd dom f . Let y ∈ dom f be an otherwise
arbitrary point, and define x = λy + (1 − λ)x0 . Then taking λ → 0, we have
so that lim supx→x0 f (x) ≤ f (x0 ). By the closedness assumption (A.2.6), we have lim inf x→x0 f (x) ≥
f (x0 ), and continuity follows.
Proposition A.18. Let {fα }α∈A be an arbitrary collection of convex functions indexed by A. Then
is convex. Moreover, if for each α ∈ A, the function fα is closed convex, f is closed convex.
Proof The proof is immediate once we consider the epigraph epi f . We have that
\
epi f = epi fα ,
α∈A
which is convex whenever epi fα is convex for all α and closed whenever epi fα is closed for all α
(recall Observation A.5).
Another immediate result is that composition of a convex function with an affine transformation
preserves convexity:
Proposition A.19. Let A ∈ Rd×n and b ∈ Rd , and let f : Rd → R be convex. Then the function
g(y) = f (Ay + b) is convex.
Lastly, we consider the functional analogue of the perspective transform. Given a function
f : Rd → R, the perspective transform of f is defined as
(
tf xt if t > 0 and xt ∈ dom f
pers(f )(x, t) := (A.2.7)
+∞ otherwise.
In analogue with the perspective transform of a convex set, the perspective transform of a function
is (jointly) convex.
312
Stanford Statistics 311/Electrical Engineering 377 John Duchi
f1 (x)
f2 (x)
Figure A.5. The maximum of two convex functions is convex, as its epigraph is the intersection
of the two epigraphs.
Proof The result follows if we can show that epi pers(f ) is a convex set. With that in mind,
note that x r
Rd × R++ × R 3 (x, t, r) ∈ epi pers(f ) if and only if f ≤ .
t t
Rewriting this, we have
n x ro
epi pers(f ) = (x, t, r) ∈ Rd × R++ × R : f ≤
n t t o
= t(x , 1, r ) : x ∈ R , t ∈ R++ , r ∈ R, f (x0 ) ≤ r0
0 0 0 d 0
c. Fenchel biconjugate
313
Stanford Statistics 311/Electrical Engineering 377 John Duchi
Further reading
There are a variety of references on the topic, beginning with the foundational book by Rockafellar
[126], which initiated the study of convex functions and optimization in earnest. Since then, a
variety of authors have written (perhaps more easily approachable) books on convex functions,
optimization, and their related calculus. Hiriart-Urruty and Lemaréchal [87] have written two
volumes explaining in great detail finite-dimensional convex analysis, and provide a treatment of
some first-order algorithms for solving convex problems. Borwein and Lewis [31] and Luenberger
[109] give general treatments that include infinite-dimensional convex analysis, and Bertsekas [27]
gives a variety of theoretical results on duality and optimization theory.
There are, of course, books that combine theoretical treatment with questions of convex mod-
eling and procedures for solving convex optimization problems (problems for which the objective
and constraint sets are all convex). Boyd and Vandenberghe [33] gives a very readable treatment
for those who wish to use convex optimization techniques and modeling, as well as the basic results
in convex analytic background and duality theory. Ben-Tal and Nemirovski [25], as well as Ne-
mirovski’s various lecture notes, give a theory of the tractability of computing solutions to convex
optimization problems as well as methods for solving them.
314
Bibliography
[2] S. Agrawal and N. Goyal. Analysis of Thompson sampling for the multi-armed bandit prob-
lem. In Proceedings of the Twenty Fifth Annual Conference on Computational Learning
Theory, 2012.
[3] N. Ailon and B. Chazelle. The fast Johnson-Lindenstrauss transform and approximate nearest
neighbors. SIAM Journal on Computing, 39(1):302–322, 2009.
[4] S. M. Ali and S. D. Silvey. A general class of coefficients of divergence of one distribution
from another. Journal of the Royal Statistical Society, Series B, 28:131–142, 1966.
[5] S. Amari and H. Nagaoka. Methods of Information Geometry, volume 191 of Translations of
Mathematical Monographs. American Mathematical Society, 2000.
[8] S. Arora, E. Hazan, and S. Kale. The multiplicative weights update method: a meta algorithm
and applications. Theory of Computing, 8(1):121–164, 2012.
[9] S. Artstein, K. Ball, F. Barthe, and A. Naor. Solution of Shannon’s problem on the mono-
tonicity of entropy. Journal of the American Mathematical Society, 17(4):975–982, 2004.
[10] P. Assouad. Deux remarques sur l’estimation. Comptes Rendus des Séances de l’Académie
des Sciences, Série I, 296(23):1021–1024, 1983.
[11] J.-Y. Audibert and S. Bubeck. Regret bounds and minimax policies under partial monitoring.
In Journal of Machine Learning Research, pages 2635–2686, 2010.
[12] P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit
problem. Machine Learning, 47(2-3):235–256, 2002.
315
Stanford Statistics 311/Electrical Engineering 377 John Duchi
[14] F. Bach and E. Moulines. Non-asymptotic analysis of stochastic approximation algorithms for
machine learning. In Advances in Neural Information Processing Systems 24, pages 451–459,
2011.
[15] B. Balle, G. Barthe, and M. Gaboardi. Privacy amplification by subsampling: Tight analyses
via couplings and divergences. In Advances in Neural Information Processing Systems 31,
pages 6277–6287, 2018.
[18] A. Barron. Entropy and the central limit theorem. Annals of Probability, 14(1):336–342,
1986.
[20] A. R. Barron and T. M. Cover. Minimum complexity density estimation. IEEE Transactions
on Information Theory, 37:1034–1054, 1991.
[21] P. L. Bartlett, M. I. Jordan, and J. McAuliffe. Convexity, classification, and risk bounds.
Journal of the American Statistical Association, 101:138–156, 2006.
[22] R. Bassily, A. Smith, T. Steinke, and J. Ullman. More general queries and less generalization
error in adaptive data analysis. arXiv:1503.04843 [cs.LG], 2015.
[23] R. Bassily, K. Nissim, A. Smith, T. Steinke, U. Stemmer, and J. Ullman. Algorithmic stability
for adaptive data analysis. In Proceedings of the Forty-Eighth Annual ACM Symposium on
the Theory of Computing, pages 1046–1059, 2016.
[24] A. Beck and M. Teboulle. Mirror descent and nonlinear projected subgradient methods for
convex optimization. Operations Research Letters, 31:167–175, 2003.
[25] A. Ben-Tal and A. Nemirovski. Lectures on Modern Convex Optimization. SIAM, 2001.
[26] J. M. Bernardo. Reference analysis. In D. Day and C. R. Rao, editors, Bayesian Thinking,
Modeling and Computation, volume 25 of Handbook of Statistics, chapter 2, pages 17–90.
Elsevier, 2005.
[28] L. Birgé. Approximation dans les espaces métriques et théorie de l’estimation. Zeitschrift für
Wahrscheinlichkeitstheorie und verwebte Gebiet, 65:181–238, 1983.
[29] L. Birgé. A new lower bound for multiple hypothesis testing. IEEE Transactions on Infor-
mation Theory, 51(4):1611–1614, 2005.
316
Stanford Statistics 311/Electrical Engineering 377 John Duchi
[30] L. Birgé and P. Massart. Estimation of integral functionals of a density. Annals of Statistics,
23(1):11–29, 1995.
[31] J. Borwein and A. Lewis. Convex Analysis and Nonlinear Optimization. Springer, 2006.
[33] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004.
[34] S. Boyd, J. Duchi, and L. Vandenberghe. Subgradients. Course notes for Stanford Course
EE364b, 2015. URL http://web.stanford.edu/class/ee364b/lectures/subgradients_
notes.pdf.
[37] S. Bubeck and N. Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed
bandit problems. Foundations and Trends in Machine Learning, 5(1):1–122, 2012.
[38] V. Buldygin and Y. Kozachenko. Metric Characterization of Random Variables and Random
Processes, volume 188 of Translations of Mathematical Monographs. American Mathematical
Society, 2000.
[39] E. J. Candès and M. A. Davenport. How well can we estimate a sparse vector. Applied and
Computational Harmonic Analysis, 34(2):317–323, 2013.
[42] N. Cesa-Bianchi and G. Lugosi. Prediction, Learning, and Games. Cambridge University
Press, 2006.
[43] N. Cesa-Bianchi, Y. Mansour, and G. Stoltz. Improved second-order bounds for prediction
with expert advice. Machine Learning, 66(2–3):321–352, 2007.
[45] J. E. Cohen, Y. Iwasa, G. Rautu, M. B. Ruskai, E. Seneta, and G. Zbaganu. Relative entropy
under mappings by stochastic matrices. Linear Algebra and its Applications, 179:211–235,
1993.
317
Stanford Statistics 311/Electrical Engineering 377 John Duchi
[47] J. Couzin. Whole-genome data not anonymous, challenging assumptions. Science, 321(5894):
1278, 2008.
[48] T. M. Cover and J. A. Thomas. Elements of Information Theory, Second Edition. Wiley,
2006.
[50] I. Csiszár and J. Körner. Information Theory: Coding Theorems for Discrete Memoryless
Systems. Academic Press, 1981.
[51] T. Dalenius. Towards a methodology for statistical disclosure control. Statistik Tidskrift, 15:
429–444, 1977.
[52] S. Dasgupta and A. Gupta. An elementray proof of a theorem of Johnson and Lindenstrauss.
Random Structures and Algorithms, 22(1):60–65, 2002.
[53] L. D. Davisson. The prediction error of stationary gaussian time series of unknown covariance.
IEEE Transactions on Information Theory, 11:527–532, 1965.
[55] P. Del Moral, M. Ledoux, and L. Miclo. On contraction properties of Markov kernels. Prob-
ability Theory and Related Fields, 126:395–420, 2003.
[56] R. L. Dobrushin. Central limit theorem for nonstationary markov chains. i. Theory of
Probability and Its Applications, 1(1):65–80, 1956.
[57] D. L. Donoho and R. C. Liu. Geometrizing rates of convergence I. Technical Report 137,
University of California, Berkeley, Department of Statistics, 1987.
[58] J. C. Duchi and M. J. Wainwright. Distance-based and continuum Fano inequalities with
applications to statistical estimation. arXiv:1311.2669 [cs.IT], 2013.
[59] J. C. Duchi, M. I. Jordan, and M. J. Wainwright. Local privacy, data processing inequalities,
and minimax rates. arXiv:1302.3203 [math.ST], 2013. URL http://arxiv.org/abs/1302.
3203.
[60] J. C. Duchi, M. I. Jordan, and M. J. Wainwright. Local privacy and statistical minimax rates.
In 54th Annual Symposium on Foundations of Computer Science, pages 429–438, 2013.
[62] R. M. Dudley. Uniform Central Limit Theorems. Cambridge University Press, 1999.
[63] C. Dwork and A. Roth. The algorithmic foundations of differential privacy. Foundations and
Trends in Theoretical Computer Science, 9(3 & 4):211–407, 2014.
318
Stanford Statistics 311/Electrical Engineering 377 John Duchi
[64] C. Dwork, K. Kenthapadi, F. McSherry, I. Mironov, and M. Naor. Our data, ourselves:
Privacy via distributed noise generation. In Advances in Cryptology (EUROCRYPT 2006),
2006.
[65] C. Dwork, F. McSherry, K. Nissim, and A. Smith. Calibrating noise to sensitivity in private
data analysis. In Proceedings of the Third Theory of Cryptography Conference, pages 265–284,
2006.
[66] C. Dwork, G. N. Rothblum, and S. P. Vadhan. Boosting and differential privacy. In 51st
Annual Symposium on Foundations of Computer Science, pages 51–60, 2010.
[67] C. Dwork, V. Feldman, M. Hardt, T. Pitassi, O. Reingold, and A. Roth. Preserving statistical
validity in adaptive data analysis. arXiv:1411.2664v2 [cs.LG], 2014.
[68] C. Dwork, V. Feldman, M. Hardt, T. Pitassi, O. Reingold, and A. Roth. Preserving statis-
tical validity in adaptive data analysis. In Proceedings of the Forty-Seventh Annual ACM
Symposium on the Theory of Computing, 2015.
[69] C. Dwork, V. Feldman, M. Hardt, T. Pitassi, O. Reingold, and A. Roth. The reusable holdout:
Preserving statistical validity in adaptive data analysis. Science, 349(6248):636–638, 2015.
[70] A. V. Evfimievski, J. Gehrke, and R. Srikant. Limiting privacy breaches in privacy preserving
data mining. In Proceedings of the Twenty-Second Symposium on Principles of Database
Systems, pages 211–222, 2003.
[71] D. A. Freedman. On tail probabilities for martingales. The Annals of Probability, 3(1):
100–118, Feb. 1975.
[72] R. Gallager. Source coding with side information and universal coding. Technical Report
LIDS-P-937, MIT Laboratory for Information and Decision Systems, 1979.
[73] D. Garcı́a-Garcı́a and R. C. Williamson. Divergences and risks for multiclass experiments.
In Proceedings of the Twenty Fifth Annual Conference on Computational Learning Theory,
2012.
[74] A. Garg, T. Ma, and H. L. Nguyen. On communication cost of distributed statistical estima-
tion and dimensionality. In Advances in Neural Information Processing Systems 27, 2014.
[75] A. Gelman and E. Loken. The garden of forking paths: Why multiple comparisons can
be a problem, even when there is no “fishing expedition” or “p-hacking” and the research
hypothesis was posited ahead of time. Technical report, Columbia University, 2013.
[76] R. P. Gilbert. Function Theoretic Methods in Partial Differential Equations. Academic Press,
1969.
[78] P. Grünwald. The Minimum Description Length Principle. MIT Press, 2007.
[79] P. Grünwald and A. P. Dawid. Game theory, maximum entropy, minimum discrepancy, and
robust Bayesian decision theory. Annals of Statistics, 32(4):1367–1433, 2004.
319
Stanford Statistics 311/Electrical Engineering 377 John Duchi
[80] A. Guntuboyina. Lower bounds for the minimax risk using f -divergences, and applications.
IEEE Transactions on Information Theory, 57(4):2386–2399, 2011.
[81] L. Györfi and T. Nemetz. f -dissimilarity: A generalization of the affinity of several distribu-
tions. Annals of the Institute of Statistical Mathematics, 30:105–113, 1978.
[83] R. Z. Has’minskii. A lower bound on the risks of nonparametric estimates of densities in the
uniform metric. Theory of Probability and Applications, 23:794–798, 1978.
[84] D. Haussler. A general minimax result for relative entropy. IEEE Transactions on Information
Theory, 43(4):1276–1280, 1997.
[86] E. Hazan, A. Kalai, S. Kale, and A. Agarwal. Logarithmic regret algorithms for online
convex optimization. In Proceedings of the Nineteenth Annual Conference on Computational
Learning Theory, 2006.
[87] J. Hiriart-Urruty and C. Lemaréchal. Convex Analysis and Minimization Algorithms I & II.
Springer, New York, 1993.
[88] J.-B. Hiriart-Urruty and C. Lemaréchal. Fundamentals of Convex Analysis. Springer, 2001.
[90] P. J. Huber. Robust Statistics. John Wiley and Sons, New York, 1981.
[92] P. Indyk. Nearest neighbors in high-dimensional spaces. In Handbook of Discrete and Com-
putational Geometry. CRC Press, 2004.
[93] P. Indyk and R. Motwani. Approximate nearest neighbors: towards removing the curse of
dimensionality. In Proceedings of the Thirtieth Annual ACM Symposium on the Theory of
Computing, 1998.
[95] T. S. Jayram. Hellinger strikes back: a note on the multi-party information complexity of
AND. In Proceedings of APPROX and RANDOM 2009, volume 5687 of Lecture Notes in
Computer Science, pages 562–573. Springer, 2009.
320
Stanford Statistics 311/Electrical Engineering 377 John Duchi
[96] H. Jeffreys. An invariant form for the prior probability in estimation problems. Proceedings
of the Royal Society of London, Series A: Mathematical and Physical Sciences, 186:453–461,
1946.
[97] W. Johnson and J. Lindenstrauss. Extensions of Lipschitz maps into a Hilbert space. Con-
temporary Mathematics, 26:189–206, 1984.
[98] J. Kivinen and M. Warmuth. Exponentiated gradient versus gradient descent for linear
predictors. Information and Computation, 132(1):1–64, Jan. 1997.
[99] J. Kivinen and M. Warmuth. Relative loss bounds for multidimensional regression problems.
Machine Learning, 45(3):301–329, July 2001.
[100] A. Kolmogorov and V. Tikhomirov. ε-entropy and ε-capacity of sets in functional spaces.
Uspekhi Matematischeskikh Nauk, 14(2):3–86, 1959.
[101] V. Koltchinskii. Oracle Inequalities in Empirical Risk Minimization and Sparse Recovery
Problems, volume 2033 of Lecture Notes in Mathematics. Springer-Verlag, 2011.
[102] E. Kushilevitz and N. Nisan. Communication Complexity. Cambridge University Press, 1997.
[103] T. L. Lai and H. Robbins. Asymptotically efficient adaptive allocation rules. Advances in
Applied Mathematics, 6:4–22, 1985.
[105] L. Le Cam and G. L. Yang. Asymptotics in Statistics: Some Basic Concepts. Springer, 2000.
[107] E. L. Lehmann and G. Casella. Theory of Point Estimation, Second Edition. Springer, 1998.
[108] F. Liese and I. Vajda. On divergences and informations in statistics and information theory.
IEEE Transactions on Information Theory, 52(10):4394–4412, 2006.
[110] M. Madiman and A. Barron. Generalized entropy power inequalities and monotonicity prop-
erties of information. IEEE Transactions on Information Theory, 53(7):2317–2329, 2007.
[115] F. McSherry and K. Talwar. Mechanism design via differential privacy. In 48th Annual
Symposium on Foundations of Computer Science, 2007.
321
Stanford Statistics 311/Electrical Engineering 377 John Duchi
[116] N. Merhav and M. Feder. Universal prediction. IEEE Transactions on Information Theory,
44(6):2124–2147, 1998.
[117] A. Nemirovski and D. Yudin. Problem Complexity and Method Efficiency in Optimization.
Wiley, 1983.
[118] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation ap-
proach to stochastic programming. SIAM Journal on Optimization, 19(4):1574–1609, 2009.
[120] B. T. Polyak and J. Tsypkin. Robust identification. Automatica, 16:53–63, 1980. doi: 10.
1016/0005-1098(80)90086-2. URL http://dx.doi.org/10.1016/0005-1098(80)90086-2.
[121] M. Raginsky. Strong data processing inequalities and φ-Sobolev inequalities for discrete
channels. IEEE Transactions on Information Theory, 62(6):3355–3389, 2016.
[122] G. Raskutti, M. J. Wainwright, and B. Yu. Minimax rates of estimation for high-dimensional
linear regression over `q -balls. IEEE Transactions on Information Theory, 57(10):6976—6994,
2011.
[123] M. Reid and R. Williamson. Information, divergence, and risk for binary experiments. Journal
of Machine Learning Research, 12:731–817, 2011.
[124] J. Rissanen. Universal coding, information, prediction, and estimation. IEEE Transactions
on Information Theory, 30:629–636, 1984.
[125] H. Robbins. Some aspects of the sequential design of experiments. Bulletin American Math-
ematical Society, 55:527–535, 1952.
[127] D. Russo and B. Van Roy. An information-theoretic analysis of Thompson sampling. Journal
of Machine Learning Research, page To appear, 2014.
[128] D. Russo and B. Van Roy. Learning to optimize via information-directed sampling. In
Advances in Neural Information Processing Systems 27, 2014.
[129] D. Russo and B. Van Roy. Learning to optimize via posterior sampling. Mathematics of
Operations Research, 39(4):1221–1243, 2014.
[130] C. E. Shannon. A mathematical theory of communication. The Bell System Technical Journal,
1948.
[132] A. Slavkovic and F. Yu. Genomics and privacy. Chance, 28(2):37–39, 2015.
[133] J. Steinhardt and P. Liang. Adaptivity and optimism: An improved exponentiated gradient
algorithm. In Proceedings of the 31st International Conference on Machine Learning, 2014.
322
Stanford Statistics 311/Electrical Engineering 377 John Duchi
[135] W. R. Thompson. On the likelihood that one unknown probability exceeds another in view
of the evidence of two samples. Biometrika, 25(3-4):285–294, 1933.
[137] A. W. van der Vaart. Asymptotic Statistics. Cambridge Series in Statistical and Probabilistic
Mathematics. Cambridge University Press, 1998.
[139] M. J. Wainwright and M. I. Jordan. Graphical models, exponential families, and variational
inference. Foundations and Trends in Machine Learning, 1(1–2):1–305, 2008.
[140] A. Wald. Contributions to the theory of statistical estimation and testing hypotheses. Annals
of Mathematical Statistics, 10(4):299–326, 1939.
[141] S. Warner. Randomized response: a survey technique for eliminating evasive answer bias.
Journal of the American Statistical Association, 60(309):63–69, 1965.
[143] B. Yu. Assouad, Fano, and Le Cam. In Festschrift for Lucien Le Cam, pages 423–435.
Springer-Verlag, 1997.
323