Theory of Probability Zitcovic PDF
Theory of Probability Zitcovic PDF
Lecture Notes
by Gordan itkovic
Contents
Contents
I Theory of Probability I
Measurable spaces
1.1 Families of Sets . . . . . . . . . . .
1.2 Measurable mappings . . . . . . .
1.3 Products of measurable spaces . .
1.4 Real-valued measurable functions
1.5 Additional Problems . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Measures
2.1 Measure spaces . . . . . . . . . . . . . . . . . .
2.2 Extensions of measures and the coin-toss space
2.3 The Lebesgue measure . . . . . . . . . . . . . .
2.4 Signed measures . . . . . . . . . . . . . . . . .
2.5 Additional Problems . . . . . . . . . . . . . . .
Lebesgue Integration
3.1 The construction of the integral
3.2 First properties of the integral .
3.3 Null sets . . . . . . . . . . . . .
3.4 Additional Problems . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5
5
8
10
13
17
.
.
.
.
.
19
19
24
28
30
33
.
.
.
.
36
36
39
44
46
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
72
72
74
77
82
84
86
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
89
89
96
100
102
103
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
106
106
108
110
114
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Conditional Expectation
9.1 The definition and existence of conditional expectation
9.2 Properties . . . . . . . . . . . . . . . . . . . . . . . . . .
9.3 Regular conditional distributions . . . . . . . . . . . . .
9.4 Additional Problems . . . . . . . . . . . . . . . . . . . .
10 Discrete Martingales
10.1 Discrete-time filtrations and stochastic processes
10.2 Martingales . . . . . . . . . . . . . . . . . . . . .
10.3 Predictability and martingale transforms . . . .
10.4 Stopping times . . . . . . . . . . . . . . . . . . . .
10.5 Convergence of martingales . . . . . . . . . . . .
10.6 Additional problems . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
11 Uniform Integrability
11.1 Uniform integrability . . . . . . . . . . . . . . . . . .
11.2 First properties of uniformly-integrable martingales
11.3 Backward martingales . . . . . . . . . . . . . . . . .
11.4 Applications of backward martingales . . . . . . . .
11.5 Exchangeability and de Finettis theorem (*) . . . . .
11.6 Additional Problems . . . . . . . . . . . . . . . . . .
Index
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
115
115
118
121
128
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
130
130
131
133
135
137
139
.
.
.
.
.
.
142
142
145
147
149
150
156
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
158
Preface
These notes were written (and are still being heavily edited) to help students with the graduate
courses Theory of Probability I and II offered by the Department of Mathematics, University of
Texas at Austin.
Statements, proofs, or entire sections marked by an asterisk () are not a part of the syllabus
and can be skipped when preparing for midterm, final and prelim exams. .
G ORDAN ITKOVI C
Austin, TX
December 2010.
Part I
Theory of Probability I
Chapter
Measurable spaces
Before we delve into measure theory, let us fix some notation and terminology.
denotes a subset (not necessarily proper).
A set A is said to be countable if there exists an injection (one-to-one mapping) from A
into N. Note that finite sets are also countable. Sets which are not countable are called
uncountable.
For two functions f : B C, g : A B, the composition f g : A C of f and g is given
by (f g)(x) = f (g(x)), for all x A.
{An }nN denotes a sequence. More generally, (A ) denotes a collection indexed by the
set .
1.1
Families of Sets
Definition 1.1 (Order properties) A (countable) family {An }nN of subsets of a non-empty set S is
said to be
1. increasing if An An+1 for all n N,
2. decreasing if An An+1 for all n N,
3. pairwise disjoint if An Am = for m 6= n,
4. a partition of S if {An }nN is pairwise disjoint and n An = S.
We use the notation An A to denote that the sequence {An }nN is increasing and A = n An .
Similarly, An A means that {An }nN is decreasing and A = n An .
Definition 1.4 (Generated -algebras) For a family A of subsets of a non-empty set S, the intersection of all -algebras on S that contain A is denoted by (A) and is called the -algebra generated
by A.
Remark 1.8 Almost all topologies in these notes will be generated by a metric, i.e., a set A S
will be open if and only if for each x A there exists > 0 such that {y S : d(x, y) < } A.
The prime example is R where a set is declared open if it can be represented as a union of open
intervals.
Definition 1.9 (Borel -algebras) If (S, ) is a topological space, then the -algebra ( ), generated
by all open sets, is called the Borel -algebra on (S, ).
Remark 1.10 We often abuse terminology and call S itself a topological space, if the topology
on it is clear from the context. In the same vein, we often speak of the Borel -algebra on a set S.
Example 1.11 Some important -algebras. Let S be a non-empty set:
1. The set S = 2S (also denoted by P(S)) consisting of all subsets of S is a -algebra.
2. At the other extreme, the family S = {, S} is the smallest -algebra on S. It is called the
trivial -algebra on S.
3. The set S of all subsets of S which are either countable or whose complements are countable
is a -algebra. It is called the countable-cocountable -algebra and is the smallest -algebra
on S which contains all singletons, i.e., for which {x} S for all x S.
4. The Borel -algebra on R (generated by all open sets as defined by the Euclidean metric on
R), is denoted by B(R).
Problem 1.12 Show that the B(R) = (A), for any of the following choices of the family A:
1. A = {all open subsets of R },
2. A = {all closed subsets of R },
7
1.2
Measurable mappings
Definition 1.13 (Measurable spaces) A pair (S, S) consisting of a non-empty set S and a -algebra
S of its subsets is called a measurable space.
If (S, S) is a measurable space, and A S, we often say that A is measurable in S.
Definition 1.14 (Pull-backs and push-forwards) For a function f : S T and subsets A S,
B T , we define the
1. push-forward f (A) of A S as
f (A) = {f (x) : x A} T,
2. pull-back f 1 (B) of B T as
f 1 (B) = {x S : f (x) B} S.
It is often the case that the notation is abused and the pull-back of B under f is denoted simply
by {f B}. This notation presupposes, however, that the domain of f is clear from the context.
Problem 1.15 Show that the pull-back operation preserves the elementary set operations, i.e., for
f : S T , and B, {Bn }nN T ,
1. f 1 (T ) = S, f 1 () = ,
2. f 1 (n Bn ) = n f 1 (Bn ),
3. f 1 (n Bn ) = n f 1 (Bn ), and
4. f 1 (B c ) = [f 1 (B)]c .
Definition 1.16 (Measurability) A mapping f : S T , where (S, S) and (T, T ) are measurable
spaces, is said to be (S, T )-measurable if f 1 (B) S for each B T .
Remark 1.17 When T = R, we tacitly assume that the Borel -algebra is defined on T , and we
simply call f measurable. In particular, a function f : R R, which is measurable with respect to
the pair of the Borel -algebras is often called a Borel function.
Proposition 1.18 (A measurability criterion) Let (S, S) and (T, T ) be two measurable spaces, and
let C be a subset of T such that T = (C). If f : S T is a mapping with the property that
f 1 (C) S, for any C C, then f is (S, T )-measurable.
P ROOF Let D be the family of subsets of T defined by
D = {B T : f 1 (B) S}.
By the assumptions of the proposition, we have C D. On the other hand, by Problem 1.15, the
family D has the structure of the -algebra, i.e., D is a -algebra that contains C. Remembering
that T = (C) is the smallest -algebra that contains C, we conclude that T D. Consequently,
f 1 (B) S for all B T .
Problem 1.19 Let (S, S) and (T, T ) be measurable spaces.
1. Suppose that S and T are topological spaces, and that S and T are the corresponding Borel
-algebras. Show that each continuous function f : S T is (S, T )-measurable. (Hint:
Remember that the function f is continuous if the pull-backs of open sets are open.)
2. Let f : S R be a function. Show that f is measurable if and only if
{x S : f (x) q} S, for all rational q.
3. Find an example of (S, S), (T, T ) and a measurable function f : S T such that f (A) =
{f (x) : x A} 6 T for all nonempty A S.
Proposition 1.20 (Compositions of measurable maps) Let (S, S), (T, T ) and (U, U ) be measurable spaces, and let f : S T and g : T U be measurable functions. Then the composition
h = g f : S U , given by h(x) = g(f (x)) is (S, U )-measurable.
Definition 1.22 (Generation by a function) Let f : S T be a map from the set S into a measurable space (T, T ). The -algebra generated by f , denoted by (f ), is the intersection of all
-algebras S on S which make f (S, T )-measurable.
The letter will typically be used to denote an abstract index set - we only assume that it is
nonempty, but make no other assumptions about its cardinality.
(f ) = {f 1 (B) : B T }.
(f
)
(1.2)
1.3
10
S ,
S0 defined by
Equivalently (why?)
1 (B2 ) = S1 B2 , for B2 S2 .
S1 S2 = {B1 S2 , S1 B2 : B1 S1 , B2 S2 } .
S1 S2 = {B1 B2 : B1 S1 , B2 S2 } .
In a completely analogous fashion, we can show that, for finitely many measurable spaces
(S1 , S1 ), . . . , (Sn , Sn ), we have
n
O
i=1
Si = {B1 B2 Bn : B1 S1 , B2 S2 , . . . , Bn Sn }
The same goes for countable products. Uncountable products, however, behave very differently.
11
Definition
1.32 (Cylinder sets) Let {(S , S )}Q be a family of measurable spaces, and let
Q
( S , S ) be its product. A subset C S is called a cylinder set if there exist
a finite subset {1 , . . . , n } of , as well as a measurable set B S1 S2 Sn such that
Y
C = {s
S : (s(1 ), . . . , s(n )) B}.
A cylinder set for which the set B can be chosen of the form B = B1 Bn , for some B1
S1 , . . . , Bn Sn is called a product cylinder set. In that case
Y
S : (s(1 ) B1 , s(2 ) B2 , . . . , s(n ) Bn }.
C = {s
Problem 1.33
1. Show that the family of product cylinder sets generates the product -algebra.
2. Show that (not-necessarily-product) cylinders are measurable in the product -algebra.
3. Which of the 4 families of sets from Definition 1.2 does the collection of all product cylinders
belong to in general? How about (not-necessarily-product) cylinders?
Example 1.34 The following example will play a major role in probability theory. Hence the name
coin-toss space. Here =
i , Si ) is the discrete two-element space Si = {1, 1},
Q N and for i N, (S
N
S
1
Si = 2 . The product iN Si = {1, 1} can be identified with the set of all sequences s =
(s1 , s2 , . . . ), where si {1, 1}, i N. For each cylinder set C, there exists (why?) n N and a
subset B of {1, 1}n such that
C = {s = (s1 , . . . , sn , sn+1 , . . . ) {1, 1}N : (s1 , . . . , sn ) B}.
The product cylinders are even simpler - they are always of the form C = {1, 1}N or C =
Cn1 ,...,nk ;b1 ,...,bk , where
n
o
Cn1 ,...,nk ;b1 ,...,bk = s = (s1 , s2 , . . . ) {1, 1}N : sn1 = b1 , . . . , snk = bk ,
(1.3)
for some k N, 1 n1 < n2 < < nk N and b1 , b2 , . . . , bk {1, 1}.
12
We know that the -algebra S = iN Si is generated by all projections i : {1, 1}N {1, 1},
i N, where i (s) = si . Equivalently, by Problem 1.33, S is generated by the collection of all
cylinder sets.
Problem 1.35 One can obtain the product -algebra S on {1, 1}N as the Borel -algebra corresponding to a particular topology which makes {1, 1}N compact. Here is how. Start by defining
a mapping d : {1, 1}N {1, 1}N [0, ) by
(1.4)
d(s1 , s2 ) = 2i(s
1 ,s2 )
1.4
Let L0 (S, S; R) (or, simply, L0 (S; R) or L0 (R) or L0 when the domain (S, S) or the co-domain R are
clear from the context) be the set of all S-measurable functions f : S R. The set of non-negative
measurable functions is denoted by L0+ or L0 ([0, )).
Proposition 1.36 (Measurable functions form a vector space) L0 is a vector space, i.e.
f + g L0 , whenever , R, f, g L0 .
P ROOF Let us define a mapping F : S R2 by F (x) = (f (x), g(x)). By Problem 1.30, the Borel
-algebra on R2 is the same as the product -algebra when we interpret R2 as a product of two
copies of R. Therefore, since its compositions with the coordinate projections are precisely the
functions f and g, Problem 1.31 implies that F is (S, B(R2 ))-measurable.
Consider the function : R2 R given by (x, y) = x + y. It is linear, and, therefore,
continuous. By Corollary 1.21, the composition F : S R is (S, B(R))-measurable, and it only
remains to note that
( F )(x) = (F (x)) = f (x) + g(x), i.e., F = f + g.
In a similar manner (the functions (x, y) 7 max(x, y) and (x, y) xy are continuous from R2
to R - why?) one can prove the following proposition.
13
14
define convergence (and topology) on R using d . For example, a sequence {xn }nN in R
converges to + if
a) It contains only a finite number of terms equal to ,
b) Every subsequence of {xn }nN whose elements are in R converges to + (in the usual
sense).
for a sequence {xn }nN in the
We define the notions of limit superior and limit inferior on R
following manner:
lim sup xn = inf Sn , where Sn = sup xk ,
n
kn
and
lim inf xn = sup In , where In = inf xk .
n
kn
If you have forgotten how to manipulate limits inferior and superior, here is an exercise to remind
you:
Prove the following statements:
Problem 1.41 Let {xn }nN be a sequence in R.
satisfies a lim supn xn if and only if for any (0, ) there exists n N such that
1. a R
xn a + for n n .
2. lim inf n xn lim supn xn .
3. Define
Show that
{lim inf xn , lim sup xn } A [lim inf xn , lim sup xn ].
n
and the notion of measurability for functions mapping a measurable space (S, S) into R.
is in B(R)
if and only if A \ {, } is Borel in R. Show
Problem 1.42 Show that a subset A R
and
(lim inf fn )(x) = lim inf fn (x) = sup
n
sup fk (x) ,
kn
inf fk (x) .
kn
Then, we have the following result, where the supremum and infimum of a sequence of functions
are defined pointwise (just like the limits superior and inferior).
Proposition 1.43 (Limiting operations preserve measurability) Let {fn }nN be a sequence in
Then
L0 (R).
and the fact that -algebras are closed with respect to countable intersections.
for each n N.
2. Define gn = supkn fk and use part 1. above to conclude that gn L0 (R)
0
The statement about
Another appeal to part 1. yields that lim supn fn = inf n gn is in L (R).
the limit inferior follows in the same manner.
3. If the limit f (x) = limn fn (x) exists for all x S, then f = lim inf n fn which is measurable
by part 2. above.
4. The statement follows from the fact that A = f 1 ({0}), where
f (x) = arctan lim sup fn (s) arctan lim inf fn (x) .
n
(Note: The unexpected use of the function arctan is really noting to be puzzled by. The only
property needed is its measurability (it is continuous) and monotonicity+bijectivity from
16
1.5
Additional Problems
n
X
n
k=0
ak .
5. Show that the exponential generating function for the sequence {an }nN is f (x) = ee
that
X
xn
x
dn ex 1
e
an
= ee 1 or, equivalently, an = dx
.
n
n!
x=0
x 1
, i.e.,
n=0
Problem 1.46 Let (S, S) be a measurable space. For f, g L0 show that the sets {f = g} = {x
S : f (x) = g(x)}, {f < g} = {x S : f (x) < g(x)} are in S.
Problem 1.47 Show that all
1. monotone,
2. convex
functions f : R R are measurable.
17
18
Chapter
Measures
2.1
Measure spaces
Definition 2.1 (Measure) Let (S, S) be a measurable space. A mapping : S [0, ] is called a
(positive) measure if
1. () = 0, and
P
2. (n An ) = nN (An ), for all pairwise disjoint sequences {An }nN in S.
Remark 2.2
1. A mapping whose domain is some nonempty set A of subsets of some set S is sometimes
called a set function.
2. If the requirement 2. in the definition of the measure is weakened so that it is only required
that (A1 An ) = (A1 ) + + (An ), for n N, and pairwise disjoint A1 , . . . , An ,
we say that the mapping is a finitely-additive measure. If we want to stress that a mapping satisfies the original requirement 2. for sequences of sets, we say that is -additive
(countably additive).
CHAPTER 2. MEASURES
for some function p : S [0, ] (why?). In particular, for a finite set S with N elements,
if p(x) = 1/N then is a probability measure called the uniform measure on S. It has the
property that (A) = #A
#S , where # denotes the cardinality (number of elements).
CHAPTER 2. MEASURES
(Note: It is possible to construct very simple-looking finite-additive measures which are not
-additive. For example, there exist {0, 1}-valued finitely-additive measures on all subsets of N,
which are not -additive. Such objects are called ultrafilters and their existence is equivalent to a
certain version of the Axiom of Choice.)
(Ai ) = (ni=1 Ai ).
i=1
(Finite additivity)
2. If A, B S, A B, then
(A) (B).
(Monotonicity of measures)
3. If {An }nN in S is increasing, then
(n An ) = lim (An ) = sup (An ).
n
(An ).
nN
P ROOF
1. Note that the sequence A1 , A2 , . . . , An , , , . . . is pairwise disjoint, and so, by -additivity,
(ni=1 Ai ) = (iN Ai ) =
(Ai ) =
n
X
i=1
iN
21
(Ai ) +
i=n+1
() =
n
X
i=1
(Ai ).
CHAPTER 2. MEASURES
2. Write B as a disjoint union A (B \ A) of elements of S. By (1) above,
(B) = (A) + (B \ A) (A).
3. Define B1 = A1 , Bn = An \ An1 for n > 1. Then {Bn }nN is a pairwise disjoint sequence in
S with nk=1 Bk = An for each n N (why?). By -additivity we have
(n An ) = (n Bn ) =
(Bn ) = lim
n
nN
n
X
k=1
(A1 ) + (A2 ) = ((A1 \ A2 ) + (A1 A2 )) + (A2 \ A1 ) + (A1 A2 )
= (A1 \ A2 ) + (A2 \ A1 ) + 2(A1 A2 ),
and so
(A1 ) + (A2 ) (A1 A2 ) = (A1 A2 ) 0.
(A1 An )
n
X
(Ak ).
k=1
(An ).
nN
The sequence {Bn }nN given by Bn = nk=1 Ak is increasing, so the continuity of measure
with respect to increasing sequences implies that
(n An ) = (n Bn ) = lim (Bn ) = lim (A1 An ) .
n
22
CHAPTER 2. MEASURES
Remark 2.7 The condition (A1 ) < in the part (4) of Proposition 2.6 cannot be significantly
relaxed. Indeed, let be the counting measure on N, and let An = {n, n + 1, . . . }. Then (An ) =
and, so limn (An ) = . On the other hand, An = , so (n An ) = 0.
In addition to unions and intersections, one can produce other important new sets from sequences of old ones. More specifically, let {An }nN be a sequence of subsets of S. The subset
lim inf n An of S, defined by
lim inf An = n Bn , where Bn = kn Ak ,
n
is called the limit inferior of the sequence An . It is also denoted by limn An or {An , ev.} (ev. stands
for eventually). The reason for this last notation is the following: lim inf n An is the set of all x S
which belong to An for all but finitely many values of the index n.
Similarly, the subset lim supn An of S, defined by
lim sup An = n Bn , where Bn = kn Ak ,
n
is called the limit superior of the sequence An . It is also denoted by limn An or {An , i.o.} (i.o.
stands for infinitely often). In words, lim supn An is the set of all x S which belong An for infinitely
many values of n. Clearly, we have
lim inf An lim sup An .
n
P ROOF Set Bn = kn Ak , so that {Bn }nN is a decreasing sequence of sets in S with lim supn An =
n Bn . Using the subadditivity of measures of Proposition 2.6, part 5., we get
(2.1)
(Bn )
(An ).
k=n
Since nN (An ) converges, the right-hand side of (2.1) can be made arbitrarily small by choosing large enough n N. Hence (lim supn An ) = 0.
23
CHAPTER 2. MEASURES
2.2
Example 1.34 of Chapter 1 has introduced a measurable space ({1, 1}N , S), where S is the product -algebra on {1, 1}N . The purpose of the present section is to turn ({1, 1}N , S) into a measure space, i.e., to define a suitable measure on it. It is easy to construct just any measure on
{1, 1}N , but the one we are after is the one which will justify the name coin-toss space.
The intuition we have about tossing a fair coin infinitely many times should help us start with
the definition of the coin-toss measure - denoted by C - on cylinders. Since the coordinate
spaces {1, 1} are particularly simple, each product cylinder is of the form C = {1, 1}N or C =
Cn1 ,...,nk ;b1 ,...,bk , as given by (1.3), for a choice 1 n1 < n2 < . . . , nk N of coordinates and
the corresponding values b1 , . . . , bk {1, 1}. In the language of elementary probability, each
cylinder corresponds to the event when the outcome of the ni -th coin is bi {1, 1}, for k =
1, . . . , n. The measure (probability) of this event can only be given by
(2.2)
|2
1
2
21 = 2k .
{z
}
k times
The hard part is to extend this definition to all elements of S, and not only cylinders. For example,
in order to state the law of large numbers later on, we will need to be able to compute the measure
of the set
n
n
o
X
N
1
s {1, 1} : lim n
sk = 21 ,
n
k=1
Theorem 2.10 (Caratheodorys Extension Theorem) Let S be a non-empty set, let A be an algebra
of its subsets and let : A [0, ] be a set-function with the following properties:
1. () = 0, and
P
2. (A) =
n=1 (An ), if {An }nN is a pairwise-disjoint family in A and A = n An A.
24
CHAPTER 2. MEASURES
Then is -additive on A, i.e., it satisfies the conditions of Theorem 2.10.
The part about finite additivity is easy (but messy) and we leave it to the reader:
Problem 2.13 Show that the set-function C , defined by (2.2) on the algebra A of cylinders, is
finitely additive.
Lemma 2.14 (Conditions of Caratheodorys theorem) The the set-function C , defined by (2.2)
on the algebra A of cylinders, has the property (2.3).
P ROOF By Problem 1.35, cylinders are closed sets, and so {An }nN is a sequence of closed sets
whose intersection is empty. The same problem states that {1, 1}N is compact, so, by the finiteintersection property1 , we have An1 . . . Ank = , for some finite collection n1 , . . . , nk of indices.
Since {An }nN is decreasing, we must have An = , for all n nk , and, consequently, limn (An ) =
0.
exists
measure
on
Theorem 2.16 (Dynkins - Theorem) Let P be a -system on a non-empty set S, and let be
a -system which contains P. Then also contains the -algebra (P) generated by P.
P ROOF Using the result of part 4. of Problem 1.3, we only need to prove that (P) (where (P)
denotes the -system generated by P) is a -system. For A S, let GA denote the family of all
subsets of S whose intersections with A are in (P):
GA = {C S : C A (P)}.
Claim 1: GA is a -system for A (P).
Since A (P), clearly S GA .
For an increasing family {Cn }nN in GA we have (n Cn ) A = n (Cn A). Each Cn A is
in , and the family {Cn A}nN is increasing, so (n Cn ) A .
1
The finite-intersection property refers to the following fact, familiar from real analysis: If a family of closed sets of a
compact topological space has empty intersection, then it admits a finite subfamily with an empty intersection.
25
CHAPTER 2. MEASURES
Finally, for C1 , C2 G with C1 C2 , we have
(C2 \ C1 ) A = (C2 A) \ (C1 A) ,
because C1 A C2 A.
Since P is a -system, for any A P, we have P GA . Therefore, (P) GA , because GA is a
-system. In other words, for A P and B (P), we have A B (P).
That means, however, that P GB , for any B (P). Using the fact that GB is a -system we
must also have (P) GB , for any B (P), i.e., A B (P), for all A, B (P), which shows
that (P) is -system.
Proposition 2.17 (Measures which agree on a -system) Let (S, S) be a measurable space, and
let P be a -system which generates S. Suppose that 1 and 2 are two measures on S with the
property that 1 (S) = 2 (S) < and
1 (A) = 2 (A), for all A P.
Then 1 = 2 , i.e., 1 (A) = 2 (A), for all A S.
P ROOF Let L be the family of all subsets A of S for which 1 (A) = 2 (A). Clearly P L, but
L is, potentially, bigger. In fact, it follows easily from the elementary properties of measures (see
Proposition 2.6) and the fact that 1 (S) = 2 (S) < that it necessarily has the structure of a system. By Theorem 2.16 (the - Theorem), L contains the -algebra generated by P, i.e., S L.
On the other hand, by definition, L S and so 1 = 2 .
Remark 2.18 It seems that the structure of a -system is defined so that it would exactly describe
the structure of the family of all sets on which two measures (with the same total mass) agree. The
structure of the -system corresponds to the minimal assumption that allows Proposition 2.17 to
hold.
Proposition 2.19 (Uniqueness of the coin-toss measure) The measure C is the unique measure
on ({1, 1}N , S) with the property that (2.2) holds for all cylinders.
P ROOF The existence is the content of Proposition 2.15. To prove uniqueness, it suffices to note
that algebras are -systems and use Proposition 2.17.
Problem 2.20 Define D1 , D2 {1, 1}N by
1. D1 = {s {1, 1}N : lim supn sn = 1},
2. D2 = {s {1, 1}N : N N, sN = sN +1 = sN +2 }.
Show that D1 , D2 S and compute (D1 ), (D2 ).
26
CHAPTER 2. MEASURES
Our next task is to probe the structure of the -algebra S on {1, 1}N a little bit more and show
N
that S 6= 2{1,1} . It is interesting that such a result (which deals exclusively with the structure of
S) requires a use of a measure in its proof.
Example 2.21 ((*) A non-measurable subset of {1, 1}N ) Since -algebras are closed under countable set operations, and since the product -algebra S for the coin-toss space {1, 1}N is generated
by sets obtained by restricting finite collections of coordinates, one is tempted to think that S contains all subsets of {1, 1}N . That is not the case. We will use the axiom of choice, together with
the fact that a measure C can be defined on the whole of {1, 1}N , to show to construct an
example of a non-measurable set.
Let us start by constructing a relation on {1, 1}N in the following way: we set s1 s2 if
and only if there exists n N such that s1k = s2k , for k n (here, as always, si = (si1 , si2 , . . . ),
i = 1, 2). In words, s1 and s2 are related if they only differ in a finite number of coordinates. It is
easy to check that is an equivalence relation and that it splits {1, 1}N into disjoint equivalence
classes. One of the many equivalent forms of the axiom of choice states that there exists a subset
N of {1, 1}N which contains exactly one element from each of the equivalence classes.
Let us suppose that N is an element in S and see if we can reach a contradiction. Let F denote
the set of all finite subsets of N. For each nonempty n = {n1 , . . . , nk } F , let us define the
mapping Tn : {1, 1}N {1, 1}N in the following manner:
(
sl ,
l n,
(Tn (s))l =
sl , l 6 n.
In words, Tn flips the signs of the elements of its argument on the positions corresponding to n.
We define T = Id, i.e., T (s) = s.
Since n is finite, Tf preserves the -equivalence class of each element. Consequently (and
using the fact that N contains exactly one element from each equivalence class) the sets N and
Tn (N ) = {Tn (s) : s N } are disjoint. Similarly and more generally, the sets Tn (N ) and Tn (N )
are also disjoint whenever n 6= n . On the other hand, each s {1, 1}N is equivalent to some
N , i.e., it can be obtained from s
by flipping a finite number of coordinates. Therefore, the
s
family
N = {Tn (N ) : n F }
Actually, we say that a map f from a measure space (S, S, S ) to the measure space (T, T , T ) is measure preserving if it is measurable and S (f 1 (A)) = T (A), for all A T . The involutivity of the map Tn implies that this
general definition agrees with our usage in this example.
27
CHAPTER 2. MEASURES
n : S [0, 1] given by
It is a simple matter to show that n is, in fact, a measure on (S, S) with n (S) = 1. Moreover,
thanks to the simple form (2.2) of the action of the measure C on cylinders, it is clear that n = C
on the algebra of all cylinders. It suffices to invoke Proposition 2.17 to conclude that n = C on
the entire S, i.e., that Tn preserves C .
The above properties of the maps Tn , n F can imply the following: N is a partition of S into
countably many measurable subsets of equal measure. Such a partition {N1 , N2 , . . . } cannot exist,
however. Indeed if it did, one of the following two cases would occur:
P
P
1. (N1 ) = 0. In that case (S) = (k Nk ) = n (Nk ) = n 0 = 0 6= 1 = (S).
P
P
6 1 = (S).
2. (N1 ) = > 0. In that case (S) = (k Nk ) = n (Nk ) = n = =
Therefore, the set N cannot be measurable in S.
(Note: Somewhat heavier set-theoretic machinery can be used to prove that most of the subsets
of S are not in S, in the sense that the cardinality of the set S is strictly smaller than the cardinality
of the set 2S of all subsets of S)
2.3
As we shall see, the coin-toss space can be used as a sort of a universal measure space in probability theory. We use it here to construct the Lebesgue measure on [0, 1]. We start with the notion
somewhat dual to the already introduced notion of the pull-back in Definition 1.14. We leave it
as an exercise for the reader to show that the set function f () from Definition 2.22 is indeed a
measure.
Definition 2.22 (Push-forwards) Let (S, S, ) be a measure space and let (T, T ) be a measurable
space. The measure f () on (T, T ), defined by
f (B) = (f 1 (B)), for B T ,
is called the push-forward of the measure by f .
X
k=1
1+sk
2
2k , s {1, 1}N .
The idea is to use f to establish a correspondence between all real numbers in [0, 1] and their
expansions in the binary system, with the coding 1 7 0 and 1 7 1. It is interesting to note
that f is not one-to-one3 , as it, for example, maps s1 = (1, 1, 1, . . . ) and s2 = (1, 1, 1, . . . )
into the same value - namely 12 . Let us show, first, that the map f is continuous in the metric d
3
The reason for this is, poetically speaking, that [0, 1] is not the Cantor set.
28
CHAPTER 2. MEASURES
defined by part (1.4) of Problem 1.33. Indeed, we pick s1 and s2 in {1, 1}N and remember that
for d(s1 , s2 ) = 2n , the first n 1 coordinates of s1 and s2 coincide. Therefore,
|f (s1 ) f (s2 )|
2k = 2n+1 = 2d(s1 , s2 ).
k=n
Proposition 2.23 (Intuitive properties of the Lebesgue measure) The Lebesgue measure on
([0, 1], B([0, 1])) satisfies
(2.4)
P ROOF
n
1. Consider a, b of the form b = 2kn and b = k+1
2n , for n N and k < 2 . For such a, b we
have f 1 ([a, b)) = C1,...,n;c1 ,c2 ,...,cn , where c1 c2 . . . cn is the base-2 expansion of k (after the
recoding 1 7 0, 1 7 1). By the very definition of and the form (2.2) of the action of the
coin-toss measure C on cylinders, we have
k
[a, b) = C f 1 [a, b) = C (C1,...,n;c1 ,c2 ,...,cn ) = 2n = k+1
2n 2n .
Therefore, (2.4) holds for a, b of the form b = 2kn and b = 2ln , for n N, k < 2n and l = k + 1.
Using (finite) additivity of , we immediately conclude that (2.4) holds for all k, l, i.e., that
it holds for all dyadic rationals. A general a (0, 1] can be approximated by an increasing
sequence {qn }nN of dyadic rationals from the left, and the continuity of measures with
respect to decreasing sequences implies that
[a, p) = n [qn , p) = lim [qn , p) = lim(p qn ) = (p a),
n
whenever a (0, 1] and p is a dyadic rational. In order to remove the dyadicity requirement
from the right limit, we approximate it from the left by a sequence {pn }nN of dyadic rationals with pn > a, and use the continuity with respect to increasing sequences to get, for
a < b (0, 1),
[a, b) = n [a, pn ) = lim [a, pn ) = lim(pn a) = (b a).
n
CHAPTER 2. MEASURES
1. B +1 x = {b + x (mod 1) : b B} is in B([0, 1]) and
2. (B +1 x) = (B),
(
a,
a 1,
. Geometrically, the set x +1 B is
a 1, a > 1
obtained from B by translating it to the right by x and then shifting the part that is sticking out
by 1 to the left.) (Hint: Use Proposition 2.17 for the second part.)
Finally, the notion of the Lebesgue measure is just as useful on the entire R, as on its compact
subset [0, 1]. For a general B B(R), we can define the Lebesgue measure of B by measuring its
intersections with all intervals of the form [n, n + 1), and adding them together, i.e.,
(B) =
n=
B [n, n + 1) n .
Note how we are overloading the notation and using the letter for both the Lebesgue measure
on [0, 1] and the Lebesgue measure on R.
It is a quite tedious, but does not require any new tools, to show that many of the properties
of on [0, 1] transfer to on R:
Problem 2.25 Let be the Lebesgue measure on (R, B(R)). Show that
1. ([a, b)) = b a, ({a}) = 0 for a < b,
2. is -finite but not finite,
3. (B + x) = (B), for all B B(R) and x R, where B + x = {b + x : b B}.
Remark 2.26 The existence of the Lebesgue measure allows to show quickly that the converse of
the implication in the Borel-Cantelli Lemma does not hold without additional conditions, even if
is a probability measure. Indeed, let = be the Lebesgue measure on [0, 1].
Set An = (0, n1 ], for n N so that
lim sup An =
n
\[
Ak =
n kn
\
n
An = ,
nN
We will see later that the converse does hold if the family of sets {An }nN satisfy the additional
condition of independence.
2.4
Signed measures
In addition to (positive) measures, it is sometimes useful to know a few things about measure-like
(and not in [0, ]).
set functions which take values in R
30
CHAPTER 2. MEASURES
n (An )
is summable and
The notion of convergence here is applied to sequences that may take the value , so we need
to be precise about how it is defined. Remember that a+ = max(a, 0) and a = max(a, 0).
Definition 2.28
for sequences) A sequence {an }nN
P(Summability
P in (, ] is said to be
summable if nN aP
that case, the sum of the series nN an is the (well-defined)
n < . InP
Definition 2.31 (Total variation) For A S and a signed measure on S, we define the number
|| (A) [0, ], called the total variation of on A by
|| (A) =
sup
n
X
31
|(Dk )| ,
CHAPTER 2. MEASURES
where the supremum is taken over all finite measurable partitions D1 , . . . , Dn , n N of S. The number
|| (S) [0, ] is called the total variation (norm) of .
The central result about signed measures is the following:
Theorem 2.32 (Hahn-Jordan decomposition) Let (S, S) be a measure space, and let be a signed
measure on S. Then there exist two (positive) measures + and such that
1. is finite,
2. (A) = + (A) (A),
3. || (A) = + (A) + (A),
Measures + and with the above properties are unique. Moreover, there exists a set D S such
that + (A) = (A Dc ) and (A) = (A D) for all A S.
P ROOF (*) Call a set B S negative if (C) 0, for all C S, C B. Let P be the collection of
all negative sets - it is nonempty because P. Set
= inf{(B) : B P},
and let {Bn }nN be a sequence of negative sets with (Bn ) . We define D = n Bn and note
that D is a negative set with (D) = (why?). In particular, > .
Our first order of business is to show that Dc is a positive set, i.e. that (E) 0 for all E Dc .
Suppose, to the contrary, that (B) < 0, for some B S, B Dc . The set B cannot be a negative
set - otherwise D B would be a negative set with (D B) = (D) + (B) = + (B) < .
Therefore, there exists a measurable subset E1 of B with (E1 ) > 0, i.e., the set
E1 = {E B : E S, (E) > 0}
is non-empty. Pick k1 N such that
1
k1
sup{(E) : E E1 },
(E1 ) >
1
k1 +1 .
CHAPTER 2. MEASURES
Given that (B) < 0, it cannot have subsets of measure . Therefore, (n En ) < and
X
X
1
<
(En ) = (n En ) < ,
kn +1
nN
nN
and so kn 0, as n .
Let F S be a subset of B \ n En . Then, it a subset of B \ nk=1 Ek , and, therefore, by construction, (F ) k1n . The fact that kn now implies that (F ) 0, which, in turn, implies that
B\n En is a negative set. The set D is, however, the maximal negative set, and so (B\n En ) = 0.
On the other hand,
X
(B0 \ n En ) = (B0 )
(En ) > (B0 ) > 0,
n
Dc
a contradiction. Therefore,
is a positive set.
Having split S into a disjoint union of a positive and a negative set, we define
+ (A) = (A Dc ) and (A) = (A D),
|(B)| =
n
n
X
X
+
(Bk ) (Bk )
+ (Bk ) + (Bk ) = + (A) + (A).
k=1
k=1
To show that the obtained upper bound is tight, we consider the partition {A D, A Dc } of A
for which we have
|(A D)| + |(A Dc )| = (A D) + + (A Dc ) = + (A) + (A).
2.5
Additional Problems
Problem 2.33 (Local separation by constants) Let (S,S, ) be a measure space and let the function f, g L0 (S, S, ) satisfy {x S : f (x) < g(x)} > 0. Prove or construct a counterexample
for the following statement:
There exist constants a, b R such that {x S : f (x) a < b g(x)} > 0.
Problem 2.34 (A pseudometric on sets) Let (S, S, ) be a finite measure space. For A, B S define
d(A, B) = (A B),
where denotes the symmetric difference: A B = (A \ B) (B \ A). Show that d is a pseudometric4 on S, and for A S describe the set of all B S with d(A, B) = 0.
4
33
CHAPTER 2. MEASURES
Problem 2.35 (Complete measure spaces) A measure space (S, S, ) is called complete if all subsets of null sets are themselves in S. For a (possibly incomplete) measure space (S, S, ) we define
the completion (S, S , ) in the following way:
S = {A N : A S and N N for some N S with (N ) = 0}.
For B S with representation B = A N we set (B) = (A).
1. Show that S is a -algebra.
2. Show that the definition (B) = (A) above does not depend on the choice of the decom = (A) if B = A N
is another decomposition of B
position B = A N , i.e., that (A)
of a null set in S.
into a set A in S and a subset N
3. Show that is a measure on (S, S ) and that (S, S , ) is a complete measure space with
the property that (A) = (A), for A S.
Problem 2.36 (The Cantor set) The Cantor set is defined as the collection of all real numbers x in
[0, 1] with the representation
x=
n=1
1. Show that the map f is (B([0, 1)), S 1 )-measurable, where S 1 denotes the Borel -algebra on
S 1 (with the topology inherited from R2 ).
2. For (0, 2), let R denote the (counter-clockwise) rotation of R2 with center (0, 0) and
angle . Show that R (A) = {R (x) : x A} is in S 1 if and only if A S 1 .
3. Let 1 be the push-forward of the Lebesgue measure by the map f . Show that 1 is
rotation-invariant, i.e., that 1 (A) = 1 R (A) .
(Note: The measure 1 is called the uniform measure (or the uniform distribution on S 1 .)
Problem 2.38 (Asymptotic densities) We say that the subset A of N admits asymptotic density if
the limit
#(A {1, 2, . . . , n})
,
d(A) = lim
n
n
exists (remember that # denotes the number of elements of a set). Let D be the collection of all
subsets of N which admit asymptotic density.
1. Is D an algebra? A -algebra?
2. Is the map A 7 d(A) finitely-additive on D? A measure?
34
CHAPTER 2. MEASURES
Problem 2.39 (A subset of the coin-toss space) An element in {1, 1}N (i.e., a sequence s with
s = (s1 , s2 , . . . ) where sn {1, 1} for all n N) is said to be eventually periodic if there exists
N0 , K N such that sn = sn+K for all n N0 . Let P {1, 1}N be the collection of all eventuallyperiod sequences. Show that P is measurable in the product -algebra S and compute C (P ).
Problem 2.40 (Regular measures) The measure space (S, S, ), where (S, d) is a metric space and
S is a -algebra on S which contains the Borel -algebra B(d) on S is called regular if for each
A S and each > 0 there exist a closed set C and an open set O such that C A O and
(O \ C) < .
1. Suppose that (S, S, ) is a regular measure space, and let (S, B(d), |B(d) ) be the measure
space obtained from (S, S, ) by restricting themeasure onto the -algebra of Borel sets.
Show that S B(d) , where S, B(d) , (|B(d) ) is the completion of (S, B(d), |B(d) ) (in the
sense of Problem 2.35).
2. Suppose that (S, d) is a metric space and that is a finite measure on B(d). Show that
(S, B(d), ) is a regular measure space.
(Hint: Consider a collection A of subsets A of S such that for each > 0 there exists a closed
set C and an open set O with C A O and (O \ C) < . Argue that A is a -algebra.
Then show that each closed set can be written as an intersection of open sets; use (but prove,
first) the fact that the map
x 7 d(x, C) = inf{d(x, y) : y C},
is continuous on S for any nonempty C S. )
3. Show that (S, B(d), ) is regular if is not necessarily finite, but has the property that (A) <
whenever A B(d) is bounded, i.e., when sup{d(x, y) : x, y A} < . (Hint: Pick a
point x0 S and, for n N, define the family {Rn }nN of subsets of S as follows:
R1 = {x S : d(x, x0 ) < 2}, and
35
Chapter
Lebesgue Integration
3.1
Unless expressly specified otherwise, we pick and fix a measure space (S, S, ) and assume that
all functions under consideration are defined there.
Definition 3.1 (Simple functions) A function f L0 (S, S, ) is said to be simple if it takes only
a finite number of values.
The collection of all simple functions is denoted by LSimp,0 (more precisely by LSimp,0 (S, S, ))
Simp,0
and the family of non-negative simple functions by L+
. Clearly, a simple function f : S R
admits a (not necessarily unique) representation
(3.1)
f=
n
X
k 1Ak ,
k=1
36
where f =
Pn
k=1 k 1Ak
f d =
n
X
k=1
k (Ak ) [0, ],
is a simple-function representation of f ,
2. One can think of the (simple) Lebesgue integral as a generalization of the notion of (finite)
additivity
of measures. Indeed, if the simple-function representation of f is given by f =
Pn
the equality of the values of the integrals
k=1 1Ak , for pairwise disjoint A1 , . . . , An , thenP
n
for two representations f = 1nk=1 Ak and f =
k=1 1Ak is a simple restatement of finite
additivity. When A1 , . . . , An are not disjoint, then the finite additivity gives way to finite
subadditivity
n
X
(nk=1 Ak )
(Ak ),
k=1
but the integral f d takes into account those x which are covered by more than one Ak ,
k = 1, . . . , n. Take, for example, n = 2 and A1 A2 = C. Then
f = 1A1 + 1A2 = 1A1 \C + 21C + 1A2 \C ,
and so
0
Definition 3.6 (Lebesgue integral
R for nonnegative functions) For a function f L+ [0, ] ,
we define the Lebesgue integral f d of f by
Z
Z
Simp,0
g d : g L+
, g(x) f (x), x S [0, ].
f d = sup
Remark
3.7 While there is no question that the expression above defines uniquely the number
R
f d, one can wonder if it matches the previously given definition of the Lebesgue integral for
simple functions. A simple argument based on the monotonicity property of part 1. of Problem
3.5 can be used to show that this is, indeed, the case.
R
Problem 3.8 Show that f d = if there existsRa measurable set A with (A) > 0 such that
f (x) = for x A. On the other hand, show that f d = 0 for f of the form
(
, x A,
f (x) = 1A (x) =
0,
x 6= A,
whenever (A) = 0. (Note: Relate this to our convention that 0 = 0 = 0.)
Finally, we are ready to define the integral for general measurable functions. Each f L0 can be
written as a difference of two functions in L0+ in many ways. There exists a decomposition which
is, in a sense, minimal. We define
f + = max(f, 0), f = max(f, 0),
so that f = f + f (and both f + and f are measurable). The minimality we mentioned above
is reflected in the fact that for each x S, at most one of f + and f is non-zero.
Definition 3.9 (Integrable functions) A function f L0 is said to be integrable if
Z
Z
+
f d < and
f d < .
The collection of all integrable functions in L0 is denoted by L1 . The family of integrable
functions is tailor-made for the following definition:
38
Definition 3.10 (The Lebesgue integral) For f L1 , we define the Lebesgue integral
f by
Z
Z
Z
+
f d = f d f d.
f d of
Remark 3.11
1. We have seen so far two cases in which an integral for a function f L0 can be defined:
when f 0 or when f L1 . It is possible to combine the two and define the Lebesgue
integral for all functions f L0 with f L1 . The set of all such functions is denoted by
L01 and we set
Z
Z
Z
+
f d = f d f d (, ], for f L01 .
Note that no problems of the form arise here, and also note that, like L0+ , L01 is only
a convex cone, and not a vector space. While the notation L0 and L1 is quite standard, the
one we use for L01 is not.
R
R
2. For A S and f L01 we usually write A f d for f 1A d.
Problem 3.12 Show that the Lebesgue integral remains a monotone operation in L01 . More pre01 and g L0 are such that g(x) f (x), for all x S, then g L01 and
cisely,
show
R
R that if f L
g d f d.
3.2
The wider the generality to which a definition applies, the harder it is to prove theorems about it.
Simp,0
Linearity of the integral is a trivial matter for functions in L+
, but you will see how much we
need to work to get it for L0+ . In fact, it seems that the easiest route towards linearity is through
two important results: an approximation theorem and a convergence theorem. Before that, we
need to pick some low-hanging fruit:
Problem 3.13 Show that for f1 , f2 L0+ [0, ] and [0, ] we have
R
R
1. if f1 (x) f2 (x) for all x S then f1 d f2 d.
R
R
2. f d = f d.
Theorem 3.14 (Monotone convergence theorem) Let {fn }nN be a sequence in L0+ [0, ] with
the property that
f1 (x) f2 (x) . . . for all x S.
Then
lim
n
fn d =
where f (x) = limn fn (x) L0+ [0, ] , for x S.
39
f d,
Let g =
Pk
i=1 i 1Bi
fn d c sup
n
g1An d.
g1An d =
Z X
k
i=1
i 1Bi An d =
k
X
i=1
i (Bi An ).
g1An d
fn d = sup
n
k
X
i (Bi ) =
i=1
fn d c
g d.
Remark 3.15
1. The monotone convergence theorem is a testament to the incredible robustness of the Lebesgue
integral. This stability with respect to limiting operations is one of the reasons why it is a
de-facto industry standard.
2. The monotonicity condition in the monotone convergence theorem cannot be dropped.
Take, for example S = [0, 1], S = B([0, 1]), and = (the Lebesgue measure), and define
fn = n1(0,n1 ] , for n N.
Then fn (0) = 0 for all n N and fn (x) = 0 for n > x1 and x > 0. In either case fn (x) 0.
On the other hand
Z
fn d = n (0, n1 ] = 1,
40
fn d = 1 > 0 =
lim fn d.
n
We will see later that the while the equality of the limit of the integrals and the integral of the
limit will not hold in general, they will always be ordered in a specific way, if the functions
{fn }nN are non-negative (that will be the content of Fatous lemma below).
Proposition 3.16 (Approximation by simple functions) For each f L0+ [0, ] there exists a
Simp,0
sequence {gn }nN L+
such that
1. gn (x) gn+1 (x), for all n N and all x S,
2. gn (x) f (x) for all x S,
3. f (x) = limn gn (x), for all x S, and
4. the convergence gn f is uniform on each set of the form {f M }, M > 0, and, in particular,
on the whole S if f is bounded.
f
<
}
=
f
,
)
, k = 1, . . . , n2n .
[
Ank = { k1
n
n
n
n
2
2
2
2
Note that the sets Ank , k = 1, . . . , n2n are disjoint and that the measurability of f implies that
Ank S for k = 1, . . . , n2n . Define the function gn LSimp,0
by
+
n
gn =
n2
X
k1
n
2 n 1A k
k=1
+ n1{f n} .
The statements 1., 2., and 4. follow immediately from the following three simple observations:
gn (x) f (x) for all x S,
gn (x) = n if f (x) = , and
gn (x) > f (x) 2n when f (x) < n.
Finally, we leave it to the reader to check the simple fact that {gn }nN is non-decreasing.
Problem 3.17 Show, by means of an example, that the sequence {gn }nN would not necessarily be
monotone if we defined it in the following way:
2
gn =
n
X
k1
n 1{f [ k1 , k )}
n n
k=1
41
+ n1{f n} .
Corollary 3.19 (Countable additivity of the integral) Let {fn }nN be a sequence in L0+ [0, ] .
Then
Z X
XZ
fn d =
fn d.
nN
nN
P ROOF Apply the monotone convergence theorem to the partial sums gn = f1 + + fn , and use
linearity of integration.
Once we have established a battery of properties for non-negative functions, an extension to L1 is
not hard. We leave it to the reader to prove all the statements in the following problem:
Problem 3.20 The family L1 of integrable functions has the following properties:
R
1. f L1 iff |f | d < ,
2. L1 is a vector space,
R
R
3. f d |f | d, for f L1 .
R
R
R
4. |f + g| d |f | d + |g| d, for all f, g L1 .
42
Theorem 3.21 (Fatous lemma) Let {fn }nN be a sequence in L0+ [0, ] . Then
Z
Z
lim inf fn d lim inf fn d.
n
P ROOF Set gn (x) = inf kn fk (x), so that gn L0+ [0, ] and gn (x) is a non-decreasing sequence
for each x S. The monotone convergence theorem and the fact that lim inf fn (x) = supn gn (x) =
limn gn (x), for all x S, imply that
Z
Z
gn d lim inf d.
n
Therefore,
lim
n
gn d lim inf
n kn
fk d = lim inf
n
fk d.
Remark 3.22
1. The inequality in the Fatous lemma does not have to be equality, even if the limit limn fn (x)
exists for all x S. You can use the sequence {fn }nN of Remark 3.15 to see that.
2. Like the monotone convergence theorem, Fatous lemma requires that all function {fn }nN
be non-negative. This requirement is necessary - to see that, simply consider the sequence
{fn }nN , where {fn }nN is the sequence of Remark 3.15 above.
3. The strength of Fatous lemma comes from the fact that, apart from non-negativity, it requires no special properties for the sequence {fn }nN . Its conclusion is not as strong as that
of the monotone convergence theorem, but it proves
to be very useful in various settings beR
cause it gives an upper bound (namely lim inf n fn d) on the integral of the non-negative
function lim inf fn .
Theorem 3.23 (Dominated convergence theorem) Let {fn }nN be a sequence in L0 with the
property that there exists g L1 such that |fn (x)| g(x), for all x X and all n N. If
f (x) = limn fn (x) for all x S, then f L1 and
Z
Z
f d = lim fn d.
n
43
and, consequently,
f d = limn
fn d
f d lim inf
n
fn d,
fn d.
Remark 3.24 The dominated convergence theorem combines the lack of monotonicity requirements of Fatous lemma and the strong conclusion of the monotone convergence theorem. The
price to be paid is the uniform boundedness requirement. There is a way to relax this requirement
a little bit (using the concept of uniform integrability), but not too much. Still, it is an unexpectedly
useful theorem.
3.3
Null sets
An important property - inherited directly from the underlying measure - is that it is blind to sets
of measure zero. To make this statement precise, we need to introduce some language:
Remark 3.26
1. In addition to almost-everywhere equality, one can talk about the almost-everywhere version of any relation between functions which can be defined on points. For example, we
write f g, a.e. if f (x) g(x) for all x S, except, maybe, for x in some null set N .
44
Proposition 3.28 (The blindness property of the Lebesgue integral) Suppose that f = g,
a.e,. for some f, g L0+ . Then
Z
Z
f d =
g d.
c
P ROOF Let N be an Rexceptional set
R for f = g, a.e., i.e., f = g on N and (NR ) = 0. Then
f 1N c = g1N c , and so R f 1N c d = g1N c d. On
R the other hand f 1N 1N and 1N d = 0,
so, by monotonicity, f 1N d = 0. Similarly g1N d = 0. It remains to use the additivity of
integration to conclude that
Z
Z
Z
Z
Z
Z
c
c
f d = f 1N d + f 1N d = g1N d + g1N d = g d.
45
Then
lim
n
fn d =
f d,
P ROOF There are + 1 a.e.-statements we need to deal with: one for each n N in fn fn+1 ,
a.e., and an extra one when we assume that fn f , a.e. Each of them comes with an exceptional
set; more precisely, let {An }nN be such that fn (x) fn+1 (x) for x Acn and let B be such that
fn (x) f (x) for x B c . Define A S by A = (n An ) B and note that A is a null set. Moreover,
consider the functions f, {fn }nN defined by f = f 1Ac , fn = fn 1Ac . Thanks to the definition of
the set A, fn (x) fn+1 (x), for all n N and x S; hence fn f, everywhere.
R Therefore,
R the
Remark 3.34 There is a subtlety that needs to be pointed out. If a sequence {fn }nN of measurable functions converges to the function f everywhere, then f is necessarily a measurable function
(see Proposition 1.43). However, if fn f only almost everywhere, there is no guarantee that
f is measurable. There is, however, always a measurable function which is equal to f almost
everywhere; you can take lim inf n fn , for example.
3.4
Additional Problems
Problem 3.35 (The monotone-class theorem) Prove the following result, known as the monotoneclass theorem (remember that an a means that an is a non-decreasing sequence and an a)
Let H be a class of bounded functions from S into R satisfying the following conditions
1. H is a vector space,
2. the constant function 1 is in H, and
3. if {fn }nN is a sequence of non-negative functions in H such that fn (x) f (x), for all
x S and f is bounded, then f H.
Then, if H contains the indicator 1A of every set A in some -system P, then H necessarily
contains every bounded (P)-measurable function on S.
46
Problem 3.37 (Sums as integrals) Consider the measurable space (N, 2N , ), where is the counting measure.
1. For a function f : N [0, ], show that
Z
f d =
f (n).
n=1
2. Use the monotone convergence theorem to show the following special case of Fubinis theorem
X
X
X
akn =
akn ,
k=1 n=1
n=1 k=1
f (n),
n=1
converges absolutely.
Problem 3.38 (A criterion for integrability) Let (S, S, ) be a finite measure space. For f L0+ ,
show that f L1 if and only if
X
({f n}) < .
nN
Problem
3.39 (A limit of integrals) Let (S, S, ) be a measure space, and suppose f L1+ is such
R
that f d = c > 0. Show that the limit
Z
lim n log 1 + (f /n) d
n
Problem 3.40 (Integrals converge but the functionsRdont . . . ) Construct an sequence {fn }nN of
continuous functions fn : [0, 1] [0, 1] such that fn d 0, but the sequence {fn (x)}nN is
divergent for each x [0, 1].
Problem 3.41 (. . . or they do, but are not dominated)
Construct an sequence {fn }nN of continuR
ous functions fn : [0, 1] [0, ) such that fn d 0, and fn (x) 0 for all x, but f 6 L1 , where
f (x) = supn fn (x).
47
and
L(f, ) =
n
X
k=1
t(tk1 ,tk ]
inf
t(tk1 ,tk ]
L(f, ) =
P ([a,b])
inf
P ([a,b])
U (f, ).
In that case the common value of the supremum and the infimum above is called the Riemann
Rb
integral of the function f - denoted by (R) a f (x) dx.
1. Suppose that a bounded Borel-measurable function f : [a, b] R is Riemann-integrable.
Show that
Z
Z b
f d = (R)
f (x) dx.
[a,b]
49
Chapter
Lebesgue spaces
We have seen how the family of all functions f R L1 forms a vector space and how the map
f 7 ||f ||L1 , from L1 to [0, ) defined by ||f ||L1 = |f | d has the following properties
1. f = 0 implies ||f ||L1 = 0, for f L1 ,
Z
|f | d
51
1/p
, f Lp ,
52
4.2
Inequalities
Definition 4.11 (Conjugate exponents) We say that p, q [1, ] are conjugate exponents if
1
1
p + q = 1.
Lemma 4.12 (Youngs inequality) For all x, y 0 and conjugate exponents p, q [1, ) we have
xp
p
(4.1)
yq
q
xy.
P ROOF If x = 0 or y = 0, the inequality trivially holds so we assume that x > 0 and y > 0. The
function log is strictly concave on (0, ) and p1 + 1q = 1, so
log( p1 + 1q )
1
p
log() + 1q log(),
for all , > 0, with equality if and only if = . If we substitute = xp and = y q , and
exponentiate both sides, we get
xp
p
yq
q
P ROOF We assume that 1 < p, g < and leave the (easier) extreme cases to the reader. Clearly,
we can also assume that ||f ||Lp > 0 and ||q||Lq > 0 - otherwise, the inequality is trivially satisfied.
We define f = |f | /||f ||Lp and g = |g| /||g||Lq , so that ||f||Lp = ||
g ||Lq = 1.
Plugging f for x and g for y in Youngs inequality (Lemma 4.12 above) and integrating, we get
Z
Z
Z
q
p
1
1
(4.3)
f d + q g d fg d,
p
53
(4.4)
fg d 1,
R
R
p
because fp d = ||f||Lp = 1, and gq d = ||
g ||Lp = 1 and p1 + 1q = 1. Hlders inequality (4.2)
now follows by multiplying both sides of (4.4) by ||f ||Lp ||g||Lq .
If the equality in (4.2) holds, then it also holds a.e. in the Youngs inequality (4.3). Therefore,
the equality will hold if and only if ||g||qLq |f |p = ||f ||pLp |g|q , a.e. The reader will check that if a pair
of constants , as in the statement exists, then (||g||qLq , ||f ||pLp ) must be proportional to it.
For p = q = 2 we get the following well-known special case:
P ROOF Like above, we assume p < and leave the case p = to the reader. Moreover, we
assume that ||f + g||Lp > 0 - otherwise, the inequality trivially holds. Note, first that for conjugate
exponents p, q we have q(p 1) = p. Therefore, Hlders inequality implies that
Z
|f | |f + g|
p1
p1
p/q
Z
|f + g|
q(p1)
1/q
p/q
Therefore,
||f +
g||pLp
|f + g| d
|f | |f + g|
p1
||f ||Lp + ||g||Lp ||f + g||p1
Lp ,
d +
|g| |f + g|p1 d
Corollary 4.17 (Lp is pseudo-normed) (Lp , || ||Lp ) is a pseudo-normed space for each p [1, ].
A pseudo-metric space (X, d) is said to be complete if each Cauchy sequence converges. A sequence {xn }nN is called a Cauchy sequence if
> 0, N N, m, n N d(xn , xm ) < .
A pseudo-normed space (V, || ||) is called a pseudo-Banach space if is it complete for the metric
induced by || ||. If || || is, additionally, a norm, (V, || ||) is said to be a Banach space.
Problem 4.18 Let {xn }nN be Cauchy sequence in a pseudo-metric space (X, d), and let {xnk }kN
be a subsequence of {xn }nN which converges to x X. Show that xn x.
Proposition 4.19 (Lp is pseudo-Banach) Lp is a pseudo-Banach space, for p [1, ].
P ROOF We assume p [1, ) and leave the case p = to the reader. Let {fn }nN be a Cauchy
sequence in Lp . Thanks to the Cauchy property, there exists a subsequence of {fnk }kN such that
||fnk+1 fnk ||Lp < 2k , for all k N.
Pk1
We define the sequence {gk }kN in L0+ by gk = |fn1 | + i=1
fni+1 fni , as well as the function
g = limk gk L0 ([0, ]). The monotone-convergence theorem implies that
Z
Z
p
g d = lim gnp d,
n
gkp d
||gk ||pLp
||fn1 ||Lp +
k1
X
i=1
p
p
Therefore, g p d (1 + ||fn1 ||Lp )p < , and, in particular,
P g L and g < , a.e. It follows
immediately from the absolute convergence of the series i=1 fnk+1 fnk that
k1
X
i=1
converges in R, for almost all x. Hence, the function f = lim inf k fnk is in Lp since |f | g, a.e.
Since |f | g and |fnk | g, forRall k N, we have |f fnk |p 2 |g|p L1 , so the dominated
convergence theorem implies that |fnk f |p d 0, i.e., fnk f in Lp . Finally, we invoke the
result of Problem 4.18 to conclude that fn f in Lp .
The following result is a simple consequence of the (proof of) Proposition 4.19.
55
and
+
x
of at x by
(x) = sup 1 ((x) (x )),
x
>0
and
+
(x) = inf 1 ((x + ) (x)).
>0
x
1
((x)
(x ))
and
+
x
+
(x)
(x) 1 ((x + ) (x)),
x
x
P ROOF (P ROOF OF P ROPOSITION 4.21) Let us first show that ((f )) L1 . By Lemma 4.22, there
exists sequences {an }nN and {bn }nN such that (x) = supnN (an x + bn ). In particular, (f (x))
a1 f (x) + b1 , for all x S. Therefore,
(f (x)) (a1 f (x) + b1 ) |a1 | |f (x)| + |b1 | L1 .
R
R
R
Next, we have (f ) d an f + bn d = an f d + bn , for all n N. Therefore,
Z
Z
Z
(f ) d sup(an f d + bn ) = ( f d).
n
Problem 4.23 State and prove a generalization of Jensens inequality when is defined only on
an interval I of R, but ({f 6 I}) = 0.
Problem 4.24 Use Jensens inequality on an appropriately chosen measure space to prove the
arithmetic-geometric inequality
a1 ++an
n a1 . . . an , for a1 , . . . , an 0.
n
The following inequality is know as Markovs inequality in probability theory, but not much
wider than that. In analysis it is known as Chebyshevs inequality.
,
0,
f (x) [, )
f (x) [0, ).
4.3
Additional problems
Problem 4.26 (Projections onto a convex set) A subset K of a vector space is said to be convex if
x + (1 )y K, whenever x, y K and [0, 1]. Let K be a closed and convex subset of L2 ,
and let g be an element of its complement L2 \ K. Prove that
1. There exists an element f K such that ||g f ||L2 ||g f ||L2 , for all f K.
R
2. (f f )(g f ) d 0, for all f K.
(Hint: Pick a sequence {fn }nN in K with ||fn g||L2 inf f K ||f g||L2 and show that it
is Cauchy. Use (but prove first) the parallelogram identity 2||h||2L2 + 2||k||2L2 = ||h + k||2L2 +
||h k||2L2 , for h, k L2 .)
Problem 4.27 (Egorovs theorem) Suppose that is a finite measure, and let {fn }nN be a sequence in L0 which converges a.e. to f L0 . Prove that for each > 0 there exists E S
with (E c ) < such that
lim esssup |fn 1E f 1E | = 0.
n
(Hint: Define An,k = mn {|fm f | k1 }, show that for each k N, there exists nk N such that
(Ank ,k ) < /2k , and set E = k Acnk ,k .)
Problem 4.28 (Relationships between different Lp spaces)
1. Show that for p, q [1, ), we have
||f ||Lp ||f ||Lq (S)r ,
where r = 1/p 1/q. Conclude that Lq Lp , for p q if (S) < .
2. For p0 [1, ), construct an example of a measure space (S, S, ) and a function f L0
such that f Lp if and only if p = p0 .
3. Suppose that f Lr L , for some r [1, ). Show that f Lp for all p [r, ) and
||f ||L = lim ||f ||Lp .
p
Problem 4.29 (Convergence in measure) A sequence {fn }nN in L0 is said to converge in measure toward f L0 if
> 0, |fn f | 0 as n .
Assume that (S) < (parts marked by () are true without this assumption).
1. Show that the mapping
d(f, g) =
|f g|
1+|f g|
d, f, g L0 ,
59
Chapter
We have seen in Chapter 2 that it is possible to define products of arbitrary collections of measurable spaces - one generates the -algebra on the product by all finite-dimensional cylinders. The
purpose of the present section is to extend that construction to products of measure spaces, i.e., to
define products of measures.
Let us first consider the case of two measure spaces (S, S, S ) and (T, T , T ). If the measures
are stripped, the product S T is endowed with the product -algebra S T = ({A B : A
S, B T }). The family P = {A B : A S, B T } serves as a good starting point towards the
creation of the product measure S T . Indeed, if we interpret of the elements in P as rectangles
of sorts, it is natural to define
(S T )(A B) = S (A)T (B).
The family P is a -system (why?), but not necessarily an algebra, so we cannot use Theorem 2.10
(Caratheodorys extension theorem) to define an extension of S T to the whole S T . It it not
hard, however, to enlarge P a little bit, so that the resulting set is an algebra, but that the measure
S T can still be defined there in a natural way. Indeed, consider the smallest algebra that
contains P. It is easy to see that it must contain the family S defined by
A = {nk=1 Ak Bk : n N, Ak S, Bk T , k = 1, . . . , n}.
Problem 5.1 Show that A is, in fact, an algebra and that each element C A can be written in the
form
C = nk=1 Ak Bk ,
for n N, Ak S, Bk T , k = 1, . . . , n, such that A1 B1 , . . . , An Bn are pairwise disjoint.
The problem above allows us to extend the definition of the set function S T to the entire A
by
n
X
(S T )(C) =
S (Ak )T (Bk ),
k=1
60
Lemma 5.2 (Sections of measurable sets are measurable) Let C be an S T -measurable subset
of S T . For each x S the section Cx = {y T : (x, y) C} is measurable in T .
P ROOF In the spirit of most of the measurability arguments seen so far in these notes, let C denote
the family of all C S T such that Cx is T -measurable for each x S. Clearly, the rectangles
A B, A S, B T are in A because their sections are either equal to or B, for each x S.
Remember that the set of all rectangles generates S T . The proof of the theorem will, therefore,
be complete once it is established that C is a -algebra. This easy exercise is left to the reader.
Problem 5.3 Show that an analogous result holds for measurable functions, i.e., show that if f :
is a S T -measurable function, then the function x 7 f (x, y0 ) is S-measurable for
ST R
each y0 T , and the function y 7 f (x0 , y) is T -measurable for each y0 T .
Proposition 5.4 (A simple Cavallieris principle) Let S and T be finite measures. For C
S T , define the functions C : T [0, ) and C : S [0, ) by
C (y) = S (Cy ), C (x) = T (Cx ).
Then,
1. C L0+ (T ),
2. C L0+ (S),
R
R
3. C dT = C dS .
P ROOF Note that, by Problem 5.3, the function x 7 1C (x, y) is S-measurable for each y T .
Therefore,
Z
(5.1)
1C (, y) dS = S (Cy ) = C (y),
and the function C is well-defined.
Let C denote the family of all sets in S T such that (1), (2) and (3) in the statement of the
proposition hold. First, observe that C contains all rectangles A B, A S, B T , i.e., it contains
a -system which generates S T . So, by the - Theorem (Theorem 2.16), it will be enough to
show that C is a -system. We leave the details to the reader (Hint: Use representation (5.1) and
the monotone convergence theorem. Where is the finiteness of the measures used?)
61
Proposition 5.5 (Simple Cavallieri holds for -finite measures) The conclusion of Proposition
5.4 continues to hold if we assume that S and T are only -finite.
P ROOF (*) Thanks to -finiteness, there exists pairwise disjoint sequences {An }nN and {Bn }nN
in S and T , respectively, such that n An = S, m Bm = T and S (An ) < and S (Bm ) < , for
all m, n N.
For m, n N, define the set-functions nS and m
T on S and T respectively by
nS (A) = S (An A), m
T (B) = T (Bm B).
n
m
It is easy to check that
N are finite measures on S and T , respectively.
Pall nS and T , m, n
P
n
n
Moreover, S (A) = n=1 S (A), T (B) = m=1 m
T (B). In particular, if we set C (y) = S (Cy )
m
m
and C (x) = T (Cx ), for all x S and y S, we have
C (y) = S (Cy ) =
C (x) = T (Cx ) =
nS (Cy ) =
n=1
nC (y), and
n=1
m
T (Cx ) =
m=1
m
C
(x),
m=1
for all x S, y T .
We can apply the conclusion of Proposition 5.4 to all pairs (S, S, nS ) and (T, T , m
T ), m, n N,
of finite measure spaces to conclude that all elements of the sums above are measurable functions
and that so are C and C .
P
P
i
Similarly, the sequences of non-negative functions ni=1 nC (y) and m
i=1 C (x) are non-decreasing
and converge to C and C . Therefore, by the monotone convergence theorem,
Z
C dT = lim
n
n Z
X
iC
dT , and
i=1
C dS = lim
n Z
X
i
C
dS .
i=1
where the last equality follows from the fact (see Problem 5.7 below) that
Z
XZ
f dnS = f dS ,
nN
for all f L0+ . Another summation - this time over m N - completes the proof.
Remark 5.6 The argument of the proof above uncovers the fact that integration is a bilinear operation, i.e., that the mapping
Z
(f, )
62
f d,
Proposition 5.8 (Finite products of measure spaces) Let (Si , Si , i ), i = 1, . . . , n be finite measure spaces. There exists a unique measure - denoted by 1 n - on the product space
(S1 Sn , S1 Sn ) with the property that
(1 n )(A1 An ) = 1 (A1 ) . . . n (An ),
for all Ai Si , i = 1, . . . , n. Such a measure is necessarily -finite.
P ROOF To simplify the notation, we assume that n = 2 - the general case is very similar. For
C S1 S2 , we define
Z
(1 2 )(C) =
C d2 , where C (y) = 1 (Cy ) and Cy = {x S1 : (x, y) C}.
S2
It follows from Proposition 5.5 that 1 2 is well-defined as a map from S1 S2 to [0, ]. Also,
it is clear that (1 2 )(A B) = 1 (A)2 (B), for all A S1 , B S2 . It remains to show that
1 2 is a measure. We start with a pairwise disjoint sequence {Cn }nN in S1 S2 . For y S2 ,
the sequence {(Cn )y }nN is also pairwise disjoint, and so, with C = n Cn , we have
X
X
C (y) = 1 (Cy ) =
2 (Cn )y =
Cn (y), y S2 .
nN
nN
Therefore, by the monotone convergence theorem (see Problem 3.37 for details) we have
Z
XZ
X
(1 2 )(C) =
C d2 =
(1 2 )(Cn ).
Cn d =
S2
nN S2
nN
Finally, let {An }nN , {Bn }nN be sequences in S1 and S2 (respectively) such that 1 (An ) <
and 2 (Bn ) < for all n N and n An = S1 , n Bn = S2 . Define {Cn }nN as an enumeration of
the countable family {Ai Bj : i, j N} in S1 S2 . Then (1 2 )(Cn ) < and all n N and
n Cn = S1 S2 . Therefore, 1 2 is -finite.
The measure 1 n is called the product measure, and the measure space (S1 Sn , S1
Sn , 1 n ) the product (measure space) of measure spaces (S1 , S1 , 1 ), . . . , (Sn , Sn , n ).
Now that we know that product measures exist, we can state and prove the important theorem
which, when applied to integrable functions bears the name of Fubini, and when applied to nonnegative functions, of Tonelli. We state it for both cases simultaneously (i.e., on L01 ) in the case of
a product of two measure spaces. An analogous theorem for finite products can be readily derived
from it. When
the variable or the underlying measure
space of integration needs to be specified,
R
R
we write S f (x) (dx) for the Lebesgue integral f d.
63
Theorem 5.9 (Fubini, Tonelli) Let (S, S, S ) and (T, T , T ) be two -finite measure spaces. For
f L01 (S T ) we have
Z Z
Z Z
f (x, y) T (dy) S (dx) =
f (x, y) S (dx) T (dy)
S
T
S
(5.2)
ZT
= f d(S T ).
P ROOF All the hard work has already been done. We simply need to crank the Standard Machine.
Let H denote the family of all functions in L0+ (ST ) with the property that (5.2) holds. Proposition
5.5 implies that H contains the indicators of all elements of S T . Linearity of all components
of (5.2) implies that H contains all simple functions in L0+ , and the approximation theorem 3.16
implies that the whole L0+ is in H. Finally, the extension to L01 follows by additivity.
Since f is always in L01 , we have the following corollary
Corollary 5.10 (An integrability criterion) For f L0 (S T ), we have
Z Z
01
f (x, y) T (dy) S (dx) < .
f L (S T ) if and only if
T
Example 5.11 (-finiteness cannot be left out . . . ) The assumption of -finiteness cannot be left
out of the statement of Theorem 5.9. Indeed, let (S, S, ) = ([0, 1], B([0, 1]), ) and (T, T , ) =
([0, 1], 2[0,1] , ), where is the counting measure on 2[0,1] , so that (T, T , ) fails the -finite property.
Define f L0 (S T ) (why it is product-measurable?) by
(
1, x = y,
f (x, y) =
0, x 6= y.
Then
and so
Z
Z Z
T
and so
Z
Z Z
S
0 (dy) = 0.
[0,1]
64
1 (dx) = 1.
[0,1]
m = n,
1,
f (n, m) = 1, m = n + 1,
0,
otherwise
Then
f (n, m) (dm) =
T
and so
Z Z
S
f (n, m) (dn) =
S
f (n, m) =
nN
i.e.,
Therefore,
f (n, m) = 0 + + 0 + 1 + (1) + 0 + = 0,
mN
Z
Z Z
T
1 + 0 + = 1,
0 + + 0 + (1) + 1 + 0 + . . . ,
m=1
m > 1,
1{m=1} (dm) = 1.
If you think that using the counting measure is cheating, convince yourself that it is not hard to
transfer this example to the setup where (S, S, ) = (T, T , ) = ([0, 1], B([0, 1]), ).
The existence of the product measure gives us an easy access to the Lebesgue measure on
higher-dimensional Euclidean spaces. Just as on R measures the length of sets, the Lebesgue
measure on R2 will measure area, the one on R3 volume, etc. Its properties are collected in
the following problem:
Problem 5.13 For n N, show the following statements:
1. There exists a unique measure (note the notation overload) on B(Rn ) with the property
that
[a1 , b1 ) [an , bn ) = (b1 a1 ) . . . (bn an ),
for all a1 < b1 , . . . , an < bn in R.
An isometry of Rn is a map f : Rn Rn with the property that d(x, y) = d(f (x), f (y)) for all x, y Rn . It can be
shown that the isometries of R3 are precisely translations, rotations, reflections and compositions thereof.
65
5.2
f > 0, a.e.
Definition 5.15 (Absolute continuity, etc.) Let , be measures on the measurable space (S, S).
We say that
1. is absolutely continuous with respect to - denoted by - if (A) = 0, whenever
(A) = 0, A S.
2. and are equivalent if and , i.e., if (A) = 0 (A) = 0, for all A S,
3. and are (mutually) singular - denoted by - if there exists D S such that (D) = 0
and (Dc ) = 0.
Problem 5.16 Let and be measures with finite and . Show that for each > 0 there
exists > 0 such that for each A S, we have (A) (A) . Show that the assumption
that is finite is necessary.
Problem 5.14 states that the prescription (5.3) defines a measure on S which is absolutely continuous with respect to . What is surprising is that the converse also holds under the assumption
of -finiteness: all absolutely continuous measures on S are of that form. That statement (and
more) is the topic of this section. Since there is more than one measure in circulation, we use the
convention that a.e. always uses the notion of the null set as defined by the measure .
Theorem 5.17 (The Lebesgue decomposition) Let (S, S) be a measurable space and let and be
two -finite measures on S. Then there exists a unique decomposition = a + s , where
66
1. a ,
2. s .
Furthermore, there exists an a.e.-unique function f L0+ such that
Z
a (A) =
f d.
A
P ROOF (*)
- Uniqueness. Suppose that a1 + s1 = = a2 + s2 are two decompositions satisfying (1) and (2)
in the statement. Let D1 and D2 be as in the definition of mutual singularity applied to the pairs
, s1 and , s2 , respectively. Set D = D1 D2 , and note that (D) = 0 and s1 (Dc ) = s2 (Dc ) = 0.
For any A S, we have (A D) = 0 and so, thanks to absolute continuity,
a1 (A D) = a2 (A D) = 0 and, consequently, s1 (A D) = s2 (A D) = (A D).
By singularity,
s1 (A Dc ) = s2 (A Dc ) = 0, and, consequently, a1 (A Dc ) = a2 (A Dc ) = (A Dc ).
Finally,
a1 (A) = a1 (A D) + a1 (A Dc ) = a2 (A D) + a2 (A Dc ) = 2 (A),
and, similarly, s1 = s2 .
R
To establish the uniqueness of the function f with the property that a (A) = A f d for all
A S, we assume that there are two such functions, f1 and f2 , say. Define the sequence {Bn }nN
by
Bn = {f1 f2 } Cn ,
where {Cn }nN is a pairwise-disjoint sequence in S with the property that (Cn ) < , for all
n N and n Cn = S. Then, with gn = f1 1Bn f2 1Bn L1+ we have
Z
Z
Z
f2 d = a (Bn ) a (Bn ) = 0.
f1 d
gn d =
Bn
Bn
By Problem 3.29, we have gn = 0, a.e., i.e., f1 = f2 , a.e., on Bn , for all n N, and so f1 = f2 , a.e.,
on {f1 f2 }. A similar argument can be used to show that f1 = f2 , a.e., on {f1 < f2 }, as well.
Therefore, fn R and thanks to the maximal property (5.4) of f , we conclude that fn = f , a.e.,
i.e. (Dn ) = 0, and, immediately, (D) = 0, as required.
Corollary 5.18 (Radon-Nikodym) Let and be -finite measures on (S, S) with . Then
there exists f L0+ such that
Z
(A) =
f d, for all A S.
(5.6)
A
For any other g L0+ with the same property, we have f = g, a.e.
Any function f for which (5.6) holds is called the Radon-Nikodym derivative of with respect
d
d
to and is denoted by f = d
, a.e. The Radon-Nikodym derivative f = d
is defined only
up to a.e.-equivalence, and there is no canonical way of picking a representative defined for all
x S. For that reason, we usually say that a function f L0+ is a version of the Radon-Nikodym
derivative of with respect to if (5.6) holds. Moreover, to stress the fact that we are talking about
a whole class of functions instead of just one, we usually write
d
d
L0+ and not
L0+ .
d
d
68
d
d
and a
Problem 5.19 Let , and be -finite measures on (S, S). Show that
1. If and , then + and
d
d
d( + )
+
=
.
d d
d
2. If and f L0+ , then
Z
f d =
g d where g = f
d
.
d
d
=
d
d
d
1
Problem 5.20 Let 1 , 2 , 1 , 2 be -finite measures with 1 and 1 , as well as 2 and 2 , defined
on the same measurable space. If 1 1 and 2 2 , show that 1 2 1 2 .
Example 5.21 Just like in the statement of Fubinis theorem, the assumption of -finiteness cannot
be omitted. Indeed, take (S, S) = ([0, 1], B([0, 1])) and consider the Lebesgue measure and
the
R
0
counting measure on (S, S). Clearly, , but there is no f L+ such that (A) = A f d.
Indeed, suppose that such f exists and set Dn = {x S : f (x) > 1/n}, for n N, so that
Dn {f > 0} = {f 6= 0}. Then
Z
Z
1
1
1 (Dn ) =
f d
n d = n #Dn ,
Dn
Dn
and so #Dn n. Consequently, the set {f > 0} = n Dn is countable. This leads to a contradiction
since the Lebesgue measure does not charge countable sets, and so
Z
Z
1 = ([0, 1]) = f d =
f d = ({f > 0}) = 0.
{f >0}
69
5.3
Additional Problems
Problem 5.22 (Area under the graph of a function) For f R L0+ , let H = {(x, r) S [0, ) :
f (x) r} be the region under the graph of f . Show that f d = ( )(H).
R
(Note: This equality is consistent with our intuition that the value of the integral f d corresponds to the area of the region under the graph of f .)
Problem 5.23 (A layered representation) Let be a measure on B([0, )) such that N (u) = ([0, u)) <
, for all u R. Let (S, S, ) be a -finite measure space. For f L0+ (S), show that
R
R
1. N f d = [0,) ({f > u}) (du).
R
R
2. for p > 0, we have f p d = p [0,) up1 ({f > u}) (du), where is the Lebesgue measure.
Problem 5.24 (A useful integral)
R
1. Show that 0 sinx x dx = . (Hint: Find a function below sinx x which is easier to integrate.)
(
exy sin(x), 0 x a, 0 y,
2. For a > 0, let f : R2 R be given by f (x, y) =
0,
otherwise.
1
2
2
Show that f L (R , B(R ), ), where denotes the Lebesgue measure on R2 .
Ra
R yeay
R eay
3. Establish the equality 0 sinx x dx = 2 cos(a) 0 1+y
dy.
2 dy sin(a) 0
1+y 2
R
R a sin(x)
a
2
x
2 a , so that lima 0
x dx = 2 .
Problem 5.25 (The Cantor measure) Let ({1, 1}N , B({1, 1}N ), C ) be the coin-toss space. Define the mapping f : {1, 1}N [0, 1] by
X
f (s) =
(1 + sn )3n , for s = (s1 , s2 , . . . ).
nN
3. For a measure on the -algebra of Borel sets of a topological space X, the support of is
collection of all x X with the property that (O) > 0 for each open set O with x O.
Describe the support of . (Hint: Guess what it is and prove that your guess is correct. Use
the result in (1).)
4. Prove that .
(Note: The Cantor measure is an example of a singular measure. It has no atoms, but is still
singular with respect to the Lebesgue measure.)
70
71
Chapter
Probability spaces
Definition 6.1 (Probability space) A probability space is a triple (, F, P), where is a nonempty set, F is a -algebra on and P is a probability measure on F.
In many (but certainly not all) aspects, probability theory is a part of measure theory. For
historical reasons and because of a different interpretation, some of the terminology/notation
changes when one talks about measure-theoretic concepts in probability. Here is a list of what is
different, and what stays the same:
1. We will always assume - often without explicit mention - that a probability space (, F, P)
is given and fixed.
2. Continuity of measure is called continuity of probability, and, unlike the general case, does
not require and additional assumptions in the case of a decreasing sequence (that is, of
course, because P[] = 1 < .)
3. A measurable function f : R is called a random variable. Typically, the sample space
is too large and clumsy for analysis, so we often focus our attention to real-valued functions X on (random variables are usually denoted by capital letters such as X, Y Z, etc.).
72
set R
random variables.
etc. to denote the set of all random
4. We use the measure-theoretic notation L0 ,L0+ , L0 (R),
variables, non-negative random variables, extended random variables, etc.
5. Let (S, S) be a measurable space. An (F, S)-measurable map X : S is called a random
element (of S).
Random variables are random elements, but there are other important examples. IfQ(S, S) =
(Rn , B(Rn )), we talk about random vectors. More generally, if S = RN and S = n B(R),
the map X : S is called a discrete-time stochastic process. Sometimes, the object of
interest is a set (the area covered by a wildfire, e.g.) and then S is a collection of subsets of
Rn . There are many more examples.
6. The class of null-sets in F still plays the same role as it did in measure theory, but now we
use the acronym a.s. (which stands for almost surely) instead of the measure-theoretic a.e.
7. The Lebesgue integral with respect to the probability P is now called expectation and is
denoted by E, so that we write
Z
Z
E[X] instead of
X dP, or
X() P[d].
For p [1, ], the Lp spaces are defined just like before, and have the property that Lq Lp ,
when p q.
8. The notion of a.e.-convergence is now re-baptized as a.s. convergence, while convergence
a.s.
in measure is now called convergence in probability. We write Xn X if the sequence
P
9. Since the constant random variable X() = 1, for is integrable, a special case of the
dominated convergence theorem, known as the bounded convergence theorem holds in
probability spaces:
Theorem 6.2 (Bounded convergence) Let {Xn }nN be a sequence of random variables such
that there exists M 0 such that |Xn | M , a.s., and Xn X, a.s., then
E[Xn ] E[X].
73
where 1 p q < and an arrow A B means that A implies B, but that B does not
imply A in general.
6.2
As we have already mentioned, typically too big to be of direct use. Luckily, if we are only
interested in a single random variable, all the useful probabilistic information about it is contained
in the probabilities of the form P[X B], for B B(R). Btw, it is standard to write P[X B]
instead of the more precise P[{X B}] or P[{ : X() B}]. Similarly, we will write
P[Xn Bn , i.o] instead of P[{Xn Bn } i.o.] and P[Xn Bn , ev.] instead of P[{Xn Bn } ev.]
The map B 7 P[X B] is, however, nothing but the push-forward of the measure P by the
map X onto B(R):
Definition 6.3 (Distribution of a random variable) The distribution of the random variable X
is the probability measure X on B(R), defined by
(B) = P[X 1 (B)],
that is the push-forward of the measure P by the map X.
In addition to be able to recover the information about various probabilities related to X from
X , one can evaluate any possible integral involving a function of X by integrating that function
against X (compare the statement to Problem 5.23):
Problem 6.4 Let g : R R be a Borel function. Then g X L01 (, F, P) if and only if g
L01 (R, B(R), X ) and, in that case,
Z
E[g(X)] = g dX .
In particular,
E[X] =
xX (dx).
R
Taken in isolation from everything else, two random variables X and Y for which X = Y are the
same from the probabilistic point of view. In that case we say that X and Y are equally distributed
(d)
random variables and write X = Y . On the other hand, if we are interested in their relationship
with a third random variable Z, it can happen that X and Y have the same distribution, but that
74
1
4
On the other hand, it is impossible for X and Z to take the same value at the same time. In
fact, there are only two values that the pair (X, Z) can take - (0, 1) and (1, 0). They happen with
probability 12 each, so
(X,Z) =
1
2
(0,1) + (1,0) .
We will see later that the difference between (X, Y ) and (X, Z) is best understood if we analyze
the way the component random variables depend on each other. In the first case, even if the value
of X is revealed, Y can still take the values 0 or 1 with equal probabilities. In the second case, as
soon as we know X, we know Z.
More generally, if X : S, is a random element with values in the measurable space (S, S),
the distribution of X is the measure X on S, defined by X (B) = P[X B] = P[X 1 (B)], for
B S.
Sometimes it is easier to work with a real-valued function FX defined by
FX (x) = P[X x],
which we call the (cumulative) distribution function (cdf for short), of the random variable X.
The following properties of FX are easily derived by using continuity of probability from above
and from below:
Proposition 6.6 (Preperties of the cdf) Let X be a random variable, and let FX be its distribution
function. Then,
1. FX is non-decreasing and takes values in [0, 1],
75
2. FX is right continuous,
3. limx FX (x) = 1 and limx FX (x) = 0.
Remark 6.7 A notion of a (cumulative) distribution function can be defined for random vectors,
too, but it is not used as often as the single-component case, so we do not write about it here.
The case when X is absolutely continuous with respect to the Lebesgue measure is especially
important:
Definition 6.8 (Absolute continuity and pdfs) A random variable X with the property that
X , where is the Lebesgue measure on B(R), is said to be absolutely continuous. In that case,
X
any Radon-Nikodym derivative d
d is called the probability density function (pdf) of X, and is
denoted by fX . Similarly, a random vector X = (X1 , . . . , Xn ) is said to be absolutely continuous if
X
X , where is the Lebesgue measure on B(Rn ), and the Radon-Nikodym derivative d
d , denoted
by fX is called the probability density function (pdf) of X.
Problem 6.9
1. Let X = (X1 , . . . , Xn ) be an absolutely-continuous random vector. Show that Xk is also
absolutely continuous, and that its pdf is given by
Z
Z
f (1 , . . . , k1 , x, k+1 , . . . , n ) d1 . . . dk1 dk+1 . . . dn .
fXk (x) =
...
R
R
| {z }
n 1 integrals
(Note: Note the fXk (x) is defined only for almost all x R; that is because fX is defined only
up to null sets in B(Rn ).)
2. Let X be an absolutely-continuous random variable. Show that the random vector (X, X)
is not absolutely continuous, even though both of its components are .
Problem 6.10 Let X = (X1 , . . . , Xn ) be an absolutely-continuous random vector with density
fX , and let g : Rn R be a Borel-measurable function with gfX L01 (Rn , B(Rn ), ). Show that
g(X) L01 (, F, P) and that
Z
Z
Z
E[g(X)] = gfX d =
. . . g(1 , . . . , n )fX (1 , . . . , n ) d1 . . . dn .
R
Definition 6.11 (Discrete random variables) A random variable X is said to be discrete if there
exists a countable set B B(R) such that X (B) = 1.
76
Definition 6.13 (Singular distributions) A distribution which has no atoms and is singular with
respect to the Lebesgue measure is called singular.
Example 6.14 (A measure which is neither absolutely continuous nor discrete) By Problem 5.25,
there exists a measure on [0, 1], with the following properties
1. has no atoms, i.e., ({x}) = 0, for all x
[0, 1],
1.0
0.8
0.6
0.4
0.2
-0.5
0.5
1.0
1.5
6.3
Independence
The point at which probability departs from measure theory is when independence is introduced.
As seen in Example 6.5, two random variables can depend on each other in different ways. One
extreme (the case of X and Y ) corresponds to the case when the dependence is very weak - the
distribution of Y stays the same when the value of X is revealed:
Definition 6.15 (Independence of two random variables) Two random variables X and Y are
said to be independent if
P[{X A} {Y B}] = P[X A] P[Y B] for all A, B B(R).
It turns out that independence of random variables is a special case of the more-general notion
of independence between families of sets.
77
Problem 6.17
1. Show, by means of an example, that the notion of independence would change if we asked
for the product condition (6.1) to hold only for k = n and i1 = 1, . . . , ik = n.
2. Show that, however, if Ai , for all i = 1, . . . , n, it is enough to test (6.1) for k = n and
i1 = 1, . . . , ik = n to conclude independence of Ai , i = 1, . . . , n.
Problem 6.18 Show that random variables X and Y are independent if and only if the -algebras
(X) and (Y ) are independent.
When only two families of sets are compared, there is no difference between pairwise independence and independence. For 3 or more, the difference is non-trivial:
Example 6.20 (Pairwise independence without independence) Let X1 , X2 and X3 be independent random variables, each with the coin-toss distribution, i.e., P[Xi = 1] = P[Xi = 1] = 21 ,
for i = 1, 2, 3. It is not hard to construct a probability space where such random variables may be
defined explicitly: let = {1, 2, 3, 4, 5, 6, 7, 8}, F = 2 , and let P be characterized by P[{}] = 81 ,
for all . Define
(
1,
i
Xi () =
1, otherwise
where 1 = {1, 3, 5, 7}, 2 = {2, 3, 6, 7} and 3 = {5, 6, 7, 8}. It is easy to check that X1 , X2 and
X3 are independent (Xi is the i-th bit in the binary representation of ).
78
1
8
1
8
1
4
We dont need to check the other possibilities, such as Y1 = 1, Y2 = 1, to conclude that Y1 and Y2
are independent (see Problem 6.21 below).
On the other hand, Y1 , Y2 and Y3 are not independent:
P[Y1 = 1, Y2 = 1, Y3 = 1] = P[X2 = X3 , X1 = X3 , X1 = X2 ] = P[X1 = X2 = X3 ]
=
1
4
6=
1
8
Problem 6.21 Show that if A1 , . . . , An are independent, then so are the families Ai = {Ai , Aci },
i = 1, . . . , n.
A more general statement is also true (and very useful):
E[fi (Xi )] = E[
i=1
n
Y
fi (Xi )],
i=1
for all n-tuples (f1 , . . . , fn ) of bounded continuous real functions. (Hint: Approximate!)
2. Let {Xni }nN , i = 1, . . . , m be sequences of random variables such that Xn1 , . . . , Xnm are indea.s.
pendent for each n N. If Xni X i , i = 1, . . . , m, for some X 1 , . . . , X m L0 , show that
X 1 , . . . , X m are independent.
The idea independent means multiply applies not only to probabilities, but also to random
variables:
80
Proposition 6.28 (Expectation of a function of independent components) Let X, Y be independent random variables, and let h : R2 [0, ) be a measurable function. Then
Z Z
h(x, y) X (dx) Y (dy).
E[h(X, Y )] =
R
P ROOF By independence and part (1) of Problem 6.24, the distribution of the random vector
(X, Y ) is given by X Y , where X is the distribution of X and Y is the distribution of Y .
Using Fubinis theorem, we get
Z
Z Z
E[h(X, Y )] = h d(X,Y ) =
h(x, y) X (dx) Y (dy).
R
The product formula 2. remains true if we assume that Xi L0+ (instead of L1 ), for i = 1, . . . , n.
P ROOF Using the fact that X1 and X2 Xn are independent random variables (use part (2) of
Problem 6.26), we can assume without loss of generality that n = 2.
Focusing first on the case X1 , X2 L0+ , we apply Proposition 6.28 with h(x, y) = xy to conclude that
Z Z
E[X1 X2 ] =
x1 x2 X1 (dx1 ) X2 (dx2 )
R
ZR
=
x2 E[X1 ] X2 (dx2 ) = E[X1 ]E[X2 ].
R
For the case X1 , X2 L1 , we split X1 X2 = X1+ X2+ X1+ X2 X1 X2+ + X1 X2 and apply the
above conclusion to the 4 pairs X1+ X2+ , X1+ X2 , X1 X2+ and X1 X2 .
Problem 6.30 (Conditions for independent-means-multiply) Proposition 6.29 in states that for
independent X and Y , we have
(6.2)
E[XY ] = E[X]E[Y ],
whenever X, Y L1 or X, Y L0+ . Give an example which shows that (6.2) is no longer necessarily true if X L0+ and Y L1 . (Hint: Build your example so that E[(XY )+ ] = E[(XY ) ] = .
Use ([0, 1], B([0, 1]), ) and take Y () = 1[0,1/2] () 1(1/2,0] (). Then show that any random variable X with the property that X() = X(1 ) is independent of Y . )
81
6.4
Proposition 6.32 (Convolution as the distribution of a sum) Let X and Y be independent random variables, and let Z = X + Y be their sum. Then the distribution Z of Z has the following
representation:
Z
Z (B) =
where B y = {b y : b B}.
P ROOF We can view Z as a function f (x, y) = x + y applied to the random vector (X, Y ), and
so, we have E[g(Z)] = E[h(X, Y )], where h(x, y) = g(x + y). In particular, for g(z) = 1B (z),
Proposition 6.28 implies that
Z Z
Z (B) = E[g(Z)] =
1{x+yB} X (dx) Y (dy)
R R
Z Z
Z
=
1{xBy} X (dx) Y (dy) =
X (B y) Y (dy).
R
f (x) dF (x),
R
R
as notation for the integral f d, where F (x) = ((, x)]. The reason for this is that such
integrals - called the Lebesgue-Stieltjes integrals - have a theory parallel to that of the Riemann
integral and the correspondence between dF (x) and d is parallel to the correspondence between
dx and d.
Corollary 6.33 (Cdf of a sum as a convolution) Let X, Y be independent random variables, and
let Z be their sum. Then
Z
FZ () =
FX ( y) dFY (y).
R
82
Definition 6.34 (Convolution of probability measures) Let 1 and 2 be two probability measures on B(R). The convolution of 1 and 2 is the probability measure 1 2 on B(R), given
by
Z
(1 2 )(B) =
where B = {x : x B} B(R).
Problem 6.35 Show that the is a commutative and associative operation on the set of all probability measures on B(R). (Hint: Use Proposition 6.32).
It is interesting
to see how convolution mixes
R with absolute continuity. To simplify the notation,
R
we write A f (x) dx instead of more precise A f (x) (dx) for the (Lebesgue) integral with respect
R
we write b f (x) dx.
to the Lebesgue measure on R. When A = [a, b] R,
a
Proposition 6.36 (Convolution inherets absolute continuity from either component) Let X
and Y be independent random variables, and suppose that X is absolutely continuous. Then their
sum Z = X + Y is also absolutely continuous and its density fZ is given by
Z
fZ (z) =
fX (z y) Y (dy).
R
R
P ROOF Define f (z) = R fX (z y) Y (dy), for some density fX of X (remember, the density
function is defined only -a.e.). The function f is measurable (why?) so it will be enough (why?)
to show that
h
i Z
P Z [a, b] =
f (z) dz, for all < a < b < .
[a,b]
We start with the right-hand side of (6.3) and use Fubinis theorem to get
Z
Z
Z
f (z) dz =
1[a,b] (z)
fX (z y) Y (dy) dz
[a,b]
R
R
(6.3)
Z Z
=
1[a,b] (z)fX (z y) dz Y (dy)
R
83
Problem 6.38
1. Use the reasoning from the proof of Proposition 6.36 to show that the convolution is welldefined operation on L1 (R).
2. Show that if X and Y are independent absolutely-continuous random variables, then X + Y
is also absolutely continuous with density which is the convolution of densities of X and Y .
6.5
We leave the most basic of the questions about independence for last: do independent random
variable exist? We need a definition and two auxiliary results, first.
Definition 6.39 (Uniform distribution on (a, b)) A random variable X is said to be uniformly
distributed on (a, b), for a < b R, if it is absolutely continuous with density
fX (x) =
1
ba 1(a,b) (x).
Our first result states a uniform random variable on (0, 1) can be transformed deterministically
into any a random variable of prescribed distribution.
Remark 6.41 Proposition 6.40 is a basis for a technique used to simulate random variables. There
are efficient algorithms for producing simulated values which resemble the uniform distribution
in (0, 1) (so-called pseudo-random numbers). If a simulated value drawn with distribution is
needed, one can simply apply the function H to a pseudo-random number.
Our next auxiliary result tells us how to construct a sequence of independent uniforms:
Proposition 6.42 (A large-enough probability space exists) There exists a probability space
(, F, P), and on it a sequence {Xn }nN of random variables such that
1. Xn has the uniform distribution on (0, 1), for each n N, and
2. the sequence {Xn }nN is independent.
P ROOF Set (, F, P) = ({1, 1}N , S, C ) - the coin-toss space with the product -algebra and the
coin-toss measure. Let a : N N N be a bijection, i.e., (aij )i,jN is an arrangement of all natural
numbers into a double array. For i, j N, we define the map ij : {1, 1}, by
ij (s) = saij ,
i.e., ij is the natural projection onto the aij -th coordinate. It is straightforward to show that, under
P, the collection (ij )i,jN is independent; indeed, it is enough to check the equality
P[i1 j1 = 1, . . . , in jn = 1] = P[i1 j1 = 1] P[in jn = 1],
for all n N and all different (i1 , j1 ), . . . , (in , jn ) N N.
At this point, we use the construction from Section 2.3 of Lecture 2, to construct an independent copy of a uniformly-distributed random variable from each row of (ij )i,jN . We set
(6.4)
Xi =
X
1+
ij
j=1
2j , i N.
By second parts of Problems 6.26 and 6.27, we conclude that the sequence {Xi }iN is independent.
Moreover, thanks to (6.4), Xi is uniform on (0, 1), for each i N.
Proposition 6.43 (Arbitrary independent sequences exist) Let {n }nN be a sequence of probability measures on B(R). Then, there exists a probability space (, F, P), and a sequence {Xn }nN of
random variables defined there such that
1. Xn = n , and
2. {Xn }nN is independent.
85
Definition 6.44 (Independent and identically distributed sequences) A sequence {Xn }nN of
random variables is said to be independent and identically distributed (iid) if {Xn }nN is independent and all Xn have the same distribution.
Corollary 6.45 (Iid sequences exist) Given a probability measure on R, there exist a probability
space supporting an iid sequence {Xn }nN such that Xn = .
6.6
Additional Problems
Problem 6.46 (The standard normal distribution) An absolutely continuous random variable X
is said to have the standard normal distribution - denoted by X N (0, 1) - if it admits a density
of the form
1
f (x) = exp(x2 /2), x R
2
For a r.v. with such a distribution we write X N (0, 1).
R
R
1. Show that R f (x) dx = 1. (Hint: Consider the double integral R2 f (x)f (y) dx dy and pass
to polar coordinates. )
2. For X N (0, 1), show that E[|X|n ] < for all n N. Then compute the nth moment E[X n ],
for n N.
3. A random variable with the same distribution as X 2 , where X N (0, 1), is said to have the
2 -distribution. Find an explicit expression for the density of the 2 -distribution.
4. Let Y have the 2 -distribution. Show that there exists a constant 0 > 0 such that E[exp(Y )] <
for < 0 and E[exp(Y )] = + for 0 . (Note: For a random variable Y L0+ , the
quantity E[exp(Y )] is called the exponential moment of order .)
5. Let 0 > 0 be a fixed, but arbitrary constant. Find an example of a random variable X 0
with the property that E[X ] < for 0 and E[X ] = + for > 0 . (Hint: This is not
the same situation as in (4) - this time the critical case 0 is included in a different alternative.
Try X = exp(Y ), where P[Y N] = 1. )
Problem 6.47 (The memory-less property of the exponential distribution) A random variable
is said to have exponential distribution with parameter > 0 - denoted by X Exp() - if its
distribution function FX is given by
FX (x) = 0 for x < 0, and FX (x) = 1 exp(x), for x 0.
86
( 21 ) = ,
where is the Gamma function.
2. Remember that the conditional probability P[A|B] of A, given B, for A, B F, P[B] > 0 is
given by
P[A|B] = P[A B]/P[B].
Compute P[X x2 |X x1 ], for x2 > x1 > 0 and compare it to P[X (x2 x1 )].
(Note: This can be interpreted as follows: the knowledge that the bulb stayed functional until
x1 does not change the probability that it will not explode in the next x2 x1 units of time;
bulbs have no memory.)
Conversely, suppose that Y is a random variable with the property that P[Y > 0] = 1 and
P[Y > y] > 0 for all y > 0. Assume further that
(6.5)
This is often known as the second Borel-Cantelli lemma. (Hint: Use the inequality (1 x)
ex , x R.)
3. Let {Xn }nN be an iid (independent and identically distributed) sequence of coin tosses, i.e.,
independent random variables with P[Xn = T ] = P[Xn = H] = 1/2 for all n N (if you are
uncomfortable with T and H, feel free to replace them with 1 and 1). A tail-run of size k
is a finite sequence of at least k consecutive T s starting from some index n N. Show that
for almost every (i.e., almost surely) the sequence {Xn ()}nN will contain infinitely
many tail runs of size k. Conclude that, for almost every , the sequence Xn () will contain
infinitely many tail runs of every length.
87
88
Chapter
Weak convergence
Definition 7.1 (Weak convergence of probability measures) Let {n }nN be a sequence of probability measures on (S, S). We say that n converges weakly to a probability measure on (S, S) w
and write n - if
Z
Z
f dn
f d,
for all f Cb (S), where Cb (S) denotes the set of all continuous and bounded functions f : S R.
Remark 7.2 It would be more in tune with standard mathematical terminology to use the term
weak- convergence instead of weak convergence. For historical reasons, however, we omit the .
Definition 7.3 (Convergence in distribution) A sequence {Xn }nN of random variables (eleD
for all f Cb (S). Let F be a closed set, and let {fk }kN be as in Problem 7.4, with fk = fF ;
corresponding to = 1/k. If we set Fk = {x S : d(x, F ) 1/k}, then Fk is a closed set (why?)
and we have 1F fk 1Fk . By (7.1), we have
Z
Z
(F ) fk d = fk d (Fk ),
and, similarly, (F ) (Fk ), for all k N. Since Fk F (why?), we have (Fk ) (F ) and
(Fk ) (F ), and it follows that (F ) = (F ).
It remains to note that the family of all closed sets is a -system which generates the -algebra
S to conclude that = .
We have seen in the proof of Proposition 7.5 that an operational characterization of weak convergence is needed. Here is a useful one. We start with a lemma; remember that A denotes the
topological boundary A = Cl A \ Int A of a set A S.
Problem 7.6 Let (F ) be a partition of S into (possibly uncountably many) measurable subsets.
Show that for any probability measure on S, (F ) = 0, for all but countably many (Hint:
For n N, define n = { : (F ) n1 }. Argue that n has at most n elements.)
Definition 7.7 (-continuity sets) A set A S with the property that (A) = 0, is called a continuity set
90
Theorem 7.8 (Portmanteau Theorem) Let , {n }nN be probability measures on S. Then, the
following are equivalent:
w
1. n ,
R
R
2. f dn f d, for all bounded, Lipschitz continuous f : S R,
(2) (3): given a closed set F and let Fk = {x S : d(x, F ) 1/k}, fk = fF ;1/k , k N, be as
in the proof of Proposition 7.5. Since 1F fk 1Fk and the functions fk are Lipschitz continuous,
we have
Z
Z
Z
lim sup n (F ) = lim sup 1F dn lim fk dn = fk d (Fk ),
n
(4) (1): Pick f Cb (S) and (possibly after applying a linear transformation to it) assume
R
R1
that 0 < f (x), for all x S. Then, by Problem 5.23, we have f d = 0 (f > t) dt, for any
probability measure on B(R). The set {f > t} S is open, so by (3), lim inf n n (f > t) (f > t),
for all t. Therefore, by Fatous lemma,
Z 1
Z 1
Z 1
Z
(f > t) dt
lim inf n (f > t) dt
n (f > t) dt
lim inf f dn = lim inf
n
n
n
0
0
0
Z
= f d.
We get the other inequality by f .
f d lim supn
(3), (4) (5): Let A be a -continuity set, let Int A be its interior and Cl A its closure. Then,
since Int A is open and Cl A is closed, we have
(Int A) lim inf n (Int A) lim inf n (A) lim sup n (A)
n
91
The above claim implies that there exists a sequence rk [0, ) \ R such that rk 0. By (5) and
the Claim above, we have n (BF (rk )) (BF (rk )) for all k N. Hence, for k N,
(BF (rk )) = lim n (BF (rk )) lim sup n (F ).
n
Proposition 7.10 (Weak-convergence test families) Let I be a collection of open subsets of S such
that
1. I is a -system,
2. Each open set in S can be represented as a finite or countable union of elements of I.
w
Corollary 7.12 (Weak convergence using cdfs) Suppose that S = R, and let n be a family of
probability measures on B(R). Let F (x) = ((, x]) and Fn (x) = n ((, x]), x R be the
corresponding cdfs. Then, the following two statements are equivalent:
1. Fn (x) F (x) for all x such that F is continuous at x, and
w
2. n .
P ROOF (2) (1): Let C be the set of all x such that F is continuous at x; eqivalently, C = {x
R : ({x}) = 0}. The sets (, x] are -continuity sets for x C, so the Portmanteau theorem
(Theorem 7.8) implies that Fn (x) = n ((, x]) ((, x]) = F (x), for all x C.
(1) (2): The set C c is at most countable (why?) and so the family
I = {(a, b) : a < b, a, b C},
w
satisfies the the conditions of Proposition 7.10. To show that n , it will be enough to show
that n (I) (I), for all a, b I. Since ((a, b)) = F (b) F (a), where F (b) = limxb F (x), it
will be enough to show that
Fn (x) F (x),
for all x C. Since Fn (x) Fn (x), we have lim supn Fn (x) lim Fn (x) = F (x). To prove
the other inequality, we pick > 0, and, using the continuity of F at x, find > 0 such that
x C and F (x ) > F (x) . Since Fn (x ) F (x ), there exists n0 N such that
Fn (x ) > F (x) 2 for n n0 , and, by increase of Fn , Fn (x) > F (x) 2, for n n0 .
Consequently lim inf n Fn (x) F (x) 2 and the statement follows.
One of the (many) reasons why weak convergence is so important, is the fact that it possesses nice
compactness properties. The central result here is the theorem of Prohorov which is, in a sense, an
analogue of the Arzel-Ascoli compactness theorem for families of measures. The statement we
give here is not the most general possible, but it will serve all our purposes.
2. relatively (sequentially) weakly compact if any sequence {n }nN in M admits a weaklyconvergent subsequence {nk }kN .
93
Theorem 7.14 (Prohorov) Suppose that the metric space (S, d) is complete and separable, and let M
be a set of probability measures on S. Then M is relatively weakly compact if and only if it is tight.
P ROOF (Note: In addition to the fact that the stated version of the theorem is not the most general
available, we only give the proof for the so-called Hellys selection theorem, i.e., the special case
S = R. The general case is technically more involved, but the key ideas are similar.)
(Tight relatively weakly compact): Suppose that M is tight, and let {n }nN be a sequence in
M. Let Q be a countable and dense subset of R, and let {qk }kN be an enumeration of Q. Since
all {n }nN are probability measures, the sequence {Fn (q1 )}nN , where Fn (x) = n ((, x]) is
bounded. Consequently, it admits a convergent subsequence; we denote its indices by n1,k , k N.
The sequence {Fn1,k (q2 )}kN is also bounded, so we can extract a further subsequence - lets denote
it by n2,k , k N, so that Fn2,k (q2 ) converges as k . Repeating this procedure for each element
of Q, we arrive to a sequence of increasing sequences of integers ni,k , k N, i N with the
property that ni+1,k , k N is a subsequence of ni,k , k N and that Fni,k (qj ) converges for each
j i. Therefore, the diagonal sequence mk = nk,k , is a subsequence of each ni,k , k N, i N, and
can define a function F : Q [0, 1] by
F (q) = lim Fmk (q).
k
inf
q<x,qQ
F (q),
so that limx F (x) 1 and limx F (x) . The claim follows from the arbitrariness of
> 0.
(Relatively weakly compact tight): Suppose to the contrary, that M is relatively weakly compact, but not tight. Then, there exists > 0 such that for each n N there exists n M such that
n ([n, n]) < 1 , and, consequently,
n ([M, M ]) < 1 for n M.
(7.2)
The sequence {n }nN admits a weakly-convergent subsequence {nk }kN . By (7.2), we have
lim sup nk ([M, M ]) 1 , for each M > 0,
k
so that ([M, M ]) 1 for all M > 0. Continuity of probability implies that (R) 1 - a
contradiction with the fact that is a probability measure on B(R).
The following problem cases tightness in more operational terms:
Problem 7.15 Let M be a non-empty set of probability measures on R. Show that M is tight if
and only if there exists a non-decreasing function : [0, ) [0, ) such that
1. (x) as x , and
R
2. supM (|x|) (dx) < .
Prohorovs theorem goes well with the following problem (it will be used soon):
Problem 7.16 Let be a probability measure on B(R) and let {n }nN be a sequence of probability
measures on B(R) with the property that every subsequence {nk }kN of {n }nN has a (further)
subsequence {nkl }lN which converges towards . Show that {n }nN is convergent. (Hint: If
R
w
n
6
, then there exists
f Cb and a subsequence {nk }kN of {n }nN such that f dnk
R
converges, but not to f d.)
We conclude with a comparison between convergence in distribution and convergence in probability.
Proposition 7.17 (Relation between and ) Let {Xn }nN be a sequence of random variables.
D
P ROOF Assume that Xn X. To show that Xn X, the Portmanteau theorem guarantees that
it will be enough to prove that lim supn P[Xn F ] P[X F ], for all closed sets F . For F R,
we define F = {x R : d(x, F ) }. Therefore, for a closed set F , we have
P[Xn F ] = P[Xn F, |X Xn | > ] + P[Xn F, |X Xn | ]
P[|X Xn | > ] + P[X F ].
95
Remark 7.18 It is not true that Xn X implies Xn X in general. Here is a simple example:
take = {1, 2} with uniform probability, and define Xn (1) = 1 and Xn (2) = 2, for n odd and
Xn (1) = 2 and Xn (2) = 1, for n even. Then all Xn have the same distribution, so we have
D
Xn X1 . On the other hand P[|Xn X1 | 21 ] = 1, for n even. In fact, it is not hard to see that
P
Xn
6 X for any random variable X.
7.2
Characteristic functions
A characteristic function is simply the Fourier transform, in probabilistic language. Since we will
be integrating complex-valued functions, we define (both integrals on the right need to exist)
Z
Z
Z
f d = f d + i f d,
where f and f denote the real and the imaginary part of a function f : R C. The reader will
easily figure out which properties of the integral transfer from the real case.
96
Proposition 7.20 (First properties of characteristic functions) Let X, Y and {Xn }nN be a random variables.
1. X (0) = 1 and |X (t)| 1, for all t.
2. X (t) = X (t), where bar denotes complex conjugation.
3. X is uniformly continuous.
4. If X and Y are independent, then X+Y = X Y .
5. For all t1 < t2 < < tn , the matrix A = (aij )1i,jn given by
ajk = X (tj tk ),
is Hermitian and positive semi-definite, i.e., A = A and T A 0, for any Cn ,
D
5. The matrix A is Hermitian by (2). To see that it is positive semidefinite, note that ajk =
E[eitj X eitk X ], and so
!
n
n
n
n X
n
X
X
X
X
itj X
it
X
k
j k ajk = E
j e
k e
= E[|
j eitj X |2 ] 0.
j=1 k=1
j=1
j=1
k=1
6. For f Cb (R), we have f (Xn ) f (X), a.s., and so, by the dominated convergence theorem
applied to the cases f (x) = cos(tx) and f (x) = sin(tx), we have
X (t) = E[exp(itX)] = E[lim exp(itXn )] = lim E[exp(itXn )] = lim Xn (t).
n
Remark 7.21 We do not prove (or use) it in these notes, but it can be shown that a function :
R C, continuous at the origin with (0) = 1 is a characteristic function of some probability
measure on B(R) if and only if it satisfies (5). This is known as Bochners theorem.
97
Our next result shows can be recovered from its characteristic function :
Theorem 7.23 (Inversion theorem) Let be a probability measure on B(R), and let = be its
characteristic function. Then, for a < b R, we have
(7.3)
Z T
1
lim
2 T
T
eita eitb
(t) dt.
it
eity dy,
where
F (a, b, T ) =
Another use of Fubinis theorem yields:
Z
F (a, b, T ) =
T
T
[T,T ][a,b]R
=
=
Set
f (a, b, T ) =
T
T
[T,T ][a,b]
[T,T ]
1 it(ax)
it (e
1
it
eita eitb
(t) dt.
it
exp(it(y x)) dy dt
it(ax)
it(bx)
dt
(dx)
(dx).
T
0
sin(ct)
t
dt,
and note that, since cos is an even and sin an odd function, we have
Z T
sin((bx)t)
sin((ax)t)
98
0 ct d(ct) = 0
s ds = K(cT ; 1), c > 0
K(T ; c) = 0,
(7.4)
c=0
K(|c| T ; 1),
c < 0,
Problem 5.24 implies that
2, c > 0
0,
0, c = 0
lim K(T ; c) =
and so lim f (a, b, T ; x) = ,
T
T
2, c < 0
2,
x [a, b]c
x = a or x = b
a<x<b
Observe first that the function T 7 K(T ; 1) is continuous on [0, ) and has a finite limit as T
so that supT 0 |K(T ; 1)| < . Furthermore, (7.4) implies that |K(T ; c)| supT 0 K(T ; 1) for any
c R and T 0 so that
sup{|f (a, b, T ; x)| : x R, T 0} < .
99
Parameters
Uniform
a<b
Normal
R, > 0
Exponential
>0
Double Exponential
>0
Cauchy
R, > 0
Density fX (x)
1
ba
1
2 2
1[a,b] (x)
(x)2
2 2
exp(
exp(it 12 2 t2 )
1
1it
exp(x)1[0,) (x)
1
2
1
1+t2
exp( |x|)
( 2 +(x)2 )
exp(it |t|)
2. Discrete distributions.
Name
Parameters
pn = P[X = n], n Z
Dirac
m N0
1{m=n}
exp(itm)
Coin-toss
p (0, 1)
p1 = p, p1 = (1 p)
cos(t)
Geometric
p (0, 1)
pn (1 p), n N0
1p
1eit p
Poisson
>0
n
n!
exp((eit 1))
, n N0
3. A singular distribution.
10
Name
Cantor
e 2 it
k=1
cos(
t
3k
7.3
Tail behavior
We continue by describing several methods one can use to extract useful information about the
tails of the underlying probability distribution from a characteristic function.
100
= E[eitX (iX)n ].
In particular
d
E[X n ] = (i)n (dt)
n X (0).
P ROOF We give the proof in the case n = 1 and leave the general case to the reader:
Z
Z
Z
(h)(0)
eihx 1
eihx 1
lim
= lim
(dx) =
lim h (dx) =
ix (dx),
h
h
h0
h0 R
R h0
where the passage of the limit under the integral sign is justified by the dominated convergence
theorem which, in turn, can be used since
Z
ihx
e 1
|x|
,
and
|x| (dx) = E[|X|] < .
h
R
Remark 7.28
dn
(dt)n X (0)
dn
(dt)n X (0)
Finer estimates of the tails of a probability distribution can be obtained by finer analysis of the
behavior of around 0:
Proposition 7.29 (A tail estimate) Let be a probability measure on B(R) and let = be its
characteristic function. Then, for > 0 we have
Z
2 2 c
1
([ , ] )
(1 (t)) dt.
P ROOF Let X be a random variable with distribution . We start by using Fubinis theorem to get
Z
Z
Z
itX
1
1
1
(1 (t)) dt = 2 E[
(1 e ) dt] = E[ (1 cos(tX)) dt] = E[1 sin(X)
2
X ].
1
0 and 1 sin(x)
1 |x|
for all x. Therefore, if we use the
It remains to observe that 1 sin(x)
x
x
c
first inequality on [2, 2] and the second one on [2, 2] , we get
Z
sin(x)
1
1
1 x 2 1{|x|>2} so that 2
(1 (t)) dt 12 P[|X| > 2] = 21 ([ 2 , 2 ]c ).
101
(Hint:P
Use (and prove) the fact that f L1+ (R) can be approximated in L1 (R) by a function of the
form nk=1 k 1[ak ,bk ] .)
7.4
nN
which, together with the arbitrariness of > 0 implies that {n }nN is tight.
Let {nk }kN be a convergent subsequence of {n }nN , and let be its limit. Since nk ,
we conclude that is the characteristic function of . It remains to show that the whole sequence
converges to weakly. This follows, however, directly from Problem 7.16, since any convergent
subsequence {nk }kN has the same limit .
Problem 7.33 Let be a characteristic function of some probability measure on B(R). Show that
(t)
= e(t)1 is also a characteristic function of some probability measure
on B(R).
102
7.5
Additional Problems
Problem 7.34 (Scheffs Theorem) Let {Xn }nN be absolutely-continuous random variables with
densities fXn , such that fXn (x) f (x), -a.e., where f is the density of the absolutely-continuous
R
D
random variable X. Show that Xn X. (Hint: Show that R |fXn f | d 0 by writing the
integrand in terms of (f fXn )+ f . )
Problem 7.35 (Convergence of moments) Let {Xn }nN and X be random variables with a common uniform bound, i.e., such that
M > 0, n N, |Xn | M, |X| M, a.s.
Show that the following two statements are equivalent:
D
AB(R)
Compare convergence in variation to weak convergence: if one implies the other, prove it. Give
counterexamples, if they are not equivalent.
Problem 7.37 (Convergence of Maxima) Let {Xn }nN be an iid sequence of standard normal (N (0, 1))
random variables. Define the sequence of up-to-date-maxima {Mn }nN by
Mn = max(X1 , . . . , Xn ).
Show that
1. Show that limx
P[X1 >x]
1
exp( 2 x2 )
x1
P[X1 > x]
1
1
1
3 , x > 0,
x
(x)
x x
(7.6)
where, (x) =
by parts).
1
2
P[X1 >x+ x ]
P[X1 >x]
= exp().
3. Let {bn }nN be a sequence of real numbers with the property that P[X1 > bn ] = 1/n. Show
that P[bn (Mn bn ) x] exp(ex ).
4. Show that limn
bn
2 log n
= 1.
103
Mn
2 log n
Problem 7.38 Check that the expressions for characteristic functions in Example 7.26 are correct.
(Hint: Not much computing is needed. Use the inversion theorem. For 2., start with the case
= 0, = 1 and derive a first-order differential equation for .)
Problem 7.39 (Atoms from the characteristic function) Let be a probability measure on B(R),
and let = be its characteristic function.
R T ita
1
(t) dt.
1. Show that ({a}) = limT 2T
T e
2. Show that if limt |(t)| = limt |(t)| = 0, then has no atoms.
3. Show that converse of (2) is false. (Hint: Prove that |(tn )| = 1 along a suitably chosen
sequence tn , where is the characteristic function of the Cantor distribution. )
Problem 7.40 (Existence of X (0) does not imply that X L1 ) Let X be a random variable which
takes values in Z \ {2, 1, 0, 1, 2} with
P[X = k] = P[X = k] =
C
,
k2 log(k)
for k = 3, 4, . . . ,
P
1
where C = 12 ( k3 k2 log(k)
)1 (0, ). Show that X (0) = 0, but X 6 L1 .
(Hint: Argue that, in order to establish that X (0) = 0, it is enough to show that
X cos(hk)1
= 0.
lim h1
k2 log(k)
h0
k3
Then split the sum at k close to 2/h and use (and prove) the inequality |cos(x) 1| min(x2 /2, x).
Bounding sums by integrals may help, too. )
Problem 7.41 (Multivariate characteristic functions) Let X = (X1 , . . . , Xn ) be a random vector.
The characteristic function = X : Rn C is given by
(t1 , t2 , . . . , tn ) = E[exp(i
n
X
tk Xk )].
k=1
P
We will also use the shortcut t for (t1 , . . . , tn ) and t X for the random variable nk=1 tk Xk . Take
for granted the following statement (the proof of which is similar to the proof of the 1-dimensional
case):
Fact. Suppose that X 1 and X 2 are random vectors with X 1 (t) = X 2 (t) for all t Rn . Then X 1
and X 2 have the same distribution, i.e. X 1 = X 2 .
Prove the following statements
1. Random variables X and Y are independent if and only if (X,Y ) (t1 , t2 ) = X (t1 )Y (t2 ) for
all t1 , t2 R.
2. Random vectors X 1 and X 2 have the same distribution if and only if random variables tX 1
and t X 2 have the same distribution for all t Rn . (This fact is known as Walds device.)
104
1. It is not necessarily true that Xn + Yn X + Y . For that matter, we do not necessarily have
D
(Xn , Yn ) (X, Y ) (where the pairs are considered as random elements in the metric space
R2 ).
2. If, in addition to (11.13), there exists a constant c R such that P[Y = c] = 1, show that
D
g(Xn , Yn ) g(X, c), for any continuous function g : R2 R. (Hint: It is enough to show
D
a.s.
Lp
Chapter
We start with a definitive form of the weak law of large numbers. We need two lemmas, first:
k=1
k=1
P ROOF We proceed by induction. For n = 1, the claim is trivial. Suppose that (8.1) holds. Then
n
n
n+1
n
n+1
Y
Y
Y
Y
Y
uk
uk
uk |un+1 wn+1 | + |wn+1 |
wk
wk
k=1
k=1
k=1
M n |un+1 wn+1 | + M M n1
= M (n+1)1
n+1
X
k=1
k=1
n
X
k=1
k=1
|uk wk |
|uk wk | .
Lemma 8.2 (Convergence to the exponential) Let {zn }nN be a sequence of complex numbers
with zn z C. Then (1 + znn )n ez .
(8.2)
zn
n
106
(8.3)
nN
P
To estimate the last term in (8.2), we start with the Taylor expansion eb = 1 + b + k2
1
which converges absolutely for all b C. Then, we use the fact that k!
2k+1 , to obtain
X
2 X k+1
b
k 1
b
|b|
2
= |b|2 , for |b| 1.
k!
(8.4)
k2
bk
k! ,
k2
Since |zn | /n 1 for large-enough n, it follows from (8.2), (8.4) and (8.3), that
lim sup (1 +
zn n
n)
2
ezn lim sup nL znn = 0.
n
Theorem 8.3 (Weak law of large numbers) Let {Xn }nN be an iid sequence of random variables
with the (common) distribution and the characteristic function = such that (0) exists. Then,
c = i (0) R and
n
1X
Xk c in probability.
n
k=1
s0
(s)1
s
= lim
s0
(s)1
s
= lim
s0
(s)1
s
= (0).
Therefore, c = i (0) R.
P
D
Let Sn = nk=1 Xk . According to Proposition 7.17, it will be enough to show that n1 Sn c =
i (0) R. Moreover, by Theorem 7.32, all we need to do is show that 1 (t) eitc = et (0) ,
n Sn
for all t R.
The iid property of {Xn }nN and the fact that X (t) = X (t) imply that
1
n Sn
(t) = (( nt ))n = (1 +
zn n
n) ,
(s)1
s
n Sn
Remark 8.4
P
1. It can be shown that the converse of Theorem 8.3 is true in the following sense: if n1 Sn c
R, then (0) exists and (0) = ic. Thats why we call the result of Theorem 8.3 definitive.
107
8.2
We continue with a central limit theorem for iid sequences. Unlike in the case of the (weak) law of
large numbers, existence of the first moment will not be enough - we will need to assume that the
second moment is finite, too. We will see how this assumption can be relaxed when we state and
prove the Lindeberg-Feller theorem. We start with an estimate of the error term in the Taylor
expansion of the exponential function of imaginary argument:
Lemma 8.6 (A tight error estimate for the exponential) For R we have
n
X
k
i
||n+1
||n
(i)
e
min( (n+1)! , 2 n! )
k!
k=0
P ROOF If we write the remainder in the Taylor formula in the integral form (derived easily using
integration by parts), we get
ei
n
X
(i)k
k!
k=0
eiu (u)
du.
n!
1
n!
||
0
(|| u)n du =
||n+1
(n+1)! .
R
0
in+1
n!
in
(n1)!
1 n
i
Z
n
i
iu
e ( u)
( u)n1 du
n1
du
eiu ( u)n1 du ,
|Rn ()|
1
(n1)!
||
0
(|| u)n1 eiu 1 du
108
2
n!
||
0
n(|| u)n1 du =
2||n
n! .
(8.5)
Then
1
6
ne-
Theorem 8.8 (Central Limit Theorem - iid version) Let {Xn }nN be an iid sequence of random
variables with 0 < Var[X1 ] < . Then
Pn
(X )
k=1
k
2 n
P ROOF By considering the sequence {(Xn )/ 2 }nN , instead of {Xn }nN , we may assume
that = 0 and
P = 1. Let be the characteristic function of the common distribution of {Xn }nN
and set Sn = nk=1 Xk , so that
1 (t) = (( tn ))n ,
n
Sn
By Theorem 7.32, the problem reduces to whether the following statement holds:
(8.6)
(t) 1 + 1 t2 t2 r(t) for all t R,
2
t
( n ) 1 +
t2
2n
109
t2
n r(t/
n).
t
) yields:
Lemma 8.1 with u1 = = un = ( tn ) and w1 = = wn = (1 2n
t2 n
) t2 r(t/ n),
(( tn ))n (1 2n
for n
2
t2
t2
2n |)
t2 n
2n )
e 2 t , for all t.
8.3
Unlike Theorem 8.8, the Lindeberg-Feller Theorem does not require summands to be equall distributed - it only asks for no single term to dominated the sum. As usual, we start with a technical
lemma:
Lemma 8.9 ( (*) Convergence to the exponential for triangular arrays) Let (cn,m ), n N,
m = 1, . . . , n be a (triangular) array of real numbers with
Pn
Pn
1.
m=1 cn,m c R, and
m=1 |cn,m | is a bounded sequence,
2. mn 0, as n , where mn = max1mn |cn,m | 0
Then
n
Y
m=1
(1 + cn,m ) ec as n .
1
P ROOF Without
2 for all n, and note that the statement is
Pnloss of generality we assume that mn < P
equivalent to m=1 log(1 + cn,m ) c, as n . Since nm=1 cn,m c, this is also equivalent to
n
X
(8.7)
m=1
mn
because
Pn
m=1 |cn,m |
m=1
m=1
is bounded and mn 0.
n
X
m=1
110
|cn,m | 0,
Theorem 8.10 (Lindeberg-Feller) Let Xn,m , n N, m = 1, . . . , n be a (triangular) array of random variables such that
1. E[Xn,m ] = 0, for all n N, m = 1, . . . , n,
2. Xn,1 , . . . , Xn,n are independent,
Pn
2
2
3.
m=1 E[Xn,m ] > 0, as n ,
Pn
2
m=1 E[Xn,m 1{|Xn,m |} ].
Xn,1 + + Xn,n , as n ,
where N (0, 1).
2
2 ]. Just like in the proofs of Theorems 8.3 and 8.8, it
P ROOF (*) Set n,m = Xn,m , n,m
= E[Xn,m
will be enough to show that
n
Y
m=1
n,m (t) e 2
2 t2
, for all t R.
2 t2 to conclude that
We fix t 6= 0 and use Lemma 8.1 with un,m = n,m (t) and wn,m = 1 21 n,m
Dn (t)
Mnn1
n
X
2
2
n,m (t) 1 + 1 n,m
,
t
2
m=1
where
n
n
Y
Y
2
t2 )
n,m (t)
Dn (t) =
(1 12 n,m
m=1
m=1
2 ). Assumption (4) in the statement implies that
and Mn = 1 max1mn (1 21 t2 n,m
2
2
2
2
n,m
= E[Xn,m
1{|Xm,n |} ] + E[Xn,m
1{|Xm,n |<} ] 2 + E[Xn,m
1{|Xm,n |<} ]
2 + sn (),
2
2
2 and
and so sup1mn n,m
0, as n . Therefore, for n large enough, we have 21 t2 n,m
Mn = 1.
According to Corollary 8.7 we now have (for large-enough n)
Dn (t) t2
t2
n
X
2
min(t |Xn,m | , 1)]
E[Xn,m
m=1
n
X
m=1
2
1{|Xn,m |} ] + E[t |Xn,m |3 1{|Xn,m |<} ]
E[Xn,m
t2 sn () + t3
n
X
m=1
2
E[Xn,m
1{|Xn,m |<} ] t2 sn () + 2t3 2 .
111
m=1
2
(1 12 n,m
t2 ) e 2
2 t2
Problem 8.11 Show how the iid central limit theorem follows from the Lindeberg-Feller theorem.
Example 8.12 (Cycles in a random permutation) Let : Sn be a random element taking
values in the set Sn of all permutations of the set {1, . . . , n}, i.e., the set of all bijections :
{1, . . . , n} {1, . . . , n}. One usually considers the probability measure on such that is uni1
, for each Sn . A random element in Sn whose
formly distributed over Sn , i.e. P[ = ] = n!
distribution is uniform over Sn is called a random permutation.
Remember that each permutation Sn be decomposed into cycles; a cycle is a collection
(i1 i2 . . . ik ) in {1, . . . , n} such that (il ) = il+1 for l = 1, . . . , k 1 and (ik ) = i1 . For example,
the permutation : {1, 2, 3, 4} {1, 2, 3, 4}, given by (1) = 3, (2) = 1, (3) = 2, (4) = 4 has
two cycles: (132) and (4). More precisely, start from i1 = 1 and follow the sequence ik+1 = (ik ),
until the first time you return to ik = 1. Write these number in order (i1 i2 . . . ik ) and pick j1
{1, 2, . . . , n} \ {i1 , . . . , ik }. If no such j1 exists, consist of a single cycle. If it does, we repeat the
same procedure starting from j1 to obtain another cycle (j1 j2 . . . jl ), etc. In the end, we arrive at
the decomposition
(i1 i2 . . . ik )(j1 j2 . . . jl ) . . .
of into cycles.
Let us first answer the following, warm-up, question: what is the probability p(n, m) that 1 is a
member of a cycle of length m? Equivalently, we can ask for the number c(n, m) of permutations
in which 1 is a member of a cycle of length m. The easiest way to solve this is to note that each such
permutation corresponds to a choice of (m 1) district numbers of {2, 3, . . .} - these will serve as
n1
the remaining elements of the cycle containing 1. This can be done in m1
ways. Furthermore,
the m 1 elements to be in the same cycle with 1 can be ordered in (m 1)! ways. Also, the
remaining n m elements give rise to (n m)! distinct permutations. Therefore,
n1
c(n, m) =
(m 1)!(n m)! = (n 1)!, and so p(n, m) = n1 .
m1
This is a remarkable result - all cycle lengths are equally likely. Note, also, that 1 is not special in
any way.
Our next goal is to say something about the number of cycles - a more difficult task. We start
by describing a procedure for producing a random permutation by building it from cycles. The
reader will easily convince his-/herself that the outcome is uniformly distributed over all permutations. We start with n 1 independent random variables 2 , . . . , n such that i is uniformly
distributed over the set {0, 1, 2, . . . , n i + 1}. Let the first cycle start from X1 = 1. If 2 = 0, then
we declare (1) to be a full cycle and start building the next cycle from 2. If 2 6= 0, we pick the
2 -th smallest element - let us call it X2 - from the set of remaining n 1 numbers to be the second
element in the first cycle. After that, we close the cycle if 3 = 0, or append the 3 -th smallest
element - lets call it X3 - in {1, 2, . . . , n} \ {X1 , X2 } to the cycle. Once the cycle (X1 X2 . . . Xk )
is closed, we pick the smallest element in {1, 2, . . . , n} \ {X1 , X2 , . . . , Xk } - lets call it Xk+1 - and
repeat the procedure starting from Xk+1 and using k+1 , . . . , n as sources of randomness.
112
n
X
E[Yn,k = 1] =
k=1
n
X
1
nk+1
=1+
k=1
1
2
+ +
1
n
= log(n) + + o(1),
where 0.58 is the Euler-Mascheroni constant, and an = bn + o(n) means that |bn an | 0, as
n .
If we want to know more about the variability of Cn , we can also compute its variance:
Var[Cn ] =
n
X
Var[Yn,k ] =
k=1
n
X
k=1
( nk+1
1
)
(nk+1)2
= log(n) +
2
6
+ o(1).
The Lindeberg-Feller theorem will give us the precise asymptotic behavior of Cn . For m =
1, . . . , n, we define
Yn,m E[Yn,m ]
p
,
Xn,m =
log(n)
n
X
2
E[Xn,m
]
m=1
= lim
n
2
log(n)+ 6 +o(1)
log(n)
= 1.
Finally, for > 0 and log(n) > 2/, we have P[|Xn,m | > ] = 0, so
n
X
m=1
2
1{|Xn,m |} ] = 0.
E[Xn,m
Having checked that all the assumption of the Lindeberg-Feller theorem are satisfied, we conclude
that
Cn log(n) D
It follows that (if we believe that the approximation is good) the number of cycles in a random
permutation with n = 8100 is at most 18 with probability 99%.
How about variability? Here is histogram of the number of cycles from 1000 simulations for
n = 8100, together with the appropriately-scaled
density of the normal distribution with mean
p
log(8100) and standard deviation log(8100). The quality of approximation leaves something to
113
10
15
20
8.4
Additional Problems
P
Problem 8.13 p
(Lyapunovs theorem) Let {Xn }nN be an independent sequence, let Sn = nm=1 Xm ,
and let n = Var[Sn ]. Suppose that n > 0 for all n N and that there exists a constant > 0
such that
n
X
lim n(2+)
E[|Xm E[Xm ]|2+ ] = 0.
n
Show that
m=1
Sn E[Sn ] D
, where N (0, 1).
n
Problem 8.14 (Self-normalized sums) Let {Xn }nN be iid random variables with E[X1 ] = 0, =
p
E[X12 ] > 0 and P[X1 = 0] = 0. Show that the sequence {Yn }nN given by
Pn
Xk
Yn = qPk=1
,
n
2
X
k=1 k
114
Chapter
Conditional Expectation
9.1
P[AB]
P[B] .
115
111 22 34
111 2 34
In our specific case, if we know that Y = 2, then = a or = b, and the expected value of X,
given that Y = 2, is 12 X(a) + 21 X(b) = 2. Similarly, this average equals 4 for Y = 1, and 6 for
Y = 7. Let us show that the random variable defined by this average, i.e.,
a b c d e f
,
2 2 4 4 6 6
satisfies the definition of E[X|(Y )], as given above. The integrability is not an issue (we are on
a finite probability space), and it is clear that is measurable with respect to (Y ). Indeed, the
atoms of (Y ) are {a, b}, {c, d} and {e, f }, and is constant over each one of those. Finally, we
need to check that
E[1A ] = E[X1A ], for all A (Y ),
which for an atom A translates into
() =
1
P[A] E[X1A ]
The moral of the story is that when A is an atom, part (3) of Definition 9.1 translates into a requirement that be constant on A with value equal to the expectation of X over A with respect to the
conditional probability P[|A]. In the general case, when there are no atoms, (3) still makes sense
and conveys the same message.
Btw, since the atoms of (Z) are {a, b, c, d} and {e, f }, it is clear that
(
3, {a, b, c, d},
E[X|(Z)]() =
6, {e, f }.
116
Proposition 9.3 (Conditional expectation - existence and a.s.-uniqueness) Let G be a sub-algebra G of F. Then
1. there exists a conditional expectation E[X|G] for any X L1 , and
2. any two conditional expectations of X L1 are equal P-a.s.
P ROOF (Uniqueness): Suppose that and both satisfy (1),(2) and (3) of Definition 9.1. Then
E[1A ] = E[ 1A ], for all A G.
For An = { n1 }, we have An G and so
E[1An ] = E[ 1An ] E[( + n1 )1An ] = E[1An ] + n1 P[An ].
Consequently, P[An ] = 0, for all n N, so that P[ > ] = 0. By a symmetric argument, we also
have P[ < ] = 0.
(Existence): By linearity, it will be enough to prove that the conditional expectation exists for
X L1+ .
1. A Radon-Nikodym argument. Suppose, first, that X 0 and E[X] = 1, as the general case
follows by additivity and scaling. Then the prescription
Q[A] = E[X1A ],
defines a probability measure on (, F), which is absolutely continuous with respect to P. Let QG
be the restriction of Q to G; it is trivially absolutely continuous with respect to the restriction PG of
P to G. The Radon-Nikodym theorem - applied to the measure space (, G, PG ) and the measure
QG PG - guarantees the existence of the Radon-Nikodym derivative
=
dQG
L1+ (, G, PG ).
dPG
0
for some subsequence {nk }kN of {n }nN . Set = lim inf kN nk L ([, ], G) and =
1{| |<} , so that = , a.s., and is G-measurable.
We still need to remove the restriction X L2+ . We start with a general X L1+ and define
2
Xn = min(X, n) L
+ L+ . Let n = E[Xn |G], and note that E[n+1 1A ] = E[Xn+1 1A ]
E[Xn 1A ] = E[n 1A ]. It follows (just like in the proof of uniqueness above) that n n+1 , a.s.
We define = supn n , so that n , a.s. Then, for A G, the monotone-convergence theorem
implies that
E[X1A ] = lim E[Xn 1A ] = lim E[n 1A ] = E[1A ],
n
L1 (G)
is a version of E[X|G].
Remark 9.4 There is no canonical way to choose the version of the conditional expectation. We
follow the convention started with Radon-Nikodym derivatives, and interpret a statement such
at E[X|G], a.s., to mean that , a.s., for any version of the conditional expectation of X
with respect to G.
If we use the symbol L1 to denote the set of all a.s.-equivalence classes of random variables in
1
L , we can write:
E[|G] : L1 (F) L1 (G),
but L1 (G) cannot be replaced by L1 (G) in a natural way. Since X = X , a.s., implies that E[X|G] =
E[X |G], a.s. (why?), we consider conditional expectation as a map from L1 (F) to L1 (G)
E[|G] : L1 (F) L1 (G).
9.2
Properties
Conditional expectation inherits many of the properties from the ordinary expectation. Here
are some familiar and some new ones:
Proposition 9.5 (Properties of the conditional expectation) Let X, Y, {Xn }nN be random variables in L1 , and let G and H be sub--algebras of F. Then
118
119
P
The statement (9.1) will follow from it by taking Z = Y 1A . For Z = nk=1 k 1Ak , (9.2) is
a consequence of the definition of conditional expectation and linearity. Let us assume that
both Z and X are nonnegative and ZX L1 . In that case we can find a non-decreasing
sequence {Zn }nN of non-negative simple random variables with Zn Z. Then Zn X L1
for all n N and the monotone convergence theorem implies that
E[ZX] = lim E[Zn X] = lim E[Zn E[X|G]] = E[ZE[X|G]].
n
Our next task is to relax the assumption X L1+ to the original one X L1 . In that case, the
Lp -nonexpansivity for p = 1 implies that
|E[X|G]| E[|X| |G] a.s., and so |Zn E[X|G]| Zn E[|X| |G] ZE[|X| |G].
We know from the previous case that
E[ZE[|X| |G]] = E[Z |X|], so that ZE[|X| |G] L1 .
We can, therefore, use the dominated convergence theorem to conclude that
E[ZE[X|G]] = lim E[Zn E[X|G]] = lim E[Zn X] = E[ZX].
n
(9.3)
Let L be the collection of all A (G, H) such that (9.3) holds. It is straightforward that L is
a -system, so it will be enough to establish (9.3) for some -system that generates (G, H).
One possibility is P = {G H : G G, H H}, and for G H P we use independence
of 1H and E[X|G]1G , as well as the independence of 1H and X1G to get
E[E[X|G]1GH ] = E[E[X|G]1G 1H ] = E[E[X|G]1G ]E[1H ] = E[X1G ]E[1H ]
= E[X1GH ]
10. (conditional monotone-convergence theorem) By monotonicity, E[Xn |G] L0+ (G), a.s.
The monotone convergence theorem implies that
E[1A ] = lim E[1A E[Xn |G]] = lim E[1A Xn ] = E[1A X], for all A G.
n
11. (conditional Fatous lemma) Set Yn = inf kn Xk , so that Yn Y = lim inf k Xk . By monotonicity,
E[Yn |G] inf E[Xk |G], a.s.,
kn
nN
9.3
Once we have a the notion of conditional expectation defined and analyzed, we can use it to define
other, related, conditional quantities. The most important of those is the conditional probability:
121
Definition 9.7 (Conditional probability) Let G be a sub--algebra of F. The conditional probability of A F, given G - denoted by P[A|G] - is defined by
P[A|G] = E[1A |G].
It is clear (from the conditional version of the monotone-convergence theorem) that
X
P[nN An |G] =
P[An |G], a.s.
(9.4)
nN
We can, therefore, think of the conditional probability as a countably-additive map from events
to (equivalence classes of) random variables A 7 P[A|G]. In fact, this map has the structure of a
vector measure:
Definition 9.8 (Vector Measures) Let (B, || ||) be a Banach space, and let (S, S) be a measurable
space. A map : S B is called a vector measure if
1. () = 0, and
2. for each pairwise disjoint sequence {An }nN in S, (n An ) =
in B converges absolutely).
nN (An )
conditional
probability
P ROOF Clearly P[0|G] = 0, a.s. Let {An }nN be a pairwise-disjoint sequence in F. Then
P[An |G] 1 = E[|E[1An |G]|] = E[1An ] = P[An ],
L
and so
X
P[A
|G]
n
nN
nN P[An |G]
N
X
P[An |G]
P[A|G]
n=1
L1
L1
nN
X
= E[
1An |G]
n=N +1
L1
= P[
n=N +1 An ] 0 as N .
It is tempting to try to interpret the map A 7 P[A|G]() as a probability measure for a fixed .
It will not work in general; the reason is that P[A|G] is defined only a.s., and the exceptional
sets pile up when uncountable families of events A are considered. Even if we fixed versions
122
Definition 9.10 (Measurable kernels) Let (R, R) and (S, S) be measurable spaces. A map :
R S R is called a (measurable) kernel if
1. x 7 (x, B) is R-measurable for each B S, and
2. B 7 (x, B) is a measure on S for each x R.
Remark 9.12
1. When (S, S) = (, F), and e() = , the regular conditional distribution of e (if it exists) is
called the regular conditional probability. Indeed, in this case, e|G (, B) = P[e B|G] =
P[B|G], a.s.
2. It can be shown that regular conditional distributions not need to exist in general if S is too
large.
When (S, S) is small enough, however, regular conditional distributions can be constructed.
Here is what we mean by small enough:
Definition 9.13 (Borel spaces) A measurable space (S, S) is said to be a Borel space (or a nice
space) if it is isomorphic to a Borel subset of R, i.e., if there one-to-one map : S R such that both
and 1 are measurable.
Problem 9.14 Show that Rn , n N (together with their Borel -algebras) are Borel spaces. (Hint:
Show, first, that there is a measurable bijection : [0, 1] [0, 1] [0, 1] such that 1 is also
measurable. Use binary (or decimal, or . . . ) expansions.)
Remark 9.15 It can be show that any Borel subset of any complete and separable metric space is
a Borel space. In particular, the coin-toss space is a Borel space.
123
P ROOF (*) Let us, first, deal with the case S = R, so that e = X is a random variable. Let Q be
a countable dense set in R. For q Q, consider the random variable P q , defined as an arbitrary
version of
P q = P[X q|G].
By redefining each P q on a null set (and aggregating the countably many null sets - one for
each q Q), we may suppose that P q () P r (), for q r, q, r Q, for all and that
limq P q () = 1 and limq P q () = 0, for all . For x R, we set
F (, x) =
inf
qQ,q>x
P q (),
qx
qx
nN
124
P ROOF When g = 1B , for B Rn , the statement follows by the very definition of the regular
condition distribution. For the general case, we simply use the standard machine.
Just like we sometimes express the distribution of a random variable or a vector in terms of its
density, cdf or characteristic function, we can talk about the conditional density, conditional cdf or
the conditional characteristic function. All of those will correspond to the case covered in Proposition 9.16 and all conditional distributions will be assumed to be regular. For x = (x1 , . . . , xn )
and y = (y1 , . . . , yn ), y n x means y1 x1 , . . . , yn xn .
Definition 9.19 (Other regular conditional quantities) Let X : Rn be a random vector, let
G be a sub--algebra of F, and let X|G : B(Rn ) [0, 1] be the regular conditional distribution
of X given G.
1. The (regular) conditional cdf of X, given G is the map F : Rn [0, 1], given by
F (, x) = X|G (, {y Rn : y n x}), for x Rn ,
2. A map fX|G : Rn [0, ) is called the conditional density of X with respect to G if
a) fX|G (, ) is Borel measurable for all ,
To illustrate the utility of the above concepts, here is a versatile result (see Example 9.23 below):
P ROOF (1) (2). By Proposition 9.18, we have X|G (, t) = E[eitX |G], a.s. If we replace X|G by
, multiplying both sides by a bounded G-measurable random variable Y and take expectations,
we get
(t)E[Y ] = E[Y eitX ].
In particular, for Y = 1 we get (t) = E[eitX ], so that
(9.5)
for all G-measurable and bounded Y , and all t Rn . For Y of the form Y = eisZ , where Z is
a G-measurable random variable, relation (9.5) and (a minimal extension of) part (1) of Problem
7.41, we conclude that X and Z are independent. Since Z is arbitrary and G-measurable, X and
G are independent.
(2) (1). If (X) is independent of G, so is eitX , and so, the irrelevance of independent
information property of conditional expectation implies that
(t) = E[eitX ] = E[eitX |G] = X|G (, t), a.s.
One of the most important cases used in practice is when a random vector (X1 , . . . , Xn ) admits a
density and we condition on the -algebra generated by several of its components. To make the
notation more intuitive, we denote the first d components (X1 , . . . , Xd ) by X o (for observed) and
the remaining n d components (Xd+1 , . . . , Xn ) by X u (for unobserved).
126
|G (, x
)=
fX (X o (),xu )
,
f (X o (),y) dy
Rnd X
f0 (xu ),
R
Rnd
f (X o , y) dy > 0,
otherwise,
Au
= P[X Ao , X u Au ].
127
u (X
o )T ] R(nd)d
uo = E[X
u (X
u )T ] R(nd)(nd) .
uu = E[X
We assume that oo is invertible. Otherwise, we can find a subset of components of X o whose
variance-covariance matrix in invertible and which generate the same -algebra (why?). The mau
o o T
that the random vectors
trix A = uo 1
oo ohas the property that E[(X AX )(X ) ] = 0, i.e.,
o
o
AX
and X
are uncorrelated. We know, however, that X
= (X
o, X
u ) is a Gaussian random
X
o AX
o is independent of X
o . It follows from Proposition
vector, so, by Problem 7.41, part (3), X
o
o
AX
, given G = (X
o ) is deterministic
9.20 that the conditional characteristic function of X
and given by
u
o
nd
E[eit(X AX ) |G] = X
.
u AX
o (t), for t R
o is G-measurable, we have
Since AX
u
T t
, for t Rnd .
= E[(X
u AX
o )(X
u AX
o )T ]. A simple calculation yields that, conditionally on G,
where
X u is multivariate normal with mean X u |G and variance-covariance matrix X u |G given by
X u |G = o + A(X o o ), X u |G = uu uo 1
oo ou .
Note how the mean gets corrected by a multiple of the difference between the observed value
X o and its (unconditional) expected value. Similarly, the variance-covariance matrix also gets
o
corrected by uo 1
oo ou , but this quantity does not depend on the observation X .
Problem 9.24 Let (X1 , X2 ) be a bivariate normal vector with Var[X1 ] > 0. Work out the exact
form of the conditional distribution of X2 , given X1 in terms of i = E[Xi ], i2 = Var[Xi ], i = 1, 2
and the correlation coefficient = corr(X1 , X2 ).
9.4
Additional Problems
Problem 9.25 (Conditional expectation for non-negative random variables) A parallel definition
of conditional expectation can be given for random variables in L0+ . For X L0+ , we say that Y is
a conditional expectation of X with respect to G - and denote it by E[X|G] - if
(a) Y is G-measurable and [0, ]-valued, and
(b) E[Y 1A ] = E[X1A ] [0, ], for A G.
128
be sub--algebras of F. Show that Xn 0 if E[Xn |Fn ] 0. Does the converse hold? (Hint:
P
Prove that for Xn L0+ , we have Xn 0 if and only if E[min(Xn , 1)] 0.)
129
Chapter
10
Discrete Martingales
10.1
One of the uses of -algebras is to single out the subsets of to which probability can be assigned.
This is the role of F. Another use, as we have seen when discussing conditional expectation, is
to encode information. The arrow of time, as we perceive it, points from less information to more
information. A useful mathematical formalism is the one of a filtration.
Definition 10.1 (Filtered probability spaces) A filtration is a sequence {Fn }nN0 , where N0 =
N {0}, of sub--algebras of F such that Fn Fn+1 , for all n N0 . A probability space with a
filtration - (, F, {Fn }nN0 , P) - is called a filtered probability space.
We think of n N0 as the time-index and of Fn as the information available at time n.
Definition 10.2 (Discrete-time stochastic process) A (discrete-time) stochastic process is a
sequence {Xn }nN0 of random variables.
A stochastic process is a generalization of a random vector; in fact, we can think of a stochastic processes as an infinite-dimensional random vector. More precisely, a stochastic process is a
random element in the space RN0 of real sequences. In the context of stochastic processes, the
sequence (X0 (), X1 (), . . . ) is called a trajectory of the stochastic process {Xn }nN0 . This dual
view of stochastic processes - as random trajectories (sequences) or as sequences of random variables - can be supplemented by another interpretation: a stochastic process is also a map from the
product space N0 into R.
Definition 10.3 (Adapted processes) A stochastic process {Xn }nN0 is said to be adapted with
respect to the filtration {Fn }nN0 if Xn is Fn -measurable for each n N0 .
130
is called the filtration generated by {Xn }nN0 . Clearly, X is always adapted to the filtration generated by X.
10.2
Martingales
Definition 10.4 ((Sub-, super-) martingales) Let {Fn }nN0 be a filtration. A stochastic process
{Xn }nN0 is called an {Fn }nN0 -supermartingale if
1. {Xn }nN0 is {Fn }nN0 -adapted,
2. Xn L1 , for all n N0 , and
3. E[Xn+1 |Fn ] Xn , a.s., for all n N0 .
A process {Xn }nN0 is called a submartingale if {Xn }nN0 is a supermartingale. A martingale
is a process which is both a supermartingale and a submartingale at the same time, i.e., for which the
equality holds in (3).
Remark 10.5 Very often, the filtration {Fn }nN0 is not explicitly mentioned. Then, it is often clear
from the context, i.e., the existence of an underlying filtration {Fn }nN is assumed throughout.
Alternatively, if no filtration is pre-specified, the filtration {FnX }nN0 , generated by {Xn }nN0 is
used. It is important to remember, however, that the notion of a (super-, sub-) martingale only
makes sense in relation to a filtration.
The fundamental examples of martingales are (additive or multiplicative) random walks:
Example 10.6
1. An additive random walk. Let {n }nN be a sequence of iid random variables with n L1
and E[n ] = 0, for all n N. We define
X0 = 0, Xn =
n
X
k=1
k , for n N.
The process {Xn }nN0 is a martingale with respect to the filtration {FnX }nN0 generated by
it (which is the same as (1 , . . . , n )). Indeed, Xn L1 (FnX ) for all n N and
E[Xn+1 |Fn ] = E[n+1 + Xn |Fn ] = Xn + E[n+1 |Fn ] = Xn + E[n+1 ] = Xn , a.s.,
where we used the irrelevance of independent information-property of conditional expectation (in this case n+1 is independent of FnX = (X0 , . . . , Xn ) = (1 , . . . , n )).
It is easy to see that if {n }nN are still iid, but E[n ] > 0, then {Xn }nN0 is a submartingale.
When E[n ] < 0, we get a supermartingale.
131
The process {Xn }nN0 is a martingale with respect to the filtration {FnX }nN0 generated it.
Indeed, Xn L1 (FnX ) for all n N and
E[Xn+1 |Fn ] = E[n+1 Xn |Fn ] = Xn E[n+1 |Fn ] = Xn E[n+1 ] = Xn , a.s.,
n
Y
eitk
k (t) ,
k=1
n N,
is a martingale. Actually, it is complex-valued, so it would be better to say that its real and
imaginary parts are both martingales. This martingale will be important in the study of
hitting times of random walks.
4. Lvy martingales. For X L1 , we define
Xn = E[X|Fn ], for n N0 .
The tower property of conditional expectation implies that {Xn }nN0 is a martingale.
5. An urn scheme. An urn contains b black and w white balls on day n = 0. On each subsequent
day, a ball is chosen (each ball in the urn has the same probability of being picked) and then
put back together with another ball of the same color. Therefore, at the end of day n, here
are n + b + w balls in the urn. Let Bn denote the number of black balls in the urn at day n,
and let define the process {Xn }nN0 by
Xn =
Bn
b+w+n ,
n N0 ,
to be the proportion of black balls in the urn at time n. Let {Fn }nN0 denote the filtration
generated by {Xn }nN0 . The conditional probability - given Fn - of picking a black ball at
time n is Xn , i.e.,
P[Bn+1 = Bn + 1|Fn ] = Xn and P[Bn+1 = Bn |Fn ] = 1 Xn .
Therefore,
E[Xn+1 |Fn ] = E[Xn+1 1{Bn+1 =Bn } |Fn ] + E[Xn+1 1{Bn+1 =Bn +1} |Fn ]
Bn +1
Bn
1{Bn+1 =Bn } |Fn ] + E[ b+w+n+1
1{Bn+1 =Bn +1} |Fn ]
= E[ b+w+n+1
Bn
b+w+n+1 (1
Xn ) +
Bn +1
b+w+n+1 Xn
132
= Xn .
10.3
Definition 10.8 (Predictable processes) A process {Hn }nN is said to be predictable with respect
to the filtration {Fn }nN0 if Hn is Fn1 -measurable for n N.
A process is predictable if you can predict its tomorrows value today. We often think of predictable processes as strategies: let {n }nN be a sequence of random variables which we interpret
as gambles. At time n we can place a bet of Hn dollars, thus realizing a gain/loss of Hn n . Note
that a negative Hn is allowed - the player wins money if n < 0 and loses if n > 0 in that case.
If {Fn }nN0 is a filtration generated by the gambles, i.e., F0 = {, } and Fn = {1 , . . . , n }, for
n N, then Hn Fn1 , so that it does not use any information about n : we are allowed to adjust
our bet according to the outcomes of previous gambles, but we dont know the outcome of n until
after the bet is placed. Therefore, the sequence {Hn }nN is a predictable sequence with respect to
{Fn }nN0 .
Problem 10.9 Characterize predictable submartingales and predictable martingales. (Note: To
comply with the setting in which the definition of predictability is given (processes defined on N
and not on N0 ), simply discard the value at 0.)
133
Definition 10.10 (Martingale transforms) Let {Fn }nN0 be a filtration and let {Xn }nN0 be a process adapted to {Fn }nN0 . The stochastic process {(H X)n }nN0 , defined by
(H X)0 = 0, (H X)n =
n
X
k=1
Remark 10.11
1. The process {(H X)n }nN0 is called the martingale transform of X, even if neither H nor X
is a martingale. It is most often applied to a martingale X, though - hence the name.
2. In terms of the gambling interpretation given above, X plays the role of the cumulative gain
(loss) when a $1-bet is placed each time:
X0 = 0, Xn =
n
X
k=1
If we insist that {n }nN is a sequence of fair bets, i.e., that there are no expected gains/losses
in the n-th bet, even after we had the opportunity to learn from the previous n 1 bets, we
arrive to the condition
E[n |Fn1 ] = 0, i.e., that {Xn }nN0 is a martingale.
The following proposition states that no matter how well you choose your bets, you cannot make
(or loose) money by betting on a sequence of fair games. A part of result is stated for submartingales; this is for convenience only. The reader should observe that almost any statement about
submartingales can be turner into a statement about supermartingales by a simple change of sign.
Proposition 10.12 (Stability of martingales under martingale transforms) Let {Xn }nN0 be
adapted, and let {Hn }nN be predictable. Then, the martingale transform H X of X by H is
1. a martingale, provided that {Xn }nN0 is a martingale and Hn (Xn Xn1 ) L1 , for all n N,
2. a submartingale, provided that {Xn }nN0 is a submartingale, Hn 0, a.s., and Hn (Xn
Xn1 ) L1 , for all n N.
P ROOF Just check the definition and use properties of conditional expectation.
Remark 10.13 The martingale transform is the discrete-time analogue of the stochastic integral.
Note that it is crucial that H be predictable if we want a martingale transform of a martingale to
be a martingale. Otherwise, we just take Hn = sgn(Xn Xn1 ) Fn and obtain a process which
is not a martingale unless X is constant. This corresponds to a player who knows the outcome of
the game before the bet is placed and places the bet of $ 1 which is guaranteed to win.
134
10.4
Stopping times
Definition 10.14 (Random and stopping times) A random variable T with values in N0 {}
is called a random time. A random time is said to be a stopping time with respect to the filtration
{Fn }nN0 if
{T n} Fn , for all n N.
Remark 10.15
1. Stopping times are simply random instances with the property that at every instant you
can answer the question Has T already happened? using only the currently-available
information.
2. The additional element + is used as a placeholder for the case when T does not happen.
Example 10.16
1. Constant (deterministic) times T = m, m N0 {} are obviously stopping time. The set
of all stopping times can be thought of as an enlargement of the set of time-instances. The
meaning of when Red Sox win the World Series again is clear, but it does not correspond
to a deterministic time.
2. Let {Xn }nN0 be a stochastic process adapted to the filtration {Fn }nN0 . For a subset B
B(R), we define the random time TB by
TB = min{n N0 : Xn B}.
Tb is called the hitting time of the set B and is a stopping time. Indeed,
{TB n} = {X0 B} {X1 B} {Xn B} Fn .
3. Let {nP
}nN be an iid sequence of coin tosses, i.e. P[i = 1] = P[i = 1] = 21 , and let
Xn = nk=1 k be the corresponding random walk. For N N, let S be the random time
defined by
S = max{n N : Xn = 0}.
S is called the last visit time to 0 before time N N. Intuitively, S is not a stopping time
since, in order to know whether S had already happened at time m < N , we need to know
that Xk 6= 0, for k = m + 1, . . . , N , and, for that, we need the information which is not
contained in Fm . We leave it to the reader to make this comment rigorous.
Stopping times have good stability properties, as the following proposition shows. All stopping times are with respect to an arbitrary, but fixed filtration {Fn }nN0 .
135
P ROOF
1. Immediate.
2. Let us show that S + T is a stopping time and leave the other two to the reader:
{S + T n} = nk=0 ({S k} {T n k}) Fn .
3. For m N0 , we have {T m} = nN {Tn m} Fm .
4. For m N0 , we have {Tn m} = {Tn < m}c = {Tn m 1}c Fm1 . Therefore,
{T m} = {T < m + 1} = {T m + 1}c = nN {Tn m + 1}c Fm .
Stopping times are often used to produce new processes from old ones. The most common
construction runs the process X until time T and after that keeps it constant and equal to its value
at time T . More precisely:
Definition 10.18 (Stopped processes) Let {Xn }nN0 be a stochastic process, and let T be a stopping
time. The process {Xn }nN0 stopped at T , denoted by {XnT }nN0 is defined by
XnT () = XT ()n () = Xn ()1{nT ()} + XT () 1{n>T ()} .
The (sub)martingale property is stable under stopping:
Proposition 10.19 (Stability under stopping) Let {Xn }nN0 be a (sub)martingale, and let T be a
stopping time. Then the stopped process {XnT }nN is also a (sub)martingale.
P ROOF Let {Xn }nN0 be a (sub)martingale. We note that the process Kn = 1{nT } , is predictable,
non-negative and bounded, so its martingale transform (K X) is a (sub)martingale. Moreover,
(K X)n = XT n X0 = XnT X0 ,
136
10.5
Convergence of martingales
A judicious use of a predictable processes in a martingale transform yields the following important result:
Theorem 10.20 (Martingale convergence) Let {Xn }nN0 be a martingale such that
sup E[|Xn |] < .
nN0
S2 = inf{n T1 : Xn a},
T1 = inf{n S1 : Xn b}
In words, let S1 be the first time X falls under a. Then, T1 is the first time after S1 when X exceeds
b, etc. We leave it to the reader to check that {Tn }nN and {Sn }nN are stopping times. These
two sequences of stopping times allow us to construct a predictable process {Hn }nN which takes
values in {0, 1}. Simply, we buy low and sell high:
(
X
1, Sk < n Tk for some k N,
Hn =
1{Sk <nTk } =
0, otherwise.
kN
Let Una,b be the number of completed upcrossings by time n, i.e., the process defined by
Una,b = inf{k N : Tk n}
A bit of accounting yields:
(H X)n (b a)Una,b (Xn a) .
(10.1)
Indeed, the total gains from the strategy H can be split into two components. First, every time a
passage from below a to above b is completed, we pocket at least (b a). After that, if X never
falls below a again, H remains 0 and our total gains exceeds (b a)Una,b , which, in turn, trivially
dominates (b a)Una,b (Xn a) . The other possibility is that after the last upcrossing, the
process does reach the value below a at a certain point. The last upcrossing already happened, so
the process never hits a value above b after that; it may very well happen that we lose on this
transaction. The loss is overestimated by (Xn a) .
Then the inequality (10.1) and fact that the martingale transform (by a bounded process) of a
martingale is a martingale yield
E[Una,b ]
1
ba E[(H
X)n ] +
1
ba E[(Xn
137
a) ]
E[|Xn |]+|a|
.
ba
|a|+supnN0 E[|Xn |]
ba
a,b
a,b
The assumption that supnN0 E[|Xn |] < , implies that E[U
] < and so P[U
< ] = 1. In
words, the number of upcrossings is almost surely finite (otherwise, we would be able to make
money by betting on an unfair game).
a,b
It remains to use the fact that U
< , a.s., to deduce that {Xn }nN0 converges. First of all,
by passing to rational numbers and taking countable intersections of probability-one sets, we can
assert that
a,b
P[U
< , for all a < b rational] = 1.
Then, we assume, contrary to the statement, that {Xn }nN0 does not converge, so that
P[lim inf Xn < lim sup Xn ] > 0.
n
which is, however, a contradiction since, on the event {lim inf n Xn < a < b < lim supn Xn }, the
process X completes infinitely many upcrossings.
a.s.
We conclude that there exists an [, ]-valued random variable X such that Xn X .
a.s.
In particular, we have |Xn | |X |, and Fatous lemma yields
E[|X |] lim inf E[|Xn |] sup E[|Xn |] < .
n
Some, but certainly not all, results about martingales can be transferred to submartingales (supermartingales) using the following proposition:
Proposition 10.21 (Doob-Meyer decomposition) Let {Xn }nN0 be a submartingale. Then, there
exists a martingale {Mn }nN0 and a predictable process {An }nN (with A0 = 0 adjoined) such that
An L1 , An An+1 , a.s., for all n N0 and
Xn = Mn + An , for all n N0 .
P ROOF Define
An =
n
X
k=1
n N.
Then {An }nN is clearly predictable and An+1 An , a.s., thanks to the submartingale property of
X. Finally, set Mn = Xn An , so that
E[Mn Mn1 |Fn1 ] = E[Xn Xn1 (An An1 )|Fn1 ]
Corollary 10.22 (Submartingale convergence) Let {Xn }nN0 be a submartingale such that
sup E[Xn+ ] < .
nN0
a.s.
X n X = M + A .
It remains to show that X L1 , and for that, it suffices to show that E[A ] < . Since E[An ] =
E[Xn ] E[Mn ] C = supn E[Xn+ ] + E[|M0 |] < , monotone convergence theorem yields E[A ]
C < .
Remark 10.23 Corollary 10.22 - or the simple observation that E[|Xn |] = 2E[Xn+ ] E[X0 ] when
E[Xn ] = E[X0 ] - implies that it is enough to assume supn E[Xn+ ] < in the original martingaleconvergence theorem (Theorem 10.20).
Corollary 10.24 (Convergence of non-negative supermartingales) Let {Xn }nN0 be a nonnegative supermartingale (or a non-positive submartingale). Then there exists a random variable
X L1 (F) such that Xn X, a.s.
To convince yourself that things can go wrong if the boundedness assumptions are not met, here
is a problem:
Problem 10.25 Give an example of a submartingale {Xn }nN with the property that Xn ,
a.s. and E[Xn ] . (Hint: Use the Borel-Cantelli lemma.)
10.6
Additional problems
Problem 10.26 (Combining (super)martingales) In 1., 2. and 3. below, {Xn }nN0 and {Yn }nN0
are martingales. In 4., they are only supermartingales.
1. Show that the process {Zn }nN0 given by Zn = Xn Yn = max{Xn , Yn } is a submartingale.
2. Give an example of {Xn }nN0 and {Yn }nN0 such that {Zn }nN0 (defined above) is not a
martingale.
139
n
X
k=1
k , n N,
where {n }nN are iid with P[1 = 1] = p and P[1 = 1] = 1 p, for some p (0, 1). Under
the assumption that p 12 , show that X hits any nonnegative level with probability 1, i.e.,
that for a N, we have P[a < ] = 1, where a = inf{n N : Xn = a}.
Problem 10.29 (An application to gambling) Let {n }nN0 be an iid sequence with P[n = 1] =
1 P[n = 1] = p ( 21 , 1). We interpret {n }nN0 as outcomes of a series of gambles. A gambler
starts with Z0 > 0 dollars, and in each play wagers a certain portion of her wealth. More precisely,
the wealth of the gambler at time n N is given by
Zn = Z0 +
n
X
k=1
140
C k k ,
E[log(ZT )] log(Z0 ) + T,
for any choice of {Cn }nN0 .
2. Show that the upper bound above is attained for some strategy {Cn }nN0 .
(Note: The quantity H(p) is called the entropy of the distribution of 1 . This problem shows how
it appears naturally in a gambling-theoretic context: the optimal rate of return equals to excess
entropy H( 21 ) H(p). )
Problem 10.30 (An application to analysis) Let = [0, 1), F = B[0, 1), and P = , where denotes the Lebesgue measure on [0, 1). For n N and k {0, 1, . . . , 2n 1}, we define
Ik,n = [k2n , (k + 1)2n ), Fn = (I0,n , I1,n , . . . , I2n1 ,n ).
In words, Fn is generated by the n-th dyadic partition of [0, 1). For x [0, 1), let kn (x) be the
unique number in {0, 1, . . . , 2n 1} such that x Ikn (x),n . For a function f : [0, 1) R we define
the process {Xnf }nN0 by
Xnf (x) = 2n f (kn (x) + 1 2n ) f kn (x)2n , x [0, 1).
1. Show that {Xnf }nN0 is a martingale.
2. Assume that the function f is Lipschitz, i.e., that there exists K > 0 such that |f (y) f (x)|
K |y x|, for all x, y [0, 1). Show that the limit X f = limn Xnf exists a.s.
3. Show that, for f Lipschitz, X f has the property that
Z y
X f () d, for all 0 x < y < 1.
f (y) f (x) =
x
(Note: This problem gives an alternative proof of the fact that Lipschitz functions are absolutely
continuous.)
141
Chapter
11
Uniform Integrability
11.1
Uniform integrability
Uniform integrabilty is a compactness-type concept for families of random variables, not unlike
that of tightness.
XX
Remark 11.2 It follows from the dominated convergence theorem (prove it!) that
lim E[|X| 1{|X|K} ] = 0 if and only if X L1 ,
i.e., that for integrable random variables, far tails contribute little to the expectation. Uniformly
integrable families are simply those for which the size of this contribution can be controlled uniformly over all elements.
We start with a characterization and a few basic properties of uniform-integrable families:
142
sup E[|X|] K + 1,
XX
1
K E[|X|]
Definition 11.6 (Test functions of uniform integrability) A Borel function : [0, ) [0, )
is called a test function of uniform integrability if
lim
(x)
x
143
= .
(11.1)
XX
Moreover, if it exists, the function can be chosen in the class of non-decreasing convex functions.
P ROOF Suppose, first, that (11.1) holds for some test function of uniform integrability and that
the value of the supremum is 0 < M < . For n > 0, there exists Cn R such that (x) nM x,
for x Cn . Therefore,
M E[(|X|)] E[(|X|)1{|X|Cn } ] nM E[|X| 1{|X|Cn } ], for all X X .
Hence, supXX E[|X| 1{|X|Cn } ] n1 , and the uniform integrability of X follows.
Conversely, suppose that X is uniformly integrable. By definition, there exists a sequence
{Cn }nN (which can always be chosen so that 0 < Cn < Cn+1 for n N, Cn ) such that
sup E[|X| 1{|X|Cn } ]
XX
1
.
n3
Let the function : [0, ) [0, ) be continuous and piecewise affine with (x) = 0 for x
[0, C1 ], and the derivative equal to n on (Cn , Cn+1 ), so that
lim
(x)
x
= lim (x) = .
x
Then,
E[(|X|)] = E[
=
n=1
|X|
0
() d] =
C1
X
E[ ()1{|X|} ] d =
n
n=1
Cn+1
Cn
E[1{|X|} ] d
Clearly,
E[|X| Cn+1 ] E[|X| Cn ] = E[(|X| Cn )1{Cn |X|<Cn+1 } ] + (Cn+1 Cn )E[1{|X|Cn+1 } ]
E[|X| 1{|X|Cn } ] + E[Cn+1 1{|X|Cn+1 } ]
nN
2
n2
2
,
n3
< .
Corollary 11.8 (Lp -boundedness, p > 1 implies UI) For p > 1, let X be a nonempty family of
random variables bounded in Lp , i.e., such that supXX ||X||Lp < . Then X is uniformly integrable.
144
is uniformly integrable. (Hint: Argue that it follows directly from Proposition 11.7 that E[(|X|)] <
for some test function of uniform integrability. Then, show that the same can be used to prove
that X is UI.)
11.2
When it is known that the martingale {Xn }nN is uniformly integrable, a lot can be said about its
structure. We start with a definitive version of the dominated convergence theorem:
2. Xn X, and
3. ||Xn ||Lp ||X||Lp < .
a.s.
P ROOF (1) (2): Since there exists a subsequence {Xnk }kN such that Xnk X, Fatous lemma
implies that
E[|X|p ] = E[lim inf |Xnk |p ] lim inf E[|Xnk |p ] sup E[|X|p ] < ,
k
XX
where the last inequality follows from the fact that uniformly-integrable families are bounded in
L1 .
Now that we know that X Lp , uniform integrability of {|Xn |p }nN implies that the family
P
{|Xn X|p }nN is UI (use Problem 11.5, (2)). Since Xn X if and only if Xn X 0, we
can assume without loss of generality that X = 0 a.s., and, consequently, we need to show that
E[|Xn |p ] 0. We fix an > 0, and start by the following estimate
(11.2)
E[|Xn |p ] = E[|Xn |p 1{|Xn |p /2} ] + E[|Xn |p 1{|Xn |p >/2} ] /2 + E[|Xn |p 1{|Xn |p >/2} ].
By uniform integrability there exists > 0 such that supnN E[|Xn |p 1A ] < /2, whenever P[A] .
Convergence in probability now implies that there exists n0 N such that for n n0 , we have
P[|Xn |p > /2] . It follows directly from (11.2) that for n n0 , we have E[|Xn |p ] .
(2) (3): ||Xn ||Lp ||X||Lp ||Xn X||Lp 0
145
x [0, M 1]
x,
M (x) = 0,
x [M, )
interpolated linearly, x (M 1, M ).
For a given > 0, dominated convergence theorem guarantees the existence of a constant M > 0
(which we fix throughout) such that
(11.3)
146
Proposition 11.15 (Structure of UI martingales) If {Xn }nN0 be a martingale. Then, the following are equivalent:
1. {Xn }nN0 is a Lvy martingale, i.e., it admits a representation of the form Xn = E[X|Fn ], a.s.,
for some X L1 (F),
2. {Xn }nN0 is uniformly integrable.
3. {Xn }nN0 converges in L1 ,
In that case, convergence also holds a.s., and the limit is given by E[X|F ], where F = (nN0 Fn ).
P ROOF (1) (2). The representation Xn = E[X|Fn ], a.s., and Problem 11.10 imply that {Xn }nN0
is uniformly integrable.
(2) (3). Corollary 11.14.
(3) (2). Corollary 11.13.
(2) (1). Corollary 11.14 implies that there exists a random variable Y L1 (F) such that
Xn Y a.s., and in L1 . For m N and A Fm , we have |E[Xn 1A Y 1A ]| E[|Xn Y |] 0, so
E[Xn 1A ] E[Y 1A ]. Since E[Xn 1A ] = E[E[X|Fn ]1A ] = E[X1A ], for n m, we have
E[Y 1A ] = E[X1A ], for all A n Fn .
The family n Fn is a -system which generated the sigma algebra F = (n Fn ), and the family
of all A F such that E[Y 1A ] = E[X1A ] is a -system. Therefore, by the Theorem, we have
E[Y 1A ] = E[X1A ], for all A F .
Therefore, since Y F , we conclude that Y = E[X|F ].
Example 11.16 There exists a non-negative (and therefore a.s.-convergent) martingale which is
not uniformly integrable (and therefore, not L1 -convergent).
Let {Xn }nN0 be a simple random
P
walk starting from 1, i.e. X0 = 1 and Xn = 1 + nk=1 k , where {n }nN is an iid sequence with
P[n = 1] = P[n = 1] = 12 , n N. Clearly, {Xn }nN0 is a martingale, and so is {Yn }nN0 ,
where Yn = XnT and T = inf{n N : Xn = 0}. By convention, inf = +. It is well known
that a simple symmetric random walk hits any level eventually, with probability 1 (we will prove
this rigorously later), so P[T < ] = 1, and, since Yn = 0, for n T , we have Yn 0, a.s., as
n . On the other hand, {Yn }nN0 is a martingale, so E[Yn ] = E[Y0 ] = 1, for n N. Therefore,
E[Yn ] 6 E[X], which can happen only if {Yn }nN0 is not uniformly integrable.
11.3
Backward martingales
If, instead of N0 , we use N0 = {. . . , 2, 1, 0} as the time set, the notion of a filtration is readily
extended: it is still a family of sub--algebras of F, parametrized by N0 , such that Fn1 Fn ,
for n N0 .
147
One of the most important facts about backward submartingales is that they (almost) always
converge a.s., and in L1 .
Proposition 11.18 (Backward submartingale convergence) Suppose that {Xn }nN0 is a backward submartingale such that
lim E[Xn ] > .
n
Then {Xn }nN0 is uniformly integrable and there exists a random variable X L1 (n Fn ) such
that
(11.4)
Xn X a.s. and in L1 ,
and
(11.5)
n=0 An .
The
E[X 1A ] E[Xm 1A ],
for any A n Fn , and any m N0 . We first note that since Xn E[Xm |Fn ], for n m 0, we
have
E[Xn 1A ] E[E[Xm |Fn ]1A ] = E[Xm 1A ],
for any A n Fn . It remains to use the fact the L1 -convergence of {Xn }nN0 implies that
E[Xn 1A ] E[X 1A ], for all A F.
Remark 11.19 Even if lim E[Xn ] = , the convergence Xn X still holds, but not in L1 and
X may take the value with positive probability.
Corollary 11.20 (Backward martingale convergence) If {Xn }nN0 is a backward martingale,
then Xn X = E[X0 | n Fn ], a.s., and in L1 .
11.4
We can use the results about the convergence of backward martingales to give a non-classical
proof of the strong law of large numbers. Before that, we need a useful classical result.
Proposition 11.21 (Kolmogorovs 0-1 law) Let {n }nN be a sequence of independent random variables, and let the tail -algebra F be defined by
\
Fn , where Fn = (n , n+1 , . . . ).
F =
nN
149
Since (Sn , Sn+1 , . . . ) = ((Sn ), (n+1 , n+2 , . . . )), and (n+1 , n+2 , . . . ) is independent of 1 ,
for n N, we have
Xn = E[1 |Fn ] = E[1 |(Sn )] = n1 Sn ,
where the last equality follows from Problem 9.29. Backward martingales converge a.s., and in
L1 , so for the random variable X = limn n1 Sn we have
E[X ] = lim E[ n1 Sn ] = E[1 ].
n
On the other hand, since limn n1 Sk = 0, for all k N, we have X = limn n1 (k+1 + + n ),
for any k N, and so X (k+1 k, k+2 , . . . ). By Proposition 11.21, X is measurable in
a P-trivial -algebra, and is, thus, constant a.s. (why?). Since E[X ] = E[1 ], we must have
X = E[1 ], a.s.
11.5
We continue with a useful generalization of the Kolmogorovs 0-1 law, where we extend the ideas
about the use of symmetry in the proof of Theorem 11.22 above.
Problem 11.25
1. Show that a function f : Rn R is symmetric if and only if f = fnsim .
150
Definition 11.26 (Exchangeable -algebra) For n N, let En be the -algebra generated by Xn+1 ,
Xn+2 , . . . , in addition to all random variables of the form f (X1 , . . . , Xn ), where f is a symmetric Borel
function f : Rn R.
The exchangeable -algebra E is defined by
\
E=
En .
nN
Remark 11.27 The exchangeable -algebra clearly contains the tail -algebra, and we can interpret it as the collection of all events whose occurrence is not affected by a permutation of the order
of X1 , X2 , . . . .
Example 11.28 Consider the event
A = { : lim sup
kN
k
X
j=1
Xj () 0}.
This event is not generally in the tail -algebra (why?), but it is always in the exchangeable algebra. Indeed, A can be written as
{ : lim sup
k
k
X
j=n+1
Xj () (X1 () + + Xn ())},
Lemma 11.29 (Symmetrization as conditional expectation) Let {Xn }nN be an iid sequence and
let f : Rk R, k N be a bounded Borel function. Then
(11.7)
Moreover,
(11.8)
151
By symmetry and definition of En , we expect E[f (X(1) , . . . , X(k) )|En ] not to depend on . To
prove this in a rigorous way, we must show that,
E[f (X(1) , . . . , X(k) ) f (X1 , . . . , Xk )|En ] = 0, a.s.,
for each Snk . For that, in turn, it will be enough to pick Snk and show that
E[g(X1 , . . . , Xn )(f (X(1) , . . . , X(k) ) f (X1 , . . . , Xk ))] = 0,
for any bounded symmetric function g : Rn R. Notice that the iid property implies that for any
permutation Sn , and any bounded Borel function h : Rn R, we have
E[h(X1 , . . . , Xn )] = E[h(X(1) , . . . , X(n) )].
In particular, for the function h : Rn R given by
h(x1 , . . . , xn ) = g(x1 , . . . , xn )f (x(1) , . . . , x(k) ),
and the permutation Sn with ((i)) = i, for i = 1, . . . , k (it exists since is an injection), we
have
E[g(X1 , . . . , Xn )f (X(1) , . . . , f(k) )] = E[h(X1 , . . . , Xn )]
= E[h(X(1) , . . . , X(n) ]
= E[g(X(1) , . . . , X(n) )f (X1 , . . . , Xn )]
= E[g(X1 , . . . , Xn )f (X1 , . . . , Xk )],
where the last equality follows from the fact that g is symmetric.
Finally, to prove (11.8), we simply combine (11.7), the definition E = n En of the exchangeable
-algebra and the backward martingale convergence theorem (Proposition 11.18).
Problem 11.30 Let G be a sub--algebra of F, and let X be a random variable with E[X 2 ] <
such that E[X|G] is independent of X. Show that X = E[X], a.s.
(Hint: Use the fact that E[X|G] is the L2 -projection of X onto L2 (G), and, consequently, that
E[(X E[X|G])E[X|G]] = 0. )
Proposition 11.31 (Hewitt-Savage 0-1 Law) The exchangeable -algebra of an iid sequence is trivial, i.e., P[A] {0, 1}, for A E.
P ROOF We pick a Borel function f : Rk R such that |f (x)| C, for x Rk . The idea of
the proof is to improve the conclusion (11.8) of Lemma 11.29 to fnsim E[f (X1 , . . . , Xk )]. We
(n1)!
n!
n!
n!
start by observing that for n > k, out of (nk)!
terms in fnsim , exactly (nk)!
(n1k)!
= nk (nk)!
152
Corollary 11.32 (A strong law for symmetric functions) Let {Xn }nN be an iid sequence, and let
f : Rk R, k N, be a Borel function with f (X1 , . . . , Xk ) L1 . Then
fnsim (X1 , . . . , Xn ) E[f (X1 , . . . , Xk )].
Remark 11.33 The random variables of the form fnsim (X1 , . . . , Xn ), for some Borel function f :
Rk R are sometimes called U-statistics, and are used as estimators in statistics. Corollary 11.32
can be interpreted as consistency statement for U-statistics.
Example 11.34 For f (x1 , x2 ) = (x1 x2 )2 , we have
fnsim (x1 , . . . , xn ) =
n
2
( )
1i<jn
(xi xj )2 ,
and so, by Corollary 11.32, we have that for an iid sequence {Xn }nN with 2 = Var[X1 ] < , we
have
X
1
(Xi Xj )2 E[(X1 X2 )2 ] = 2 2 , a.s. and in L1 .
n
(2)
1i<jn
153
Definition 11.35 (Exchangeable sequences) A sequence {Xn }nN of random variables is said to
be exchangeable If
(d)
1
0
P[Y1 , Y2 x] dx =
as, well as
E[X1 ]E[X2 ] = P[Y1 ]P[Y2 ] = (
1
0
1
0
x2 dx = 13 ,
x dx)2 = 41 .
To show that {Xn }nN is exchangeable, we need to compare the distribution of (X1 , . . . , Xn ) and
(X(1) , . . . , X(n) ) for Sn , n N. For a choice (b1 , . . . , bn ) {0, 1}n , we have
P[(X(1) , . . . , X(n) ) = (b1 , . . . , bn )]
= E[E[1{(X(1) ,...,X(n) )=(b1 ,...,bn )} |()]]
Z 1
=
P[1{Y(1) x} = b1 , . . . , 1{Y(n) x} = bn ] dx
0
=
=
n
1Y
0 i=1
Z 1Y
n
0 i=1
P[1{Y(i) x} = bi ] dx
P[1{Y1 x} = bi ] dx
where the last equality follows from the fact that {Yn }nN are iid. Since the final expression above
does not depend on , the sequence {Xn }nN is indeed exchangeable.
Problem 11.37 (A consequence of exchangeability) Let {Xn }nN be an exchangeable sequence
with E[X12 ] < . Show that
E[X1 X2 ] 0.
(Hint: Expand the inequality E[(X1 E[X1 ] + + Xn E[Xn ])2 ] 0 and use exchangeability.)
154
Definition 11.38 (Conditional iid) A sequence {Xn }nN is said to be conditionally iid with respect to a sub--algebra G of F, if
P[X1 A1 , X2 A2 , . . . , Xn An |G] = P[X1 A1 |G] P[X1 An |G], a.s.,
for all n N, and all A1 , . . . , An B(R).
Problem 11.39 Let the sequence {Xn }nN be as in Example 11.36. Show that
1. {Xn }nN is conditionally iid with respect to (), and
2. () is the exchangeable -algebra for {Xn }nN . (Hint: Consider the limit limn
1
n
Pn
k=1 Xk .)
The result of Problem 11.39 is not a coincidence. In a sense, all exchangeable sequences have the
structure similar to that of {Xn }nN above:
Theorem 11.40 (de Finetti) Let {Xn }nN be an exchangeable sequence, and let E be the corresponding exchangeable -algebra. Then, conditionally on E, {Xn }nN are iid.
P ROOF Let f be of the form f = gh, where g : Rk1 R and h : R R are Borel functions with
|g(x)| Cg , for all x Rk1 and |h(x)| Ch , for all x R. The product Pn = n(n 1) . . . (n k +
2)gnsim (X1 , . . . , Xn ) nhsim
n (X1 , . . . , Xn ) can be expanded into
X
X
g(X(1) , . . . , X(k1) )
h(Xi )
Pn =
k1
Sn
i{1,...,n}
k
Sn
g(X(1) , . . . , X(k1) )
k1
X
h(X(j) ).
j=1
k1
Sn
A bit of simple algebra, where f j (x1 , . . . , xk1 ) = g(x1 , . . . , xk1 )h(xj ), for j = 1, . . . , k 1 yields
fnsim =
The sum
Pk1
j=1
sim sim
n
nk+1 gn hn
1
nk+1
k1
X
fnj,sim .
j=1
Therefore, the relation (11.8) of Lemma 11.29 applied to f sim , g sim and hsim implies that
E[f (X1 , . . . , Xk )|E] = E[g(X1 , . . . , Xk1 )|E] E[h(Xk )|E].
155
11.6
Additional Problems
Let Y : [1, ) be a random variable such that E[Y ] < and E[Y K] = , where K(k) = k,
for k N.
1. (3pts) Find an explicit example of a random variable Y with the above properties.
2. (5pts) Find an expression for Xn = E[Y |Fn ] in terms of the values Y (k), k N.
(k) := sup
3. (12pts) Using the fact that X
nN |Xn (k)| Xk (k) for k N, show that {Xn }nN
is a uniformly integrable martingale which is not in H 1 . (Note: A martingale {Xn }nN is said
L1 .)
to be in H 1 if X
Problem 11.42 (Scheffs lemma) Let {Xn }nN0 be a sequence of random variables in L1+ such
that Xn X, a.s., for some X L1+ . Show that E[Xn ] E[X] if and only if the sequence
{Xn }nN0 is UI.
Problem 11.43 (Hunts lemma) Let {Fn }nN0 be a filtration, and let {Xn }nN0 be a sequence in L0
such that Xn X, for some X L0 , both in L1 and a.s.
1. (Hunts lemma). Assume that |Xn | Y , a.s., for all n N and some Y L1+ Prove that
(11.9)
(Hint: Define Zn = supmn |Xm X|, and show that Zn 0, a.s., and in L1 .)
2. Find an example of a sequence {Xn }nN in L1 such that Xn 0, a.s., and in L1 , but E[Xn |G]
1An
does not converge to 0, a.s., for some G F. (Hint: Look for Xn of the form Xn = n P[A
n]
and G = (n ; n N).)
(Note: The existence of such a sequence proves that (11.9) is not true without an additional
assumption, such as the one of uniform domination in (1). It provides an example of a
property which does not generalize from the unconditional to the conditional case.)
156
+
(Hint: Consider limn E[Xm+n
|Fm ], for m N0 . )
Problem 11.45 (Branching processes) Let be a probability measure on B(R) with (N0 ) = 1,
which we call the offspring distribution. A population starting from one individual (Z0 = 1)
evolves as follows. The initial member leaves a random number Z1 of children and dies. After
that, each of the Z1 children of the initial member, produces a random number of children and
dies. The total number of all children of the Z1 members of the generation 1 is denoted by Z2 . Each
of the Z2 members of the generation 2 produces a random number of children, etc. Whenever
an individual procreates, the number of children has the distribution , and is independent of
the sizes of all the previous generations including the present one, as well as of the numbers of
children of other members of the present generation.
1. Suppose that a probability space and iid sequence {n }nN of random variables with the
distribution is given. Show how you would construct a sequence {Zn }nN0 with the above
properties. (Hint: Zn+1 is a sum of iid random variables with the number of summands
equal to Zn .)
2. For a distribution on N0 , we define the the generating function P : [0, 1] [0, 1] of by
X
({k})xk .
P (x) =
kN0
Show that each P is continuous, non-decreasing and convex on [0, 1] and continuously differentiable on (0, 1).
Index
probability, 115, 122
conjugate exponents, 53
continuity of probability, 72
convergence
almost surely (a.s.), 73
almost-everywhere, 45
everywhere, 46
in L1 , 51
in Lp , 52
in distribution, 89
in measure, 58
in probability, 73
in total variation, 103
weak, 89
convex cone, 37
convolution
of L1 -functions, 84
of probability measures, 83
countable set, 5
countable-cocountable -algebra, 7
cycle, 112
-system, 6
-system, 6
-algebra, 6
-trivial, 45
countable-cocountable, 7
exchangeable, 151
generated by a family of maps, 10
generated by a map, 10
tail, 149
cylinder set, 12
algebra, 6
almost surely (a.s.), 73
almost-everywhere equality, 44
asymptotic density, 34
atom of a measure, 20
axiom of choice, 21
Bell numbers, 17
Bochners theorem, 97
Borel -algebra, 7
Borel function, 9
Darboux sums, 48
Dirac function, 20
distribution
2 , 86
coin-toss, 78
cummulative (cdf), 75
exponential, 86
joint, 75
marginal, 75
of a random element, 75
of a random variable, 74
of a random vector, 75
regular conditional, 123
Cantor set, 34
characteristic function, 96
choice functions, 10
coin-toss space, 12
completion of a measure, 34
composition of functions, 5
conditional
probability, regular, 123
characteristic function, 126
density, 125
expectation, 115
iid property, 155
158
singular, 77
standard normal, 86
uniform on (a, b), 84
uniform on S 1 , 34
elementary outcome, 72
entropy of a distribution, 141
equally distributed random variables, 74
essential supremum, 52
essentially bounded from above, 52
events, 72
certain, 72
mutually-exclusive, 72
eventually-periodic sequence, 35
exchangeable sequence, 154
expectation, 73
exponential moment, 86
extended set of real numbers, 14
extinction probability, 157
family of sets, 5
decreasing, 5
increasing, 5
pairwise disjoint, 5
filtration, 130
generated by a process, 131
function
n-symmetrization of, 150
a version of, 68
Caratheodory function, 71
characteristic, 96
cummulative distribution (cdf), 75
essentially-bounded, 52
generating function, 157
integrable, 38
measure-preserving, 27
null, 44
probability-density (pdf), 76
Riemann-integrable, 48
simple, 36
symmetric, 150
positive, 19
probability measure, 19
product, 63
real, 31
regular, 35
signed, 31
singular, 66
support of, 70
translation-invariant, 29
uniform, 20
vector measure, 122
measure space, 19
complete, 34
product, 63
measure-preserving map, 27
sample space, 72
section of a set, 61
set
-continuity, 90
convex, 58
countable, 5
exceptional, 45
null, 20, 44
set function, 19
simple-function representation, 36
space
Banach, 55
Borel, 123
complete metric, 55
nice, 123
normed, 50
pseudo-Banach, 55
pseudo-normed, 50
topological, 7
Standard Machine, 36
stochastic process
adapted, 130
discrete-time, 73, 130
predictable, 133
stopped at T , 136
stopping time, 135
submartingale, 131
backward, 148
summability, 31
supermartingale, 131
natural projections, 11
norm, 50
offspring distribution, 157
parallelogram identity, 58
Parsevals identity, 98
partition, 17
finite measurable, 31
of a set, 5
of an interval, 48
point mass, 20
probability space, 72
filtered, 130
product
of measurable spaces, 11
of measure spaces, 63
of sets, 10
product cylinder set, 12
pseudo metric, 33
pseudo norm, 50
pseudo-random numbers, 85
pull-back, 8
push-forward, 8, 28
theorem
-, 25
Caratheodorys extension theorem, 24
continuity, 102
de Finettis, 155
dominated-convergence theorem, 43
Fubini-Tonelli, 64
Hahn-Jordan, 32
inversion, 98
monotone-convergence theorem, 39
Radon-Nikodym derivative, 68
random element, 73
random permutation, 112
random time, 135
random variable, 72
absolutely-continuous, 76
extended-valued, 73
random variables
160
161