Prob MIT Lectures
Prob MIT Lectures
Contents
1. Probabilistic experiments
2. Sample space
3. Discrete probability spaces
4. σ-fields
5. Probability measures
6. Continuity of probabilities
1 PROBABILISTIC EXPERIMENTS
(a) Ω is the sample space, the set of possible outcomes of the experiment.
(b) F is a σ-field, a collection of subsets of Ω.
(c) P is a probability measure, a function that assigns a nonnegative probabil
ity to every set in the σ-field F.
1
2 SAMPLE SPACE
The sample space is a set Ω comprised of all the possible outcomes of the ex
periment. Typical elements of Ω are often denoted by ω, and are called ele
mentary outcomes, or simply outcomes. The sample space can be finite, e.g.,
Ω = {ω1 , . . . , ωn }, countable, e.g., Ω = N, or uncountable, e.g., Ω = R or
Ω = {0, 1}∞ .
As a practical matter, the elements of Ω must be mutually exclusive and
collectively exhaustive, in the sense that once the experiment is carried out, there
is exactly one element of Ω that occurs.
Examples
(a) If the experiment consists of a single roll of an ordinary die, the natural sample
space is the set Ω = {1, 2, . . . , 6}, consisting of 6 elements. The outcome ω = 2
indicates that the result of the roll was 2.
(b) If the experiment consists of five consecutive rolls of an ordinary die, the natural
sample space is the set Ω = {1, 2, . . . , 6}5 . The element ω = (3, 1, 1, 2, 5) is an
example of a possible outcome.
(c) If the experiment consists of an infinite number of consecutive rolls of an ordinary
die, the natural sample space is the set Ω = {1, 2, . . . , 6}∞ . In this case, an elemen
tary outcome is an infinite sequence, e.g., ω = (3, 1, 1, 5, . . .). Such a sample space
would be appropriate if we intend to roll a die indefinitely and we are interested in
studying, say, the number of rolls until a 4 is obtained for the first time.
(d) If the experiment consists of measuring the velocity of a vehicle with infinite preci
sion, a natural sample space is the set R of real numbers.
P({ω}) = 1.
ω∈Ω
For simplicity, we will usually employ the notation P(ω) instead of P({ω}),
and we will often denote P(ωi ) by pi .
The following are some examples of discrete probability spaces. Note that
typically we do not provide an explicit expression for P(A) for every A ⊂ Ω. It
suffices to specify the probability of elementary outcomes, from which P(A) is
readily obtained for any A.
Examples.
(a) Consider a single toss of a coin. If we believe that heads (H) and tails (T) are equally
likely, the following is an appropriate model. We set Ω = {ω1 , ω2 ), where ω1 = H
and ω2 = T , and let p1 = p2 = 1/2. Here, F = {Ø, {H}, {T }, {H, T }}, and
P(Ø) = 0, P(H) = P(T ) = 1/2, P({H, T }) = 1.
(b) Consider a single roll of a die. if we believe that all six outcomes are equally likely,
the following is an appropriate model. We set Ω = {1, 2, . . . , 6} and p1 = · · · =
p6 = 1/6.
(c) This example is not necessarily motivated by a meaningful experiment, yet it is a
legitimate discrete probability space. Let Ω = {1, 2, 5, a, v, aaa, ∗}, and P(1) = .1,
P(2) = .1, P(5) = .3, P(a) = .15, P(v) = .15, P(aaa) = .2, P(∗) = 0.
(d) Let Ω = N, and pk = (1/2)k , for k = 1, 2, . . . . More generally, given a parameter
p ∈ [0, 1), we can define pk = (1 −� p)pk−1 , for k = 1, 2, . . . . This results in a
∞
legitimate probability space because k=1 (1 − p)pk−1 = 1.
(e) Let Ω = N. We fix a parameter λ > 0, and let pk =� e−λ λk /k!, for every k ∈ N.
∞
This results in a legitimate probability space because k=1 e−λ λk /k! = 1.
(f) We toss an unbiased coin n times. We let Ω = {0, 1}n , and if we believe that all se
quences of heads and tails are equally likely, we let P(ω) = 1/2n for every ω ∈ Ω.
3
(g) We roll a die n times. We let Ω = {1, 2, . . . , 6}n , and if we believe that all elemen
tary outcomes (6-long sequences) are equally likely, we let P(ω) = 1/6n for every
ω ∈ Ω.
Given the probabilities pi , the problem of determining P(A) for some sub
set of Ω is conceptually straightforward.
� However, the calculations involved in
determining the value of the sum ω∈A P(ω) can range from straightforward to
daunting. Various methods that can simplify such calculations will be explored
in future lectures.
4 σ-FIELDS
When the sample space Ω is uncountable, the idea of defining the probability of
a general subset of Ω in terms of the probabilities of elementary outcomes runs
into difficulties. Suppose, for example, that the experiment consists of drawing
a number from the interval [0, 1], and that we wish to model a situation where all
elementary outcomes are “equally likely.” If we were to assign a probability of
zero to every ω, this alone would not be of much help in determining the proba
bility of a subset such as [1/2, 3/4]. If we were to assign the same positive value
to every ω, we would obtain P({1, 1/2, 1/3, . . .}) = ∞, which is undesirable.
A way out of this difficulty is to work directly with the probabilities of more
general subsets of Ω (not just subsets consisting of a single element).
Ideally, we would like to specify the probability P(A) of every subset of
Ω. However, if we wish our probabilities to have certain intuitive mathematical
properties, we run into some insurmountable mathematical difficulties. A so
lution is provided by the following compromise: assign probabilities to only a
partial collection of subsets of Ω. The sets in this collection are to be thought
of as the “nice” subsets of Ω, or, alternatively, as the subsets of Ω of interest.
Mathematically, we will require this collection to be a σ-field, a term that we
define next.
4
Definition 2. Given a sample space Ω, a σ-field is a collection F of subsets
of Ω, with the following properties:
(a) Ø ∈ F.
(b) If A ∈ F, then Ac ∈ F.
(c) If Ai ∈ F for every i ∈ N, then ∪∞
i=1 Ai ∈ F.
The following are some examples of σ-fields. (Check that this is indeed the
case.)
Examples.
(a) The trivial σ-field, F = {Ø, Ω}.
(b) The collection F = {Ø, A, Ac , Ω}, where A is a fixed subset of Ω.
(c) The set of all subsets of Ω: F = 2Ω = {A | A ⊂ Ω}.
(d) Let Ω = {1, 2, . . . , 6}n , the sample space associated with n rolls of a die. Let
A = {ω = (ω1 , . . . ωn ) | ω1 ≤ 2}, B = {ω = (ω1 , . . . , ωn ) | 3 ≤ ω1 ≤ 4}, and
C = {ω = (ω1 , . . . , ωn ) | ω1 ≥ 5}, and F = {Ø, A, B, C, A∪B, A∪C, B∪C, Ω}.
Example (d) above can be thought of as follows. We start with a number
of subsets of Ω that we wish to have included in a σ-field. We then include
more subsets, as needed, until a σ-field is constructed. More generally, given a
collection of subsets of Ω, we can contemplate forming complements, countable
unions, and countable intersections of these subsets, to form a new collection.
We continue this process until no more sets are included in the collection, at
which point we obtain a σ-field. This process is hard to formalize in a rigorous
manner. An alternative way of defining this σ-field is provided below. We will
need the following fact.
5
Proposition 1. Let S be an index set (possibly infinite, or even uncountable),
and suppose that for every s we have a σ-field Fs of subsets of the same
sample space. Let F = ∩s∈S Fs , i.e., a set A belongs to F if and only if
A ∈ Fs for every s ∈ S. Then F is a σ-field.
Proof. We need to verify that F has the three required properties. Since each
Fs is a σ-field, we have Ø ∈ Fs , for every s, which implies that Ø ∈ F. To
establish the second property, suppose that A ∈ F. Then, A ∈ Fs , for every s.
Since each s is a σ-field, we have Ac ∈ Fs , for every s. Therefore, Ac ∈ F,
as desired. Finally, to establish the third property, consider a sequence {Ai } of
elements of F. In particular, for a given s ∈ S, every set Ai belongs to Fs . Since
Fs is a σ-field, it follows that ∪∞i=1 Ai ∈ Fs . Since this is true for every s ∈ S,
∞
it follows that ∪i=1 Ai ∈ Fs . This verifies the third property and establishes that
F is indeed a σ-field.
Suppose now that we start with a collection C of subsets of Ω, which is not
necessarily a σ-field. We wish to form a σ-field that contains C. This is always
possible, a simple choice being to just let F = 2Ω . However, for technical
reasons, we may wish the σ-field to contain no more sets than necessary. This
leads us to define F as the intersection of all σ-fields that contain C. Note that
if H is any other σ-field that contains C, then F ⊂ H. (This is because F was
defined as the intersection of various σ-fields, one of which is H.) In this sense,
F is the smallest σ-field containing C. The σ-field F constructed in this manner
is called the σ-field generated by C, and is often denoted by σ(C).
Example. Let Ω = [0, 1]. The smallest σ-field that includes every interval [a, b] ⊂ [0, 1]
is hard to describe explicitly (it includes fairly complicated sets), but is still well-defined,
by the above discussion. It is called the Borel σ-field, and is denoted by B. A set
A ⊂ [0, 1] that belongs to this σ-field is called a Borel set.
5 PROBABILITY MEASURES
6
Definition 3. Let (Ω, F) be a measurable space. A measure is a function
µ : F → [0, ∞], which assigns a nonnegative extended real number µ(A) to
every set A in F, and which satisfies the following two conditions:
(a) µ(Ø) = 0;
For any A ∈ F, P(A) is called the probability of the event A. The assign
ment of unit probability to the event Ω expresses our certainty that the outcome
of the experiment, no matter what it is, will be an element of Ω. Similarly, the
outcome cannot be an element of the empty set; thus, the empty set cannot occur
and is assigned zero probability. If an event A ∈ F satisfies P(A) = 1, we say
that A occurs almost surely. Note, however, that A happening almost surely is
not the same as the condition A = Ω. For a trivial example, let Ω = {1, 2, 3},
p1 = .5, p2 = .5, p3 = 0. Then the event A = {1, 2} occurs almost surely, since
P(A) = .5 + .5 = 1, but A �= Ω. The outcome 3 has zero probability, but is still
possible.
The countable additivity property is very important. Its intuitive meaning
is the following. If we have several events A1 , A2 , . . ., out of which at most
one can occur, then the probability that “one of them will occur” is equal to
the sum of their individual probabilities. In this sense, probabilities (and more
generally, measures) behave like the familiar notions of area or volume: the area
or volume of a countable union of disjoint sets is the sum of their individual areas
or volumes. Indeed, a measure is to be understood as some generalized notion
of a volume. In this light, allowing the measure µ(A) of a set to be infinite is
natural, since one can easily think of sets with infinite volume.
The properties of probability measures that are required by Definition 3 are
often called the axioms of probability theory. Starting from these axioms, many
other properties can be derived, as in the next proposition.
7
Proposition 2. Probability measures have the following properties.
(a) (Finite
�n additivity) If the events A1 , . . . , An are disjoint, then P(∪ni=1 Ai ) =
i=1 P(Ai ).
(b) For any event A, we have P(Ac ) = 1 − P(A).
(c) If the events A and B satisfy A ⊂ B, then P(A) ≤ P(B).
(d) (Union bound) For any sequence {Ai } of events, we have
∞
�
� ∞
�
P Ai ≤ P(Ai ).
i=1 i=1
� �
P Ai = P(Ai ) − P(Ai ∩ Aj )
�
+ P(Ai ∩ Aj ∩ Ak ) + · · · + (−1)n P(A1 ∩ · · · ∩ An ).
(i,j,k): i<j<k
Proof.
(a) This property is almost identical to condition (b) in the definition of a mea
sure, except that it deals with a finite instead of a countably infinite collec
tion of events. Given a finite collection of disjoint events A1 , . . . , An , let us
define Ak = Ø for k > n, to obtain an infinite sequence of disjoint events.
Then,
n
�
� ∞
�
� �∞ n
�
P Ai =P Ai = P(Ai ) = P(Ai ).
i=1 i=1 i=1 i=1
Countable additivity was used in the second equality, and the fact P(Ø) = 0
was used in the last equality.
(b) The events A and Ac are disjoint. Using part (a), we have P(A ∪ Ac ) =
P(A) + P(Ac ). But A ∪ Ac = Ω, whose measure is equal to one, and the
result follows.
(c) The events A and B \ A are disjoint. Also, A ∪ (B \ A) = B. Therefore,
using also part (a), we obtain P(A) ≤ P(A) + P(B \ A) = P(B).
(d) Left as an exercise.
8
(e) Left as an exercise; a simple proof will be provided later, using random
variables.
Part (e) of Proposition 2 admits a rather simple proof that relies on random
variables and their expectations, a topic to be visited later on. For the special
case where n = 2, the formula simplifies to
Let us note that all properties (a), (c), and (d) in Proposition 2 are also valid
for general measures (the proof is the same). Let us also note that for a proba
bility measure, the property P(Ø) = 0 need not be assumed, but can be derived
from the other properties. Indeed, consider a sequence of sets Ai , each of which
�∞ since Ø ∩ Ø = Ø. Applying
is equal to the empty set. These sets are disjoint,
the countable additivity property, we obtain i=1 P(Ø) = P(Ø) ≤ P(Ω) = 1,
which can only hold if P(Ø) = 0.
Finite Additivity
Our definitions of σ-fields and of probability measures involve countable unions
and a countable additivity property. A different mathematical structure is ob
tained if we replace countable unions and sums by finite ones. This leads us to
the following definitions.
(i) Ø ∈ F.
(ii) If A ∈ F, then Ac ∈ F.
(iii) If A ∈ F and B ∈ F, then A ∪ B ∈ F.
We note that finite additivity, for the two case of two events, easily implies
finite additivity for a general finite number n of events, namely, the property in
part (a) of Proposition 2. To see this, note that finite additivity for n = 2 allows
9
us to write, for the case of three disjoint events,
6 CONTINUITY OF PROBABILITIES
Proof. We first assume that (a) holds and establish (b). Observe that A = A1 ∪
(A2 \ A1 ) ∪ (A3 \ A2 ) ∪ . . ., and that the events A1 , (A2 \ A1 ), (A3 \ A2 ), . . .
10
are disjoint (check this). Therefore, using countable additivity,
∞
�
P(A) = P(A1 ) + P(Ai \ Ai−1 )
i=2
n
�
= P(A1 ) + lim P(Ai \ Ai−1 )
n→∞
i=2
�n
� �
= P(A1 ) + lim P(Ai ) − P(Ai−1 )
n→∞
i=2
= P(A1 ) + lim (P(An ) − P(A1 ))
n→∞
= lim P(An ).
n→∞
Suppose now that property (b) holds, let Ai be a decreasing sequence of sets,
and let A = ∩∞ c
i=1 Ai . Then, the sequence Ai is increasing, and De Morgan’s law,
together with property (b) imply that
�� �c �
P(Ac ) = P ∩∞ = P ∪∞ c c
� �
A
i=1 i i=1 Ai = lim P(Ai ), n→∞
and
Property (d) follows from property (c), because (d) is just the special case of
(c) in which the set A is empty.
To complete the proof, we now assume that property (d) holds and establish
that property (a) holds as well. Let Bi ∈ F be disjoint events. Let An =
∪∞i=n Bi . Note that {An } is a decreasing sequence of events. We claim that
∩∞n=1 An = Ø. Indeed, if ω ∈ A1 , then ω ∈ Bn for some n, which implies that
ω∈ / ∪∞ i=n+1 Bi = An+1 . Therefore, no element of A1 can belong to all of the
sets An , which means that that the intersection of the sets An is empty. Property
(d) then implies that limn→∞ P(An ) = 0.
Applying finite additivity to the n disjoint sets B1 , B2 , . . . , Bn−1 , ∪∞
i=n Bi ,
we have � ∞ � n−1 �∞ �
�
P Bi = P(Bi ) + P Bi .
i=1 i=1 i=n
11
�∞ � n
�
P Bi = P(Bi ),
i=1 i=1
References
12
MIT OpenCourseWare
http://ocw.mit.edu
For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.
MASSACHUSETTS INSTITUTE OF TECHNOLOGY
6.436J/15.085J Fall 2008
Lecture 2 9/8/2008
Contents
The following are two fundamental probabilistic models that can serve as
building blocks for more complex models:
(a) The uniform distribution on [0, 1], which assigns probability b − a to every
interval [a, b] ⊂ [0, 1].
(b) A model of an infinite sequence of fair coin tosses that assigns equal proba
bility, 1/2n , to every possible sequence of length n.
These two models are often encountered in elementary probability and used
without further discussion. Strictly speaking, however, we need to make sure
that these two models are well-posed, that is, consistent with the axioms of
probability. To this effect, we need to define appropriate σ-fields and probability
measures on the corresponding sample spaces. In what follows, we describe the
required construction, while omitting the proofs of the more technical steps.
1 ´
CARATHEODORY’S EXTENSION THEOREM
The general outline of the construction we will use is as follows. We are in
terested in defining a probability measure with certain properties on a given
1
measurable space (Ω, F). We consider a smaller collection, F0 ⊂ F, of subsets
of Ω, which is a field, and on which the desired probabilities are easy to define.
(Recall that a field is a collection of subsets of the sample space that includes
the empty set, and which is closed under taking complements and under finite
unions.) Furthermore, we make sure that F0 is rich enough, so that the σ-field
it generates is the same as the desired σ-field F. We then extend the definition
of the probability measure from F0 to the entire σ-field F. This is possible, un
der few conditions, by virtue of the following fundamental result from measure
theory.
Remarks:
(a) The proof of the extension theorem is fairly long and technical; see, e.g.,
Appendix A of [Williams].
(b) The main hurdle in applying the extension theorem is the verification of
the countable additivity property of P0 on F0 ; that is, one needs to show
that if {Ai } is �a sequence disjoint sets in F0 , and if ∪∞ i=1 Ai ∈ F0 , then
P0 (∪i=1 Ai ) = ∞
∞ P
i=1 0 (Ai ). Alternatively, in the spirit of Theorem 1 from
Lecture 1, it suffices to verify that if {Ai } is a decreasing sequence of sets
in F0 and if ∩∞ i=1 Ai is empty, then limn→∞ P0 (Ai ) = 0. Indeed, while
Theorem 1 of Lecture 1 was stated for the case where F is a σ-field, an
inspection of its proof indicates that it remains valid even if F is replaced
by a field F0 .
In the next two sections, we consider the two models of interest. We define
appropriate fields, define probabilities for the events in those fields, and then use
the extension theorem to obtain a probability measure.
In this section, we construct the uniform probability measure on [0, 1], also
known as Lebesgue measure. Under the Lebesgue measure, the measure as
2
signed to any subset of [0, 1] is meant to be equal to its length. While the defini
tion of length is immediate for simple sets (e.g., the set [a, b] has length b − a),
more general sets present more of a challenge.
We start by considering the sample space Ω = (0, 1], which is slightly more
convenient than the sample space [0, 1].
Proof. We have already argued that every interval of the form (a, b] is a Borel
set. Hence, a typical element of F0 (a finite union of such intervals) is also a
Borel set. Therefore, F0 ⊂ B, which implies that σ(F0 ) ⊂ σ(B) = B. (The
last equality holds because B is already a σ-field and is therefore equal to the
smallest σ-field that contains B.)
Note that for a > 0, we have [a, b] = ∩∞
n=1 (a − 1/n, b]. Since (a − 1/n, b] ∈
F0 ⊂ σ(F0 ), it follows that [a, b] ∈ σ(F0 ). Thus, C ⊂ σ(F0 ), which implies
that � �
B = σ(C) ⊂ σ σ(F0 ) = σ(F0 ) ⊂ B.
3
(The second equality holds because the smallest σ-field containing σ(F0 ) is
σ(F0 ) itself.) The first equality in the statement of the proposition follows.
Finally, the equality σ(C) = B is just the definition of B.
Lemma 2.
Proof.
(a) By definition, Ø ∈ F0 . Note that Øc = (0, 1] ∈ F0 . More generally, if A is
of the form A = (a1 , b1 ]∪· · ·∪(an , bn ], its complement is (0, a1 ]∪(b1 , a2 ]∪
· · · ∪ (bn , 1], which is also in F0 . Furthermore, the union of two sets that
are unions of finitely many intervals of the form (a, b] is also a union of
finitely many such intervals. For example, if A = (1/8, 2/8] ∪ (4/8, 7/8]
and B = (3/8, 5/8], then A ∪ B = (1/8, 2/8] ∪ (3/8, 7/8].
(b) To see that F0 is not a σ-field, note that (0, n/(n + 1)] ∈ F0 , for every
n ∈ N, but the union of these sets, which is (0, 1), does not belong to
F0 .
A = (a1 , b1 ] ∪ · · · ∪ (an , bn ],
4
� �
entire Borel σ-field B, that agrees with P0 on F0 . In particular, P (a, b] = b−a,
for every interval (a, b] ⊂ (0, 1].
By augmenting the sample space Ω to include 0, and assigning zero prob
ability to it, we obtain a new probability model with sample space Ω = [0, 1].
(Exercise: define formally the sigma-field on [0, 1], starting from the σ-field on
(0, 1].)
Exercise 1. Let A be the set of irrational numbers in [0, 1]. Show that P(A) = 1.
Example. Let A be the set of points in [0, 1] whose decimal representation contains
only odd digits. (We disallow decimal representations that end with an infinite string of
nines. Under this condition, every number has a unique decimal representation.) What
is the Lebesgue measure of this set?
Observe that A = ∩∞ n=1 An , where An is the set of points whose first n digits are
odd. It can be checked that An is a union of 5n intervals, each with length 1/10n . Thus,
P(An ) = 5n /10n = 1/2n . Since A ⊂ An , we obtain P(A) ≤ P(An ) = 1/2n . Since
this is true for every n, we conclude that P(A) = 0.
Exercise 2. Let A be the set of points in [0, 1] whose decimal representation contains
at least one digit equal to 9. Find the Lebesgue measure of that set.
Note that there is nothing special about the interval (0, 1]. For example,
� � if
we let Ω = (c, d], where c < d, and if (a, b] ⊂ (c, d], we can define P0 (a, b] =
(b − a)/(d − c) and proceed as above to obtain a uniform probability measure
on the set (c, d], as well as on the set [c, d].
On the other hand, a “uniform” probability measure on the entire real line,
R, that assigns equal probability to intervals of equal length, is incompatible
with the requirement P(Ω) = 1. What we obtain instead, in the next section, is
a notion of length which becomes infinite for certain sets.
(a) Let C be the collection of all intervals of the form [a, b], and let B = σ(C)
be the σ-field that it generates.
(b) Let D be the collection of all intervals of the form (−∞, b], and let B =
σ(D) be the σ-field that it generates.
(c) For any n, we define the Borel σ-field of (n, n + 1] as the σ-field generated
by sets of the form [a, b] ⊂ (n, n + 1]. We then say that A is a Borel subset
of R if A ∩ (n, n + 1] is a Borel subset of (n, n + 1], for every n.
It turns out that µ is a measure on (R, B), called again Lebesgue measure.
However, it is not a probability measure because µ(R) = ∞.
Exercise 4. Show that µ is a measure on (R, B). Hint: Use the countable additivity of
the measures Pn to establish the countable
�∞ additivity
�∞ of µ.�You can also the fact that if
∞ �∞
the numbers aij are nonnegative, then i=1 j=1 aij = j=1 i=1 aij .
We can express A ⊂ {0, 1}∞ in the form A = B × {0, 1}∞ . (This is simply
saying that any sequence in A can be viewed as a pair consisting of a n-long
sequence that belongs to B, followed by an arbitrary infinite sequence. The
6
event A belongs to Fn , and all elements of Fn are of this form, for some A. It
is easily verified that Fn is a σ-field.
Exercise 5. Provide a formal proof that Fn is a σ-field.
The σ-field Fn , for any fixed n, is too small; it can only serve to model
the first n coin tosses. We are interested instead in sets that belong to Fn , for
arbitrary n, and this leads us to define F0 = ∪∞n=1 Fn , the collection of sets that
belong to Fn for some n. Intuitively, A ∈ F0 if the occurrence or nonoccurrence
of A can be decided after a fixed number of coin tosses.1
Example. Let An = {ω | ωn = 1}, the event that the nth toss results in a “1”. Note
that An ∈ Fn . Let A = ∪∞ i=1 An , which is the event that there is at least one “1” in
the infinite toss sequence. The event A does not belong to Fn , for any n. (Intuitively,
having observed a sequence of n zeroes does not allow us to decide whether there will
be a subsequent “1” or not.) Consider also the complement of A, which is the event that
the outcome of the experiment is an infinite string of zeroes. Once more, we see that Ac
does not belong to F0 .
The preceding example shows that F0 is not a σ-field. On the other hand, it
can be verified that F0 is a field.
Exercise 6. Prove that F0 is a field.
Similar to the case of Borel sets in [0, 1], there exist subsets of {0, 1}∞ that
do not belong to F. In fact the similarities between the models of Sections 2 and
3 are much deeper; the two models are essentially equivalent, although we will
not elaborate on the meaning of this. Let us only say that the equivalence relies
on the one-to-one correspondence of the sets [0, 1] and {0, 1}∞ obtained through
the binary representation of real numbers. Intuitively, generating a real number
at random, according to the uniform distribution (Lebesgue measure) on [0, 1],
is probabilistically equivalent to generating each bit in its binary expansion at
random.
Starting with a field F0 and a countably additive function P0 on that field, the
Extension Theorem leads to a measure on the smallest σ-field containing F0 .
Can we extend the measure further, to a larger σ-field? If so, is the extension
unique, or will there have to be some arbitrary choices? We describe here a
generic extension that assigns probabilities to certain additional sets A for which
there is little choice.
Consider a probability space (Ω, F, P). Suppose that B ∈ F, and P(B) =
0. Any set B with this property is called a null set. (Note that in this context,
“null” is not the same as “empty.”) Suppose now that A ⊂ B. If the set A is not
in F, it is not assigned a probability; were it to be assigned one, the only choice
that would not lead to a contradiction is a value of zero.
The first step is to augment the σ-field F so that it includes all subsets of
null sets. This is accomplished as follows:
(a) Let N be the collection of all subsets of null sets;
(b) Define F ∗ = σ(F ∪ N ), the smallest σ-field that contains F as well as all
subsets of null sets.
(c) Extend P in a natural manner to obtain a new probability measure P∗ on
(Ω, F ∗ ). In particular, we let P∗ (A) = 0 for every subset A ⊂ B of every
null set B ∈ F. It turns out that such an extension is always possible and
unique.
The resulting probability space is said to be complete. It has the property that
all subsets of null sets are included in the σ-field and are also null sets.
When Ω = [0, 1] (or Ω = R), F is the Borel σ-field, and P is Lebesgue
measure, we obtain an augmented σ-field F ∗ and a measure P∗ . The sets in F ∗
are called Lebesgue measurable sets. The new measure P∗ is referred to by the
same name as the measure P (“Lebesgue measure”).
5 FURTHER REMARKS
We record here a few interesting facts related to Borel σ-fields and the Lebesgue
measure. Their proofs tend to be fairly involved.
(a) There exist sets that are Lebesgue measurable but not Borel measurable,
i.e., F is a proper subset of F ∗ .
(b) There are as many Borel measurable sets as there are points on the real
line (this is the “cardinality of the continuum”), but there are as many
Lebesgue measurable sets as there are subsets of the real line (which is a
higher cardinality) [Billingsley]
(c) There exist subsets of [0, 1] that are not Lebesgue measurable; see Section
6 and [Williams, p. 192].
9
(d) It is not possible to construct a probability space in which the σ-field in
cludes all subsets of [0, 1], with the property that P({x}) = 0 for every
x ∈ (0, 1] [Billingsley, pp. 45-46].
In this appendix, we provide some evidence that not every subset of (0, 1] is
Lebesgue measurable, and, furthermore, that Lebesgue measure cannot be ex
tended to a measure defined for all subsets of (0, 1].
Let “+” stand for addition modulo 1 in (0, 1]. For example, 0.5 + 0.7 = 0.2,
instead of 1.2. You may want to visualize (0, 1] as a circle that wraps around so
that after 1, one starts again at 0. If A ⊂ (0, 1], and x is a number, then A + x
stands for the set of all numbers of the form y + x where y ∈ A.
Define x and y to be equivalent if x + r = y for some rational number r.
Then, (0, 1] can be partitioned into equivalence classes. (That is, all elements
in the same equivalence class are equivalent, elements belonging to different
equivalent classes are not equivalent, and every x ∈ (0, 1] belongs to exactly
one equivalence class.) Let us pick exactly one element from each equivalence
class, and let H be the set of the elements picked this way. (This fact that a set H
can be legitimately formed this way involves the Axiom of Choice, a generally
accepted axiom of set theory.) We will now consider the sets of the form H + r,
where r ranges over the rational numbers in (0, 1]. Note that there are countably
many such sets.
� r2 , and if the two sets H + r1 ,
The sets H + r are disjoint. (Indeed, if r1 =
H + r2 share the point h1 + r = h2 + r2 , with h1 , h2 ∈ H, then h1 and h2
differ by a rational number and are equivalent. If h1 = � h2 , this contradicts the
construction of H, which contains exactly one element from each equivalence
class. If h1 = h2 , then r1 = r2 , which is again a contradiction.) Therefore,
(0, 1] is the union of the countably many disjoint sets H + r.
The sets H + r, for different r, are “translations” of each other (they are
all formed by starting from the set H and adding a number, modulo 1). Let us
say that a measure is translation-invariant if it has the following property: if A
and A + x are measurable sets, then P(A) = P(A + x). Suppose that P is a
translation invariant probability measure, defined on all subsets of (0, 1]. Then,
� � � �
1 = P (0, 1] = P(H + r) = P(H),
r r
where the sum is taken over all rational numbers in (0, 1]. But this impossible.
We conclude that a translation-invariant measure, defined on all subsets of (0, 1]
does not exist.
10
On the other hand, it can be verified that the Lebesgue measure is translation-
invariant on the Borel σ-field, as well as its extension, the Lebesgue σ-field. This
implies that the Lebesgue σ-field does not include all subsets of (0, 1].
An even stronger, and more counterintuitive example is the following. It in
dicates, that the ordinary notion of area or volume cannot be applied to arbitrary
sets.
The Banach-Tarski Paradox. Let S be the two-dimensional surface of the unit
sphere in three dimensions. There exists a subset F of S such that for any k ≥ 3,
S = (τ1 F ) ∪ · · · ∪ (τk F ),
where each τi is a rigid rotation and the sets τi F are disjoint. For example, S
can be made up by three rotated copies of F (suggesting probability equal to
1/3, but also by four rotated copies of F , suggesting probability equal to 1/4).
Ordinary geometric intuition clearly fails when dealing with arbitrary sets.
11
MIT OpenCourseWare
http://ocw.mit.edu
For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.
MASSACHUSETTS INSTITUTE OF TECHNOLOGY
6.436J/15.085J Fall 2008
Lecture 3 9/10/2008
Most of the material in this lecture is covered in [Bertsekas & Tsitsiklis] Sections
1.3-1.5 and Problem 48 (or problem 43, in the 1st edition), available at http:
//athenasc.com/Prob-2nd-Ch1.pdf. Solutions to the end of chapter
problems are available at http://athenasc.com/prob-solved 2ndedition.
pdf. These lecture notes provide some additional details and twists.
Contents
1. Conditional probability
2. Independence
3. The Borel-Cantelli lemma
1 CONDITIONAL PROBABILITY
P(A ∩ B)
P(A | B) = .
P(B)
(a) If B is an event with P(B) > 0, then P(Ω | B) = 1, and for any sequence
{Ai } of disjoint events, we have
∞
�
∪∞
� �
P i=1 Ai | B = P(Ai | B).
i=1
(b) Suppose that B is an event with P(B) > 0. For every A ∈ F, define
PB (A) = P(A | B). Then, PB is a probability measure on (Ω, F).
(c) Let A be an event. If the events Bi , i ∈ N, form a partition of Ω, and
P(Bi ) > 0 for every i, then
n
�
P(A) = P(A | Bi )P(Bi ).
i=1
(d) (Bayes’ rule) Let A be an event with P(A) > 0. If the events Bi , i ∈ N,
form a partition of Ω, and P(Bi ) > 0 for every i, then
Proof.
� P B ∩ (∪∞
� �
� ∞ i=1 Ai ) P(∪∞
i=1 (B ∩ Ai ))
P ∪i=1 Ai | B = = .
P(B) P(B)
2
Since the sets B ∩ Ai , i ∈ N are disjoint, countable additivity, applied to the
right-hand side, yields
�∞ ∞
i=1 P(B ∩ Ai ) �
∪∞
� �
P i=1 Ai | B = = P(Ai | B),
P(B)
i=1
as claimed.
(b) This is immediate from part (a).
(c) We have
� �
P(A) = P(A ∩ Ω) = P A ∩ (∪∞ = P ∪∞
� �
B
i=1 i ) i=1 (A ∩ Bi )
∞
� ∞
�
= P(A ∩ Bi ) = P(A | Bi )P(Bi ).
i=1 i=1
In the second equality, we used the fact that the sets Bi form a partition of
Ω. In the next to last equality, we used the fact that the sets Bi are disjoint
and countable additivity.
(d) This follows from the fact
2 INDEPENDENCE
3
Definition 2. Let (Ω, F.P) be a probability space.
(b) Let S be an index set (possibly infinite, or even uncountable), and let {As |
s ∈ S} be a family (set) of events. The events in this family are said to be
independent if for every finite subset S0 of S, we have
� � �
P ∩s∈S0 As = P(As ).
s∈S0
(d) More generally, let S be an index set, and for every s ∈ S, let Fs be a σ
field contained in F. We say that the σ-fields Fs are independent if the
following holds. If we pick one event As from each Fs , the events in the
resulting family {As | s ∈ S} are independent.
Example. Consider an infinite sequence of fair coin tosses, under the model constructed
in the Lecture 2 notes. The following statements are intuitively obvious (although a
formal proof would require a few steps).
� j, the events Ai and Aj
(a) Let Ai be the event that the ith toss resulted in a “1”. If i =
are independent.
(b) The events in the (infinite) family {Ai | i ∈ N} are independent. This statement
captures the intuitive idea of “independent” coin tosses.
(c) Let F1 (respectively, F2 ) be the collection of all events whose occurrence can be
decided by looking at the results of the coin tosses at odd (respectively, even) times
n. More formally, Let Hi be the event that the ith toss resulted in a 1. Let C be
the collection of events C = {Hi | i is odd}, and finally let F1 = σ(C), so that
F1 is the smallest σ-field that contains all the events Hi , for even i. We define F2
similarly, using even times instead of odd times. Then, the two σ-fields F1 and F2
turn out to be independent. This statement captures the intuitive idea that knowing
the results of the tosses at odd times provides no information on the results of the
tosses at even times.
(d) Let Fn be the collection of all events whose occurrence can be decided by looking
at the results of tosses 2n and 2n + 1. (Note that each Fn is a σ-field comprised of
finitely many events.) Then, the families Fn , n ∈ N, are independent.
Remark: How can one establish that two σ-fields (e.g., as in the above coin
4
tossing example) are independent? It turns out that one only needs to check
independence for smaller collections of sets; see the theorem below (the proof
is omitted and can be found in p. 39 of [W]).
Theorem: If G1 and G2 are two collections of measurable sets, that are closed
under intersection (that is, if A, B ∈ Gi , then A ∩ B ∈ Gi ), if Fi = σ(Gi ),
i = 1, 2, and if P(A ∩ B) = P(A)P(B) for every A ∈ G1 , B ∈ G2 , then F1 and
F2 are independent.
The Borel-Cantelli lemma is a tool that is often used to establish that a certain
event has probability zero or one. Given a sequence of events An , n ∈ N, recall
that {An i.o.} (read as “An occurs infinitely often”) is the event consisting of all
ω ∈ Ω that belong to infinitely many An , and that
∞
� ∞
{An i.o.} = lim sup An = Ai .
n→∞
n=1 i=n
(a) If ∞
�
n=1 P(An ) < ∞, then P(A) = 0.
�∞
(b) If n=1 P(An ) = ∞ and the events An , n ∈ N, are independent, then
P(A) = 1.
Remark: The result in part (b) is not true without the independence assumption.
Indeed, consider an arbitrary event C such that 0 < P(C) � < 1 and let An = C
for all n. Then P({An i.o.}) = P(C) < 1, even though n P(An ) = ∞.
The following lemma is useful here and in may other contexts.
�∞
Lemma�1. Suppose that 0 ≤ pi ≤ 1 for every i ∈ N, and that i=1 pi = ∞.
Then, ∞
i=1 (1 − pi ) = 0.
Proof. Note that log(1 − x) is a concave function of its argument, and its deriva
tive at x = 0 is −1. It follows that log(1 − x) ≤ −x, for x ∈ [0, 1]. We then
5
have
∞
� � n
� �
log (1 − pi ) = log lim (1 − pi )
n→∞
i=1 i=1
k
�
≤ log (1 − pi )
i=1
k
�
= log(1 − pi )
i=1
�k
≤ (−pi ).
i=1
�∞
� k. By taking the limit as k → ∞, we obtain log
This is true for every i=1 (1 −
pi ) = −∞, and ∞ i=1 (1 − pi ) = 0.
Proof of Theorem 2.
(a) The assumption ∞
� �∞
n=1 P(An ) < ∞ implies that limn→∞ i=n P(Ai ) = 0.
Note that for every n, we have A ⊂ ∪∞ A
i=n i . Then, the union bound implies
that
∞
� �
P(A) ≤ P ∪∞
�
A
i=n i ≤ P(An ).
i=n
We take the limit of both sides as n → ∞. Since the right-hand side con
verges to zero, P(A) must be equal to zero.
(b) Let Bn = ∪∞ ∞ c
i=n Ai , and note that A = ∩n=1 Bn . We claim that P(Bn ) = 0.
This will imply the desired result because
∞
� �
P(Ac ) = P ∪∞ c
P(Bnc ) = 0.
�
B
n=1 n ≤
n=1
6
where the second equality made use of the continuity property of probability
measures.
MIT OpenCourseWare
http://ocw.mit.edu
For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.
MASSACHUSETTS INSTITUTE OF TECHNOLOGY
6.436J/15.085J Fall 2008
Lecture 4 9/15/2008
COUNTING
Readings: [Bertsekas & Tsitsiklis], Section 1.6, and solved problems 57-58 (in
1st edition) or problems 61-62 (in 2nd edition). These notes only cover the part
of the lecture that is not covered in [BT].
(a) In the first 2n − k times, the mathematician reached n times into the right
pocket, n − k times into the left pocket, and then, at time 2n − k + 1, into
the right pocket.
(b) In the first 2n − k times, the mathematician reached n times into the left
pocket, n − k times into the right pocket, and then, at time 2n − k + 1, into
the left pocket.
Scenario (b) has the same probability. Thus, the overall probability is
� �
2n − k 1
· 2n−k .
n 2
2 MULTINOMIAL PROBABILITIES
1
the probability that in n trials there were exactly n1 results equal to a1 , n2 results
equal to r2 , etc., where the ni are given nonnegative integers that add to n?
Solution: Note that every possible outcome (n-long sequence of results) that in
volves ni results equal to ai , for all i, has the same probability, pn1 1 · · · pnrr . How
many such sequences are there? Any such sequence corresponds to a partition
of the set {1, . . . n} of trials into subsets of sizes n1 , . . . , nr : the ith subset, of
size ni , indicates the trials at which the result was equal to ai . Thus, using the
formula for the number of partitions, the desired probability is equal to
n!
· pn1 · · · pnr r .
n1 ! · · · nr ! 1
MIT OpenCourseWare
http://ocw.mit.edu
For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.
MASSACHUSETTS INSTITUTE OF TECHNOLOGY
6.436J/15.085J Fall 2008
Lecture 5 9/17/2008
RANDOM VARIABLES
Contents
X(ω1 , . . . , ωn ) = ω1 + · · · + ωn .
1
With this definition, the set {ω | X(ω) < 4} is just the event that there were fewer than
4 heads overall, belongs to the σ-field F, and therefore has a well-defined probability.
Consider the real line, and let B be the associated Borel σ-field. Sometimes,
we will also allow random variables that take values in the extended real line,
R = R ∪ {−∞, ∞}. We define the Borel σ-field on R, also denoted by B, as
the smallest σ-field that contains all Borel subsets of R and the sets {−∞} and
{∞}.
2
Because the collection of intervals of the form (−∞, c] generates the Borel
σ-field in R, it can be shown that if X is a random variable, then for any
−1
� −1set B,� the set X (B) is F-measurable. It follows that the probability
Borel
P X (B) = P({ω | X(ω) ∈ B}) is well-defined. It is often denoted by
P(X ∈ B).
(a) For every Borel subset B of the real line (i.e., B ∈ B), we define PX (B) =
P(X ∈ B).
(b) The resulting function PX : B → [0, 1] is called the probability law of X.
Proof: Clearly, PX (B) ≥ 0, for every Borel set B. Also, PX (R) = P(X ∈
R) = P(Ω) = 1. We now verify countable additivity. Let {Bi } be a countable
sequence of disjoint Borel subsets of R. Note that the sets X −1 (Bi ) are also
disjoint, and that
X −1 ∪∞ ∞ −1
� �
i=1 Bi = ∪i=1 X (Bi ),
or, in different notation,
{X ∈ ∪∞ ∞
i=1 Bi } = ∪i=1 {X ∈ Bi }.
3
1.3 Technical digression: measurable functions
The following generalizes the definition of a random variable.
According to the above definition, given a probability space (Ω, F, P), and
taking into account the discussion in Section 1.1, a random variable on a proba
bility space is a function X : Ω → R that is (F, B)-measurable.
As a general rule, functions constructed from other measurable functions
using certain simple operations are measurable. We collect, without proof, a
number of relevant facts below.
Another way that we can form a random variable is by taking the limit of
a sequence of random variables. Let us first introduce some terminology. Let
each fn be a function from some set Ω into R. Consider a new function f =
inf n fn defined by f (ω) = inf n fn (ω), for every ω ∈ Ω. The functions supn fn ,
lim inf n→∞ fn , and lim supn→∞ fn are defined similarly. (Note that even if
the fn are everywhere finite, the above defined functions may turn out to be
extended-valued. ) If the limit limn→∞ fn (ω) exists for every ω, we say that the
sequence of functions {fn } converges pointwise, and define its pointwise limit
to be the function f defined by f (ω) = limn→∞ fn (ω). For example, suppose
that Ω = [0, 1] and that fn (ω) = ω n . Then, the pointwise limit f = limn→∞ fn
exists, and satisfies f (1) = 1, and f (ω) = 0 for ω ∈ [0, 1).
Example 4. Let X be the number of heads in two independent tosses of a fair coin. In
5
particular, P(X = 0) = P(X = 2) = 1/4, and P(X = 1) = 1/2. Then,
⎧
⎪
⎪ 0, if x < 0,
1/4, if 0 ≤ x < 1,
⎨
FX (x) =
⎪
⎪ 3/4, if 1 ≤ x < 2.
1, if x ≥ 2.
⎩
Example 5. (A uniform random variable and its square) Consider a probability space
(Ω, B, P), where Ω = [0, 1], B is the Borel σ-field B, and P is the Lebesgue measure.
The random variable U defined by U (ω) = ω is said to be uniformly distributed. Its
CDF is given by
⎧
⎨ 0, if x < 0,
FU (x) = x, if 0 < x < 1,
1, if x ≥ 1.
⎩
Thus,
⎧
⎨ 0,
√ if x < 0,
FX (x) = x, if 0 ≤ x < 1,
1, if x ≥ 1.
⎩
Proof:
(a) Suppose that x ≤ y. Then, {X ≤ x} ⊂ {X ≤ y}, which implies that
6
(b) Since FX (x) is monotonic in x and bounded below by zero, it converges as
x → −∞, and the limit is the same for every sequence {xn } converging to
−∞. So, let xn = −n, and note that the sequence of events ∩∞ n=1 {X ≤
−n} converges to the empty set. Using the continuity of probabilities, we
obtain
Since this is true for every such sequence {xn }, we conclude that
limy↓x FX (y) = FX (x).
7
range of F is the entire interval (0, 1). Furthermore, F is invertible: for ev
ery y ∈ (0, 1), there exists a unique x, denoted F −1 (y), such that F (x) = y.
We define U (ω) = ω and X(ω) = F −1 (ω), for every ω ∈ (0, 1), so that
X = F −1 (U ). Note that F (F −1 (ω)) = ω for every ω ∈ (0, 1), so that
F (X) = U . Since F is strictly increasing, we have X ≤ x if and only
F (X) ≤ F (x), or U ≤ F (x). (Note that this also establishes that the event
{X ≤ x} is measurable, so that X is indeed a random variable.) Thus, for every
x ∈ R, we have
� �
FX (x) = P(X ≤ x) = P F (X) ≤ F (x) = P(U ≤ F (x)) = F (x),
as desired.
Note that the probability law of X assigns probabilities to all Borel sets,
whereas the CDF only specifies the probabilities of certain intervals. Neverthe
less, the CDF contains enough information to recover the law of X.
Proof: (Outline) Let F0 be the collection of all subsets of the real line that are
unions of finitely many intervals of the form (a, b]. Then, F0 is a field. Note
that, the CDF FX can be used to completely determine the probability PX (A)
of a set A ∈ F0 . Indeed, this is done using relations such as
Discrete random variables take values in a countable set. We need some nota
tion. Given a function f : Ω → R, its range is the set
8
Definition 5. Discrete random variables and PMFs)
(a) A random variable X, defined on a probability space (Ω, F, P), is said to be
discrete if its range X(Ω) is countable.
(b) If X is a discrete random variable, the function pX : R → [0, 1] defined by
pX (x) = P(X = x), for every x, is called the (probability) mass function
of X, or PMF for short.
A random variable that takes only integer values is discrete. For instance, the
random variable in Example 4 (number of heads in two coin tosses) is discrete.
Also, every simple random variable is discrete, since it takes a finite number of
values. However, more complicated discrete random variables are also possible.
Example 6. Let the sample space be the set N of natural numbers, and consider a
measure that satisfies P(n) = 1/2n , for every n ∈ N. The random variable X defined
by X(n) = n is discrete.
Suppose now that the rational numbers have been arranged in a sequence, and that
xn is the nth rational number, according to this sequence. Consider the random variable
Y defined by Y (n) = xn . The range of this random variable is countable, so Y is a
discrete random variable. Its range is the set of rational numbers, every rational number
has positive probability, and the set of irrational numbers has zero probability.
We close by noting that discrete random variables can be represented in
terms of indicator functions. Indeed, given a discrete random variable X, with
range {x1 , x2 , . . .}, we define An = {X = xn }, for every n ∈ N. Observe that
each set An is measurable (why?). Furthermore, the sets An , n ∈ N, form a
partition of the sample space. Using indicator functions, we can write
∞
�
X(ω) = xn IAn (ω).
n=1
9
Conversely, suppose we are given a sequence {An } of disjoint events, and a
real sequence {xn }. Define X : Ω → R by letting X(ω) = xn if and only if
ω ∈ An . Then X is a discrete random variable, and P(X = xn ) = P(An ), for
every n.
The function f is called a (probability) density function (or PDF, for short)
for X,
Any nonnegative measurable function that satisfies Eq. (2) is called a�density
x
function. Conversely, given a density function f , we can define F (x) = −∞ f (t) dt,
and verify that F is a distribution function. It follows that given a density func
tion, there always exists a random variable whose PDF is the given density.
If a CDF FX is differentiable at some x, the corresponding value fX (x) can
be found by taking the derivative of FX at that point. However, CDFs need not
be differentiable, so this will not always work. Let us also note that a PDF of
a continuous random variable is not uniquely defined. We can always change
the PDF at a finite set of points, without affecting its integral, hence multiple
10
PDFs can be associated to the same CDF. However, this nonuniqueness rarely
becomes an issue. In the sequel, we will often refer to “the PDF” of X, ignoring
the fact that it is nonunique.
Example 7. For a uniform random variable, we have FX (x) = P(X ≤ x) = x, for
every x ∈ (0, 1). By differentiating, we find fX (x) = 1, for x ∈ (0, 1). For x < 0
we have FX (x) = 0, and for x > 1 we have FX (x) = 1; in both cases, we obtain
fX (x) = 0. At x = 0, the CDF is not differentiable. We are free to define fX (0) to be
0, or 1, or in fact any real number; the value of the integral of fX will remain unaffected.
Using the PDF of a continuous random variable, we can calculate the prob
ability of various subsets of the real line. For example, we have P(X = x) = 0,
for all x, and if a < b,
� b
P(a < X < b) = P(a ≤ X ≤ b) = fX (t) dt.
a
11
(a) Consider a function f : Rm → R, and fix some x ∈ Rm . We say that f (y)
converges to a value c, as y tends to x, if we have limn→∞ f (xn ) = c, for
every sequence {xn } of elements of Rm such that xn = � x for all n, and
limn→∞ xn = x. In this case, we write limy→x f (y) = c.
This expansion is not unique. For example, 1/3 admits two expansions, namely
.10000 · · · and .022222 · · · . Nonuniqueness occurs only for those x that admit
an expansion ending with an infinite sequence of 2s. The set of such unusual x
is countable, and therefore has Lebesgue measure zero.
12
The Cantor set C is defined as the set of all x ∈ [0, 1] that have a ternary ex
pansion that uses only 0s and 2s (no 1s allowed). The set C can be constructed as
follows. Start with the interval [0, 1] and remove the “middle third” (1/3, 2/3).
Then, from each of the remaining closed intervals, [0, 1/3] and [2/3, 1], remove
their middle thirds, (1/9, 2/9) and (7/9, 8/9), resulting in four closed intervals,
and continue this process indefinitely. Note that C is measurable, since it is
constructed by removing a countable sequence of intervals. Also, the length
(Lebesgue measure) of C is 0, since at each stage its length is mutliplied by a
factor of 2/3. On the other hand, the set C has the same cardinality as the set
{0, 2}∞ , and is uncountable.
Consider now an infinite sequence of independent rolls of a 3-sided die,
whose faces are labeled 0, 1, and 2. Assume that at each roll, each of the three
possible results has the same probability, 1/3. If we use the sequence of these
rolls to form a number x, then the probability law of the resulting random vari
able is the Lebesgue measure (i.e., picking a ternary expansion “at random”
leads to a uniform random variable).
The Cantor set can be identified with the event consisting of all roll se
quences in which a 1 never occurs. (This event has zero probability, which is
consistent with the fact that C has zero Lebesgue measure.)
Consider now an infinite sequence of independent tosses of a fair coin. If
the ith toss results in tails, record xi = 0; if it results in heads, record xi = 2.
Use the xi s to form a number x, using Eq. (4). This defines a random variable
X on ([0, 1], B), whose range is the set C. The probability law of this random
variable is therefore concentrated on the “zero-length” set C. At the same time,
P(X = x) = 0 for every x, because any particular sequence of heads and tails
has zero probability. A measure with this property is called singular.
The random variable X that we have constructed here is neither discrete nor
continuous. Moreover, the CDF of X cannot be written as a mixture of the kind
considered in Eq. (3).
13
MIT OpenCourseWare
http://ocw.mit.edu
For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.
MASSACHUSETTS INSTITUTE OF TECHNOLOGY
6.436J/15.085J Fall 2008
Lecture 6 9/24/2008
Contents
Recall that a random variable X : Ω → R is called discrete if its range (i.e., the
set of values that it can take) is a countable set. The PMF of X is a function
pX : R → [0, 1], defined by pX (x) = P(X = x), and completely determines
the probability law of X.
The following are some important PMFs.
(a) Discrete uniform with parameters a and b, where a and b are integers with
a < b. Here,
1
A binomial random variable with parameters n and p represents the number
of heads observed in n independent tosses of a coin if the probability of
heads at each toss is p.
(d) Geometric with parameter p, where 0 < p ≤ 1. Here,
pX (k) = (1 − p)k−1 p, k = 1, 2, . . . , .
λk
pX (k) = e−λ , k = 0, 1, . . . .
k!
As will be seen shortly, a Poisson random variable can be thought of as a
limiting case of a binomial random variable. Note that this is a legitimate
PMF (i.e., it sums
� to one), because of the series expansion of the exponential
function, eλ = ∞ k=0 λ k /k!.
Notation: Let us use the abbreviations dU(a, b), Ber(p), Bin(n, p), Geo(p),
Pois(λ), and Pow(α) to refer the above defined PMFs. We will use notation
d
such as X = dU(a, b) or X ∼ dU(a, b) as a shorthand for the statement that X
is a discrete random variable whose PMF is uniform on (a, b), and similarly for
d
the other PMFs we defined. We will also use the notation X = Y to indicate
that two random variables have the same PMFs.
2
1.1 Poisson distribution as a limit of the binomial
To get a feel for the Poisson random variable, think of a binomial random vari
able with very small p and very large n. For example, consider the number of
typos in a book with a total of n words, when the probability p that any one
word is misspelled is very small (associate a word with a coin toss that results in
a head when the word is misspelled), or the number of cars involved in accidents
in a city on a given day (associate a car with a coin toss that results in a head
when the car has an accident). Such random variables can be well modeled with
a Poisson PMF.
More precisely, the Poisson PMF with parameter λ is a good approximation
for a binomial PMF with parameters n and p, i.e.,
λk n!
e−λ ≈ pk (1 − p)n−k , k = 0, 1, . . . , n,
k! k! (n − k)!
provided λ = np, n is large, and p is small. In this case, using the Poisson
PMF may result in simpler models and calculations. For example, let n = 100
and p = 0.01. Then the probability of k = 5 successes in n = 100 trials is
calculated using the binomial PMF as
100!
· 0.015 (1 − 0.01)95 = 0.00290.
95! 5!
Using the Poisson PMF with λ = np = 100 · 0.01 = 1, this probability is
approximated by
1
e−1 = 0.00306.
5!
Proof: We have
n(n − 1) · · · (n − k + 1) λk � λ �n−k
P(Xn = k) = · · 1 − .
nk k! n
Fix k and let n → ∞. We have, for j = 1, . . . , k,
n−k+j � λ �−k � λ �n
→ 1, 1− → 1, 1− → e−λ .
n n n
3
Thus, for any fixed k, we obtain
λk
lim P(Xn = k) = e−λ = P(X = k),
n→∞ k!
as claimed.
In most applications, one typically deals with several random variables at once.
In this section, we introduce a few concepts that are useful in such a context.
called the joint PMF of X and Y . Here and in the sequel, we will use the
abbreviated notation P(X = x, Y = y) instead of the more precise notations
P({X = x} ∩ {Y = y}) or P(X = x and Y = x). More generally, the PMF
of finitely many discrete random variables, X1 , . . . , Xn on the same probability
space is defined by
4
The joint PMF of X and Y determines the probability of any event that can
be specified in terms of the random variables X and Y . For example if A is the
set of all pairs (x, y) that have a certain property, then
� � �
P (X, Y ) ∈ A = pX,Y (x, y).
(x,y)∈A
In fact, we can calculate the marginal PMFs of X and Y by using the formulas
� �
pX (x) = pX,Y (x, y), pY (y) = pX,Y (x, y).
y x
where the second equality follows by noting that the event {X = x} is the union
of the countably many disjoint events {X = x, Y = y}, as y ranges over all the
different values of Y . The formula for pY (y) is verified similarly.
6
Verifying the independence of random variables using the above definition
(which involves arbitrary Borel sets) is rather difficult. It turns out that one only
needs to examine Borel sets of the form (−∞, x].
We omit the proof of the above proposition. However, we remark that ul
timately, the proof relies on the fact that the collection of events of the form
(−∞, x] generate the Borel σ-field.
Let us define the joint CDF of of the random variables X1 , . . . , Xn by
Proof: The fact that (a) implies (b) is immediate from the definition of indepen
dence. The equivalence of (b), (c), and (d) is also an immediate consequence of
our definitions. We complete the proof by verifying that (c) implies (a).
7
Suppose that X and Y are independent, and let A, B, be two Borel subsets
of the real line. We then have
�
P(X ∈ A, Y ∈ B) = P(X = x, Y = y)
x∈A, y∈B
�
= pX,Y (x, y)
x∈A, y∈B
�
= pX (x)pY (x)
x∈A, y∈B
�� �� �
= pX (x) pY (y))
x∈A y∈B
Since this is true for any Borel sets A and B, we conclude that X and Y are
independent.
We note that Theorem 1 generalizes to the case of multiple, but finitely many,
random variables. The generalization of conditions (a)-(c) should be obvious.
As for condition (d), it can be generalized to a few different forms, one of which
is the following: given any subset S0 of the random variables under considera
tion, the conditional joint PMF of the random variables Xs , s ∈ S0 , given the
values of the remaining random variables, is the same as the unconditional joint
PMF of the random variables Xs , s ∈ S0 , as long as we are conditioning on an
event with positive probability.
We finally note that functions g(X) and h(Y ) of two independent random
variables X and Y must themselves be independent. This should be expected
on intuitive grounds: If X is independent from Y , then the information provided
by the value of g(X) should not affect the distribution of Y , and consequently
should not affect the distribution of h(Y ). Observe that when X and Y are
discrete, then g(X) and h(Y ) are random variables (the required measurability
conditions are satisfied) even if the functions g and h are not measurable (why?).
8
3.3 Examples
4 EXPECTED VALUES
10
�
(c) If S+ < ∞ and S− = ∞, the sum s∈S as is not absolutely convergent;
we define it to be equal to −∞. In this case, for every possible arrangement
of the elements of S in a sequence {sn }, we have
n
�
lim asi = −∞.
n→∞
i=1
�
(d) If S+ = ∞ and S− = ∞, the sum s∈S an is left undefined. In fact, in this
case, different arrangements of the elements of S in a sequence {sn } will
result into different or even nonexistent values of the limit
n
�
lim asi .
n→∞
i=1
More important, we stress that the first equality need not hold in the absence of
conditions (i) or (ii) above.
11
Suppose that you spin the wheel k times, and that ki is the number of times
that the outcome is mi . Then, the total amount received is m1 k1 + m2 k2 + · · · +
mn kn . The amount received per spin is
m1 k1 + m2 k2 + · · · + mn kn
M= .
k
If the number of spins k is very large, and if we are willing to interpret proba
bilities as relative frequencies, it is reasonable to anticipate that mi comes up a
fraction of times that is roughly equal to pi :
ki
≈ pi , i = 1, . . . , n.
k
Thus, the amount of money per spin that you “expect” to receive is
m1 k1 + m2 k2 + · · · + mn kn
M= ≈ m1 p1 + m2 p2 + · · · + mn pn .
k
Motivated by this example, we introduce the following definition.
whenever the sum is well defined, and where the sum is taken over the count
able set of values in the range of X.
Example. Using this formula, it is easy to give an example of a random variable for
d
which the expected value is infinite. Consider X = Pow(α), where α ≤ 1. Then, it can
�∞
be verified, using the fact n=1 1/n = ∞, that E[X] = n≥0 n1α = ∞. On the other
�
12
Here is another useful fact, whose proof is again left as an exercise.
13
Proposition 4. Let X and Y be discrete random variables defined on the
same probability space.
and
n
�� � �n
var Xi = var(Xi ),
i=1 i=1
x,y
�� � � �� � �
= ax pX,Y (x, y) + by pX,Y (x, y)
x y y x
� �
= a xpX (x) + b ypY (y)
x y
= aE[X] + bE[Y ].
14
For part (d), we have
Using the equality E[XY ] = E[X]E[Y ], the above expression becomes var(X)+
var(Y ). The proof of part (g) is similar and is omitted.
Remark: The equalities in part (f) need not hold in the absence of independence.
For example, consider a random variable X that takes either value 1 or −1, with
probability 1/2. Then, E[X] =� 0, but� E[X 2 ] = 1. If we let Y = X, we see that
E[XY ] = E[X 2 ] = 1 �= 0 = E[X] . Furthermore, var(X + Y ) = var(2X) =
4var(X), while var(X) + var(Y ) = 2.
Exercise 2. Show that var(X) = 0 if and only if there exists a constant c such that
P(X = c) = 1.
15
MIT OpenCourseWare
http://ocw.mit.edu
For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.
MASSACHUSETTS INSTITUTE OF TECHNOLOGY
6.436J/15.085J Fall 2008
Lecture 7 9/29/2008
EXPECTATIONS
Contents
1
2 EXPECTED VALUES OF SOME COMMON RANDOM VARIABLES
E[X] = 1 · p + 0 · (1 − p) = p,
var(X) = E[X 2 ] − (E[X])2 = 12 · p + 02 · (1 − p) − p2 = p(1 − p).
2
(d) Poisson(λ). Let X be a Poisson random variable with parameter λ. A direct
calculation yields
∞
� λn
E[X] = e−λ n
n!
n=0
∞
� λn
= e−λ n
n!
n=1
∞
� λn
= e−λ
(n − 1)!
n=1
∞
−λ
� λn
= λe
n!
n=0
= λ.
The variance of X turns out to satisfy var(X) = λ, but we defer the deriva
tion to a later section. We note, however, that the mean and the variance of a
Poisson random variable are exactly what one would expect, on the basis of
the formulae for the mean and variance of a binomial random variable, and
taking the limit as n → ∞, p → 0, while keeping np fixed at λ.
(e) Power(α). Let X be a random variable with a power law distribution with
parameter α. We have
∞ ∞
� � 1
E[X] = P(X > k) = .
(k + 1)α
k=0 k=0
3.1 Covariance
The covariance of two square integrable random variables, X and Y , is denoted
by cov(X, Y ), and is defined by
�� �� ��
cov(X, Y ) = E X − E[X] Y − E[Y ] .
3
Note that, under the square integrability assumption, the covariance is al
ways well-defined and finite. This is a consequence of the fact that |XY | ≤
(X 2 + Y 2 )/2, which implies that XY , as well as (X − E[X])(Y − E[Y ]), are
integrable.
Roughly speaking, a positive or negative covariance indicates that the values
of X − E[X] and Y − E[Y ] obtained in a single experiment “tend” to have the
same or the opposite sign, respectively. Thus, the sign of the covariance provides
an important qualitative indicator of the relation between X and Y .
We record a few properties of the covariance, which are immediate conse
quences of its definition:
4
and, more generally,
� n � n n−1 n
� � � �
var Xi = var(Xi ) + 2 cov(Xi , Xj ).
i=1 i=1 i=1 j=i+1
This can be seen from the following calculation, where for brevity, we denote
X̃i = Xi − E[Xi ]:
� n ⎡� �2 ⎤
n
�
� �
var Xi = E ⎣ X̃i ⎦
i=1 i=1
⎡ ⎤
�n �
n
= E⎣ X̃i X̃j ⎦
i=1 j=1
n �
� n
= E[X̃i X̃j ]
i=1 j=1
n
� � � n−1
� � n
= E X̃i2 + 2 E[X̃i X̃j ]
i=1 i=1 j=i+1
n
� n−1
� n
�
= var(Xi ) + 2 cov(Xi , Xj ).
i=1 i=1 j=i+1
(The simpler notation ρ will also be used when X and Y are clear from the
context.) It may be viewed as a normalized version of the covariance cov(X, Y ).
(a) We have −1 ≤ ρ ≤ 1.
(b) We have ρ = 1 (respectively, ρ = −1) if and only if there exists a positive
(respectively, negative) constant c such that Y − E[Y ] = a(X − E[X]), with
probability 1.
Proof of Theorem 1:
E[X̃ 2 ] E[Ỹ 2 ]
and hence |ρ(X, Y )| ≤ 1.
6
(b) One direction is straightforward. If Ỹ = aX̃, then
˜ X̃]
E[Xa a
ρ(X, Y ) = � = ,
|a|
E[X̃ 2 ] E[(aX̃)2 ]
E[X̃Ỹ ]
X̃ − Ỹ
E[Ỹ 2 ]
Note that the sign of the constant ratio of X̃ and Ỹ is determined by the sign
of ρ(X, Y ), as claimed.
7
4 INDICATOR VARIABLES AND THE INCLUSION-EXCLUSION FOR
MULA
Indicator functions are special discrete random variables that can be useful in
simplifying certain derivations or proofs. In this section, we develop the inclusion-
exclusion formula and apply it to a matching problem.
Recall that with every event A, we can associate its indicator function,
which is a discrete random variable IA : Ω → {0, 1}, defined by IA (ω) = 1 if
ω ∈ A, and IA (ω) = 0 otherwise. Note that IAc = 1 − IA and that E[IA ] =
P(A). These simple observations, together with the linearity of expectations
turn out to be quite useful.
We begin with the easily verifiable fact that for any real numbers a1 , . . . , an , we
have
n
� � � �
(1 − aj ) =1 − aj + ai aj − ai aj ak
j=1 1≤j≤n 1≤i<j≤n 1≤i<j<k≤n
+ · · · + (−1)n a1 · · · an .
8
� � �
P(B) = P(Aj ) − P(Ai ∩ Aj ) − P(Ai ∩ Aj ∩ Ak )
1≤j≤n 1≤i<j≤n 1≤i<j<k≤n
+ · · · + (−1)n P(A1 ∩ · · · ∩ An ).
9
� j, we have
For i =
�� �� ��
cov(Xi , Xj ) = E Xi − E[Xi ] Xj − E[Xj ]
= E[Xi Xj ] − E[Xi ] E[Xj ]
= P(Xi = 1 and Xj = 1) − P(Xi = 1)P(Xj = 1)
= P(Xi = 1)P(Xj = 1 | Xi = 1) − P(Xi = 1)P(Xj = 1)
1 1 1
= · − 2
n n − 1 n
= .
n2 (n − 1)
Therefore,
� n �
�
var(X) = var Xi
i=1
n
� n−1
� n
�
= var(Xi ) + 2 cov(Xi , Xj )
i=1 i=1 j=i+1
� �
1 1 n(n − 1) 1
= n· 1− +2· · 2
n n 2 n (n − 1)
= 1.
Finding the PMF of X is a little harder. Let us first dispense with some
easy cases. We have P(X = n) = 1/n!, because there is only one (out of
the n! possible) permutations under which every person receives their own hat.
Furthermore, the event X = n − 1 is impossible: if n − 1 persons have received
their own hat, the remaining person must also have received their own hat.
Let us continue by finding the probability that X = 0. Let Ai be the event
that the ith person gets their own hat, i.e., Xi = 1. Note that the event X = 0
is the same as the event ∩i Aci . Thus, P(X = 0) = 1 − P(∪ni=1 Ai ). Using the
inclusion-exclusion formula, we have
� � �
P(∪ni=1 Ai ) = P(Ai ) − P(Ai ∩ Aj ) + P(Ai ∩ Aj ∩ Ak ) + · · · .
i i<j i<j<k
1 1 1 (n − k)!
P(Ai1 ∩ Ai2 ∩ · · · ∩ Aik ) = · ··· = . (1)
n n−1 n−k+1 n!
10
Thus,
� � � � � �
1 n (n − 2)! n (n − 3)! n n (n − n)!
P(∪ni=1 Ai ) =n· − + + · · · + (−1)
n 2 n! 3 n! n n!
1 1 1
= 1 − + − · · · + (−1)n+1 .
2! 3! n!
We conclude that
1 1 1
P(X = 0) = − + · · · + (−1)n . (2)
2! 3! n!
Note that P(X = 0) → e−1 , as n → ∞.
To conclude, let us now fix some integer r, with 0 < r ≤ n−2, and calculate
P(X = r). The event {X = r} can only occur as follows: for some subset S of
{1, . . . , n}, of cardinality r, the following two events, BS and CS , occur:
We then have
{X = r} = BS ∩ CS .
S: |S|=r
Note that
(n − r)!
P(BS ) = ,
n!
by the same argument as in Eq. (1). Conditioned on the event that the r persons
in the set S have received their own hats, the event CS will materialize if and
only if none of the remaining n − r persons receive their own hat. But this is
the same situation as the one analyzed when we calculated the probability that
X = 0, except that n needs to be replaced by n − r. We conclude that
1 1 1
P (CS | BS ) = − + · · · + (−1)n−r .
2! 3! (n − r)!
11
Putting everything together, we conclude that
� �
n (n − r)! � 1 1 1 �
P(X = r) = − + · · · + (−1)n−r
r n! 2! 3! (n − r)!
1�1 1 1 �
= − + · · · + (−1)n−r .
r! 2! 3! (n − r)!
Note that for each fixed r, the probability P(X = r) converges to e−1 /r!,
as n → ∞, which corresponds to a Poisson distribution with parameter 1. An
intuitive justification is as follows. The random variables Xi are not independent
(in particular, their covariance is nonzero). On the other hand, as n → ∞, they
are “approximately independent”. Furthermore, the success probability for each
person is 1/n, and the situation is similar to the one in our earlier proof that the
binomial PMF approaches the Poisson PMF.
5 CONDITIONAL EXPECTATIONS
Definition 1. Given an event A, such that P(A) > 0, and a discrete random
variable X, the conditional expectation of X given A is defined as
�
E[X | A] = xpX | A (x),
x
Note that the preceding also provides a definition for a conditional expecta
tion of the form E[X | Y = y], for any y such that pY (y) > 0: just let A be the
event {Y = y}, which yields
�
E[X | Y = y] = xpX | Y (x | y).
x
We note that the conditional expectation is always well defined when either
the random variable X is nonnegative, or when the random variable X is inte
grable. In particular, whenever E[|X|] < ∞, we also have E[|X| | Y = y] < ∞,
12
for every y such that pY (y) > 0. To verify the latter assertion, note that for
every y such that pY (y) > 0, we have
� � pX,Y (x, y) 1 � E[|X |]
|x|pX|Y (x | y) = |x| ≤ |x|pX (x) = .
x x
pY (y) pY (y) x pY (y )
The converse, however, is not true: it is possible that E[|X| | Y = y] is finite for
every y that has positive probability, while E[|X|] = ∞.
The conditional expectation is essentially the same as an ordinary expecta
tion, except that the original PMF is replaced by the conditional PMF. As such,
the conditional expectation inherits all the properties of ordinary expectations
(cf. Proposition 4 in the notes for Lecture 6).
Example. (The mean of the geometric.) Let X be a random variable with parameter p,
so that pX (k) = (1−p)k−1 p, for p ∈ N. We first observe that the geometric distribution
is memoryless: for k ∈ N, we have
P(X = k + 1, X > 1)
P(X − 1 = k | X > 1) =
P(X > 1)
P(X = k + 1)
=
P(X > 1)
(1 − p)k p
= = (1 − p)k−1 p
1−p
= P(X = k).
13
In words, in a sequence of repeated i.i.d., trials, given that the first trial was a failure,
the distribution of the remaining trials, X − 1, until the first success is the same as the
unconditional distribution of the number of trials, X, until the first success. In particular,
E[X − 1 | X > 1] = E[X].
Using the total expectation theorem, we can write
E[X] = E[X | X > 1]P(X > 1)+E[X | X = 1]P(X = 1) = (1+E[X])(1−p)+1·p.
We solve for E[X], and find that E[X] = 1/p.
Similarly,
E[X 2 ] = E[X 2 | X > 1]P(X > 1) + E[X 2 | X = 1]P(X = 1).
Note that
E[X 2 | X > 1] = E[(X −1)2 | X > 1]+E[2(X −1)+1 | X > 1] = E[X 2 ]+(2/p)+1.
Thus,
E[X 2 ] = (1 − p)(E[X 2 ] + (2/p) + 1) + p,
which yields
2 1
E[X 2 ] = − .
p2 p
We conclude that
�2 2 1 1 1−p
var(X) = E[X 2 ] − E[X] = 2 − − 2 =
�
.
p p p p2
But X is just the expected number of heads in n trials, so that E[X | N = n] = np.
Let us now calculate E[N | X = m]. We have
∞
�
E[N | X = m] = nP(N = n | X = m)
n=0
∞
� P(N = n, X = m)
= n
n=m
P(X = m)
∞
� P(X = m | N = n)P(N = n)
= n
n=m
P(X = m)
∞ �n� m
� p (1 − p)n−m (λn /n!)e−λ
= n m .
n=m
P(X = m)
14
d
Recall that X = Pois(λp), so that P(X = m) = e−λp (λp)m /m!. Thus, after some
cancellations, we obtain
∞
� (1 − p)n−m λn−m e−λ(1−p)
E[N | X = m] = n
n=m
(n − m)!
∞
� (1 − p)n−m λn−m e−λ(1−p)
= (n − m)
n=m
(n − m)!
∞
� (1 − p)n−m λn−m e−λ(1−p)
+m
n=m
(n − m)!
= λ(1 − p) + m.
A faster way of obtaining this result is as follows. From Theorem 3 in the notes for
Lecture 6, we have that X and Y are independent, and that Y is Poisson with parameter
λ(1 − p). Therefore,
and
E[E[N | X]] = λ(1 − p) + E[X] = λ(1 − p) + λp = λ = E[N ].
This is not a coincidence; the equality E[E[X | Y ]] = E[X] is always true, as we shall
now see. In fact, this is just the total expectation theorem, written in more abstract
notation.
15
Theorem 2. Let g : R → R be a measurable function such that Xg(Y ) is
either nonnegative or integrable. Then,
� �
E E[X | Y ]g(Y ) = E[Xg(Y )].
Proof: We have
� � �
E E[X|Y ]g(Y ) = E[X | Y = y]g(y)pY (y)
y
��
= xpX|Y (x | y)g(y)pY (y)
y x
�
= xg(y)pX,Y (x, y) = E[Xg(Y )].
x,y
for every measurable function g. The merits of this definition is that it can
be used for all kinds of random variables (discrete, continuous, mixed, etc.).
However, for this definition to be sound, there are two facts that need to be
verified:
(a) Existence: It turns out that as long as X is integrable, a function φ with the
above properties is guaranteed to exist. We already know that this is the
case for discrete random variables: the conditional expectation as defined in
the beginning of this section does have the desired properties. For general
random variables, this is a nontrivial and deep result. It will be revisited
later in this course.
16
(b) Uniqueness: It turns out that there is essentially only one function φ with
the above properties. More precisely, any two functions with the above
properties are equal with probability 1.
17
MIT OpenCourseWare
http://ocw.mit.edu
For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.
MASSACHUSETTS INSTITUTE OF TECHNOLOGY
6.436J/15.085J Fall 2008
Lecture 8 10/1/2008
Contents
Readings: For a less technical version of this material, but with more discussion
and examples, see Sections 3.1-3.5 of [BT] and Sections 4.1-4.5 of [GS].
for some nonnegative measurable function f : R → [0, ∞), which is called the
PDF of X. We then have, for any Borel set B,
�
P(X ∈ B) = fX (x) dx.
B
Technical remark: Since we have not yet defined the notion of an integral of
a measurable function, the discussion in these notes will be rigorous only when
we deal with integrals that can be interpreted in the usual sense of calculus,
1
The reader should revisit Section 4 of the notes for Lecture 5.
1
namely, Riemann integrals. For now, let us just concentrate on functions that are
piecewise continuous, with a finite number of discontinuities.
We note that fX should be more appropriately called “a” (as opposed to
“the”) PDF of X, because it is not unique. For example, if we modify fX at a
finite number of points, its integral is unaffected, so multiple densities can corre
spond to the same CDF. It turns out, however, that any two densities associated
with the same CDF are equal except on a set of Lebesgue measure zero.
A PDF is in some ways similar to a PMF, except that the value fX (x) cannot
be interpreted as a probability. In particular, the value of fX (x) is allowed to
be greater than one for some x. Instead, the proper intuitive interpretation is the
fact that if fX is continuous over a small interval [x, x + δ], then
P(x ≤ X ≤ x + δ) ≈ fX (x)δ.
Remark: The fact that a random variable X is continuous has no bearing on the
continuity of X as a function from Ω into R. In fact, we have not even defined
what it means for a function on Ω to be continuous. But even in the special case
where Ω = R, we can have a discontinuous function X : R → R which is a
continuous random variable. Here is an example. Let the underlying probability
measure on Ω be the Lebesgue measure on the unit interval. Let
�
ω, 0 ≤ ω ≤ 1/2,
X(ω) =
1 + ω, 1/2 < ω ≤ 1.
The function X is discontinuous. The random variable X takes values in the set
[0, 1/2] ∪ (3/2, 2]. Furthermore, it is not hard to check that X is a continuous
random variable with PDF given by
�
1, x ∈ [0, 1/2] ∪ (3/2, 2]
fX (x) =
0 otherwise.
2 EXAMPLES
2.1 Uniform
This is perhaps the simplest continuous random variable. Consider an interval
[a, b], and let
⎧
⎨ 0, x ≤ a,
FX (x) = (x − a)/(b − a), a < x ≤ b,
1, x > b.
⎩
2.2 Exponential
Fix some λ > 0. Let FX (x) = 1 − e−λx , for x ≥ 0, and FX (x) = 0, for
x < 0. It is easy to check that FX satisfies the required properties of CDFs.
A corresponding PDF is fX (x) = λe−λx , for x ≥ 0, and fX (x) = 0, for
x < 0. We denote this distribution by Exp(λ). The exponential distribution
can be viewed as a “limit” of a geometric distribution. Indeed, if we fix some δ
and consider the values of FX (kδ), for k = 1, 2, . . ., these values agree with the
values of a geometric CDF. Intuitively, the exponential distribution corresponds
to a limit of a situation where every δ time units, we toss a coin whose success
probability is λδ, and let X be the time elapsed until the first success.
The distribution Exp(λ) has the following very important memorylessness
property.
3
model of the remaining time until an arrival. For example, suppose that the
time until the next bus arrival is an exponential random variable with parameter
λ = 1/5 (in minutes). Thus, there is probability e−1 that you will have to wait
for at least 5 minutes. Suppose that you have already waited for 10 minutes. The
probability that you will have to wait for at least another five minutes is still the
same, e−1 .
1 (x−µ)2
fX (x) = √ e− 2σ2 .
σ 2π
It can be checked that this is a legitimate PDF, i.e., that it integrates to one. Note
also that this PDF is symmetric around x = µ. We use the notation N (µ, σ 2 ) to
denote the normal distribution with parameters µ, σ. The distribution N (0, 1) is
referred to as the standard normal distribution; a corresponding random vari
able is also said to be standard normal.
There is no closed form formula for the corresponding CDF, but numerical
tables are available. These tables can also be used to find probabilities associated
with general normal variables. This is because of the fact (to be verified later)
that if X ∼ N (µ, σ 2 ), then (X − µ)/σ ∼ N (0, 1). Thus,
�X − µ c − µ�
P(X ≤ c) = P ≤ = Φ((c − µ)/σ),
σ σ
where Φ is the CDF of the standard normal, available from the normal tables.
dFX αcα
fX (t) = (t) = α+1 .
dx t
3 EXPECTED VALUES
Similar to the discrete case, given a continuous random variable X with PDF
fX , we define
� ∞
E[X] = xfX (x) dx.
−∞
�∞
This integral is well defined and finite if −∞ |x|fX (x) dx < ∞, in which case
we say that the random variable X is integrable. The integral� 0 is also well de
fined, but infinite, if one, but not both, of the integrals −∞ xfX (x) dx and
�∞
0 xfX (x) dx is infinite. If both of these integrals are infinite, the expected
value is not defined.
Practically all of the results developed for discrete random variables carry
over to the continuous case. Many of them, e.g., E[X + Y ] = E[X] + E[Y ],
have the exact same form. We list below two results in which sums need to be
replaced by integrals.
Proof: We have
� ∞ � ∞ � ∞� ∞
(1 − FX (t)) dt = P(X > t) dt = fX (x) dx dt
0 0 0 t
� ∞ � x � ∞
= fX (x) dt dx = xfX (x) dx = E[X].
0 0 0
(The interchange of the order of integration turns out to be justified becuase the
integrand is nonnegative.)
5
Proposition 2. Let X be a continuous random variable with density fX , and
suppose that g : R → R is a (Borel) measurable function such that g(X) is
integrable. Then,
� ∞
E[g(X)] = g(t)fX (t) dt.
−∞
Proof: Let us express the function g as the difference of two nonnegative func
tions,
g(x) = g + (x) − g − (x),
where g + (x) = max{g(x), 0}, and g − (x) = max{−g(x), 0}. In particular, for
any t ≥ 0, we have g(x) > t if and only if g + (x) > t.
We will use the result
� ∞ � ∞
� � � � � �
E g(X) = P g(X) > t dt − P g(X) < −t dt
0 0
from Proposition 1. The first term in the right-hand side is equal to
� ∞� � ∞� � ∞
fX (x) dx dt = fX (x) dt dx = g + (x)fX (x) dx.
0 {x|g(x)>t} −∞ {t|0≤t<g(x)} −∞
Note that for this result to hold, the random variable g(X) need not be con
tinuous. The proof is similar to the one for Proposition 1, and involves an in
terchange of the order of integration; see [GS] for a proof for the special case
where g ≥ 0.
4 JOINT DISTRIBUTIONS
6
fX,Y : R2 → [0, ∞) such that their joint CDF satisfies
� x � y
FX,Y (x, y) = P(X ≤ x, Y ≤ y) = fX,Y (u, v) du dv.
−∞ −∞
∂2F
(x, y) = fX,Y (x, y).
∂x∂y
Similar to what was mentioned for the case of a single random variable, for
any Borel subset B of R2 , we have
� �
� �
P (X, Y ) ∈ B = fX,Y (x, y) dx dy = IB (x, y)fX,Y (x, y) dx dy.
B R2
We have just argued that if X and Y are jointly continuous, then X (and,
similarly, Y ) is a continuous random variables. The converse is not true. For a
trivial counterexample, let X be a continuous random variable, and let and Y =
X. Then the set {(x, y) ∈ R2 | x = y} has zero area (zero Lebesgue measure),
but unit probability, which is impossible for jointly continuous random variables.
In particular, the corresponding probability law on R2 is neither discrete nor
continuous, hence qualifies as “singular.”
Proposition 2 has a natural extension to the case of multiple random vari
ables.
3
The Lebesgue measure on R2 is the unique measure µ defined on the Borel subsets of R2
that satisfies µ([a, b] × [c, d]) = (b − a)(d − c), i.e., agrees with the elementary notion of “area”
on rectangular sets. Existence and uniqueness of such a measure is obtained from the Extension
Theorem, in a manner similar to the one used in our construction of the Lebesgue measure on R.
Proposition 3. Let X and Y be jointly continuous with PDF fX,Y , and sup
pose that g : R2 → R is a (Borel) measurable function such that g(X) is
integrable. Then,
� ∞� ∞
E[g(X, Y )] = g(u, v)fX,Y (u, v) du dv.
−∞ −∞
5 INDEPENDENCE
Recall that two random variables, X and Y , are said to be independent if for any
two Borel subsets, B1 and B2 , of the real line, we have P(X ∈ B1 , Y ∈ B2 ) =
P(X ∈ B1 )P(Y ∈ B2 ).
Similar to the discrete case (cf. Proposition 1 and Theorem 1 in Section 3 of
Lecture 6), simpler criteria for independence are available.
The proof parallels the proofs in Lecture 6, except for the last condition,
for which the argument is simple when the densities are continuous functions
(simply differentiate the CDF), but requires more care otherwise.
8
MIT OpenCourseWare
http://ocw.mit.edu
For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.
MASSACHUSETTS INSTITUTE OF TECHNOLOGY
6.436J/15.085J Fall 2008
Lecture 9 10/6/2008
Contents
Recall that two random variables X and Y are said to be jointly continuous if
there exists a nonnegative measurable function fX,Y such that
� x � y
P(X ≤ x, Y ≤ y) = fX,Y (u, v) dv du.
−∞ −∞
Once we have in our hands a general definition of integrals, this can be used to
establish that for every Borel subset of R2 , we have
�
P((X, Y ) ∈ B) = fX,Y (u, v) du dv.
B
1
the mth moment and the mth central moment, respectively, of X. In particular,
var(X) � E[(X − E[X])2 ] is the variance of X.
The properties of expectations developed for discrete random variables in
Lecture 6 (such as linearity) apply to the continuous case as well. The sub
sequent development, e.g., for the covariance and correlation, also applies to
the continuous case, practically without any changes. The same is true for the
Cauchy-Schwarz inequality.
Finally, we note that all of the definitions and formulas have obvious exten
sions to the case of more than two random variables.
2 CONDITIONAL PDFS
For the case of discrete random variables, the conditional CDF is defined by
FX|Y (x | y) = P(X ≤ x | Y = y), for any y such that P(Y = y) > 0. However,
this definition cannot be extended to the continuous case because P(Y = y) = 0,
for every y. Instead, we should think of FX|Y (x | y) as a limit of P(X ≤ x | y ≤
Y ≤ y + δ), as δ decreases to zero. Note that
FX|Y (x | y) ≈ P(X ≤ x | y ≤ Y ≤ y + δ)
P(X ≤ x, y ≤ Y ≤ y + δ)
=
P(y ≤ Y ≤ y + δ)
� x � y+δ
−∞ y fX,Y (u, v) dv du
≈
δfY (y)
�x
δ −∞ fX,Y (u, y) du
≈
δfY (y)
�x
−∞ fX,Y (u, y) du
= .
fY (y)
� x
fX,Y (u, y)
FX|Y (x | y) = du,
−∞ fY (y)
for every y such that fY (y) > 0, where fY is the marginal PDF of Y .
(b) The conditional PDF of X given Y is defined by
fX,Y (x, y)
fX|Y (x | y) = ,
fY (y)
It can be checked that FX | Y is indeed a CDF (it satisfies the required prop
erties such as monotonicity, right-continuity, etc.) For example, observe that
� ∞
fX,Y (u, y)
lim FX|Y (x | y) = du = 1,
x→∞ −∞ fY (y)
and �
P(X ∈ A) = P(X ∈ A | Y = y)fY (y) dy.
3
These two relations are established as in the discrete case, by just replacing sum
mations with integrals. They can be rigorously justified if the random variable
X is nonnegative or integrable.
Let us fix some ρ ∈ (−1, 1) and consider the function, called the standard
bivariate normal PDF,
1 � x2 − 2ρxy + y 2 �
f (x, y) = � exp − .
2π 1 − ρ2 2(1 − ρ2 )
Let X and Y be two jointly continuous random variables, defined on the same
probability space, whose joint PDF is f .
√ 2
Proof: We will use repeatedly the fact that 1/( 2πσ) exp(− (x−µ) 2σ 2
) is a PDF
2
(namely, the PDF of the N (µ, σ ) distribution), and thus integrates to one.
4
�∞
which is the standard normal PDF. Since −∞ fY (y) dy = 1, we conclude
that f (x, y) integrates to one, and is a legitimate joint PDF. Furthermore,
we have verified that the marginal PDF of Y (and by symmetry, also the
marginal PDF of X) is the standard normal PDF, N (0, 1).
But
� ∞ � (x − ρy)2 �
1
x exp − dx = ρy,
2(1 − ρ2 )
�
2π(1 − ρ2 ) −∞
since this is the expected value for the N (ρy, 1 − ρ2 ) distribution. Thus,
� � � �
E[XY ] = xyf (x, y) dx dy = yρyfY (y) dy = ρ y 2 fY (y) dy = ρ,
since the integral is the second moment of the standard normal, which
is equal to one. We have established that Cov(X, Y ) = ρ. Since the
variances of X and Y are equal to unity, we obtain ρ(X, Y ) = ρ. If X and
Y are independent, then ρ(X, Y ) = 0, implying that ρ = 0. Conversely,
if ρ = 0, then
1 � x2 + y 2 �
f (x, y) = exp − = fX (x)fY (y),
2π 2
and therefore X and Y are independent. Note that the condition ρ(X, Y ) =
0 implies independence, for the special case of the bivariate normal, whereas
this implication is not always true, for general random variables.
(d) Let us now compute the conditional PDF. Using the expression for fY (y),
we have
f (x, y)
fX|Y (x | y) =
fY (y)
1 � x2 − 2ρxy + y 2 �√
= � exp − 2π exp(y 2 /2)
2π 1 − ρ2 2(1 − ρ2 )
1 � x2 − 2ρxy + ρ2 y 2 �
=� exp −
2π(1 − ρ2 ) 2(1 − ρ2 )
1 � (x − ρy)2 �
= exp − ,
2(1 − ρ2 )
�
2π(1 − ρ2 )
which we recognize as the N (ρy, 1 − ρ2 ) PDF.
Similar to the discrete case, we define E[X | Y ] as a random variable that takes
the value E[X | Y = y], whenever Y = y, where fY (y) > 0. Formally, E[X | Y ]
is a function ψ : Ω → R that satisfies
�
ψ(ω) = xfX|Y (x | y) dx,
6
for every ω such that Y (ω) = y, where fY (y) > 0. Note that nothing is said
about the value of ψ(ω) for those ω that result in a y at which fY (y) = 0.
However, the set of such ω has zero probability measure. Because, the value
of ψ(ω) is completely determined by the value of Y (ω), we also have ψ(ω) =
φ(Y (ω)), for some function φ : R → R. It turns out that both functions ψ and
φ can be taken to be measurable.
One might expect that when X and Y are jointly continuous, then E[X | Y ]
is a continuous random variable, but this is not the case. To see this, suppose
that X and Y are independent, in which case E[X | Y = y] = E[X], which also
implies that E[X | Y ] = E[X]. Thus, E[X | Y ] takes a constant value, and is
therefore a trivial case of a discrete random variable.
Similar to the discrete case, for every measurable function g, we have
� �
E E[X | Y ]g(Y ) = E[Xg(Y )] (1)
Example. We have a stick of unit length [0, 1], and break it at X, where X is uniformly
distributed on [0, 1]. Given the value x of X, we let Y be uniformly distributed on [0, x],
and let Z be uniformly distributed on [0, 1 − x]. We assume that conditioned on X = x,
the random variables Y and Z are independent. We are interested in the distribution of
Y and Z, their expected values, and the expected value of their product.
It is clear from symmetry that Y and Z have the same marginal distribution, so we
focus on Y . Let us first find the joint distribution of Y and X. We have fX (x) = 1, for
x ∈ [0, 1], and fY |X (y | x) = 1/x, for y ∈ [0, x]. Thus, the joint PDF is
1 1
fX,Y (x, y) = fY |X (y | x)fX (x) = ·1= , 0 ≤ y ≤ x ≤ 1.
x x
We can now find the PDF of Y :
� 1 � 1
1 �1
�
fY (y) = fX,Y (x, y) dx = dx = log x� = log(1/y).
y y x y
(check that this indeed integrates to unity). Integrating by parts, we then obtain
� 1 � 1
1
E[Y ] = yfY (y)dy = y log(1/y) dy = .
0 0 4
The above calculation is more involved than necessary. For a simpler argument,
simply observe that E[Y | X = x] = x/2, since Y conditioned on X = x is uniform on
[0, x]. In particular, E[Y | X] = X/2. It follows that E[Y ] = E[E[Y | X]] = E[X/2] =
1/4.
For an alternative version of this argument, consider the random variable Y /X.
Conditioned on the event X = x, this random variable takes values in the range [0, 1],
is uniformly distributed on that range, and has mean 1/2. Thus, the conditional PDF of
Y /X is not affected by the value x of X. This implies that Y /X is independent of X,
and we have
1 1 1
E[Y ] = E[(Y /X)X] = E[Y /X] · E[X] = · = .
2 2 4
To find E[Y Z] we use the fact that, conditional on X = x, Y and Z are independent,
and obtain
� � � �
E[Y Z] = E E[Y Z | X] = E E[Y | X] · E[Z | X]
� X 1 − X � � 1 x(1 − x) 1
=E · = dx = .
2 2 0 4 24
Exercise 1. Find the joint PDF of Y and Z. Find the probability P(Y + Z ≤ 1/3).
Find E[X|Y ], E[X|Z], and ρ(Y, Z).
Theorem 1. Suppose that E[X 2 ] < ∞. Then, for any measurable function
g : R → R, we have
Proof: We have
The inequality above is obtained by noticing that the term E (X − g(Y ))2 is�
� �
always nonnegative, and that the term E[(X − E[X | Y ])(E[X | Y ] − g(Y ))
8
�
is of the form E[(X − E[X | Y ])ψ(Y ) for ψ(Y ) = E[X | Y ] − g(Y ), and is
therefore equal to zero, by Eq. (1).
Notice that the preceding proof only relies on the property (1). As we have
discussed, we can view this as the defining property of conditional expectations,
for general random variables. It follows that the preceding theorem is true for
all kinds of random variables.
pX (x)pY |X (y | x) pX (x)pY |X (y | x)
pX|Y (x | y) = =� � �
.
pY (y ) x� pX (x )pY |X (y | x )
When X and Y are both continuous, Bayes’ rule takes a similar form,
fX (x)fY |X (y | x) fX (x)fY |X (y | x)
fX|Y (x | y) = =� ,
fY (y ) fX (x� )fY |X (y | x� ) dx
We then have
� ∞
pK (k) = P(K = k) = fK,Z (k, t) dt,
−∞
and1
�� z � z �
FZ (z) = P(Z ≤ z) = fK,Z (k, t) dz = fK,Z (k, t) dz,
k −∞ −∞ k
is the PDF of Z.
Note that if P(K = k) > 0, then
� z
fK,Z (k, t)
P(Z ≤ z | K = k) = dt,
−∞ pK (k)
Finally, for z such that fZ (z) > 0, we define pK|Z (k | z) = fK,Z (k, z)/fZ (z),
and interpret it as the conditional probability of the event K = k, given that
Z = z. (Note that we are conditioning on a zero probability event; a more accu
rate interpretation is obtained by conditioning on the event z ≤ Z ≤ z + δ, and
let δ → 0.) With these definitions, we have
for every (k, z) for which fK,Z (k, z) > 0. By rearranging, we obtain two more
versions of the Bayes’ rule:
fZ (z)pK |Z (k | z) fZ (z )pK |Z (k | z )
fZ|K (z | k) = =� ,
pK (k ) fZ (z � )pK |Z (k | z � ) dz �
and
pK (k)fZ|K (z | k) pK (k)fZ|k (z | k)
pK|Z (k | z) = =� � �
.
fZ (z) k� pK (k )fZ|K (z | k )
Note that all four versions of Bayes’ rule take the exact same form; the only
difference is that we use PMFs and summations for discrete random variables,
as opposed to PDFs and integrals for continuous random variables.
1
The interchange of the summation and the integration can be rigorously justified, because the
terms inside are nonnegative.
10
MIT OpenCourseWare
http://ocw.mit.edu
For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.
MASSACHUSETTS INSTITUTE OF TECHNOLOGY
6.436J/15.085J Fall 2008
Lecture 10 10/8/2008
DERIVED DISTRIBUTIONS
Contents
The principal method for deriving the PDF of g(X) is the following two-step
approach.
1
Calculation of the PDF of a Function Y = g(X) of a Continuous Ran
dom Variable X
FY (y) = P(Y ≤ y)
= P(X 2 ≤ y)
√ √
= P(− y ≤ X ≤ y)
√ √
= FX ( y) − FX (− y),
2
A. Let B the set of values of g(x), as x ranges over A. Let g −1 be the inverse
function of g, so that g(g −1 (y)) = y, for y ∈ B. Then, for y ∈ B, and using the
chain rule in the last step, we have
d d dg −1 (y)
fY (y) = P(g(X) ≤ y) = P(X ≤ g −1 (y)) = fX (g −1 (y)) .
dy dy dy
Recall from calculus that the derivative of an inverse function satisfies
dg −1 1
(y) = � −1 ,
dy g (g (y))
where g � is the derivative of g. Therefore,
1
fY (y) = fX (g −1 (y)) .
g � (g −1 (y))
When g is strictly monotone decreasing, the only change is a minus sign in front
of g � (g −1 (y)). Thus, the two cases can be summarized in the single formula:
1
fY (y) = fX (g −1 (y)) , y ∈ B. (1)
|g � (g −1 (y))|
1 �y − b�
fY (y) = fX .
|a| a
3
d
so that Y is N (b, a2 ). More generally, if X = N (µ, σ), then the same argument shows
d
that Y = ax + b = N (aµ + b, a2 σ 2 ). We conclude that a linear (more precisely, affine)
function of a normal random variable is normal.
2 MULTIVARIATE TRANSFORMATIONS
where o(δ n ) stands for a function such that limδ↓0 o(δ n )/δ n = 0, and where the
symbol ≈ indicates that the difference between the two sides is o(δ n ). Thus,
fX (x) · δ n ≈ P(X ∈ C)
= P(g(X) ∈ g(C))
= P(Y ∈ D)
≈ fY (y) · vol(D)
= fY (y) · |M | · δ n .
fX (M −1 y)
fY (y) = = fX (M −1 y) · |M −1 |.
|M |
fX (g −1 (y))
fY (y) = = fX (g −1 (y)) · |M −1 (g −1 (y))|.
|M (g −1 (y))|
We note a useful fact from calculus that sometimes simplifies the application
of the above formula. If we define J(y) as the Jacobian (the matrix of partial
derivatives) of the mapping g −1 (y), and if some particular x and y are related
by y = g(x), then J(y) = M −1 (x). Therefore,
5
fY (y) = fX (g −1 (y)) · |J(y)|. (2)
6
3 A SINGLE FUNCTION OF MULTIPLE RANDOM VARIABLES
g −1 (y) = (h(y), y2 , . . . , yn ),
⎢ ⎥
⎣
.. . .
.. .. ..
. . ⎦
0 0 ··· 1
∂h
It follows that |J(y)| = | ∂y1
(y)|, and
� ∂h
�
fY (y) = fX (h(y), y2 , . . . , yn )�
(y)�.
� �
∂y1
Integrating, we obtain
�
� ∂h
�
fY1 (y1 ) = fX (h(y), y2 , . . . , yn )�
(y)
� dy2 · · · dyn .
� �
∂y1
Example. Let X1 and X2 be positive, jointly continuous, random variables, and sup
pose that we wish to derive the PDF of Y1 = g(X1 , X2 ) = X1 X2 . We define Y2 = X2 .
7
From the relation x1 = y1 /x2 we see that h(y1 , y2 ) = y1 /y2 . The partial derivative
∂h/∂y1 is 1/y2 . We obtain
� �
1 1
fY1 (y1 ) = fX (y1 /y2 , y2 ) dy2 = fX (y1 /x2 , x2 ) dx2 .
y2 x2
d
For a special case, suppose that X1 , X2 = U (0, 1) are independent. Their common
PDF is fXi (xi ) = 1, for xi ∈ [0, 1]. Note that fY1 (y1 ) = 0 for y ∈
/ (0, 1). Furthermore,
fX1 (y1 /x2 ) is positive (and equal to 1) only in the range x2 ≥ y1 . Also fX2 (x2 ) is
positive, and equal to 1, iff x2 ∈ (0, 1). In particular,
fX (y1 /x2 , x2 ) = fX1 (y1 /x2 )fX2 (x2 ) = 1, for x2 ≥ y1 .
We then obtain
� 1
1
fY1 (y1 ) = dx2 = − log y, y1 ∈ (0, 1).
y1 x2
The direct approach to this problem would first involve the calculation of FY1 (y1 ) =
P(X1 X2 ≤ y1 ). It is actually easier to calculate
� 1� 1
1 − FY1 (y1 ) = P(X1 X2 ≥ y1 ) = dx2 dx1
y1 y1 /x1
� �
y1 �
= 1− dx1
y1 x1
�1
= (x1 − y1 log x1 )� = (1 − y1 ) + y1 log y1 .
�
y1
Thus, FY1 (y1 ) = y1 − y1 log y1 . Differentiating, we find that fY1 (y1 ) = − log y1 .
An even easier solution for this particular problem (along the lines of the stick
example in Lecture 9) is to realize that conditioned on X1 = x1 , the random variable
Y1 = X1 X2 is uniform on [0, x1 ], and using the total probability theorem,
� 1 � 1
1
fY1 (y1 ) = fX1 (x1 )fY1 |X1 (y1 | x1 ) dx1 = dx1 = − log y1 .
y1 y1 x1
8
For the minimum, we have
= 1 − P(X1 , . . . , Xn > x)
= 1 − (1 − FX1 (x)) · · · (1 − FXn (x)).
Let us consider the special case where the X1 , . . . , Xn are i.i.d., with common
CDF F and PDF f . For simplicity, assume that F is differentiable everywhere.
Then,
implying that
fmaxj Xj (x) = nF n−1 (x)f (x), fminj Xj (x) = n(1 − F (x))n−1 f (x).
fX (1) ,...,X (n) (x1 , . . . , xn ) = n!f (x1 ) · · · f (xn ), x1 < x2 < · · · < xn ,
and fX (1) ,...,X (n) (x1 , . . . , xn ) = 0, otherwise. Use this to derive the densities for
maxj Xj and minj Xj .
9
A first derivation involves plain calculus. Let fX,Y be the joint PDF of X
and Y . Then.
� � ∞ � z−x
P(X + Y ≤ z) = fX,Y (x, y) dx dy = fX,Y (x, y) dy dx.
{x,y | x+y≤z} −∞ −∞
which gives
� ∞
fX+Y (z) = fX,Y (x, z − x) dx.
−∞
In the special case where X and Y are independent, we have fX,Y (x, y) =
fX (x)fY (y), resulting in
� ∞
fX+Y (z) = fX (x)fY (z − x) dx.
−∞
10
MIT OpenCourseWare
http://ocw.mit.edu
For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.
MASSACHUSETTS INSTITUTE OF TECHNOLOGY
6.436J/15.085J Fall 2008
Lecture 11 10/15/2008
ABSTRACT INTEGRATION — I
Contents
1. Preliminaries
2. The main result
3. The Riemann integral
4. The integral of a nonnegative simple function
5. The integral of a nonnegative function
6. The general case
The material in these notes can be found in practically every textbook that
includes basic measure theory, although the order with which various properties
are proved can be somewhat different in different sources.
1 PRELIMINARIES
�
The objective
� of these notes is to define the integral g dµ [sometimes also
denoted g(ω) dµ(ω)] of a measurable function g : Ω → R, defined on a
measure space (Ω, F, µ).
Special cases:
1
�
The program: We will define the integral g dµ for progressively general
classes of measurable functions g:
(a) Finite nonnegative functions g that take finitely many values (“simple
functions”). In this case, the integral is just a suitably weighted sum of
the values of g.
(b) Nonnegative functions g. Here, the integral will be defined by approxi
mating g from below by a sequence of simple functions.
(c) General functions g. This is done by decomposing g in the form
� g =
g�+ −g− , where
� g+ and g− are nonnegative functions, and letting g dµ =
g+ dµ − g− dµ.
We will be focusing on the integral over the entire set Ω. The integral over a
(measurable) subset B of Ω will be defined by letting
� �
g dµ = (1B g) dµ.
B
Once the construction is carried out, integrals of nonnegative functions will al
ways be well-defined. For general functions, integrals will be left undefined
only when an expression of the form ∞ − ∞ is encountered.
The following properties will turn out to be true, whenever the integrals or
expectations involved are well-defined. On the left, we show the general version;
on the right, we show the same property, specialized to the case of probability
measures. In property 8, the convention 0 · ∞ will be in effect when needed.
�
1. 1B dµ = µ(B) E[1B ] = P(B)
�
2. g ≥ 0 ⇒ g dµ ≥ 0 X ≥ 0 ⇒ E[X] ≥ 0
�
3. g = 0, a.e. ⇒ g dµ = 0 X = 0, a.s. ⇒ E[X] = 0
� �
4. g ≤ h ⇒ g dµ ≤ h dµ X ≤ Y ⇒ E[X] ≤ E[Y ]
4�
� �
g ≤ h, a.e. ⇒ g dµ ≤ h dµ X ≤ Y, a.s. ⇒ E[X] ≤ E[Y ]
� �
5. g = h, a.e. ⇒ g dµ = h dµ X = Y, a.s. ⇒ E[X] = E[Y ]
�
6. [g ≥ 0, a.e., and g dµ = 0] ⇒ g = 0, a.e. [X ≥ 0, a.s., and E[X] ≥ 0] ⇒ X = 0, a.s.
� � �
7. (g + h) dµ = g dµ + h dµ E[X + Y ] = E[X] + E[Y ]
� �
8. (ag) dµ = a g dµ E[aX] = aE[X]
� �
9. 0 ≤ gn ↑ g ⇒ gn dµ ↑ g dµ 0 ≤ Xn ↑ X, ⇒ E[Xn ] ↑ E[X]
9� . 0 ≤ gn ↑ g, a.e. ⇒ gn dµ ↑ g dµ
� �
0 ≤ Xn ↑ X, a.s. ⇒ E[Xn ] ↑ E[X]
� �
10. g ≥ 0 ⇒ ν(B) = B g dµ is a measure [f ≥ 0 and � f dP = 1]
⇒ ν(B) = B f dP is a probability measure
tion of the (Riemann) integral a g(x) dx. We subdivide the interval [a, b] using
n−1
�� �
U (σ) = max g(x) · (xi+1 − xi ),
xi ≤x<xi+1
i=1
n
�−1 � �
L(σ) = min g(x) · (xi+1 − xi ).
xi ≤x<xi+1
i=1
3
Thus, U (σ) and L(σ) are approximations of the “area under the curve g,” from
�b
above and from below, respectively. We say that the integral a g(x) dx is well-
defined, and equal to a constant c, if
In this case, we also say that g is Riemann-integrable over [a, b]. Intuitively, we
want the upper and lower approximants U (σ) and L(σ) to agree, in the limit of
very fine subdivisions of the interval [a, b].
It is known that if g : R → R is Riemann-integrable over every inter
val [a, b], then g is continuous almost everywhere (i.e., there exists a set S of
Lebesgue measure zero, such that g is continuous at every x ∈ / S). This is a
severe limitation on the class of Riemann-integrable functions.
Example. Let Q be the set of rational numbers in [0, 1]. Let g = 1Q . For any σ =
(x1 , x2 , . . . , xn ), and every i, the interval [xi , xi+1 ) contains a rational number, and
also an irrational number. Thus, maxxi ≤x<xi+1 g(x) = 1 and minxi ≤x<xi+1 g(x) = 0.
It follows that U (σ) = 1 and L(σ) = 0, for all σ, and supσ L(σ) �= inf σ U (σ).
Therefore 1Q is not Riemann integrable. On the other hand if we consider a uniform
distribution over [0, 1], and the binary random variable 1Q , we have P(1Q = 1) = 0,
�
and we would like to be able to say that E[X] = [0,1] 1Q (x) dx = 0. This indicates
that a different definition is in order.
which agrees with the elementary definition of E[X] for discrete random vari
ables.
Note, for future reference, that the sum or difference of two simple functions
is also simple.
5
We note a few immediate consequences
� of the definition. For any B ∈ F,
The function 1B is simple and 1B dµ = µ(B), which verifies property 1. In
particular,� when Q is the set of rational numbers and µ is Lebesgue measure,
we have 1Q dµ = µ(Q) = 0, as desired. Note that a nonnegative simple
function
� has a representation of the form (1) with all ai positive. It follows that
g dµ ≥ 0, which verifies property S-2.
Suppose now that a simple function satisfies
�k g = 0, a.e. Then, it has a
canonical representation of the form
� g = i=1 i 1Ai , where µ(Ai ) = 0, for
a
every i. Definition 1 implies that g dµ = 0, which verifies property S-3.
Let us now verify the linearity property S-7. Let g and h be nonnegative
simple functions. Using canonical representations, we can write
k
� m
�
g= ai 1Ai , h= b j 1B j ,
i=1 j=1
where the sets Ai are disjoint, and the sets Bj are also disjoint. Then, the sets
Ai ∩ Bj are disjoint, and
k �
� m
g+h= (ai + bj )1Ai ∩Bj .
i=1 j=1
Therefore,
� k �
� m
(g + h) dµ = (ai + bj )µ(Ai ∩ Bj )
i=1 j=1
k
� m
� m
� k
�
= ai µ(Ai ∩ Bj ) + bj µ(Ai ∩ Bj )
i=1 j=1 j=1 i=1
k
� m
�
= ai µ(Ai ) + bj µ(Bj )
i=1 j=1
� �
= g dµ + h dµ.
(The first and fourth equalities follow from Definition 1. The third equality made
use of finite additivity for µ.)
Property S-8 is an immediate � consequence of Definition 1. We only need
to be careful for the case where g�dµ = ∞ and a = 0. Using �the convention
0 · ∞, we see that ag = 0, so that (ag) dµ = 0 = 0 · ∞ = a g dµ, and the
We will now verify that with this definition, properties N-2 to N-10 are all
satisfied. This is easy for some (e.g., property N-2). Most of our effort will
be devoted to establishing properties N-7 (linearity) and N-9 (monotone conver
gence theorem).
The arguments that follow will make occasional use of the following conti
nuity property for monotonic sequences of measurable sets Bi : If Bi ↑ B, then
7
µ(Bi ) ↑ µ(B). This property was established in the notes for Lecture 1, for the
special case where µ is a probability measure, but the same proof applies to the
general case.
Property N-4� : Suppose that g ≤ h, a.e. Then,� there exists a function g � such
that g � �≤ h and g �= g � , a.e. Property N-5 yields �
�
g dµ
� = g dµ. Property N-4
�
�
yields g dµ ≤ h dµ. These imply that g dµ ≤ h dµ.
Property N-6: �Suppose that g ≥ 0 but the relation g = 0, a.e., is not true. We
will show that g dµ > 0. Let B = {ω | g(ω) > 0}. Then, µ(B) > 0. Let
Bn = {ω | g(ω) > 1/n}. Then, Bn ↑ B and, therefore, µ(Bn ) ↑ µ(B) > 0.
This shows that for some n we have µ(Bn ) > 0. Note that g ≥ (1/n)1Bn .
Then, properties S-4, S-8, and 1 yield
� � �
1 1 1
g dµ ≥ · 1Bn dµ = 1Bn dµ = µ(Bn ) > 0.
n n n
8
Property N-8, when a ≥ 0: If a = 0, the result is immediate. Assume that
a > 0. It is not hard to see that q ∈ S(g) if and only if aq ∈ S(ag). Thus,
� � � � �
(ag) dµ = sup q dµ = sup (aq) dµ = sup (aq) dµ = a q dµ.
q∈S(ag) aq∈S(ag) q∈S(g)
For every ω ∈ Ai , there exists some n such that gn (ω) > ai /2. Therefore,
Bn ↑ Ai . From the continuity of measures, we obtain µ(Bn ) ↑ ∞. Now,
note that gn ≥ (ai /2)1Bn . Then, using property N-4, we have
� �
ai
gn dµ ≥ µ(Bn ) ↑ ∞ = q dµ.
2
�
(ii) Suppose now that q dµ < ∞. Then, µ(Ai ) < ∞, for all i ∈ S. Let
� �
Bn = ω ∈ A | gn (ω) ≥ q(ω) − (1/r) .
9
For ω ∈ Bn , we have gn (ω) + (1/r) ≥ q(ω). Thus, gn + (1/r)1Bn ≥
1Bn q. Using properties N-4 and S-7, together with Eq. (2), we have
� � � � �
1
gn dµ + 1B dµ ≥ 1Bn q dµ = q dµ − 1A\Bn q dµ
r n
�
≥ q dµ − aµ(A \ Bn ).
0 ≤ min{gn , q} ↑ min{g, q} = q.
Therefore,
� � �
lim gn dµ ≥ lim min{gn , q} dµ = q dµ.
n→∞ n→∞
(The inequality above uses property N-4; the equality relies on the fact that we
already proved the MCT for the case where the limit function is simple.) By
taking the supremum over q ∈ S(g), we obtain
� � �
lim gn dµ ≥ sup q dµ = g dµ.
n→∞ q∈S(g)
� �
On the other
� hand, �we have gn ≤ g, so that gn dµ ≤ g dµ. Therefore,
limn→∞ gn dµ ≤ g dµ, which concludes the proof of property N-9.
To prove property 9� , suppose that gn ↑ g, a.e. Then, there exist functions
gn and g � , such that gn = g � n, a.e., g = g � , a.e., and gn� ↑ g � . By combining
�
10
r, if g(ω) ≥ r
�
gr (ω) = i i i + 1
r
, if r ≤ g(ω) < r , i = 0, 1, . . . , r2r − 1
2 2 2
In words, the function gr is a quantized version of g. For every ω, the value of
g(ω) is first capped at r, and then rounded down to the nearest multiple of 2−r .
We note a few properties of gr that are direct consequences of its definition.
(a) For every r, the function gr is simple (and, in particular, measurable).
(b) We have 0 ≤ gr ↑ g; that is, for every ω, we have gr (ω) ↑ g(ω).
(c) If g is bounded above by c and r ≥ c, then |gr (ω) − g(ω)| ≤ 1/2r , for
every ω.
Statement (b) above gives us a transparent characterization of the set of mea
surable functions. Namely, a nonnegative function is measurable if and only if
it is the monotonic and� pointwise� limit of simple functions. Furthermore, the
MCT indicates that gr dµ ↑ g dµ, for this particular choice of simple func
tions gr . (In� an alternative line of� development of the subject, some texts start
by defining g dµ as the limit of gr dµ.)
5.4 Linearity
We now prove linearity (property N-7). Let gr and hr be the approximants of
g and h, respectively, defined in Section 5.3. Since gr ↑ g and hr ↑ h, we have
(gr + hr ) ↑ (g + h). Therefore, using the MCT and property S-7 (linearity for
simple functions),
� �
(g + h) dµ = lim (gr + hr ) dµ
r→∞
�� � �
= lim gr dµ + hr dµ
r→∞
� �
= lim gr dµ + lim hr dµ
r→∞ r→∞
� �
= g dµ + h dµ.
11
6 THE GENERAL CASE
12
MIT OpenCourseWare
http://ocw.mit.edu
For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.
MASSACHUSETTS INSTITUTE OF TECHNOLOGY
6.436J/15.085J Fall 2008
Lecture 12 10/20/2008
ABSTRACT INTEGRATION — II
Contents
1. Borel-Cantelli revisited
2. Connections between abstract integration and elementary definitions of
integrals and expectations
3. Fatou’s lemma
4. Dominated convergence theorem
1
1 BOREL-CANTELLI REVISITED
Abstract integration would not be useful theory if it were inconsistent with the
more elementary notions of integration. For discrete random variables taking
values in a finite range, this consistency is automatic because of the definition of
an integral of a simple function. We will now verify some additional aspects of
this consistency.
2
2.1 Connection with Riemann integration.
We state here, without proof, the following reassuring result. Let Ω = R, en
dowed with the usual Borel σ-field. Let λ be the Lebesgue measure. Sup
pose that g is a measurable function which� is Riemann integrable�on some in
b
terval [a, b]. Then, the Riemann integral a g(x) dx is equal to [a,b] g dλ =
�
1[a,b] g dλ.
Theorem 1. We have
� � �
Y dP = g dPX = y dPY ,
Proof: We follow the “standard program”: first establish the result for simple
functions, then take the limit to deal with nonnegative functions, and finally
generalize.
Let g be a simple function, which takes values in a finite set y1 , . . . , yk .
Using the definition of the integral of a simple function we have
� � �
Y dP = yi P({ω | Y (ω) = yi }) = yi P({ω | g(X(ω)) = yi }).
yi yi
Similarly, � �
g dPX = yi PX ({x | g(x) = yi }).
yi
3
However, from the definition of PX , we obtain
and the first equality in the theorem follows, for simple functions.
Let now g be nonnegative function, and let {gn } be an increasing sequence
of nonnegative simple functions that converges to g. Note that gn (X) converges
monotonically to g(X). We then have
� � � � �
Y dP = g(X) dP = lim gn (X) dP = lim gn dPX = g dPX .
n→∞ n→∞
(The second equality is the MCT; the third is the result that we already proved
for simple functions; the last equality is once more the MCT.)
The case of general (not just nonnegative) functions follows easily from the
above – the details are omitted. This proves the first equality in the theorem.
For the second equality,
� note that � by considering the special case g(x) = x,
we obtain Y = X and � X dP�= x dPX . By a change of notation, we have
also established that Y dP = y dPY .
When f is�Riemann integrable and the set A is an interval, we can also write
PX (A) = A f (x) dx, where the latter integral is an ordinary Riemann integral.
Proof: The first equality holds by definition. The second was shown in Theorem
1. So, let us concentrate on the third. Following the usual program,
� let us first
consider the case where g is a simple function, of the form g = ki=1 ai 1Ai , for
some measurable disjoint subsets Ai of the real line. We have
� k
�
g dPX = ai PX (Ai )
i=1
�k �
= ai f dλ
i=1 Ai
�k �
= ai 1Ai f dλ
i=1
k
� �
= ai 1Ai f dλ
i=1
�
= (gf ) dλ.
The first equality is the definition of the integral for simple functions. The sec
ond uses Eq. (1). The fourth uses linearity of integrals. The fifth uses the defini
tion of g.
Suppose now that g is a nonnegative function, and let {gn } be an increasing
sequence of nonnegative functions that converges to g, pointwise. Since f is
nonnegative, note that gn f also increases monotonically and converges to gf .
Then,
� � � �
g dPX = lim gn dPX = lim (gn f ) dλ = (gf ) dλ.
n→∞ n→∞
The first and the third equality above is the MCT. The middle equality is the
result we already proved, for the case of a simple function gn .
Finally, if g is not nonnegative, the result is proved by considering separately
the positive and negative parts of g.
5
When g and f are “nice” functions, e.g., piecewise continuous, Theorem 2
yields the familiar formula
�
E[g(X)] = g(x)f (x) dx,
3 FATOU’S LEMMA
Note that for any two random variables, we have min{X, Y } ≤ X and min{X, Y } ≤
Y . Taking expectations, we obtain E[min{X, Y }] ≤ min{E[X], E[Y ]}. Fa
tou’s lemma is in the same spirit, except that infinitely many random variables
are involved, as well as a limiting operation, so some additional technical con
ditions are needed.
(a) If Y ≤ Xn , for all n, then E[lim inf n→∞ Xn ] ≤ lim inf n→∞ E[Xn ].
(b) If Xn ≤ Y , for all n, then E[lim supn→∞ Xn ] ≥ lim supn→∞ E[Xn ].
By considering the case where Y = 0, we see that parts (a) and (b) ap
ply in particular to the case of nonnegative (respectively, nonpositive) random
variables.
Proof: Let us only prove the first part; the second part frollows from a symmet
rical argument, or by simply applying the first part to the random variables −Y
and −Xn .
Fix some n. We have
inf Xk − Y ≤ Xm − Y, ∀ m ≥ n.
k≥n
E[ inf Xk − Y ] ≤ E[Xm − Y ], ∀ m ≥ n.
k≥n
Note that the sequence inf k≥n Xk − Y is nonnegative and nondecreasing with
n, and converges to lim inf n→∞ Xn − Y . Taking the limit of both sides of the
6
preceding inequality, and using the MCT for the left-hand side term, we obtain
Note that E[Xn − Y ] = E[Xn ] − E[Y ]. (This step makes use of the as
sumption that E[|Y |] < ∞.) For similar reasons, the term −E[Y ] can also be
removed from the right-hand side. After canceling the terms E[Y ] from the two
sides, we obtain the desired result.
We note that Fatou’s lemma remains valid for the case of general measures
and integrals.
�n
Proof: By the monotone convergence theorem, applied to Yn = k=1 |Zk |, we
have
��∞ � � ∞
E |Zn | = E[|Zn |] < ∞.
n=1 n=1
�n �∞
Let Xn = �∞ i=1 Zi and note that limn→∞ X n = i=1 Zn . We observe that
|Xn | ≤ i=1 |Zi |, which has finite expectation, as shown earlier. The result
follows from the dominated convergence theorem.
Exercise: Can you prove Corollary 1 directly from the monotonone convergence theo
rem, without appealing to the DCT or Fatou’s lemma?
Remark: We note that the MCT, DCT, and Fatou’s lemma are also valid for
general measures, not just for probability measures
� (the proofs are the same);
just replace expressions such as E[X] by integrals g dµ.
MIT OpenCourseWare
http://ocw.mit.edu
For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.
MASSACHUSETTS INSTITUTE OF TECHNOLOGY
6.436J/15.085J Fall 2008
Lecture 13 10/22/2008
Contents
1. Product measure
2. Fubini’s theorem
1 PRODUCT MEASURE
A1 × �1 = {(�1 , �2 ) | �1 ≤ A1 , �2 ≤ �2 }.
1
Thus, we would like our �-field on � to include all sets of the form A 1 × �2 ,
(with A1 ≤ F1 ) and by symmetry, all sets of the form � 1 × A2 (with (A2 ≤ F2 ).
This leads us to the following definition.
Note that the notation F1 ×F2 is misleading: this is not the Cartesian product
of F1 and F2 !
Since �-fields are closed under intersection, we observe that if A i ≤ Fi , then
A1 × A2 = (A1 × �2 ) � (�1 � A2 ) ≤ F1 × F2 . It turns out (and is not hard
to show) that F1 × F2 can also be defined as the smallest �-field containing all
sets of the form A1 × A2 , where Ai ≤ Fi .
2
The most relevant example of a �-finite measure is the Lebesgue measure
on the real line. Indeed, the real line can be broken into a countable sequence of
intervals (n, n + 1], each of which has finite Lebesgue measure.
More generally, for any “nice” set of the form encountered in calculus, e.g., sets
of the form A = {(x, y) | f (x, y) ∀ c}, where f is a continuous function,
�2 (A) coincides with the usual notion of the area of A.
Remark for those of you who know a little bit of topology – otherwise ignore
it. We could define the Borel �-field on R 2 as the �-field generated by the
collection of open subsets of R2 . (This is the standard way of defining Borel
sets in topological spaces.) It turns out that this definition results in the same
�-field as the method of Section 1.2.
2 FUBINI’S THEOREM
3
(b) indicator functions of measurable sets are measurable;
(c) combining measurable functions in the usual ways (e.g., adding them, mul
tiplying them, taking limits, etc.) results in measurable functions.
Fubini’s theorem holds under two different sets of conditions: (a) nonnega
tive functions g (compare with the MCT); (b) functions g whose absolute value
has a finite integral (compare with the DCT). We state the two versions sepa
rately, because of some subtle differences.
The two statements below are taken verbatim from the text by Adams &
Guillemin, with minor changes to conform to our notation.
(e) We have
� �� � � �� �
g(�1 , �2 ) dP2 dP1 = g(�1 , �2 ) dP1 dP2
�1 �2
��2 �1
= g(�1 , �2 ) dP.
�1 �2
Note that some of the integrals above may be infinite, but this is not a prob
lem; since everything is nonnegative, expressions of the form ⊂ − ⊂ do not
arise.
Recall now that a function is said to be integrable if it is measurable and the
integral of its absolute value is finite.
where P = P1 × P2 .
(e) We have
� �� � � �� �
g(�1 , �2 ) dP2 dP1 = g(�1 , �2 ) dP1 dP2
�1 �2
��2 �1
= g(�1 , �2 ) dP.
�1 �2
We repeat that all of these results remain valid when dealing with �-finite
measures, such as the Lebesgue measure on R 2 . This provides us with condi
tions for the familiar calculus formula
� � � �
g(x, y) dx dy = g(x, y) dy dx.
5
so all we need is to work with the right hand side, and integrate one variable at
a time, possibly also using some bounds on the way.
Finally, let us note that all the hard work goes into proving Theorem 2.
Theorem 3 is relatively easy to derive once Theorem 2 is available: Given a
function g, decompose it into its positive and negative parts, apply Theorem 2
to each part, and in the process make sure that you do not encounter expressions
of the form ⊂ − ⊂.
� � � � � �
gdP1 = f (a), hdP2 = h(b), f dP1 × P2 = f (c).
A a�A B b�B C c�C
1 −1 0 0 ···
0 1 −1 0 · · ·
0 0 1 −1 · · ·
0 0 0 1 ···
.. .. .. .. ..
. . . . .
So
� � �� �� � �
f dP1 dP2 = f (n, m) = 0 =
≥ 1= f (n, m) = f dP2 dP1
�1 �2 n m m n �2 �1
The problem is that the function we are integrating is neither nonnegative nor
integrable.
6
3.2 �-finiteness
Let �1 = (0, 1), and let F1 be the Borel sets, and P1 be the Lebesgue measure.
Let �2 = (0, 1) and F2 be the set of all subsets of (0, 1), and let P 2 be the
counting measure.
Define f (x, y) = 1 if x = y and 0 otherwise. Then,
� � �
f (x, y)dP2 (y)dP1 (x) = 1dP1 (y) = 1,
�1 �2 �1
but � � �
f (x, y)dP1 (x)dP2 (y) = 0dP2 (y) = 0.
�2 �1 �2
The problem is that the counting measure on (0, 1) is not �-finite.
4 An application
7
Proof: Define A = {(w, x) | 0 ∀ x ∀ X(w)}. Intuitively, if � = R, then A
would be the region under the curve X(w). We argue that
� � � �
E[X] = X(w)dP = 1A (w, x)dxdP,
� � 0
and now let’s postpone the technical issues for a moment and interchange the
integrals to get
� ��
E[X] = 1A (w, x)dP dx
0 �
� �
= P (X → x)dx.
0
Now let’s consider the technical details necessary to make the above argument
work. The integral interchange can be justified on account of the funciton 1 A
being nonnegative, so we just need to show that all the functions we deal with
are measurable. In particular we need to show that:
4. P (X → x) is a measurable function of x.
2. For fixed w, 1A (w, x) is the indicator function of the interval [0, X(w)],
so it is lebesgue measurable.
8
5. To show that 1A is measurable, we argue that A is measurable.Indeed, the
function g : � × R � R defined by g(w, x) = X(w) is measurable,
since for any Borel set B, g −1 (B) = X −1 (B) × (−⊂, +⊂). Similarly,
h : �×R � R defined as h(w, x) = x is measurable for the same reason.
Since �
A = {g → h} {h → 0},
it follows that A is measurable.
9
MIT OpenCourseWare
http://ocw.mit.edu
For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.
MASSACHUSETTS INSTITUTE OF TECHNOLOGY
6.436J/15.085J Fall 2008
Lecture 14 10/27/2008
Contents
1
1 MOMENT GENERATING FUNCTIONS
1.1 Definition
MX (s) = E[esX ].
Note that this is essentially the same as the definition of the Laplace transform
of a function fX , except that we are using s instead of −s in the exponent.
2
Exercise 2. Suppose that
log P(X > x)
lim sup � −ν < 0.
x→∞ x
(a) Suppose that MX (s) is finite for all s in an interval of the form [−a, a],
where a is a positive number. Then, MX determines uniquely the CDF of
the random variable X.
(b) If MX (s) = MY (s) < ∞, for all s ∈ [−a, a], where a is a positive number,
then the random variables X and Y have the same CDF.
There are explicit formulas that allow us to recover the PMF or PDF of a ran
dom variable starting from the associated transform, but they are quite difficult
to use (e.g., involving “contour integrals”). In practice, transforms are usually
inverted by “pattern matching,” based on tables of known distribution-transform
pairs.
3
the order of integration and differentiation, we obtain
dMX (s) �� d � �
�
= E[esX ]�
= E[XesX ]�
= E[X],
� �
ds
s=0 ds s=0 s=0
dm MX (s) �� dm sX �
�
m sX �
�
= E[e ] = E[X e
]�
= E[X m ]
dsm dsm
�
�
4
1.6 Examples
d
Example : X = Exp(λ). Then,
� ∞ � λ
MX (s) = sx −λx
e λe dx = λ−s , s < λ;
0 ∞, otherwise.
d
Example : X = Ge(p)
∞ es p
�
�
sm m−1 1−(1−p)es , es < 1/(1 − p);
MX (s) = e p(1 − p)
m=1
∞, otherwise.
In this case, we also find gX (s) = ps/(1 − (1 − p)s), s < 1/(1 − p) and
gX (s) = ∞, otherwise.
d
Example : X = N (0, 1). Then,
� ∞
1 x2
MX (s) = √ exp(sx) exp(− )dx
2π −∞ 2
2 � ∞
exp(s /2) x + 2sx − s2
2
= √ exp(− )dx
2π −∞ 2
= exp(s2 /2).
Theorem 2.
5
For part (b), we have
For part (c), by conditioning on the random choice between X and Y , we have
6
implying that
∞
�
MY (s) = exp(n log MX (s))P(N = n) = MN (log MX (s)).
n=1
The reader is encouraged to take the derivative of the above expression, and
evaluate it at s = 0, to recover the formula E[Y ] = E[N ]E[X].
Example : Suppose that each Xi is exponentially distributed, with parameter λ, and
that N is geometrically distributed, with parameter p ∈ (0, 1). We find that
If two random variables X and Y are described by some joint distribution (e.g.,
a joint PDF), then each one is associated with a transform MX (s) or MY (s).
These are the transforms of the marginal distributions and do not convey infor
mation on the dependence between the two random variables. Such information
is contained in a multivariate transform, which we now define.
Consider n random variables X1 , . . . , Xn related to the same experiment.
Let s1 , . . . , sn be real parameters. The associated multivariate transform is a
function of these n parameters and is defined by
Example :
(a) Consider two random variables X and Y . Their joint transform is
MIT OpenCourseWare
http://ocw.mit.edu
For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.
MASSACHUSETTS INSTITUTE OF TECHNOLOGY
6.436J/15.085J Fall 2008
Lecture 15 10/29/2008
Contents
(a) We say that A is positive definite, and write A > 0, if xT Ax > 0, for every
nonzero x ∈ Rn .
(b) We say that A is nonnegative definite, and write A ≥ 0, if xT Ax ≥ 0, for
every x ∈ Rn .
1
(d) To each eigenvalue of a symmetric matrix, we can associate a real eigen
vector. Eigenvectors associated with distinct eigenvalues are orthogonal;
eigenvectors associated with repeated eigenvalues can always be taken to be
orthogonal. Without loss of generality, all these eigenvectors can be nor
malized so that they have unit length, resulting in an orthonormal basis.
(e) The above essentially states that a symmetric definite matrix becomes diag
onal after a suitable orthogonal change of basis.
Our interest in positive definite matrices stems for the following. When A is
positive definite, the quadratic form q(x) = xT Ax goes to infinity as �x� → ∞,
so that e−q(x) decays to zero, as �x� → ∞, and therefore can be used to define
a multivariate PDF.
There are multiple ways of defining multivariate normal distributions. We
will present three, and will eventually show that they are consistent with each
other.
The first generalizes our definition of the bivariate normal. It is the most
explicit and transparent; on the downside it can lead to unpleasant algebraic
manipulations. Recall that |V | stands for the absolute value of the determinant
of a square matrix V .
1 � (x − µ)V −1 (x − µ)T �
fX (x) = � exp − ,
(2π)n |V | 2
X = DW + µ,
for some matrix D and some real vector µ, where W is a random vector
whose components are independent N (0, 1) random variables.
The last definition is possibly the hardest to penetrate, but in the eyes of
some, it is the most elegant.
Cov(X, Y)
4
4 KEY PROPERTIES OF THE MULTIVARIATE NORMAL
The theorem below includes almost everything useful there is to know about
multivariate normals. We will prove and state the theorem, while working
mostly with Definition 3. The proof of equivalence of the three definitions will
be completed in the next lecture, together with some additional observations.
Proof:
(a) Under definition 3, Xi is a linear function of independent normal random
variables, hence normal. Since E[W] = 0, we have E[Xi ] = µi .
(b) For simplicity, let us just consider the zero mean case. We have
Cov(X, X) = E[XXT ] = E[DWWT DT ] = DE[WWT ]DT = DDT ,
5
where the last equality follows because the components of W are in
dependent (hence the covariance matrix is diagonal), with unit variance
(hence the diagonal entries are all equal to 1).
(d) This is an exercise in derived distributions. Let us again just consider the
case of µ = 0. We already know (Lecture 10) that when X = DW , with
D invertible, then
fW (D−1 w)
fX (x) = .
| det D|
In our case, since the Wi are i.i.d. N(0,1), we have
1 � 1 �
fW (w) = � exp − wT w ,
(2π)n 2
leading to
1 � 1 �
fX (x) = � exp − xT (D−1 )T D−1 x .
(2π)n |DDT | 2
(e) Using part (d), the joint PDF of X is completely determined by the matrix
V , which happens to be equal to Cov(X, X), together with the vector µ.
The degenerate case is a little harder, because of the absence of a conve
nient closed form formula. One could think of a limiting argument that
involves injecting a tiny bit of noise in all directions, to make the distri
bution nondegenerate, and then taking the limit. This type of argument
can be made to work, but will involve tedious technicalities. Instead, we
will take a shortcut, based on the inversion property of transforms. This
argument is simpler, but relies on the heavy machinery behind the proof
of the inversion property.
T
Let us find the multivariate transform MX (s) = E[es x ]. We note that
sT X is normal with mean sT µ. Letting X̃ = X − µ, the variance of sT X
6
is
Using the formula for the transform of a single normal random variable
(sT X in this case), we have
Tx Tµ TV
MX (s) = E[es ] = MsT X (1) = es es s/2
.
(g) Once more, to simplify notation, let us just deal with the zero-mean case.
Let us define
X̂ = VXY VY−1 Y Y.
We then have
7
˜ is independent of Y, which implies that Cov(X,
For part (iii), note that X ˜ X˜ |
˜ ˜
Y) = Cov(X, X). Finally,
˜X
Cov(X̃, X̃) = E[X ˜ T ] = E[(X − X)(X
ˆ ˆ T ] = E[(X − X)X
− X) ˆ T]
= VXX − E[VXY VY−1 T −1 T
Y YX ] = VXX − VXY VY Y E[YX ]
= VXX − VXY VY−1
Y VY X .
Note that in the case of the bivariate normal, we have Cov(X, Y ) = ρσX σY ,
VXX = σX 2 ,V 2
Y Y = σY . Then, part (g) of the preceding theorem, for the zero-
mean case, reduces to
σX 2
E[X | Y ] = ρ Y, var(X̃) = σX (1 − ρ2 ),
σY
which agrees with the formula we derived through elementary means in Lec
ture 9, for the special case of unit variances.
MIT OpenCourseWare
http://ocw.mit.edu
For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.
MASSACHUSETTS INSTITUTE OF TECHNOLOGY
6.436J/15.085J Fall 2008
Lecture 16 11/3/2008
CHARACTERISTIC FUNCTIONS
Contents
1 � (x − µ)V −1 (x − µ)T �
fX (x) = � exp − ,
(2π)n |V | 2
X = DW + µ,
for some matrix D and some real vector µ, where W is a random vector
whose components are independent N (0, 1) random variables.
2 PROOF OF EQUIVALENCE
In the course of the proof of Theorem 1 in the previous lecture, we argued that
if X is multivariate normal, in the sense of Definition 2, then:
Theorem 1.
Proof:
2
we see that D is nonsingular, and therefore invertible. Let
W = D−1 (X − µ).
We have shown thus far that the Wi are normal and uncorrelated. We now
proceed to show that they are independent. Using the formula for the PDF
of X and the change of variables formula, we find that the PDF of W is of
the form
for some normalizing constant c, which is the joint PDF of a vector of in
dependent normal random variables. It follows that X = DW + µ is a
multivariate normal in the sense of Definition 2.
(b) Suppose that X satisfies Definition 3, i.e., any linear function aT X is nor
mal. Let V = Cov(X, X), and let D be a symmetric matrix such that
D2 = V . We first give the proof for the easier case where V (and therefore
D) is invertible.
Let W = D−1 (X − µ). As before, E[W] = 0, and Cov(W, W) = I. Fix
a vector s, Then, sT W is a linear function of W, and is therefore normal.
Note that
3
(b)� Suppose now that V is singular (as opposed to positive definite). For sim
plicity, we will assume that the mean of X is zero. Then, there exists some
a �= 0, such that V a = 0, and aT V a = 0. Note that
aT V a = E (aT X)2 .
� �
The last part of the proof in the previous section provides some interesting intu
ition. Given a multivariate normal vector X, we can always perform a change of
coordinates, and obtain a representation of that vector in terms of independent
normal random variables. Our process of going from X to W involved factoring
the covariance matrix V of X in the form V = D2 , where D was a symmetric
square root of V . However, other factorizations are also possible. The most
useful one is described below.
4
Let
W1 = X1 ,
W2 = X2 − E[X2 | X1 ],
W3 = X3 − E[X3 | X1 , X2 ],
.. ..
. .
Wn = Xn − E[Xn | X1 , . . . , Xn−1 ].
(b) When we deal with multivariate normals, conditional expectations are linear
functions of the conditioning variables. Thus, the Wi are linear functions of
the Xi . Furthermore, we have W = LX, where L is a lower triangular ma
trix (all entries above the diagonal are zero). The diagonal entries of L are
all equal to 1, so L is invertible. The inverse of L is also lower triangular.
This means that the transformation from X to W is causal (Wi can be de
termined from X1 , . . . , Xi ) and causally invertible (Xi can be determined
from W1 , . . . , Wi ). Engineers sometimes call this a “causal and causally
invertible whitening filter.”
(c) The Wi are independent of each other. This is a consequence of the general
fact E[(X − E[X | Y ])Y ] = 0, which shows that the Wi is uncorrelated
with X1 , . . . , Xi−1 , hence uncorrelated with W1 , . . . , Wi−1 . For multivari
ate normals, we know that zero correlation implies independence. As long
as the Wi have nonzero variance, we can also normalize them so that their
variance is equal to 1.
We have defined the moment generating function MX (s), for real values of
s, and noted that it may be infinite for some values of s. In particular, if
MX (s) = ∞ for every s �= 0, then the moment generating function does not
provide enough information to determine the distribution of X. (As an example,
which very similar to the Fourier transform of f (except for the absence of a
minus sign in the exponent). Thus, the relation between moment generating
functions and characteristic functions is of the same kind as the relation between
Laplace and Fourier trasnforms.
Note that eitX is a complex-valued random variable, a new concept for us.
However, using the relation eitX = cos(tX)+i sin(tX), defining its expectation
is straightforward:
φX (t) = E[cos(tX)] + iE[sin(tX)].
We make a few key observations:
(a) Because |eitX | ≤ 1 for every t, its expectation, φX (t) is well-defined and
finite for every t ∈ R. In fact, |φX (t)| ≤ 1, for every t.
(b) The key properties of moment generating functions (cf. Lecture 14) are also
valid for characteristic functions (same proof).
Theorem 2.
(c) Inversion theorem: If two random variables have the same characteristic
function, then their distributions are the same. (The proof of this is nontriv
ial.)
6
(d) The above inversion theorem remains valid for multivariate characteristic
T
functions, defined by φX (t) = E[eit X ].
(e) For the univariate case, if X is a continuous random variable with PDF fX ,
there is an explicit inversion formula, namely
� T
1
fX (x) = lim e−itx φX (t) dt,
2π T →∞ −T
for every x at which fX is differentiable. (Note the similarity with inversion
formulas for Fourier transforms.)
The DCT applies here, because the random variables |eitXn | are bounded
by 1.
dk �
φ (t)�
= ik E[X k ].
�
k X
dt t=0
7
MIT OpenCourseWare
http://ocw.mit.edu
For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.
MASSACHUSETTS INSTITUTE OF TECHNOLOGY
6.436J/15.085J Fall 2008
Lecture 17 11/5/2008
Contents
1. Definitions
2. Convergence in distribution
3. The hierarchy of convergence concepts
1 DEFINITIONS
Note that for a.s. convergence to be relevant, all random variables need to
be defined on the same probability space (one experiment). Furthermore, the
different random variables Xn are generally highly dependent.
Two common cases where a.s. convergence arises are the following.
(a) The probabilistic experiment runs over time. To each time n, we associate
a�nonnegative random variable Zn (e.g., income on day� n). Let Xn =
n ∞
k=1 Zk be the income on the first n days. Let X = k=1 Zk be the
lifetime income. Note that X is well defined (as an extended real number)
a.s.
for every ω ∈ Ω, because of our assumption that Zk ≥ 0, and Xn → X.
1
(b) The various random variables are defined as different functions of a sin
gle underlying random variable. More precisely, suppose that Y is a ran
dom variable, and let gn : R → R be measurable functions. Let Xn =
gn (Y ) [which really means, Xn (ω) = gn (Y (ω)), for all ω]. Suppose
a.s.
that limn→∞ gn (y) = g(y) for every y. Then, Xn → X. For exam
ple, let gn (y) = y + y 2 /n, which converges to g(y) = y. We then have
a.s.
Y + Y 2 /n → Y .
a.s.
When Xn → X, we always have
E[Xn ] → E[X]
is not always true; sufficient conditions are provided by the monotone and dom
inated convergence theorems. For an example, where convergence of expecta
tions fails to hold, consider a random variable U which is uniform on [0, 1], and
let: �
n, if U ≤ 1/n,
Xn = (1)
0, if U > 1/n.
We have
lim E[Xn ] = lim nP(U ≤ 1/n) = 1.
n→∞ n→∞
On the other hand, for any outcome ω for which U (ω) > 0 (which happens with
a.s.
probability one), Xn (ω) converges to zero. Thus, Xn → 0, but E[Xn ] does not
converge to zero.
(a) Recall that CDFs have discontinuities (“jumps”) only at the points that have
positive probability mass. More precisely, F is continuous at x if and only
if P(X = x) = 0.
(b) Let Xn = 1/n, and X = 0, with probability 1. Note that FXn (0) =
P(Xn ≤ 0) = 0, for every n, but FX (0) = 1. Still, because of the excep
d
tion in the above definition, we have Xn → X. More generally, if Xn = an
d
and X = a, with probability 1, and an → a, then Xn → X. Thus, conver
gence in distribution is consistent with the definition of convergence of real
numbers. This would not have been the case if the definition required the
condition limn→∞ Fn (x) = F (x) to hold at every x.
(c) Note that this definition just involves the marginal distributions of the ran
dom variables involved. These random variables may even be defined on
different probability spaces.
(d) Let Y be a continuous random variable whose PDF is symmetric around 0.
Let Xn = (−1)n Y . Then, every Xn has the same distribution, so, trivially,
Xn converges to Y in distribution. However, for almost all ω, the sequence
Xn (ω) does not converge.
(e) If we are dealing with random variables whose distribution is in a parametric
class, (e.g., if every Xn is exponential with parameter λn ), and the parame
ters converge (e.g., if λn → λ > 0 and X is exponential with parameter λ),
then we usually have convergence of Xn to X, in distribution.
(f) It is possible for a sequence of discrete random variables to converge in dis
tribution to a continuous one. For example, if Yn is uniform on {1, . . . , n}
and Xn = Yn /n, then Xn converges in distribution to a random variable
which is uniform on [0, 1].
(g) Similarly, it is possible for a sequence of continuous random variables to
converge in distribution to a discrete one. For example if Xn is uniform
on [0, 1/n], then Xn converges in distribution to a discrete random variable
which is identically equal to zero.
(h) If X and all Xn are continuous, convergence in distribution does not imply
convergence of the corresponding PDFs. (Find an example, by emulating
the example in (f).)
(i) If X and all Xn are integer-valued, convergence in distribution turns out
to be equivalent to convergence of the corresponding PMFs: pXn (k) →
pX (k), for all k.
(b) Suppose that X and Xn , n ∈ N are all defined on the same probability
space. We say that the sequence Xn converges to X, in probability, and
i.p.
write Xn → X, if Xn − X converges to zero, in probability, i.e.,
(a) When X in part (b) of the definition is deterministic, say equal to some
constant c, then the two parts of the above definition are consistent with
each other.
i.p.
(b) The intuitive content of the statement Xn → c is that in the limit as n in
creases, almost all of the probability mass becomes concentrated in a small
interval around c, no matter how small this interval is. On the other hand,
for any fixed n, there can be a small probability mass outside this interval,
with a slowly decaying tail. Such a tail can have a strong impact on expected
values. For this reason, convergence in probability does not have any im
plications on expected values. See for instance the example in Eq. (1). We
i.p.
have Xn → X, but E[Xn ] does not converge to E[X].
i.p. i.p.
(c) If Xn → X and Yn → Y , and all random variables are defined on the same
i.p.
probability space, then (Xn + Yn ) → (X + Y ). (Can you prove it?)
2 CONVERGENCE IN DISTRIBUTION
The following result provides insights into the meaning of convergence in dis
tribution, by showing a close relation with almost sure convergence.
4
d
Theorem 1. Suppose that Xn → X. Then, there exists a probability space
and random variables Y , Yn defined on that space with the following proper
ties:
(a) For every n, the random variables Xn and Yn have the same CDF; similarly,
X and Y have the same CDF.
a.s.
(b) Yn → Y .
Theorem 2. We have
a.s. i.p. d
[Xn → X] ⇒ [Xn → X] ⇒ [Xn → X] ⇒ [φXn (t) → φX (t), ∀ t].
(The first two implications assume that all random variables be defined on
the same probability space.)
Proof:
a.s. i.p.
(a) [Xn → X] ⇒ [Xn → X]:
We give a short proof, based on the DCT, but more elementary proofs are
also possible. Fix some � > 0. Let
Yn = �I{|Xn −X|≥�} .
5
a.s. a.s.
If Xn → X, then Yn → 0. By the DCT, E[Yn ] → 0. On the other hand,
� �
E[Yn ] = �P Xn − X| ≥ � .
� � i.p.
This implies that P |Xn − X| ≥ � → 0, and therefore, Xn → X.
i.p. d
(b) [Xn → X] ⇒ [Xn → X]:
The proof is omitted; see, e.g., [GS].
d
(c) [Xn → X] ⇒ [φXn (t) → φX (t), ∀ t]:
d a.s.
Suppose that Xn → X. Let Yn and Y be as in Theorem 1, so that Yn → Y .
Then, for any t ∈ R,
lim φXn (t) = lim φYn (t) = lim E[eitYn ] = E[eitY ] = φY (t) = φX (t),
n→∞ n→∞ n→∞
a.s. a.s.
where we have made use of the facts Yn → Y , eitYn → eitY , and the
DCT.
6
3.2 Convergence in probability versus in distribution
The converse turns out to be false in general, but true when the limit is deter
ministic.
d i.p.
[Xn → X] does not imply [Xn → X]:
Let the random variables X, Xn be i.i.d. and nonconstant random variables, in
d
which case we have (trivially) Xn → X. Fix some �. Then, P(|Xn − X | ≥ �)
is positive and the same for all n, which shows that Xn does not converge to X,
in probability.
d i.p.
[Xn → c] implies [Xn → c]:
The proof is omitted; see, e.g., [GS].
The preceding theorem involves two separate conditions: (i) the sequence
of characteristic functions φXn converges (pointwise), and (ii) the limit is the
characteristic function associated with some other random variable. If we are
only given the first condition (pointwise convergence), how can we tell if the
limit is indeed a legitimate characteristic function associated with some random
variable? One way is to check for various properties that every legitimate char
acteristic function must possess. One such property is continuity: if t → t∗ ,
then (using dominated convergence),
∗X
lim∗ φX (t) = lim∗ E[eitX ] = E[eit ] = φX (t∗ ).
t→t t→t
7
Theorem 4. Continuity of inverse transforms: Let Xn be random vari
ables with characteristic functions φXn , and suppose that the limit φ(t) =
limn→∞ φXn (t) exists for every t. Then, either
(i) The function φ is discontinuous at zero (in this case Xn does not converge
in distribution); or
(ii) There exists a random variable X whose characteristic function is φ, and
d
Xn → X.
8
MIT OpenCourseWare
http://ocw.mit.edu
For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.
MASSACHUSETTS INSTITUTE OF TECHNOLOGY
6.436J/15.085J Fall 2008
Lecture 18 11/12/2008
Contents
1. Useful inequalities
2. The weak law of large numbers
3. The central limit theorem
1 USEFUL INEQUALITIES
Proof: Apply the Markov inequality, to the random variable |X − E[X]|2 , and
with a = �2 .
This is called the “weak law” in order to distinguish it from the “strong
a.s.
law” of large numbers, which asserts, under the same assumptions, that Xn →
E[X1 ]. Of course, since almost sure convergence implies convergence in proba
bility, the strong law implies the weak law. On the other hand, the weak law can
be easier to prove, especially in the presence of additional assumptions. Indeed,
in the special case where the Xi have mean µ and finite variance, Chebyshev’s
inequality yields, for every � > 0,
� � var(Sn /n) var(X1 )
P |(Sn /n) − µ| ≥ � ≤ 2
= ,
� n�2
which converges to zero, as n → ∞, thus establishing convergence in probabil
ity.
Before we proceed to the proof for the general case, we note two important
facts that we will use.
(a) First-order Taylor series expansion. Let g : R → R be a function that has
a derivative at zero, denoted by d. Let h be a function that represents the
error in a first order Taylor series approximation:
2
{an } is a sequence of complex numbers that converges to a, then,
� an �n
lim 1 + = ea .
n→∞ n
Suppose that X1 , X2 , . . . are i.i.d. with common (and finite) mean µ and variance
σ 2 . Let Sn = X1 + · · · + Xn . The central limit theorem (CLT) asserts that
Sn − nµ
√
σ n
3
Proof of the CLT: For simplicity, suppose that the random variables Xi have
zero mean and unit variance. Finiteness of the first two moments of X1 implies
that φX1 (t) is twice differentiable at zero. The first derivative is the mean (as
sumed zero), and the second derivative is −E[X 2 ] (assumed equal to one), and
we can write
φX (t) = 1 − t2 /2 + o(t2 ),
where o(t2 ) indicates a function such that o(t2 )/t2 → 0, as t → 0. The charac
√
teristic function of Sn / n is of the form
√ �
n �
t2 �n
+ o(t
2 /n) .
φX (t/ n) =
1 −
2n
2
For any fixed t, the limit as n → ∞ is e−t /2 , which is the characteristic function
φZ of a standard normal random variable Z. Since φSn /√n (t) → φZ (t) for
√
every t, we conclude that Sn / n converges to Z, in distribution.
The central limit theorem, as stated above, does not give any information on
the PDF or PMF of Sn . However, some further refinements are possible, under
some additional assumptions. We state, without proof, two such results.
(a) Suppose that |φX1 (t)|r dt < ∞, for some positive integer r. Then, Sn is
�
� �
n→∞ z 2π
(b) Suppose that Xi is a discrete random variable that takes values of the form
a+kh, where a and h are constants, and k ranges over the integers. Suppose
furthermore that X has zero mean and unit variance. Then, for any z of the
√ √
form z = (na + kh)/ n (these are the possible values of Sn / n), we have
√
n 1 −z 2 /2
lim P(Sn = z) = e .
n→∞ h 2π
4
MIT OpenCourseWare
http://ocw.mit.edu
For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.
MASSACHUSETTS INSTITUTE OF TECHNOLOGY
6.436J/15.085J Fall 2008
Lecture 19 11/17/2008
Contents
While the weak law of large numbers establishes convergence of the sample
mean, in probability, the strong law establishes almost sure convergence.
Before we proceed, we point out two common methods for proving almost
sure convergence.
E[ ∞ s
�
Proof. (i) By the monotone convergence theorem,
�∞ we obtain n=1 |Xn | ] <
s
∞, which implies that the random variable n=1 |Xn | is finite, with probabil
a.s. a.s.
ity 1. Therefore, |Xn |s → 0, which also implies that Xn → 0.
(ii) Setting � = 1/k, for any positive integer k, the Borel-Cantelli Lemma shows
that the event {|Xn | > 1/k} occurs only a finite number of times, with prob
ability 1. Thus, P(lim supn→∞ Xn > 1/k) = 0, for every positive integer k.
Note that the sequence of events {lim supn→∞ |Xn | > 1/k} is monotone and
converges to the event {lim supn→∞ |Xn | > 0}. The continuity of probabil
ity measures implies that P(lim supn→∞ |Xn | > 0) = 0. This establishes that
a.s.
Xn → 0.
1
Theorem 1: Let X, X1 , X2 , . . . be i.i.d. random variables, and assume that
E[|X|] < ∞. Let Sn = X1 + · · · + Xn . Then, Sn /n converges almost surely
to E[X].
Proof, assuming finite fourth moments. Let us make the additional assump
tion that E[X 4 ] < ∞. Note that this implies E[|X|] < ∞. Indeed, using the
inequality |x| ≤ 1 + x4 , we have
We have
n n n n
(X1 + · · · + Xn )4
� �
1 ����
E = E[Xi1 Xi2 Xi3 Xi4 ].
n4 n4
i1 =1 i2 =1 i3 =1 i4 =1
Let us consider the various terms in this sum. If one of the indices is differ
ent from all of the other indices, the corresponding term is equal to zero. For
example, if i1 is different from i2 , i3 , or i4 , the assumption E[Xi ] = 0 yields
Therefore, the nonzero terms in the above sum are either of the form E[Xi4 ]
(there are n such terms), or of the form E[Xi2 Xj2 ], with i �= j. Let us count the
number of terms of the second type. Such terms are obtained in three different
ways: by setting i1 = i2 �= i3 = i4 , or by setting i1 = i3 �= i2 = i4 , or by
setting i1 = i4 =� i2 = i3 . For each one of these three ways, we have n choices
for the first pair of indices, and n − 1 choices for the second pair. We conclude
that there are 3n(n − 1) terms of this type. Thus,
Using the inequality xy ≤ (x2 + y 2 )/2, we obtain E[X12 X22 ] ≤ E[X14 ], and
n + 3n(n − 1) E[X14 ]
� � �
(X1 + · · · + Xn )4 3n2 E[X14 ] 3E[X14 ]
�
E ≤ ≤ = .
n4 n4 n4 n2
2
It follows that
�∞ � ∞ ∞
� (X1 + · · · + Xn )4 � 1 � � � 3
4
E 4
= 4
E (X1 +· · ·+Xn ) ≤ E[X14 ] < ∞,
n n n2
n=1 n=1 n=1
�∞
where the last step uses the well known property n=1 n−2 < ∞. This implies
that (X1 + · · · + Xn )4 /n4 converges to zero with probability 1, and therefore,
(X1 + · · · +Xn )/n also converges to zero with probability 1, which is the strong
law of large numbers.
For the more general case where the mean of � the random variables X�i is
nonzero, the preceding argument establishes that X1 + · · · + Xn − nE[X1 ] /n
converges to zero, which is the same as (X1 + · · · +Xn )/n converging to E[X1 ],
with probability 1.
Proof, assuming finite second moments. We now consider the case where we
only assume that E[X 2 ] < ∞. We have
�� �2 � var(X)
Sn
E −µ = .
n n
If we only consider values of n that are perfect squares, we obtain
∞ �� �2 � � ∞
� Si2 var(X)
E 2
−µ = < ∞,
i i2
i=1 i=1
�2
which implies that (Si2 /i2 ) − E[X] converges to zero, with probability 1.
�
S i2 Sn S(i+1)2
2
≤ ≤ ,
(i + 1) n i2
or
i2 Si2 Sn (i + 1)2 S(i+1)2
· ≤ ≤ · .
(i + 1)2 i2 n i2 (i + 1)2
As n → ∞, we also have i → ∞. Since i/(i + 1) → 1, and since Si2 · i2
converges to E[X], with probability 1, we see that for almost all sample points,
Sn /n is sandwiched between two sequences that converge to E[X]. This proves
that Sn /n → E[X], with probability 1.
For a general random variable X, we write it in the form X = X + − X − ,
where X + and X − are nonnegative. The strong law applied to X − and X −
separately, implies the strong law for X as well.
3
The proof for the most general case (finite mean, but possibly infinite vari
ance) is omitted. It involves truncating the distribution of X, so that its moments
are all finite, and then verifying that the “errors” due to such truncation are not
significant in the limit.
Theorem 2. (Chernoff upper bound) Suppose that E[esX ] < ∞ for some
s > 0, and that a > 0. Then,
where � �
φ(a) = sup sa − log M (s) .
s≥0
For s = 0, we have
sa − log M (s) = 0 − log 1 = 0,
where we have used the generic property M (0) = 1. Furthermore,
d
�
�
�� 1
d
�
sa − log M (s)
�
= a − ·
M (s)
�
= a − 1 · E[X] > 0.
�
ds
s=0 M (s) ds
s=0
Since the function sa − log M (s) is zero and has a positive derivative at s = 0,
it must be positive when s is positive and small. It follows that the supremum
φ(a) of the function sa − log M (s) over all s ≥ 0 is also positive. In particular,
for any fixed a > 0, the probability P(Sn ≥ na) decays at least exponentially
fast with n.
2
Example: For a standard normal random variable X, we have M (s) = es /2 .
Therefore, sa − log M (s) = sa − s2 /2. To maximize this expression over all
s ≥ 0, we form the derivative, which is a − s, and set it to zero, resulting in
s = a. Thus, φ(a) = a2 /2, which leads to the bound
2 /2
P(X ≥ a) ≤ e−a .
Assumption 1.
(i) M (s) = E[esX ] < ∞, for all s ∈ R.
(ii) The random variable X is continuous, with PDF fX .
(iii) The random variable X does not admit finite upper and lower bounds.
(Formally, 0 < FX (x) < 1, for all x ∈ R.)
5
log M (s) − sa is minimized over all s ≥ 0. Taking derivatives, we see that such
a s∗ satisfies a = M � (s∗ )/M (s∗ ), where M � stands for the derivative of M . In
particular,
φ(a) = s∗ a − log M (s∗ ). (2)
Let us introduce a new PDF
∗
es x
fY (x) = fX (x).
M (s∗ )
This is a legitimate PDF because
� �
1 ∗ 1
fY (x) dx = es x fX (x) dx = · M (s∗ ) = 1.
M (s∗ ) M (s∗ )
The moment generating function associated with the new PDF is
M (s + s∗ )
�
1 sx s∗ x
MY (s) = e e fX (x) dx = .
M (s∗ ) M (s∗ )
Thus,
1 d �
∗ � M � (s∗ )
E[Y ] = · M (s + s ) = = a,
M (s∗ ) ds M (s∗ )
�
s=0
where the last equality follows from our definition of s∗ . The distribution of Y
is called a “tilted” version of the distribution of X.
Let Y1 , . . . , Yn be i.i.d. random variables with PDF fY . Because of the
close relation between fX and fY , approximate probabilities of events involving
Y1 , . . . , Yn can be used to obtain approximate probabilities of events involving
X1 , . . . , Xn .
We keep assuming that a > 0, and fix some δ > 0. Let
n
� � 1� �
B = (x1 , . . . , xn ) � a − δ ≤ xi ≤ a + δ ⊂ Rn .
�
n
i=1
6
The second inequality above was obtained because for every (x1 , . . . , xn )] ∈ B,
∗ ∗ ∗
we have x1 + · · · + xn ≤ n(a + δ), so that e−s x1 · · · e−s xn ≥ e−ns (a+δ) .
By the weak law of large numbers, we have
�Y + · · · + Y �
1 n
P(Tn ∈ B) = P ∈ [na − nδ, na + nδ] → 1,
n
as n → ∞. Taking logarithms, dividing by n, and then taking the limit of the
two sides of Eq. (3), and finally using Eq. (2), we obtain
1
lim inf log P(Sn > na) ≥ log M (s∗ ) − s∗ a − δ = −φ(a) − δ.
n→∞ n
This inequality is true for every δ > 0, which establishes the lower bound in
Eq. (1).
MIT OpenCourseWare
http://ocw.mit.edu
For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.
MASSACHUSETTS INSTITUTE OF TECHNOLOGY
6.436J/15.085J Fall 2008
Lecture 20 11/19/2008
Contents
We now turn to the study of some simple classes of stochastic processes. Exam
ples and a more leisurely discussion of this material can be found in the corre
sponding chapter of [BT].
A discrete-time stochastic is a sequence of random variables {Xn } defined
on a common probability space (Ω, F, P). In more detail, a stochastic process is
a function X of two variables n and ω. For every n, the function ω �→ Xn (ω) is a
random variable (a measurable function). An alternative perspective is provided
by fixing some ω ∈ Ω and viewing Xn (ω) as a function of n (a “time function,”
or “sample path,” or “trajectory”).
A continuous-time stochastic process is defined similarly, as a collection of
random variables {Xn } defined on a common probability space (Ω, F, P).
In the Bernoulli process, the random variables Xn are i.i.d. Bernoulli, with com
mon parameter p ∈ (0, 1). The natural sample space in this case is Ω = {0, 1}∞ .
Let Sn = X1 + · · · +Xn (the number of “successes” or “arrivals” in n steps).
The random variable Sn is binomial, with parameters n and p, so that
� �
n k
pSn (k) = p (1 − p)n−k , k = 0, 1 . . . , n,
k
1
Let T1 be the time of the first success. Formally, T1 = min{n | Xn = 1}.
We already know that T1 is geometric:
1
pT1 (k) = (1 − p)k−1 p, k = 1, 2, . . . ; E[T1 ] = .
p
2
X1 , . . . , Xn . Even more formally, for every n, there exists a function hn such
that
I{N =n} = hn (X1 , . . . , Xn ).
We are now a position to state a stronger version of the memorylessness
property. If N is a stopping time, then for all n, we have
In words, the process seen if we start watching right after a stopping time is also
Bernoulli with the same parameter p.
k−1 k−1
3
an arrival at time n if and only if one or both of the original processes record an
arrival. Formally,
Zn = max{Xn , Yn }.
The random variables Zn are i.i.d. Bernoulli, with parameter
Xn = Zn · Un , Yn = Zn · (1 − Un ).
Note that the random variables Xn are i.i.d. Bernoulli, with parameter pq, so that
{Xn } is a Bernoulli process with parameter pq. Similarly, {Yn } is a Bernoulli
process with parameter p(1 − q). Note however that the two processes are de
pendent. In particular, P(Xn = 1 | Yn = 1) = 0 �= pq = P(Xn = 1).
We also let
P (k; t) = P(N (t) = k).
4
The Poisson process, with parameter λ > 0, is defined implicitly by the
following properties:
(a) The numbers of arrivals in disjoint intervals are independent. Formally,
if 0 < t1 < t2 < · · · < tk , then the random variables N (t1 ), N (t2 ) −
N (t1 ), . . . , N (tk ) − N (tk−1 ) are independent. This is an analog of the
independence of trials in the Bernoulli process.
P (0; δ) = 1 − λδ + o1 (δ)
P (1; δ) = λδ + o2 (δ),
∞
�
P (k; δ) = o3 (δ),
k=2
5
for some function o that satisfies o(δ)/δ → 0.
We fix k and define the following events:
A: exactly k arrivals occur in (0, t];
B: exactly k slots have one or more arrivals;
C: at least one of the slots has two or more arrivals.
The events A and B coincide unless event C occurs. We have
B ⊂ A ∪ C, A ⊂ B ∪ C,
and, therefore,
Note that
P(C) ≤ n · o3 (δ) = (t/δ) · o3 (δ),
which converges to zero, as n → ∞ or, equivalently, δ → 0. Thus, P(A), which
is the same as P (k; t) is equal to the limit of P(B), as we let n → ∞.
The number of slots that record an arrival is binomial, with parameters n
and p = λt/n + o(1/n). Thus, using the binomial probabilities,
� ��
n λt �k � λt �n−k
P(B) = + o(1/n) 1− + o(1/n) .
k n n
When we let n → ∞, essentially the same calculation as the one carried out in
Lecture 6 shows that the right-hand side converges to the Poisson PMF, and
(λt)k −λt
P (k; t) = e .
k!
This establishes that N (t) is a Poisson random variable with parameter λt, and
E[N (t)] = var(N (t)) = λt.
6
We recognize this as an exponential CDF. Thus,
Let us now find the joint PDF of the first two interarrival times. We give a
heuristic argument, in which we ignore the probability of two or more arrivals
during a small interval and any o(δ) terms. Let t1 > 0, t2 > 0, and let δ be a
small positive number, with δ < t2 . We have
P(t1 ≤ T1 ≤ t1 + δ, t2 ≤ T2 ≤ t2 + δ)
≈ P (0; t1 ) · P (1; δ) · P (0; t2 − t1 − δ) · P (1; δ)
= e−λt1 λδe−λ(t2 −δ) λδ.
This shows that T2 is independent of T1 , and has the same exponential distribu
tion. This argument is easily generalized to argue that the random variables Tk
are i.i.d. exponential, with common parameter λ.
Differentiating, we obtain
∂2
fY1 ,Y2 (s, t) = P(Y1 ≤ s, Y2 ≤ t) = e−t , 0 ≤ s ≤ t.
∂t∂s
We point out an interesting consequence: conditioned on Y2 = t, Y1 is
uniform on (0, t); that is given the time of the second arrival, all possible times
of the first arrival are “equally likely.”
We now use the linear relations
T1 = Y1 , T2 = Y2 − Y1 .
7
The determinant of the matrix involved in this linear transformation is equal to 1.
Thus, the Jacobian formula yields
fT1 ,T2 (t1 , t2 ) = fY1 ,Y2 (t1 , t1 + t2 ) = e−t1 e−t2 ,
confirming our earlier independence conclusion. Once more this approach can
be generalized to deal with ore than two interarrival times, although the calcula
tions become more complicated
d λk y k−1 e−λy
fYk (y) = FYk (y) = .
dy (k − 1)!
MIT OpenCourseWare
http://ocw.mit.edu
For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.
MASSACHUSETTS INSTITUTE OF TECHNOLOGY
6.436J/15.085J Fall 2008
Lecture 21 11/24/2008
Readings:
Posted excerpt from Chapter 6 of [BT, 2nd edition], as well as starred exercises
therein.
1
In a more formal definition, we first define the σ-field Fs generated by all events of the
form {N (t) = k}, as t ranges over [0, s] and k ranges over the integers. We then require that
{S ≤ s} ∈ Fs , for all s ≥ 0.
1
MIT OpenCourseWare
http://ocw.mit.edu
For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.
Fundamentals of probability.
6.436/15.085
LECTURE 23
Markov chains
23.1. Introduction
d
Recall
� a model we considered earlier: random walk. We have Xn = Be(p), i.i.d. Then Sn =
1≤j≤n Xj was defined to be a simple random walk. One of its key property is that the distribu
tion of Sn+1 conditioned on the state Sn = x at n is independent from the past history, namely
Sm , m ≤ n − 1. To see this formally, note
P(Sn+1 = y|Sn = x, Sn−1 = z1 , . . . , S1 = zn−1 )
P(Xn+1 = y − x, Sn = x, Sn−1 = z1 , . . . , S1 = zn−1 )
=
P(Sn = x, Sn−1 = z1 , . . . , S1 = zn−1 )
P(Xn+1 = y − x)P(Sn = x, Sn−1 = z1 , . . . , S1 = zn−1 )
=
P(Sn = x, Sn−1 = z1 , . . . , S1 = zn−1 )
= P(Xn+1 = y − x),
where the second equality follows from the independence assumption for the sequence Xn , n ≥ 1.
A similar derivation gives P(Sn+1 = y|Sn = x) = P(Xn+1 = y − x) and we get the required
equality: P(Sn+1 = y|Sn = x, Sn−1 = z1 , . . . , S1 = zn−1 ) = P(Sn+1 = y|Sn = x).
Definition 23.1. A discrete time stochastic process (Xn , n ≥ 1) is defined to be a Markov chain
if it takes values in some countable set X , and for every x1 , x2 , . . . , xn ∈ X it satisfies the property
P(Xn = xn |Xn−1 = xn−1 , Xn−2 = xn−2 , . . . , X1 = x1 ) = P(Xn = xn |Xn−1 = xn−1 )
The elements of X are called states. We say that the Markov chain is in state s ∈ X at time
n if Xn = s. Mostly we will consider the case when X is finite. In this case we call Xn a finite
state Markov chain and, without the loss of generality, we will assume that X = {1, 2, . . . , n}.
Let us establish some properties of Markov chains.
Proposition 1. Given a Markov chain Xn , n ≥ 1.
(a) For every collection of states s, x1 , x2 , . . . , xn−1 and every m
P(Xn+m = s|Xn−1 = xn−1 , . . . , X1 = x1 ) = P(Xn+m = s|Xn−1 = xn−1 ).
1
2 , FUNDAMENTALS OF PROBABILITY. 6.436/15.085
Proof. Exercise. �
23.2. Examples
We already have an example of a Markov chain - random walk.
Consider now the following example (Exercise 2, Section 6.1 [2]). Suppose we roll a die
repeatedly and Xn is the number of 6-s we have seen so far. Then Xn is a Markov chain and
P(Xn = x + 1|Xn−1 = x) = 1/6, P(Xn = x|Xn−1 = x) = 5/6 and P(Xn = y|Xn−1 = x) = 0 for all
y=� x, x + 1. Note, that we can think of Xn as a random walk, where the transition to the right
occurs with probability 1/6 and the transition to the same state with the probability 5/6.
Also, let Xn be the largest outcome seen so far. Then Xn is again a Markov chain. What are
its transition probabilities?
Now consider the following model of an inventory process. The inventory can hold finish
goods up to capacity C ∈ N. Every month n there is some current inventory level In and a
certain fixed amount of product x ∈ N is produced, as long as limit is not reached, namely
In + x ≤ C. If In + x > C, than just enough C − In is produced to reach the capacity. Every
month there is a random demand Dn , n ≥ 1, which we assume is i.i.d. If the current inventory
level is at least as large as the demand, then the full demand is satisfied. Otherwise as much of
the demand is satisfied as possible, bringing the inventory level down to zero.
Let In be the inventory level in month n. Then In is a Markov chain. Note
Specifically, the probability distribution of In+1 given In = i, is independent from the values
Im , m ≤ n − 1. In is a Markov chain taking values in 0, 1, . . . , C.
Observe that
�
pi,j = P(Xn+2 = j|Xn = i) = P(Xn+2 = j|Xn+1 = k, Xn = i)P(Xn+1 = k|Xn = i)
1≤k≤N
�
= P(Xn+2 = j|Xn+1 = k)P(Xn+1 = k|Xn = i)
1≤k≤N
�
= pk,j pi,k .
1≤k≤N
This means that the matrix P 2 gives the two-step transition probabilities of the underlying
(2)
Markov chain. Namely, the (i, j)-th entry of P 2 , which we denote by pi,j is precisely P(Xn+2 =
j|Xn = i). This is not hard to extend to the general case: for every r ≥ 1, P r is the transition
matrix of r-steps of the Markov chain. One of our goals is understanding the long-term dynamics
of P r as r → ∞. We will see that for a broad class of Markov chains the following property
(r)
happens: the limit limr→∞ pi,j exists and depends on j only. Namely, the starting state i is
irrelevant, as far as the limit is concerned. This property is called mixing and is a very important
property of Markov chains.
Now, we use ej to denote the j-th N -dimensional column vector. Namely ej has j-th co
ordinate equal to one, and all the other coordinates equal to zero. We also let e denote the
N -dimensional column vector consisting of ones. Suppose X0 = i, for some state i ∈ {1, . . . , N }.
Then the probability vector of Xn can be written as eTi P n in vector form. Suppose at time
zero, the state of the chain is random and is given by some probability vector µ. Namely
P(X0 = i) = µi , i = 1, 2, . . . , N . Then the probability vector of Xn is precisely µT P n in vector
form.
One of the fundamental properties of finite state Markov chains is that a stationary distri
bution always exists.
Theorem 23.4. Given a finite state Markov chain with transition matrix P , the exists at least
one stationary distribution
� π. Namely the system of equation (23.3) has at least one solution
satisfying π ≥ 0, i πi = 1.
Proof. There are many proofs of this fundamental results. One possibility is to use Brower’s
Fixed Point Theorem. Later on we give a probabilistic proof which provides important intuition
about the meaning of πi . For now let us give a quick proof, but one that relies on linear
programming (LP). If you are not familiar with linear programming theory, you can simply
ignore this proof.
Consider the following LP problem in variables π1 , . . . , πN .
�
max πi
1≤i≤N
Subject to:
P T π − π = 0,
π ≥ 0.
Note that a stationary vector π exists iff this LP has an unbounded optimal solution. Indeed,
if π is a stationary vector, then it clearly
� is a feasible solution to this LP. Note that απ is also
a solution for every α > 0. Since α 1≤i≤N πi = α, then we can obtain a feasible solution as
large as we want. On the other hand, suppose� this LP has an unbounded objective
� value. In
particular, there exists a solution x satisfying i xi > 0. Taking πi = xi / i xi we obtain a
stationary distribution.
Now using LP duality theory, this LP has an unbounded solution iff the dual solution is
infeasible. The dual solution is
�
min 0yi
1≤i≤N
Subject to:
P y − y ≥ e.
∗
Let us show that indeed this �dual LP problem
� is infeasible. Take any y and find k such that
yk∗ = maxi yi . Observe that i pk∗ ,i yi ≤ i pk∗ ,i yk∗ = yk∗ < 1 + yk∗ , since the rows of P sum
to one. Thus the constraint P y − y ≥ e is violated in the k ∗ -th row. We conclude that the
dual problem is indeed infeasible. Thus the primal LP problem is unbounded and the stationary
distribution exists. �
As we mentioned, stationary distribution π is not necessarily unique, but it is quite often. In
this case it can be obtained as a unique solution to the system of equations π T = π T P, j πj =
�
1, πj ≥ 0.
Example : [Example 6.6 from [1]] An absent-minded professor has two umbrellas, used when
commuting from home to work and back. If it rains and umbrella is available, the professor takes
it. If umbrella is not available, the professor gets wet. If it does not rain the professor does not
LECTURE 23. 5
take the umbrella. It rains on a given commute with probability p, independently for all days.
What is the steady-state probability that the professor will get wet on a given day?
We model the process as a Markov chain with states j = 0, 1, 2. The state j means the
location where the professor is currently in has j umbrellas. Then the corresponding transition
probabilities are p0,2 = 1, p2,1 = p, p1,2 = p, p1,1 = 1−p, p2,0 = 1−p. The corresponding equations
for πj , j = 0, 1, 2 are then π0 = π2 (1 − p), π1 = (1 − p)π1 + pπ2 , π2 = π0 + pπ1 . From the second
equation π1 = π2 . Combining with the first equation and with the fact π0 + π1 + π2 = 1, we
obtain π1 = π2 = 3−p 1
, π0 = 1−p
3−p
. The steady-state probability that the professor gets wet is
the probability of being in state zero times probability that it rains on this day. Namely it is
P(wet) = (13−−pp)p .
23.6. References
• Sections 6.1-6.4 [2]
• Chapter 6 [1]
BIBLIOGRAPHY
7
MIT OpenCourseWare
http://ocw.mit.edu
For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.
Fundamentals of probability.
6.436/15.085
LECTURE 24
Markov chains II. Mean recurrence times.
Since k is recurrent, then by Exercise 3, µk < ∞ implying ρi (k) < ∞. We let ρ(k) denote the
vector with components ρi (k).
Lemma 24.6. ρ(k) satisfies ρT (k) = ρT (k)P . In particular, for every recurrent state k, πi =
ρi (k)
µk
, 1 ≤ i ≤ N defines a stationary distribution.
Proof. The second part follows from (24.5) and the fact that µk < ∞. Now we prove the first
part. We have for every n ≥ 2
�
(24.7) P(Xn = i, Tk ≥ n|X0 = k) = P(Xn = i, Xn−1 = j, Tk ≥ n|X0 = k)
�
j=k
�
(24.8) = P(Xn−1 = j, Tk ≥ n − 1|X0 = k)pj,i
�
j=k
Observe that P(X1 = i, Tk ≥ 1|X0 = k) = pk,i . We now sum the (24.7) over n and apply it to
(24.4) to obtain
��
ρi (k) = pk,i + P(Xn−1 = j, Tk ≥ n − 1|X0 = k)pj,i
� k n≥2
j=
LECTURE 24. 3
�
We recognize n≥2 P(Xn−1 = j, Tk ≥ n − 1|X0 = k) as ρj (k). Using ρk (k) = 1 we obtain
� �
ρi (k) = ρk (k)pk,i + ρj (k)pj,i = ρj (k)pj,i
�
j=k j
But by continuity of probabilities limn an = P(Xn �= k, ∀n). By Exercise 3, the state k, being
recurrent is visited infinitely often with probability one. We conclude that limn an = 0, which
gives µk πk = 1, implying that πk is uniquely defined as 1/µk . �
also πi = 0, then we have established the required equality for the case when i is a transient
state.
Suppose now i is a recurrent state. Let T1 , T2 , T3 , . . . denote the time of successive visits to
i. Then the sequence Tn , n ≥ 2 is i.i.d. Also T1 is independent from the rest of the sequence,
although it distribution is different from the one of Tm , m ≥ 2 since we have started the chain
from k which is in general different from i. By the definition of Ni (t) we have
� �
Tm ≤ t < Tm
1≤m≤Ni (t) 1≤m≤Ni (t)+1
almost surely. �
Since i is a recurrent state then by Exercise 3, Ni (t) → ∞ almost surely as t → ∞. Combining
the preceding identity with (24.10) we obtain
t
lim = E[T2 ] = µi ,
t→∞ Ni (t)
24.5. References
• Sections 6.3-6.4 [1].
BIBLIOGRAPHY
MIT OpenCourseWare
http://ocw.mit.edu
For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.
Fundamentals of probability.
6.436/15.085
LECTURE 25
Markov chains III. Periodicity, Mixing, Absorption
25.1. Periodicity
Previously we showed that when a finite state M.c. has only one recurrent class and π is the
corresponding stationary distribution,
�t then E[Ni (t)|X0 = k]/t → πi as t → ∞, irrespective of the
starting state k. Since Ni (t) = n=1 1{Xn =i} is the number of times state i is visited up till time
(n)
t, we have shown that 1t tn=1 P(Xn = i|X0 = k) → πi for every state k, i.e., pki converges to πi
�
(n)
in the Cesaro sense. However, pki need not converge, as the following example shows. Consider
(n)
a 2 state Markov Chain with states {1, 2} and p12 = 1 = p21 . Then p12 = 1 when n is odd and
0 when n is even.
Let x be a recurrent state and consider all the times when x is accessible from itself, i.e., the
(n)
times in the set Ix = {n ≥ 1 : pxx > 0} (note that this set is non-empty since x is a recurrent
state). One property of Ix we will make use of is that it is closed under addition, i.e., if m, n ∈ Ix ,
(m+n) (m) (n)
then m + n ∈ Ix . This is easily seen by observing that pxx ≥ pxx pxx > 0. Let dx be the
greatest common divisor of the numbers in Ix . We call dx the period of x. We now show that all
states in the same recurrent class has the same period.
Lemma 25.1. If x and y are in the same recurrent class, then dx = dy .
(m) (n) (m+n) (m) (n)
Proof. Let m and n be such that pxy , pyx > 0. Then pyy ≥ pxy pyx > 0. So dy divides
(l) (m+n+l) (n) (l) (m)
m+n. Let l be such that pxx > 0, then pyy ≥ pyx pxx pxy > 0. Therefore dy divides m+n+l,
hence it divides l. This implies that dy divides dx . A similar argument shows that dx divides dy ,
so dx = dy . �
A recurrent class is said to be periodic if the period d is greater than 1 and aperiodic if d = 1.
(n)
The 2 state Markov Chain in the example above has a period of 2 since p11 > 0 iff n is even.
A recurrent class with period d can be divided into d subsets, so that all transitions from one
subset lead to the next subset.
Why is periodicity of interest to us? It is because periodicity is exactly what prevents the
(n) (n)
convergence of pxy to πy . Suppose y is a recurrent state with period d > 1. Then pyy = 0 unless
n is a multiple of d, but πy > 0. However, if d = 1, we have positive probability of returning to
y for all time steps n sufficiently large.
1
2 , FUNDAMENTALS OF PROBABILITY. 6.436/15.085
(n)
Lemma 25.2. If dy = 1, then there exists some N ≥ 1 such that pyy > 0 for all n ≥ N .
(n)
Proof. We first show that Iy = {n ≥ 1 : pyy > 0} contains two consecutive integers. Let n
and n + k be elements of Iy . If k = 1, then we are done. If not, then since dy = 1, we can
find a n1 ∈ Iy such that k is not a divisor of n1 . Let n1 = mk + r where 0 < r < k. Consider
(m + 1)(n + k) and (m + 1)n + n1 , which are both in Iy since Iy is closed under addition. We
have
(m + 1)(n + k) − (m + 1)n + n1 = k − r < k.
So by repeating the above argument at most k times, we eventually obtain a pair of consecutive
integers m, m + 1 ∈ Iy . If N = m2 , then for all n ≥ N , we have n − N = km + r, where
0 ≤ r < m. Then n = m2 + km + r = r(1 + m) + (m − r + k)m ∈ Iy . �
When a Markov chain has one recurrent class (irreducible) and aperiodic, we have that the
steady state behavior is given by the stationary distribution. This is also known as mixing.
Theorem 25.3. Consider an irreducible, aperiodic Markov chain. Then for all states x, y,
(n)
lim pxy = πy .
n→∞
(n)
For the case of periodic chains, there is a similar statement regarding convergence of pxy , but
now the convergence holds only for certain subsequences of the time index n. See [1] for further
details.
There are at least two generic ways to prove this theorem. One is based on the Perron-
Frobenius Theorem which characterizes eigenvalues and eigenvectors of non-negative matrices.
Specifically the largest eigenvalue of P is equal to unity and all other eigenvalues are strictly
smaller than unity in absolute value. The P-F Theorem is especially useful in the special case
of so-called reversible M.c.. These are irreducible M.c. for which the unique stationary distri
bution satisfies πx pxy = πy pyx for all states x, y. Then the following important refinement of
Theorem 25.4 is known.
Theorem 25.4. Consider an irreducible aperiodic Markov chain which is reversible. Then there
(n)
exists a constant C such that for all states x, y, |pxy − πy | ≤ C|λ2 |n , where λ2 is the second
largest (in absolute value) eigenvalue of P .
Since by P-F Theorem |λ2 | < 1, this theorem is indeed a refinement of Theorem 25.4 as it
gives a concrete rate of convergence to the steady-state.
i.e., the probability that state i is eventually reached, starting from state k. Note that aii = 1
and aji = 0 for all absorbing j =� i. For k a transient state, we have
aki = P(∃n : Xn = i | X0 = k)
N
�
= P(∃n : Xn = i | X1 = j)pkj
j=1
N
�
= aji pkj .
j=1
So we can find the absorption probabilities by solving the above system of linear equations.
Example: Gambler’s Ruin A gambler wins 1 dollar at each round, with probability p, and
loses a dollar with probability 1 − p. Different rounds are independent. The gambler plays
continuously until he either accumulates a target amount m or loses all his money. What is the
probability of losing his fortune?
We construct a Markov chain with state space {0, 1, . . . , m}, where the state i is the amount
of money the gambler has. So state i = 0 corresponds to losing his entire fortune, and state
m corresponds to accumulating the target amount. The states 0 and m are absorbing states.
We have the transition probabilities pi,i+1 = p, pi,i−1 = 1 − p for i = 1, 2, . . . , m − 1, and
p00 = pmm = 1. To find the absorbing probabilities for the state 0, we have
a00 = 1,
am0 = 0,
ai0 = (1 − p)ai−1,0 + pai+1,0 , for i = 1, . . . , m − 1.
so we obtain bi = ρi b0 . Note that b0 +b1 + · · · +bm−1 = a00 −am0 = 1, hence (1+ρ+. . .+ρm−1 )b0 =
1, which gives us
� i
ρ (1−ρ)
1−ρm
, if ρ �= 1,
bi = 1
m
, otherwise.
and for ρ = 1,
m−i
ai0 = .
m
This shows that for any fixed i, if ρ > 1, i.e., p < 1/2, the probability of losing goes to 1 as
m → ∞. Hence, it suggests that if the gambler aims for a large target while under unfavorable
odds, financial ruin is almost certain.
The expected time of absorption µk when starting in a transient state k can be defined as
µk = E[min{n ≥ 1 : Xn is recurrent} | X0 = k]. A similar analysis by conditioning on the first
step of the Markov chain shows that the expected time to absorption can be found by solving
µk = 0 for all recurrent states k,
N
�
µk = 1 + pkj µj for all transient states k.
j=1
25.3. References
• Sections 6.4,6.6 [2].
• Section 5.5 [1].
BIBLIOGRAPHY
1. R. Durrett, Probability: theory and examples, Duxbury Press, second edition, 1996.
2. G. R. Grimmett and D. R. Stirzaker, Probability and random processes, Oxford University
Press, 2005.
5
MIT OpenCourseWare
http://ocw.mit.edu
For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.
Fundamentals of probability.
6.436/15.085
LECTURE 26
Infinite Markov chains. Continuous time Markov chains.
26.1. Introduction
In this lecture we cover variants of Markov chains, not covered in earlier lectures. We will discuss
infinite state Markov chains. Then we consider finite and infinite state M.c. where the transition
between the states happens during some random time interval, as opposed to unit time steps.
Most of the times we state the results without proofs. Our treatment of this material is also
very brief. A more in depth analysis of these concepts is devoted by the course 6.262 - Discrete
Stochastic Processes.
d d
As a result, if the M.c. Xn has the property X0 = π, then Xn = π for all n.
Now let us consider the following M.c. on Z+ . A parameter p is fixed. For every i > 0,
pi,i+1 = p, pi,i−1 = 1 − p and p0,1 = p, p0,0 = 1 − p. This M.c. is called random walk with reflection
at zero. Let us try to find a stationary distribution π of this M.c. It must satisfy
πi = πi−1 pi−1,i + πi+1 pi+1,i = πi−1 p + πi+1 (1 − p), i ≥ 1.
π0 = π0 (1 − p) + π1 (1 − p).
1
2 , FUNDAMENTALS OF PROBABILITY. 6.436/15.085
p
From this we obtain π1 = π
1−p 0
and iterating
p
(26.1) πi+1 = πi .
1−p
i
� gives πi = (p/(1 − p)) π0 . Now if p > 1/2 then πi → ∞ and we cannot possibly have that
This
i πi = 1. Thus no stationary distribution exists. Note, that however all pairs of states i, j
communicate, as we can get from i to j > i in j − i steps with probability pj−i > 0, and from j
to i in j − i steps with probability (1 − p)j−i .
We conclude that an infinite state M.c. does not necessarily have a stationary distribution
even if all states communicate. Recall that in the case of finite state M.c. if i is a recurrent state,
then its recurrence time Ti has finite expected value (as it has geometrically decreasing tails). It
turns out that the difficulty is the fact that while every state i communicates with every other
state j, it is possible that the chain starting from i wanders off to ”infinity” for every without
ever returning to i. Furthermore, it is possible that even if the chain returns to i infinitely often
with probability one, the expected return time from i to i is infinite. Recall, that the return time
is defined to be Ti = min{n ≥ 1 : Xn = i}, when the M.c. starts at i at time 0.
Definition 26.2. Given an infinite M.c. Xn , n ≥ 1, the state i is defined to be transient if the
probability of never returning to i is positive. Namely,
� i, ∀n ≥ 1|X0 = i) > 0.
P(Xn =
Otherwise the state is defined to be recurrent. It is defined to be positive recurrent if E[Ti ] < ∞
and null-recurrent if E[Ti ] = ∞.
Thus, unlike the finite state case, the state is transient if there is a positive probability of no
return, as opposed to existence of a state from which the return to starting state has probability
zero. It is an exercise to see that the definition above when applied to the finite state case is
consistent with the earlier definition. Also, observe that there is no notion of a null-recurrent
state in the finite state case.
The following theorem holds, the proof of which we skip.
Theorem 26.3. Given an infinite M.c. Xn , n ≥ 1 suppose all the states communicate. Then
there exists a stationary distribution π iff there exists at least one positive recurrent state i. In
this case in fact all the states are positive recurrent and the stationary distribution π is unique.
It is given as πj = 1/E[Tj ] > 0 for every state j.
We see that in the case when all the states communicate, all states have the same status:
positive recurrent, null recurrent or transient. In this case we will say the M.c. itself is positive
recurrent, null recurrent, or transient. There is an extension of this theorem to the cases when
not all states communicate, but we skip the discussion of those. The main difference is that if
the state i is such that for some j, i → j and j � i, then the steady state probability of i is
zero, just as in the case of finite state M.c. Similarly, if there are several communicating classes,
then there exists at least one stationary distribution per class which contains at least one positive
recurrent state (and as a result all states are positive recurrent).
Theorem 26.4. A random walk with reflection Xn on Z+ is positive recurrent if p < 1/2,
null-recurrent if p = 1/2 and transient if p > 1/2.
Proof. The case p < 1/2 will be resolved by exhibiting explicitly at least one steady state
distribution π. Since all the states communicate, then by Theorem 26.3 we know that the
LECTURE 26. 3
stationary distribution is unique and E[Ti ] = 1/πi < ∞ for all i. Thus the chain is positive
recurrent. To construct a stationary distribution look again at the recurrence (26.1), which
suggests πi = (p/(1 − p))i π0 . From this we obtain
�
π0 (1 + (p/(1 − p))i ) = 1
i>0
only at the times of transitions, denoted say by t1 < t2 < · · · , then we obtain an embedded
discrete time process Yn = X(tn ). It is an exercise to show that Yn is in fact a homogeneous
Markov chain. Denote the transition rates of this Markov chain by pi,j . The value qi,j = µi pi,j is
called “transition rate” from state i to state j. Note, that the values pi,j were� introduced only for
j �= i, as they were derived from M.c. changing its state. Define qi,i = − j=i � qi,j . The matrix
G = (qi,j ), i, j ∈ X is defined to be the generator of the M.c. X(t) and plays an important role,
specifically for the discussion of a stationary distribution.
A stationary distribution π of a continuous M.c. is defined in the same way as for the discrete
time case: it is the distribution which is time invariant. The following fact can be established.
�
Proposition 2. A vector (π i ), i ∈ X is a stationary distribution iff πi ≥ 0, i πi = 1 and
�
�
j πj qj,i = 0 for every state i. In vector form q G = 0.
As for the discrete time case, the theory of continuous time M.c. has a lot of special structure
when the state space is finite. We now summarize without proofs some of the basic results. First
there exists a stationary distribution. The condition for uniqueness of the stationary distribution
are the same - single recurrence class, with communications between the states defined similarly.
A nice “advantage” of continuous M.c. is the lack of periodicity. There is no notion of a period
of a state. Moreover, and most importantly, suppose the chain has a unique recurrence class.
Then, for π, the corresponding unique stationary distribution, the mixing property
(t)
lim pi,j = πj
t→∞
holds for every starting state i. For the modeling purposes, it is useful sometimes to consider a
continuous as opposed to a discrete M.c.
There is an alternative way to describe a continuous M.c. and the embedded discrete time
M.c. Assume that to each pair of states i, j we associate an exponential “clock” - exponen
tially distributed r.v. Ui,j with parameter µi,j . Each time the process jumps into i all of the
clocks turned on simultaneously. Then at time Ui � min Ui,j the process jumps into state
j = arg minj Ui,j . It is not hard to establish the following: the resulting process is a continu
ous time finite state M.c. The embedded discrete time M.c. has then transition probabilities
µ
P(X(tn+1 ) = j|X(tn ) = i) = � i,jµi,k , as the probability that Ui,j = mink Ui,k is given by this
k
expression, when the distribution of Ui,j is exponential
� with parameters µi,j . The holding time
has then the distribution Exp(µi ) where µi = k µi,k . Thus we obtain an alternative description
of a M.c. The transition rates of this M.c. are qi,j = µi pi,j = µi,j . In other words, we described
the M.c. via the rates qi,j as given.
This description extends to the infinite M.c., when the notion of holding times is well defined
(see the comments above).
26.4. References
• Sections 6.2,6.3,6.9 [1].
BIBLIOGRAPHY
MIT OpenCourseWare
http://ocw.mit.edu
For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.
Fundamentals of probability.
6.436/15.085
LECTURE 27
Birth-death processes
⎧
λ h + o(h), if m = 1;
⎨ n
⎪
⎪
µn h + o(h), if m = −1;
P(X(t + h) = n + m|X(t) = n) =
⎪
⎪ 1 − λn h − µn h + o(h), if m = 0;
⎩
o(h), if |m| > 1;
1
2 , FUNDAMENTALS OF PROBABILITY. 6.436/15.085
Fix
� an arbitrary state k0 and let pn (t) = P(X(t) = n|X(0) = k0 ). We have pn (t + h) =
j pj (t)P(X(t + h) = n|X(t) = j). Assume n > 0. Then we can rewrite the expression above as
with the reflected random walk (essentially the same model except it is discrete time) and the
condition 1 − p < p for existence of steady state. The parameter ρ is called traffic intensity
and plays a very important role in the theory of queueing systems. For one thing notice that in
steady state p0 = P(L = 0) = 1 − ρ. On the other hand, from the general theory of discrete and
continuous time M.c. we know that p0 is the average fraction of time the system spends in this
state. Thus 1 − ρ is the average time the server is idle. Alternatively, ρ is the average time the
system is busy. Hence the term “traffic intensity”.
Consider now the following variant of a queueing system, known as M/M/∞ system. Here
d
we have infinitely many servers. Each service time is again = Exp(µ) and the arrival occurs
according to a Poisson process with rate λ. The difference is that there is no queue any more
as, due to infinity of servers, every customer gets instantly to be served. Let L(t) be the number
of customers being served at time t. It is not hard to see that this corresponds to a birth-death
process with parameters λn = λ, µn = µn. The arrival parameter explained as before. The
service rate being µn is explained as follows: when there are n customers being served the next
transition occurs at a time which is minimum of an arrival time till the next customer or the
d d
smallest of the n service time. The former is = Exp(λ), the latter = Exp(nµ), hence µn = µn.
Let ρ = λ/µ. In this case we find the stationary distribution as
� � ρn �−1
p0 = = e−ρ ,
n≥0
n!
ρn −ρ
pn = e .
n!
In particular, the distribution is a familiar Pois(ρ). The system always has a steady state
distribution, irrespectively of the values λ, µ. This is explained by the fact that we have infinitely
many servers and the queue disappears.
27.4. References
• Sections 6.11, [1].
BIBLIOGRAPHY
MIT OpenCourseWare
http://ocw.mit.edu
For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.