4b_ProbabilityNotes

Probability and Simulation
Introduction to Probability Theory and Monte Carlo

Simulation
Daniel Arrieta
Contents
1 Introduction to Probability 1
1.1 Sample Space and Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Probabilities on Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Conditional Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Counting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2 Random Variables 13
2.1 Probability Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2 Expectation, Variance and Quantiles . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3 Moment Generating Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.4 Characteristic Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3 Codependence Structures 26
3.1 Joint Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.2 Independent Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.3 Covariance and Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.4 Conditional Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4 Convergence and Limit Theorems 40

4.1 The Central Limit Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.2 Laws of Large Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5 Simulation 48
5.1 Monte Carlo Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.2 Univariate Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.3 Gaussian Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.4 Stochastic Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.4.1 Poisson Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.4.2 Brownian Motion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.5 Variance Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.5.1 Antithetic Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.5.2 Control Variates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
References 77
Probability Theory 1. Introduction to Probability
1 Introduction to Probability
This chapter is based on Ross (2019), and Barron and Greco (2020).
1.1 Sample Space and Events

Every experiment whether it is designed by some experimenter or not, results in a possible
outcome.
Definition 1.1 (Sample space). The sample space is the set Ω of all possible outcomes of an
experiment. Ω could be a finite set, a countably infinite set, or a continuum. Ω is also called
universal set.
Definition 1.2 (Event). An event A is any subset of Ω, A ⊂ Ω. The set Ω is also called the sure
event. The empty set ∅ is called the impossible event. The class of all possible events is denoted
by F = {A : A ⊂ Ω}. If Ω is a finite set with N elements, we write |Ω| = N and then the number
of possible events is 2N .
Example 1.3. If we roll a die the sample space is Ω = {1, 2, 3, 4, 5, 6}. Rolling an even number is
the event A = {2, 4, 6}. If we want to count the number of customers coming to a bakery the sample
space is Ω = {0, 1, 2, . . .}, and the event we get between 2 and 7 customers is A = {2, 3, 4, 5, 6, 7}.
Example 1.4. If we throw a dart randomly at a circular board of 1 meter radius, the sample
space is the set of all possible positions of the dart Ω = (x, y) : x2 + y 2 ≤ 1 . The event that the

dart landed in the first quadrant is A = (x, y) : x2 + y 2 ≤ 1, x ≥ 0, y ≥ 0 . This can be seen in

Figure 1.1.
Figure 1.1: Randomly throwns of a dart at a circular board of 1 meter radius and set A.
1
1. Introduction to Probability 1.2. Probabilities on Events
Eventually we want to find the probability that an event will occur. We say that an event A
occurs if any outcome in the set A actually occurs when the experiment is performed.
Combinations of events
Let A, B ∈ F be any two events. From these events we may describe the following events:
(a) A ∪ B is the event A occurs, or B occurs, or they both occur.
(b) A ∩ B, also written as AB, is the event A occurs and B occur, i.e., they both occur.
(c) Ac = Ω − A is the event A does not occur. This is all the outcomes in Ω and not in A.
(d) A ∩ B c is the event A occurs and B does not occur.
(e) A ∩ B = ∅ means the two events cannot occur together, i.e., they are mutually exclusive.
We also say that A and B are disjoint. Mutually exclusive events cannot occur at the same
time.
(f) A ∪ Ac = Ω means that no matter what event A we pick, either A occurs or Ac occurs, and
not both. A and Ac are mutually exclusive.
Many more such relations hold if we have three or more events. It is useful to recall the following
set relationships:
• A ∩ (B ∪ C) = (A ∩ B) ∪ (A ∩ C) and, A ∪ (B ∩ C) = (A ∪ B) ∩ (A ∪ C).
• (A ∩ B)c = Ac ∪ B c , and (A ∪ B)c = Ac ∩ B c . (DeMorgan’s Rules)
These relations can be checked by using Venn diagrams.
1.2 Probabilities on Events

Now we are ready to define what we mean by the probability of an event.
Definition 1.5 (Probability function). A function P : F → R satisfying
i. P (A) ≥ 0, ∀A ∈ F,
ii. P (Ω) = 1,
iii. For any countable sequence of events A1 , A2 , . . . mutually exclusive, i.e., events for which
Ai ∩ Aj = ∅ whenever i ̸= j, then
∞ ∞
!
[ X
P An = P (An ) ,
n=1 n=1
it is said to be a probability function.
2
From i. it is clear that probabilities cannot be negative, ii. states that the probability of the sure
event is 1, iii. implies that
P (A ∪ B) = P (A) + P (B),
for all events A, B ∈ F such that A ∩ B = ∅, this is called the disjoint event sum rule.
Whenever we write P we will always assume it is a probability function.
Remark 1.6. Immediately from the Definition 1.5 we can see that P (∅) = 0. In fact, since we
have the disjoint sum rule, and Ω ∩ ∅ = ∅, and P (Ω) = 1, then
P (Ω ∪ ∅) = P (Ω) + P (∅) = 1 + P (∅) =⇒ P (∅) = 0.
Since 1 = P (Ω) = P (A ∪ Ac ) = P (A) + P (Ac ) we also see that
P (Ac ) = 1 − P (A),
for any event A ∈ F. It is also true that no matter what event A ∈ F we take 0 ≤ P (A) ≤ 1. In
fact, by definition P (Ac ) ≥ 0, and since P (Ac ) = 1 − P (A) ≥ 0, it must be that
0 ≤ P (A) ≤ 1.
Remark 1.7. One of the most important and useful rules is the Law of Total Probability:
P (A) = P (A ∩ B) + P (A ∩ B c ) , ∀ A, B ∈ F. (1.1)
To see why this is true, we use some basic set theory decomposing A,
A=A∩Ω
= A ∩ (B ∪ B c )
= (A ∩ B) ∪ (A ∩ B c ) ,
and A ∩ B is disjoint from A ∩ B c . Therefore, by the disjoint event sum rule,
P (A) = P (A ∩ B) ∪ (A ∩ B c )

= P (A ∩ B) + P (A ∩ B c ) .
A main use of this Law is that we may find the probability of an event A if we know what happens
when A ∩ B occurs and when A ∩ B c occurs. A useful form of this is
P (A ∩ B c ) = P (A) − P (A ∩ B).
The next theorem gives us the sum rule when the events are not mutually exclusive.
Theorem 1.8 (General Sum Rule). If A, B are any two events, then
P (A ∪ B) = P (A) + P (B) − P (A ∩ B).
3
1. Introduction to Probability 1.2. Probabilities on Events
Proof. A union can always be written as a disjoint union, i.e.,
A ∪ B = A ∪ (Ac ∩ B) .
Then by the disjoint sum rule
P (A ∪ B) = P (A) + P (Ac ∩ B) .
But, by the Law of Total Probability
P (Ac ∩ B) = P (B) − P (A ∩ B).
Putting these together we have
P (A ∪ B) = P (A) + P (Ac ∩ B)
= P (A) + P (B) − P (A ∩ B).
■
The next example gives one of the most important probability functions for finite sample spaces.
Example 1.9. When the sample space is finite, say |Ω| = N , and all individual outcomes in Ω
are equally likely, we may define a function
n(A)
P (A) := ,
N
where n(A) denotes the number of outcomes in A. To check that this is indeed a probability
function, we have to verify the conditions of the Definition 1.5.
i. 0 ≤ P (A) ≤ 1 since 0 ≤ n(A) ≤ N for any event A ⊂ Ω;

n(Ω)
ii. P (Ω) = = 1; and
N
n(A ∪ B) n(A) + n(B)
iii. P (A ∪ B) = = = P (A) + P (B), if A ∩ B = ∅.
N N
It is worth noticing that the requirement that individual outcomes be equally likely is essential.
For example, suppose we roll two dice and sum the numbers on each die. We take the sample
space S = {2, 3, 4, . . . , 12}. If we use this sample space and we assume the outcomes are equally
likely then we would get that P (roll a 7) = 1/11 which is clearly not correct. The problem is that
with this sample space, the individual outcomes are not equally likely. If we want equally likely
outcomes we need to change the sample space to account for the result on each die:
Ω = {(1, 1), (1, 2), . . . , (1, 6), (2, 1), (2, 2) . . . , (2, 6), . . . , (6, 1), (6, 2), . . . , (6, 6)}.
This sample space has 36 outcomes and the event of rolling a 7 is
A = {(1, 6), (6, 1), (2, 5), (5, 2), (3, 4), (4, 3)}.
Then P (roll a 7) = 6/36 is the correct probability of rolling a 7 .
4
Example 1.10. Whenever the sample space can easily be written it is often the best way to find
probabilities. As an example, we roll two dice and we let D1 denote the number on the first die
and D2 the number on the second. Suppose we want to find P (D1 > D2 ). The easiest way to
solve this is to write down the sample space as done in previous example, and then use the fact
that each outcome is equally likely. We have
{D1 > D2 } = {(2, 1), (3, 2), (3, 1), (4, 3), (4, 2), (4, 1), (5, 4), (5, 3),
(5, 2), (5, 1), (6, 5), (6, 4), (6, 3), (6, 2), (6, 1)}.
15
This event has 15 outcomes which means P (D1 > D2 ) = .
36
1.3 Conditional Probability

It is important to take advantage of information about the occurrence of a given event in calculating
the probability of a separate event. The way to do that is to use conditional probability.
Definition 1.11. The conditional probability of event A, given that event B has occurred is
P (A ∩ B)
P (A|B) = , if P (B) > 0.
P (B)
If P (B) = 0, B does not occur.
Rearranging the terms in the previous definition yields the following
P (A ∩ B) = P (A|B)P (B) = P (B|A)P (A),
above equality is known as the multiplication rule.
One justification for Definition 1.11 can be seen from the case when the sample space is finite
and with equally likely individual outcomes. In such case, if |Ω| = N , then
n(A ∩ B)
n(A ∩ B) N
=
n(B) n(B)
N
P (A ∩ B)
=
P (B)
= P (A|B).
The left-most side of above equations is the fraction of outcomes in A ∩ B from the event B. In
other words, it is the probability of A using the reduced sample space B. That is, if the outcomes
in Ω are equally likely, P (A|B) is the proportion of outcomes in both A and B relative to the
number of outcomes in B.
5
1. Introduction to Probability 1.3. Conditional Probability
Drug Placebo Subtotals Probability
Response 26 13 39 0.267
No Response 45 62 107 0.733
Subtotals 71 75 146
Probability 0.486 0.514
Table 1: Controlled experiment results and conditional probabilities.
Example 1.12. In a controlled experiment to see if a drug is effective 71 patients were given the
drug (event D), while 75 were given a placebo (event Dc ). A patient records a response (event
R) or not (event Rc ). The following table summarizes the results. This is called a two-way or
contingency table . The sample space consists of 146 outcomes of the type (Drug, Response),
(Placebo, Response), (Drug, No Response), or (Placebo, No Response), assumed equally likely.
The numbers in the table are recorded after the experiment is performed and we estimate the
probability of each event. For instance,
71 39
P (D) = = 0.486, P (R) = = 0.267,
146 146
and so on. For example, P (R) is obtained from 39 of the equally likely chosen patients exhibit a
response (whether to the drug or the placebo).
We can use the Law of Total Probability to also calculate these probabilities. If we want the chance
that a randomly chosen patient will record a response we use the fact that R = (R ∩ D) ∪ (R ∩ Dc ),
so
P (R) = P (R ∩ D) + P (R ∩ Dc )
26 13
= +
146 146
39
=
146
= 0.267,
and
P (D) = P (D ∩ R) + P (D ∩ Rc )
26 45
= +
146 146
71
=
146
= 0.486.
We may answer various questions using conditional probability, e.g.,
6
• If we choose at random a patient and we observe that this patient exhibited a response, what
is the chance this patient took the drug? This is
26
P (D|R) =
39
P (D ∩ R)
=
P (R)
26/146
= .
39/146
Using the reduced sample space R is how we got the first equality.
• If we choose a patient at random and we observe that this patient took the drug, what
is the chance this patient exhibited a response? This is P (R|D) = 26/71. Notice that
P (R|D) ̸= P (D|R).
• Find P (Rc |D) = 45/71. Observe that since P (D) = P (R ∩ D) + P (Rc ∩ D), we have
P (Rc ∩ D)
P (Rc |D) =
P (D)
P (D) − P (R ∩ D)
=
P (D)
= 1 − P (R|D).
Using the Law of Total Probability we get an important formula and tool for calculating
probabilities of events.
Theorem 1.13. P (A) = P (A|B)P (B) + P (A|B c ) P (B c ).
Proof. The Law of Total Probability combined with the multiplication rule says
P (A) = P (A ∩ B) + P (A ∩ B c )
= P (A|B)P (B) + P (A|B c ) P (B c ) ,
which is the statement in the theorem. ■
Frequently, we want to find the conditional probability of some event and we have yet another
event we want to take into account. The next corollary tells us how to do that.
Corollary 1.14. Let A, B, C be three events. Then
P (A|B) = P (A|B ∩ C)P (C|B) + P (A|B ∩ C c ) P (C c |B) ,
assuming each conditional probability is defined.
7
1. Introduction to Probability 1.3. Conditional Probability
Proof. Simply write out each term and use the theorem.
P (A ∩ B) = P (A ∩ B ∩ C) + P (A ∩ B ∩ C c )
P (A ∩ B ∩ C) P (A ∩ B ∩ C c )
= P (B ∩ C) + P (B ∩ C c )
P (B ∩ C) P (B ∩ C c )
= P (A|B ∩ C)P (B ∩ C) + P (A|B ∩ C c ) P (B ∩ C c ) .
Divide both sides by P (B) > 0 to get

P (A ∩ B)
= P (A|B)
P (B)
P (B ∩ C) P (B ∩ C c )
= P (A|B ∩ C) + P (A|B ∩ C c )
P (B) P (B)
= P (A|B ∩ C)P (C|B) + P (A|B ∩ C c ) P (C c |B) .
■
Another very useful fact is that conditional probabilities are actually probabilities and therefore
all rules for probabilities apply to conditional probabilities as long as the given information
remains the same.
Corollary 1.15. Let B be an event with P (B) > 0, then
Q(A) = P (A|B), ∀ A ∈ F,
is a probability function.
Proof. We have to verify that Q(·) satisfies the axioms of Definition 1.5. Clearly, Q(A) ≥ 0 for
any event A, and
Q(Ω) = P (Ω|B)
P (Ω ∩ B)
=
P (B)
P (B)
= = 1.
P (B)
Finally, let A1 ∩ A2 = ∅,
P (A1 ∪ A2 ) ∩ B

P (A1 ∪ A2 |B) =
P (B)
P (A1 ∩ B) ∪ (A2 ∩ B)

=
P (B)
P (A1 ∩ B) + P (A2 ∩ B)
=
P (B)
= P (A1 |B) + P (A2 |B) .
This means the disjoint sum rule holds. ■
8
Conditional probability naturally leads us to what it means when information about B doesn’t
help with the probability of A. This is an important concept and will be very helpful throughout
probability and statistics.
Definition 1.16. Two events A, B are said to be independent, if the knowledge that one of the
events occurred does not affect the probability that the other event occurs. That is,
P (A|B) = P (A),
and
P (B|A) = P (A).
Using the definition of conditional probability, an equivalent definition is
P (A ∩ B) = P (A)P (B).
Example 1.17. Suppose an experiment has two possible outcomes a, b, so the sample space is
S = {a, b}. Suppose P (a) = p and P (b) = 1 − p. If we perform this experiment n ≥ 1 times with
identical conditions from experiment to experiment, then the events of individual experiments are
independent. We may calculate P (n a′ s in a row) = pn , and P (n a′ s and then b) = pn (1 − p).
5
In particular, the chance of getting five straight heads in five tosses of a fair coin is 21 = 32
1
.
When events are not independent we can frequently use the information about the occurrence of
one of the events to find the probability of the other. That is the basis of conditional probability.
The next concept allows us to calculate the probability of an event if the entire sample space is
split (or partitioned) into pieces and decomposing the event we are interested in into the parts
occurring in each piece. Here’s the idea.
If we have events B1 , . . . , Bn such that Bi ∩ Bj = ∅, for all i, j and ∪ni=1 Bi = S, then the collection
n
{Bi }i=1 is called a partition of S. In this case, the Law of Total Probability says
n
X n
X
P (A) = P (A ∩ Bi ) , and P (A) = P (A | Bi ) P (Bi )
i=1 i=1
for any event A ∈ F. We can calculate the probability of an event by using the pieces of A that
intersect each Bi . It is always possible to partition Ω by taking any event B and the event B c .
Then for any other event A,
P (A) = P (A ∩ B) + P (A ∩ B c ) = P (A | B)P (B) + P (A | B c ) P (B c ) .
Example 1.18. Suppose we draw the second card from the top of a well-shuffled deck. We want
to know the probability that this card is an Ace.
9
1. Introduction to Probability 1.4. Counting
This seems to depend on what the first card is. Let B = {1st card is an Ace} and consider the
partition {B, B c }. We condition on what the first card is.
P (2nd card is an Ace) = P (2nd and 1st are Aces)
+ P (1st is not an Ace and 2nd is an Ace)
= P (2nd is an Ace | B)P (B) + P 2nd is an Ace | B c P (B c )

3 4 4 48 4
= × + × = .
51 52 51 52 52
Amazingly, the chances the second card is an ace is the same as the chance the first card is an
ace. This makes sense because if we don’t know what the first card is, the second card should have
the same chance as the first card. In fact, the chance the 27th card is an ace is also 4/52 as long
as we don’t know any of the preceding 26 cards.
The next important theorem tells us how to find P (Bk | A) if we know how to find P (A | Bi ) for
each event Bi in the partition of S. It shows us how to find the probability that if A occurs, it
was due to Bk .
n
Theorem 1.19 (Bayes’ Rule). Let {Bi }i=1 be a partition of S. Then for each k = 1, 2, . . . , n.
P (Bk ∩ A)
P (Bk | A) =
P (A)
P (A | Bk ) P (Bk )
=
P (A)
P (A | Bk ) P (Bk )
= .
P (A | B1 ) P (B1 ) + · · · + P (A | Bn ) P (Bn )
Proof. The proof is in the statement of the theorem using the definition of conditional probability
and the Law of Total Probability. ■
1.4 Counting
When we have a finite sample space |Ω| = N with equally likely outcomes, we calculate
n(A)
P (A) = .
N
It is sometimes a very difficult task to calculate both N and n(A). In this appendix we give some
basic counting principles to help with this. Basic Counting Principle: If there is a task with two
steps and step one can be done in k different ways, and step two in j different ways, then the
task can be completed in k × j different ways.
Permutations
The number of ways to arrange k objects out of n distinct objects is

n!
n(n − 1)(n − 2) · · · (n − (k − 1)) = = Pn,k .
(n − k)!
10
For instance, if we have 3 distinct objects {a, b, c}, there are 6 = 3 · 2 ways to pick 2 objects out
of the 3 , since there are 3 ways to pick the first object and then 2 ways to pick the second. They
are (a, b), (a, c), (b, a), (b, c), (c, a), (c, b).
Combinations
The number of ways to choose k objects out of n when we don’t care about the order of the
objects is
!
n! n
Cn,k = = .
k!(n − k)! k
For example, in the paragraph on permutations, the choices (a, b) and (b, a) are different permu-
tations but they are the same combination and so should not be counted separately. The way to
get the number of combinations is to first figure out the number of permutations, namely (n−k)!
n!
,
and then get rid of the number of ways to arrange the selection of k objects, namely k!. In other
words,

n n n!
Pn,k = k! =⇒ = .
k k (n − k)!k!
Example 1.20 (Poker Hands). We will calculate the probability of obtaining some of the common
5 card poker hands to illustrate the counting principles. A standard 52-card deck has 4 suits
(Hearts, Clubs, Spades, Diamonds) with each suit consisting of 13 cards labeled
2, 3, 4, 5, 6, 7, 8, 9, 10, J, Q, K, A.
Five cards from the deck are chosen at random (without replacement). We now want to find the
probabilities of various poker hands.
The sample space Ω is all possible 5-card hands where order of the cards does not matter. These
are combinations of 5 cards from the 52, and there are N := |Ω| = 52 5 = 2, 598, 960 possible

hands, all of which are equally likely.
Probability of a Royal Flush, which is A, K, Q, J, 10 all the same suit. Let A = {royal flush}.
How many royal flushes are there? It should be obvious there are exactly 4, one for each suit.
Therefore, P (A) = 4/ 52
5 = 0.00000153908, an extremely rare event.

Probability of a Full House, which is 3 of a kind and a pair. Let A = { full house }. To get a
full house we break this down into steps.
(a) Pick a card for the 3 of a kind. There are 13 types one could choose.
(b) Choose 3 out of the 4 cards of the same
(c) Choose another type distinct from the. first type. There are 12 ways to do that.
(d) Choose 2 cards of the same type chosen type chosen in the first step. There are 4

3 in the
previous step. There are 42 ways ways to do that. to do that.

11
1. Introduction to Probability 1.4. Counting
We conclude that the number of full house hands is n(A) = 13× 4 4

= 3744. Consequently

3 ×12× 2
P (A) = 3744/ 52
5 = 0.00144.

Probability of 3 of a Kind. This is a hand of the form aaabc where b, c are cards neither of
which has the same face value as a. Let A be the event we get 3 of a kind. The number of hands
in A is calculated using the multiplication rule with these steps:
i. Choose a card type;
ii. Choose 3 of that type;
iii. Choose 2 more distinct types; and
iv. Choose 1 card of each type.

13 4 12 4 4
= 54, 912, and so P (A) =

The number of ways to do this is 1 × 3 × 2 × 1 × 1
54, 912/ 52
5 = 0.0211.

Probability of a Pair. This is a hand like aabcd where b, c, d, are distinct cards without the
same face values. A is the event that we get a pair. To get one pair and make sure the other 3
cards don’t match the pair is a bit tricky. These are the steps:
i. Choose a card type;
ii. Choose 2 of that type;
iii. Choose 3 types from the remaining types; and
iv. Choose 1 card from each of these types.
The number of ways to do that is

3
13 4 12 4
× × × = 1, 098, 240.
1 2 3 1
Therefore, P (A) = 0.4225.
12
Probability Theory 2. Random Variables
2 Random Variables
In this chapter we study the main properties of functions whose domain is an outcome of an
experiment with random outcomes, i.e., a sample space. Such functions are called random
variables. This chapter is based on Ross (2019), Barron and Greco (2020), and Capiński and
Kopp (2004).
2.1 Probability Distributions

The distribution of a random variable is a specification of the probability that the random variable
takes on any set of values. What is a random variable? It is just a function defined on the sample
space Ω of an experiment.
Definition 2.1. A random variable is a function X : Ω → R such that
E = {ω ∈ Ω : X(ω) ≤ a} ∈ F, ∀a ∈ R,
where F is the set of all possible events.

As a function, a random variable has a range
R(X) = {y ∈ R : X(ω) = y, for some ω ∈ Ω}.
If R(X) is a finite or countable set, we say X is discrete. If R(X) contains an interval, then we
say X is not discrete, but either continuous, or mixed.
Definition 2.2. If X is a discrete random variable with range R(X) = {x1 , x2 , . . .}, the proba-
bility mass function (pmf) of X is
p (xi ) = P (X = xi ) , i = 1, 2, . . .
We write {X = xi } for the event {X = xi } = {ω ∈ Ω : X(ω) = xi }. And in general, we write

{X ≤ a} = {ω ∈ Ω : X(ω) ≤ a} and similarly for {X > a}.
Remark 2.3. Any function p (xi ) which satisfies
i. 0 ≤ p (xi ) ≤ 1 for all i, and
i p (xi ) = 1 is called a pmf.

P
ii.
The pmf of a random variable is also called its distribution.
The next two particular discrete random variables are fundamental.
Distribution 2.4. A random variable X which takes on only two values, a, b, with P (X = a) =
p, P (X = b) = 1 − p, is said to have a Bernoulli (a, b, p) distribution, or be a Bernouili (a, b, p)
random variable, and we write X ∼ Bernouili (a, b, p). In particular, if we have an experiment
with two outcomes, success or failure, we may set a = 1, b = 0 to represent these, and p is the
probability of success.
13
2. Random Variables 2.1. Probability Distributions
An experiment like this is called a Bernoulli (p) trial. The pmf of a Bernoulli (a, b, p) random
variable is 
p, if x = a,
p(x) := P (X = x) =
1 − p, if x = b.
A random variable X which counts the number of successes in an independent set of n Bernoulli
trials is called a Binomial (n, p) random variable, denoted X ∼ Binom (n, p). The range of X is
R(X) = {0, 1, 2, . . . , n}. The pmf of X is

n x
p(x) := P (X = x) = p (1 − p)n−x , x = 0, 1, . . . , n.
x
Remark 2.5. Here’s where this comes from. If we have a particular sequence of n Bernoulli
trials with x successes, say 10011101 . . . 1, then x 1’s must be in this sequence and n − x 0’s must
also be in there. By independence of the trials, the probability of any particular sequence of x 1’s
and n − x 0’s is px (1 − p)n−x . How many sequences with x 1’s out of n are there? That number is

n n!
= .
x x!(n − x)!
Hence a Binomial (n, p) random variable X is a sum of n (independent) Bernoulli (p) random
variables, X = X1 + X2 + · · · + Xn . Independent random variables will be discussed later.
Example 2.6. A bet on red for a standard roulette wheel has 18 38 chances of winning. Suppose a
gambler will bet $5 on red each time for 100 plays. Let X be the total amount won or lost as a result
of these 100 plays. X will be a discrete random variable with range R(X) = {0, ±5, ±10, . . . , ±500}.
In fact, if M denotes the number of games won (which is also a random variable with values from
0 to 100), then our net amount won or lost is X = 10M − 500. The random variable M is an
example of a Binomial 100, 38 18

random variable.
The chance you win exactly 50 games is
100 18 20
50 50
P (M = 50) = = 0.0693,
50 38 38
so the chance you break even is P (X = 0) is also 0.0693.
Now we define continuous random variables.
Definition 2.7. A random variable X is continuous if there is a function f : R → R associated

with X such that Z ∞
f (x) ≥ 0 ∀x, and f (x)dx = 1.
−∞
The function f is said to be a probability density function (pdf) of X.
It is important to note that a pdf does not have to satisfy f (x) ≤ 1 in general.
14
Definition 2.8. The cumulative distribution function (cdf) of a random variable X is

defined as FX (x) := P (X ≤ x).
P

 p (xi ) , if X is discrete with pmf p,
xi ≤x

FX (x) =
Rx
f (y)dy, if X is continuous with pdf f.




−∞
Every cdf has the following properties

i. limx→−∞ FX (x) = 0, and limx→+∞ FX (x) = 1.
ii. x < y =⇒ FX (x) ≤ FX (y) (nondecreasing).
iii. limy→x+0 FX (y) = FX (x) for all x ∈ R. This says a cdf is continuous at every point from
the right.
Using the cdf FX we have for a < b, P (a < X ≤ b) = FX (b) − FX (a).
Proposition 2.9. If X is continuous then P (a < X ≤ b) = FX (b) − FX (a).
Proof. For a < b let’s define the events A = {X ≤ a} ⊂ B = {X ≤ b}, then by the Law of Total
Probability
P (B ∩ Ac ) = P (B) − P (A ∩ B).
As a consequence
P (a < X ≤ b) = P ({X ≤ b} ∩ {X ≤ a}c )
= P (X ≤ b) − P (X ≤ a)
= FX (b) − FX (a).
■
If X is continuous with density f (x), then
Z b
P (a < X ≤ b) = f (x)dx,
a
represents the area under the density curve between a and b.

If X is discrete, x can be a single point with positive probability, P (X = x) > 0. If X is
continuous P (X = x) = 0 for any x since
Z x
P (X = x) ≤ P (x − ε < X ≤ x) = f (y)dy → 0 as ε → 0.
x−ε
Therefore, for a continuous random variable P (a < X ≤ b) = P (a ≤ X ≤ b) = P (a ≤ X < b) are

all the same. However, for a discrete random variable
P (a < X ≤ b) = P (a < X < b) + P (X = b),
that is, we have to take endpoints into account.
15
2. Random Variables 2.1. Probability Distributions
Remark 2.10. If X is continuous with pdf f and cdf FX , we know that

Z x
FX (x) = f (y)dy.
−∞
Therefore, we can find the pdf if we know the cdf using the fundamental theorem of calculus, i.e.,
′
FX (x) = f (x).
Example 2.11. Suppose X is a random variable with values 1, 2, 3 with probabilities 1/6, 1/3,
and 1/2, respectively. The jumps are at x = 1, 2, 3. The size of the jump is P (X = x), x = 1, 2, 3
and at each jump the left endpoint is not included while the right endpoint is included because the
cdf is continuous from the right. Then we may calculate
P (X < 2) = P (X = 1)
1
= ,
6
but
P (X ≤ 2) = P (X = 2) + P (X = 1)
1
= .
2
Some Important Distributions
We begin with the pmfs of some of the most important discrete rvs we will use in this course.
Definition 2.12. The Discrete Uniform pmf and cdf, are respectively,
1
P (X = x) = , and
n
x
FX (x) = ,
n
for x = 1, 2, . . . , n. A discrete uniform rv picks one of n points at random.
Definition 2.13. The Binomial(n, p) distribution pmf is given by

n x
P (X = x) = p (1 − p)n−x , x = 0, 1, . . . , n.
x
If n = 1 this is a Bernoulli (p). A Binomial rv counts the number of successes in n independent

Bernoulli trials.
Definition 2.14. A Poisson (λ) rv has pmf given by
λx
P (X = x) = e−λ , x = 0, 1, 2 . . . .
x!
The parameter λ > 0 is given. A Poisson (λ) rv counts the number of events that occur at the
rate λ by time unit.
16
Definition 2.15. The pmf of a Geometric (p) rv is
P (X = x) = (1 − p)x−1 p, x = 1, 2, . . .
This rv is the number of independent Bernoulli trials until we get the first success.
Definition 2.16. If rv X follows a NegBinomial (r, p) distribution, its pmf is given by
x−1 r

P (X = x) = p (1 − p)x−r , x = r, r + 1, r + 2, . . .
r−1
where x represents the number of Bernoulli trials until we get r successes.
Definition 2.17. A (uniform) X ∼ U (a, b) rv models choosing a random number from a to b.
The pdf is
 1 , if a < x < b,


f (x) = b − a
0,

otherwise.
And the cdf is



0, if x < a,

x−a

FX (x) = , if a ≤ x < b,
b−a


1,

if x ≥ b.

Next is the normal distribution which we have already discussed but we record it here again for
convenience.
Definition 2.18. A Normal (µ, σ) rv X ∼ N (µ, σ) has density
1
√ e− 2 ( σ ) ,
1 x−µ 2
f (x) = −∞ < x < ∞.
σ 2π
It is not possible to get an explicit expression for the cdf so we simply write
1
Z x
e− 2 ( σ ) dy.
1 y−µ 2
Ncdf (x, µ, σ) := FX (x) = √
σ 2π −∞
Definition 2.19. An Exponential (λ) rv, denoted X ∼ Exp (λ), λ > 0, has pdf

λe−λx , x ≥ 0,
f (x) =
0, x < 0.
The cdf is

Z x 1 − e−λx , if x ≥ 0,
FX (x) = λe−λy dy =
0 0, if x < 0.
An exponential random variable represents processes that do not “remember”. For example, if X
represents the time between arrivals of customers to a store, a reasonable model is Exponential (λ)
where λ represents the average rate at which customers arrive.
17
2. Random Variables 2.2. Expectation, Variance and Quantiles
2.2 Expectation, Variance and Quantiles

For any given random variable we are interested in basic properties like its mean value, median,
the spread around the mean, etc. These are measures of central location of the rv and the spreads
around these locations. These concepts are discussed here.
Definition 2.20. The expected value of a random variable X is

P
 xP (X = x),
 if X is discrete,
x

E[X] =
R∞
xf (x)dx,

if X is continuous with pdf f.



−∞
If g : R → R is a given function, the expected value of the rv g(X) is

P

 g(x)P (X = x), if X is discrete,
x

E[g(X)] =
R∞
g(x)f (x)dx,

if X is continuous with pdf f.



−∞
With this definition, you can see why it is frequently useful to write E[g(X)] = g(x)P (X = x)dx
R
even when X is a continuous rv. This abuses notation a lot and you have to keep in mind that
f (x) ̸= P (X = x), which is zero when X is continuous.
From calculus we know that if we have a one-dimensional object with density f (x) at each point,
then
Z
xf (x)dx = E[X],
gives the center of gravity of the object. If X is discrete, the expected value is an average of the
values of X, weighted by the probability it takes on each value.
For example, if X has values 1, 2, 3 with probabilities 1/8, 3/8, and 1/2, respectively, then
1 3 1
E[X] = 1 × +2× +3×
8 8 2
19
= .
8
On the other hand, the straight average of the 3 numbers is 2 . The straight average corresponds
to each value with equal probability. Now we have a definition of the expected value of any
function of X. In particular,
Z ∞
E X2 = x2 f (x)dx.

−∞
We need this if we want to see how the random variable spreads its values around the mean.
18
2
Definition 2.21. The variance of a rv X is Var[X] := E X − E[X] . Written out, the first step

is to find the constant µ = E[X] and then

P
 (x − µ)2 P (X = x),
 if X is discrete;
x

Var[X] =
R∞
 (x − µ)2 f (x)dx,

if X is continuous.


−∞
The standard deviation, abbreviated SD, of X, SD(X) := Var[X]. Another measure of the
p
spread of a distribution is the median and the percentiles. Here’s the definition.
Definition 2.22. The median m = med(X) of a random variable X is defined to be the real
number such that P (X ≤ m) = P (X ≥ m) = 12 . The median is also known as the 50th percentile.
Given a real number 0 < q < 1, the 100q th percentile of X is the number xq such that
P (X ≤ xq ) = q.
The interquartile range of a rv is IQR = Q3 − Q1 , i.e., the 75th percentile minus the 25th
percentile. Q1 is the first quartile, the 25th percentile, and Q3 is the third quartile, the 75th
percentile. The median is also known as Q2 , the second quartile.
In other words, 100q% of the values of X are below xq . Percentiles apply to any random variable
and give an idea of the shape of the density. Note that percentiles do not have to be unique, i.e.,
there may be several xq ’s resulting in the same q.
Example 2.23. If Z ∼ N (0, 1) we may calculate

Z ∞
1 2
E[X] = x √ e−x /2 dx,
−∞ 2π
using substitution z = x2 /2 and obtain E[Z] = 0.

Then we calculate
∞
1
Z
2
E X2 = x2 √ e−x /2 dx,

−∞ 2π
using integration by parts. We get E Z 2 = 1 and then

2
Var[Z] = E Z 2 − E[Z]

= 1.
The parameters µ = 0 and σ = 1 represent the mean and SD of Z. In general, if X ∼ N (µ, σ)

we write X = σZ + µ with Z ∼ N (0, 1), and we see that E[X] = σ × 0 + µ = µ, and Var[X] =
σ 2 Var[Z] = σ 2 , so that SD(X) = σ.
Example 2.24. Suppose we know that LSAT scores follow a normal distribution with mean 155
and SD=13. You take the test and score 162. What percent of people taking the test did worse
than you?
19
2. Random Variables 2.2. Expectation, Variance and Quantiles
This is asking for P (X ≤ 162) knowing X ∼ N (155, 13). That’s easy since P (X ≤ 162) =
Ncdf (162, 155, 13) = 0.704. In other words, 162 is the 70.4 percentile of the scores. Suppose
instead someone told you that her score was in the 82nd percentile and you want to know her
actual score. To find that, we are looking to solve P (X ≤ x0.82 ) = 0.82.
To find this we have x0.82 = N-1

cdf (0.82, 155, 13) = 166.89, so she scored about 167.
Now here’s a proposition which says that the mean is the best estimate of a rv X in the mean
square sense, and the median is the best estimate in the mean absolute deviation sense.
Proposition 2.25. Variance main properties are:
(i.) We have the alternate formula

2
Var[X] = E X 2 − E[X] .

(ii.) The mean of X, E[X] = µ is the unique constant a which minimizes E[X − a]2 . Then
mina E[X − a]2 = E[X − µ]2 = Var[X].
(iii.) A median med(X) is a constant which provides a minimum for E|X − a|. In other words,
mina E|X − a| = E|X − med(X)|.
The second statement says that the variance is the minimum of the mean squared distance of the
rv X to its mean. The third statement says that a median (which may not be unique) satisfies a
similar property for the absolute value of the distance.
Proof. For proving i first set µ = E[X], then
Var[X] = E(X − E[X])2
= E X 2 − 2Xµ + µ2

= E X 2 − 2µE[X] + µ2

= E X 2 − µ2 .

R∞
For ii we will assume X is a continuous rv with pdf f . Then, with G(a) = −∞
(x − a)2 f (x)dx,
then
Z ∞
d
G′ (a) = (x − a)2 f (x)dx
da −∞
Z ∞
= −2(x − a)f (x)dx
−∞
= 0,
implies xf (x)dx = af (x)dx = a. This assumes we can interchange the derivative and the
R R
integral. Furthermore, G′′ (a) = 2 f (x)dx = 2 > 0. Consequently, a = E[X] provides a minimum
R
for G. It is unique since G is strictly concave up.
20
Last property is a little trickier since we can’t take derivatives at first. We get rid of absolute
value signs first.
Z ∞
E |X − a| = |x − a|f (x)dx

−∞
Z a Z ∞
= −(x − a)f (x)dx + (x − a)f (x)dx ≡ H(a).
−∞ a
Now we take derivatives

Z a Z ∞ Z a Z ∞
H ′ (a) = f (x)dx − f (x)dx = 0 =⇒ f (x)dx = f (x)dx ≡ α.
−∞ a −∞ a
Ra R∞ Ra
Since f (x)dx = 1 = −∞ + a , =⇒ −∞ f (x)dx = 1 − α. But then 1 − α = α =⇒ α = 12 . We
R
conclude that
Z a
P (X ≤ a) = f (x)dx
−∞
Z ∞
= f (x)dx
a
= P (X ≥ a)
1
= ,
2
and this says that a is a median of X. Furthermore, H ′′ (a) = 2f (a) ≥ 0, so a = med(X) does
provide a minimum (but note that H is not necessarily strictly concave up). ■
2.3 Moment Generating Function

In the beginning of this section we defined the expected value of a function g of a rv X as
E [g(X)] = g(x)f (x)dx, where f is the pdf of X. We now consider a special and very useful
R
function of X. This will give us a method of calculating means and variances usually in a much
simpler way than doing it directly.
Definition 2.26. The moment-generating function (mgf) of a rv X is M (t) := E etX . Explicitly,

we define
 ∞
 R etx f (x)dx,
 if X is continuous;

M (t) = −∞
 etx P (X = x), if X is discrete.

 P
x
We assume the integral or sum exists for all t ∈ (−δ, δ) for some δ > 0. One reason the mgf is
so useful is the following theorem. It says that if we know the mgf, we can find moments, i.e.,
E (X n ) , n = 1, 2, . . ., by taking derivatives.
Theorem 2.27. If X has the mgf M (t), then
dn
E [X n ] = M (t) .
dtn t=0
21
2. Random Variables 2.3. Moment Generating Function
Proof. The proof is easy if we assume that we can switch integral and derivatives.
Z ∞ n
dn d tx
M (t) = e f (x)dx
dtn −∞ dtn
Z ∞
= xn etx f (x)dx.
−∞
Plug in t = 0 in the last integral to see

Z ∞ Z ∞
xn etx f (x)dx = xn f (x)dx
−∞ t=0 −∞
= E[X]n .
■
Example 2.28. Let’s use the mgf to find the mean and variance of X ∼ Binomial (n, p).
n
X
M (t) = etx P (X = x)
x=0
n
X n x
= etx p (1 − p)n−x
x=0
x
n
X n x
= pet (1 − p)n−x
x=0
x
n
= pet + (1 − p) .
We used the Binomial Theorem from algebra, i.e.,

n
X n k n−k
(a + b)n = a b ,
k
k=0
in the last line. Now that we know the mgf we can find any moment by taking derivatives. Here
are the first two:
n−1
M ′ (t) = npet pet + (1 − p) =⇒ M ′ (0) = E[X] = np,
and
n−2 n−1
M ′′ (t) = n(n − 1)p2 e2t pet + (1 − p) + npet pet + (1 − p)
=⇒ E[X 2 ] = M ′′ (0) = n(n − 1)p2 + np.
The variance is then

2
Var[X] = E[X 2 ] − E[X]
= n(n − 1)p2 + np − n2 p2
= np(1 − p).
22
Main Moments of some Important Distributions
Now we use the mgf to calculate the mean and variances of some of the important continuous
distributions.
(a) X ∼ Unif[a, b], f (x) = 1

b−a , a < x < b. The mgf is
b
1
Z
M (t) = etx dx
a b−a
b
1 1 tx
= e
b−a t a
tb ta
e −e
= .
t(b − a)
Then
eat (at − 1) + ebt (1 − bt) a+b
M ′ (t) = and, lim M ′ (t) = .
(a − b)t 2 t→0 2
We conclude E[X] = 2 .
a+b
While we could find M ′′ (0) = E[X 2 ], it is actually easier to find
this directly.
b
1 b3 − a3 (b − a)2
Z
2
E[X ] =
2
x2 dx = =⇒ Var[X] = E[X 2 ] − E[X] = .
a b−a 3(b − a) 12
(b) X ∼ Exp(λ), f (x) = λe−λx , x > 0. We get

Z ∞
λ
M (t) = E etX = etx λe−λx dx = if t < λ.

,
0 λ−t
λ 1 2λ 2
M ′ (t) = M ′ (0) = = E[X], and M ′′ (t) = M ′′ [0] = = E X2 ,

, ,
(λ − t)2 λ (λ − t)3 λ2
2 2 1 1
Var[X] = E[X 2 ] − E[X] = 2
− 2 = 2.
λ λ λ
x 2
(c) X ∼ N (0, 1), f (x) = √1 e− 2
2π
, −∞ < x < ∞. The mgf for the standard normal distribution
is
∞
1
Z
x2
M (t) = etx √ e− 2 dx
−∞ 2π
Z ∞
1 x2
=√ etx− 2 dx
2π −∞
Z ∞
1 2 1 2
=√ et /2 e− 2 (x−t) dx
2π −∞
Z ∞
2 1 1 2
= et /2 √ e− 2 (x−t) dx
2π −∞
2
= et /2
,
23
2. Random Variables 2.4. Characteristic Function
since
∞
1
Z
1 2
√ e− 2 (x−t) dx = 1.
2π −∞
Now we can find the moments fairly simply.

t2 t2 t2
M ′ (t) = e 2 t =⇒ M ′ (0) = E[X] = 0, and, M ′′ (t) = e 2 t2 + e 2 =⇒ M ′′ (0) = E[X 2 ] = 1,
Var[X] = E[X 2 ] − (E[X])2 = 1.
(d) X ∼ N (µ, σ). All we have to do is convert X to standard normal. Let Z = X−µ σ . We know
t2 /2
Z ∼ N (0, 1) and we may use the previous part to write MZ (t) = e . How do we get the
mgf for X from that? Well, we know X = σZ + µ and so
MX (t) = E etX

h i
= E e(σZ+µ)t
h i
= eµt E e(tσ)Z
1 2
= eµt e 2 (σt)
1 2 2
= eµt+ 2 σ t
.
σ 2 t2
Then M ′ (t) = e µ + σ 2 t so that M ′ (0) = E[X] = µ. Next
+µt

2
σ 2 t2
2
M ′′ (t) = e 2 +µt
σ2 + µ + σ2 t =⇒ M ′′ (0) = σ 2 + µ2 .
2
This gives us Var[X] = E[X 2 ] − E[X] = σ 2 + µ2 − µ2 = σ 2 .

We record here the mean and variance of some important discrete distributions.
(a) X ∼ Binomial (n, p), E[X] = np, Var[X] = np(1 − p).

pet
(b) X ∼ Geom(p), E[X] = p1 , Var[X] = p2 ,
1−p
MX (t) = 1−(1−p)et .
(c) X ∼ Poisson(λ), E[X] = λ, Var[X] = λ, MX (t) = eλ(e −1)

t
.
2.4 Characteristic Function

We can ask whether a single function might suffice to identify all the moments of a random
variable X. It turns out that the expectation of eitX does the job.
To make sense of such an expectation we define the integral of a function f : Ω → C with
values among complex numbers by means of the integral of its real and imaginary parts: if
f = Re f + i Im f and both the (real-valued) functions Re f and Im f are integrable, we set
Z Z Z
f dP = Re f dP + i Im f dP.
Ω Ω Ω
24
In other words, we set

E[f ] = E[Re f ] + iE[Im f ].
Definition 2.29. Let X : Ω → R be a random variable. Then ϕX : R → C defined by
ϕX (t) := E eitX = E [cos(tX)] + iE [sin(tX)] ,

for all t ∈ R is called the characteristic function of X.
To compute ϕX it is sufficient to know the probability distribution of X :

Z
ϕX (t) = eitx dPX (x),
R
and if X has a density, this reduces to

Z
ϕX (t) = eitx fX (x)dx.
R
Vice versa, it turns out that the probability distribution of X is uniquely determined by the
characteristic function ϕX .
The function ϕX has the advantage that it always exists because the random variable eitX is
bounded.
25
3. Codependence Structures 3.1. Joint Distributions
3 Codependence Structures
In probability theory and statistics, the concept of codependence structure refers to the way in
which two or more random variables are related to each other.
Specifically, it describes the pattern of association or dependence among the variables. There are
different types of codependence structures that will be addressed next.
This chapter is based on Ross (2019) and Barron and Greco (2020).
3.1 Joint Distributions

In probability and statistics we are often confronted with a problem involving more than one
random variable which may or may not depend on each other. We have to study jointly distributed
random variables if we want to calculate things like P (X + Y ≤ w).
Definition 3.1. (1) If X and Y are two random variables, the joint cdf is

FX,Y (x, y) := P {X ≤ x} ∩ {Y ≤ y} .
In general, we write this as FX,Y (x, y) = P (X ≤ x, Y ≤ y).
(2) If X and Y are discrete, the pmf of (X, Y ) is p(x, y) = P (X = x, Y = y).
(3) A joint density function is a function fX,Y (x, y) ≥ 0 with

Z ∞ Z ∞
fX,Y (x, y)dxdy = 1.
−∞ −∞
The pair of rvs (X, Y ) is continuous if there is a joint density function, and then
Z x Z y
FX,Y (x, y) = fX,Y (u, v)dudv.
−∞ −∞
(4) If we know FX,Y (x, y) then the joint density is
∂ 2 FX,Y (x, y)
fX,Y (x, y) = .
∂x∂y
Knowing the joint distribution of X and Y means we have full knowledge of X and Y individually.
For example, if we know FX,Y (x, y), then
FX (x) = FX,Y (x, ∞)
= lim FX,Y (x, y),

y→∞
and
FY (y) = FX,Y (∞, y).
26
Probability Theory 3. Codependence Structures
The resulting FX and FY are called the marginal cumulative distribution functions. The marginal
densities when there is a joint density are given by
Z ∞
fX (x) = fX,Y (x, y)dy,
−∞
and Z ∞
fY (y) = fX,Y (x, y)dx.
−∞
Example 3.2. The function
(
8xy, 0 ≤ x < y ≤ 1,
f (x, y) =
0 otherwise.
is given. First we verify it is a joint density. Since f ≥ 0 all we need to check is that the double
integral is one.
Z ∞Z ∞ Z 1Z y
f (x, y)dxdy = 8xydxdy
−∞ −∞ 0 0
1
1 2
Z
= 8y y dy
0 2
= 1.
To find the marginal densities we have

Z ∞
fX (x) = f (x, y)dy
−∞
 R1
 x 8xydy = 4x 1 − x2 , if 0 ≤ x ≤ 1

=
0, otherwise.

and
Z ∞
fY (y) = f (x, y)dx
−∞
 Ry
 0 8xydx = 4y 3 , if 0 ≤ y ≤ 1
=
0, otherwise.

If X and Y are discrete rvs, the joint pmf is p(x, y) = P (X = x, Y = y). The marginals are then
given by pX (x) = P (X = x) = y p(x, y) and pY (y) = P (Y = y) = x p(x, y).
P P
In general, to find the probability that for any set C ⊂ R × R, the pair (X, Y ) ∈ C has probability
defined by
RR
 C fX,Y (x, y)dxdy, if X, Y are continuous,

P (X, Y ) ∈ C =
(x,y)∈C pX,Y (x, y), if X, Y are discrete.
P

We also have expected values of functions of rvs.
27
3. Codependence Structures 3.1. Joint Distributions
Definition 3.3. If (X, Y ) have joint density fX,Y (x, y), the expected value of a function of the
rvs is
R
∞ R∞
 −∞ −∞ g(x, y)fX,Y (x, y)dxdy, if X, Y are continuous,

E[g(X, Y )] =
P g(x, y)P (X = x, Y = y),

if X, Y are discrete.
x,y
Example 3.4. We calculate E[X + Y ] assuming we have the joint density of (X, Y ) given by
fX,Y (x, y). By definition
ZZ
E[X + Y ] = (x + y)fX,Y (x, y)dxdy
ZZ ZZ
= xfX,Y (x, y)dxdy + yfX,Y (x, y)dxdy
Z Z Z Z
= x fX,Y (x, y)dy dx + y yfX,Y (x, y)dx dy
Z Z
= xfX (x)dx + yfY (y)dy
= E[X] + E[Y ].
Notice that the first E uses the joint density fX,Y while the second and third E ′ ’s use fX and
fY , respectively.
Example 3.5. Suppose (X, Y ) have joint density f (x, y) = 1, 0 ≤ x, y ≤ 1, and f (x, y) = 0
otherwise. This models picking a random point (x, y) in the unit square. If we want to calculate
P (X < Y ), this uses the density.
ZZ
P (X < Y ) = f (x, y)dxdy
0≤x<y≤1
Z 1 Z y
= 1dxdy
0 0
1
y2
=
2 0
1
= .
2
Similarly, we may calculate
1
ZZ
P X +Y ≤
2 2
= 1dxdy
4 0≤x2 +y 2 ≤ 14
= area of semicicle in square

π
= .
16
28
Also,
Z 1 Z 1
E X2 + Y 2 = x2 + y 2 f (x, y)dxdy

0 0
2
= .
3
In general, if we are given a set D ⊂ R2 the density

 area1of D , (x, y) ∈ D,

f (x, y) =
0,

otherwise,
is called a uniform density on D and the rvs (X, Y ) ∼ Unif(D). You see that E[XY ] ̸= E[X]×E[Y ]
in general, but there are important cases when this is true. For that we need the notion of
independent random variables.
3.2 Independent Random Variables

Independence of random variables, which has the intuitive meaning that one random variable
doesn’t affect the other, is a central idea in probability.
Definition 3.6. Random variables X, and Y are independent if
P (X ≤ x, Y ≤ y) = P (X ≤ x) P (Y ≤ y), ∀ x, y ∈ R.
If (X, Y ) has a joint density fX,Y , X has density fX , and Y has density fY , then independence
means that the joint density factors into the individual densities:
fX,Y (x, y) = fX (x)fY (y).
One of the main consequences of independence is the following fact. It says the expected value of
a product of rvs is the product of the expected value of each rv.
Proposition 3.7. If X, Y are independent
E XY = E[X] × E[Y ].

In fact, for any functions g, h, we have E[g(X)h(Y )] = E[g(X)] × E[h(Y )].

Proof. By definition,
Z ∞ Z ∞
E XY = xyfX,Y (x, y)dxdy

−∞ −∞
Z ∞ Z ∞
= xyfX (x)fY (y)dxdy
−∞ −∞
Z ∞ Z ∞
= xfX (x)dx yfY (y)dy
−∞ −∞
= E[X] × E[Y ].
29
3. Codependence Structures 3.3. Covariance and Correlation
The proof of the second statement is almost identical. ■
Independence also allows us to find an explicit expression for the joint cumulative distribution of
the sum of two random variables.
Proposition 3.8. If X, Y are independent continuous rvs then

Z ∞
FX+Y (w) := P (X + Y ≤ w) = P (X ≤ w − y)fY (y)dy.
−∞
Proof. This is really another application of the Law of Total Probability. To see this
Z
P (X + Y ≤ w) = P (X + Y ≤ w, Y = y)dy
Z
= P (X ≤ w − y)P (Y = y)dy
Z
= FX (w − y)fY (y)dy.
The first equality uses the Law of Total Probability, whereas the second equality uses the
independence. ■
Example 3.9. Suppose X and Y are independent Exp(λ) rvs. Then, for w ≥ 0,
Z ∞
P (X + Y ≤ w) = FX (w − y)fY (y)dy
0
Z w
= 1 − e−λ(w−y) λe−λy dy
0
= 1 − (λw + 1)e−λw
= FX+Y (w).
If w < 0, FX+Y (w) = 0. To find the density we take the derivative with respect to w to get
fX+Y (w) = λ2 we−λw , w ≥ 0.
It turns out that this is the pdf of a so-called Gamma (λ, 2) rv.
3.3 Covariance and Correlation

A very important quantity measuring the linear relationship between two rvs is the following
Definition 3.10. Given two rvs X, Y , the covariance of X, Y is defined by

h i
Cov X, Y := E X − E[X] Y − E[Y ] = E XY − E X E Y .

30
The correlation coefficient is defined by
Cov X, Y

ρ(X, Y ) := q .
Var X Var[Y ]

X and Y are said to be uncorrelated if Cov X, Y = 0 or, equivalently, ρ(X, Y ) = 0.

It looks like covariance measures how independent X and Y are. It is certainly true that if X, Y
are independent, then ρ(X, Y ) = 0, but the reverse is false.
Here’s one of the more important implications of independence.
Theorem 3.11. If X, Y are rvs then
Var[X + Y ] = Var X + 2 Cov X, Y + Var[Y ].

If X, Y are uncorrelated, then Var[X + Y ] = Var X + Var[Y ].

Proof. This is a calculation.

h i 2
2
Var X + Y = E (X + Y ) − E[X] + E[Y ]

2
= E X 2 + 2XY + Y 2 − (E[X])2 − 2E[X]E[Y ] − (E[Y ])

= Var X + 2 Cov X, Y + Var Y .

If X, Y are uncorrelated, Cov X, Y = 0.

■
Remark 3.12. This can be extended to n rvs X1 , . . . , Xn . If they are uncorrelated (which is true
if they are independent), Var X1 + · · · + Xn = Var X1 + · · · + Var Xn .

Let us consider now the special case where X and Y are indicator variables for whether or not
the events A and B occur. That is, for events A and B, define whether or not the events A and
B occur. That is, for events A and B, define

1, if A occurs,
X=
0, otherwise,
and 
1, if B occurs,
Y =
0, otherwise.
Then,
Cov X, Y = E XY − E X E Y ,

and, because X, Y will equal 1 or 0 depending on whether or not both X and Y equal 1, we see
that
Cov X, Y = P (X = 1, Y = 1) − P (X = 1) P (Y = 1) .

31
3. Codependence Structures 3.3. Covariance and Correlation
From this we see that
Cov X, Y > 0 ⇐⇒ P (X = 1, Y = 1) > P (X = 1)P (Y = 1)

P (X = 1, Y = 1)
⇐⇒ > P (Y = 1)
P (X = 1)
⇐⇒ P (Y = 1 | X = 1) > P (Y = 1).
That is, the covariance of X and Y is positive if the outcome X = 1 makes it more likely that
Y = 1 (which, as is easily seen by symmetry, also implies the reverse). In general it can be shown
that a positive value of Cov X, Y is an indication that Y tends to increase as X does, whereas

a negative value indicates that Y tends to decrease as X increases.

Example 3.13. The joint density function of X, Y is
1 −(y+x/y)
f (x, y) = e , 0 < x, y < ∞,
y
(a) Verify that the preceding is a joint density function.
(b) Find Cov X, Y .

To show that f (x, y) is a joint density function we need to show it is nonnegative, which is
R∞ R∞
immediate, and that −∞ −∞ f (x, y)dydx = 1. We prove the latter as follows:
Z ∞Z ∞ Z ∞Z ∞
1 −(y+x/y)
f (x, y)dxdy = e dxdy
−∞ −∞ 0 0 y
Z ∞ Z ∞
1 −x/y
= e−y
e dxdy
0 0 y
Z ∞
= e−y dy
0
= 1.
To obtain Cov X, Y , note that the density function of Y is

Z ∞
1 −x/y
fY (y) = e −y
e dx
0 y
= e−y .
Thus, Y is an exponential random variable with parameter 1, showing that
E[Y ] = 1.

We compute E X and E XY as follows:
Z ∞Z ∞
E X = xf (x, y)dxdy

−∞ −∞
Z ∞ Z ∞
x −x/y
= e −y
e dxdy.
0 0 y
32
R∞
Now, 0 xy e−x/y dx is the expected value of an exponential random variable with parameter 1/y,
and thus is equal to y. Consequently,
Z ∞
E X = ye−y dy

0
= 1.
Also
Z ∞ Z ∞
E XY = xyf (x, y)dxdy

−∞ −∞
Z ∞ Z ∞
x −x/y
= ye −y
e dxdy
0 0 y
Z ∞
= y 2 e−y dy.
0
Integration by parts dv = e−y dy, u = y 2 gives

Z ∞
E XY = y 2 e−y dy

0
∞ Z ∞
= −y 2 e−y + 2ye−y dy
0 0
= 2E[Y ]
= 2.
Consequently,
Cov X, Y := E XY − E X E Y = 1.

The following are important properties of covariance.
Properties of Covariance
For any random variables X, Y , Z, and constant c,

i. Cov X, X = Var X ,

ii. Cov X, Y = Cov Y, X ,

iii. Cov cX, Y = c Cov X, Y ,

iv. Cov X, Y + Z = Cov X, Y + Cov X, Z .

Whereas the first three properties are immediate, the final one is easily proven as follows:
Cov X, Y + Z = E X(Y + Z) − E X E Y + Z

= E XY − E X E Y + E XZ − E[X E Z

= Cov X, Y + Cov X, Z

33
3. Codependence Structures 3.4. Conditional Expectation
The fourth property listed easily generalizes to give the following result:
 
Xn m
X n X
X m
Cov  Yj  = Cov Xi , Yj . (3.2)

Xi ,
i=1 j=1 i=1 j=1
A useful expression for the variance of the sum of random variables can be obtained from equation
(3.2) as follows:
" n #  
X Xn n
X
Var Xi = Cov  Xi , Xj 
i=1 i=1 j=1
n X
X n
= Cov Xi , Xj

i=1 j=1
(3.3)
n
X n X
X
= Cov Xi , Xi + Cov Xi , Xj

i=1 i=1 j̸=i
n
X n X
X
= Var Xi + 2 Cov Xi , Xj .

i=1 i=1 j<i
3.4 Conditional Expectation

Let us denote by E X|Y that function of the random variable Y whose value at Y = y is

E X|Y = y . Note that E X|Y is itself a random variable.

An extremely important property of conditional expectation is given in the next result.
Proposition 3.14. For all random variables X and Y
E X = E E X|Y . (3.4)

If Y is a discrete random variable, then equation (3.4) states that

X
E X = E X|Y = y P (Y = y) ,

y
while if Y is continuous with density fY (y), then Eq. (3.2) says that
Z ∞
E X = E X|Y = y fY (y)dy.

−∞
We now give a proof of Proposition 3.14 in the case where X and Y are both discrete random
variables.
Proof. When X and Y are Discrete we must show that

X
E X = E X|Y = y P (Y = y) (3.5)

y
34
Now, the right side of the preceding can be written

X XX
E X|Y = y P (Y = y) = xP (X = x | Y = y) P (Y = y)

y y x
X X P (X = x, Y = y)
= x P (Y = y)
y x
P (Y = y)
XX
= xP (X = x, Y = y)
y x
X X
= x P (X = x, Y = y)
x y
X
= xP (X = x)
x
=E X .

and the result is obtained. ■
One way to understand equation (3.5) is to interpret it as follows. It states that to calculate E X

we may take a weighted average of the conditional expected value of X given that Y = y, each of
the terms E X|Y = y being weighted by the probability of the event on which it is conditioned.

The following examples will indicate the usefulness of equation (3.4)
Example 3.15. Sam will read either one chapter of his probability book or one chapter of his
history book. If the number of misprints in a chapter of his probability book is Poisson distributed
with mean 2 and if the number of misprints in his history chapter is Poisson distributed with
mean 5, then assuming Sam is equally likely to choose either book, what is the expected number of
misprints that Sam will come across?

Let X be the number of misprints. Because it would be easy to compute E X if we know which
book Sam chooses, let

1, if Sam chooses his history book,
Y =
2, if chooses his probability book.
Conditioning on Y yields
E X = E X|Y = 1 P (Y = 1) + E X|Y = 2 P (Y = 2)

1 1

=5 +2
2 2
7
= .
2
Example 3.16 (The Expectation of the Sum of a Random Number of Random Variables).
Suppose that the expected number of accidents per week at an industrial plant is four. Suppose
35
also that the numbers of workers injured in each accident are independent random variables
with a common mean of 2. Assume also that the number of workers injured in each accident is
independent of the number of accidents that occur. What is the expected number of injuries during
a week?
Letting N denote the number of accidents and Xi the number injured in the ith accident, i =
PN
1, 2, . . ., then the total number of injuries can be expressed as i=1 Xi . Hence, we need to
compute the expected value of the sum of a random number of random variables. Because it is
easy to compute the expected value of the sum of a fixed number of random variables, let us try
conditioning on N . This yields
"N # " "N ##
X X
E Xi = E E Xi | N .
i=1 i=1
But
" N
# " n
#
X X
E Xi | N = n = E Xi | N = n
i=1 i=1
" n
#
X
=E Xi (by the independence of Xi and N )
i=1
= nE X ,

which yields
" N
#
X
Xi | N = N E X ,

E
i=1
and thus "N #

X
Xi = E N E X = E N E X .

E
i=1
Therefore, in our example, the expected number of injuries during a week equals 4 × 2 = 8.
PN
The random variable i=1 Xi , equal to the sum of a random number N of independent and
identically distributed random variables that are also independent of N , is called a compound
random variable.
As just shown in Example 3.16, the expected value of a compound random variable is E N E X .

Its variance will be derived next.

If there is some random variable Y such that it would be easy to compute E X if we knew the

value of Y , then conditioning on Y is likely to be a good strategy for determining E X . When

there is no obvious random variable to condition on, it often turns out to be useful to condition
on the first thing that occurs. This is illustrated in the following two examples.
Example 3.17 (The Mean of a Geometric Distribution). A coin, having probability p of coming
up heads, is to be successively flipped until the first head appears. What is the expected number of
flips required?
36
Let N be the number of flips required, and let


1, if the first flip results in a head,
Y =
0, if the first flip results in a tail.
Now,
E[N ] = E[N | Y = 1]P (Y = 1) + E[N | Y = 0]P (Y = 0)
(3.6)
= pE[N | Y = 1] + (1 − p)E[N | Y = 0].
However,
E[N | Y = 1] = 1,
(3.7)
E[N | Y = 0] = 1 + E[N ].
To see why equation (3.7) is true, consider E[N | Y = 1]. Since Y = 1, we know that the first flip
resulted in heads and so, clearly, the expected number of flips required is 1.
On the other hand if Y = 0, then the first flip resulted in tails. However, since the successive flips
are assumed independent, it follows that, after the first tail, the expected additional number of
flips until the first head is just E[N ]. Hence E[N | Y = 0] = 1 + E[N ].
Substituting equation (3.7) into equation (3.6) yields
E[N ] = p + (1 − p)(1 + E[N ]),
or
1
E[N ] = .
p
Conditional Variance
Another way to use conditioning to obtain the variance of a random variable is to apply the
conditional variance formula. The conditional variance of X given that Y = y is defined by
h 2 i
Var X|Y = y := E X − E X|Y = y |Y =y .

That is, the conditional variance is defined in exactly the same manner as the ordinary variance
with the exception that all probabilities are determined conditional on the event that Y = y.
Expanding the right side of the preceding and taking expectation term by term yields
2
Var X|Y = y = E X 2 | Y = y − E X|Y = y

.
Letting Var X|Y denote that function of Y whose value when Y = y is Var X|Y = y , we have

the following result.
Proposition 3.18 (The Conditional Variance Formula).
Var X = E Var X|Y + Var E X|Y = y .

37
Proof.
h 2 i
E[Var X|Y ] = E E X 2 | Y − E X|Y

h 2 i
= E E X 2 | Y − E E X|Y

h 2 i
= E X 2 − E E X|Y

,
and
h 2 i 2
Var E X|Y = E E X|Y

− E E X|Y
h 2 i 2
= E E X|Y − E X .
Therefore,
2
E Var X|Y + Var E X|Y = E X 2 − E X

,
which completes the proof. ■
Example 3.19 (Variance of a Compound Random Variable). Let X1 , X2 , . . . be independent and

identically distributed random variables with distribution F having mean µ and variance σ 2 , and
assume that they are independent of the nonnegative integer valued random variable N .
As noted in Example 3.16, where its expected value was determined, the random variable S =
PN
i=1 Xi is called a compound random variable. Let’s find its variance.

Whereas we could obtain E S 2 by conditioning on N , let us instead use the conditional variance
formula. Now,
"N #
X
Var S|N = n = Var Xi |N = n

i=1
" n #
X
= Var Xi |N = n
i=1
" n #
X
= Var Xi
i=1
= nσ 2 .
By the same reasoning,

E S|N = n = nµ.

Therefore,
Var S|N = N σ 2 , E S|N = N µ,

38
and the conditional variance formula gives
Var S = E N σ 2 + Var N µ

= σ 2 E N + µ2 Var N .

PN
If N is a Poisson random variable, then S = i=1 Xi is called a compound Poisson random
variable. Because the variance of a Poisson random variable is equal to its mean, it follows that
for a compound Poisson random variable having E N = λ

Var S = λσ 2 + λµ2 = λE X 2 ,

where X has the distribution F .
39
4. Convergence and Limit Theorems 4.1. The Central Limit Theorem
4 Convergence and Limit Theorems

This chapter is based on Barron and Greco (2020), and Capiński and Kopp (2004).
Later we will need the following results. The first part says that if two rvs have the same mgf,
then they have the same distribution. The second part says that if the mgfs of a sequence of rvs
converges to an mgf, then the cdfs must also converge to the cdf of the limit rv.
Theorem 4.1. If X and Y are two rvs such that MX (t) = MY (t) (for all t close to 0 ), then X
and Y have the same cdfs.
On the other hand, if Xk , k = 1, 2, . . ., is a sequence of rvs with mgf Mk (t), k = 1, 2, . . ., and cdf
Fk (x), k = 1, 2, . . ., respectively, and if limk→∞ Mk (t) = MX (t) and MX (t) is an mgf then there
is a unique cdf FX and limk→∞ Fk (x) = FX (x) at each x a point of continuity of FX .
4.1 The Central Limit Theorem

For statistics, one of the major applications of independence is the following fact:
 v 
n
X n
X
u n
uX
Xi ∼ N (µi , σi ) , i = 1, 2, . . . , n, and independent =⇒ Xi ∼ N  µi , t σi2  .
i=1 i=1 i=1
We can see this using mgfs and the following proposition.
Proposition 4.2. Let X1 , X2 , . . . , Xn be independent rvs with mgf MXi (t), i = 1, 2, . . . , n. Let
Sn = X1 + · · · + Xn . Then MSn (t) = MX1 (t) · MX2 (t) · · · MXn (t).
Proof. This is directly from the definition and the independence. In fact,
MSn (t) = Ee(X1 +···+Xn )t = EetX1 · · · EetXn .
Therefore, if Xi ∼ N (µi , σi ) , i = 1, 2, . . . , n, and they are independent, we have
n
1 t2 X 2
Y X
MSn (t) = exp tµi + σi2 t2 = exp t µi + σi .
i=1
2 2
Since mgfs determine a distribution uniquely according to Theorem 4.1, we see that
 v 
n
X
u n
uX
Sn ∼ N  µi , t σi2  .
i=1 i=1
■
Example 4.3. The sum of independent Geom(p) random variables is Negative Binomial. In
particular, suppose X is the number of Bernoulli trials until we get r successes with probability p
of success on each trial. Then X = X1 + X2 + · · · + Xr , where Xi ∼ Geom(p), i = 1, 2, . . . , r, is the
40
Probability Theory 4. Convergence and Limit Theorems
number of trials until the first success. This is true since once we have a success we simply start
counting anew from the last success until we get another success. Now, we have by independence,
r r
X r X r(1 − p)
E[X] = E [Xi ] = , and Var[X] = Var [Xi ] = .
i=1
p i=1
p2
et p
In addition, using the mgf of Geom(p), namely, MXi (t) = 1−et (1−p) , t < − ln(1 − p) we have
r
Y ert pr
MX (t) = MXi (t) = r, t < − ln(1 − p).
i=1
(1 − et (1 − p))
and this must be the mgf of a Negative Binomial rv.
We have seen that the sum of independent normal rvs is exactly normal. The Central Limit
Theorem says that even if the Xi ’s are not normal, the sum is approximately normal if the
number of rvs is large.
Theorem 4.4 (Central Limit Theorem). Let X1 , X2 , . . . be a sequence of independent rvs all
having the same distributions and E[X1 ] = µ, Var [X1 ] = σ 2 . Then for any a, b ∈ R,
X1 + · · · + Xn − nµ

lim P a≤ √ ≤b = P (a ≤ Z ≤ b),
n→∞ σ n
where Z ∼ N (0, 1).
In other words, for large n, (generally n ≥ 30 ),

√
Sn = X1 + · · · + Xn ≈ N (nµ, σ n),
and dividing by n, since

Sn nµ
E = = µ,
n n
and,
1 σ2

Sn
Var = 2 σ2 n = ,
n n n
then
Sn σ
X := ≈N µ, √ .
n n
This is true no matter what the distributions of the individual Xi ’s are as long as they all have
the same finite means and variances.
√
Proof. We may assume µ = 0 (why?) and we also may assume a = −∞. Set Zn = Sn /(σ n).
Then, if M (t) = MXi (t) is the common mgf of the rvs Xi
n
√

t
MZn (t) = M √ = exp(n ln M (t/(σ n))).
σ n
41
4. Convergence and Limit Theorems 4.1. The Central Limit Theorem
If we can show that

2
lim MZn (t) = et /2
,
n→∞
then by Theorem 4.1 we can conclude that the cdf of Zn will converge to the cdf of the random
2
variable that has mgf et /2 . But that random variable is Z ∼ N (0, 1). That will complete the
proof. Therefore, all we need to do is to show that
√ t2
lim n ln M (t/(σ n)) = .
n→∞ 2
√
To see this, change variables to x = t/(σ n) so that
√ t2 ln M (x)
lim n ln M (t/(σ n)) = lim 2 .
n→∞ x→0 σ x2
Since ln M (0) = 0 we may use L’Hopital’s rule to evaluate the limit. We get
t2 ln M (x) t2 M ′ (x)/M (x)

lim = lim
x→0 σ 2 x2 σ 2 x→0 2x
t2 M ′′ (x)
= lim (using L’Hopital again)
2σ 2 x→0 xM ′ (x) + M (x)
t2 M ′′ (0)
=
2σ 2 0M ′ (0) + M (0)
t2 σ2
=
2σ 2 0 × 0 + 1
t2
= ,
2
since M (0) = 1, M ′ (0) = E[X] = 0, and M ′′ (0) = E X 2 = σ 2 . This completes the proof.

■
The full proof of Theorem 4.4 involves the characteristic function, and can be seen for example in
Ross (2019).
Example 4.5. Suppose an elevator is designed to hold 2000 pounds. The mean weight of a person
getting on the elevator is 175 with standard deviation 15 pounds. How many people can board the
elevator so that the chance it is overloaded is 1%? Let W = X1 + · · · + Xn be the total weight
of n people who board the elevator. We don’t know the distribution of the weights of individual
people (which is probably not normal), but we do know E[X] = 175 and Var[X] = 152 . By the
√
central limit theorem, W ≈ N (175n, 15 n) and we want to find n so that
P (W > 2000) = 0.01.
If we standardize W we get
W − 175n 2000 − 175n 2000 − 175n

0.01 = P (W > 2000) = P √ > √ =P Z> √ .
15 n 15 n 15 n
42
Using a calculator or a computer, we get P (Z > z) = 0.01 =⇒ z = invNorm(0.99) = 2.326.

Therefore, it must be true that
2000 − 175n
√ ≥ 2.326 =⇒ n ≤ 11.
15 n
The maximum number of people that can board the elevator and meet the criterion is 11. Without
knowing the distribution of the weight of people, there is no other way to do this problem.
4.2 Laws of Large Numbers

Suppose we have a rv X which has an arbitrary distribution but a finite mean µ = E[X] and
variance σ 2 = Var[X]. Chebychev’s inequality gives an upper bound on the chances X differs
from it’s mean without knowing anything about the distribution of X at all. Here’s the inequality.
Proposition 4.6 (Chebychev’s Inequality).
σ2
P (|X − µ| ≥ c) ≤ , for any constant c > 0.
c
The larger c is the smaller the probability can be.
Proof. The argument for Chebychev is simple. Assume X has pdf f . Then
Z Z
2
σ := E [|X − µ|] =
2
|x − µ| f (x)dx +
2
|x − µ|2 f (x)dx
|x−µ|≥c |x−µ|≤c
Z
≥ |x − µ|2 f (x)dx
Z
≥c 2
f (x)dx
= c2 P (|X − µ| ≥ c).
■
Chebychev is used to give us the Weak Law of Large Numbers which tells us that the mean of a
random sample should converge to the population mean as the sample size goes to infinity.
Theorem 4.7 (Weak Law of Large Numbers). Let X1 , . . . , Xn be a random sample, i.e., inde-
pendent and all having the same distribution as the rv X which has finite mean E[X] = µ and
finite variance σ 2 = Var[X]. Then, for any constant c > 0, with X = X1 +···+X
n
n
,
lim P |X − µ| ≥ c = 0.

n→∞
σ2
Proof. We know E X = µ and Var X = n . By Chebychev’s inequality,

Var X

σ2
P |X − µ| ≥ c ≤ = → 0 as n → ∞.
c nc ■
43
4. Convergence and Limit Theorems 4.2. Laws of Large Numbers
Strong Law of Large Numbers
In this section the strong law of large numbers is presented. As the proof of the strong law makes
use of the Borel-Cantelli lemma, this result will be presented first.
Lemma 4.8 (Borel-Cantelli). For a sequence of events Ai , i ≥ 1, let N denote the number of
P∞
these events that occur. If i=1 P (Ai ) < ∞, then P (N = ∞) = 0.
P∞
Proof. Suppose that i=1 P (Ai ) < ∞. Now, if N = ∞, then for every n < ∞ at least one
of the events An , An+1 , . . . will occur. That is, N = ∞ implies that ∪∞
i=n Ai occurs for every n.
Thus, for every n
∞
!
[
P (N = ∞) ≤ P Ai
i=n
∞
X
≤ P (Ai ) ,
i=n
P∞
where the final inequality follows from Boole’s inequality. Because i=1 P (Ai ) < ∞ implies
P∞
that i=n P (Ai ) → 0 as n → ∞, we obtain from the preceding upon letting n → ∞ that
P (N = ∞) = 0, which proves the result. ■
Remark 4.9. The Borel-Cantelli lemma is actually quite intuitive, for if we define the indicator
P∞
variable Ii to equal 1 if Ai occurs and to equal 0 otherwise, then N = i=1 Ii , implying that
∞
X ∞
X
E[N ] = E [Ii ] = P (Ai ) .
i=1 i=1
Consequently, the Borel-Cantelli theorem states that if the expected number of events that occur is
finite then the probability that an infinite number of them occur is 0, which is intuitive because if
there were a positive probability that an infinite number of events could occur then E[N ] would be
infinite.
Theorem 4.10 (Strong Law of Large Numbers). Let X1 , X2 , . . . be a sequence of i.i.d. random
variables, with E [Xi ] = µ < ∞, and Var [Xi ] = σ 2 < ∞. Then, with probability 1,
X1 + X2 + · · · + Xn
→ µ, as n → ∞.
n
Suppose that X1 , X2 , . . . are independent and identically distributed random variables with mean
Pn
µ, and let X n = n1 i=1 Xi be the average of the first n of them. The strong law of large numbers
states that P limn→∞ X n = µ = 1. That is, with probability 1, X n converges to µ as n → ∞.

We will give a proof of this result under the assumption that σ 2 , the variance of Xi , is finite
(which is equivalent to assuming that E Xi2 < ∞). Because proving the strong law requires

showing, for any ϵ > 0, that X n − µ > ϵ for only a finite number of values of n, it is natural to
attempt to prove it by utilizing the Borel-Cantelli lemma.
44
P∞
That is, the result would follow if we could show that n=1 P X n − µ > ϵ < ∞. However,

because E X n = µ, Var X n = σ 2 /n, attempting to show this by using Chebyshev’s inequality

yields
∞ ∞ ∞
Var X n σ2 X 1

X X
P Xn − µ > ϵ ≤ = = ∞.
n=1 n=1
ϵ2 ϵ2 n=1 n
Thus, a straightforward use of Borel-Cantelli does not work. However, as we now show, a tweaking
of the argument, where we first consider a subsequence of X n , n ≥ 1, allows us to prove the strong
law.
Proof. Suppose first that the Xi are nonnegative random variables. Fix α > 1, and let nj be the
smallest integer greater than or equal to αj , j ≥ 1. From Chebyshev’s inequality we see that
Var X nj

σ2
P X nj −µ >ϵ ≤ = .
ϵ2 n j ϵ2
Consequently,
∞ ∞ ∞
X σ2 X 1 σ2 X
P X nj −µ >ϵ ≤ 2 ≤ 2 (1/α)j < ∞.
j=1
ϵ j=1 nj ϵ j=1
Therefore, by the Borel-Cantelli lemma, it follows that, with probability 1, | X nj − µ |> ϵ for only
a finite number of j. As this is true for any ϵ > 0, we see that, with probability 1,
lim X nj = µ. (4.8)
j→∞
Because nj → ∞ as j → ∞, it follows that for any m > α, there is an integer j(m) such that
nj(m) ≤ m < nj(m)+1 . The nonnegativity of the Xi yields that
nj(m) m nj(m)+1
X X X
Xi ≤ Xi ≤ Xi .
i=1 i=1 i=1
Dividing by m shows that

nj(m) nj(m)+1
X nj(m) ≤ X m ≤ X nj(m)+1 .
m m
Because
1 1 1
< ≤ ,
nj(m)+1 m nj(m)
this yields that
nj(m) nj(m)+1
X nj(m) ≤ X m ≤ X nj(m)+1 .
nj(m)+1 nj(m)
nj+1
Because limm→∞ j(m) = ∞ and limj→∞ = α, it follows, for any ϵ > 0, that
nj
nj(m)+1
<α+ϵ
nj(m)
45
4. Convergence and Limit Theorems 4.2. Laws of Large Numbers
for all but a finite number of m. Consequently, from (4.8) and the preceding, it follows, with
probability 1, that
µ
< X m < (α + ϵ)µ
α+ϵ
for all but a finite number of values of m. As this is true for any ϵ > 0, α > 1, it follows that
with probability 1
lim X m = µ.
m→∞
Thus the result is proven when the Xi are nonnegative. In the general case, let

X , if X ≥ 0
i i
Xi+ =
0, if Xi < 0
and let 
0, if Xi ≥ 0
Xi− =
−Xi , if Xi < 0
Xi+ and Xi− are called, respectively, the positive and negative parts of Xi . Noting that
Xi = Xi+ − Xi− ,
it follows (since Xi+ Xi− = 0) that

2 2
Xi2 = Xi+ + Xi− .
h 2 i h 2 i
Hence, the assumption that E Xi2 < ∞ implies that E Xi+ and E Xi− are also both

− −
finite. Letting µ = E Xi and µ = E Xi , and using that Xi and Xi are both nonnegative,
+
+ − +
it follows from the previous result for nonnegative random variables that, with probability 1,
m m
1 X + 1 X −
lim Xi = µ+ , lim Xi = µ− .
m→∞ m m→∞ m
i=1 i=1
Consequently, with probability 1,

m
1 X
lim X m = lim Xi+ − Xi−

m→∞ m→∞ m
i=1
= µ+ − µ−
= µ.
■
Example 4.11. Suppose that a sequence of independent trials is performed. Let E be a fixed
event and denote by P (E) the probability that E occurs on any particular trial. Letting

1, if E occurs on the ith trial,
Xi =
0, if E does not occur on the ith trial.
46
we have by the strong law of large numbers that, with probability 1,
X1 + · · · + Xn
→ E[X] = P (E). (4.9)
n
Since X1 + · · · + Xn represents the number of times that the event E occurs in the first n trials,
we may interpret equation (4.9) as stating that, with probability 1, the limiting proportion of time
that the event E occurs is just P (E).
47
5. Simulation 5.1. Monte Carlo Integration
5 Simulation
In these section we describe the principal methods that are used to generate random variables,
taking as given a good U (0, 1) random variable generator. We begin with Monte-Carlo integration
and then describe the main methods for random variable generation including inverse-transform,
composition and acceptance-rejection.
We also describe the generation of normal random variables and multivariate normal random
vectors via the Cholesky decomposition. We end with a discussion of how to generate (non-
homogeneous) Poisson processes as well (geometric) Brownian motions.
This chapter is based on Liu (2001), Glasserman (2004), and Ross (2012).
5.1 Monte Carlo Integration

Monte-Carlo simulation can also be used for estimating integrals and we begin with one-dimensional
integrals. Suppose then that we want to compute
Z 1
θ := g(x)dx.
0
If we cannot compute θ analytically, then we could use numerical methods. However, we can
also use simulation and this can be especially useful for high-dimensional integrals. The key
observation is to note that θ = E[g(U )] where U ∼ U (0, 1). We can use this observation as follows:
i. Generate U1 , U2 , . . . Un ∼ i.i.d. U (0, 1)
ii. Estimate θ with

g (U1 ) + . . . + g (Un )
θbn := .
n
There are two reasons that explain why θbn is a good estimator:
h i
1. θ̂n is unbiased, i.e., E θ̂n = θ and
2. θbn is consistent, i.e., θbn → θ as n → ∞ with probability 1 . This follows immediately from
the Strong Law of Large Numbers (SLLN) since g (U1 ) , g (U2 ) , . . . , g (Un ) are i.i.d. with
mean θ.
R1
Example 5.1. Suppose we wish to estimate 0 x3 dx using simulation. We know the exact answer
is 1/4 but we can also estimate this using simulation. In particular if we generate n U (0, 1)
independent variables, cube them and then take their average then we will have an unbiased
estimate.
R3
Example 5.2. We wish to estimate θ = 1 x2 + x dx again using simulation. Once again we

know the exact answer (it’s 38/3) but we can also estimate it by noting that
3
x2 + x
Z
θ := 2 dx = 2E X 2 + X

1 2
48
Probability Theory 5. Simulation
where X ∼ U (1, 3). So we can estimate θ by generating n i.i.d. U (0, 1) random variables,
converting them to U (1, 3) variables, X1 , . . . , Xn , and then taking
n
Xi2 + Xi
P
θbn := 2 i=1 .
n
Multi-Dimensional Monte Carlo Integration
Suppose now that we wish to approximate

Z 1 Z 1
θ= g (x1 , x2 ) dx1 dx2 .
0 0
Then we can write θ = E [g (U1 , U2 )] where U1 , U2 are i.i.d. U (0, 1) random variables. Note that
the joint PDF satisfies fu1 ,u2 (u1 , u2 ) = fu1 (u1 ) fu2 (u2 ) = 1 on [0, 1] × [0, 1]. As before we can
estimate θ using simulation by performing the following steps:
i. Generate n independent bivariate vectors U1i , U2i for i = 1, . . . , n, with all Uji ’s i.i.d.

U (0, 1).
ii. Compute g U1i , U2i for i = 1, . . . , n.

iii. Estimate θ with

g U11 , U21 + . . . + g (U1n , U2n )

θn =
b .
n
As before, the SLLN justifies this approach and guarantees that θbn → θ w.p. 1 as n → ∞.
Example 5.3 (Computing a Multi-Dimensional Integral). We can use Monte Carlo to estimate
Z 1 Z 1
θ := 4x2 y + y 2 dxdy

0 0
= E 4X 2 Y + Y 2 ,

where X, Y are i.i.d. U (0, 1). (The true value of θ is easily calculated to be 1.) We can also apply
Monte Carlo integration to more general problems. For example, if we want to estimate
ZZ
θ= g(x, y)f (x, y)dxdy
A
where f (x, y) is a density function on A, then we observe that θ = E[g(X, Y )] where X, Y have
joint density f (x, y). To estimate θ using simulation we simply generate n random vectors (X, Y )
with joint density f (x, y) and then estimate θ with
g (X1 , Y1 ) + . . . + g (Xn , Yn )
θbn := .
n
49
5. Simulation 5.2. Univariate Random Variables
5.2 Generating Univariate Random Variables

We will study a number of methods for generating univariate random variables. The three principal
methods are the inverse transform method, the composition method and the acceptance-rejection
method. All of these methods rely on having a (good) U (0, 1) random number generator available
which we assume to be the case.
Inverse Transform Method
The Inverse Transform Method for Discrete Random Variables Suppose X is a discrete random
variable with probability mass function (PMF)

x1 , if 0 ≤ U ≤ p1 ,


X= x2 , if p1 < U ≤ p1 + p2 ,

x3 , if p1 + p2 < U ≤ 1.


We can easily check that this is correct: note that P (X = x1 ) = P (0 ≤ U ≤ p1 ) = p1 since U is

U (0, 1). The same is true for P (X = x2 ) and P (X = x3 ).
More generally, suppose X can take on n distinct values, x1 < x2 < . . . < xn , with
P (X = xi ) = pi for i = 1, . . . , n.
Then to generate a sample value of X we:
i. Generate U
Pj−1 Pj
ii. Set X = xj if i=1 pi < U ≤ i=1 pi . That is, we set X = xj if F (xj−1 ) < U ≤ F (xj ).
If n is large, then we might want to search for xj more efficiently, however.
Example 5.4 (Generating a Geometric Random Variable). Suppose X is geometric with parameter
p so that P(X = n) = (1 − p)n−1 p. Then we can generate X as follows:
i. Generate U .
ii. Set X = j if
j−1
X j
X
(1 − p)i−1 p < U ≤ (1 − p)i−1 p.
i=1 i=1
That is, we set (why?) X = j if 1 − (1 − p)j−1 < U ≤ 1 − (1 − p)j .
In particular, we set
log(U )

X = int + 1,
log(1 − p)
where int(y) denotes the integer part of y. You should convince yourself that this is correct! How
does this compare to the coin-tossing method for generating X?
50
Example 5.5 (Generating a Poisson Random Variable). Suppose that X is Poisson (λ) so that
P(X = n) = exp(−λ)λn /n!. We can generate X as follows:
i. Generate U .
ii. Set X = j if F (j − 1) < U ≤ F (j).

How do we find j? We could use the following algorithm.
set j = 0, p = e−λ , F = p
while U > F
set p = λp/(j + 1), F = F + p, j = j + 1
set X = j
How much work does this take? What if λ is large? Can we find j more efficiently?
Inverse Transform Method for Continuous Random Variables
Suppose now that X is a continuous random variable and we want to generate a value of X.
Recall that when X was discrete, we could generate a variate by first generating U and then
setting X = xj if F (xj−1 ) < U ≤ F (xj ). This suggests that when X is continuous, we might
generate X as follows:
i. Generate U .
ii. Set X = x if Fx (x) = U , i.e., set X = Fx−1 (U ).

We need to prove that this algorithm actually works! But this follows immediately since
P(X ≤ x) = P Fx−1 (U ) ≤ x

= P (U ≤ Fx (x))
= Fx (x),
as desired. This argument assumes Fx−1 exists but there is no problem even when Fx−1 does not
exist. All we have to do is
i. Generate U .
ii. Set X = inf {x : Fx (x) ≥ U }.

This works for discrete and continuous random variables or mixtures of the two.
Another way of seeing this is that FX (x) = P(X ≤ x) is a strictly increasing bijective function
−1
R → (0, 1), so GX (u) = FX (u) is a strictly increasing bijective function (0, 1) → R. We
know that if U is uniformly distributed on [0, 1] then P(U ≤ u) = u for 0 < u < 1, so
−1
P (GX (U ) ≤ GX (u)) = u, and GX (U ) = FX (U ) is a value in R and if we let Y = GX (U ) and
y = GX (u) so u = FX (y), then P(Y ≤ y) = FX (y), meaning that X and Y = GX (U ) have the
same cumulative distribution function, i.e., the same distribution.
51
Example 5.6 (Generating an Exponential Random Variable). We wish to generate X ∼ Exp(λ).

In this case Fx (X) = 1 − e−λx , so that
log(1 − u)
Fx−1 (u) = − .
λ
We can generate X then by generating U and setting
− log(U )
X= .
λ
Example 5.7 (Generating a Gamma(n, λ) Random Variable). We wish to generate X ∼
Gamma(n, λ) where n is a positive integer. Let Xi be i.i.d. ∼ exp(λ) for i = 1, . . . , n. Note that
if Y := X1 + . . . + Xn then Y ∼ Gamma(n, λ). How can we use this observation to generate a
sample value of Y ? If n is not an integer, then we need another method to generate Y .
Advantages of the Inverse Transform Method
There are two principal advantages to the inverse transform method:
i. Monotonicity: we have already seen how this can be useful.
ii. The method is 1-to-1, i.e. one U (0, 1) variable produces one X variable. As we will see,
this property can be useful for some variance reduction techniques.
Disadvantages of the Inverse Transform Method
The principal disadvantage of the inverse transform method is that Fx−1 may not always be
computable. For example, suppose X ∼ N (0, 1). Then
1
Z x 2
−z
Fx (x) = √ exp dz,
−∞ 2π 2
so that we cannot even express Fx in closed form. Even if Fx is available in closed form, it may
not be possible to find Fx−1 in closed form. For example, suppose Fx (x) = x5 (1 + x)3 /8 for
0 ≤ x ≤ 1. Then we cannot compute Fx−1 . One possible solution to these problems is to find
Fx−1 numerically.
Composition Approach
Another method for generating random variables is the composition approach. Suppose again
that X has CDF Fx and that we wish to simulate a value of X. We can often write
∞
X
Fx (x) = pj Fj (x),
j=1
where the Fj ’s are also CDFs, pj ≥ 0 for all j, and pj = 1. Equivalently, if the densities exist
P
then we can write

X∞
fx (x) = pj fj (x).
j=1
If it’s difficult to simulate X directly using the inverse transform method then we could use the
composition algorithm (see below) instead.
52
Proposition 5.8 (Composition Algorithm). Assume a random variable has CDF Fx and
∞
X
Fx (x) = pj Fj (x)
j=1
where the Fj ’s are also CDFs, pj ≥ 0 for all j, and pj = 1. If we follow the steps:
P
i. Generate I that is distributed on the non-negative integers so that P(I = j) = pj .
ii. If I = j, then simulate Yj from Fj .
iii. Set X = Yj .
Then X has CDF Fx .
Proof. We have
∞
X
P(X ≤ x) = P(X ≤ x | I = j)P(I = j)
j=1
∞
X
= P (Yj ≤ x) P(I = j)
j=1
∞
X
= Fj (x)pj
j=1
= Fx (x). ■
The proof actually suggests that the composition approach might arise naturally from “sequential”
type experiments. Consider the following example.
Example 5.9 (A Sequential Experiment). Suppose we roll a dice and let Y ∈ {1, 2, 3, 4, 5, 6} be
the outcome. If Y = i then we generate Zi from the distribution Fi and set X = Zi What is the
distribution of X? How do we simulate a value of X?
Acceptance-Rejection Algorithm
Let X be a random variable with density, f (·), and CDF, Fx (·). Suppose it’s hard to simulate a
value of X directly using either the inverse transform or composition algorithm. We might then
wish to use the next algorithm.
Proposition 5.10 (Acceptance-Rejection Algorithm). Let X be a random variable with density,

f (·), and let Y be another random variable with density g(·). Let’s assume that it is easy to
simulate a value of Y , and there exists a constant a such that
f (x)
≤ a, for all x.
g(x)
Then we can simulate a value of X as follows.
53
generate Y with PDF g(·)
generate U
f (Y )
while U >
ag(Y )
generate Y
generate U
set X = Y.
Proof. We define B to be the event that Y has been accepted in the while loop, i.e., U ≤
f (Y )/ag(Y ). We need to show that P(X ≤ x) = Fx (x).
First observe
P(X ≤ x) = P(Y ≤ x | B)

P (Y ≤ x) ∩ B
= .
P(B)
Last expression numerator satisfies

Z ∞
P (Y ≤ x) ∩ B = P (Y ≤ x) ∩ B | Y = y g(y)dy
−∞
∞
f (Y )
Z
= P (Y ≤ x) ∩ U ≤ | Y = y g(y)dy
−∞ ag(Y )
f (y)
Z x
= P U≤ g(y)dy
−∞ ag(y)
Fx (x)
= .
a
And P(B) is equal to
f (Y ) 1

P U≤ = .
ag(Y ) a
Therefore P(X ≤ x) = Fx (x), as required. ■
Why must we have a ≥ 1?
Example 5.11 (Generating a β(a, b) Random Variable). Recall that X has a β(a, b) distribution
if f (x) = cxa−1 (1 − x)b−1 for 0 ≤ x ≤ 1. Suppose now that we wish to simulate from the β(4, 3)
so that
f (x) = 60x3 (1 − x)2 , for 0 ≤ x ≤ 1.
We could, for example, integrate f (·) to find F (·), and then try to use the inverse transform
approach. However, it is hard to find F −1 (·). Instead, let’s use the acceptance-rejection algorithm:
54
i. First choose g(y): let’s take g(y) = 1 for y ∈ [0, 1], i.e., Y ∼ U (0, 1)
ii. Then find a. Recall that we must have

f (x)
≤ a, for all x,
g(x)
which implies
60x3 (1 − x)2 ≤ a, for all x ∈ [0, 1].
So take a = 3. It is easy to check that this value works. We then have the following algorithm.
generate Y ∼ U (0, 1)
generate U ∼ U (0, 1)
while U > 20Y 3 (1 − Y )2
generate Y
generate U
set X = Y.
Efficiency of the Acceptance-Rejection Algorithm

Let N be the number of loops in the A-R algorithm until acceptance, and as before, let B be the
event that Y has been accepted in a loop, i.e.
f (Y )
U≤ .
ag(Y )
1
We saw above that P(B) = . The following questions arise naturally:
a
i. What is the distribution of N ?;
ii. What is E[N ]?; and
iii. How Do We Choose a?
E[N ] = a, so clearly we would like a to be as small as possible. Usually, this is just a matter of
calculus.
Example 5.12 (Generating a β(a, b) Random Variable revisited). Recall the β(4, 3) example with
PDF f (x) = 60x3 (1 − x)2 , for x ∈ [0, 1]. We chose g(y) = 1 for y ∈ [0, 1] so that Y ∼ U (0, 1).
The constant a had to satisfy
f (x)
≤ a, for all x ∈ [0, 1],
g(x)
and we chose a = 3. We can do better by choosing
f (x) 3

a = max =f ≈ 2.0736.
x∈[0,1] g(x) 5
55
How Do We Choose g(·) ? We would like to choose g(·) to minimize the computational load. This
can be achieved by taking g(·) ’close’ to f (·). Then a will be close to 1 and so fewer iterations
will be required in the A-R algorithm. There is a tradeoff, however: if g(·) is “close” to f (·) then
it will probably also be hard to simulate from g(·). So we often need to find a balance between
having a ’nice’ g(·) and a small value of a.
Acceptance-Rejection Algorithm for Discrete Random Variables
So far, we have expressed the A-R algorithm in terms of PDF’s, thereby implicitly assuming
that we are generating continuous random variables. However, the A-R algorithm also works for
discrete random variables where we simply replace PDF’s with PMF’s.
So suppose we wish to simulate a discrete random variable, X, with PMF, pi = P (X = xi ).

If we do not wish to use the discrete inverse transform method for example, then we can use
the following version of the A-R algorithm. We assume that we can generate Y with PMF,
qi = P (Y = yi ), and that a satisfies pi /qi ≤ a for all i.
The Acceptance-Rejection Algorithm for Discrete Random Variables is as follows
generate Y with PMF qi
generate U
pY
while U >
aqY
generate Y
generate U
set X = Y.
Generally, we would use this A-R algorithm when we can simulate Y efficiently.
Example 5.13 (Generating from a Uniform Distribution over a 2-D Region). Suppose (X, Y ) is
uniformly distributed over a 2-dimensional area, A. How would you simulate a sample of (X, Y )
? Note first that if X ∼ U (−1, 1), Y ∼ U (−1, 1) and X and Y are independent then (X, Y ) is
uniformly distributed over the region
A := {(x, y) : −1 ≤ x ≤ 1, −1 ≤ y ≤ 1}.
We can therefore (how?) simulate a sample of (X, Y ) when A is a square. Suppose now that A is
a circle of radius 1 centered at the origin. How do we simulate a sample of (X, Y ) in that case?
Remark 5.14. The A-R algorithm is an important algorithm for generating random variables.
Moreover it can be used to generate samples from distributions that are only known up to a
constant. Nevertheless, it is very inefficient in high-dimensions.
56
5.3 Generating Gaussian Random Variables

While we typically rely on software packages to generate Gaussian random variables for us, it
is nonetheless worthwhile having an understanding of how to do this. We first note that if
Z ∼ N (0, 1) then
X := µ + σZ ∼ N (µ, σ 2 ),
so that we need only concern ourselves with generating N (0, 1) random variables. One possibility
for doing this is to use the inverse transform method. But we would then have to use numerical
methods since we cannot find Fz−1 (·) := ϕ(·) in closed form. Other approaches for generating
N (0, 1) random variables include:
i. The Box-Muller method.
ii. The Polar method.
iii. Rational approximations.
There are many other methods such as the A-R algorithm that could also be used to generate
N (0, 1) random variables.
Box Muller Algorithm
The Box-Muller algorithm uses two i.i.d. U (0, 1) random variables to produce two i.i.d. N (0, 1)
random variables. Its working is detailed next.
Proposition 5.15 (Box-Muller Algorithm). Let U1 , U2 be two i.i.d. U (0, 1) random variables,
and set
X= −2 log (U1 ) cos (2πU2 ) ,

p
Y = −2 log (U1 ) sin (2πU2 ) .

p
Then X and Y are two i.i.d N (0, 1) random variables.
Proof. We need to show that
1 1
2 2
x y
f (x, y) = √ exp − √ exp − .
2π 2 2π 2
First, make a change of variables:

p
R := X 2 + Y 2,

Y
θ := tan−1 ,
X
so R and θ are polar coordinates of (X, Y ). To transform back, note X = R cos(θ), and
Y = R sin(θ). Note also that R = −2 log (U1 ), and θ = 2πU2 .
p
57
5. Simulation 5.3. Gaussian Random Variables
Since U1 and U2 are i.i.d., R and θ are independent. Clearly θ ∼ U (0, 2π) so
1
fθ (θ) = , for 0 ≤ θ ≤ 2π,
2π
in addition, it is also easy to see that
r2
fR (r) = re− 2 , for r ≥ 0.
Therefore,
1 −r2 /2
fR,θ (r, θ) = re , 0 ≤ θ ≤ 2π, r ≥ 0.
2π
This implies
P (X ≤ x1 , Y ≤ y1 ) = P (R cos(θ) ≤ x1 , R sin(θ) ≤ y1 )
ZZ
1 −r2 /2 (5.10)
= re drdθ.
A 2π
where A = (r, θ) : r cos(θ) ≤ x, r sin(θ) ≤ y .

We now transform back to (x, y) coordinates with x = r cos(θ), and y = r sin(θ), and note that
dxdy = rdrdθ, i.e., the Jacobian of the transformation is r. We then use (5.10) to obtain
!
1 x1 y1
x2 + y 2
Z Z
P (X ≤ x1 , Y ≤ y1 ) = exp − dxdy
2π −∞ −∞ 2
1 x1
1 y1
Z Z
=√ exp −x /2 dx √ 2
exp −y 2 /2 dy.

2π −∞ 2π −∞
as required. ■
Polar Algorithm for Generating Two i.i.d. N(0,1) Random Variables
One disadvantage of the Box-Muller method is that computing sines and cosines is inefficient. We
can get around this problem using the polar method which is described in the algorithm below.
set S = 2
while S > 1
generate U1 and U2 i.i.d. U (0, 1)
set V1 = 2U1 − 1, V2 = 2U2 − 1, and S = V12 + V22

r r
−2 log(S) −2 log(S)
set X = V1 , and Y = V2 .
S S
See Chapter 5 of Simulation by Ross (2012) for further details about this algorithm.
58
Rational Approximations
Let X ∼ N (0, 1) and recall that Φ(x) = P(X ≤ x) is the CDF of X. If U ∼ U (0, 1), then the
inverse transform method seeks xu = Φ−1 (U ). Finding Φ−1 in closed form is not possible but
instead, we can use rational approximations. These are very accurate and efficient methods for
estimating xu .
Example 5.16 (Rational Approximations). For 0.5 ≤ u ≤ 1
a0 + a1 t
xu ≈ t − ,
1 + b1 t + b2 t2
where a0 , a1 , b1 and b2 are constants, and t = −2 log(1 − u). The error is bounded in this case
p
by 0.003. Even more accurate approximations are available, and since they are very fast, many
packages (e.g. Python or Matlab) use them for generating normal random variables.
Multivariate Normal Distribution
If the n-dimensional vector X is multivariate normal with mean vector µ and covariance matrix
Σ then we write
X ∼ MNn (µ, Σ).
The standard multivariate normal has µ = 0, and Σ = In , i.e., the n × n identity matrix. The
PDF of X is given by
1 1

f (x) = exp − (x − µ) Σ (x − µ)
⊤ −1
(2π)n/2 |Σ|1/2 2
where | · | denotes the determinant, and its characteristic function satisfies
1 ⊤
h i
it⊤ X
ϕX (t) := E e = exp it µ − t Σt .
⊤
2
⊤ ⊤
Notice that it is possible to partition X into X1 = (X1 , . . . , Xk ) , and X2 = (Xk+1 , . . . , Xn ) .
If we extend this notation naturally so that
! !
µ1 Σ11 Σ12
µ= , and Σ= ,
µ2 Σ21 Σ22
then we obtain the following results regarding the marginal and conditional distributions of X.
Marginal Distribution
The marginal distribution of a multivariate normal random vector is itself multivariate normal.
In particular,
Xi ∼ MN (µi , Σii ) , for i = 1, 2.
59
5. Simulation 5.3. Gaussian Random Variables
Conditional Distribution
Assuming Σ is positive definite, the conditional distribution of a multivariate normal distribution
is also a multivariate normal distribution. In particular,
X2 | X1 = x1 ∼ MN (µ2.1 , Σ2.1 ) ,
where µ2.1 := µ2 + Σ21 Σ−1 −1

11 (x1 − µ1 ), and Σ2.1 := Σ22 − Σ21 Σ11 Σ12 .
Linear Combinations
Linear combinations of multivariate normal random vectors remain normally distributed with
mean vector and covariance matrix given by
E[AX + a] = AE[X] + a,
Cov(AX + a) = A Cov(X)A⊤ .
Estimation of Multivariate Normal Distributions

The simplest and most common method of estimating a multivariate normal distribution is
to take the sample mean vector and sample covariance matrix as our estimators of µ and Σ,
respectively. It is easy to justify this choice since they are the maximum likelihood estimators. It
is also common to take n/(n − 1) times the sample covariance matrix as an estimator of Σ as
this estimator is known to be unbiased.
Testing Normality and Multivariate Normality

There are many tests that can be employed for testing normality of random variables and vectors.
These include standard univariate tests and tests based on QQplots, as well omnibus moment
tests based on whether the skewness and kurtosis of the data are consistent with a multivariate
normal distribution.
Generating Multivariate Normally Distributed Random Vectors
Suppose that we wish to generate X = (X1 , . . . , Xn ) where X ∼ MNn (0, Σ). Note that it is then
⊤
easy to handle the case where E[X] ̸= 0. Let Z = (Z1 , . . . , Zn ) where the Zi ’s are i.d.d. N(0, 1)
for i = 1, . . . , n. If C is an (n × m) matrix then it follows that
C⊤ Z ∼ MN 0, C⊤ C .

Our problem therefore reduces to finding C such that C⊤ C = Σ. We can use the Cholesky
decomposition of Σ to find such a matrix, C.
Cholesky Decomposition of a Symmetric Positive-Definite Matrix

A well known fact from linear algebra is that any symmetric positive-definite matrix, M, may be
written as
M = U⊤ DU,
60
where U is an upper triangular matrix and, D is a diagonal matrix with positive diagonal elements.
Since Σ is symmetric positive-definite, we can therefore write
Σ = U⊤ DU
√ √
= U⊤ D DU
√ ⊤ √
= DU DU .
√
The matrix C = DU therefore satisfies C⊤ C = Σ. It is said to be the Cholesky Decomposition
of Σ.
5.4 Stochastic Processes

We can easily simulate a stochastic process by simulating a sequence of random variables.
5.4.1 Simulating Poisson Processes
Recall that a Poisson process, N (t), with intensity λ is a process such that
(λt)r e−λt
P(N (t) = r) = .
r!
For a Poisson process the numbers of arrivals in non-overlapping intervals are independent and
the distribution of the number of arrivals in an interval only depends on the length of the interval.
The Poisson process is good for modeling many phenomena including the emission of particles
from a radioactive source and the arrivals of customers to a queue. The ith inter-arrival time, Xi ,
is defined to be the interval between the (i − 1)th and ith arrivals of the Poisson process, and it is
easy to see that the Xi ’s are i.i.d. ∼ Exp(λ).
In particular, this means we can simulate a Poisson process with intensity λ by simply generating
the inter-arrival times, Xi , where Xi ∼ Exp(λ).
We have the following algorithm for simulating the first T time units of a Poisson process:
set t = 0, I = 0
generate U
set t = t − log(U )/λ
while t < T
set I = I + 1, S(I) = t
generate U
set t = t − log(U )/λ
61
5. Simulation 5.4. Stochastic Processes
The Non-Homogeneous Poisson Process

A non-homogeneous Poisson process, N (t), is obtained by relaxing the assumption that the
intensity, λ, is constant. Instead we take it to be a deterministic function of time, λ(t).
More formally, if λ(t) ≥ 0 is the intensity of the process at time t, then we say that N (t) is a
non-homogeneous Poisson process with intensity λ(t). Define the function m(t) by
Z t
m(t) := λ(s)ds.
0
Then it can be shown that N (t + s) − N (t) is a Poisson random variable with parameter
m(t + s) − m(t), i.e.,
exp (−mt,s ) (mt,s )r

P N (t + s) − N (t) = r = ,
r!
where mt,s := m(t + s) − m(t).
Before we describe the thinning algorithm for simulating a non-homogeneous Poisson process, we
first need the following proposition.
Proposition 5.17. Let N (t) be a Poisson process with constant intensity λ. Suppose that an
arrival that occurs at time t is counted with probability p(t), independently of what has happened
beforehand. Then the process of counted arrivals is a non-homogeneous Poisson process with
intensity λ(t) = λp(t).
The proof can be found in Chapter 11 of Introduction to Probability Models by Ross (2019).
Suppose now N (t) is a non-homogeneous Poisson process with intensity λ(t) and that there
exists a λ such that λ(t) ≤ λ for all t ≤ T . Then we can use the following algorithm, based on
Proposition 5.17, to simulate N (t).
Thinning Algorithm for Simulating T Time Units of a NHPP
set t = 0, I = 0
generate U1
set t = t − log(U1 )/λ
while t < T
generate U2
if U2 ≤ λ(t)/λ then
set I = I + 1, S(I) = t
generate U1
set t = t − log(U1 )/λ
62
5.4.2 Simulating Brownian Motion
Definition 5.18. A stochastic process, {Xt : t ≥ 0}, is a Brownian motion with parameters (µ, σ)
if it verifies:
i. For 0 < t1 < t2 < . . . < tn−1 < tn
(Xt2 − Xt1 ) , (Xt3 − Xt2 ) , . . . , Xtn − Xtn−1

are mutually independent;
ii. For s > 0, Xt+s − Xt ∼ N µs, σ 2 s ; and

iii. Xt is a continuous function of t w.p. 1.
We say that X is a B(µ, σ) Brownian motion with drift, µ, and volatility, σ. When µ = 0 and
σ = 1 we have a standard Brownian motion (SBM). We will use Bt to denote a SBM and we will
always assume (unless otherwise stated) that B0 = 0. Note that if X ∼ B(µ, σ) and X0 = x then
we can write
Xt = x + µt + σBt ,
where B is a SBM. We will usually write a B(µ, σ) Brownian motion in this way.
Remark 5.19. Bachelier (1900) and Einstein (1905) were the first to explore Brownian motion
from a mathematical viewpoint whereas Wiener (1920) was the first to show that it actually exists
as a well-defined mathematical entity.
Questions: (i) What is E [Bt+s Bs ]?; (ii) What is E [Xt+s Xs ] where X ∼ B(µ, σ)?; and (iii) Let
B be a SBM and let Zt := |Bt |. What is the CDF of Zt for t fixed?
Simulating a Standard Brownian Motion
It is not possible to simulate an entire sample path of Brownian motion between 0 and T as this
would require an infinite number of random variables. This is not always a problem, however,
since we often only wish to simulate the value of Brownian motion at certain fixed points in time.
For example, we may wish to simulate Bti for t1 < t2 < . . . < tn , as opposed to simulating Bt for
every t ∈ [0, T ].
Sometimes, however, the quantity of interest, θ, that we are trying to estimate does indeed depend
on the entire sample path of Bt in [0, T ]. In this case, we can still estimate θ by again simulating
Bti for t1 < t2 < . . . < tn but where we now choose n to be very large. We might, for example,
choose n so that |ti+1 − ti | < ϵ for all i where ϵ > 0 is very small. By choosing ϵ to be sufficiently
small, we hope to minimize the numerical error (as opposed to the statistical error), in estimating
θ.
In either case, we need to be able to simulate Bti for t1 < t2 < . . . < tn and for a fixed n. We
will now see how to do this. The first observation we make is that
(Bt2 − Bt1 ) , (Bt3 − Bt2 ) , . . . , Btn − Btn−1 ,

63
5. Simulation 5.4. Stochastic Processes
are mutually independent, and for s > 0, Bt+s − Bt ∼ N (0, s). The idea then is as follows: we
begin with t0 = 0 and Bt0 = 0. We then generate Bt1 which we can do since Bt1 ∼ N (0, t1 ).
We now generate Bt2 by first observing that Bt2 = Bt1 + (Bt2 − Bt1 ). Then since (Bt2 − Bt1 )
is independent of Bt1 , we can generate Bt2 by generating an N (0, t2 − t1 ) random variable and
simply adding it to Bt1 .
More generally, if we have already generated Bti then we can generate Bti+1 by generating an
N (0, ti+1 − ti ) random variable and adding it to Bti . We have the following algorithm:
set t0 = 0, Bt0 = 0
for i = 1 to n
generate X ∼ N (0, ti − ti−1 )
set Bti = Bti−1 + X
Remark 5.20. It is very important that when you generate Bti+1 , you do so conditional on the
value of Bti . If you generate Bti and Bti+1 independently of one another then you are effectively
simulating from different sample paths of the Brownian motion. This is not correct! In fact when
we generate (Bt1 , Bt2 , . . . , Btn ) we are actually generating a random vector that does not consist
of i.i.d. random variables.
Simulating a B(µ, σ) Brownian Motion
Suppose now that we want to simulate a B(µ, σ), X, at the times t1 , t2 , . . . , tn−1 , tn . Then
all we have to do is simulate an SBM, (Bt1 , Bt2 , . . . , Btn ), and use our earlier observation that
Xt = X0 + µt + σBt .
Geometric Brownian Motion
Definition 5.21. A stochastic process, {Xt : t ≥ 0}, is a (µ, σ) geometric Brownian motion
(GBM) if log(X) ∼ B µ − σ 2 /2, σ . We write X ∼ GBM (µ, σ). The following properties of

GBM follow immediately from the definition of BM:

i. Fix t1 , t2 , . . . , tn , then
Xt2 Xt3 Xtn
, ,...,
Xt1 Xt3 Xtn−1
are mutually independent.
ii. For s > 0,

Xt+s
log µ − σ 2 /2 s, σ 2 s .

∼N
Xt
iii. Xt is continuous w.p. 1.
Again, we call µ the drift and σ the volatility.
If X ∼ GBM (µ, σ), then note that Xt has a lognormal distribution. In particular, if X ∼
GBM (µ, σ), then Xt ∼ LN µ − σ 2 /2 t, σ 2 t .

64
Modelling Stock Prices as Geometric Brownian Motion

Suppose X ∼ GBM (µ, σ). Note the following:
i. If Xt > 0, then Xt+s is always positive for any s > 0 so limited liability is not violated.
ii. The distribution of XXt+s

t
only depends on s so the distribution of returns from one period
to the next only depends on the length of the period.
This suggests that GBM might be a reasonable model for stock prices. In fact, we will often
model stock prices as GBM’s in this course, and we will generally use the following notation: S0
is the known stock price at t = 0, St is the random stock price at time t and
St = S0 e(µ−σ /2)t+σBt
2
where B is a standard BM. The drift is µ, σ is the volatility and S is a therefore a GBM (µ, σ)
process that begins at S0 .
Suppose now that we wish to simulate S ∼ GBM (µ, σ), then it is not hard to see that
St+∆t = St e(µ−σ /2)∆t+σ(Bt+∆t −Bt )

2
, (5.11)
so that we can simulate St+∆t conditional on St for any ∆t > 0 by simply simulating an N (0, ∆t)
random variable.
5.5 Variance Reduction Techniques

Let X1 , . . . , Xn have a given joint distribution, and suppose we are interested in computing
θ := E [g (X1 , . . . , Xn )] ,
where g is some specified function. It is often the case that it is not possible to analytically
compute the preceding, and when such is the case we can attempt to use simulation to estimate θ.
(1) (1)
This is done as follows: Generate X1 , . . . , Xn having the same joint distribution as X1 , . . . , Xn
and set

(1)
Y1 = g X1 , . . . , Xn(1) .
(2) (2)
Now, simulate a second set of random variables (independent of the first set) X1 , . . . , Xn
having the distribution of X1 , . . . , Xn and set

(2)
Y2 = g X1 , . . . , Xn(2) .
Continue this until you have generated k (some predetermined number) sets, and so have also
computed Y1 , Y2 , . . . , Yk . Now, Y1 , . . . , Yk are independent and identically distributed random
variables each having the same distribution of g (X1 , . . . , Xn ).
65
5. Simulation 5.5. Variance Reduction
Thus, if we let Ȳ denote the average of these k random variables, that is,
k
X
Ȳ = Yi /k
i=1
then
E[Ȳ ] = θ,
E (Ȳ − θ)2 = Var(Ȳ )

Hence, we can use Ȳ as an estimate of θ. As the expected square of the difference between Ȳ
and θ is equal to the variance of Ȳ , we would like this quantity to be as small as possible. In the
preceding situation, Var(Ȳ ) = Var (Yi ) /k, which is usually not known in advance but must be
estimated from the generated values Y1 , . . . , Yn .
Simulation Efficiency
As usual, we wish to estimate θ := E[h(X)], then the standard simulation algorithm is:
i. Generate X1 , . . . , Xn .
Pn
ii. Estimate θ with θbn = j=1 Yj /n where Yj := h (Xj ).
iii. Approximate 100(1 − α)% confidence intervals (CI’s) are then given by

σ
bn b σ
bn
θn − z1−α/2 √ , θn + z1−α/2 √ ,
b
n n
where σ
bn is the usual estimate of Var(Y ) based on Y1 , . . . , Yn .
One way to measure the quality of the estimator, θbn , is by the half-width, HW , of the confidence
interval. For a fixed α, we have
r
Var(Y )
HW = z1−α/2 .
n
We would like HW to be small, but sometimes this is difficult to achieve. This may be because
Var(Y ) is too large, or too much computational effort is required to simulate each Yj so that n is
necessarily small, or some combination of the two.
Before proceeding to study techniques for variance reduction, we should first describe a measure
of simulation efficiency. Suppose there are two random variables, W and Y , such that E[W ] =
E[Y ] = θ. Then we could choose to either simulate W1 , . . . , Wn or Y1 , . . . , Yn in order to estimate
θ. Let Mw denote the method of estimating θ by simulating the Wi ’s. My is similarly defined.
Which method is more efficient, Mw or My ? To answer this, let nw and ny be the number of
66
samples of W and Y , respectively, that are needed to achieve a half-width, HW . Then we know
that
z 2
1−α/2
nw = Var(W ),
HW
z 2
1−α/2
ny = Var(Y ).
HW
Let Ew and Ey denote the amount of computational effort required to produce one sample of W
and Y , respectively. Then the total effort expended by Mw and My , respectively, to achieve a
half width HW are
z 2
1−α/2
T Ew = Var(W )Ew ,
HW
z α 2
1− /2
T Ey = Var(Y )Ey .
HW
We then say that Mw is more efficient than My if T Ew < T Ey . Note that T Ew < T Ey if and
only if
Var(W )Ew < Var(Y )Ey .
We will use the quantity Var(W )Ew as a measure of the efficiency of the simulator, Mw . Note
that las inequality implies taht we cannot conclude that one simulation algorithm, Mw , is better
than another, My , simply because Var(W ) < Var(Y ); we also need to take Ew and Ey into
consideration. However, it is often the case that we have two simulators available to us, Mw and
My , where Ew ≈ Ey and Var(W ) << Var(Y ). In such cases it is clear that using Mw provides a
substantial improvement over using My .
As a result, it is often imperative to address the issue of simulation efficiency. There are a number
of things we can do:
i. Develop a good simulation algorithm.
ii. Program carefully to minimize storage requirements. For example we do not need to store all
the Yj ’s: we only need to keep track of j Yj and j Yj2 to compute θbn and approximate
P P
CI’s.
iii. Program carefully to minimize execution time.
iv. Decrease the variability of the simulation output that we use to estimate θ. The techniques
used to do this are usually called variance reduction techniques.
We will now study some of the simplest variance reduction techniques, and assume that we are
doing items (i.) to (iii.) as well as possible.
67
5.5.1 Antithetic Variables
In the preceding situation, suppose that we have generated Y1 and Y2 , identically distributed
random variables having mean θ. Now,
Y1 + Y2 1

Var = [Var (Y1 ) + Var (Y2 ) + 2 Cov (Y1 , Y2 )]
2 4
Var (Y1 ) Cov (Y1 , Y2 )
= + .
2 2
Hence, it would be advantageous (in the sense that the variance would be reduced) if Y1 and Y2
rather than being independent were negatively correlated. To see how we could arrange this, let
us suppose that the random variables X1 , . . . , Xn are independent and, in addition, that each is
simulated via the inverse transform technique.
That is, Xi is simulated from Fi−1 (Ui ) where Ui is a random number and Fi is the distribution
of Xi . Hence, Y1 can be expressed as
Y1 = g F1−1 (U1 ) , . . . , Fn−1 (Un ) .

Now, since 1 − U is also uniform over (0, 1) whenever U is a random number (and is negatively
correlated with U ) it follows that Y2 defined by
Y2 = g F1−1 (1 − U1 ) , . . . , Fn−1 (1 − Un ) .

will have the same distribution as Y1 . Hence, if Y1 and Y2 were negatively correlated, then
generating Y2 by this means would lead to a smaller variance than if it were generated by a new
set of random numbers. In addition, there is a computational savings since rather than having to
generate n additional random numbers, we need only subtract each of the previous n from 1.
The following theorem will be the key to showing that this technique, known as the use of
antithetic variables, will lead to a reduction in variance whenever g is a monotone function.
Theorem 5.22. If X1 , . . . , Xn are independent, then, for any increasing functions f and g of n
variables,
E[f (X)g(X)] ≥ E[f (X)]E[g(X)], (5.12)
where X = (X1 , . . . , Xn ).
Proof. The proof is by induction on n. To prove it when n = 1, let f and g be increasing

functions of a single variable. Then, for any x and y,
f (x) − f (y) g(x) − g(y) ≥ 0,

since if x ≥ y(x ≤ y) then both factors are nonnegative (nonpositive). Hence, for any random
variables X and Y ,
f (X) − f (Y ) g(X) − g(Y ) ≥ 0,

68
implying that
f (X) − f (Y ) g(X) − g(Y ) ≥ 0,

E
or, equivalently,
E[f (X)g(X)] + E[f (Y )g(Y )] ≥ E[f (X)g(Y )] + E[f (Y )g(X)].
If we suppose that X and Y are independent and identically distributed, as in this case, then
E[f (X)g(X)] = E[f (Y )g(Y )],
E[f (X)g(Y )] = E[f (Y )g(X)]
= E[f (X)]E[g(X)],
and so we obtain the result when n = 1. So assume that (5.12) holds for n − 1 variables, and now
suppose that X1 , . . . , Xn are independent and f and g are increasing functions. Then
E [f (X)g(X) | Xn = xn ] = E [f (X1 , . . . , Xn−1 , xn ) g (X1 , . . . , Xn−1 , xn ) | Xn = xn ]
= E [f (X1 , . . . , Xn−1 , xn ) g (X1 , . . . , Xn−1 , xn )] (by independence)
≥ E [f (X1 , . . . , Xn−1 , xn )] E [g (X1 , . . . , Xn−1 , xn )]
(by the induction hypothesis)
= E [f (X) | Xn = xn ] E [g(X) | Xn = xn ] .
Hence,
E [f (X)g(X) | Xn ] ≥ E [f (X) | Xn ] E [g(X) | Xn ] ,
and, upon taking expectations of both sides,
E[f (X)g(X)] ≥ E E [f (X) | Xn ] E [g(X) | Xn ]

≥ E[f (X)]E[g(X)].
The last inequality follows because E [f (X) | Xn ] and E [g(X) | Xn ] are both increasing functions
of Xn , and so, by the result for n = 1,
E [E [f (X) | Xn ] E [g(X) | Xn ]] ≥ E E [f (X) | Xn ] E E [g(X) | Xn ]

= E[f (X)]E[g(X)]. ■
Example 5.23 (Monte Carlo Integration). Consider the problem of estimating

Z 1
2
θ= ex dx.
0
h 2
i
As usual, we may then write θ = E eU where U ∼ U (0, 1). We can compare the usual raw
simulation algorithm with the simulation algorithm that uses antithetic variates. Using antithetic
variates in this case results in a variance reduction of approximately 75%.
69
We now discuss the circumstances under which a variance reduction can be guaranteed. Consider
first the case where U is a uniform random variable so that m = 1, U = U and θ = E[h(U )].
Suppose now that h(·) is a non-decreasing function of u over [0, 1]. Then if U is large, h(U ) will also
tend to be large while 1 − U and h(1 − U ) will tend to be small. That is, Cov(h(U ), h(1 − U )) <
0. We can similarly conclude that if h(.)isa non-increasing function of u then once again,
Cov(h(U ), h(1 − U )) < 0. So for the case where m = 1, a sufficient condition to guarantee a
variance reduction is for h(·) to be a monotonic function of u on [0, 1].
Let us now consider the more general case where m > 1, U = (U1 , . . . , Um ) and θ = E[h(U)]. We
say h (u1 , . . . , um ) is a monotonic function of each of its m arguments if, in each of its arguments,
it is non-increasing or non-decreasing. We have the following result which generalizes the m = 1
case above.
Corollary 5.24. If U1 , . . . , Un are independent, and k is either an increasing or decreasing

function, then
Cov k (U1 , . . . , Un ) , k (1 − U1 , . . . , 1 − Un ) ≤ 0.

Proof. Suppose k is increasing. As −k (1 − U1 , . . . , 1 − Un ) is increasing in U1 , . . ., Un , then,

from Theorem 5.22,
Cov (k (U1 , . . . , Un ) , −k (1 − U1 , . . . , 1 − Un )) ≥ 0.
When k is decreasing just replace k by its negative. ■
Since Fi−1 (Ui ) is increasing in Ui (as Fi , being a distribution function, is increasing) it follows
that g F1−1 (U1 ) , . . . , Fn−1 (Un ) is a monotone function of U1 , . . . , Un whenever g is monotone.

Hence, if g is monotone, the antithetic variable approach of twice using each set of random
numbers U1 , . . . , Un will reduce the variance of the estimate of E [g (X1 , . . . , Xn )] by first computing
g F1−1 (U1 ) , . . . , Fn−1 (Un ) , and then g F1−1 (1 − U1 ) , . . . , Fn−1 (1 − Un ) .

That is, rather than generating k sets of n random numbers, we should generate k/2 sets and use
each set twice.
Example 5.25 (Simulating the Reliability Function). Consider a system of n components in

which component i, independently of other components, works with probability pi , i = 1, . . . , n.
Letting

1, if component i works
Xi =
0, otherwise.
suppose there is a monotone structure function ϕ such that


1, if the system works under X1 , . . . , Xn
ϕ (X1 , . . . , Xn ) =
0, otherwise.
70
We are interested in using simulation to estimate
r (p1 , . . . , pn ) := E [ϕ (X1 , . . . , Xn )] = P {ϕ (X1 , . . . , Xn ) = 1} .
Now, we can simulate the Xi by generating uniform random numbers U1 , . . . , Un and then setting

1, if Ui < pi
Xi =
0, otherwise.
Hence, we see that

ϕ (X1 , . . . , Xn ) = k (U1 , . . . , Un ) ,
where k is a decreasing function of U1 , . . . , Un . Hence,
Cov k(U), k(1 − U) ≤ 0,

and so the antithetic variable approach of using U1 , . . . , Un to generate both k (U1 , . . . , Un ) and
k (1 − U1 , . . . , 1 − Un ) results in a smaller variance than if an independent set of random numbers
was used to generate the second k.
Example 5.26 (Simulating a Queueing System). Consider a given queueing system, let Di denote
the delay in queue of the i th arriving customer, and suppose we are interested in simulating the
system so as to estimate
θ := E [D1 + · · · + Dn ] .
Let X1 , . . . , Xn denote the first n interarrival times and S1 , . . . , Sn the first n service times of
this system, and suppose these random variables are all independent. Now in most systems
D1 + · · · + Dn will be a function of X1 , . . . , Xn , S1 , . . . , Sn -say,
D1 + · · · + Dn = g (X1 , . . . , Xn , S1 , . . . , Sn ) .
Also, g will usually be increasing in Si and decreasing in Xi , i = 1, . . . , n. If we use the inverse

transform method to simulate Xi , Si , i = 1, . . . , n-say, Xi = Fi−1 (1 − Ui ), Si = G−1

i Ūi where
U1 , . . . , Un , Ū1 , . . . , Ūn are independent uniform random numbers - then we may write
D1 + · · · + Dn = k U1 , . . . , Un , Ū1 , . . . , Ūn ,

where k is increasing in its variates.
Hence, the antithetic variable approach will reduce the variance of the estimator of θ. Thus, we
would generate Ui , Ūi , i = 1, . . . , n and set Xi = Fi−1 (1 − Ui ) and Yi = G−1

i Ūi for the first run,
and Xi = Fi−1 (Ui ) and Yi = G−1 1 − Ūi for the second. As all the Ui and Ūi are independent,

i
however, this is equivalent to setting Xi = Fi−1 (Ui ) , Yi = G−1

i Ūi in the first run and using
1 − Ui for Ui and 1 − Ūi for Ūi in the second.
71
Non-Uniform Antithetic Variates

So far we have only considered problems where θ = E[h(U)], for U a vector of i.i.d. U (0, 1) random
variables. Of course in practice, it is often the case that θ = E[Y ] where Y = h (X1 , . . . , Xm ),
and where (X1 , . . . , Xm ) is a vector of independent random variables.
We can still use the antithetic variable method for such problems if we can use the inverse transform
method to generate the Xi ’s. To see this, suppose Fi (·) is the CDF of Xi . If Ui ∼ U (0, 1) then
Fi−1 (Ui ) has the same distribution as Xi . This implies that we can generate a sample of Y by
generating U1 , . . . , Um ∼ i.i.d. U (0, 1) and setting
Z = h F1−1 (U1 ) , . . . , Fm
−1
(Um ) .

Since the CDF of any random variable is non-decreasing, it follows that the inverse CDFs, Fi−1 (·)
, are also non-decreasing. This means that if h (x1 , . . . , xm ) is a monotonic function of each of its
arguments, then h F1−1 (U1 ) , . . . , Fm
−1
(Um ) is also a monotonic function of the Ui ’s. Corollary

5.24 therefore applies.

Example 5.27 (The Barbershop). Let’s consider the case of a barbershop where the barber opens
for business every day at 9am and closes at 6pm. He is the only barber in the shop and he’s
considering hiring another barber to share the workload. First, however, he would like to estimate
the mean total time that customers spend waiting each day.
Assume customers arrive at the barbershop according to a non-homogeneous Poisson process,
N (t), with intensity λ(t), and let Wi denote the waiting time of the ith customer. Then, noting
that the barber closes the shop after T = 9 hours (but still serves any customers who have arrived
before then) the quantity that he wants to estimate is θ := E[Y ] where
N (T )
X
Y := W j.
j=1
Assume also that the service times of customers are i.i.d. with CDF, F (·), and that they are also
independent of the arrival process, N (t). The usual simulation method for estimating θ would be
to simulate n days of operation in the barbershop, thereby obtaining n samples, Y1 , . . . , Yn , and
then setting
Pn
j=1 Yj
θ̂n = .
n
Suppose now that the barber now wants to estimate the average total waiting time, θ, of the first
100 customers. Then  
X100
θ = E Wj  .
j=1
Now for each customer, j, there is an inter-arrival time, Ij , which is the time between the (j − 1)th
and j th arrivals. There is also a service time, Sj , which is the amount of time the barber spends
cutting the j th customer’s hair. It is therefore the case that Y may be written as
Y = h (I1 , . . . , I100 , S1 , . . . , S100 ) ,
72
for some function, h(·).
Now for many queueing systems, h(·) will be a monotonic function of its arguments since we
would typically expect Y to increase as service times increase, and decrease as inter-arrival times
increase. As a result, it might be advantageous to use antithetic variates to estimate θ. Notice
that we are implicitly assuming here that the Ij ’s and Sj ’s can be generated using the inverse
transform method.
Normal Antithetic Variates

We can also generate antithetic normal random variates without using the inverse transform
technique. For example, if X ∼ N µ, σ 2 then X̃ ∼ N µ, σ 2 also, where X̃ := 2µ − X.

Clearly X and X̃ are negatively correlated. So if θ = E [h (X1 , . . . , Xm )] where the Xi ’s are i.i.d.
N µ, σ 2 and h(·) is monotonic in its arguments, then we can again achieve a variance reduction

by using antithetic variates.
Example 5.28 (Normal Antithetic Variates). Suppose we want to estimate θ = E X 2 where

X ∼ N (2, 1). Then it is easy to see that θ = 5, but we can also estimate it using antithetic
variates. Is a variance reduction guaranteed?
5.5.2 Control Variates
Suppose again that we wish to estimate θ := E[Y ] where Y = h(X) is the output of a simulation
experiment. Suppose that Z is also an output of the simulation or that we can easily output
it if we wish. Finally, we assume that we know E[Z]. Then we can construct many unbiased
estimators of θ:
i. θ̂ = Y , our usual estimator
ii. θbc := Y + c(Z − E[Z]), for any c ∈ R.
The variance of θ̂c satisfies

Var θbc = Var(Y ) + c2 Var(Z) + 2c Cov(Y, Z),
and we can choose c to minimize this quantity. Simple calculus then implies the optimal value of
c is given by
Cov(Y, Z)
c∗ = − ,
Var(Z)
and that the minimized variance satisfies
Cov(Y, Z)2
Var θbc∗ = Var(Y ) −
Var(Z)
Cov(Y, Z)2
= Var(θ̂) − .
Var(Z)
73
In order to achieve a variance reduction it is therefore only necessary that Cov(Y, Z) ̸= 0. The
new resulting Monte Carlo algorithm proceeds by generating n samples of Y and Z and then
setting
Pn
i=1 Yi + c (Zi − E[Z])
∗

θc =
b ∗ .
n
There is a problem with this, however, as we usually do not know Cov(Y, Z). We overcome this
problem by doing p pilot simulations and setting
Pp
Yj − Ȳp (Zj − E[Z])

j=1
Cov(Y,
d Z) = .
p−1
If it is also the case that Var(Z) is unknown, then we also estimate it with
Pp 2
j=1 (Zj − E[Z])
Var(Z)
d = ,
p−1
and finally set

Cov(Y,
d Z)
ĉ∗ = − .
Var(Z)
\
Assuming we can find a control variate, our control variate simulation algorithm is as follows.
Control Variate Simulation Algorithm for Estimating E[Y ]
/*Pilot simulation*/
for i = 1 to p
generate (Yi , Zi )
end for
c∗
compute b
/*Main simulation*/
for i = 1 to n
generate (Yi , Zi )
set Vi = Yi + b
c∗ (Zi − E[Z])
end for
n n
1X 1 X 2
set θbĉ∗ = Vi , and σ 2
= Vi − θbĉ∗
n − 1 i=1
bn,v
n i=1

σ̂n,v b σ
bn,v
set 100(1 − α)% CI = θĉ∗ − z1−α/2 √ , θĉ∗ + z1−α/2 √
b
n n
Note that the Vi ’s are i.i.d., so we can compute approximate confidence intervals.
74
h 2
i
Example 5.29. Suppose we wish to estimate θ = E e(U +W ) where U, W ∼ U (0, 1) and i.i.d.
2
In our notation we then have Y := e(U +W ) .
The usual approach is:
i. Generate U1 , . . . , Un and W1 , . . . , Wn , all i.i.d. U (0, 1).

2 2
ii. Compute Y1 = e(U1 +W1 ) , . . . , Yn = e(Un +Wn ) .
Pn
iii. Construct the estimator θ̂n,y = j=1 Yj /n
√
iv. Build confidence intervals θbn,y ± z1−α/2 σbn,y / n where σ 2
bn,y is the usual estimate of Var(Y ).
Now consider using the control variate technique. First we have to choose an appropriate control
variate, Z. There are many possibilities including
Z1 := U + W,
Z2 := (U + W )2 ,
Z3 := eU +W .
Note that we can easily compute E [Zi ] for i = 1, 2, 3 and that it’s also clear that Cov (Y, Zi ) ̸= 0.
In a simple experiment we used Z3 , estimating ĉ∗ on the basis of a pilot simulation with 100
samples. We reduced the variance by approximately a factor of 4. In general, a good rule of thumb
is that we should not be satisfied unless we have a variance reduction on the order of a factor of 5
to 10, though often we will achieve much more.
Example 5.30 (The Barbershop revisited). Recall Example 5.27 where we assumed that customers
arrive at the barbershop according to a non-homogeneous Poisson process, N (t), with intensity
λ(t). Recall also that the service times of customers are i.i.d. with CDF, F (·), and that they are
also independent of the arrival process, N (t). Then, the quantity to be estimated is θ := E[Y ]
where
N (T )
X
Y := W j,
j=1
th
and Wj denoted the waiting time of the j customer.
Again, a method for estimating θ would be to simulate n days of operation in the barbershop,
thereby obtaining n samples, Y1 , . . . , Yn , and then setting
Pn
j=1 Yj
θ̂n = .
n
However, a better estimate could be obtained by using a control variate. In particular, let Z denote
the total time customers on a given day spend in service so that
N (T )
X
Z := Sj ,
j=1
75
where Sj is the service time of the j th customer. Then, since services times are i.i.d. and
independent of the arrival process, it is easy to see that
E[Z] = E[S]E[N (T )],
which should be easily computable.
76
References
Bachelier, Louis (1900). “Théorie de la spéculation”. In: Annales Scientifiques de l’École Normale
Supérieure 3, pp. 21–86.
Barron, E.N. and J.G. Del Greco (2020). Probability and Statistics for STEM: A Course in One
Semester. Springer.
Capiński, Marek and Peter Ekkehard Kopp (2004). Measure, Integral and Probability. Springer
Science & Business Media. isbn: 978-1-4471-1046-0.
Einstein, Albert (1905). “Über die von der molekularkinetischen Theorie der Wärme geforderte
Bewegung von in ruhenden Flüssigkeiten suspendierten Teilchen”. In: Annalen der Physik 322.8,
pp. 549–560.
Glasserman, Paul (2004). Monte Carlo Methods in Financial Engineering. Springer Science &
Business Media.
Liu, Jun S. (2001). Monte Carlo Strategies in Scientific Computing. Springer Science & Business
Media.
Ross, Sheldon M. (2012). Simulation. Elsevier Science & Technology.
— (2019). Introduction to Probability Models. 12th ed. AP Academic Press.
Wiener, Norbert (1920). “The generalized harmonic analysis”. In: Acta Mathematica 43.1,
pp. 203–239.

4b_ProbabilityNotes

Uploaded by

4b_ProbabilityNotes

Uploaded by

Probability and Simulation

Introduction to Probability Theory and Monte Carlo

4 Convergence and Limit Theorems 40

1.1 Sample Space and Events

dart landed in the first quadrant is A = (x, y) : x2 + y 2 ≤ 1, x ≥ 0, y ≥ 0 . This can be seen in

(a) A ∪ B is the event A occurs, or B occurs, or they both occur.

(d) A ∩ B c is the event A occurs and B does not occur.

• (A ∩ B)c = Ac ∪ B c , and (A ∪ B)c = Ac ∩ B c . (DeMorgan’s Rules)

These relations can be checked by using Venn diagrams.

1.2 Probabilities on Events

Definition 1.5 (Probability function). A function P : F → R satisfying

it is said to be a probability function.

P (Ω ∪ ∅) = P (Ω) + P (∅) = 1 + P (∅) =⇒ P (∅) = 0.

Since 1 = P (Ω) = P (A ∪ Ac ) = P (A) + P (Ac ) we also see that

and A ∩ B is disjoint from A ∩ B c . Therefore, by the disjoint event sum rule,

P (A ∪ B) = P (A) + P (B) − P (A ∩ B).

Proof. A union can always be written as a disjoint union, i.e.,

Then by the disjoint sum rule

But, by the Law of Total Probability

P (Ac ∩ B) = P (B) − P (A ∩ B).

Putting these together we have

= P (A) + P (B) − P (A ∩ B).

i. 0 ≤ P (A) ≤ 1 since 0 ≤ n(A) ≤ N for any event A ⊂ Ω;

This sample space has 36 outcomes and the event of rolling a 7 is

Then P (roll a 7) = 6/36 is the correct probability of rolling a 7 .

1.3 Conditional Probability

If P (B) = 0, B does not occur.

Rearranging the terms in the previous definition yields the following

P (A ∩ B) = P (A|B)P (B) = P (B|A)P (A),

above equality is known as the multiplication rule.

Drug Placebo Subtotals Probability

No Response 45 62 107 0.733

Probability 0.486 0.514

Table 1: Controlled experiment results and conditional probabilities.

We may answer various questions using conditional probability, e.g.,

Theorem 1.13. P (A) = P (A|B)P (B) + P (A|B c ) P (B c ).

= P (A|B)P (B) + P (A|B c ) P (B c ) ,

which is the statement in the theorem. ■

Corollary 1.14. Let A, B, C be three events. Then

P (A|B) = P (A|B ∩ C)P (C|B) + P (A|B ∩ C c ) P (C c |B) ,

assuming each conditional probability is defined.

Divide both sides by P (B) > 0 to get

This means the disjoint sum rule holds. ■

Using the definition of conditional probability, an equivalent definition is

P (A) = P (A ∩ B) + P (A ∩ B c ) = P (A | B)P (B) + P (A | B c ) P (B c ) .

P (2nd card is an Ace) = P (2nd and 1st are Aces)

+ P (1st is not an Ace and 2nd is an Ace)

= P (2nd is an Ace | B)P (B) + P 2nd is an Ace | B c P (B c )

The number of ways to arrange k objects out of n distinct objects is

hands, all of which are equally likely.

(b) Choose 3 out of the 4 cards of the same

We conclude that the number of full house hands is n(A) = 13× 4 4

i. Choose a card type;

ii. Choose 3 of that type;

iii. Choose 2 more distinct types; and

iv. Choose 1 card of each type.

i. Choose a card type;

ii. Choose 2 of that type;

iii. Choose 3 types from the remaining types; and

iv. Choose 1 card from each of these types.

The number of ways to do that is

Therefore, P (A) = 0.4225.

2.1 Probability Distributions

Definition 2.1. A random variable is a function X : Ω → R such that

where F is the set of all possible events.

R(X) = {y ∈ R : X(ω) = y, for some ω ∈ Ω}.

We write {X = xi } for the event {X = xi } = {ω ∈ Ω : X(ω) = xi }. And in general, we write

Remark 2.3. Any function p (xi ) which satisfies

i. 0 ≤ p (xi ) ≤ 1 for all i, and

i p (xi ) = 1 is called a pmf.

The pmf of a random variable is also called its distribution.

The next two particular discrete random variables are fundamental.

The chance you win exactly 50 games is

so the chance you break even is P (X = 0) is also 0.0693.

Now we define continuous random variables.

Definition 2.7. A random variable X is continuous if there is a function f : R → R associated