4b_ProbabilityNotes
4b_ProbabilityNotes
Daniel Arrieta
Contents
1 Introduction to Probability 1
1.1 Sample Space and Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Probabilities on Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Conditional Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Counting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2 Random Variables 13
2.1 Probability Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2 Expectation, Variance and Quantiles . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3 Moment Generating Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.4 Characteristic Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3 Codependence Structures 26
3.1 Joint Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.2 Independent Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.3 Covariance and Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.4 Conditional Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5 Simulation 48
5.1 Monte Carlo Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.2 Univariate Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.3 Gaussian Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.4 Stochastic Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.4.1 Poisson Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.4.2 Brownian Motion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.5 Variance Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.5.1 Antithetic Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.5.2 Control Variates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
References 77
Probability Theory 1. Introduction to Probability
1 Introduction to Probability
This chapter is based on Ross (2019), and Barron and Greco (2020).
Definition 1.1 (Sample space). The sample space is the set Ω of all possible outcomes of an
experiment. Ω could be a finite set, a countably infinite set, or a continuum. Ω is also called
universal set.
Definition 1.2 (Event). An event A is any subset of Ω, A ⊂ Ω. The set Ω is also called the sure
event. The empty set ∅ is called the impossible event. The class of all possible events is denoted
by F = {A : A ⊂ Ω}. If Ω is a finite set with N elements, we write |Ω| = N and then the number
of possible events is 2N .
Example 1.3. If we roll a die the sample space is Ω = {1, 2, 3, 4, 5, 6}. Rolling an even number is
the event A = {2, 4, 6}. If we want to count the number of customers coming to a bakery the sample
space is Ω = {0, 1, 2, . . .}, and the event we get between 2 and 7 customers is A = {2, 3, 4, 5, 6, 7}.
Example 1.4. If we throw a dart randomly at a circular board of 1 meter radius, the sample
space is the set of all possible positions of the dart Ω = (x, y) : x2 + y 2 ≤ 1 . The event that the
Figure 1.1.
Figure 1.1: Randomly throwns of a dart at a circular board of 1 meter radius and set A.
1
1. Introduction to Probability 1.2. Probabilities on Events
Eventually we want to find the probability that an event will occur. We say that an event A
occurs if any outcome in the set A actually occurs when the experiment is performed.
Combinations of events
Let A, B ∈ F be any two events. From these events we may describe the following events:
(b) A ∩ B, also written as AB, is the event A occurs and B occur, i.e., they both occur.
(c) Ac = Ω − A is the event A does not occur. This is all the outcomes in Ω and not in A.
(e) A ∩ B = ∅ means the two events cannot occur together, i.e., they are mutually exclusive.
We also say that A and B are disjoint. Mutually exclusive events cannot occur at the same
time.
(f) A ∪ Ac = Ω means that no matter what event A we pick, either A occurs or Ac occurs, and
not both. A and Ac are mutually exclusive.
Many more such relations hold if we have three or more events. It is useful to recall the following
set relationships:
• A ∩ (B ∪ C) = (A ∩ B) ∪ (A ∩ C) and, A ∪ (B ∩ C) = (A ∪ B) ∩ (A ∪ C).
i. P (A) ≥ 0, ∀A ∈ F,
ii. P (Ω) = 1,
iii. For any countable sequence of events A1 , A2 , . . . mutually exclusive, i.e., events for which
Ai ∩ Aj = ∅ whenever i ̸= j, then
∞ ∞
!
[ X
P An = P (An ) ,
n=1 n=1
2
Probability Theory 1. Introduction to Probability
From i. it is clear that probabilities cannot be negative, ii. states that the probability of the sure
event is 1, iii. implies that
P (A ∪ B) = P (A) + P (B),
for all events A, B ∈ F such that A ∩ B = ∅, this is called the disjoint event sum rule.
Whenever we write P we will always assume it is a probability function.
Remark 1.6. Immediately from the Definition 1.5 we can see that P (∅) = 0. In fact, since we
have the disjoint sum rule, and Ω ∩ ∅ = ∅, and P (Ω) = 1, then
P (Ac ) = 1 − P (A),
for any event A ∈ F. It is also true that no matter what event A ∈ F we take 0 ≤ P (A) ≤ 1. In
fact, by definition P (Ac ) ≥ 0, and since P (Ac ) = 1 − P (A) ≥ 0, it must be that
0 ≤ P (A) ≤ 1.
Remark 1.7. One of the most important and useful rules is the Law of Total Probability:
P (A) = P (A ∩ B) + P (A ∩ B c ) , ∀ A, B ∈ F. (1.1)
To see why this is true, we use some basic set theory decomposing A,
A=A∩Ω
= A ∩ (B ∪ B c )
= (A ∩ B) ∪ (A ∩ B c ) ,
P (A) = P (A ∩ B) ∪ (A ∩ B c )
= P (A ∩ B) + P (A ∩ B c ) .
A main use of this Law is that we may find the probability of an event A if we know what happens
when A ∩ B occurs and when A ∩ B c occurs. A useful form of this is
P (A ∩ B c ) = P (A) − P (A ∩ B).
The next theorem gives us the sum rule when the events are not mutually exclusive.
Theorem 1.8 (General Sum Rule). If A, B are any two events, then
3
1. Introduction to Probability 1.2. Probabilities on Events
A ∪ B = A ∪ (Ac ∩ B) .
P (A ∪ B) = P (A) + P (Ac ∩ B) .
P (A ∪ B) = P (A) + P (Ac ∩ B)
■
The next example gives one of the most important probability functions for finite sample spaces.
Example 1.9. When the sample space is finite, say |Ω| = N , and all individual outcomes in Ω
are equally likely, we may define a function
n(A)
P (A) := ,
N
where n(A) denotes the number of outcomes in A. To check that this is indeed a probability
function, we have to verify the conditions of the Definition 1.5.
Ω = {(1, 1), (1, 2), . . . , (1, 6), (2, 1), (2, 2) . . . , (2, 6), . . . , (6, 1), (6, 2), . . . , (6, 6)}.
A = {(1, 6), (6, 1), (2, 5), (5, 2), (3, 4), (4, 3)}.
4
Probability Theory 1. Introduction to Probability
Example 1.10. Whenever the sample space can easily be written it is often the best way to find
probabilities. As an example, we roll two dice and we let D1 denote the number on the first die
and D2 the number on the second. Suppose we want to find P (D1 > D2 ). The easiest way to
solve this is to write down the sample space as done in previous example, and then use the fact
that each outcome is equally likely. We have
{D1 > D2 } = {(2, 1), (3, 2), (3, 1), (4, 3), (4, 2), (4, 1), (5, 4), (5, 3),
(5, 2), (5, 1), (6, 5), (6, 4), (6, 3), (6, 2), (6, 1)}.
15
This event has 15 outcomes which means P (D1 > D2 ) = .
36
Definition 1.11. The conditional probability of event A, given that event B has occurred is
P (A ∩ B)
P (A|B) = , if P (B) > 0.
P (B)
One justification for Definition 1.11 can be seen from the case when the sample space is finite
and with equally likely individual outcomes. In such case, if |Ω| = N , then
n(A ∩ B)
n(A ∩ B) N
=
n(B) n(B)
N
P (A ∩ B)
=
P (B)
= P (A|B).
The left-most side of above equations is the fraction of outcomes in A ∩ B from the event B. In
other words, it is the probability of A using the reduced sample space B. That is, if the outcomes
in Ω are equally likely, P (A|B) is the proportion of outcomes in both A and B relative to the
number of outcomes in B.
5
1. Introduction to Probability 1.3. Conditional Probability
Response 26 13 39 0.267
Subtotals 71 75 146
Example 1.12. In a controlled experiment to see if a drug is effective 71 patients were given the
drug (event D), while 75 were given a placebo (event Dc ). A patient records a response (event
R) or not (event Rc ). The following table summarizes the results. This is called a two-way or
contingency table . The sample space consists of 146 outcomes of the type (Drug, Response),
(Placebo, Response), (Drug, No Response), or (Placebo, No Response), assumed equally likely.
The numbers in the table are recorded after the experiment is performed and we estimate the
probability of each event. For instance,
71 39
P (D) = = 0.486, P (R) = = 0.267,
146 146
and so on. For example, P (R) is obtained from 39 of the equally likely chosen patients exhibit a
response (whether to the drug or the placebo).
We can use the Law of Total Probability to also calculate these probabilities. If we want the chance
that a randomly chosen patient will record a response we use the fact that R = (R ∩ D) ∪ (R ∩ Dc ),
so
P (R) = P (R ∩ D) + P (R ∩ Dc )
26 13
= +
146 146
39
=
146
= 0.267,
and
P (D) = P (D ∩ R) + P (D ∩ Rc )
26 45
= +
146 146
71
=
146
= 0.486.
6
Probability Theory 1. Introduction to Probability
• If we choose at random a patient and we observe that this patient exhibited a response, what
is the chance this patient took the drug? This is
26
P (D|R) =
39
P (D ∩ R)
=
P (R)
26/146
= .
39/146
Using the reduced sample space R is how we got the first equality.
• If we choose a patient at random and we observe that this patient took the drug, what
is the chance this patient exhibited a response? This is P (R|D) = 26/71. Notice that
P (R|D) ̸= P (D|R).
• Find P (Rc |D) = 45/71. Observe that since P (D) = P (R ∩ D) + P (Rc ∩ D), we have
P (Rc ∩ D)
P (Rc |D) =
P (D)
P (D) − P (R ∩ D)
=
P (D)
= 1 − P (R|D).
Using the Law of Total Probability we get an important formula and tool for calculating
probabilities of events.
Proof. The Law of Total Probability combined with the multiplication rule says
P (A) = P (A ∩ B) + P (A ∩ B c )
Frequently, we want to find the conditional probability of some event and we have yet another
event we want to take into account. The next corollary tells us how to do that.
7
1. Introduction to Probability 1.3. Conditional Probability
Proof. Simply write out each term and use the theorem.
P (A ∩ B) = P (A ∩ B ∩ C) + P (A ∩ B ∩ C c )
P (A ∩ B ∩ C) P (A ∩ B ∩ C c )
= P (B ∩ C) + P (B ∩ C c )
P (B ∩ C) P (B ∩ C c )
= P (A|B ∩ C)P (B ∩ C) + P (A|B ∩ C c ) P (B ∩ C c ) .
■
Another very useful fact is that conditional probabilities are actually probabilities and therefore
all rules for probabilities apply to conditional probabilities as long as the given information
remains the same.
Corollary 1.15. Let B be an event with P (B) > 0, then
Q(A) = P (A|B), ∀ A ∈ F,
is a probability function.
Proof. We have to verify that Q(·) satisfies the axioms of Definition 1.5. Clearly, Q(A) ≥ 0 for
any event A, and
Q(Ω) = P (Ω|B)
P (Ω ∩ B)
=
P (B)
P (B)
= = 1.
P (B)
Finally, let A1 ∩ A2 = ∅,
P (A1 ∪ A2 ) ∩ B
P (A1 ∪ A2 |B) =
P (B)
P (A1 ∩ B) ∪ (A2 ∩ B)
=
P (B)
P (A1 ∩ B) + P (A2 ∩ B)
=
P (B)
= P (A1 |B) + P (A2 |B) .
8
Probability Theory 1. Introduction to Probability
Conditional probability naturally leads us to what it means when information about B doesn’t
help with the probability of A. This is an important concept and will be very helpful throughout
probability and statistics.
Definition 1.16. Two events A, B are said to be independent, if the knowledge that one of the
events occurred does not affect the probability that the other event occurs. That is,
P (A|B) = P (A),
and
P (B|A) = P (A).
P (A ∩ B) = P (A)P (B).
Example 1.17. Suppose an experiment has two possible outcomes a, b, so the sample space is
S = {a, b}. Suppose P (a) = p and P (b) = 1 − p. If we perform this experiment n ≥ 1 times with
identical conditions from experiment to experiment, then the events of individual experiments are
independent. We may calculate P (n a′ s in a row) = pn , and P (n a′ s and then b) = pn (1 − p).
5
In particular, the chance of getting five straight heads in five tosses of a fair coin is 21 = 32
1
.
When events are not independent we can frequently use the information about the occurrence of
one of the events to find the probability of the other. That is the basis of conditional probability.
The next concept allows us to calculate the probability of an event if the entire sample space is
split (or partitioned) into pieces and decomposing the event we are interested in into the parts
occurring in each piece. Here’s the idea.
If we have events B1 , . . . , Bn such that Bi ∩ Bj = ∅, for all i, j and ∪ni=1 Bi = S, then the collection
n
{Bi }i=1 is called a partition of S. In this case, the Law of Total Probability says
n
X n
X
P (A) = P (A ∩ Bi ) , and P (A) = P (A | Bi ) P (Bi )
i=1 i=1
for any event A ∈ F. We can calculate the probability of an event by using the pieces of A that
intersect each Bi . It is always possible to partition Ω by taking any event B and the event B c .
Then for any other event A,
Example 1.18. Suppose we draw the second card from the top of a well-shuffled deck. We want
to know the probability that this card is an Ace.
9
1. Introduction to Probability 1.4. Counting
This seems to depend on what the first card is. Let B = {1st card is an Ace} and consider the
partition {B, B c }. We condition on what the first card is.
3 4 4 48 4
= × + × = .
51 52 51 52 52
Amazingly, the chances the second card is an ace is the same as the chance the first card is an
ace. This makes sense because if we don’t know what the first card is, the second card should have
the same chance as the first card. In fact, the chance the 27th card is an ace is also 4/52 as long
as we don’t know any of the preceding 26 cards.
The next important theorem tells us how to find P (Bk | A) if we know how to find P (A | Bi ) for
each event Bi in the partition of S. It shows us how to find the probability that if A occurs, it
was due to Bk .
n
Theorem 1.19 (Bayes’ Rule). Let {Bi }i=1 be a partition of S. Then for each k = 1, 2, . . . , n.
P (Bk ∩ A)
P (Bk | A) =
P (A)
P (A | Bk ) P (Bk )
=
P (A)
P (A | Bk ) P (Bk )
= .
P (A | B1 ) P (B1 ) + · · · + P (A | Bn ) P (Bn )
Proof. The proof is in the statement of the theorem using the definition of conditional probability
and the Law of Total Probability. ■
1.4 Counting
When we have a finite sample space |Ω| = N with equally likely outcomes, we calculate
n(A)
P (A) = .
N
It is sometimes a very difficult task to calculate both N and n(A). In this appendix we give some
basic counting principles to help with this. Basic Counting Principle: If there is a task with two
steps and step one can be done in k different ways, and step two in j different ways, then the
task can be completed in k × j different ways.
Permutations
10
Probability Theory 1. Introduction to Probability
For instance, if we have 3 distinct objects {a, b, c}, there are 6 = 3 · 2 ways to pick 2 objects out
of the 3 , since there are 3 ways to pick the first object and then 2 ways to pick the second. They
are (a, b), (a, c), (b, a), (b, c), (c, a), (c, b).
Combinations
The number of ways to choose k objects out of n when we don’t care about the order of the
objects is
!
n! n
Cn,k = = .
k!(n − k)! k
For example, in the paragraph on permutations, the choices (a, b) and (b, a) are different permu-
tations but they are the same combination and so should not be counted separately. The way to
get the number of combinations is to first figure out the number of permutations, namely (n−k)!
n!
,
and then get rid of the number of ways to arrange the selection of k objects, namely k!. In other
words,
n n n!
Pn,k = k! =⇒ = .
k k (n − k)!k!
Example 1.20 (Poker Hands). We will calculate the probability of obtaining some of the common
5 card poker hands to illustrate the counting principles. A standard 52-card deck has 4 suits
(Hearts, Clubs, Spades, Diamonds) with each suit consisting of 13 cards labeled
2, 3, 4, 5, 6, 7, 8, 9, 10, J, Q, K, A.
Five cards from the deck are chosen at random (without replacement). We now want to find the
probabilities of various poker hands.
The sample space Ω is all possible 5-card hands where order of the cards does not matter. These
are combinations of 5 cards from the 52, and there are N := |Ω| = 52 5 = 2, 598, 960 possible
Probability of a Royal Flush, which is A, K, Q, J, 10 all the same suit. Let A = {royal flush}.
How many royal flushes are there? It should be obvious there are exactly 4, one for each suit.
Therefore, P (A) = 4/ 52
5 = 0.00000153908, an extremely rare event.
Probability of a Full House, which is 3 of a kind and a pair. Let A = { full house }. To get a
full house we break this down into steps.
(a) Pick a card for the 3 of a kind. There are 13 types one could choose.
(c) Choose another type distinct from the. first type. There are 12 ways to do that.
(d) Choose 2 cards of the same type chosen type chosen in the first step. There are 4
3 in the
previous step. There are 42 ways ways to do that. to do that.
11
1. Introduction to Probability 1.4. Counting
Probability of 3 of a Kind. This is a hand of the form aaabc where b, c are cards neither of
which has the same face value as a. Let A be the event we get 3 of a kind. The number of hands
in A is calculated using the multiplication rule with these steps:
Probability of a Pair. This is a hand like aabcd where b, c, d, are distinct cards without the
same face values. A is the event that we get a pair. To get one pair and make sure the other 3
cards don’t match the pair is a bit tricky. These are the steps:
12
Probability Theory 2. Random Variables
2 Random Variables
In this chapter we study the main properties of functions whose domain is an outcome of an
experiment with random outcomes, i.e., a sample space. Such functions are called random
variables. This chapter is based on Ross (2019), Barron and Greco (2020), and Capiński and
Kopp (2004).
E = {ω ∈ Ω : X(ω) ≤ a} ∈ F, ∀a ∈ R,
If R(X) is a finite or countable set, we say X is discrete. If R(X) contains an interval, then we
say X is not discrete, but either continuous, or mixed.
Definition 2.2. If X is a discrete random variable with range R(X) = {x1 , x2 , . . .}, the proba-
bility mass function (pmf) of X is
p (xi ) = P (X = xi ) , i = 1, 2, . . .
Distribution 2.4. A random variable X which takes on only two values, a, b, with P (X = a) =
p, P (X = b) = 1 − p, is said to have a Bernoulli (a, b, p) distribution, or be a Bernouili (a, b, p)
random variable, and we write X ∼ Bernouili (a, b, p). In particular, if we have an experiment
with two outcomes, success or failure, we may set a = 1, b = 0 to represent these, and p is the
probability of success.
13
2. Random Variables 2.1. Probability Distributions
An experiment like this is called a Bernoulli (p) trial. The pmf of a Bernoulli (a, b, p) random
variable is
p, if x = a,
p(x) := P (X = x) =
1 − p, if x = b.
A random variable X which counts the number of successes in an independent set of n Bernoulli
trials is called a Binomial (n, p) random variable, denoted X ∼ Binom (n, p). The range of X is
R(X) = {0, 1, 2, . . . , n}. The pmf of X is
n x
p(x) := P (X = x) = p (1 − p)n−x , x = 0, 1, . . . , n.
x
Remark 2.5. Here’s where this comes from. If we have a particular sequence of n Bernoulli
trials with x successes, say 10011101 . . . 1, then x 1’s must be in this sequence and n − x 0’s must
also be in there. By independence of the trials, the probability of any particular sequence of x 1’s
and n − x 0’s is px (1 − p)n−x . How many sequences with x 1’s out of n are there? That number is
n n!
= .
x x!(n − x)!
Hence a Binomial (n, p) random variable X is a sum of n (independent) Bernoulli (p) random
variables, X = X1 + X2 + · · · + Xn . Independent random variables will be discussed later.
Example 2.6. A bet on red for a standard roulette wheel has 18 38 chances of winning. Suppose a
gambler will bet $5 on red each time for 100 plays. Let X be the total amount won or lost as a result
of these 100 plays. X will be a discrete random variable with range R(X) = {0, ±5, ±10, . . . , ±500}.
In fact, if M denotes the number of games won (which is also a random variable with values from
0 to 100), then our net amount won or lost is X = 10M − 500. The random variable M is an
example of a Binomial 100, 38 18
random variable.
100 18 20
50 50
P (M = 50) = = 0.0693,
50 38 38
It is important to note that a pdf does not have to satisfy f (x) ≤ 1 in general.
14
Probability Theory 2. Random Variables
iii. limy→x+0 FX (y) = FX (x) for all x ∈ R. This says a cdf is continuous at every point from
the right.
Using the cdf FX we have for a < b, P (a < X ≤ b) = FX (b) − FX (a).
Proposition 2.9. If X is continuous then P (a < X ≤ b) = FX (b) − FX (a).
Proof. For a < b let’s define the events A = {X ≤ a} ⊂ B = {X ≤ b}, then by the Law of Total
Probability
P (B ∩ Ac ) = P (B) − P (A ∩ B).
As a consequence
= P (X ≤ b) − P (X ≤ a)
= FX (b) − FX (a).
■
If X is continuous with density f (x), then
Z b
P (a < X ≤ b) = f (x)dx,
a
15
2. Random Variables 2.1. Probability Distributions
Therefore, we can find the pdf if we know the cdf using the fundamental theorem of calculus, i.e.,
′
FX (x) = f (x).
Example 2.11. Suppose X is a random variable with values 1, 2, 3 with probabilities 1/6, 1/3,
and 1/2, respectively. The jumps are at x = 1, 2, 3. The size of the jump is P (X = x), x = 1, 2, 3
and at each jump the left endpoint is not included while the right endpoint is included because the
cdf is continuous from the right. Then we may calculate
P (X < 2) = P (X = 1)
1
= ,
6
but
P (X ≤ 2) = P (X = 2) + P (X = 1)
1
= .
2
We begin with the pmfs of some of the most important discrete rvs we will use in this course.
Definition 2.12. The Discrete Uniform pmf and cdf, are respectively,
1
P (X = x) = , and
n
x
FX (x) = ,
n
for x = 1, 2, . . . , n. A discrete uniform rv picks one of n points at random.
λx
P (X = x) = e−λ , x = 0, 1, 2 . . . .
x!
The parameter λ > 0 is given. A Poisson (λ) rv counts the number of events that occur at the
rate λ by time unit.
16
Probability Theory 2. Random Variables
P (X = x) = (1 − p)x−1 p, x = 1, 2, . . .
This rv is the number of independent Bernoulli trials until we get the first success.
Definition 2.16. If rv X follows a NegBinomial (r, p) distribution, its pmf is given by
x−1 r
P (X = x) = p (1 − p)x−r , x = r, r + 1, r + 2, . . .
r−1
where x represents the number of Bernoulli trials until we get r successes.
Definition 2.17. A (uniform) X ∼ U (a, b) rv models choosing a random number from a to b.
The pdf is
1 , if a < x < b,
f (x) = b − a
0,
otherwise.
And the cdf is
0, if x < a,
x−a
FX (x) = , if a ≤ x < b,
b−a
1,
if x ≥ b.
Next is the normal distribution which we have already discussed but we record it here again for
convenience.
Definition 2.18. A Normal (µ, σ) rv X ∼ N (µ, σ) has density
1
√ e− 2 ( σ ) ,
1 x−µ 2
f (x) = −∞ < x < ∞.
σ 2π
It is not possible to get an explicit expression for the cdf so we simply write
1
Z x
e− 2 ( σ ) dy.
1 y−µ 2
Ncdf (x, µ, σ) := FX (x) = √
σ 2π −∞
Definition 2.19. An Exponential (λ) rv, denoted X ∼ Exp (λ), λ > 0, has pdf
λe−λx , x ≥ 0,
f (x) =
0, x < 0.
The cdf is
Z x 1 − e−λx , if x ≥ 0,
FX (x) = λe−λy dy =
0 0, if x < 0.
An exponential random variable represents processes that do not “remember”. For example, if X
represents the time between arrivals of customers to a store, a reasonable model is Exponential (λ)
where λ represents the average rate at which customers arrive.
17
2. Random Variables 2.2. Expectation, Variance and Quantiles
With this definition, you can see why it is frequently useful to write E[g(X)] = g(x)P (X = x)dx
R
even when X is a continuous rv. This abuses notation a lot and you have to keep in mind that
f (x) ̸= P (X = x), which is zero when X is continuous.
From calculus we know that if we have a one-dimensional object with density f (x) at each point,
then
Z
xf (x)dx = E[X],
gives the center of gravity of the object. If X is discrete, the expected value is an average of the
values of X, weighted by the probability it takes on each value.
For example, if X has values 1, 2, 3 with probabilities 1/8, 3/8, and 1/2, respectively, then
1 3 1
E[X] = 1 × +2× +3×
8 8 2
19
= .
8
On the other hand, the straight average of the 3 numbers is 2 . The straight average corresponds
to each value with equal probability. Now we have a definition of the expected value of any
function of X. In particular,
Z ∞
E X2 = x2 f (x)dx.
−∞
We need this if we want to see how the random variable spreads its values around the mean.
18
Probability Theory 2. Random Variables
2
Definition 2.21. The variance of a rv X is Var[X] := E X − E[X] . Written out, the first step
The standard deviation, abbreviated SD, of X, SD(X) := Var[X]. Another measure of the
p
spread of a distribution is the median and the percentiles. Here’s the definition.
Definition 2.22. The median m = med(X) of a random variable X is defined to be the real
number such that P (X ≤ m) = P (X ≥ m) = 12 . The median is also known as the 50th percentile.
Given a real number 0 < q < 1, the 100q th percentile of X is the number xq such that
P (X ≤ xq ) = q.
The interquartile range of a rv is IQR = Q3 − Q1 , i.e., the 75th percentile minus the 25th
percentile. Q1 is the first quartile, the 25th percentile, and Q3 is the third quartile, the 75th
percentile. The median is also known as Q2 , the second quartile.
In other words, 100q% of the values of X are below xq . Percentiles apply to any random variable
and give an idea of the shape of the density. Note that percentiles do not have to be unique, i.e.,
there may be several xq ’s resulting in the same q.
Then we calculate
∞
1
Z
2
E X2 = x2 √ e−x /2 dx,
−∞ 2π
using integration by parts. We get E Z 2 = 1 and then
2
Var[Z] = E Z 2 − E[Z]
= 1.
Example 2.24. Suppose we know that LSAT scores follow a normal distribution with mean 155
and SD=13. You take the test and score 162. What percent of people taking the test did worse
than you?
19
2. Random Variables 2.2. Expectation, Variance and Quantiles
This is asking for P (X ≤ 162) knowing X ∼ N (155, 13). That’s easy since P (X ≤ 162) =
Ncdf (162, 155, 13) = 0.704. In other words, 162 is the 70.4 percentile of the scores. Suppose
instead someone told you that her score was in the 82nd percentile and you want to know her
actual score. To find that, we are looking to solve P (X ≤ x0.82 ) = 0.82.
Now here’s a proposition which says that the mean is the best estimate of a rv X in the mean
square sense, and the median is the best estimate in the mean absolute deviation sense.
(ii.) The mean of X, E[X] = µ is the unique constant a which minimizes E[X − a]2 . Then
mina E[X − a]2 = E[X − µ]2 = Var[X].
(iii.) A median med(X) is a constant which provides a minimum for E|X − a|. In other words,
mina E|X − a| = E|X − med(X)|.
The second statement says that the variance is the minimum of the mean squared distance of the
rv X to its mean. The third statement says that a median (which may not be unique) satisfies a
similar property for the absolute value of the distance.
= E X 2 − 2Xµ + µ2
= E X 2 − 2µE[X] + µ2
= E X 2 − µ2 .
R∞
For ii we will assume X is a continuous rv with pdf f . Then, with G(a) = −∞
(x − a)2 f (x)dx,
then
Z ∞
d
G′ (a) = (x − a)2 f (x)dx
da −∞
Z ∞
= −2(x − a)f (x)dx
−∞
= 0,
implies xf (x)dx = af (x)dx = a. This assumes we can interchange the derivative and the
R R
integral. Furthermore, G′′ (a) = 2 f (x)dx = 2 > 0. Consequently, a = E[X] provides a minimum
R
20
Probability Theory 2. Random Variables
Last property is a little trickier since we can’t take derivatives at first. We get rid of absolute
value signs first.
Z ∞
E |X − a| = |x − a|f (x)dx
−∞
Z a Z ∞
= −(x − a)f (x)dx + (x − a)f (x)dx ≡ H(a).
−∞ a
conclude that
Z a
P (X ≤ a) = f (x)dx
−∞
Z ∞
= f (x)dx
a
= P (X ≥ a)
1
= ,
2
and this says that a is a median of X. Furthermore, H ′′ (a) = 2f (a) ≥ 0, so a = med(X) does
provide a minimum (but note that H is not necessarily strictly concave up). ■
function of X. This will give us a method of calculating means and variances usually in a much
simpler way than doing it directly.
Definition 2.26. The moment-generating function (mgf) of a rv X is M (t) := E etX . Explicitly,
we define
∞
R etx f (x)dx,
if X is continuous;
M (t) = −∞
etx P (X = x), if X is discrete.
P
x
We assume the integral or sum exists for all t ∈ (−δ, δ) for some δ > 0. One reason the mgf is
so useful is the following theorem. It says that if we know the mgf, we can find moments, i.e.,
E (X n ) , n = 1, 2, . . ., by taking derivatives.
Theorem 2.27. If X has the mgf M (t), then
dn
E [X n ] = M (t) .
dtn t=0
21
2. Random Variables 2.3. Moment Generating Function
Proof. The proof is easy if we assume that we can switch integral and derivatives.
Z ∞ n
dn d tx
M (t) = e f (x)dx
dtn −∞ dtn
Z ∞
= xn etx f (x)dx.
−∞
= E[X]n .
■
Example 2.28. Let’s use the mgf to find the mean and variance of X ∼ Binomial (n, p).
n
X
M (t) = etx P (X = x)
x=0
n
X n x
= etx p (1 − p)n−x
x=0
x
n
X n x
= pet (1 − p)n−x
x=0
x
n
= pet + (1 − p) .
in the last line. Now that we know the mgf we can find any moment by taking derivatives. Here
are the first two:
n−1
M ′ (t) = npet pet + (1 − p) =⇒ M ′ (0) = E[X] = np,
and
n−2 n−1
M ′′ (t) = n(n − 1)p2 e2t pet + (1 − p) + npet pet + (1 − p)
= n(n − 1)p2 + np − n2 p2
= np(1 − p).
22
Probability Theory 2. Random Variables
Now we use the mgf to calculate the mean and variances of some of the important continuous
distributions.
b
1
Z
M (t) = etx dx
a b−a
b
1 1 tx
= e
b−a t a
tb ta
e −e
= .
t(b − a)
Then
eat (at − 1) + ebt (1 − bt) a+b
M ′ (t) = and, lim M ′ (t) = .
(a − b)t 2 t→0 2
We conclude E[X] = 2 .
a+b
While we could find M ′′ (0) = E[X 2 ], it is actually easier to find
this directly.
b
1 b3 − a3 (b − a)2
Z
2
E[X ] =
2
x2 dx = =⇒ Var[X] = E[X 2 ] − E[X] = .
a b−a 3(b − a) 12
x 2
(c) X ∼ N (0, 1), f (x) = √1 e− 2
2π
, −∞ < x < ∞. The mgf for the standard normal distribution
is
∞
1
Z
x2
M (t) = etx √ e− 2 dx
−∞ 2π
Z ∞
1 x2
=√ etx− 2 dx
2π −∞
Z ∞
1 2 1 2
=√ et /2 e− 2 (x−t) dx
2π −∞
Z ∞
2 1 1 2
= et /2 √ e− 2 (x−t) dx
2π −∞
2
= et /2
,
23
2. Random Variables 2.4. Characteristic Function
since
∞
1
Z
1 2
√ e− 2 (x−t) dx = 1.
2π −∞
(d) X ∼ N (µ, σ). All we have to do is convert X to standard normal. Let Z = X−µ σ . We know
t2 /2
Z ∼ N (0, 1) and we may use the previous part to write MZ (t) = e . How do we get the
mgf for X from that? Well, we know X = σZ + µ and so
MX (t) = E etX
h i
= E e(σZ+µ)t
h i
= eµt E e(tσ)Z
1 2
= eµt e 2 (σt)
1 2 2
= eµt+ 2 σ t
.
σ 2 t2
Then M ′ (t) = e µ + σ 2 t so that M ′ (0) = E[X] = µ. Next
+µt
2
σ 2 t2
2
M ′′ (t) = e 2 +µt
σ2 + µ + σ2 t =⇒ M ′′ (0) = σ 2 + µ2 .
2
This gives us Var[X] = E[X 2 ] − E[X] = σ 2 + µ2 − µ2 = σ 2 .
We record here the mean and variance of some important discrete distributions.
24
Probability Theory 2. Random Variables
Vice versa, it turns out that the probability distribution of X is uniquely determined by the
characteristic function ϕX .
The function ϕX has the advantage that it always exists because the random variable eitX is
bounded.
25
3. Codependence Structures 3.1. Joint Distributions
3 Codependence Structures
In probability theory and statistics, the concept of codependence structure refers to the way in
which two or more random variables are related to each other.
Specifically, it describes the pattern of association or dependence among the variables. There are
different types of codependence structures that will be addressed next.
This chapter is based on Ross (2019) and Barron and Greco (2020).
Definition 3.1. (1) If X and Y are two random variables, the joint cdf is
FX,Y (x, y) := P {X ≤ x} ∩ {Y ≤ y} .
The pair of rvs (X, Y ) is continuous if there is a joint density function, and then
Z x Z y
FX,Y (x, y) = fX,Y (u, v)dudv.
−∞ −∞
∂ 2 FX,Y (x, y)
fX,Y (x, y) = .
∂x∂y
Knowing the joint distribution of X and Y means we have full knowledge of X and Y individually.
For example, if we know FX,Y (x, y), then
and
FY (y) = FX,Y (∞, y).
26
Probability Theory 3. Codependence Structures
The resulting FX and FY are called the marginal cumulative distribution functions. The marginal
densities when there is a joint density are given by
Z ∞
fX (x) = fX,Y (x, y)dy,
−∞
and Z ∞
fY (y) = fX,Y (x, y)dx.
−∞
Example 3.2. The function
(
8xy, 0 ≤ x < y ≤ 1,
f (x, y) =
0 otherwise.
is given. First we verify it is a joint density. Since f ≥ 0 all we need to check is that the double
integral is one.
Z ∞Z ∞ Z 1Z y
f (x, y)dxdy = 8xydxdy
−∞ −∞ 0 0
1
1 2
Z
= 8y y dy
0 2
= 1.
and
Z ∞
fY (y) = f (x, y)dx
−∞
Ry
0 8xydx = 4y 3 , if 0 ≤ y ≤ 1
=
0, otherwise.
If X and Y are discrete rvs, the joint pmf is p(x, y) = P (X = x, Y = y). The marginals are then
given by pX (x) = P (X = x) = y p(x, y) and pY (y) = P (Y = y) = x p(x, y).
P P
In general, to find the probability that for any set C ⊂ R × R, the pair (X, Y ) ∈ C has probability
defined by
RR
C fX,Y (x, y)dxdy, if X, Y are continuous,
P (X, Y ) ∈ C =
(x,y)∈C pX,Y (x, y), if X, Y are discrete.
P
27
3. Codependence Structures 3.1. Joint Distributions
Definition 3.3. If (X, Y ) have joint density fX,Y (x, y), the expected value of a function of the
rvs is
R
∞ R∞
−∞ −∞ g(x, y)fX,Y (x, y)dxdy, if X, Y are continuous,
E[g(X, Y )] =
P g(x, y)P (X = x, Y = y),
if X, Y are discrete.
x,y
Example 3.4. We calculate E[X + Y ] assuming we have the joint density of (X, Y ) given by
fX,Y (x, y). By definition
ZZ
E[X + Y ] = (x + y)fX,Y (x, y)dxdy
ZZ ZZ
= xfX,Y (x, y)dxdy + yfX,Y (x, y)dxdy
Z Z Z Z
= x fX,Y (x, y)dy dx + y yfX,Y (x, y)dx dy
Z Z
= xfX (x)dx + yfY (y)dy
= E[X] + E[Y ].
Notice that the first E uses the joint density fX,Y while the second and third E ′ ’s use fX and
fY , respectively.
Example 3.5. Suppose (X, Y ) have joint density f (x, y) = 1, 0 ≤ x, y ≤ 1, and f (x, y) = 0
otherwise. This models picking a random point (x, y) in the unit square. If we want to calculate
P (X < Y ), this uses the density.
ZZ
P (X < Y ) = f (x, y)dxdy
0≤x<y≤1
Z 1 Z y
= 1dxdy
0 0
1
y2
=
2 0
1
= .
2
1
ZZ
P X +Y ≤
2 2
= 1dxdy
4 0≤x2 +y 2 ≤ 14
28
Probability Theory 3. Codependence Structures
Also,
Z 1 Z 1
E X2 + Y 2 = x2 + y 2 f (x, y)dxdy
0 0
2
= .
3
In general, if we are given a set D ⊂ R2 the density
area1of D , (x, y) ∈ D,
f (x, y) =
0,
otherwise,
is called a uniform density on D and the rvs (X, Y ) ∼ Unif(D). You see that E[XY ] ̸= E[X]×E[Y ]
in general, but there are important cases when this is true. For that we need the notion of
independent random variables.
P (X ≤ x, Y ≤ y) = P (X ≤ x) P (Y ≤ y), ∀ x, y ∈ R.
If (X, Y ) has a joint density fX,Y , X has density fX , and Y has density fY , then independence
means that the joint density factors into the individual densities:
One of the main consequences of independence is the following fact. It says the expected value of
a product of rvs is the product of the expected value of each rv.
Proposition 3.7. If X, Y are independent
E XY = E[X] × E[Y ].
= E[X] × E[Y ].
29
3. Codependence Structures 3.3. Covariance and Correlation
Independence also allows us to find an explicit expression for the joint cumulative distribution of
the sum of two random variables.
Proof. This is really another application of the Law of Total Probability. To see this
Z
P (X + Y ≤ w) = P (X + Y ≤ w, Y = y)dy
Z
= P (X ≤ w − y)P (Y = y)dy
Z
= FX (w − y)fY (y)dy.
The first equality uses the Law of Total Probability, whereas the second equality uses the
independence. ■
Example 3.9. Suppose X and Y are independent Exp(λ) rvs. Then, for w ≥ 0,
Z ∞
P (X + Y ≤ w) = FX (w − y)fY (y)dy
0
Z w
= 1 − e−λ(w−y) λe−λy dy
0
= 1 − (λw + 1)e−λw
= FX+Y (w).
If w < 0, FX+Y (w) = 0. To find the density we take the derivative with respect to w to get
It turns out that this is the pdf of a so-called Gamma (λ, 2) rv.
30
Probability Theory 3. Codependence Structures
Cov X, Y
ρ(X, Y ) := q .
Var X Var[Y ]
It looks like covariance measures how independent X and Y are. It is certainly true that if X, Y
are independent, then ρ(X, Y ) = 0, but the reverse is false.
Here’s one of the more important implications of independence.
2
= E X 2 + 2XY + Y 2 − (E[X])2 − 2E[X]E[Y ] − (E[Y ])
Remark 3.12. This can be extended to n rvs X1 , . . . , Xn . If they are uncorrelated (which is true
if they are independent), Var X1 + · · · + Xn = Var X1 + · · · + Var Xn .
Let us consider now the special case where X and Y are indicator variables for whether or not
the events A and B occur. That is, for events A and B, define whether or not the events A and
B occur. That is, for events A and B, define
1, if A occurs,
X=
0, otherwise,
and
1, if B occurs,
Y =
0, otherwise.
Then,
Cov X, Y = E XY − E X E Y ,
and, because X, Y will equal 1 or 0 depending on whether or not both X and Y equal 1, we see
that
Cov X, Y = P (X = 1, Y = 1) − P (X = 1) P (Y = 1) .
31
3. Codependence Structures 3.3. Covariance and Correlation
P (X = 1, Y = 1)
⇐⇒ > P (Y = 1)
P (X = 1)
⇐⇒ P (Y = 1 | X = 1) > P (Y = 1).
That is, the covariance of X and Y is positive if the outcome X = 1 makes it more likely that
Y = 1 (which, as is easily seen by symmetry, also implies the reverse). In general it can be shown
that a positive value of Cov X, Y is an indication that Y tends to increase as X does, whereas
To show that f (x, y) is a joint density function we need to show it is nonnegative, which is
R∞ R∞
immediate, and that −∞ −∞ f (x, y)dydx = 1. We prove the latter as follows:
Z ∞Z ∞ Z ∞Z ∞
1 −(y+x/y)
f (x, y)dxdy = e dxdy
−∞ −∞ 0 0 y
Z ∞ Z ∞
1 −x/y
= e−y
e dxdy
0 0 y
Z ∞
= e−y dy
0
= 1.
Z ∞
1 −x/y
fY (y) = e −y
e dx
0 y
= e−y .
E[Y ] = 1.
We compute E X and E XY as follows:
Z ∞Z ∞
E X = xf (x, y)dxdy
−∞ −∞
Z ∞ Z ∞
x −x/y
= e −y
e dxdy.
0 0 y
32
Probability Theory 3. Codependence Structures
R∞
Now, 0 xy e−x/y dx is the expected value of an exponential random variable with parameter 1/y,
and thus is equal to y. Consequently,
Z ∞
E X = ye−y dy
0
= 1.
Also
Z ∞ Z ∞
E XY = xyf (x, y)dxdy
−∞ −∞
Z ∞ Z ∞
x −x/y
= ye −y
e dxdy
0 0 y
Z ∞
= y 2 e−y dy.
0
Z ∞
E XY = y 2 e−y dy
0
∞ Z ∞
= −y 2 e−y + 2ye−y dy
0 0
= 2E[Y ]
= 2.
Consequently,
Cov X, Y := E XY − E X E Y = 1.
Properties of Covariance
Whereas the first three properties are immediate, the final one is easily proven as follows:
Cov X, Y + Z = E X(Y + Z) − E X E Y + Z
= E XY − E X E Y + E XZ − E[X E Z
= Cov X, Y + Cov X, Z
33
3. Codependence Structures 3.4. Conditional Expectation
The fourth property listed easily generalizes to give the following result:
Xn m
X n X
X m
Cov Yj = Cov Xi , Yj . (3.2)
Xi ,
i=1 j=1 i=1 j=1
A useful expression for the variance of the sum of random variables can be obtained from equation
(3.2) as follows:
" n #
X Xn n
X
Var Xi = Cov Xi , Xj
i=1 i=1 j=1
n X
X n
= Cov Xi , Xj
i=1 j=1
(3.3)
n
X n X
X
= Cov Xi , Xi + Cov Xi , Xj
i=1 i=1 j̸=i
n
X n X
X
= Var Xi + 2 Cov Xi , Xj .
i=1 i=1 j<i
E X = E E X|Y . (3.4)
while if Y is continuous with density fY (y), then Eq. (3.2) says that
Z ∞
E X = E X|Y = y fY (y)dy.
−∞
We now give a proof of Proposition 3.14 in the case where X and Y are both discrete random
variables.
34
Probability Theory 3. Codependence Structures
=E X .
One way to understand equation (3.5) is to interpret it as follows. It states that to calculate E X
we may take a weighted average of the conditional expected value of X given that Y = y, each of
the terms E X|Y = y being weighted by the probability of the event on which it is conditioned.
Example 3.15. Sam will read either one chapter of his probability book or one chapter of his
history book. If the number of misprints in a chapter of his probability book is Poisson distributed
with mean 2 and if the number of misprints in his history chapter is Poisson distributed with
mean 5, then assuming Sam is equally likely to choose either book, what is the expected number of
misprints that Sam will come across?
Let X be the number of misprints. Because it would be easy to compute E X if we know which
book Sam chooses, let
1, if Sam chooses his history book,
Y =
2, if chooses his probability book.
Conditioning on Y yields
E X = E X|Y = 1 P (Y = 1) + E X|Y = 2 P (Y = 2)
1 1
=5 +2
2 2
7
= .
2
Example 3.16 (The Expectation of the Sum of a Random Number of Random Variables).
Suppose that the expected number of accidents per week at an industrial plant is four. Suppose
35
3. Codependence Structures 3.4. Conditional Expectation
also that the numbers of workers injured in each accident are independent random variables
with a common mean of 2. Assume also that the number of workers injured in each accident is
independent of the number of accidents that occur. What is the expected number of injuries during
a week?
Letting N denote the number of accidents and Xi the number injured in the ith accident, i =
PN
1, 2, . . ., then the total number of injuries can be expressed as i=1 Xi . Hence, we need to
compute the expected value of the sum of a random number of random variables. Because it is
easy to compute the expected value of the sum of a fixed number of random variables, let us try
conditioning on N . This yields
"N # " "N ##
X X
E Xi = E E Xi | N .
i=1 i=1
But
" N
# " n
#
X X
E Xi | N = n = E Xi | N = n
i=1 i=1
" n
#
X
=E Xi (by the independence of Xi and N )
i=1
= nE X ,
which yields
" N
#
X
Xi | N = N E X ,
E
i=1
Therefore, in our example, the expected number of injuries during a week equals 4 × 2 = 8.
PN
The random variable i=1 Xi , equal to the sum of a random number N of independent and
identically distributed random variables that are also independent of N , is called a compound
random variable.
As just shown in Example 3.16, the expected value of a compound random variable is E N E X .
there is no obvious random variable to condition on, it often turns out to be useful to condition
on the first thing that occurs. This is illustrated in the following two examples.
Example 3.17 (The Mean of a Geometric Distribution). A coin, having probability p of coming
up heads, is to be successively flipped until the first head appears. What is the expected number of
flips required?
36
Probability Theory 3. Codependence Structures
Now,
E[N ] = E[N | Y = 1]P (Y = 1) + E[N | Y = 0]P (Y = 0)
(3.6)
= pE[N | Y = 1] + (1 − p)E[N | Y = 0].
However,
E[N | Y = 1] = 1,
(3.7)
E[N | Y = 0] = 1 + E[N ].
To see why equation (3.7) is true, consider E[N | Y = 1]. Since Y = 1, we know that the first flip
resulted in heads and so, clearly, the expected number of flips required is 1.
On the other hand if Y = 0, then the first flip resulted in tails. However, since the successive flips
are assumed independent, it follows that, after the first tail, the expected additional number of
flips until the first head is just E[N ]. Hence E[N | Y = 0] = 1 + E[N ].
Substituting equation (3.7) into equation (3.6) yields
or
1
E[N ] = .
p
Conditional Variance
Another way to use conditioning to obtain the variance of a random variable is to apply the
conditional variance formula. The conditional variance of X given that Y = y is defined by
h 2 i
Var X|Y = y := E X − E X|Y = y |Y =y .
That is, the conditional variance is defined in exactly the same manner as the ordinary variance
with the exception that all probabilities are determined conditional on the event that Y = y.
Expanding the right side of the preceding and taking expectation term by term yields
2
Var X|Y = y = E X 2 | Y = y − E X|Y = y
.
Letting Var X|Y denote that function of Y whose value when Y = y is Var X|Y = y , we have
37
3. Codependence Structures 3.4. Conditional Expectation
Proof.
h 2 i
E[Var X|Y ] = E E X 2 | Y − E X|Y
h 2 i
= E E X 2 | Y − E E X|Y
h 2 i
= E X 2 − E E X|Y
,
and
h 2 i 2
Var E X|Y = E E X|Y
− E E X|Y
h 2 i 2
= E E X|Y − E X .
Therefore,
2
E Var X|Y + Var E X|Y = E X 2 − E X
,
As noted in Example 3.16, where its expected value was determined, the random variable S =
PN
i=1 Xi is called a compound random variable. Let’s find its variance.
Whereas we could obtain E S 2 by conditioning on N , let us instead use the conditional variance
formula. Now,
"N #
X
Var S|N = n = Var Xi |N = n
i=1
" n #
X
= Var Xi |N = n
i=1
" n #
X
= Var Xi
i=1
= nσ 2 .
Therefore,
Var S|N = N σ 2 , E S|N = N µ,
38
Probability Theory 3. Codependence Structures
Var S = E N σ 2 + Var N µ
= σ 2 E N + µ2 Var N .
PN
If N is a Poisson random variable, then S = i=1 Xi is called a compound Poisson random
variable. Because the variance of a Poisson random variable is equal to its mean, it follows that
for a compound Poisson random variable having E N = λ
Var S = λσ 2 + λµ2 = λE X 2 ,
39
4. Convergence and Limit Theorems 4.1. The Central Limit Theorem
Later we will need the following results. The first part says that if two rvs have the same mgf,
then they have the same distribution. The second part says that if the mgfs of a sequence of rvs
converges to an mgf, then the cdfs must also converge to the cdf of the limit rv.
Theorem 4.1. If X and Y are two rvs such that MX (t) = MY (t) (for all t close to 0 ), then X
and Y have the same cdfs.
On the other hand, if Xk , k = 1, 2, . . ., is a sequence of rvs with mgf Mk (t), k = 1, 2, . . ., and cdf
Fk (x), k = 1, 2, . . ., respectively, and if limk→∞ Mk (t) = MX (t) and MX (t) is an mgf then there
is a unique cdf FX and limk→∞ Fk (x) = FX (x) at each x a point of continuity of FX .
Proposition 4.2. Let X1 , X2 , . . . , Xn be independent rvs with mgf MXi (t), i = 1, 2, . . . , n. Let
Sn = X1 + · · · + Xn . Then MSn (t) = MX1 (t) · MX2 (t) · · · MXn (t).
Proof. This is directly from the definition and the independence. In fact,
n
1 t2 X 2
Y X
MSn (t) = exp tµi + σi2 t2 = exp t µi + σi .
i=1
2 2
Since mgfs determine a distribution uniquely according to Theorem 4.1, we see that
v
n
X
u n
uX
Sn ∼ N µi , t σi2 .
i=1 i=1
■
Example 4.3. The sum of independent Geom(p) random variables is Negative Binomial. In
particular, suppose X is the number of Bernoulli trials until we get r successes with probability p
of success on each trial. Then X = X1 + X2 + · · · + Xr , where Xi ∼ Geom(p), i = 1, 2, . . . , r, is the
40
Probability Theory 4. Convergence and Limit Theorems
number of trials until the first success. This is true since once we have a success we simply start
counting anew from the last success until we get another success. Now, we have by independence,
r r
X r X r(1 − p)
E[X] = E [Xi ] = , and Var[X] = Var [Xi ] = .
i=1
p i=1
p2
et p
In addition, using the mgf of Geom(p), namely, MXi (t) = 1−et (1−p) , t < − ln(1 − p) we have
r
Y ert pr
MX (t) = MXi (t) = r, t < − ln(1 − p).
i=1
(1 − et (1 − p))
We have seen that the sum of independent normal rvs is exactly normal. The Central Limit
Theorem says that even if the Xi ’s are not normal, the sum is approximately normal if the
number of rvs is large.
Theorem 4.4 (Central Limit Theorem). Let X1 , X2 , . . . be a sequence of independent rvs all
having the same distributions and E[X1 ] = µ, Var [X1 ] = σ 2 . Then for any a, b ∈ R,
X1 + · · · + Xn − nµ
lim P a≤ √ ≤b = P (a ≤ Z ≤ b),
n→∞ σ n
41
4. Convergence and Limit Theorems 4.1. The Central Limit Theorem
then by Theorem 4.1 we can conclude that the cdf of Zn will converge to the cdf of the random
2
variable that has mgf et /2 . But that random variable is Z ∼ N (0, 1). That will complete the
proof. Therefore, all we need to do is to show that
√ t2
lim n ln M (t/(σ n)) = .
n→∞ 2
√
To see this, change variables to x = t/(σ n) so that
√ t2 ln M (x)
lim n ln M (t/(σ n)) = lim 2 .
n→∞ x→0 σ x2
Since ln M (0) = 0 we may use L’Hopital’s rule to evaluate the limit. We get
t2 M ′′ (x)
= lim (using L’Hopital again)
2σ 2 x→0 xM ′ (x) + M (x)
t2 M ′′ (0)
=
2σ 2 0M ′ (0) + M (0)
t2 σ2
=
2σ 2 0 × 0 + 1
t2
= ,
2
since M (0) = 1, M ′ (0) = E[X] = 0, and M ′′ (0) = E X 2 = σ 2 . This completes the proof.
■
The full proof of Theorem 4.4 involves the characteristic function, and can be seen for example in
Ross (2019).
Example 4.5. Suppose an elevator is designed to hold 2000 pounds. The mean weight of a person
getting on the elevator is 175 with standard deviation 15 pounds. How many people can board the
elevator so that the chance it is overloaded is 1%? Let W = X1 + · · · + Xn be the total weight
of n people who board the elevator. We don’t know the distribution of the weights of individual
people (which is probably not normal), but we do know E[X] = 175 and Var[X] = 152 . By the
√
central limit theorem, W ≈ N (175n, 15 n) and we want to find n so that
If we standardize W we get
42
Probability Theory 4. Convergence and Limit Theorems
2000 − 175n
√ ≥ 2.326 =⇒ n ≤ 11.
15 n
The maximum number of people that can board the elevator and meet the criterion is 11. Without
knowing the distribution of the weight of people, there is no other way to do this problem.
σ2
P (|X − µ| ≥ c) ≤ , for any constant c > 0.
c
The larger c is the smaller the probability can be.
Proof. The argument for Chebychev is simple. Assume X has pdf f . Then
Z Z
2
σ := E [|X − µ|] =
2
|x − µ| f (x)dx +
2
|x − µ|2 f (x)dx
|x−µ|≥c |x−µ|≤c
Z
≥ |x − µ|2 f (x)dx
Z
≥c 2
f (x)dx
= c2 P (|X − µ| ≥ c).
■
Chebychev is used to give us the Weak Law of Large Numbers which tells us that the mean of a
random sample should converge to the population mean as the sample size goes to infinity.
Theorem 4.7 (Weak Law of Large Numbers). Let X1 , . . . , Xn be a random sample, i.e., inde-
pendent and all having the same distribution as the rv X which has finite mean E[X] = µ and
finite variance σ 2 = Var[X]. Then, for any constant c > 0, with X = X1 +···+X
n
n
,
lim P |X − µ| ≥ c = 0.
n→∞
σ2
Proof. We know E X = µ and Var X = n . By Chebychev’s inequality,
Var X
σ2
P |X − µ| ≥ c ≤ = → 0 as n → ∞.
c nc ■
43
4. Convergence and Limit Theorems 4.2. Laws of Large Numbers
In this section the strong law of large numbers is presented. As the proof of the strong law makes
use of the Borel-Cantelli lemma, this result will be presented first.
Lemma 4.8 (Borel-Cantelli). For a sequence of events Ai , i ≥ 1, let N denote the number of
P∞
these events that occur. If i=1 P (Ai ) < ∞, then P (N = ∞) = 0.
P∞
Proof. Suppose that i=1 P (Ai ) < ∞. Now, if N = ∞, then for every n < ∞ at least one
of the events An , An+1 , . . . will occur. That is, N = ∞ implies that ∪∞
i=n Ai occurs for every n.
Thus, for every n
∞
!
[
P (N = ∞) ≤ P Ai
i=n
∞
X
≤ P (Ai ) ,
i=n
P∞
where the final inequality follows from Boole’s inequality. Because i=1 P (Ai ) < ∞ implies
P∞
that i=n P (Ai ) → 0 as n → ∞, we obtain from the preceding upon letting n → ∞ that
P (N = ∞) = 0, which proves the result. ■
Remark 4.9. The Borel-Cantelli lemma is actually quite intuitive, for if we define the indicator
P∞
variable Ii to equal 1 if Ai occurs and to equal 0 otherwise, then N = i=1 Ii , implying that
∞
X ∞
X
E[N ] = E [Ii ] = P (Ai ) .
i=1 i=1
Consequently, the Borel-Cantelli theorem states that if the expected number of events that occur is
finite then the probability that an infinite number of them occur is 0, which is intuitive because if
there were a positive probability that an infinite number of events could occur then E[N ] would be
infinite.
Theorem 4.10 (Strong Law of Large Numbers). Let X1 , X2 , . . . be a sequence of i.i.d. random
variables, with E [Xi ] = µ < ∞, and Var [Xi ] = σ 2 < ∞. Then, with probability 1,
X1 + X2 + · · · + Xn
→ µ, as n → ∞.
n
Suppose that X1 , X2 , . . . are independent and identically distributed random variables with mean
Pn
µ, and let X n = n1 i=1 Xi be the average of the first n of them. The strong law of large numbers
states that P limn→∞ X n = µ = 1. That is, with probability 1, X n converges to µ as n → ∞.
We will give a proof of this result under the assumption that σ 2 , the variance of Xi , is finite
(which is equivalent to assuming that E Xi2 < ∞). Because proving the strong law requires
showing, for any ϵ > 0, that X n − µ > ϵ for only a finite number of values of n, it is natural to
attempt to prove it by utilizing the Borel-Cantelli lemma.
44
Probability Theory 4. Convergence and Limit Theorems
P∞
That is, the result would follow if we could show that n=1 P X n − µ > ϵ < ∞. However,
yields
∞ ∞ ∞
Var X n σ2 X 1
X X
P Xn − µ > ϵ ≤ = = ∞.
n=1 n=1
ϵ2 ϵ2 n=1 n
Thus, a straightforward use of Borel-Cantelli does not work. However, as we now show, a tweaking
of the argument, where we first consider a subsequence of X n , n ≥ 1, allows us to prove the strong
law.
Proof. Suppose first that the Xi are nonnegative random variables. Fix α > 1, and let nj be the
smallest integer greater than or equal to αj , j ≥ 1. From Chebyshev’s inequality we see that
Var X nj
σ2
P X nj −µ >ϵ ≤ = .
ϵ2 n j ϵ2
Consequently,
∞ ∞ ∞
X σ2 X 1 σ2 X
P X nj −µ >ϵ ≤ 2 ≤ 2 (1/α)j < ∞.
j=1
ϵ j=1 nj ϵ j=1
Therefore, by the Borel-Cantelli lemma, it follows that, with probability 1, | X nj − µ |> ϵ for only
a finite number of j. As this is true for any ϵ > 0, we see that, with probability 1,
lim X nj = µ. (4.8)
j→∞
Because nj → ∞ as j → ∞, it follows that for any m > α, there is an integer j(m) such that
nj(m) ≤ m < nj(m)+1 . The nonnegativity of the Xi yields that
nj(m) m nj(m)+1
X X X
Xi ≤ Xi ≤ Xi .
i=1 i=1 i=1
45
4. Convergence and Limit Theorems 4.2. Laws of Large Numbers
for all but a finite number of m. Consequently, from (4.8) and the preceding, it follows, with
probability 1, that
µ
< X m < (α + ϵ)µ
α+ϵ
for all but a finite number of values of m. As this is true for any ϵ > 0, α > 1, it follows that
with probability 1
lim X m = µ.
m→∞
Thus the result is proven when the Xi are nonnegative. In the general case, let
X , if X ≥ 0
i i
Xi+ =
0, if Xi < 0
and let
0, if Xi ≥ 0
Xi− =
−Xi , if Xi < 0
Xi+ and Xi− are called, respectively, the positive and negative parts of Xi . Noting that
Xi = Xi+ − Xi− ,
it follows from the previous result for nonnegative random variables that, with probability 1,
m m
1 X + 1 X −
lim Xi = µ+ , lim Xi = µ− .
m→∞ m m→∞ m
i=1 i=1
= µ+ − µ−
= µ.
■
Example 4.11. Suppose that a sequence of independent trials is performed. Let E be a fixed
event and denote by P (E) the probability that E occurs on any particular trial. Letting
1, if E occurs on the ith trial,
Xi =
0, if E does not occur on the ith trial.
46
Probability Theory 4. Convergence and Limit Theorems
X1 + · · · + Xn
→ E[X] = P (E). (4.9)
n
Since X1 + · · · + Xn represents the number of times that the event E occurs in the first n trials,
we may interpret equation (4.9) as stating that, with probability 1, the limiting proportion of time
that the event E occurs is just P (E).
47
5. Simulation 5.1. Monte Carlo Integration
5 Simulation
In these section we describe the principal methods that are used to generate random variables,
taking as given a good U (0, 1) random variable generator. We begin with Monte-Carlo integration
and then describe the main methods for random variable generation including inverse-transform,
composition and acceptance-rejection.
We also describe the generation of normal random variables and multivariate normal random
vectors via the Cholesky decomposition. We end with a discussion of how to generate (non-
homogeneous) Poisson processes as well (geometric) Brownian motions.
This chapter is based on Liu (2001), Glasserman (2004), and Ross (2012).
If we cannot compute θ analytically, then we could use numerical methods. However, we can
also use simulation and this can be especially useful for high-dimensional integrals. The key
observation is to note that θ = E[g(U )] where U ∼ U (0, 1). We can use this observation as follows:
There are two reasons that explain why θbn is a good estimator:
h i
1. θ̂n is unbiased, i.e., E θ̂n = θ and
2. θbn is consistent, i.e., θbn → θ as n → ∞ with probability 1 . This follows immediately from
the Strong Law of Large Numbers (SLLN) since g (U1 ) , g (U2 ) , . . . , g (Un ) are i.i.d. with
mean θ.
R1
Example 5.1. Suppose we wish to estimate 0 x3 dx using simulation. We know the exact answer
is 1/4 but we can also estimate this using simulation. In particular if we generate n U (0, 1)
independent variables, cube them and then take their average then we will have an unbiased
estimate.
R3
Example 5.2. We wish to estimate θ = 1 x2 + x dx again using simulation. Once again we
know the exact answer (it’s 38/3) but we can also estimate it by noting that
3
x2 + x
Z
θ := 2 dx = 2E X 2 + X
1 2
48
Probability Theory 5. Simulation
where X ∼ U (1, 3). So we can estimate θ by generating n i.i.d. U (0, 1) random variables,
converting them to U (1, 3) variables, X1 , . . . , Xn , and then taking
n
Xi2 + Xi
P
θbn := 2 i=1 .
n
Then we can write θ = E [g (U1 , U2 )] where U1 , U2 are i.i.d. U (0, 1) random variables. Note that
the joint PDF satisfies fu1 ,u2 (u1 , u2 ) = fu1 (u1 ) fu2 (u2 ) = 1 on [0, 1] × [0, 1]. As before we can
estimate θ using simulation by performing the following steps:
i. Generate n independent bivariate vectors U1i , U2i for i = 1, . . . , n, with all Uji ’s i.i.d.
U (0, 1).
As before, the SLLN justifies this approach and guarantees that θbn → θ w.p. 1 as n → ∞.
Example 5.3 (Computing a Multi-Dimensional Integral). We can use Monte Carlo to estimate
Z 1 Z 1
θ := 4x2 y + y 2 dxdy
0 0
= E 4X 2 Y + Y 2 ,
where X, Y are i.i.d. U (0, 1). (The true value of θ is easily calculated to be 1.) We can also apply
Monte Carlo integration to more general problems. For example, if we want to estimate
ZZ
θ= g(x, y)f (x, y)dxdy
A
where f (x, y) is a density function on A, then we observe that θ = E[g(X, Y )] where X, Y have
joint density f (x, y). To estimate θ using simulation we simply generate n random vectors (X, Y )
with joint density f (x, y) and then estimate θ with
g (X1 , Y1 ) + . . . + g (Xn , Yn )
θbn := .
n
49
5. Simulation 5.2. Univariate Random Variables
The Inverse Transform Method for Discrete Random Variables Suppose X is a discrete random
variable with probability mass function (PMF)
x1 , if 0 ≤ U ≤ p1 ,
X= x2 , if p1 < U ≤ p1 + p2 ,
x3 , if p1 + p2 < U ≤ 1.
More generally, suppose X can take on n distinct values, x1 < x2 < . . . < xn , with
P (X = xi ) = pi for i = 1, . . . , n.
i. Generate U
Pj−1 Pj
ii. Set X = xj if i=1 pi < U ≤ i=1 pi . That is, we set X = xj if F (xj−1 ) < U ≤ F (xj ).
Example 5.4 (Generating a Geometric Random Variable). Suppose X is geometric with parameter
p so that P(X = n) = (1 − p)n−1 p. Then we can generate X as follows:
i. Generate U .
ii. Set X = j if
j−1
X j
X
(1 − p)i−1 p < U ≤ (1 − p)i−1 p.
i=1 i=1
In particular, we set
log(U )
X = int + 1,
log(1 − p)
where int(y) denotes the integer part of y. You should convince yourself that this is correct! How
does this compare to the coin-tossing method for generating X?
50
Probability Theory 5. Simulation
Example 5.5 (Generating a Poisson Random Variable). Suppose that X is Poisson (λ) so that
P(X = n) = exp(−λ)λn /n!. We can generate X as follows:
i. Generate U .
set j = 0, p = e−λ , F = p
while U > F
set X = j
How much work does this take? What if λ is large? Can we find j more efficiently?
Suppose now that X is a continuous random variable and we want to generate a value of X.
Recall that when X was discrete, we could generate a variate by first generating U and then
setting X = xj if F (xj−1 ) < U ≤ F (xj ). This suggests that when X is continuous, we might
generate X as follows:
i. Generate U .
P(X ≤ x) = P Fx−1 (U ) ≤ x
= P (U ≤ Fx (x))
= Fx (x),
as desired. This argument assumes Fx−1 exists but there is no problem even when Fx−1 does not
exist. All we have to do is
i. Generate U .
51
5. Simulation 5.2. Univariate Random Variables
ii. The method is 1-to-1, i.e. one U (0, 1) variable produces one X variable. As we will see,
this property can be useful for some variance reduction techniques.
Disadvantages of the Inverse Transform Method
The principal disadvantage of the inverse transform method is that Fx−1 may not always be
computable. For example, suppose X ∼ N (0, 1). Then
1
Z x 2
−z
Fx (x) = √ exp dz,
−∞ 2π 2
so that we cannot even express Fx in closed form. Even if Fx is available in closed form, it may
not be possible to find Fx−1 in closed form. For example, suppose Fx (x) = x5 (1 + x)3 /8 for
0 ≤ x ≤ 1. Then we cannot compute Fx−1 . One possible solution to these problems is to find
Fx−1 numerically.
Composition Approach
Another method for generating random variables is the composition approach. Suppose again
that X has CDF Fx and that we wish to simulate a value of X. We can often write
∞
X
Fx (x) = pj Fj (x),
j=1
where the Fj ’s are also CDFs, pj ≥ 0 for all j, and pj = 1. Equivalently, if the densities exist
P
If it’s difficult to simulate X directly using the inverse transform method then we could use the
composition algorithm (see below) instead.
52
Probability Theory 5. Simulation
Proposition 5.8 (Composition Algorithm). Assume a random variable has CDF Fx and
∞
X
Fx (x) = pj Fj (x)
j=1
where the Fj ’s are also CDFs, pj ≥ 0 for all j, and pj = 1. If we follow the steps:
P
iii. Set X = Yj .
Proof. We have
∞
X
P(X ≤ x) = P(X ≤ x | I = j)P(I = j)
j=1
∞
X
= P (Yj ≤ x) P(I = j)
j=1
∞
X
= Fj (x)pj
j=1
= Fx (x). ■
The proof actually suggests that the composition approach might arise naturally from “sequential”
type experiments. Consider the following example.
Example 5.9 (A Sequential Experiment). Suppose we roll a dice and let Y ∈ {1, 2, 3, 4, 5, 6} be
the outcome. If Y = i then we generate Zi from the distribution Fi and set X = Zi What is the
distribution of X? How do we simulate a value of X?
Acceptance-Rejection Algorithm
Let X be a random variable with density, f (·), and CDF, Fx (·). Suppose it’s hard to simulate a
value of X directly using either the inverse transform or composition algorithm. We might then
wish to use the next algorithm.
f (x)
≤ a, for all x.
g(x)
53
5. Simulation 5.2. Univariate Random Variables
generate U
f (Y )
while U >
ag(Y )
generate Y
generate U
set X = Y.
Proof. We define B to be the event that Y has been accepted in the while loop, i.e., U ≤
f (Y )/ag(Y ). We need to show that P(X ≤ x) = Fx (x).
First observe
P(X ≤ x) = P(Y ≤ x | B)
P (Y ≤ x) ∩ B
= .
P(B)
Example 5.11 (Generating a β(a, b) Random Variable). Recall that X has a β(a, b) distribution
if f (x) = cxa−1 (1 − x)b−1 for 0 ≤ x ≤ 1. Suppose now that we wish to simulate from the β(4, 3)
so that
f (x) = 60x3 (1 − x)2 , for 0 ≤ x ≤ 1.
We could, for example, integrate f (·) to find F (·), and then try to use the inverse transform
approach. However, it is hard to find F −1 (·). Instead, let’s use the acceptance-rejection algorithm:
54
Probability Theory 5. Simulation
i. First choose g(y): let’s take g(y) = 1 for y ∈ [0, 1], i.e., Y ∼ U (0, 1)
So take a = 3. It is easy to check that this value works. We then have the following algorithm.
generate Y ∼ U (0, 1)
generate U ∼ U (0, 1)
generate Y
generate U
set X = Y.
E[N ] = a, so clearly we would like a to be as small as possible. Usually, this is just a matter of
calculus.
Example 5.12 (Generating a β(a, b) Random Variable revisited). Recall the β(4, 3) example with
PDF f (x) = 60x3 (1 − x)2 , for x ∈ [0, 1]. We chose g(y) = 1 for y ∈ [0, 1] so that Y ∼ U (0, 1).
The constant a had to satisfy
f (x)
≤ a, for all x ∈ [0, 1],
g(x)
and we chose a = 3. We can do better by choosing
f (x) 3
a = max =f ≈ 2.0736.
x∈[0,1] g(x) 5
55
5. Simulation 5.2. Univariate Random Variables
How Do We Choose g(·) ? We would like to choose g(·) to minimize the computational load. This
can be achieved by taking g(·) ’close’ to f (·). Then a will be close to 1 and so fewer iterations
will be required in the A-R algorithm. There is a tradeoff, however: if g(·) is “close” to f (·) then
it will probably also be hard to simulate from g(·). So we often need to find a balance between
having a ’nice’ g(·) and a small value of a.
So far, we have expressed the A-R algorithm in terms of PDF’s, thereby implicitly assuming
that we are generating continuous random variables. However, the A-R algorithm also works for
discrete random variables where we simply replace PDF’s with PMF’s.
generate U
pY
while U >
aqY
generate Y
generate U
set X = Y.
Generally, we would use this A-R algorithm when we can simulate Y efficiently.
Example 5.13 (Generating from a Uniform Distribution over a 2-D Region). Suppose (X, Y ) is
uniformly distributed over a 2-dimensional area, A. How would you simulate a sample of (X, Y )
? Note first that if X ∼ U (−1, 1), Y ∼ U (−1, 1) and X and Y are independent then (X, Y ) is
uniformly distributed over the region
A := {(x, y) : −1 ≤ x ≤ 1, −1 ≤ y ≤ 1}.
We can therefore (how?) simulate a sample of (X, Y ) when A is a square. Suppose now that A is
a circle of radius 1 centered at the origin. How do we simulate a sample of (X, Y ) in that case?
Remark 5.14. The A-R algorithm is an important algorithm for generating random variables.
Moreover it can be used to generate samples from distributions that are only known up to a
constant. Nevertheless, it is very inefficient in high-dimensions.
56
Probability Theory 5. Simulation
so that we need only concern ourselves with generating N (0, 1) random variables. One possibility
for doing this is to use the inverse transform method. But we would then have to use numerical
methods since we cannot find Fz−1 (·) := ϕ(·) in closed form. Other approaches for generating
N (0, 1) random variables include:
There are many other methods such as the A-R algorithm that could also be used to generate
N (0, 1) random variables.
The Box-Muller algorithm uses two i.i.d. U (0, 1) random variables to produce two i.i.d. N (0, 1)
random variables. Its working is detailed next.
Proposition 5.15 (Box-Muller Algorithm). Let U1 , U2 be two i.i.d. U (0, 1) random variables,
and set
1 1
2 2
x y
f (x, y) = √ exp − √ exp − .
2π 2 2π 2
so R and θ are polar coordinates of (X, Y ). To transform back, note X = R cos(θ), and
Y = R sin(θ). Note also that R = −2 log (U1 ), and θ = 2πU2 .
p
57
5. Simulation 5.3. Gaussian Random Variables
Since U1 and U2 are i.i.d., R and θ are independent. Clearly θ ∼ U (0, 2π) so
1
fθ (θ) = , for 0 ≤ θ ≤ 2π,
2π
r2
fR (r) = re− 2 , for r ≥ 0.
Therefore,
1 −r2 /2
fR,θ (r, θ) = re , 0 ≤ θ ≤ 2π, r ≥ 0.
2π
This implies
P (X ≤ x1 , Y ≤ y1 ) = P (R cos(θ) ≤ x1 , R sin(θ) ≤ y1 )
ZZ
1 −r2 /2 (5.10)
= re drdθ.
A 2π
We now transform back to (x, y) coordinates with x = r cos(θ), and y = r sin(θ), and note that
dxdy = rdrdθ, i.e., the Jacobian of the transformation is r. We then use (5.10) to obtain
!
1 x1 y1
x2 + y 2
Z Z
P (X ≤ x1 , Y ≤ y1 ) = exp − dxdy
2π −∞ −∞ 2
1 x1
1 y1
Z Z
=√ exp −x /2 dx √ 2
exp −y 2 /2 dy.
2π −∞ 2π −∞
as required. ■
One disadvantage of the Box-Muller method is that computing sines and cosines is inefficient. We
can get around this problem using the polar method which is described in the algorithm below.
set S = 2
while S > 1
See Chapter 5 of Simulation by Ross (2012) for further details about this algorithm.
58
Probability Theory 5. Simulation
Rational Approximations
Let X ∼ N (0, 1) and recall that Φ(x) = P(X ≤ x) is the CDF of X. If U ∼ U (0, 1), then the
inverse transform method seeks xu = Φ−1 (U ). Finding Φ−1 in closed form is not possible but
instead, we can use rational approximations. These are very accurate and efficient methods for
estimating xu .
a0 + a1 t
xu ≈ t − ,
1 + b1 t + b2 t2
where a0 , a1 , b1 and b2 are constants, and t = −2 log(1 − u). The error is bounded in this case
p
by 0.003. Even more accurate approximations are available, and since they are very fast, many
packages (e.g. Python or Matlab) use them for generating normal random variables.
If the n-dimensional vector X is multivariate normal with mean vector µ and covariance matrix
Σ then we write
X ∼ MNn (µ, Σ).
The standard multivariate normal has µ = 0, and Σ = In , i.e., the n × n identity matrix. The
PDF of X is given by
1 1
f (x) = exp − (x − µ) Σ (x − µ)
⊤ −1
(2π)n/2 |Σ|1/2 2
1 ⊤
h i
it⊤ X
ϕX (t) := E e = exp it µ − t Σt .
⊤
2
⊤ ⊤
Notice that it is possible to partition X into X1 = (X1 , . . . , Xk ) , and X2 = (Xk+1 , . . . , Xn ) .
If we extend this notation naturally so that
! !
µ1 Σ11 Σ12
µ= , and Σ= ,
µ2 Σ21 Σ22
then we obtain the following results regarding the marginal and conditional distributions of X.
Marginal Distribution
The marginal distribution of a multivariate normal random vector is itself multivariate normal.
In particular,
Xi ∼ MN (µi , Σii ) , for i = 1, 2.
59
5. Simulation 5.3. Gaussian Random Variables
Conditional Distribution
Assuming Σ is positive definite, the conditional distribution of a multivariate normal distribution
is also a multivariate normal distribution. In particular,
X2 | X1 = x1 ∼ MN (µ2.1 , Σ2.1 ) ,
Linear Combinations
Linear combinations of multivariate normal random vectors remain normally distributed with
mean vector and covariance matrix given by
E[AX + a] = AE[X] + a,
Cov(AX + a) = A Cov(X)A⊤ .
Suppose that we wish to generate X = (X1 , . . . , Xn ) where X ∼ MNn (0, Σ). Note that it is then
⊤
easy to handle the case where E[X] ̸= 0. Let Z = (Z1 , . . . , Zn ) where the Zi ’s are i.d.d. N(0, 1)
for i = 1, . . . , n. If C is an (n × m) matrix then it follows that
C⊤ Z ∼ MN 0, C⊤ C .
Our problem therefore reduces to finding C such that C⊤ C = Σ. We can use the Cholesky
decomposition of Σ to find such a matrix, C.
60
Probability Theory 5. Simulation
where U is an upper triangular matrix and, D is a diagonal matrix with positive diagonal elements.
Since Σ is symmetric positive-definite, we can therefore write
Σ = U⊤ DU
√ √
= U⊤ D DU
√ ⊤ √
= DU DU .
√
The matrix C = DU therefore satisfies C⊤ C = Σ. It is said to be the Cholesky Decomposition
of Σ.
Recall that a Poisson process, N (t), with intensity λ is a process such that
(λt)r e−λt
P(N (t) = r) = .
r!
For a Poisson process the numbers of arrivals in non-overlapping intervals are independent and
the distribution of the number of arrivals in an interval only depends on the length of the interval.
The Poisson process is good for modeling many phenomena including the emission of particles
from a radioactive source and the arrivals of customers to a queue. The ith inter-arrival time, Xi ,
is defined to be the interval between the (i − 1)th and ith arrivals of the Poisson process, and it is
easy to see that the Xi ’s are i.i.d. ∼ Exp(λ).
In particular, this means we can simulate a Poisson process with intensity λ by simply generating
the inter-arrival times, Xi , where Xi ∼ Exp(λ).
We have the following algorithm for simulating the first T time units of a Poisson process:
set t = 0, I = 0
generate U
while t < T
set I = I + 1, S(I) = t
generate U
61
5. Simulation 5.4. Stochastic Processes
More formally, if λ(t) ≥ 0 is the intensity of the process at time t, then we say that N (t) is a
non-homogeneous Poisson process with intensity λ(t). Define the function m(t) by
Z t
m(t) := λ(s)ds.
0
Then it can be shown that N (t + s) − N (t) is a Poisson random variable with parameter
m(t + s) − m(t), i.e.,
Before we describe the thinning algorithm for simulating a non-homogeneous Poisson process, we
first need the following proposition.
Proposition 5.17. Let N (t) be a Poisson process with constant intensity λ. Suppose that an
arrival that occurs at time t is counted with probability p(t), independently of what has happened
beforehand. Then the process of counted arrivals is a non-homogeneous Poisson process with
intensity λ(t) = λp(t).
The proof can be found in Chapter 11 of Introduction to Probability Models by Ross (2019).
Suppose now N (t) is a non-homogeneous Poisson process with intensity λ(t) and that there
exists a λ such that λ(t) ≤ λ for all t ≤ T . Then we can use the following algorithm, based on
Proposition 5.17, to simulate N (t).
set t = 0, I = 0
generate U1
while t < T
generate U2
if U2 ≤ λ(t)/λ then
set I = I + 1, S(I) = t
generate U1
62
Probability Theory 5. Simulation
Definition 5.18. A stochastic process, {Xt : t ≥ 0}, is a Brownian motion with parameters (µ, σ)
if it verifies:
We say that X is a B(µ, σ) Brownian motion with drift, µ, and volatility, σ. When µ = 0 and
σ = 1 we have a standard Brownian motion (SBM). We will use Bt to denote a SBM and we will
always assume (unless otherwise stated) that B0 = 0. Note that if X ∼ B(µ, σ) and X0 = x then
we can write
Xt = x + µt + σBt ,
where B is a SBM. We will usually write a B(µ, σ) Brownian motion in this way.
Remark 5.19. Bachelier (1900) and Einstein (1905) were the first to explore Brownian motion
from a mathematical viewpoint whereas Wiener (1920) was the first to show that it actually exists
as a well-defined mathematical entity.
Questions: (i) What is E [Bt+s Bs ]?; (ii) What is E [Xt+s Xs ] where X ∼ B(µ, σ)?; and (iii) Let
B be a SBM and let Zt := |Bt |. What is the CDF of Zt for t fixed?
Simulating a Standard Brownian Motion
It is not possible to simulate an entire sample path of Brownian motion between 0 and T as this
would require an infinite number of random variables. This is not always a problem, however,
since we often only wish to simulate the value of Brownian motion at certain fixed points in time.
For example, we may wish to simulate Bti for t1 < t2 < . . . < tn , as opposed to simulating Bt for
every t ∈ [0, T ].
Sometimes, however, the quantity of interest, θ, that we are trying to estimate does indeed depend
on the entire sample path of Bt in [0, T ]. In this case, we can still estimate θ by again simulating
Bti for t1 < t2 < . . . < tn but where we now choose n to be very large. We might, for example,
choose n so that |ti+1 − ti | < ϵ for all i where ϵ > 0 is very small. By choosing ϵ to be sufficiently
small, we hope to minimize the numerical error (as opposed to the statistical error), in estimating
θ.
In either case, we need to be able to simulate Bti for t1 < t2 < . . . < tn and for a fixed n. We
will now see how to do this. The first observation we make is that
63
5. Simulation 5.4. Stochastic Processes
are mutually independent, and for s > 0, Bt+s − Bt ∼ N (0, s). The idea then is as follows: we
begin with t0 = 0 and Bt0 = 0. We then generate Bt1 which we can do since Bt1 ∼ N (0, t1 ).
We now generate Bt2 by first observing that Bt2 = Bt1 + (Bt2 − Bt1 ). Then since (Bt2 − Bt1 )
is independent of Bt1 , we can generate Bt2 by generating an N (0, t2 − t1 ) random variable and
simply adding it to Bt1 .
More generally, if we have already generated Bti then we can generate Bti+1 by generating an
N (0, ti+1 − ti ) random variable and adding it to Bti . We have the following algorithm:
set t0 = 0, Bt0 = 0
for i = 1 to n
Remark 5.20. It is very important that when you generate Bti+1 , you do so conditional on the
value of Bti . If you generate Bti and Bti+1 independently of one another then you are effectively
simulating from different sample paths of the Brownian motion. This is not correct! In fact when
we generate (Bt1 , Bt2 , . . . , Btn ) we are actually generating a random vector that does not consist
of i.i.d. random variables.
Simulating a B(µ, σ) Brownian Motion
Suppose now that we want to simulate a B(µ, σ), X, at the times t1 , t2 , . . . , tn−1 , tn . Then
all we have to do is simulate an SBM, (Bt1 , Bt2 , . . . , Btn ), and use our earlier observation that
Xt = X0 + µt + σBt .
Definition 5.21. A stochastic process, {Xt : t ≥ 0}, is a (µ, σ) geometric Brownian motion
(GBM) if log(X) ∼ B µ − σ 2 /2, σ . We write X ∼ GBM (µ, σ). The following properties of
64
Probability Theory 5. Simulation
i. If Xt > 0, then Xt+s is always positive for any s > 0 so limited liability is not violated.
This suggests that GBM might be a reasonable model for stock prices. In fact, we will often
model stock prices as GBM’s in this course, and we will generally use the following notation: S0
is the known stock price at t = 0, St is the random stock price at time t and
St = S0 e(µ−σ /2)t+σBt
2
where B is a standard BM. The drift is µ, σ is the volatility and S is a therefore a GBM (µ, σ)
process that begins at S0 .
Suppose now that we wish to simulate S ∼ GBM (µ, σ), then it is not hard to see that
so that we can simulate St+∆t conditional on St for any ∆t > 0 by simply simulating an N (0, ∆t)
random variable.
θ := E [g (X1 , . . . , Xn )] ,
where g is some specified function. It is often the case that it is not possible to analytically
compute the preceding, and when such is the case we can attempt to use simulation to estimate θ.
(1) (1)
This is done as follows: Generate X1 , . . . , Xn having the same joint distribution as X1 , . . . , Xn
and set
(1)
Y1 = g X1 , . . . , Xn(1) .
(2) (2)
Now, simulate a second set of random variables (independent of the first set) X1 , . . . , Xn
having the distribution of X1 , . . . , Xn and set
(2)
Y2 = g X1 , . . . , Xn(2) .
Continue this until you have generated k (some predetermined number) sets, and so have also
computed Y1 , Y2 , . . . , Yk . Now, Y1 , . . . , Yk are independent and identically distributed random
variables each having the same distribution of g (X1 , . . . , Xn ).
65
5. Simulation 5.5. Variance Reduction
Thus, if we let Ȳ denote the average of these k random variables, that is,
k
X
Ȳ = Yi /k
i=1
then
E[Ȳ ] = θ,
Hence, we can use Ȳ as an estimate of θ. As the expected square of the difference between Ȳ
and θ is equal to the variance of Ȳ , we would like this quantity to be as small as possible. In the
preceding situation, Var(Ȳ ) = Var (Yi ) /k, which is usually not known in advance but must be
estimated from the generated values Y1 , . . . , Yn .
Simulation Efficiency
As usual, we wish to estimate θ := E[h(X)], then the standard simulation algorithm is:
i. Generate X1 , . . . , Xn .
Pn
ii. Estimate θ with θbn = j=1 Yj /n where Yj := h (Xj ).
iii. Approximate 100(1 − α)% confidence intervals (CI’s) are then given by
σ
bn b σ
bn
θn − z1−α/2 √ , θn + z1−α/2 √ ,
b
n n
where σ
bn is the usual estimate of Var(Y ) based on Y1 , . . . , Yn .
One way to measure the quality of the estimator, θbn , is by the half-width, HW , of the confidence
interval. For a fixed α, we have
r
Var(Y )
HW = z1−α/2 .
n
We would like HW to be small, but sometimes this is difficult to achieve. This may be because
Var(Y ) is too large, or too much computational effort is required to simulate each Yj so that n is
necessarily small, or some combination of the two.
Before proceeding to study techniques for variance reduction, we should first describe a measure
of simulation efficiency. Suppose there are two random variables, W and Y , such that E[W ] =
E[Y ] = θ. Then we could choose to either simulate W1 , . . . , Wn or Y1 , . . . , Yn in order to estimate
θ. Let Mw denote the method of estimating θ by simulating the Wi ’s. My is similarly defined.
Which method is more efficient, Mw or My ? To answer this, let nw and ny be the number of
66
Probability Theory 5. Simulation
samples of W and Y , respectively, that are needed to achieve a half-width, HW . Then we know
that
z 2
1−α/2
nw = Var(W ),
HW
z 2
1−α/2
ny = Var(Y ).
HW
Let Ew and Ey denote the amount of computational effort required to produce one sample of W
and Y , respectively. Then the total effort expended by Mw and My , respectively, to achieve a
half width HW are
z 2
1−α/2
T Ew = Var(W )Ew ,
HW
z α 2
1− /2
T Ey = Var(Y )Ey .
HW
We then say that Mw is more efficient than My if T Ew < T Ey . Note that T Ew < T Ey if and
only if
We will use the quantity Var(W )Ew as a measure of the efficiency of the simulator, Mw . Note
that las inequality implies taht we cannot conclude that one simulation algorithm, Mw , is better
than another, My , simply because Var(W ) < Var(Y ); we also need to take Ew and Ey into
consideration. However, it is often the case that we have two simulators available to us, Mw and
My , where Ew ≈ Ey and Var(W ) << Var(Y ). In such cases it is clear that using Mw provides a
substantial improvement over using My .
As a result, it is often imperative to address the issue of simulation efficiency. There are a number
of things we can do:
ii. Program carefully to minimize storage requirements. For example we do not need to store all
the Yj ’s: we only need to keep track of j Yj and j Yj2 to compute θbn and approximate
P P
CI’s.
iv. Decrease the variability of the simulation output that we use to estimate θ. The techniques
used to do this are usually called variance reduction techniques.
We will now study some of the simplest variance reduction techniques, and assume that we are
doing items (i.) to (iii.) as well as possible.
67
5. Simulation 5.5. Variance Reduction
In the preceding situation, suppose that we have generated Y1 and Y2 , identically distributed
random variables having mean θ. Now,
Y1 + Y2 1
Var = [Var (Y1 ) + Var (Y2 ) + 2 Cov (Y1 , Y2 )]
2 4
Var (Y1 ) Cov (Y1 , Y2 )
= + .
2 2
Hence, it would be advantageous (in the sense that the variance would be reduced) if Y1 and Y2
rather than being independent were negatively correlated. To see how we could arrange this, let
us suppose that the random variables X1 , . . . , Xn are independent and, in addition, that each is
simulated via the inverse transform technique.
That is, Xi is simulated from Fi−1 (Ui ) where Ui is a random number and Fi is the distribution
of Xi . Hence, Y1 can be expressed as
Now, since 1 − U is also uniform over (0, 1) whenever U is a random number (and is negatively
correlated with U ) it follows that Y2 defined by
Y2 = g F1−1 (1 − U1 ) , . . . , Fn−1 (1 − Un ) .
will have the same distribution as Y1 . Hence, if Y1 and Y2 were negatively correlated, then
generating Y2 by this means would lead to a smaller variance than if it were generated by a new
set of random numbers. In addition, there is a computational savings since rather than having to
generate n additional random numbers, we need only subtract each of the previous n from 1.
The following theorem will be the key to showing that this technique, known as the use of
antithetic variables, will lead to a reduction in variance whenever g is a monotone function.
Theorem 5.22. If X1 , . . . , Xn are independent, then, for any increasing functions f and g of n
variables,
E[f (X)g(X)] ≥ E[f (X)]E[g(X)], (5.12)
where X = (X1 , . . . , Xn ).
since if x ≥ y(x ≤ y) then both factors are nonnegative (nonpositive). Hence, for any random
variables X and Y ,
f (X) − f (Y ) g(X) − g(Y ) ≥ 0,
68
Probability Theory 5. Simulation
implying that
f (X) − f (Y ) g(X) − g(Y ) ≥ 0,
E
or, equivalently,
If we suppose that X and Y are independent and identically distributed, as in this case, then
= E[f (X)]E[g(X)],
and so we obtain the result when n = 1. So assume that (5.12) holds for n − 1 variables, and now
suppose that X1 , . . . , Xn are independent and f and g are increasing functions. Then
= E [f (X) | Xn = xn ] E [g(X) | Xn = xn ] .
Hence,
E [f (X)g(X) | Xn ] ≥ E [f (X) | Xn ] E [g(X) | Xn ] ,
≥ E[f (X)]E[g(X)].
The last inequality follows because E [f (X) | Xn ] and E [g(X) | Xn ] are both increasing functions
of Xn , and so, by the result for n = 1,
= E[f (X)]E[g(X)]. ■
69
5. Simulation 5.5. Variance Reduction
We now discuss the circumstances under which a variance reduction can be guaranteed. Consider
first the case where U is a uniform random variable so that m = 1, U = U and θ = E[h(U )].
Suppose now that h(·) is a non-decreasing function of u over [0, 1]. Then if U is large, h(U ) will also
tend to be large while 1 − U and h(1 − U ) will tend to be small. That is, Cov(h(U ), h(1 − U )) <
0. We can similarly conclude that if h(.)isa non-increasing function of u then once again,
Cov(h(U ), h(1 − U )) < 0. So for the case where m = 1, a sufficient condition to guarantee a
variance reduction is for h(·) to be a monotonic function of u on [0, 1].
Let us now consider the more general case where m > 1, U = (U1 , . . . , Um ) and θ = E[h(U)]. We
say h (u1 , . . . , um ) is a monotonic function of each of its m arguments if, in each of its arguments,
it is non-increasing or non-decreasing. We have the following result which generalizes the m = 1
case above.
Cov (k (U1 , . . . , Un ) , −k (1 − U1 , . . . , 1 − Un )) ≥ 0.
Since Fi−1 (Ui ) is increasing in Ui (as Fi , being a distribution function, is increasing) it follows
that g F1−1 (U1 ) , . . . , Fn−1 (Un ) is a monotone function of U1 , . . . , Un whenever g is monotone.
Hence, if g is monotone, the antithetic variable approach of twice using each set of random
numbers U1 , . . . , Un will reduce the variance of the estimate of E [g (X1 , . . . , Xn )] by first computing
g F1−1 (U1 ) , . . . , Fn−1 (Un ) , and then g F1−1 (1 − U1 ) , . . . , Fn−1 (1 − Un ) .
That is, rather than generating k sets of n random numbers, we should generate k/2 sets and use
each set twice.
70
Probability Theory 5. Simulation
Now, we can simulate the Xi by generating uniform random numbers U1 , . . . , Un and then setting
1, if Ui < pi
Xi =
0, otherwise.
and so the antithetic variable approach of using U1 , . . . , Un to generate both k (U1 , . . . , Un ) and
k (1 − U1 , . . . , 1 − Un ) results in a smaller variance than if an independent set of random numbers
was used to generate the second k.
Example 5.26 (Simulating a Queueing System). Consider a given queueing system, let Di denote
the delay in queue of the i th arriving customer, and suppose we are interested in simulating the
system so as to estimate
θ := E [D1 + · · · + Dn ] .
Let X1 , . . . , Xn denote the first n interarrival times and S1 , . . . , Sn the first n service times of
this system, and suppose these random variables are all independent. Now in most systems
D1 + · · · + Dn will be a function of X1 , . . . , Xn , S1 , . . . , Sn -say,
D1 + · · · + Dn = g (X1 , . . . , Xn , S1 , . . . , Sn ) .
D1 + · · · + Dn = k U1 , . . . , Un , Ū1 , . . . , Ūn ,
Hence, the antithetic variable approach will reduce the variance of the estimator of θ. Thus, we
would generate Ui , Ūi , i = 1, . . . , n and set Xi = Fi−1 (1 − Ui ) and Yi = G−1
i Ūi for the first run,
and Xi = Fi−1 (Ui ) and Yi = G−1 1 − Ūi for the second. As all the Ui and Ūi are independent,
i
however, this is equivalent to setting Xi = Fi−1 (Ui ) , Yi = G−1
i Ūi in the first run and using
1 − Ui for Ui and 1 − Ūi for Ūi in the second.
71
5. Simulation 5.5. Variance Reduction
Z = h F1−1 (U1 ) , . . . , Fm
−1
(Um ) .
Since the CDF of any random variable is non-decreasing, it follows that the inverse CDFs, Fi−1 (·)
, are also non-decreasing. This means that if h (x1 , . . . , xm ) is a monotonic function of each of its
arguments, then h F1−1 (U1 ) , . . . , Fm
−1
(Um ) is also a monotonic function of the Ui ’s. Corollary
Assume also that the service times of customers are i.i.d. with CDF, F (·), and that they are also
independent of the arrival process, N (t). The usual simulation method for estimating θ would be
to simulate n days of operation in the barbershop, thereby obtaining n samples, Y1 , . . . , Yn , and
then setting
Pn
j=1 Yj
θ̂n = .
n
Suppose now that the barber now wants to estimate the average total waiting time, θ, of the first
100 customers. Then
X100
θ = E Wj .
j=1
Now for each customer, j, there is an inter-arrival time, Ij , which is the time between the (j − 1)th
and j th arrivals. There is also a service time, Sj , which is the amount of time the barber spends
cutting the j th customer’s hair. It is therefore the case that Y may be written as
72
Probability Theory 5. Simulation
Now for many queueing systems, h(·) will be a monotonic function of its arguments since we
would typically expect Y to increase as service times increase, and decrease as inter-arrival times
increase. As a result, it might be advantageous to use antithetic variates to estimate θ. Notice
that we are implicitly assuming here that the Ij ’s and Sj ’s can be generated using the inverse
transform method.
Clearly X and X̃ are negatively correlated. So if θ = E [h (X1 , . . . , Xm )] where the Xi ’s are i.i.d.
N µ, σ 2 and h(·) is monotonic in its arguments, then we can again achieve a variance reduction
X ∼ N (2, 1). Then it is easy to see that θ = 5, but we can also estimate it using antithetic
variates. Is a variance reduction guaranteed?
Suppose again that we wish to estimate θ := E[Y ] where Y = h(X) is the output of a simulation
experiment. Suppose that Z is also an output of the simulation or that we can easily output
it if we wish. Finally, we assume that we know E[Z]. Then we can construct many unbiased
estimators of θ:
and we can choose c to minimize this quantity. Simple calculus then implies the optimal value of
c is given by
Cov(Y, Z)
c∗ = − ,
Var(Z)
and that the minimized variance satisfies
Cov(Y, Z)2
Var θbc∗ = Var(Y ) −
Var(Z)
Cov(Y, Z)2
= Var(θ̂) − .
Var(Z)
73
5. Simulation 5.5. Variance Reduction
In order to achieve a variance reduction it is therefore only necessary that Cov(Y, Z) ̸= 0. The
new resulting Monte Carlo algorithm proceeds by generating n samples of Y and Z and then
setting
Pn
i=1 Yi + c (Zi − E[Z])
∗
θc =
b ∗ .
n
There is a problem with this, however, as we usually do not know Cov(Y, Z). We overcome this
problem by doing p pilot simulations and setting
Pp
Yj − Ȳp (Zj − E[Z])
j=1
Cov(Y,
d Z) = .
p−1
If it is also the case that Var(Z) is unknown, then we also estimate it with
Pp 2
j=1 (Zj − E[Z])
Var(Z)
d = ,
p−1
Assuming we can find a control variate, our control variate simulation algorithm is as follows.
/*Pilot simulation*/
for i = 1 to p
generate (Yi , Zi )
end for
c∗
compute b
/*Main simulation*/
for i = 1 to n
generate (Yi , Zi )
set Vi = Yi + b
c∗ (Zi − E[Z])
end for
n n
1X 1 X 2
set θbĉ∗ = Vi , and σ 2
= Vi − θbĉ∗
n − 1 i=1
bn,v
n i=1
σ̂n,v b σ
bn,v
set 100(1 − α)% CI = θĉ∗ − z1−α/2 √ , θĉ∗ + z1−α/2 √
b
n n
Note that the Vi ’s are i.i.d., so we can compute approximate confidence intervals.
74
Probability Theory 5. Simulation
h 2
i
Example 5.29. Suppose we wish to estimate θ = E e(U +W ) where U, W ∼ U (0, 1) and i.i.d.
2
In our notation we then have Y := e(U +W ) .
The usual approach is:
Now consider using the control variate technique. First we have to choose an appropriate control
variate, Z. There are many possibilities including
Z1 := U + W,
Z2 := (U + W )2 ,
Z3 := eU +W .
Note that we can easily compute E [Zi ] for i = 1, 2, 3 and that it’s also clear that Cov (Y, Zi ) ̸= 0.
In a simple experiment we used Z3 , estimating ĉ∗ on the basis of a pilot simulation with 100
samples. We reduced the variance by approximately a factor of 4. In general, a good rule of thumb
is that we should not be satisfied unless we have a variance reduction on the order of a factor of 5
to 10, though often we will achieve much more.
Example 5.30 (The Barbershop revisited). Recall Example 5.27 where we assumed that customers
arrive at the barbershop according to a non-homogeneous Poisson process, N (t), with intensity
λ(t). Recall also that the service times of customers are i.i.d. with CDF, F (·), and that they are
also independent of the arrival process, N (t). Then, the quantity to be estimated is θ := E[Y ]
where
N (T )
X
Y := W j,
j=1
th
and Wj denoted the waiting time of the j customer.
Again, a method for estimating θ would be to simulate n days of operation in the barbershop,
thereby obtaining n samples, Y1 , . . . , Yn , and then setting
Pn
j=1 Yj
θ̂n = .
n
However, a better estimate could be obtained by using a control variate. In particular, let Z denote
the total time customers on a given day spend in service so that
N (T )
X
Z := Sj ,
j=1
75
5. Simulation 5.5. Variance Reduction
where Sj is the service time of the j th customer. Then, since services times are i.i.d. and
independent of the arrival process, it is easy to see that
76
References
Bachelier, Louis (1900). “Théorie de la spéculation”. In: Annales Scientifiques de l’École Normale
Supérieure 3, pp. 21–86.
Barron, E.N. and J.G. Del Greco (2020). Probability and Statistics for STEM: A Course in One
Semester. Springer.
Capiński, Marek and Peter Ekkehard Kopp (2004). Measure, Integral and Probability. Springer
Science & Business Media. isbn: 978-1-4471-1046-0.
Einstein, Albert (1905). “Über die von der molekularkinetischen Theorie der Wärme geforderte
Bewegung von in ruhenden Flüssigkeiten suspendierten Teilchen”. In: Annalen der Physik 322.8,
pp. 549–560.
Glasserman, Paul (2004). Monte Carlo Methods in Financial Engineering. Springer Science &
Business Media.
Liu, Jun S. (2001). Monte Carlo Strategies in Scientific Computing. Springer Science & Business
Media.
Wiener, Norbert (1920). “The generalized harmonic analysis”. In: Acta Mathematica 43.1,
pp. 203–239.