STAT230 Course Notes
STAT230 Course Notes
Joshua Allum
April 2017
Page 1
Contents
1 Introduction 3
1.1 Defining Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
3 Counting Techniques 8
3.1 Counting Arguments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.2 Counting Arrangements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.3 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.4 Counting Subsets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.5 Properties of Combinations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.6 Counting Arrangements of Set with Repeated Elements . . . . . . . . . . . . . . . . . . . . . 13
4 Probability Rules 14
4.1 General Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.2 Venn Diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.3 De Morgan’s Laws . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.4 Rules for Unions of Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.5 Mutually Exclusive Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.6 Independence of Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
5 Conditional Probability 20
5.1 Theorems and Rules for Conditional Probability . . . . . . . . . . . . . . . . . . . . . . . . . 20
5.2 Tree Diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
8 Discrete Distributions 29
8.1 Uniform Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
8.2 Hypergeometric Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
8.3 Binomial Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
8.3.1 Comparison of Binomial and Hypergeometric Distributions . . . . . . . . . . . . . . . 32
8.3.2 Binomial Estimate of the Hypergeometric Distribution . . . . . . . . . . . . . . . . . . 33
8.4 Negative Binomial Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
8.5 Geometric Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
8.6 The Poisson Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
8.6.1 Poisson Estimate to the Binomial Distribution . . . . . . . . . . . . . . . . . . . . . . 37
8.6.2 Parameters µ and λ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
8.6.3 Distinguishing the Poisson Distribution from other Distributions . . . . . . . . . . . . 37
8.7 Combining Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
Chapter 1
Introduction
Example 1.1.1
The probability of a fair dice landing on 3 is 1/6 because there is one way in which the dice may
land on 3 and 6 total possible outcomes of faces the dice may land on. The sample space of the
experiment, S, is {1, 2, 3, 4, 5, 6} and the event occurs in only one of these six outcomes.
The main limitation of this definition is that it demands that the outcomes of a sample space are
equally likely. This is a problem since a definition of “likelyhood” (probability) is needed to include
this postulate in a definition of probability itself.
Example 1.1.2
The probability of a fair dice landing on 3 is 1/6 because after a very large series of repetitions (ideally
infinite) of rolling the dice, the fraction of times the face with 3 is rolled tends to 1/6.
The main limitation of this definition is that we can never repeat a process indefinitely so we can
never truly know the probability of an event from this definition. Additionally, in some cases we
cannot even obtain a long series of repetitions of processes to produce an estimate due to restrictions
on cost, time, etc.
Introduction - Defining Probability Page 4
Example 1.1.3
The probability that a football team will win their next match can be predicted by experts who
regard all the data of past matches and current situations to provide a subjective probability.
This definition is irrational and leads to many people having different probabilities for the same
events, with no clear “right” answer. Thus, by this definition, probability is not an objective science.
Probability Model
To avoid many of the limitation of the definitions of probability, we can instead treat probability as a
mathematical system defined by a set of axioms. Thus, we can ignore the numerical values of probabilities
until we consider a specific application. The model is defined as follows
• A sample space of all possible outcomes of a random experiment is defined.
Chapter 2
Example 2.1.1
The sample space for a roll of a six-sided die is
{1, 2, 3, 4, 5, 6}
Note that a sample space of a probability model for a process is not necessarily unique. Often times,
however, we try to chose sample points that are the smallest possible or “indivisible”.
Example 2.1.2
If we define E to be the event that the top face of a six-sided die is even when rolled and O to be the
event the top-face is odd, then the sample space, S, can be defined as
{E, O}
This is the same process as Example 2.1.1 (rolling a six-sided die), so since the sample spaces differ,
clearly, sample spaces are not unique. Moreover, if we are interested in the event that a 3 is rolled,
this sample space is not suitable since it groups the event in question with other events.
A sample space can be either discrete or non-discrete. If a sample space is discrete, it consists of a
finite or countably infinite number “simple events”. A countably infinite
set is one that can be put into a
one-to-one correspondence with the set of real numbers. For example, 1, 21 , 31 , 14 , . . . is countably infinite
whereas { x | x ∈ R } is not.
Mathematical Probability Models - Assigning Probabilities Page 6
Simple Events
An event in a discrete sample space is a subset of the sample space, i.e., A ⊂ S. If the event is indivisible,
so as to only contain one point, we call it a simple event, otherwise it is a compound event.
Example 2.1.3
A simple event for a roll of a six-sided die is A = {a1 } where ai is the event the top face is i. A
compound event is E = {a2 , a4 , a6 }.
The second condition, that the sum of the probabilities of all sample points is 1, relates to the property
that for a given experiment one simple event in the sample space must occur. Every experiment or process
always has an outcome thus the probability of any outcome being achieved must be 1.
Compound Events
The probability of an event A is the sum of the probability of all the simple events that make up A.
X
P (A) = P (a)
a∈A
Example 2.2.1
In the previous example we saw that E = {a2 , a4 , a6 } is a compound event. Thus, the probability of
the compound event E is
P (E) = P (a2 ) + P (a4 ) + P (a6 )
Note that the probability model that we defined does not specify what actual numbers to assign to
the simple events of a process. It only defines the properties that guarantee mathematical consistency.
Thus, if we assigned P (a2 ) to be 0.9, our model would still be mathematically consistent but would
not be consistent with the frequencies we obtain in multiple repetitions of the experiment.
In actual practice we try to define probabilities that are approximately consistent with the frequencies
of the events in multiple repetition of the process.
Complements
The complement of an event, A, is the set of all outcomes not included in A and is denoted by A.
Mathematical Probability Models - Assigning Probabilities Page 7
Example 2.2.2
If E = { a1 , a3 , a5 } is a compound event on the sample space { a1 , a2 , a3 , a4 , a5 , a6 }, then the com-
plement of E is
E = { a2 , a4 , a6 }
Because of the nature of complementary events, two complementary events cannot both occur in one
process. The events are mutually exclusive.
Page 8
Chapter 3
Counting Techniques
Addition Rule
Consider we can perform process 1 in p ways and process 2 in q ways. Suppose we want to do process 1 or
process 2 but not both, then there are p + q ways to do so.
Example 3.1.1
Suppose a keyboard only has 26 letters and 20 special characters (!%#$), there are 46 ways in which a
typist may type a single character. (Process 1: typing a letter. Process 2: typing a special character).
Multiplication Rule
Again, consider we can perform process 1 in p ways and process 2 in q ways. Suppose we want to do process 1
and process 2, then there are p × q ways to do so. This is because for each way of doing process 1 we can
do process 2 in q ways.
Example 3.1.2
Suppose the same typist with the same keyboard wants to type a single letter and a single special
character. The typist can do so in 520 ways, since there are 26 ways to select the letter and for each
possible letter selection there are 20 possible special character selections.
Try to associate OR and AND with addition and multiplication respectively in your mind.
Often times, OR’s and AND’s are not explicit or obvious so you must re-word your problem to
identify implicit OR’s and AND’s.
Counting Techniques - Counting Arrangements Page 9
Example 3.1.3
A young boy gets to pick 2 toys from a store for his birthday. How many ways can he pick 2 toys if
the store contains 12 toys? He may pick the same toy multiple times and picks the toys at random.
We can re-word this problem as follows: A young boy selects one of 12 toys and again, selects one of
12 toys. Thus there are 12 × 12 = 144 ways in which he can select 2 toys. Furthermore, we have that
since selections are random, each selection is equally likely. So the probability that the boy selects
any pair of toys is 1/144.
In this case the boy was allowed to select the same toy more than once. This is often referred to as
with replacement. The addition and multiplication rules are generally sufficient to find probability
of processes with replacement but if processes occur without replacement solutions become more
complex and other techniques are often used.
The phrase at random or uniformly, indicates that each point in the sample space is equally likely.
Example 3.1.4
Consider a farmer with 500 different seeds. How many ways can he select 3 seeds randomly to plant?
We can re-word this problem to become: A farmer selects one seed from 500 and then selects one
seed of 499 and then one seed of 498. So there are 500 × 499 × 498 ways to do so.
Generally, if there are n ways of doing a process and it is done k times without replacement, that
is you can only do the process a specific way once, there are n × · · · × (n − k + 1) ways to do it.
Example 3.2.1
Consider the letters of the word “fiesta”. A baby (who cannot spell) randomly rearranges the letters
of the word. What is the probability that “fiesta” is the outcome?
There are six boxes to fill: . We have 6 ways to fill the first position, 5 ways to fill
the second and so on until we have 1 way to fill the 6th position. The number of points in the sample
Counting Techniques - Notations Page 10
space is 6 × 5 × 4 × 3 × 2 × 1 = 720. So the probability of each outcome in the sample space is 1/720.
Example 3.2.2
Consider the letters of the word “snake”. If arranged randomly what is the probability that the word
formed begins with a vowel?
There are five boxes to fill: . There are two ways to fill the first box:
a and e
and for each of these ways there are four remaining boxes to fill. The number of ways to fill the 4
remaining boxes is 4 × 3 × 2 × 1 = 24 so the total number of outcomes in which the first letter is a
48
vowel is 2 × 24 = 48. Therefore, the probability of the event occurring is number of sample points .
The five boxes can be filled by any letter to obtain a point in the sample space, so there are 5 × 4 ×
3 × 2 × 1 = 120 sample points. So the probability of the event occurring is 48/120 = 4/15.
Example 3.2.3
Suppose we have 7 meals to distribute randomly to 7 people (one each). Three of the meals are gluten
free and the other four are not. Of the 7 people, two of them cannot eat gluten. How many ways are
there to distribute the meals without giving gluten to someone who cannot eat it?
We can liken this to the boxes example with each person being a box. Let the first two boxes be the
people who cannot eat gluten. We have
Since we cannot place a gluten meal in boxes 1 or 2, we have that we have 3 ways to fill box 1 then
2 ways to fill box 2. So there are 6 ways distribute meals to the gluten-free people. We have
G G
Now there are 5 boxes to be filled with any of 5 meals. So there are 5 × 4 × 3 × 2 × 1 = 120 ways
to distribute the meals to the other 5 people. This is an implicit and statement, thus there are
6 × 120 = 720 ways to distribute the meals.
3.3 Notations
Because some calculations occur very frequently in statistics we define a notation that helps us to deal with
such problems.
Factorial
We define n! for any natural number n to be
n! = n × (n − 1) × (n − 2) × · · · × 1
and in order to maintain mathematical consistency we define 0! to be 1. This is the number of arrangements
of n possible unique elements, using each once.
Counting Techniques - Notations Page 11
n to k Factors
We define n(k) to be
n!
n(k) = n × (n − 1) × · · · × (n − k + 1) =
(n − k)!
This is the number of arrangements of length k using each element, of n possible unique elements, at most
once.
Power of
As in ordinary mathematics nk = n × n × · · · × n. This represents the number of arrangements that can be
| {z }
k
made of length k using each element, of n possible unique elements, as often as we wish (with replacement).
For many problems it is simply impractical to try to count the number of cases by conventional means
because of how big the numbers become. Notations such as n! and nk allow us to deal with these
large numbers effectively.
Example 3.3.1
An evil advertising company randomly chooses 7-digit phone numbers to call to try to sell products.
Find the probabilities of the following events:
• A: the number is your phone number
• B: the first three number are less than 5
• C: the first and last numbers match your phone number
Now assume that all 7-digits are unique (chosen without replacement):
• D: the number is 210-3869
• E: the first three number are less than 5
• F : the first and last numbers are 1 and 2 respectively
A: The initial sample space contains all the ways that one can select 7 numbers from the numbers 0
to 9 with replacement. There are 10 choices for each of the seven numbers, therefore the sample
space contains 107 points. Thus, since all points are equally likely, P (A) = 1/107 .
B: Now if the first three numbers are less than 5, there are 5 ways (0 to 4) to select each of the first
three numbers and there are 10 ways to select each of the next four numbers. So there are 53 × 104
3
×104
points in B. Therefore, P (B) = 5 107
C: There is only one way to select the first number such that it matches your number and the same
is true for the last number. Thus, we must only consider the middle digits. There are 10 choices each
for the middle five numbers, so there are 105 points in C. Therefore, P (C) = 1/105 .
D: The new sample space contains all the ways that one can select 7 numbers from the numbers 0
to 9 without replacement. There are 10 choices for the first number, 9 for the second and so on
until there are 4 choices for the last number. Thus, there are 10(7) points in the sample space and
1
since each is equally likely, P (D) = 1/10(7) = 10×9×8×7×6×5×4 .
E: If the first three numbers are less than five, there are 5 ways to select the first number, 4 for
the second and 3 for the third, so there are 5(3) ways to select the first 3 numbers. The next 4 digits
may be selected from any of the 7 digits that were not used as one of the first 3. So there are 7(4)
(3) (4)
ways to select the final four digits. Therefore, there are 5(3) × 7(4) points in E. So, P (E) = 5 10×7
(7) .
F : There is only one way to select the first and last digits as 1 and 2 respectively, so we must only
consider the middle 5 digits. The 5 digits are selected from 8 numbers without replacement, so there
8(5)
are 8(5) ways to do this. Therefore, P (F ) = 10 (7) .
Counting Techniques - Counting Subsets Page 12
Combinations
n
We define k to be the number of subsets of size k that can be selected from a set of n elements. We have
n(k)
n n!
= =
k k! (n − k)! k!
Derivation of Choose
Suppose we have a set of n unique elements and we wish to select a subset of size k, such that
k ≤ n, and the elements of the subset are unique (selected without replacement). If we use the boxes
metaphor we have k empty boxes.
···
| {z }
k
There are n ways to select the first element of the subset, (n − 1) ways to select the second and so
on until there are (n − k + 1) ways to select the kth and last element.
So there are n(k) ways to fill the k boxes but note that some of the subsets will contain all the same
elements as each other but in varying order. These subsets are not unique since we do not care for the
arrangement of the items in a subset. Each unique subset can be arranged to form k! permutations of
its k elements. Thus, the number of unique subsets, nk , multiplied by the number of arrangements
of each subset, k!, is n(k) . Therefore, we have
n
× k! = n(k)
k
So it follows that
n(k)
n
=
k k!
n n! n(k)
• k = (n−k)! k! = k!
n n
• k = n−k for all 0 ≤ k ≤ n
n n
• 0 =
=1 n
• (1 + x)n = n0 + n n n
1 x+ 2 x2 + · · · + n xn (Binomial Theorem)
Counting Techniques - Counting Arrangements of Set with Repeated Elements Page 13
Derivation
Suppose we have a set of n elements with k being unique. Let the k unique items be labelled u1 to uk
and let ni be the number of appearance of ui in the set of n elements. We want to form an arrangement
of length n so using the boxes metaphor we have
···
| {z }
n
Chapter 4
Probability Rules
Theorem 4.1.1
For a sample space, S, the probability of a simple event in S occurring is 1. That is
P (S) = 1
Proof 4.1.1:
X X
P (S) = P (a) = P (a)
a∈S all a
Theorem 4.1.2
Any event A in a sample space has a probability between 0 and 1 inclusive. That is
Proof 4.1.2:
Note that A is a subset of S, so X X
P (A) = P (a) ≤ P (a) = 1
a∈A a∈S
Now, recall that P (a) ≥ 0 for any sample point a by our probability model. Thus, since P (A) is the sum of
non-negative real numbers, P (A) ≥ 0. So we have
1 ≤ P (A) ≤ 1
Probability Rules - Venn Diagrams Page 15
Theorem 4.1.3
If A and B are two events such that A ⊆ B, that is all the sample points in A are also in B, then
P (A) ≤ P (B)
Proof 4.1.3:
X X
P (A) = P (a) ≤ P (a) = P (B)
a∈A a∈B
Now, assuming the area of E is half the area of S, we have that the probability of E is the probability
that a randomly chosen point on the area of S will be within E.
Consider now we let G = { 4, 5, 6 } be the event that that the number selected is greater than or equal
to 4. We have
E G
The total shaded region of the Venn diagram, E ∪ G, contains all the sample points of E and G. It is
the event that any outcome in either E or G, or both, occurs. Thus, E ∪ G is the event that E, G or both,
occurs. Similarly, the union of three events is the event that at least one of the three events occur.
Consider now the intersection E ∩ G. It is the set of all the points that are in both E and G, { 4, 6 }.
Thus, it is the event that an outcome in both E and G occurs. So E ∩ G is the event that E and G both
occur.
The sets A ∩ B and similarly A ∩ B ∩ C are often denoted as AB and ABC respectively.
Finally, the unshaded space in Figure 4.1 is the set of all outcomes that are not in E. It is the complement
of E and is denoted by E. It is the event that E does not occur.
Note that the complement of S is the null set, that is S = ∅, and has a probability of 0.
Theorem 4.3.1
The following are De Morgan’s Laws:
1. A ∪ B = A ∩ B
2. A ∩ B = A ∪ B
E G
We can see that the area of E ∪ G is not simply the sum of the areas of E and G. So we have that
the probability of E ∪ G is not simply the sum of the probability of E and G. Rather, we must sum the
probabilities and subtract the intersection (which gets included twice in the sum) to obtain P (E ∪ G).
Theorem 4.4.1
For any events, A and B, in a sample space, we have
P (A ∪ B) = P (A) + P (B) − P (A ∩ B)
Probability Rules - Mutually Exclusive Events Page 17
Example 4.4.1
A number between 1 and 6 inclusive is chosen randomly. Let E = { 2, 4, 6 } be the event the number
is odd and let G = { 4, 5, 6 } be the event that the number is greater than or equal to 4.
The probability of the number being even or greater than 4 is P (E ∪ G). Since both E and G
contain 3 points of the six in the sample space, P (E) = P (G) = 1/2. Thus, we can see clearly that
P (E ∪ G) 6= P (E) + P (G) = 1 since { 1 } is not in E or G and has a probability of 1/6. Now, note
E ∩ G = { 4, 6 }, so P (E ∩ G) = 1/3. We have
E G
Let AI be the area on the Venn diagram of the event I. The area of the union once again is not simply the
sum of the areas (AE + AG + AF ). Instead we can reason out that when we add the three areas we include
AE∩G , AG∩F , and AF ∩E twice each and AE∩G∩F three times. The sum of these doubly counted areas
(AE∩G + AG∩F + AF ∩E ) also includes AE∩G∩F three times. Thus, when we subtract the area of the doubly
counted segments, AE∩G∩F is also subtracted three times leaving this area unaccounted for. Therefore we
then add AE∩G∩F to find the complete area of E ∪ G ∪ F .
Theorem 4.4.2
For any events, A, B and C, in a sample space, we have
E G
intersection of two mutually exclusive events is 0, since it doesn’t contains any sample points. So we have
P (E ∩ G) = 0
Another intrinsic property of mutually exclusive events that we can see on a Venn diagram is that the area
of E ∪ G is the sum of the areas of E and G. Therefore, unlike in previous examples, the probability of E ∪ G
is the sum of the probabilities of E and G.
Theorem 4.5.1
For mutually exclusive events, A and B, in a sample space, we have
P (A ∪ B) = P (A) + P (B)
Theorem 4.5.2
More generally for n mutually exclusive events, A1 , A2 , . . . , An , in a sample space, we have
n
X
P (A1 ∪ A2 ∪ · · · ∪ An ) = P (A1 ) + P (A2 ) + · · · + P (An ) = P (Ai )
i=1
Probabilities of Complements
Theorem 4.5.3
For any event A, we have
P (A) = 1 − P (A)
Proof 4.5.1:
Recall the complement of an event consists of all the sample points not in the event. Thus, for any event A,
its complement A contains no points in common with A. So A ∩ A = ∅ and A and A are mutually exclusive,
by definition. Now, consider A ∪ A, it spans the whole of the sample space so we have P (A ∪ A) = 1 and
since A and A are mutually exclusive, we have
P (A) + P (A) = 1
Example 4.6.1
Consider an experiment in which a fair die is tossed twice. We define the following events:
• A: The first number rolled is a six
• B: The second number rolled is a six
• C: The sum of the numbers rolled is less than or equal to seven
• D: The sum of the numbers rolled is equal to seven
Suppose the event A occurs. Does this have any impact on the probability of B, C or D occurring?
It is quite clear to see that the events A and B are independent events since rolling a six on the first
toss has no impact on the number that will be rolled on the second toss. Now, events B and C from
the onset appear to be dependent since if you roll a six on the first toss you must roll a one to make
your total less than or equal to seven. To confirm this consider the sample space
(1, 1), (1, 2), (1, 3), (1, 4), (1, 5), (1, 6)
(2, 1), (2, 2), (2, 3), (2, 4), (2, 5), (2, 6)
(3, 1), (3, 2), (3, 3), (3, 4), (3, 5), (3, 6)
(4, 1), (4, 2), (4, 3), (4, 4), (4, 5), (4, 6)
(5, 1), (5, 2), (5, 3), (5, 4), (5, 5), (5, 6)
(6, 1), (6, 2), (6, 3), (6, 4), (6, 5), (6, 6)
We can count that 21 of the sample points have sums less than or equal to seven. So the probability
of C occurring is P (C) = 21/36 = 7/12. We also have that P (A) = 1/6. So P (A)P (C) = 7/72 but
we can count that A ∩ C contains only one sample point and hence has a probability of 1/36. Thus,
P (A)P (C) 6= P (A ∩ C) so A and C are dependent events.
At first glance, we see that upon rolling a six as the first number you must roll a 1 for the sum to
equal seven. So at first glance, events A and D seem to be independent however it would be naı̈ve
to assume this. We can count from the sample space that event D contains 6 points and so has a
probability P (D) = 6/36 = 1/6 and P (A) = 1/6. So P (A)P (D) = 1/36. Now, we can count that
the event A ∩ D contains only one point, (1,6) and so has a probability P (A ∩ D) = 1/36. Therefore,
P (A ∩ D) = P (A)P (D) and the events A and D are independent.
Page 20
Chapter 5
Conditional Probability
Often we need to calculate the probability of some event A occurring while knowing that some other event B
has already occurred. We call this the conditional probability of A given B and denote it by P (A | B).
The conditional probability of event A, given event B, is
P (A ∩ B)
P (A | B) = for P (B) > 0
P (B)
Theorem 5.1.1
For any two events A and B defined on the same sample space, with P (A) > 0 and P (B) > 0,
events A and B are independent if and only if P (A | B) = P (A) or P (B | A) = P (B).
Proof 5.1.1:
P (A ∩ B)
P (A | B) =
P (B)
↔ P (A ∩ B) = P (A | B)P (B)
and by definition of independence, A and B are independent if and only if P (A ∩ B) = P (A)P (B) which is
true if and only if P (A | B) = P (A). Without loss of generality we can swap events A and B and arrive at
the conclusion.
Product Rules
Theorem 5.1.2
Let A, B, C and D be events on a sample space, with P (A), P (B), P (C), P (D) > 0. We have
P (A ∩ B) = P (A)P (B | A)
P (A ∩ B ∩ C) = P (A)P (B | A)P (C | A ∩ B)
P (A ∩ B ∩ C ∩ D) = P (A)P (B | A)P (C | A ∩ B)P (D | A ∩ B ∩ C)
and so on. . .
Conditional Probability - Theorems and Rules for Conditional Probability Page 21
Proof 5.1.2:
The first statement come directly from the definition of conditional probability
P (A ∩ B)
P (A)P (B | A) = P (A) = P (A ∩ B)
P (A)
and so on. . .
Theorem 5.1.3
Let A1 , A2 , A3 , . . . , Ak be mutually exclusive events on a sample space and let B be an event on the
same sample space. We have
k
X
P (B) = P (B ∩ A1 ) + P (B ∩ A2 ) + P (B ∩ A3 ) + · · · + P (B ∩ Ak ) = P (Ai )P (B | Ai )
i=1
Proof 5.1.3:
Note that the events Ai ∩ B for 1 ≤ i ≤ k are all mutually exclusive events since Ai ’s are mutually exclusive.
Thus, the union of the Ai ∩ B’s is B, that is
as required.
Bayes’ Theorem
Theorem 5.1.4
Let A and B be events on a sample space, with P (B) > 0. We have
Proof 5.1.4:
P (A ∩ B) P (B | A)P (A)
P (A | B) = = by Theorem 4.7.2 (Product Rule)
P (B) P (B)
P (B | A)P (A)
= by Theorem 4.7.3 (Law of Total Probability)
P (A ∩ B) + P (A ∩ B)
P (B | A)P (A)
= by Theorem 4.7.2 (Product Rule)
P (B | A)P (A) + P (B | A)P (A)
Bayes’ Theorem allows us to find the conditional probability of some event A given B, in terms of
the probability of B given A. It allows us calculate conditional probabilities using the reversed order
of conditioning.
The probability of all the the branches leading outward from each node must sum to 1 since at least
one outcome must occur.
Page 23
Chapter 6
This chapter includes a few useful sums and series that show up in the following chapters.
Other identities can be obtained from this one by differentiation. For example we have
∞ ∞
d X i X i−1 d 1 1
r = ir = =
dr i=0 i=0
dr 1 − r (1 − r)2
Proof 6.4.1:
We begin with the equality
(1 + y)a+b = (1 + y)a × (1 + y)b
Now by Binomial Theorem we have
a+b a b
X a+b X a X b
yk = yi × yj
k i=0
i j=0
j
k=0
Consider the coefficient of y k on the right hand side. It is the sum of all the binomial terms such that
i + j = k. Thus, the coefficient of y k on the right hand side is
min{ a,k }
X a b
i=0
i k−i
and since when i > a or i > k the term is 0 we can increase the sum to infinity. Thus, since the coefficient
on the right hand side is equal to that on the left hand side we have
X ∞
a+b a b
=
k i=0
i k−i
When x becomes significantly large, the terms of the summation become 0 since
n n
= = 0, for x > n
x n−x
Useful Sums and Series - Exponential Series Page 25
n(n + 1)
1 + 2 + 3 + ··· + n =
2
n(n + 1)(n + 2)
1 2 + 2 2 + 3 2 + · · · + n2 =
6
2
3 3 3 3 n(n + 1)
1 + 2 + 3 + ··· + n =
2
Page 26
Chapter 7
Example 7.1.1
Suppose an experiment consists of tossing a coin three times. Let the random variable X be the
number of heads that are rolled. And let the random variable Y be the number of tails rolled. Now,
we have a nice short hand in that X = 2 is equivalent to the statement “two heads were rolled”.
Moreover, we have useful equalities such as X + Y = 3 and X = 3 − Y .
The ranges of X and Y are both { 0, 1, 2, 3 }.
It is very important to understand the purpose of r.v.’s since the remainder of this course features
them heavily.
The formal definition of a random variable is a function that assigns a real number to each point in a
sample space.
Example 7.1.2
Consider the same experiment as above. The sample space is
Let us define X the number of heads that are rolled and the sample point a = HHT. The value of the
function X(a) = 2 it is found by counting the number of heads in a. The range of X is { 0, 1, 2, 3 }.
Each of the outcomes X = x represents an event, simple or compound. In this case they are:
X Event
0 { TTT }
1 { TTH, THT, HTT }
2 { THH, HTH, HHT }
3 { HHH }
Discrete Random Variables and Probability Functions - Probability Function Page 27
Since some outcome in the range must always occur for a random variable for each event in the sample
space, the events of a random variable are mutually exclusive subsets of the sample space such that their
union is the total sample space. For r.v. X and outcome x, we have X = x represents some event and we
are interested in calculating its probability. We denote the probability of X = x by P (X = x).
Since the union of the events of values of a random variable is the total sample space, we have
X
P (X = x) = 1
x∈Range(X)
The set of pairs { (x, f (x)) | x ∈ Range(X) } is called the probability distribution of X.
Not that because the events “X = x” and “X = y” for x 6= y are mutually exclusive we have
x
X
F (x) = P (X ≤ x) = P (X = z) for all x ∈ Range(X)
z=0
It is the sum of the probabilities that the random variable takes values less than or equal to x.
Discrete Random Variables and Probability Functions - Cumulative Distribution Function Page 28
• lim F (x) = 1
x→∞
Page 29
Chapter 8
Discrete Distributions
As we briefly mentioned in the previous chapter, probability distributions are the set of pairs (x, f (x)) for
all possible outcomes x of a random variable X. Many probability distributions appear commonly on r.v.’s
of similar “real-life” processes. In this chapter we define a few of these common distributions on discrete
random variables, when they occur and how to use them to calculate probabilities.
Another way to define the probability of each value of a random variable with this sample space is
1
Number of possible values in Range(X)
let the random variable X be the number of successes obtained. X has a hypergeometric distribution and
we denote it
X ∼ Hypergeometric
r N −r
x n−x
f (x) = P (X = x) = , for x ≤ min(r, n)
N
n
It is important to understand that the terms “successes” and “failures” are simply placeholder that
represent a type of outcome and its complement. They could be replaced by “wins” and “losses”,
“whites” and “colors”, or any other titles that are distinct groups with a union that spans the whole
sample space.
This is used when we know how many items (n) are chosen at random from a set with two different
types and we know the amount of each type in the set.
Example 8.2.1
There is a basket with 11 fruit, 9 apples and 2 oranges. 4 fruit are picked at random from the basket.
Let random variable X be the number of apples selected. Find f (x) = P (X = x). Then find f (3).
X ∼ Hypergeometric. N = 11, n = 4, r = 9.
9 2
x 4−x
f (x) = P (X = x) = , for x ≤ 4
11
4
Hence
9 2
3 1
f (7) = P (X = 7) = ≈ 0.509
11
4
Example 8.2.2
15 cards are drawn from a deck of 52 at random. Let X be the number of red cards drawn. Find
f (x) = P (X = x). Then find f (7).
Discrete Distributions - Binomial Distribution Page 31
Hence
26 26
7 8
f (7) = P (X = 7) = ≈ 0.229
52
15
successes and failures that satisfy “X = x”. Each of these arrangements has probability px (1 − p)n−x
since the probability of obtaining x successes is px and the probability of obtaining n − x failures is
(1 − p)n−x . So the probability that X = x, that is that any of the arrangements occur, is the sum of
the probability of each unique arrangement, nx px (1 − p)n−x .
The above formula describes the probability of x success and (n−x) failures multiplied by the number
of different ways of arranging those successes within the total number of trials of the experiment.
Each of the n individual experiments is called a “Bernoulli trial” and the entire process of n trials is
called a Bernoulli process or a Binomial process.
Example 8.3.1
A loaded coin is flipped 10 times, with a probability of a heads occurring being 0.4. Let random
variable X be the number of heads that occur. Find f (x) = P (X = x), then find f (3).
X ∼ Binomial(10, 0.4).
10
f (x) = P (X = x) = (0.4)x (0.6)10−x , for x = 1, . . . , 10
x
Hence
10
f (3) = P (X = 3) = (0.4)3 (0.6)7 ≈ 0.215
3
Discrete Distributions - Binomial Distribution Page 32
Example 8.3.2
A football season in a university league has 22 games. The probability of each game being abandoned
(because of bad weather or other hazards) is 0.02. Let X be the number of games abandoned
throughout the whole season. Find f (x) = P (X = x), then find f (2) and f (10).
X ∼ Binomial(22, 0.02).
22
f (x) = P (X = x) = (0.02)x (0.98)22−x , for x = 1, . . . , 22
x
Hence
22
f (2) = P (X = 2) = (0.02)2 (0.98)2 0 ≈ 0.062
2
and
22
f (10) = P (X = 10) = (0.02)10 (0.98)12 ≈ 5.196×10−12
10
The Hypergeometric distribution is used when there is a fixed number of objects (successes and fail-
ures) to choose from.
The Binomial distribution is used when there is no fixed number of objects to be selected from and
instead we know the constant probability of a success for all the trials.
Example 8.3.3
Consider Lisa owns a car dealership and has only 750 red cars and 1250 blue cars in stock. A
rich Swedish man enters and picks 50 cars randomly to purchase. Let X be the number of red cars
the Swede purchases.
Since we know the number of successes (750 red cars) and failures (1250 blue cars) as well as the
number of trials, we have that X ∼ Geometric and
750
1250
x 50−x
f (x) = P (X = x) = 2000
50
Now, consider Lisa has run out of all her stock of cars. She goes to a Swedish car manufacturer’s
factory which is capable of producing any amount of cars. The factory has a 37.5% chance of producing
a red car and otherwise produces a blue car. Lisa orders 50 cars. Let X be the number of red cars
she receives.
Since there is no fixed number of cars to choose from but we do know the probability of each car
being a success, we have that X ∼ Binomial(50, 0.375) and
50
f (x) = P (X = x) = (0.375)x (0.625)50−x
x
Discrete Distributions - Negative Binomial Distribution Page 33
Example 8.3.4
Consider the previous example, suppose the rich Swedish man purchased 50 cars from Lisa. What is
the probability that he purchases 20 red cars?
The number of cars Lisa has in stock is very large and the number of cars being bought is fairly small.
Thus we can approximate the distribution with the probability of a success being 750/2000 = 0.375.
We have
50
f (20) = P (X = 20) = (0.375)20 (0.625)30 ≈ 0.1072
20
Now we can calculate the probability using the hypergeometric distribution to determine how good
an estimate this is. We have
750 1250
20 30
f (20) = P (X = 20) = 2000
≈ 0.1084
50
The negative binomial distribution is used to model the number of trials of an experiment before the
k th success. Thus, if we know the number of trials, this distribution is not appropriate.
Example 8.4.1
A bad driver never stops at red lights and keeps driving and running red lights until he is arrested.
The probability of him getting pulled over by a police man immediately after breaking a light is 0.53
and upon being pulled over 4 times he is arrested. Let X be number of red lights the driver runs
Discrete Distributions - Geometric Distribution Page 34
without being pulled over before he is arrested. Find f (x) = P (X = x), then find f (1) and f (7).
X ∼ NB(4, 0.53)
x+3
f (x) = P (X = x) = (0.53)4 (0.47)x , for x = 0, 1, 2, . . .
x
Hence
4
f (1) = P (X = 1) = (0.53)4 (0.47) ≈ 0.148
1
and
10
f (7) = P (X = 7) = (0.53)4 (0.47)7 ≈ 0.048
7
Example 8.4.2
The probability of a football player scoring at least one goal in each game is 0.72. When the player
scores in 26 games, she is awarded a bonus check. Let X be the number of games in which the player
does not score before she is awarded the bonus. Find f (x) = P (X = x), then find f (7), and f (0).
X ∼ NB(26, 0.72)
x + 25
f (x) = P (X = x) = (0.72)26 (0.28)x , for x = 0, 1, 2, . . .
x
Hence
32
f (7) = P (X = 7) = (0.72)26 (0.28)7 ≈ 0.089
7
and
25
f (0) = P (X = 0) = (0.72)26 (0.28)0 = 0.7226 ≈ 1.953×10−4
0
Note that f (0) is simply the probability that the player scores in all of her first 26 games.
Example 8.5.1
A betting game involves flipping a coin repeatedly. The coin is fixed to that the probability of heads
is 0.7 and tails is 0.3. On every flip, if you get heads you may flip again, but otherwise (if you get
tails) the game is over. For each heads you flip you get $100. Let X be the number of heads you get.
Find f (2) and F (3).
X is the number of trials before the first failure (flipping a tails) occurs. Thus, X ∼ Geometric so
Hence
f (2) = P (X = 2) = (0.7)2 0.3 = 0.147
and
F (3) = P (X ≤ 3) = (0.7)3 0.3 + (0.7)2 0.3 + (0.7)1 0.3 + (0.7)0 0.3 = 0.7599
Note that there are only two ways to get a total of x 6= 0 occurrences in an interval of length t + ∆t
for a sufficiently small ∆t since by individuality the probability of two or more events in the interval
Discrete Distributions - The Poisson Distribution Page 36
(t, t + ∆t) is negligible. Either there are x occurrences by time t or there are (x − 1) occurrences by
time t and 1 in the interval (t, t + ∆t). This and the property of independence lead to
d λt
[e ft (1)] = eλt λe−λt = λ
dt
Integrating both sides we have
eλt ft (1) = λt + C
Note for an interval of length 0, the probability of one event must be 0. So the constant C must be
0. Hence, we have
ft (1) = λte−λt
We now use induction to generalize this result for an arbitrary x. Our inductive hypothesis is as
follows
(λt)x e−λt
ft (x) =
x!
We have already shown that our hypothesis holds for x = 0 and x = 1. Now we assume our hypothesis
is true and recall the differential equation (8.1) with x + 1. We have
d λt
[e ft (x + 1)] = eλt λft (x)
dt
d λt (λt)x e−λt λx+1 tx
[e ft (x + 1)] = eλt λ =
dt x! x!
Z x+1 x+1
λ λ tx+1
eλt ft (x + 1) = tx dt = +C
x! x! (x + 1)
Again using the boundary condition, with an interval of length 0, we have that C must be 0 for x > 0.
Thus, we have
(λt)x+1 e−λt
ft (x + 1) =
(x + 1)!
Thus, if the inductive hypothesis holds for x then it also holds for x+1. So by principle of mathematical
induction we have that the hypothesis is true for all natural numbers x.
Discrete Distributions - The Poisson Distribution Page 37
Note that this derivation is fairly complex. If at first you do not understand don’t worry. Try reading
it again later.
n(x) µ x
n x µ n−x
f (x) = p (1 − p)n−x = 1−
x x! n n
n (x) n−x
µ n µ
= × x 1−
x! n n
µn n × (n − 1) × (n − 2) × · · · × (n − x + 1) µ n µ −x
= × × 1 − 1 −
x! nx n n
(Note the middle term’s numerator is the product of n terms)
µn n n−1 n−2 n−x+1 µ n µ −x
= × × × × ··· × × 1− 1−
x! n n n n n n
µn
1 2 x−1 µ n µ −x
= ×1× 1− × 1− × ··· × 1 − × 1− 1−
x! n n n n n
Example 8.6.1
Suppose a fire station gets 15 phone call every 5 minutes. The rate of occurrence per minute is λ = 3.
Then, if we interested in the number of phone calls in 10 minutes, we have that the average number
of phone calls in an interval of ten minutes is µ = 10λ = 30.
Example 8.7.1
Suppose a type of spider catches flies in their webs at a rate of 2 per hour. If there are 10 such spiders,
what is the probability that more than 6 spiders catch less than 4 flies in 2 hours?
First we find the probability that a single spider catches less than 4 flies in 2 hours. Let X be the
number of flies the spider catches in 2 hours. The average number of flies caught in 2 hours is µ = 4.
Thus, we have X ∼ Poisson(4).
It is important to remember that more than one distribution can be necessary. A common mistake
is to correctly use one distribution and not realize the need for another.
Page 39
Chapter 9
5
Frequency
0
1 2 3 4
X
Frequency distributions are good summaries of data because they show the variability in the observed
outcomes clearly. Another way to summarize results are single-number summaries such as the following:
Mean and Variance - Summarizing Data on Random Variables Page 40
The mean of a sample of outcomes is the average value of the outcomes. It is the sum of the outcomes
divide by the total number of outcomes. The mean of n outcomes, x1 , . . . , xn , for a random variable X is
n
X xi x1 , . . . , x n
=
i=0
n n
The median of a sample is an outcome such that half the outcomes are before it and half the outcomes
are after it when the outcomes are arranged in numerical order.
The mode of a sample is the outcome that occurs most frequently. There can be multiple equal modes
in a sample.
Example 9.1.1
A fisherman records the weight of each fish he catches for a week. These are his results. Each value
represents the weight, in pounds, of a fish he caught.
{ 20, 23, 19, 27, 17, 22, 18, 15, 23, 25, 18, 23, 29 }
A frequency distribution of the sample above is
X Frequency Count Frequency
15 : 1
17 : 1
18 :: 2
19 : 1
20 : 1
22 : 1
23 ::: 3
25 : 1
27 : 1
29 : 1
And the following is a frequency histogram
3
Frequency
0
15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
X
Mean and Variance - Expected Value of a Random Variable Page 41
15, 18, 18, 19, 20, 22, 23, 23, 23, 27, 29
↑
The mode is the weight that occurs most frequently. It corresponds to tallest bar on the histogram.
23lbs occurs the most (3 times) so it is the median.
Note that in order to calculate the expected value of a random variable X, we often need to know
the distribution, and hence the probability function, of X.
x 5 10 25 100 200
1 7 13 2 1
f (x) 3 30 30 15 30
Using the relative frequency definition of probability, we know that if we observed a very large number
of outcomes, the fraction of times X = x occurs (relative frequency of x) is f (x).
Mean and Variance - Expected Value of a Random Variable Page 42
Thus, in theory, we would expect the mean of a sample of infinitely many outcomes to be
1 7 13 2 1
(5) + (10) + (25) + (100) + (200) ≈ 33.167
3 30 30 15 30
This theoretical mean is denoted by µ or E(X), and is known as the expected value of X.
Example 9.2.1
A slots machine in a casino costs $5 to play. It has probabilities of 0.5 to pay out $2, 0.2 to pay out
$5, a 0.1 to pay out $10 and otherwise does not pay out anything. Let the random variable X be the
amount of money (in dollars) the machine pays out in one play, and Y be the amount of money won
or lost in one play. Find E(X) and E(Y ).
Example 9.2.2
A nightclub lets groups of up to 6 people enter at reduced fees. A randomly selected group in the
nightclub’s line has the following probabilities for its size and cost of entry:
Size of Group (X) Cost of Entry (Y) Probability
1 $10 0.1
2 $18 0.15
3 $26 0.1
4 $34 0.3
5 $42 0.15
6 $50 0.2
2. If the cost of entry of a group (Y) is 8 × the size of the group + 2. Find the expected value of the
cost of entry, in dollars, of a randomly selected group.
3. Show that the expected value of the cost of entry of a randomly selected group is 8 ×
the expected value of the size of the group + 2.
Theorem 9.2.1
Let X be a discrete random variable with a range of A, and probability function f (x). The expected
value of some function g(X) is given by
X
E [g(X)] = g(x)f (x)
x∈A
Proof 9.2.1:
Let the random variable Y = g(X) have a range of B and a probability function fY (y) = P (Y = y).
X
E[g(X)] = E(Y ) = yfY (y)
y∈B
Now, let Cy be { x | g(x) = y }, that is the set of all values of x such that g(X) is y. So
X
fY (y) = P [g(X) = y] = f (x)
x∈Cy
That is, the probability that Y = y is the sum of the probabilities that X = x such that g(x) = y.
Now, we have
X X X X X
E[g(X)] = yfY (y) = y f (x) = yf (x)
y∈B y∈B x∈Cy y∈B x∈Cy
X X
= g(x)f (x)
y∈B x∈Cy
Note that the inner summation is for all x such that g(x) = y and the outer is for all y. Thus the equation
is the sum for all x. So X X X
E[g(X)] = g(x)f (x) = g(x)f (x)
y∈B x∈Cy x∈A
Theorem 9.2.2
For constants a, b and c,
Proof 9.2.2:
X
E[aE[g1 (X)] + bE[g2 (X)] + c] = [ag1 (x) + bg2 (x) + c]f (x)
all x
X
= [ag1 (x)f (x) + bg2 (x)f (x) + cf (x)]
all x
X X X
= ag1 (x)f (x) + bg2 (x)f (x) + cf (x)
all x all x all x
X X X
=a g1 (x)f (x) + b g2 (x)f (x) + c f (x)
all x all x all x
!
X
recall f (x) = 1 = aE[g1 (X)] + b[g2 (X)] + c
all x
Var(X) = σ 2 = E (X − µ)2
where σ is the standard deviation of X. It is the average squared deviation of a random variable from its
mean. It measures how far out from the mean the values of a random variable are spread.
The definition and formula above is useful in understanding the variance’s importance but it can be
difficult to use to actually calculate the variance. Here are a few other useful formulas for calculating the
variance of a random variable:
Var(X) = σ 2 = E (X − µ)2
= E X 2 − 2Xµ + µ2
Var(X) = σ 2 = E[X(X − 1) + X] − µ2
= E[X(X − 1)] + E(X) − µ2
= E[X(X − 1)] + µ − µ2
Mean and Variance - Expected Value and Variance of Discrete Distributions Page 45
Properties of Variance
Theorem 9.3.1
For constants a and b,
Var[aX + b] = a2 Var(X)
Proof 9.3.1:
2
Var[aX + b] = E (aX + b)2 − E(aX + b) ,
by equation (9.1)
2
= E a2 X 2 + 2abX + b2 − aE(X) + b
= a2 Var(X)
A useful property of the standard deviation is that it is expressed in the same units as the random
variable, unlike the variance.
Binomial Distribution
Let X be a random variable such that X ∼ Binomial(n, p).
The expected value of X is
µ = E(X) = np
The variance of X is
σ 2 = Var(X) = np(1 − p)
Poisson Distribution
Let X be a random variable such that X ∼ Poisson(µ).
The expected value of X is
µ = E(X) = µ
The variance of X is
σ 2 = Var(X) = µ
Mean and Variance - Expected Value and Variance of Discrete Distributions Page 46
Now to find Var(X) we consider E[X(X − 1)]. We will use the same technique as above
n n
X X n x
E[X(X − 1)] = x(x − 1)f (x) = x(x − 1) p (1 − p)n−x
x=0 x=0
x
(when x = 0 or x = 1, the summation terms are also 0, so we can ignore them)
n
X n(n − 1)(n − 2)!
= p2 px−2 (1 − p)(n−2)−(x−2)
x=2
(x − 2)! [(n − 2) − (x − 2)]!
n x−2
X n−2 p
= n(n − 1)p2 (1 − p)n−2
x=2
x−2 1−p
n−2
X n−2 y
2 n−2 p
Let y = x − 2 = n(n − 1)p (1 − p)
y=0
y 1−p
n−2
p
(by Binomial Theorem) = n(n − 1)p2 (1 − p)n−2 1 +
1−p
= n(n − 1)p2
So we have
µx e−µ
P (X = x) = f (x) =
x!
So we will use this to find E(X). We have
∞ ∞
X X µx e−µ
E(X) = xf (x) = x
x=0 x=0
x!
(when x = 0, the summation term is also 0, so we can ignore it)
∞
X µx−1
= µe−µ
x=1
(x − 1)!
∞
X µy
Let y = x − 1 = µe−µ
y=0
y!
∞
µy
X
x
recall e = = µe−µ eµ = µ
y=0
y!
Now to find Var(X) we consider E[X(X − 1)]. We will use the same technique as above
n ∞
X X µx e−µ
E[X(X − 1)] = x(x − 1)f (x) = x(x − 1)
x=0 x=0
x!
(when x = 0 or x = 1, the summation terms are also 0, so we can ignore them)
∞
X µx−2
= µ2 e−µ
x=2
(x − 2)!
∞
X µy
Let y = x − 2 = µ2 e−µ
y=0
y!
= µ2 e−µ eµ = µ2
So we have
Var(X) = E[X(X − 1)] + µ − µ2 = µ2 + µ − µ2 = µ
Page 48
Chapter 10
Theorem 10.1.1
If F is an arbitrary cumulative distribution function and U is uniform on [ 0, 1 ] then the random
variable X, defined by X = F − (U ), where F − (y) = min{ x | F (x) ≥ y }, has a cumulative distribution
function of F (x).
Proof 10.1.1:
Note that, for all U < F (x), we have that X ≤ x by applying the inverse function F − to both sides. Now,
by applying F to both sides of X ≤ x, we have U ≤ F (x), for all x. So we can say
Note that P [ U < F (x) ] = P [U ≤ F (x) ] = F (x) since U is uniform and continuous. Thus,
F (x) ≤ P (X ≤ x) ≤ F (x)
Example 10.1.1
Suppose we have a random variable U that is uniform on U [ 0, 1 ] and we want to generate a random
variable X with exponential distribution. We have that the cumulative distribution function of X is
FX (x) = 1 − e−λx for some λ. Since FX (x) is a continuous, strictly increasing function for x > 0,
Continuous Random Variables - Normal Distribution Page 49
y = 1 − e−λx
1 − y = e−λx
ln(1 − y) = −λx
− ln(1 − y)
x=
λ
− ln(1 − y)
So FX−1 (y) = .
λ
Thus, by Theorem 3.1.1, X = FX−1 (U ) has cumulative distribution function FX (x).
1
X = − ln(1 − U )
λ
Now, to find fX (x), the probability density function of X, we differentiate the cumulative distribution
function. So,
1−y
fX (x) =
λ
1 1 x−µ 2
f (x) = √ e− 2 σ , for x ∈ R
σ 2π
where µ ∈ R and θ > 0 are parameters. It turns out that E(X) = µ and Var(X) = σ 2 .
X ∼ N (µ, σ 2 ) where X has expected value µ and variance σ 2 .
The Normal distribution is the most widely used distribution in probability and statistics. Physical
processes leading to the Normal distribution exist but are a little complicated to describe.
The graph of the probability density function f (x) is symmetric about the line x = µ. The shape of the
graph is often termed a “bell shape” or “bell curve”.
Continuous Random Variables - Normal Distribution Page 50
f (x)
µ − 4σ µ − 3σ µ − 2σ µ−σ µ µ+σ µ + 2σ µ + 3σ µ + 4σ
X
This integral cannot be given a simple mathematical expression so numerical methods are used to compute
its value for given x, µ, σ. Before computers could solve such problems, tables of probabilities F (x) where
created by numerical integration. Only the table of the standard normal distribution, N (0, 1), is required to
solve for F (x) for all µ, σ, since with a change of variable, the c.d.f. any normal distribution can be related
to that of the standard normal distribution.
Theorem 10.2.1
X−µ
Let X ∼ N (µ, σ 2 ). If Z = σ , then Z ∼ N (0, 1) and
x−µ
P (X ≤ x) = P Z ≤
σ
Proof 10.2.1:
Let X ∼ N (µ, σ 2 ).
Z x y−µ 2
1 1
P (X ≤ x) = √ e− 2 σ dy
−∞ σ 2π
x−µ
y−µ
Z
1σ 1 2
Let z = = √ e− 2 z dz
σ −∞ 2π
= P Z ≤ x−µ
σ