0% found this document useful (0 votes)
28 views51 pages

STAT230 Course Notes

This document contains course notes for STATS 230 that cover topics in probability including defining probability, mathematical probability models, counting techniques, probability rules, conditional probability, discrete random variables and probability distributions, mean and variance of random variables, and continuous random variables. The notes are broken into 10 chapters that progress from basic concepts to more advanced statistical concepts.

Uploaded by

Drishti Handa
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
0% found this document useful (0 votes)
28 views51 pages

STAT230 Course Notes

This document contains course notes for STATS 230 that cover topics in probability including defining probability, mathematical probability models, counting techniques, probability rules, conditional probability, discrete random variables and probability distributions, mean and variance of random variables, and continuous random variables. The notes are broken into 10 chapters that progress from basic concepts to more advanced statistical concepts.

Uploaded by

Drishti Handa
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 51

STATS 230 Course Notes

Joshua Allum

April 2017
Page 1

Contents

1 Introduction 3
1.1 Defining Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Mathematical Probability Models 5


2.1 Sample Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Assigning Probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

3 Counting Techniques 8
3.1 Counting Arguments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.2 Counting Arrangements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.3 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.4 Counting Subsets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.5 Properties of Combinations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.6 Counting Arrangements of Set with Repeated Elements . . . . . . . . . . . . . . . . . . . . . 13

4 Probability Rules 14
4.1 General Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.2 Venn Diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.3 De Morgan’s Laws . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.4 Rules for Unions of Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.5 Mutually Exclusive Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.6 Independence of Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

5 Conditional Probability 20
5.1 Theorems and Rules for Conditional Probability . . . . . . . . . . . . . . . . . . . . . . . . . 20
5.2 Tree Diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

6 Useful Sums and Series 23


6.1 Geometric Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
6.2 Binomial Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
6.3 Multinomial Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
6.4 Hypergeometric Identitiy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
6.5 Exponential Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
6.6 Integer Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

7 Discrete Random Variables and Probability Functions 26


7.1 Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
7.2 Probability Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
7.3 Cumulative Distribution Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
CONTENTSCONTENTS Page 2

8 Discrete Distributions 29
8.1 Uniform Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
8.2 Hypergeometric Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
8.3 Binomial Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
8.3.1 Comparison of Binomial and Hypergeometric Distributions . . . . . . . . . . . . . . . 32
8.3.2 Binomial Estimate of the Hypergeometric Distribution . . . . . . . . . . . . . . . . . . 33
8.4 Negative Binomial Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
8.5 Geometric Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
8.6 The Poisson Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
8.6.1 Poisson Estimate to the Binomial Distribution . . . . . . . . . . . . . . . . . . . . . . 37
8.6.2 Parameters µ and λ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
8.6.3 Distinguishing the Poisson Distribution from other Distributions . . . . . . . . . . . . 37
8.7 Combining Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

9 Mean and Variance 39


9.1 Summarizing Data on Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
9.2 Expected Value of a Random Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
9.3 Variance of a Random Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
9.3.1 Standard Deviation of a Random Variable . . . . . . . . . . . . . . . . . . . . . . . . . 45
9.4 Expected Value and Variance of Discrete Distributions . . . . . . . . . . . . . . . . . . . . . . 45
9.4.1 Derivations of Expected Values and Variances . . . . . . . . . . . . . . . . . . . . . . . 46

10 Continuous Random Variables 48


10.1 Computer Generated Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
10.2 Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
Page 3

Chapter 1

Introduction

1.1 Defining Probability


The Classical Definition
The probability of an event is
the number of ways the event may occur
the total number of possible outcomes
provided all outcomes are equally likely.

Example 1.1.1
The probability of a fair dice landing on 3 is 1/6 because there is one way in which the dice may
land on 3 and 6 total possible outcomes of faces the dice may land on. The sample space of the
experiment, S, is {1, 2, 3, 4, 5, 6} and the event occurs in only one of these six outcomes.

The main limitation of this definition is that it demands that the outcomes of a sample space are
equally likely. This is a problem since a definition of “likelyhood” (probability) is needed to include
this postulate in a definition of probability itself.

The Relative Frequency Definition


The probability of an event is the limiting proportion of times that an event occurs in a large number of
repetitions of an experiment.

Example 1.1.2
The probability of a fair dice landing on 3 is 1/6 because after a very large series of repetitions (ideally
infinite) of rolling the dice, the fraction of times the face with 3 is rolled tends to 1/6.

The main limitation of this definition is that we can never repeat a process indefinitely so we can
never truly know the probability of an event from this definition. Additionally, in some cases we
cannot even obtain a long series of repetitions of processes to produce an estimate due to restrictions
on cost, time, etc.
Introduction - Defining Probability Page 4

The Subjective Definition


The probability of an event occurring is a measure of how sure the person making the statement is that the
event will occur.

Example 1.1.3
The probability that a football team will win their next match can be predicted by experts who
regard all the data of past matches and current situations to provide a subjective probability.

This definition is irrational and leads to many people having different probabilities for the same
events, with no clear “right” answer. Thus, by this definition, probability is not an objective science.

Probability Model
To avoid many of the limitation of the definitions of probability, we can instead treat probability as a
mathematical system defined by a set of axioms. Thus, we can ignore the numerical values of probabilities
until we consider a specific application. The model is defined as follows
• A sample space of all possible outcomes of a random experiment is defined.

• A set of events, to which we may assign probabilities, is defined.


• A mechanism for assigning probabilities to events is specified.
Page 5

Chapter 2

Mathematical Probability Models

2.1 Sample Spaces


A sample space, S, is a set of distinct outcomes for an experiment or process, with the property that in a
single trial, one and only one of these outcomes occurs. The outcomes that make up a sample space are
called sample points or simply points.

Example 2.1.1
The sample space for a roll of a six-sided die is

{a1 , a2 , a3 , a4 , a5 , a6 } where ai is the event the top face is i

More simply we could define the sample space as

{1, 2, 3, 4, 5, 6}

Note that a sample space of a probability model for a process is not necessarily unique. Often times,
however, we try to chose sample points that are the smallest possible or “indivisible”.

Example 2.1.2
If we define E to be the event that the top face of a six-sided die is even when rolled and O to be the
event the top-face is odd, then the sample space, S, can be defined as

{E, O}

This is the same process as Example 2.1.1 (rolling a six-sided die), so since the sample spaces differ,
clearly, sample spaces are not unique. Moreover, if we are interested in the event that a 3 is rolled,
this sample space is not suitable since it groups the event in question with other events.

A sample space can be either discrete or non-discrete. If a sample space is discrete, it consists of a
finite or countably infinite number “simple events”. A countably infinite
 set is one that can be put into a
one-to-one correspondence with the set of real numbers. For example, 1, 21 , 31 , 14 , . . . is countably infinite
whereas { x | x ∈ R } is not.
Mathematical Probability Models - Assigning Probabilities Page 6

Simple Events
An event in a discrete sample space is a subset of the sample space, i.e., A ⊂ S. If the event is indivisible,
so as to only contain one point, we call it a simple event, otherwise it is a compound event.

Example 2.1.3
A simple event for a roll of a six-sided die is A = {a1 } where ai is the event the top face is i. A
compound event is E = {a2 , a4 , a6 }.

2.2 Assigning Probabilities


Let S = {a1 , a2 , a3 , . . .} be a discrete sample space. We assign probabilities, P (ai ), for i = 1, 2, 3, . . . to each
sample point ai such that the following two conditions hold
• 0 ≤ P (ai ) ≤ 1
P
• P (ai ) = 1
all i

The set of probabilities { P (ai ) | i = 1, 2, 3, . . . } is called a probability distribution on S.

Note that P is a function with the sample space as its domain.

The second condition, that the sum of the probabilities of all sample points is 1, relates to the property
that for a given experiment one simple event in the sample space must occur. Every experiment or process
always has an outcome thus the probability of any outcome being achieved must be 1.

Compound Events
The probability of an event A is the sum of the probability of all the simple events that make up A.
X
P (A) = P (a)
a∈A

Example 2.2.1
In the previous example we saw that E = {a2 , a4 , a6 } is a compound event. Thus, the probability of
the compound event E is
P (E) = P (a2 ) + P (a4 ) + P (a6 )

Note that the probability model that we defined does not specify what actual numbers to assign to
the simple events of a process. It only defines the properties that guarantee mathematical consistency.
Thus, if we assigned P (a2 ) to be 0.9, our model would still be mathematically consistent but would
not be consistent with the frequencies we obtain in multiple repetitions of the experiment.

In actual practice we try to define probabilities that are approximately consistent with the frequencies
of the events in multiple repetition of the process.

Complements
The complement of an event, A, is the set of all outcomes not included in A and is denoted by A.
Mathematical Probability Models - Assigning Probabilities Page 7

Example 2.2.2
If E = { a1 , a3 , a5 } is a compound event on the sample space { a1 , a2 , a3 , a4 , a5 , a6 }, then the com-
plement of E is
E = { a2 , a4 , a6 }

Because of the nature of complementary events, two complementary events cannot both occur in one
process. The events are mutually exclusive.
Page 8

Chapter 3

Counting Techniques

3.1 Counting Arguments


If we have a sample space, S, of some experiment that has a uniform distribution (all sample points are
equally likely), then we can calculate the probability of a compound event A as the number of sample points
in A divided by the total number of sample points.
k
P (A) =
n
where k is the number of sample points in A and n is the total number of sample points in the sample space.

Addition Rule
Consider we can perform process 1 in p ways and process 2 in q ways. Suppose we want to do process 1 or
process 2 but not both, then there are p + q ways to do so.

Example 3.1.1
Suppose a keyboard only has 26 letters and 20 special characters (!%#$), there are 46 ways in which a
typist may type a single character. (Process 1: typing a letter. Process 2: typing a special character).

Multiplication Rule
Again, consider we can perform process 1 in p ways and process 2 in q ways. Suppose we want to do process 1
and process 2, then there are p × q ways to do so. This is because for each way of doing process 1 we can
do process 2 in q ways.

Example 3.1.2
Suppose the same typist with the same keyboard wants to type a single letter and a single special
character. The typist can do so in 520 ways, since there are 26 ways to select the letter and for each
possible letter selection there are 20 possible special character selections.

Try to associate OR and AND with addition and multiplication respectively in your mind.

Often times, OR’s and AND’s are not explicit or obvious so you must re-word your problem to
identify implicit OR’s and AND’s.
Counting Techniques - Counting Arrangements Page 9

Example 3.1.3
A young boy gets to pick 2 toys from a store for his birthday. How many ways can he pick 2 toys if
the store contains 12 toys? He may pick the same toy multiple times and picks the toys at random.

We can re-word this problem as follows: A young boy selects one of 12 toys and again, selects one of
12 toys. Thus there are 12 × 12 = 144 ways in which he can select 2 toys. Furthermore, we have that
since selections are random, each selection is equally likely. So the probability that the boy selects
any pair of toys is 1/144.

In this case the boy was allowed to select the same toy more than once. This is often referred to as
with replacement. The addition and multiplication rules are generally sufficient to find probability
of processes with replacement but if processes occur without replacement solutions become more
complex and other techniques are often used.

The phrase at random or uniformly, indicates that each point in the sample space is equally likely.

Example 3.1.4
Consider a farmer with 500 different seeds. How many ways can he select 3 seeds randomly to plant?

We can re-word this problem to become: A farmer selects one seed from 500 and then selects one
seed of 499 and then one seed of 498. So there are 500 × 499 × 498 ways to do so.

Now, how many ways can he select 5 and 50 seeds randomly?


He can select 5 seeds in 500 × 499 × 498 × 497 × 496 ways and 50 seeds in 500 × · · · × 451 ways.

Generally, if there are n ways of doing a process and it is done k times without replacement, that
is you can only do the process a specific way once, there are n × · · · × (n − k + 1) ways to do it.

3.2 Counting Arrangements


When the sample space of a process is a set of arrangements of elements, like { abc, acb, bac, bca, cab, cba },
the sample points are called the permutations. Assuming all n elements we are arranging are unique, how
many sample points are there?
Consider trying to fill n boxes: ··· . We have n ways to fill the first box (each element can
go in the first box), and we have (n − 1) ways to fill the second box, and so on until we have 1 way to fill
the nth box. So there are n × (n − 1) × · · · × 1 total permutations in the sample space.

Example 3.2.1
Consider the letters of the word “fiesta”. A baby (who cannot spell) randomly rearranges the letters
of the word. What is the probability that “fiesta” is the outcome?

There are six boxes to fill: . We have 6 ways to fill the first position, 5 ways to fill
the second and so on until we have 1 way to fill the 6th position. The number of points in the sample
Counting Techniques - Notations Page 10

space is 6 × 5 × 4 × 3 × 2 × 1 = 720. So the probability of each outcome in the sample space is 1/720.

Example 3.2.2
Consider the letters of the word “snake”. If arranged randomly what is the probability that the word
formed begins with a vowel?

There are five boxes to fill: . There are two ways to fill the first box:
a and e

and for each of these ways there are four remaining boxes to fill. The number of ways to fill the 4
remaining boxes is 4 × 3 × 2 × 1 = 24 so the total number of outcomes in which the first letter is a
48
vowel is 2 × 24 = 48. Therefore, the probability of the event occurring is number of sample points .
The five boxes can be filled by any letter to obtain a point in the sample space, so there are 5 × 4 ×
3 × 2 × 1 = 120 sample points. So the probability of the event occurring is 48/120 = 4/15.

Example 3.2.3
Suppose we have 7 meals to distribute randomly to 7 people (one each). Three of the meals are gluten
free and the other four are not. Of the 7 people, two of them cannot eat gluten. How many ways are
there to distribute the meals without giving gluten to someone who cannot eat it?

We can liken this to the boxes example with each person being a box. Let the first two boxes be the
people who cannot eat gluten. We have

Since we cannot place a gluten meal in boxes 1 or 2, we have that we have 3 ways to fill box 1 then
2 ways to fill box 2. So there are 6 ways distribute meals to the gluten-free people. We have
G G

Now there are 5 boxes to be filled with any of 5 meals. So there are 5 × 4 × 3 × 2 × 1 = 120 ways
to distribute the meals to the other 5 people. This is an implicit and statement, thus there are
6 × 120 = 720 ways to distribute the meals.

3.3 Notations
Because some calculations occur very frequently in statistics we define a notation that helps us to deal with
such problems.

Factorial
We define n! for any natural number n to be

n! = n × (n − 1) × (n − 2) × · · · × 1

and in order to maintain mathematical consistency we define 0! to be 1. This is the number of arrangements
of n possible unique elements, using each once.
Counting Techniques - Notations Page 11

n to k Factors
We define n(k) to be
n!
n(k) = n × (n − 1) × · · · × (n − k + 1) =
(n − k)!
This is the number of arrangements of length k using each element, of n possible unique elements, at most
once.

Power of
As in ordinary mathematics nk = n × n × · · · × n. This represents the number of arrangements that can be
| {z }
k
made of length k using each element, of n possible unique elements, as often as we wish (with replacement).

For many problems it is simply impractical to try to count the number of cases by conventional means
because of how big the numbers become. Notations such as n! and nk allow us to deal with these
large numbers effectively.

Example 3.3.1
An evil advertising company randomly chooses 7-digit phone numbers to call to try to sell products.
Find the probabilities of the following events:
• A: the number is your phone number
• B: the first three number are less than 5
• C: the first and last numbers match your phone number
Now assume that all 7-digits are unique (chosen without replacement):
• D: the number is 210-3869
• E: the first three number are less than 5
• F : the first and last numbers are 1 and 2 respectively

A: The initial sample space contains all the ways that one can select 7 numbers from the numbers 0
to 9 with replacement. There are 10 choices for each of the seven numbers, therefore the sample
space contains 107 points. Thus, since all points are equally likely, P (A) = 1/107 .
B: Now if the first three numbers are less than 5, there are 5 ways (0 to 4) to select each of the first
three numbers and there are 10 ways to select each of the next four numbers. So there are 53 × 104
3
×104
points in B. Therefore, P (B) = 5 107

C: There is only one way to select the first number such that it matches your number and the same
is true for the last number. Thus, we must only consider the middle digits. There are 10 choices each
for the middle five numbers, so there are 105 points in C. Therefore, P (C) = 1/105 .
D: The new sample space contains all the ways that one can select 7 numbers from the numbers 0
to 9 without replacement. There are 10 choices for the first number, 9 for the second and so on
until there are 4 choices for the last number. Thus, there are 10(7) points in the sample space and
1
since each is equally likely, P (D) = 1/10(7) = 10×9×8×7×6×5×4 .
E: If the first three numbers are less than five, there are 5 ways to select the first number, 4 for
the second and 3 for the third, so there are 5(3) ways to select the first 3 numbers. The next 4 digits
may be selected from any of the 7 digits that were not used as one of the first 3. So there are 7(4)
(3) (4)
ways to select the final four digits. Therefore, there are 5(3) × 7(4) points in E. So, P (E) = 5 10×7
(7) .
F : There is only one way to select the first and last digits as 1 and 2 respectively, so we must only
consider the middle 5 digits. The 5 digits are selected from 8 numbers without replacement, so there
8(5)
are 8(5) ways to do this. Therefore, P (F ) = 10 (7) .
Counting Techniques - Counting Subsets Page 12

3.4 Counting Subsets


In many problems, you will encounter a sample space, S, of some experiment that consists of fixed-length
subsets of some set.

Combinations
n

We define k to be the number of subsets of size k that can be selected from a set of n elements. We have

n(k)
 
n n!
= =
k k! (n − k)! k!

It is read “n choose k”.

Derivation of Choose
Suppose we have a set of n unique elements and we wish to select a subset of size k, such that
k ≤ n, and the elements of the subset are unique (selected without replacement). If we use the boxes
metaphor we have k empty boxes.
···
| {z }
k

There are n ways to select the first element of the subset, (n − 1) ways to select the second and so
on until there are (n − k + 1) ways to select the kth and last element.
So there are n(k) ways to fill the k boxes but note that some of the subsets will contain all the same
elements as each other but in varying order. These subsets are not unique since we do not care for the
arrangement of the items in a subset. Each unique subset  can be arranged to form k! permutations of
its k elements. Thus, the number of unique subsets, nk , multiplied by the number of arrangements
of each subset, k!, is n(k) . Therefore, we have
 
n
× k! = n(k)
k

So it follows that
n(k)
 
n
=
k k!

3.5 Properties of Combinations


n

Here are a few properties of k .
n!
• n(k) = (n−k)! = n(n − 1)k−1 for k ≥ 1

n n! n(k)

• k = (n−k)! k! = k!

n n
 
• k = n−k for all 0 ≤ k ≤ n
n n
 
• 0 =
=1 n

• (1 + x)n = n0 + n n n
   
1 x+ 2 x2 + · · · + n xn (Binomial Theorem)
Counting Techniques - Counting Arrangements of Set with Repeated Elements Page 13

3.6 Counting Arrangements of Set with Repeated Elements


Thus far we have only discussed counting arrangements of unique items. Now, we consider a case in which
we want to count the number of unique arrangements of size k of a set of n elements that are not necessarily
unique.
Consider we have a set of n elements with k of those elements being unique. Let ni be the number of
appearances of the ith element of the k unique elements. Thus, n1 + · · · + nk = n. The number of different
ways of selecting an arrangement of size n that uses all symbols is:
       
n n − n1 n − n1 − n2 nk n!
× × × ··· × =
n1 n2 n3 nk n1 ! n2 ! n3 ! · · · nk !

Derivation
Suppose we have a set of n elements with k being unique. Let the k unique items be labelled u1 to uk
and let ni be the number of appearance of ui in the set of n elements. We want to form an arrangement
of length n so using the boxes metaphor we have

···
| {z }
n

We must use each


 of the n elements once so we must select n1 boxes to fill with u1 ’s. This can
be done in nn1 ways. Next, we must select n2 of the remaining n − n1 boxes to fill with u2 ’s,
n3 of the remaining n − n1 − n2 boxes to fill with u3 ’s, and so on until we must select nk of the
n − n1 − n2 − · · · − nk+1 = nk remaining boxes to fill with uk ’s. Therefore, there are
     
n n − n1 nk n! (n−n1)! (nk−1 + nk )! nk!

× × ··· × = ×  × ··· × × 
n1 n2 nk (n − n )! n ! (n − n − n )! n ! 0! n k!

   1 1 1 2 2 nk ! nk−1 !


ways, which simplifies to,


n!
n1 ! n2 ! n3 ! · · · nk !
to do so.
Page 14

Chapter 4

Probability Rules

4.1 General Rules


Here are a few basic rules of probabilities. They should be relatively straightforward.

Theorem 4.1.1
For a sample space, S, the probability of a simple event in S occurring is 1. That is

P (S) = 1

Proof 4.1.1:

X X
P (S) = P (a) = P (a)
a∈S all a

Theorem 4.1.2
Any event A in a sample space has a probability between 0 and 1 inclusive. That is

0 ≤ P (A) ≤ 1 for all A ⊆ S

Proof 4.1.2:
Note that A is a subset of S, so X X
P (A) = P (a) ≤ P (a) = 1
a∈A a∈S

Now, recall that P (a) ≥ 0 for any sample point a by our probability model. Thus, since P (A) is the sum of
non-negative real numbers, P (A) ≥ 0. So we have

1 ≤ P (A) ≤ 1


Probability Rules - Venn Diagrams Page 15

Theorem 4.1.3
If A and B are two events such that A ⊆ B, that is all the sample points in A are also in B, then

P (A) ≤ P (B)

Proof 4.1.3:

X X
P (A) = P (a) ≤ P (a) = P (B)
a∈A a∈B

4.2 Venn Diagrams


As we have seen already, it is helpful to think of events in a sample space as subsets of the sample space.
Consider a sample space, S = { 1, 2, 3, 4, 5, 6 }. A number is picked at random, let E be the event that the
number is even. We can think of E as the subsets of S, { 2, 4, 6 } and the probability of E is the probability
of any sample points in A occurs, that is 2, 4, or 6 is selected. We can represent the relationships of events
in the sample space using Venn diagrams.

Figure 4.1: Single event E

Now, assuming the area of E is half the area of S, we have that the probability of E is the probability
that a randomly chosen point on the area of S will be within E.
Consider now we let G = { 4, 5, 6 } be the event that that the number selected is greater than or equal
to 4. We have

E G

Figure 4.2: Events E and G


Probability Rules - De Morgan’s Laws Page 16

The total shaded region of the Venn diagram, E ∪ G, contains all the sample points of E and G. It is
the event that any outcome in either E or G, or both, occurs. Thus, E ∪ G is the event that E, G or both,
occurs. Similarly, the union of three events is the event that at least one of the three events occur.
Consider now the intersection E ∩ G. It is the set of all the points that are in both E and G, { 4, 6 }.
Thus, it is the event that an outcome in both E and G occurs. So E ∩ G is the event that E and G both
occur.

The sets A ∩ B and similarly A ∩ B ∩ C are often denoted as AB and ABC respectively.

Finally, the unshaded space in Figure 4.1 is the set of all outcomes that are not in E. It is the complement
of E and is denoted by E. It is the event that E does not occur.

Note that the complement of S is the null set, that is S = ∅, and has a probability of 0.

4.3 De Morgan’s Laws

Theorem 4.3.1
The following are De Morgan’s Laws:

1. A ∪ B = A ∩ B
2. A ∩ B = A ∪ B

4.4 Rules for Unions of Events


Recall Figure 4.2, copied below.

E G

Figure 4.2: Events E and G (repeated from page 15)

We can see that the area of E ∪ G is not simply the sum of the areas of E and G. So we have that
the probability of E ∪ G is not simply the sum of the probability of E and G. Rather, we must sum the
probabilities and subtract the intersection (which gets included twice in the sum) to obtain P (E ∪ G).

Theorem 4.4.1
For any events, A and B, in a sample space, we have

P (A ∪ B) = P (A) + P (B) − P (A ∩ B)
Probability Rules - Mutually Exclusive Events Page 17

Example 4.4.1
A number between 1 and 6 inclusive is chosen randomly. Let E = { 2, 4, 6 } be the event the number
is odd and let G = { 4, 5, 6 } be the event that the number is greater than or equal to 4.
The probability of the number being even or greater than 4 is P (E ∪ G). Since both E and G
contain 3 points of the six in the sample space, P (E) = P (G) = 1/2. Thus, we can see clearly that
P (E ∪ G) 6= P (E) + P (G) = 1 since { 1 } is not in E or G and has a probability of 1/6. Now, note
E ∩ G = { 4, 6 }, so P (E ∩ G) = 1/3. We have

P (E ∪ G) = P (E) + P (G) − P (E ∩ G) = 1/2 + 1/2 − 1/3 = 2/3

Now consider the case of the union of three events.

E G

Figure 4.4: Three events

Let AI be the area on the Venn diagram of the event I. The area of the union once again is not simply the
sum of the areas (AE + AG + AF ). Instead we can reason out that when we add the three areas we include
AE∩G , AG∩F , and AF ∩E twice each and AE∩G∩F three times. The sum of these doubly counted areas
(AE∩G + AG∩F + AF ∩E ) also includes AE∩G∩F three times. Thus, when we subtract the area of the doubly
counted segments, AE∩G∩F is also subtracted three times leaving this area unaccounted for. Therefore we
then add AE∩G∩F to find the complete area of E ∪ G ∪ F .

Theorem 4.4.2
For any events, A, B and C, in a sample space, we have

P (A ∪ B ∪ C) = P (A) + P (B) + P (C) − P (A ∩ B) − P (B ∩ C) − P (C ∩ A) + P (A ∩ B ∩ C)

4.5 Mutually Exclusive Events


Events A and B are mutually exclusive if and only if A ∩ B = ∅. More simply, the events A and B cannot
both occur in one experiment because they share no points in common and only one sample point is achieved.
In general, events A1 , A2 , A3 , . . . , An are mutually exclusive if and only if Ai ∩ Aj = ∅ for all i 6= j. This
means that at most one of these events may occur in any one experiment.

Probability of the Unions of Mutually Exclusive Events


Consider the Venn diagram of two mutually exclusive events, E and G. Clearly the probability of the
Probability Rules - Mutually Exclusive Events Page 18

E G

Figure 4.5: Two mutually exclusive events

intersection of two mutually exclusive events is 0, since it doesn’t contains any sample points. So we have

P (E ∩ G) = 0

Another intrinsic property of mutually exclusive events that we can see on a Venn diagram is that the area
of E ∪ G is the sum of the areas of E and G. Therefore, unlike in previous examples, the probability of E ∪ G
is the sum of the probabilities of E and G.

Theorem 4.5.1
For mutually exclusive events, A and B, in a sample space, we have

P (A ∪ B) = P (A) + P (B)

Theorem 4.5.2
More generally for n mutually exclusive events, A1 , A2 , . . . , An , in a sample space, we have
n
X
P (A1 ∪ A2 ∪ · · · ∪ An ) = P (A1 ) + P (A2 ) + · · · + P (An ) = P (Ai )
i=1

Probabilities of Complements

Theorem 4.5.3
For any event A, we have
P (A) = 1 − P (A)

Proof 4.5.1:
Recall the complement of an event consists of all the sample points not in the event. Thus, for any event A,
its complement A contains no points in common with A. So A ∩ A = ∅ and A and A are mutually exclusive,
by definition. Now, consider A ∪ A, it spans the whole of the sample space so we have P (A ∪ A) = 1 and
since A and A are mutually exclusive, we have

P (A) + P (A) = 1

and it follows that P (A) = 1 − P (A), as required. 


Probability Rules - Independence of Events Page 19

4.6 Independence of Events


Events A and B are said to be independent if and only if P (A ∩ B) = P (A)P (B). Otherwise they are
dependent events.
In general, events A1 , A2 , A3 . . . , An are independent if and only if

P (Ai1 ∩ Ai2 ∩ Ai3 ∩ · · · ∩ Ain ) = P (Ai1 ) + P (Ai2 ) + P (Ai3 ) + · · · + P (Ain )

for all sets { i1 , i2 , i3 , . . . , ik } of distinct subscripts chosen from { 1, 2, 3, . . . , n }.

Example 4.6.1
Consider an experiment in which a fair die is tossed twice. We define the following events:
• A: The first number rolled is a six
• B: The second number rolled is a six
• C: The sum of the numbers rolled is less than or equal to seven
• D: The sum of the numbers rolled is equal to seven
Suppose the event A occurs. Does this have any impact on the probability of B, C or D occurring?
It is quite clear to see that the events A and B are independent events since rolling a six on the first
toss has no impact on the number that will be rolled on the second toss. Now, events B and C from
the onset appear to be dependent since if you roll a six on the first toss you must roll a one to make
your total less than or equal to seven. To confirm this consider the sample space
 
 (1, 1), (1, 2), (1, 3), (1, 4), (1, 5), (1, 6) 
 
 (2, 1), (2, 2), (2, 3), (2, 4), (2, 5), (2, 6) 

 

 
(3, 1), (3, 2), (3, 3), (3, 4), (3, 5), (3, 6)
 

 (4, 1), (4, 2), (4, 3), (4, 4), (4, 5), (4, 6) 
 
 (5, 1), (5, 2), (5, 3), (5, 4), (5, 5), (5, 6) 

 

 
(6, 1), (6, 2), (6, 3), (6, 4), (6, 5), (6, 6)
 

We can count that 21 of the sample points have sums less than or equal to seven. So the probability
of C occurring is P (C) = 21/36 = 7/12. We also have that P (A) = 1/6. So P (A)P (C) = 7/72 but
we can count that A ∩ C contains only one sample point and hence has a probability of 1/36. Thus,
P (A)P (C) 6= P (A ∩ C) so A and C are dependent events.
At first glance, we see that upon rolling a six as the first number you must roll a 1 for the sum to
equal seven. So at first glance, events A and D seem to be independent however it would be naı̈ve
to assume this. We can count from the sample space that event D contains 6 points and so has a
probability P (D) = 6/36 = 1/6 and P (A) = 1/6. So P (A)P (D) = 1/36. Now, we can count that
the event A ∩ D contains only one point, (1,6) and so has a probability P (A ∩ D) = 1/36. Therefore,
P (A ∩ D) = P (A)P (D) and the events A and D are independent.
Page 20

Chapter 5

Conditional Probability

Often we need to calculate the probability of some event A occurring while knowing that some other event B
has already occurred. We call this the conditional probability of A given B and denote it by P (A | B).
The conditional probability of event A, given event B, is
P (A ∩ B)
P (A | B) = for P (B) > 0
P (B)

5.1 Theorems and Rules for Conditional Probability

Theorem 5.1.1
For any two events A and B defined on the same sample space, with P (A) > 0 and P (B) > 0,
events A and B are independent if and only if P (A | B) = P (A) or P (B | A) = P (B).

Proof 5.1.1:

P (A ∩ B)
P (A | B) =
P (B)
↔ P (A ∩ B) = P (A | B)P (B)
and by definition of independence, A and B are independent if and only if P (A ∩ B) = P (A)P (B) which is
true if and only if P (A | B) = P (A). Without loss of generality we can swap events A and B and arrive at
the conclusion. 

Product Rules

Theorem 5.1.2
Let A, B, C and D be events on a sample space, with P (A), P (B), P (C), P (D) > 0. We have

P (A ∩ B) = P (A)P (B | A)
P (A ∩ B ∩ C) = P (A)P (B | A)P (C | A ∩ B)
P (A ∩ B ∩ C ∩ D) = P (A)P (B | A)P (C | A ∩ B)P (D | A ∩ B ∩ C)

and so on. . .
Conditional Probability - Theorems and Rules for Conditional Probability Page 21

Proof 5.1.2:
The first statement come directly from the definition of conditional probability

P (A ∩ B)
P (A)P (B | A) = P (A) = P (A ∩ B)
P (A)

For the second we have

P (A)P (B | A)P (C | A ∩ B) = P (A ∩ B)P (C | A ∩ B) by the first statement


P (A ∩ B ∩ C)
= P (A ∩ B) by definition of conditional probability
P (A ∩ B)
= P (A ∩ B ∩ C)

and so on. . . 

Law of Total Probability

Theorem 5.1.3
Let A1 , A2 , A3 , . . . , Ak be mutually exclusive events on a sample space and let B be an event on the
same sample space. We have
k
X
P (B) = P (B ∩ A1 ) + P (B ∩ A2 ) + P (B ∩ A3 ) + · · · + P (B ∩ Ak ) = P (Ai )P (B | Ai )
i=1

Proof 5.1.3:
Note that the events Ai ∩ B for 1 ≤ i ≤ k are all mutually exclusive events since Ai ’s are mutually exclusive.
Thus, the union of the Ai ∩ B’s is B, that is

(A1 ∩ B) ∪ (A2 ∩ B) ∪ (A3 ∩ B) ∪ · · · ∪ (Ak ∩ B) = B

So by Theorem 4.5.1, we have

P (B) = P (A1 ∩ B) + P (A2 ∩ B) + P (A3 ∩ B) + · · · + P (Ak ∩ B)

and by Theorem 5.1.2 (Product Rule), we have


k
X
P (B) = P (A1 )P (B | A1 ) + P (A2 )P (B | A2 ) + P (A3 )P (B | A3 ) + · · · + P (Ak )P (B | Ak ) = P (Ai )P (B | Ai )
i=1

as required. 

Bayes’ Theorem

Theorem 5.1.4
Let A and B be events on a sample space, with P (B) > 0. We have

P (B | A)P (A) P (B | A)P (A)


P (A | B) = =
P (B) P (B | A)P (A) + P (B | A)P (A)
Conditional Probability - Tree Diagrams Page 22

Proof 5.1.4:

P (A ∩ B) P (B | A)P (A)
P (A | B) = = by Theorem 4.7.2 (Product Rule)
P (B) P (B)
P (B | A)P (A)
= by Theorem 4.7.3 (Law of Total Probability)
P (A ∩ B) + P (A ∩ B)
P (B | A)P (A)
= by Theorem 4.7.2 (Product Rule)
P (B | A)P (A) + P (B | A)P (A)

Bayes’ Theorem allows us to find the conditional probability of some event A given B, in terms of
the probability of B given A. It allows us calculate conditional probabilities using the reversed order
of conditioning.

5.2 Tree Diagrams


Tree diagrams are a technique that we can use to keep track of conditional probabilities. We start from
a single node and draw new branches to separate nodes for each event. Each node represents the event
occurring. On each branch we write the probability of event it leads to occurring. To find the probability of
any node we multiply the probabilities of the branches leading to the node.

B P (A1 ∩ B) = P (B | A1 )P (A1 ) = 0.5 × 0.7


0.7
A1 0.3
B P (A1 ∩ B) = P (B | A1 )P (A1 ) = 0.5 × 0.3
0.5
B P (A2 ∩ B) = P (B | A2 )P (A2 ) = 0.3 × 0.9
0.9
0.3
A2 0.1
B P (A2 ∩ B) = P (B | A2 )P (A2 ) = 0.3 × 0.1
0.2
B P (A3 ∩ B) = P (B | A3 )P (A3 ) = 0.2 × 0.4
0.4
A3 0.6
B P (A3 ∩ B) = P (B | A3 )P (A3 ) = 0.2 × 0.6

Figure 5.1: Tree diagram

The probability of all the the branches leading outward from each node must sum to 1 since at least
one outcome must occur.
Page 23

Chapter 6

Useful Sums and Series

This chapter includes a few useful sums and series that show up in the following chapters.

6.1 Geometric Series


n−1
X 1 − rn
ri = 1 + r + r2 + · · · + rn−1 =
i=0
1−r

For |r| < 1, we have



X 1
ri = 1 + r + r2 + · · · =
i=0
1−r

Other identities can be obtained from this one by differentiation. For example we have
∞ ∞
d X i X i−1 d 1 1
r = ir = =
dr i=0 i=0
dr 1 − r (1 − r)2

6.2 Binomial Theorem


The binomial theorem describes the algebraic expansion of powers of a polynomial.
      n  
n 1 n 2 n n X n x
(1 + t)n = 1 + t + t + ··· + t = t
1 2 n x=0
x

for any positive integer n and real number t.


A more general form of this theorem that holds even when n is not a positive integer is
∞  
n
X n
(1 + t) = tx , for |t| < 1
x=0
x

It is an important skill to be able to recognize if an infinite, or otherwise, polynomial with binomial


coefficients can be reduced to a simple polynomial raised to a power.
Useful Sums and Series - Multinomial Theorem Page 24

6.3 Multinomial Theorem


The multinomial theorem is a generalization of the binomial theorem. It describes the algebraic expansion
of powers of a sum in terms of powers of the terms in the sum.
 Ym
X n
(x1 + x2 + · · · + xm ) =n
xkt
k1 , k2 , . . . , km t=1 t
k1 +k2 +···+km =n

Another common form in which this theorem may be represented is


X 1
(x1 + x2 + · · · + xm )n = (xk1 xk2 · · · xkmm )
k1 ! k2 ! · · · km ! 1 2
k1 +k2 +···+km =n

The summation is over all non-negative integers, k1 , k2 , . . . , km such that k1 + k2 + · · · + km = n

6.4 Hypergeometric Identitiy


∞     
X a b a+b
=
x=0
x n−x n

Proof 6.4.1:
We begin with the equality
(1 + y)a+b = (1 + y)a × (1 + y)b
Now by Binomial Theorem we have
a+b   a   b  
X a+b X a X b
yk = yi × yj
k i=0
i j=0
j
k=0

Consider the coefficient of y k on the right hand side. It is the sum of all the binomial terms such that
i + j = k. Thus, the coefficient of y k on the right hand side is
min{ a,k }   
X a b
i=0
i k−i

and since when i > a or i > k the term is 0 we can increase the sum to infinity. Thus, since the coefficient
on the right hand side is equal to that on the left hand side we have
  X ∞   
a+b a b
=
k i=0
i k−i

When x becomes significantly large, the terms of the summation become 0 since
   
n n
= = 0, for x > n
x n−x
Useful Sums and Series - Exponential Series Page 25

6.5 Exponential Series


This is an example of a Maclaurin series expansion.
∞ n
t0 t1 t2 X t
et = + + + ··· = , for all t ∈ R
0! 1! 2! n=0
n!

The following limit definition of the exponential function is also useful


 n
1
et = lim 1 + , for all t ∈ R
n→∞ t

6.6 Integer Series


The following are useful equalities involving sums of integers.

n(n + 1)
1 + 2 + 3 + ··· + n =
2
n(n + 1)(n + 2)
1 2 + 2 2 + 3 2 + · · · + n2 =
6
 2
3 3 3 3 n(n + 1)
1 + 2 + 3 + ··· + n =
2
Page 26

Chapter 7

Discrete Random Variables and


Probability Functions

7.1 Random Variables


A random variable (r.v. for short) is a numerical valued variable that represents the outcome of an experiment
or process. Every random variable has a range associated with it, which is the set of all possible values the
r.v. can take. We denote random variables by capital letters, e.g., A, X, Z.

Example 7.1.1
Suppose an experiment consists of tossing a coin three times. Let the random variable X be the
number of heads that are rolled. And let the random variable Y be the number of tails rolled. Now,
we have a nice short hand in that X = 2 is equivalent to the statement “two heads were rolled”.
Moreover, we have useful equalities such as X + Y = 3 and X = 3 − Y .
The ranges of X and Y are both { 0, 1, 2, 3 }.

It is very important to understand the purpose of r.v.’s since the remainder of this course features
them heavily.

The formal definition of a random variable is a function that assigns a real number to each point in a
sample space.

Example 7.1.2
Consider the same experiment as above. The sample space is

{ HHH, HHT, HTH, HTT, THH, THT, TTH, TTT }

Let us define X the number of heads that are rolled and the sample point a = HHT. The value of the
function X(a) = 2 it is found by counting the number of heads in a. The range of X is { 0, 1, 2, 3 }.
Each of the outcomes X = x represents an event, simple or compound. In this case they are:
X Event
0 { TTT }
1 { TTH, THT, HTT }
2 { THH, HTH, HHT }
3 { HHH }
Discrete Random Variables and Probability Functions - Probability Function Page 27

Since some outcome in the range must always occur for a random variable for each event in the sample
space, the events of a random variable are mutually exclusive subsets of the sample space such that their
union is the total sample space. For r.v. X and outcome x, we have X = x represents some event and we
are interested in calculating its probability. We denote the probability of X = x by P (X = x).

Since the union of the events of values of a random variable is the total sample space, we have
X
P (X = x) = 1
x∈Range(X)

Discrete Random Variables


Discrete random variables take integer values or, more generally, values in a countable set. Recall that a
set is countable if its elements can be placed in a one-to-one correspondence with a subset of the positive
integers.

Continuous Random Variables


Continuous random variables (which are not the focus of this chapter) take values in some interval of real
numbers like (0, 1) or (0, ∞) or (∞, ∞). You should be aware that there are infinitely numerical non-integer
values that a r.v. with Range(0, 1) could take. The values are separated by infinitesimally small intervals.

7.2 Probability Function


The probability function of a discrete random variable X is a function that maps the value of X to the
probability of that value. The probability function is represented by

f (x) = P (X = x) for x ∈ Range(X)

The set of pairs { (x, f (x)) | x ∈ Range(X) } is called the probability distribution of X.

Properties of Probability Functions


The following two properties hold for all probability functions (on discrete r.v.’s).

• f (x) ≥ 0 for all x ∈ Range(X)


P
• f (x) = 1
x∈Range(X)

7.3 Cumulative Distribution Function


Another common function used to describe a probability model is the cumulative distribution function,
usually denoted by F (x). It is defined to be

F (x) = P (X ≤ x) for all x ∈ Range(X)

Not that because the events “X = x” and “X = y” for x 6= y are mutually exclusive we have
x
X
F (x) = P (X ≤ x) = P (X = z) for all x ∈ Range(X)
z=0

It is the sum of the probabilities that the random variable takes values less than or equal to x.
Discrete Random Variables and Probability Functions - Cumulative Distribution Function Page 28

Properties of Cumulative Distribution Functions


The following four properties hold for all cumulative distribution functions.
• F (x) is a non-decreasing function

• 0 ≤ F (x) ≤ 1 for all x ∈ R


• lim F (x) = 0
x→−∞

• lim F (x) = 1
x→∞
Page 29

Chapter 8

Discrete Distributions

As we briefly mentioned in the previous chapter, probability distributions are the set of pairs (x, f (x)) for
all possible outcomes x of a random variable X. Many probability distributions appear commonly on r.v.’s
of similar “real-life” processes. In this chapter we define a few of these common distributions on discrete
random variables, when they occur and how to use them to calculate probabilities.

It is important to understand distributions early-on. Distributions, probability functions and cumu-


lative distribution functions are defined on random variables not experiments/processes or sample
spaces.

8.1 Uniform Distribution


Suppose X can take a finite set of consecutive values with each of the values being equally likely. That is
Range(X) = { a, a + 1, a + 2, . . . , b } with each of a, a + 1, a + 2, . . . , b being equally likely. Then X has a
discrete uniform distribution and we denote it
X ∼ Discrete Uniform
1

 for all x ∈ Range(X)
f (x) = P (X = x) = b−a+1

0 otherwise

Derivation of Probability Function


The probability of each value of the r.v. is easy to calculate since they are all equal and must add up
to 1. Therefore, k × P (X = a) = 1 where k is the number of possible values of X. The number of
possible values of X is b − (a − 1) = b − a + 1 since Range(X) is between a and b inclusive.

Another way to define the probability of each value of a random variable with this sample space is
1
Number of possible values in Range(X)

8.2 Hypergeometric Distribution


Suppose we have a collections of N objects which can be classified into two different types, successes and
failures. There are r successes and N − r failures. We pick n objects at random without replacement, and
Discrete Distributions - Hypergeometric Distribution Page 30

let the random variable X be the number of successes obtained. X has a hypergeometric distribution and
we denote it
X ∼ Hypergeometric   
r N −r
x n−x
f (x) = P (X = x) =   , for x ≤ min(r, n)
N
n

Derivation of Probability Function


We will use the counting techniques we previously learnt to calculate the probability function. We note
that there are N select n objects from the total of N so the sample space contains N

n ways to n points.
Now the number of ways of choosing x successes from the total of r is xr and independently the
number
N −r
 of ways to choose the remainder of objects, n − x, from the total remaining objects, N − r, is
n−x . Thus the probability of X = x by the multiplication rule is the product of those expressions
(r )(N −r)
divided by the number of points in the sample space, x Nn−x .
(n)

It is important to understand that the terms “successes” and “failures” are simply placeholder that
represent a type of outcome and its complement. They could be replaced by “wins” and “losses”,
“whites” and “colors”, or any other titles that are distinct groups with a union that spans the whole
sample space.

This is used when we know how many items (n) are chosen at random from a set with two different
types and we know the amount of each type in the set.

Example 8.2.1
There is a basket with 11 fruit, 9 apples and 2 oranges. 4 fruit are picked at random from the basket.
Let random variable X be the number of apples selected. Find f (x) = P (X = x). Then find f (3).
X ∼ Hypergeometric. N = 11, n = 4, r = 9.
  
9 2
x 4−x
f (x) = P (X = x) =   , for x ≤ 4
11
4

Hence   
9 2
3 1
f (7) = P (X = 7) =   ≈ 0.509
11
4

Example 8.2.2
15 cards are drawn from a deck of 52 at random. Let X be the number of red cards drawn. Find
f (x) = P (X = x). Then find f (7).
Discrete Distributions - Binomial Distribution Page 31

X ∼ Hypergeometric. N = 52, n = 15, r = 26.


  
26 26
x 15 − x
f (x) = P (X = x) =   , for x ≤ 15
52
15

Hence   
26 26
7 8
f (7) = P (X = 7) =   ≈ 0.229
52
15

8.3 Binomial Distribution


Suppose we have an experiment with two distinct outcomes, success and failure, with the probability of a
success being p and a failure being (1 − p). The experiment is repeated n times independently (these are
called trials). Let the random variable X be the number of successes obtained. X has a binomial distribution.
X ∼ Binomial(n, p)
 
n x
f (x) = P (X = x) = p (1 − p)n−x , for x = 0, 1, 2, . . . , n
x

Derivation of Probability Function


Since there are n positions in which to put the x successes there are nx unique arrangements of


successes and failures that satisfy “X = x”. Each of these arrangements has probability px (1 − p)n−x
since the probability of obtaining x successes is px and the probability of obtaining n − x failures is
(1 − p)n−x . So the probability that X = x, that is that any of the arrangements occur, is the sum of
the probability of each unique arrangement, nx px (1 − p)n−x .

The above formula describes the probability of x success and (n−x) failures multiplied by the number
of different ways of arranging those successes within the total number of trials of the experiment.

Each of the n individual experiments is called a “Bernoulli trial” and the entire process of n trials is
called a Bernoulli process or a Binomial process.

Example 8.3.1
A loaded coin is flipped 10 times, with a probability of a heads occurring being 0.4. Let random
variable X be the number of heads that occur. Find f (x) = P (X = x), then find f (3).
X ∼ Binomial(10, 0.4).
 
10
f (x) = P (X = x) = (0.4)x (0.6)10−x , for x = 1, . . . , 10
x
Hence  
10
f (3) = P (X = 3) = (0.4)3 (0.6)7 ≈ 0.215
3
Discrete Distributions - Binomial Distribution Page 32

Example 8.3.2
A football season in a university league has 22 games. The probability of each game being abandoned
(because of bad weather or other hazards) is 0.02. Let X be the number of games abandoned
throughout the whole season. Find f (x) = P (X = x), then find f (2) and f (10).
X ∼ Binomial(22, 0.02).
 
22
f (x) = P (X = x) = (0.02)x (0.98)22−x , for x = 1, . . . , 22
x

Hence  
22
f (2) = P (X = 2) = (0.02)2 (0.98)2 0 ≈ 0.062
2
and  
22
f (10) = P (X = 10) = (0.02)10 (0.98)12 ≈ 5.196×10−12
10

8.3.1 Comparison of Binomial and Hypergeometric Distributions


The Binomial and Hypergeometric distributions are similar in that they both model the distribution of the
number of successes in n trials of an experiment. The difference is that the collection of objects in the
hypergeometric distribution is selected from without replacement as apposed the Binomial distribution in
which successes and failures do not affect the probability of future outcomes (with replacement).

The Hypergeometric distribution is used when there is a fixed number of objects (successes and fail-
ures) to choose from.
The Binomial distribution is used when there is no fixed number of objects to be selected from and
instead we know the constant probability of a success for all the trials.

Example 8.3.3
Consider Lisa owns a car dealership and has only 750 red cars and 1250 blue cars in stock. A
rich Swedish man enters and picks 50 cars randomly to purchase. Let X be the number of red cars
the Swede purchases.
Since we know the number of successes (750 red cars) and failures (1250 blue cars) as well as the
number of trials, we have that X ∼ Geometric and
750
 1250 
x 50−x
f (x) = P (X = x) = 2000

50

Now, consider Lisa has run out of all her stock of cars. She goes to a Swedish car manufacturer’s
factory which is capable of producing any amount of cars. The factory has a 37.5% chance of producing
a red car and otherwise produces a blue car. Lisa orders 50 cars. Let X be the number of red cars
she receives.
Since there is no fixed number of cars to choose from but we do know the probability of each car
being a success, we have that X ∼ Binomial(50, 0.375) and
 
50
f (x) = P (X = x) = (0.375)x (0.625)50−x
x
Discrete Distributions - Negative Binomial Distribution Page 33

8.3.2 Binomial Estimate of the Hypergeometric Distribution


When the number of objects to choose from, N , is very large and the number of objects being chosen, n,
is relatively small in a hypergeometric distribution, we have that the probability of a success changes only
very slightly due to lack of replacement. Since the number of objects is so large choosing a small number of
objects without replacement barely changes the probability of a success so p is relatively constant. Thus we
can fairly accurately estimate the distribution with a binomial distribution with the original probability of
a success for the first choice.

Example 8.3.4
Consider the previous example, suppose the rich Swedish man purchased 50 cars from Lisa. What is
the probability that he purchases 20 red cars?
The number of cars Lisa has in stock is very large and the number of cars being bought is fairly small.
Thus we can approximate the distribution with the probability of a success being 750/2000 = 0.375.
We have  
50
f (20) = P (X = 20) = (0.375)20 (0.625)30 ≈ 0.1072
20
Now we can calculate the probability using the hypergeometric distribution to determine how good
an estimate this is. We have
750 1250
 
20 30
f (20) = P (X = 20) = 2000
 ≈ 0.1084
50

So the approximation is accurate to 2 decimal points.

8.4 Negative Binomial Distribution


This distribution is similar to the binomial distribution. We have an experiment with two distinct outcomes,
success and failure, with the probability of a success being p and a failure being (1 − p). The experiment
is repeated until a specified amount of successes, k, have been obtained. Let the random variable X be the
number of failures obtained before the k th success. X has a negative binomial distribution.
X ∼ NB(k, p)  
x+k−1 k
f (x) = P (X = x) = p (1 − p)x , for x = 0, 1, 2, . . .
x

Derivation of the Probability Function


The above formula describes the probability of x failures and k successes multiplied by the number
of different ways of arranging those x failures within the total number of candidate trials n + k − 1.
The final trial cannot be a failure as it is the k th success.

The negative binomial distribution is used to model the number of trials of an experiment before the
k th success. Thus, if we know the number of trials, this distribution is not appropriate.

Example 8.4.1
A bad driver never stops at red lights and keeps driving and running red lights until he is arrested.
The probability of him getting pulled over by a police man immediately after breaking a light is 0.53
and upon being pulled over 4 times he is arrested. Let X be number of red lights the driver runs
Discrete Distributions - Geometric Distribution Page 34

without being pulled over before he is arrested. Find f (x) = P (X = x), then find f (1) and f (7).
X ∼ NB(4, 0.53)
 
x+3
f (x) = P (X = x) = (0.53)4 (0.47)x , for x = 0, 1, 2, . . .
x

Hence  
4
f (1) = P (X = 1) = (0.53)4 (0.47) ≈ 0.148
1
and  
10
f (7) = P (X = 7) = (0.53)4 (0.47)7 ≈ 0.048
7

Example 8.4.2
The probability of a football player scoring at least one goal in each game is 0.72. When the player
scores in 26 games, she is awarded a bonus check. Let X be the number of games in which the player
does not score before she is awarded the bonus. Find f (x) = P (X = x), then find f (7), and f (0).
X ∼ NB(26, 0.72)
 
x + 25
f (x) = P (X = x) = (0.72)26 (0.28)x , for x = 0, 1, 2, . . .
x

Hence  
32
f (7) = P (X = 7) = (0.72)26 (0.28)7 ≈ 0.089
7
and  
25
f (0) = P (X = 0) = (0.72)26 (0.28)0 = 0.7226 ≈ 1.953×10−4
0
Note that f (0) is simply the probability that the player scores in all of her first 26 games.

8.5 Geometric Distribution


This distribution is identical to the negative binomial distribution with k = 1. We have an experiment with
two distinct outcomes, success and failure, with the probability of a success being p and a failure being
(1 − p). Let the random variable X be the number of failures obtained before the first success. X has a
geometric distribution.
X ∼ Geometric(p)
f (x) = P (X = x) = (1 − p)x p, for x = 0, 1, 2 . . .

Example 8.5.1
A betting game involves flipping a coin repeatedly. The coin is fixed to that the probability of heads
is 0.7 and tails is 0.3. On every flip, if you get heads you may flip again, but otherwise (if you get
tails) the game is over. For each heads you flip you get $100. Let X be the number of heads you get.
Find f (2) and F (3).
X is the number of trials before the first failure (flipping a tails) occurs. Thus, X ∼ Geometric so

f (x) = P (X = x) = (0.7)x 0.3, for x = 0, 1, 2 . . .


Discrete Distributions - The Poisson Distribution Page 35

Hence
f (2) = P (X = 2) = (0.7)2 0.3 = 0.147
and
F (3) = P (X ≤ 3) = (0.7)3 0.3 + (0.7)2 0.3 + (0.7)1 0.3 + (0.7)0 0.3 = 0.7599

8.6 The Poisson Distribution


This distribution is somewhat unlike the other distributions we have encountered. Suppose an event occurs
an average of µ times per a specified interval (time, space, etc.) according to the following conditions:
• Independence - the number of occurrences in non-overlapping intervals are independent of one an-
other.
• Individuality - events do not occur in clusters, that is, for sufficiently short intervals of length ∆t,
the probability of two or more events occurring is extremely close to 0 (negligible).
• Homogeneity - events occur at a uniform/homogeneous rate λ such that the probability of one
occurrence in the interval (t, t + ∆t) is λ∆t for small ∆t.
Let X be the number of times the event occurs in the interval. X has a Poisson distribution.
X ∼ Poisson(µ)
µx e−µ
f (x) = P (X = x) = , for x = 0, 1, 2, . . .
x!

Derivation of the Probability Function


Based on the conditions above we can derive the probability function. Let ft (x) be the probability of
x occurrences in an interval of length t. Now we consider ft+∆t (x). We will use the the relationship
between ft (x) and ft+∆t (x) and induction to show the result.
Firstly, we must find, ft (0), the probability of 0 occurrences in an interval of length t. Consider
the probability, ft+∆t (0) that their are 0 occurrences in an interval of length t + ∆t. That is, the
probability of no events in the interval of length t and no events in the interval of length ∆t. We have

ft+∆t (0) = ft (0)(1 − λ∆t)


ft+∆t (0) − ft (0)
= −λft (0)
∆t

As ∆t → 0, we have the differential equation


d
ft (0) = −λft (0)
dt
which resolves to
ft (0) = Ce−λt
Note for an interval of length 0, the probability of zero events must be 1. So the constant C must be
1. Hence, we have
ft (0) = e−λt

Note that there are only two ways to get a total of x 6= 0 occurrences in an interval of length t + ∆t
for a sufficiently small ∆t since by individuality the probability of two or more events in the interval
Discrete Distributions - The Poisson Distribution Page 36

(t, t + ∆t) is negligible. Either there are x occurrences by time t or there are (x − 1) occurrences by
time t and 1 in the interval (t, t + ∆t). This and the property of independence lead to

ft+∆t (x) = ft (x)(1 − λ∆t) + ft (x − 1)(λ∆t)


ft+∆t (x) = ft (x) − ft (x)(λ∆t) + ft (x − 1)(λ∆t)
ft+∆t (x) − ft (x)
+ λft (x) = λft (x − 1)
∆t
Now as ∆t → 0 we have the differential equation
d
ft (x) + λft (x) = λft (x − 1)
dt
d λt
[e ft (x)] = eλt λft (x − 1) (8.1)
dt
Now consider when n = 1, we have
d λt
[e ft (1)] = eλt λft (0)
dt
Substituting the result, ft (0), we got earlier we have

d λt
[e ft (1)] = eλt λe−λt = λ
dt
Integrating both sides we have
eλt ft (1) = λt + C
Note for an interval of length 0, the probability of one event must be 0. So the constant C must be
0. Hence, we have
ft (1) = λte−λt

We now use induction to generalize this result for an arbitrary x. Our inductive hypothesis is as
follows
(λt)x e−λt
ft (x) =
x!
We have already shown that our hypothesis holds for x = 0 and x = 1. Now we assume our hypothesis
is true and recall the differential equation (8.1) with x + 1. We have

d λt
[e ft (x + 1)] = eλt λft (x)
dt
d λt (λt)x e−λt λx+1 tx
[e ft (x + 1)] = eλt λ =
dt x! x!
Z x+1 x+1
λ λ tx+1
eλt ft (x + 1) = tx dt = +C
x! x! (x + 1)

Again using the boundary condition, with an interval of length 0, we have that C must be 0 for x > 0.
Thus, we have
(λt)x+1 e−λt
ft (x + 1) =
(x + 1)!
Thus, if the inductive hypothesis holds for x then it also holds for x+1. So by principle of mathematical
induction we have that the hypothesis is true for all natural numbers x.
Discrete Distributions - The Poisson Distribution Page 37

Note that this derivation is fairly complex. If at first you do not understand don’t worry. Try reading
it again later.

8.6.1 Poisson Estimate to the Binomial Distribution

Derivation from Binomial Distribution


The Poisson distribution is a limiting case of the Binomial distribution as n → ∞ and p → 0. If
we take µ = np as a constant as n tends to infinity we have that p tends to 0. Thus consider the
following:

n(x)  µ x 
 
n x µ n−x
f (x) = p (1 − p)n−x = 1−
x x! n n
n (x)  n−x
µ n µ 
= × x 1−
x! n n
µn n × (n − 1) × (n − 2) × · · · × (n − x + 1)  µ n  µ −x
= × × 1 − 1 −
x! nx n n
(Note the middle term’s numerator is the product of n terms)
µn n n−1 n−2 n−x+1  µ n  µ −x
= × × × × ··· × × 1− 1−
x! n  n  n n n  n
µn

1 2 x−1 µ n  µ −x
= ×1× 1− × 1− × ··· × 1 − × 1− 1−
x! n n n n n

Now as n approaches infinity we have


µx  n 
lim f (x) = × (1)(1)(1) × · · · × (1) × e−µ (1)−x , since ek = lim 1 + nk
n→∞ x! n→∞
µx e−µ
= , for x = 0, 1, 2, . . .
x!
Thus, when n is very large and p is very small we can use the Poisson distribution to approximate a
Binomial distribution.

8.6.2 Parameters µ and λ


The parameters µ and λ are, as you might expect, strongly linked. The first, µ, is a average number of
occurrences in a specified interval, whereas λ is the uniform rate of occurrence per unit of the interval. Thus
if t is the amount of units of time, space, etc., then λt = µ.

Example 8.6.1
Suppose a fire station gets 15 phone call every 5 minutes. The rate of occurrence per minute is λ = 3.
Then, if we interested in the number of phone calls in 10 minutes, we have that the average number
of phone calls in an interval of ten minutes is µ = 10λ = 30.

8.6.3 Distinguishing the Poisson Distribution from other Distributions


In order to distinguish when to use and not to use a Poisson distribution, we can ask ourselves a few simple
questions to rule out the Poisson process.
Discrete Distributions - Combining Models Page 38

• Is it reasonable to ask how often an event does not occur?


• Is it possible to specify an upper limit on the value the random variable in question can take?
If the answer to either of these questions is yes, then the random variable in question does not follow a
Poisson distribution and the experiment is not a Poisson process.

8.7 Combining Models


It is possible for the solution to a problem to require using more than one distribution to model the probability
of a complex event. Below are a few examples where it is necessary to use more than one distribution.

Example 8.7.1
Suppose a type of spider catches flies in their webs at a rate of 2 per hour. If there are 10 such spiders,
what is the probability that more than 6 spiders catch less than 4 flies in 2 hours?
First we find the probability that a single spider catches less than 4 flies in 2 hours. Let X be the
number of flies the spider catches in 2 hours. The average number of flies caught in 2 hours is µ = 4.
Thus, we have X ∼ Poisson(4).

40 e−4 41 e−4 42 e−4 43 e−4


P (X < 4) = + + + ≈ 0.43
0! 1! 2! 3!
Now, we can find the probability that more than 6 spiders catch less than 4 flies in 2 hours with the
knowledge that each spider has a 0.43 chance of doing so. Let Y be the number of spiders that catch
less than 4 flies in 2 hours. We have Y ∼ Binomial(10, 0.43).
       
10 10 10 10
P (Y > 6) = 0.437 (0.57)3 + 0.438 (0.57)2 + 0.439 (0.57)1 + 0.4310 (0.57)0
7 8 9 10
≈ 0.081

It is important to remember that more than one distribution can be necessary. A common mistake
is to correctly use one distribution and not realize the need for another.
Page 39

Chapter 9

Mean and Variance

9.1 Summarizing Data on Random Variables


Often times, listing out all of the outcomes of a sample is not a very helpful way of communicating the
information obtained from the sample. A common, more helpful way to present the data of a sample is
a frequency distribution. A frequency distribution gives the number of times each value of a random
variable X occurred.
X Frequency Count Frequency
1 :: 2
2 ;: 6
3 ; 5
4 ::: 3
We could also draw a frequency histogram of these frequencies.

5
Frequency

0
1 2 3 4
X

Frequency distributions are good summaries of data because they show the variability in the observed
outcomes clearly. Another way to summarize results are single-number summaries such as the following:
Mean and Variance - Summarizing Data on Random Variables Page 40

The mean of a sample of outcomes is the average value of the outcomes. It is the sum of the outcomes
divide by the total number of outcomes. The mean of n outcomes, x1 , . . . , xn , for a random variable X is
n
X xi x1 , . . . , x n
=
i=0
n n

The median of a sample is an outcome such that half the outcomes are before it and half the outcomes
are after it when the outcomes are arranged in numerical order.

The mode of a sample is the outcome that occurs most frequently. There can be multiple equal modes
in a sample.

Example 9.1.1
A fisherman records the weight of each fish he catches for a week. These are his results. Each value
represents the weight, in pounds, of a fish he caught.
{ 20, 23, 19, 27, 17, 22, 18, 15, 23, 25, 18, 23, 29 }
A frequency distribution of the sample above is
X Frequency Count Frequency
15 : 1
17 : 1
18 :: 2
19 : 1
20 : 1
22 : 1
23 ::: 3
25 : 1
27 : 1
29 : 1
And the following is a frequency histogram

3
Frequency

0
15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
X
Mean and Variance - Expected Value of a Random Variable Page 41

The mean weight of sample of fishes is


20 + 23 + 19 + 27 + 22 + 18 + 15 + 23 + 18 + 23 + 29 237
= ≈ 18.231
13 13
The median weight of the sample of fishes is found by rearranging the sample into numerical order
and selecting the middle outcome. The median is 22.

15, 18, 18, 19, 20, 22, 23, 23, 23, 27, 29

The mode is the weight that occurs most frequently. It corresponds to tallest bar on the histogram.
23lbs occurs the most (3 times) so it is the median.

9.2 Expected Value of a Random Variable


The expected value of a random variable X with a range of A and a probability function f (x), is given by
X
E(X) = µ = xf (x)
x∈A

Note that in order to calculate the expected value of a random variable X, we often need to know
the distribution, and hence the probability function, of X.

Derivation of the Expected Value


Suppose we have a frequency distribution of a random variable X, as shown below:
X Frequency Count Frequency
5 ;; 10
10 ;:: 7
25 ;;::: 13
100 :::: 4
200 :: 1
As we learnt in the previous section, we can calculate the mean as

(5 × 10) + (10 × 2) + (25 × 13) + (100 × 4) + (200 × 1)


    30      
10 7 13 4 1
= (5) + (10) + (25) + (100) + (200)
30 30 30 30 30
X
(A is the range of X) = x × the fraction of times x occurs
x∈A

Now suppose we know the probability function of X is as follows

x 5 10 25 100 200
1 7 13 2 1
f (x) 3 30 30 15 30

Using the relative frequency definition of probability, we know that if we observed a very large number
of outcomes, the fraction of times X = x occurs (relative frequency of x) is f (x).
Mean and Variance - Expected Value of a Random Variable Page 42

Thus, in theory, we would expect the mean of a sample of infinitely many outcomes to be
         
1 7 13 2 1
(5) + (10) + (25) + (100) + (200) ≈ 33.167
3 30 30 15 30

This theoretical mean is denoted by µ or E(X), and is known as the expected value of X.

Example 9.2.1
A slots machine in a casino costs $5 to play. It has probabilities of 0.5 to pay out $2, 0.2 to pay out
$5, a 0.1 to pay out $10 and otherwise does not pay out anything. Let the random variable X be the
amount of money (in dollars) the machine pays out in one play, and Y be the amount of money won
or lost in one play. Find E(X) and E(Y ).

E(X) = (0)(0.2) + (2)(0.5) + (5)(0.2) + (10)(0.1) = 3

E(Y ) = (−5)(0.2) + (−3)(0.5) + (0)(0.2) + (5)(0.1) = −2


Note that E(Y ) = E(X − 5) = E(X) − 5

Example 9.2.2
A nightclub lets groups of up to 6 people enter at reduced fees. A randomly selected group in the
nightclub’s line has the following probabilities for its size and cost of entry:
Size of Group (X) Cost of Entry (Y) Probability
1 $10 0.1
2 $18 0.15
3 $26 0.1
4 $34 0.3
5 $42 0.15
6 $50 0.2

1. Let X be the size of a randomly selected group. Find E(X).

E(X) = (0.1)(1) + (0.15)(2) + (0.1)(3) + (0.3)(4) + (0.15)(5) + (0.2)(6) = 3.85

2. If the cost of entry of a group (Y) is 8 × the size of the group + 2. Find the expected value of the
cost of entry, in dollars, of a randomly selected group.

E(8X + 2) = E(Y ) = (0.1)(10) + (0.15)(18) + (0.1)(26) + (0.3)(34) + (0.15)(42) + (0.2)(50) = 32.8

3. Show that the expected value of the cost of entry of a randomly selected group is 8 ×
the expected value of the size of the group + 2.

8E(X) + 2 = 8 × 3.85 + 2 = 30.8 + 2 = 32.8 = E(8X + 2)


Mean and Variance - Expected Value of a Random Variable Page 43

Theorem 9.2.1
Let X be a discrete random variable with a range of A, and probability function f (x). The expected
value of some function g(X) is given by
X
E [g(X)] = g(x)f (x)
x∈A

Proof 9.2.1:
Let the random variable Y = g(X) have a range of B and a probability function fY (y) = P (Y = y).
X
E[g(X)] = E(Y ) = yfY (y)
y∈B

Now, let Cy be { x | g(x) = y }, that is the set of all values of x such that g(X) is y. So
X
fY (y) = P [g(X) = y] = f (x)
x∈Cy

That is, the probability that Y = y is the sum of the probabilities that X = x such that g(x) = y.
Now, we have
X X X X X
E[g(X)] = yfY (y) = y f (x) = yf (x)
y∈B y∈B x∈Cy y∈B x∈Cy
X X
= g(x)f (x)
y∈B x∈Cy

Note that the inner summation is for all x such that g(x) = y and the outer is for all y. Thus the equation
is the sum for all x. So X X X
E[g(X)] = g(x)f (x) = g(x)f (x)
y∈B x∈Cy x∈A

where A is the range of X, as required. 

Linear Properties of Expected Value

Theorem 9.2.2
For constants a, b and c,

E[ag1 (X) + bg2 (X) + c] = aE[g1 (X)] + bE[g2 (X)] + c


Mean and Variance - Variance of a Random Variable Page 44

Proof 9.2.2:

X
E[aE[g1 (X)] + bE[g2 (X)] + c] = [ag1 (x) + bg2 (x) + c]f (x)
all x
X
= [ag1 (x)f (x) + bg2 (x)f (x) + cf (x)]
all x
X X X
= ag1 (x)f (x) + bg2 (x)f (x) + cf (x)
all x all x all x
X X X
=a g1 (x)f (x) + b g2 (x)f (x) + c f (x)
all x all x all x
!
X
recall f (x) = 1 = aE[g1 (X)] + b[g2 (X)] + c
all x

9.3 Variance of a Random Variable


The variance of a random variable X is given by

Var(X) = σ 2 = E (X − µ)2
 

where σ is the standard deviation of X. It is the average squared deviation of a random variable from its
mean. It measures how far out from the mean the values of a random variable are spread.
The definition and formula above is useful in understanding the variance’s importance but it can be
difficult to use to actually calculate the variance. Here are a few other useful formulas for calculating the
variance of a random variable:

Var(X) = E(X 2 ) − [E(X)]2 = E(X 2 ) − µ2 (9.1)


2 2
Var(X) = E[X(X − 1)] + E(X) − [E(X)] = E[X(X − 1)] + µ − µ (9.2)

Derivation of Alternative Formulas

Var(X) = σ 2 = E (X − µ)2
 

= E X 2 − 2Xµ + µ2
 

= E(X 2 ) − 2µE(X) + µ2 , (by linear property since µ is a constant)


= E(X 2 ) − 2µ2 + µ2 , (since E(X) = µ)
= E(X 2 ) − µ2

Now note that X 2 = X(X − 1) + X, so we have

Var(X) = σ 2 = E[X(X − 1) + X] − µ2
= E[X(X − 1)] + E(X) − µ2
= E[X(X − 1)] + µ − µ2
Mean and Variance - Expected Value and Variance of Discrete Distributions Page 45

Properties of Variance

Theorem 9.3.1
For constants a and b,
Var[aX + b] = a2 Var(X)

Proof 9.3.1:

2
Var[aX + b] = E (aX + b)2 − E(aX + b) ,
   
by equation (9.1)
2
= E a2 X 2 + 2abX + b2 − aE(X) + b
  

= a2 E(X 2 ) + 2abE(X) + b2 − a2 [E(X)]2 − 2abE(X) − b2


= a2 E(X 2 ) − [E(X)]2


= a2 Var(X)


9.3.1 Standard Deviation of a Random Variable


The standard deviation of a random variable X is defined as
p q  
σ = Var X = E (X − µ)2

It is another measure used to quantify the variability of a random variable.

A useful property of the standard deviation is that it is expressed in the same units as the random
variable, unlike the variance.

9.4 Expected Value and Variance of Discrete Distributions


We can derive the expected value and variance of a random variables from its probability distribution. In
this section we will apply what we have learned about mean and variance to each of the discrete distributions
we have previously encountered.

Binomial Distribution
Let X be a random variable such that X ∼ Binomial(n, p).
The expected value of X is
µ = E(X) = np
The variance of X is
σ 2 = Var(X) = np(1 − p)

Poisson Distribution
Let X be a random variable such that X ∼ Poisson(µ).
The expected value of X is
µ = E(X) = µ
The variance of X is
σ 2 = Var(X) = µ
Mean and Variance - Expected Value and Variance of Discrete Distributions Page 46

9.4.1 Derivations of Expected Values and Variances

Derivation of Expected Value and Variance of Binomial Distribution


The probability function of a r.v. X with X ∼ Binomial(n, p) is
 
n x
P (X = x) = f (x) = p (1 − p)n−x
x

So we will use this to find E(X). We have


n n   n  
X X n x X n!
E(X) = xf (x) = x p (1 − p)n−x = x px (1 − p)n−x
x=0 x=0
x x=0
x! (n − x)!
(when x = 0, the summation term is also 0, so we can ignore it)
n
X n(n − 1)!
= px (1 − p)n−x
x=1
(x − 1)! (n − x)!
n
X n(n − 1)!
= ppx−1 (1 − p)(n−1)−(x−1)
x=1
(x − 1)! [(n − 1) − (x − 1)]!
n   x−1
X n−1 p
= np(1 − p)n−1
x=1
x−1 1−p
n−1 
X n−1   y
n−1 p
Let y = x − 1 = np(1 − p)
y=0
y 1−p
 n−1
p
(by Binomial Theorem) = np(1 − p)n−1 1 +
1−p
(1 − p + p)n−1
= np(1 − p)n−1 = np
(1 − p)n−1

Now to find Var(X) we consider E[X(X − 1)]. We will use the same technique as above
n n  
X X n x
E[X(X − 1)] = x(x − 1)f (x) = x(x − 1) p (1 − p)n−x
x=0 x=0
x
(when x = 0 or x = 1, the summation terms are also 0, so we can ignore them)
n
X n(n − 1)(n − 2)!
= p2 px−2 (1 − p)(n−2)−(x−2)
x=2
(x − 2)! [(n − 2) − (x − 2)]!
n   x−2
X n−2 p
= n(n − 1)p2 (1 − p)n−2
x=2
x−2 1−p
n−2 
X n−2   y
2 n−2 p
Let y = x − 2 = n(n − 1)p (1 − p)
y=0
y 1−p
 n−2
p
(by Binomial Theorem) = n(n − 1)p2 (1 − p)n−2 1 +
1−p
= n(n − 1)p2

So we have

Var(X) = E[X(X − 1)] + µ − µ2 = n(n − 1)p2 + np − (np)2 = np(1 − p)


Mean and Variance - Expected Value and Variance of Discrete Distributions Page 47

Derivation of Expected Value and Variance of Poisson Distribution


The probability function of a r.v. X with X ∼ Poisson(µ) is

µx e−µ
P (X = x) = f (x) =
x!
So we will use this to find E(X). We have
∞ ∞
X X µx e−µ
E(X) = xf (x) = x
x=0 x=0
x!
(when x = 0, the summation term is also 0, so we can ignore it)

X µx−1
= µe−µ
x=1
(x − 1)!

X µy
Let y = x − 1 = µe−µ
y=0
y!

µy
 X 
x
recall e = = µe−µ eµ = µ
y=0
y!

Now to find Var(X) we consider E[X(X − 1)]. We will use the same technique as above
n ∞
X X µx e−µ
E[X(X − 1)] = x(x − 1)f (x) = x(x − 1)
x=0 x=0
x!
(when x = 0 or x = 1, the summation terms are also 0, so we can ignore them)

X µx−2
= µ2 e−µ
x=2
(x − 2)!

X µy
Let y = x − 2 = µ2 e−µ
y=0
y!
= µ2 e−µ eµ = µ2

So we have
Var(X) = E[X(X − 1)] + µ − µ2 = µ2 + µ − µ2 = µ
Page 48

Chapter 10

Continuous Random Variables

10.1 Computer Generated Random Variables


Virtually all computer software have built in “psuedo-random number generators” that simulate observations
of a random variable U , from a uniform distribution, U (0, 1). From this uniform distribution, we can apply
functions on U in order to form non-uniform distributions with a given cumulative distribution functions
F (X).

Theorem 10.1.1
If F is an arbitrary cumulative distribution function and U is uniform on [ 0, 1 ] then the random
variable X, defined by X = F − (U ), where F − (y) = min{ x | F (x) ≥ y }, has a cumulative distribution
function of F (x).

Proof 10.1.1:
Note that, for all U < F (x), we have that X ≤ x by applying the inverse function F − to both sides. Now,
by applying F to both sides of X ≤ x, we have U ≤ F (x), for all x. So we can say

[ U < F (x) ] ⊆ [ X ≤ x ] ⊆ [ U ≤ F (x) ], for all x

Taking probabilities of the across the equation we have,

P [ U < F (x) ] ≤ P [ X ≤ x ] ≤ P [ U ≤ F (x) ], for all x

Note that P [ U < F (x) ] = P [U ≤ F (x) ] = F (x) since U is uniform and continuous. Thus,

F (x) ≤ P (X ≤ x) ≤ F (x)

so P [ X ≤ x ] = F (x), for all x, as required. 

Example 10.1.1
Suppose we have a random variable U that is uniform on U [ 0, 1 ] and we want to generate a random
variable X with exponential distribution. We have that the cumulative distribution function of X is
FX (x) = 1 − e−λx for some λ. Since FX (x) is a continuous, strictly increasing function for x > 0,
Continuous Random Variables - Normal Distribution Page 49

let y = FX (x). Now,

y = 1 − e−λx
1 − y = e−λx
ln(1 − y) = −λx
− ln(1 − y)
x=
λ
− ln(1 − y)
So FX−1 (y) = .
λ
Thus, by Theorem 3.1.1, X = FX−1 (U ) has cumulative distribution function FX (x).
1
X = − ln(1 − U )
λ
Now, to find fX (x), the probability density function of X, we differentiate the cumulative distribution
function. So,
1−y
fX (x) =
λ

10.2 Normal Distribution


A random variable X has a normal distribution if it has probability density function of the form

1 1 x−µ 2

f (x) = √ e− 2 σ , for x ∈ R
σ 2π

where µ ∈ R and θ > 0 are parameters. It turns out that E(X) = µ and Var(X) = σ 2 .
X ∼ N (µ, σ 2 ) where X has expected value µ and variance σ 2 .
The Normal distribution is the most widely used distribution in probability and statistics. Physical
processes leading to the Normal distribution exist but are a little complicated to describe.
The graph of the probability density function f (x) is symmetric about the line x = µ. The shape of the
graph is often termed a “bell shape” or “bell curve”.
Continuous Random Variables - Normal Distribution Page 50

f (x)

µ − 4σ µ − 3σ µ − 2σ µ−σ µ µ+σ µ + 2σ µ + 3σ µ + 4σ
X

The cumulative distribution function of a Normal Distribution is


Z x
1 1 y−µ 2

−2
F (x) = P (X ≤ x) = √ e σ dy
−∞ σ 2π

This integral cannot be given a simple mathematical expression so numerical methods are used to compute
its value for given x, µ, σ. Before computers could solve such problems, tables of probabilities F (x) where
created by numerical integration. Only the table of the standard normal distribution, N (0, 1), is required to
solve for F (x) for all µ, σ, since with a change of variable, the c.d.f. any normal distribution can be related
to that of the standard normal distribution.

Theorem 10.2.1
X−µ
Let X ∼ N (µ, σ 2 ). If Z = σ , then Z ∼ N (0, 1) and
 
x−µ
P (X ≤ x) = P Z ≤
σ

Proof 10.2.1:
Let X ∼ N (µ, σ 2 ).
Z x y−µ 2
1 1

P (X ≤ x) = √ e− 2 σ dy
−∞ σ 2π
  x−µ
y−µ
Z
1σ 1 2
Let z = = √ e− 2 z dz
σ −∞ 2π

= P Z ≤ x−µ

σ

You might also like