Probability Notes
Probability Notes
Jossy Sayir
js851@cam.ac.uk
2
Contents
1 Probability Fundamentals 5
1.1 Tribute and acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 What will be covered in this course? . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Probability or statistics? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Foundations of Probability: the small print . . . . . . . . . . . . . . . . . . . . . . 8
1.5 Random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.5.1 Shortcut for those who don’t like the axiomatic approach . . . . . . . . . . 14
1.6 Expectation and entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.7 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.9 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3
4 CONTENTS
4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
Probability Fundamentals
• where you stand (what you may already know about probability and statistics) ;
• what you still won’t know about probability and statistics after this course.
1 much of what is taught in school as probability and statistics is in fact combinatorics, a mathematical discipline
concerned with counting things. There will be very little combinatorics in this course.
5
6 CHAPTER 1. PROBABILITY FUNDAMENTALS
Rather than provide my own elaborate map of the fields of probability and statistics, it’s
easiest to borrow the contents pages of two books. The first of these books is the Riley, Hobson
and Bence [RHB06], a very heavy book covering all the mathematics that would ever be of use
to engineers (and some more.) Chapters 30 and 31 in this book cover probability and statistics,
respectively. Their contents are shown in http://assets.cambridge.org/97805216/79718/toc/
9780521679718_toc.pdf. The level you achieved in the IA “Teach yourself probabilty” examples
paper covers roughly 30.1-3 and 31.1-2. (Venn Diagrams, basic probability, permutations and
combinations, averages, variances and standard deviation.) In this chapter, I plan to go back over
what you learned in some more depth, then take you onwards to cover a cross-section of chapters
30 and 31 in the [RHB06]. By the end of the module, you should have a fair understanding of
most sections in these two chapters, or understand enough to be able to read up on anything in
these chapters not covered in the lectures. The [RHB06] is recommended reading for those of you
who want a different perspective or want to go beyond the material in the lectures.
The next book I want you to take a look at is the Billingsley [Bil95] “Probability and Measure”,
which is a standard textbook in probability for mathematicians. Its content is shown in http://
eu.wiley.com/WileyCDA/WileyTitle/productCd-1118122372.html. There is no need to dwell
on the detail of all 38 chapters in this book at this stage. The main point of this short excursion into
mathematics is to note the differences between our treatment of this subject for engineering and
the treatment for mathematicians. Notice that the topics of “Random Variables” occur in Chapter
5, “Expected values” and “Moment-generating function” in Chapter 21. These are all topics that
we will cover in our lectures. Does this mean that we are jumping straight into very advanced and
intricate mathematical concepts, bypassing 20 preliminary chapters that mathematicians require
before delving into these subjects? Not really: our treatment will be based on a simplification, and
this simplified approach luckily covers most applications of probability of interest to engineers and
physicists. I would love to be able to tell you that you will never need a more advanced treatment
of probabilities, but that’s not necessarily true. There exist applications of engineering, physics
and economics where our intuitive understanding no longer applies. My purpose in showing you
the contents of [Bil95] is for you to get a feeling that probability theory is a rich and intricate
mathematical theory. Knowing the limits of your understanding may help you identify cases later
in life when it may become necessary to knock on the door of a friendly mathematician, or to
learn the full theory yourself.
Statistical inference, the process of deriving logical conclusions from premises in the presence
of uncertainty, binds the two fields together, and can often blur the distinction between the two.
Traditionally, statistics was seen as an end in itself. Medics, economists, biologists and social
scientists were taught very little probability and mostly statistics as it was considered a useful
tool for them to analyse data. Once probabilities had been estimated based on data, the range of
statements that one could make based on those probabilities tended to be very narrow, following
prescribed recipes that give little intuition or understanding. In fact, inferring logical conclusions
from data is a lot more delicate than people tended to think. There has been much public concern
1.3. PROBABILITY OR STATISTICS? 7
about tragic misinterpretations of medical data or miscarriages of justice stemming from erroneous
interpretation of data by specialists who had not been adequately trained in probability theory.
Example 1.1. “Sudden death syndrome” (SIDS) is the sudden and unexplained death of an
infant. In the UK, SIDS occurs in roughly 4 out of 10,000 live births. Sir Roy Meadow, a
reputed paediatrician, repeatedly argued as an expert witness in trials against parents who
had lost several infants to SIDS, with what became known as “Meadow’s law” one death is a
tragedy, two is suspcious and three is murder, quoting odds of 1:73,000,000 against two SIDS
in the same family in white affulent non-smoking families. Meadow’s calculation of this figure
from statistics he had available was erroneous and overly simplistic. The Royal Statistical
Society took position against Meadow during a historic appeal, with some statisticians calcu-
lating adjusted probabilities showing there was no basis for the guilty verdict. The case led to
several guilty verdicts being overturned on appeal after parents had spent years in jail falsely
convicted of murdering their children on the basis of Meadow’s expert opinions. Meadow
appeared in front of the General Medical Council accused of serious professional misconduct.2
Example 1.2. A hypothetical university “C” bases its admissions decisions on a number of
criteria in an elaborate and expensive selection procedure. About 0.44% of the total number of
students admitted nationally studies at “C”. A sharp admissions officer notices the following:
of those admitted to “C”, 90% use the word “volunteer” in their personal statements. Among
those not admitted to “C”, only 20% use the word “volunteer”. Is the admissions officer onto
something hot? Could “C” abolish its elaborate and expensive selection procedure and take
its decisions solely on the basis of the presence of the word “volunteer” in the candidates’
personal statements?
This is a typical example of a probability calculation and we will see later on in the course
that Bayes’ theorem3 dictates that the probability that a candidate is admitted to “C” given
the presence of the word “volunteer” in their personal statement is about 2%, much improved
from the 0.44% baseline probability, but far from sufficient for a decision ignoring all other
criteria.
The application of probability theory to physics and engineering has not been uncontroversial.
Albert Einstein famously said “God doesn’t play dice with the world”. At the core of this unease,
there is a fundamental tension between physics and mathematics. On the face of it, physics
simply uses mathematics to establish models of the world. A model is valid for a certain range of
parameters. The only test of its validity is whether it allows one to make reliable predictions about
the world. On this account, a physicist should have no feelings about mathematical theories. They
are neither good nor bad, only useful when models work and irrelevant otherwise. This narrow
view however loses sight of intuition. Intuition is the ability of a physicist or engineer to guess the
solution of a mathematical problem without doing the maths. Dirac once said4 : “I understand
what an equation means if I have a way of figuring out the characteristics of its solution without
actually solving it.” In order for intuition to work, there must be interaction between our physical
understanding of the world and the mathematical models describing it.
Probability has challenged mathematicians, physicists and many others. Perhaps one reason is
that many examples when developing the theory are associated with gambling and games of luck,
traditionally frowned upon by religious and moral institutions. Another reason, ironically, is that
it deals with notions such as “beliefs”, traditionally frowned upon by exact scientists. A third and
2 More details about this case http://en.wikipedia.org/wiki/Roy_Meadow
3 Theorem named after the English theologian and mathematician Rev Thomas Bayes (1701-1761)
4 As reported by Richard Feynman in the book “The Feynman Lectures on Physics” [FLS64].
8 CHAPTER 1. PROBABILITY FUNDAMENTALS
final reason is that, as a result of the previous two reasons, probability theory was rarely taught
in depth to engineers and physicists. Hence, many influential scientists never felt at ease with it.
Luckily, the last reason is in the process of weakening and this IB module aims to erase it, at least
as far as Cambridge engineers are concerned.
Definition 1.1. A sample space Ω is the set of possible outcomes of a random experiment.
Example 1.3.
• When flipping a coin, Ω = {heads, tails}.
• When throwing a dice, Ω = {1, 2, 3, 4, 5, 6}.
• When analysing financial data, Ω could be the set of all possible values of all stocks in
all of the world’s stockmarkets at all dates and times in the past, present and future.
• When designing a communication system, Ω could be the set of all the files a user might
ever consider transmitting and all the random behaviours (“noise”) the transmission
medium (“channel”) may possibly exhibit in the past, present or future.
• When studying a physical process, Ω could be every observable quantity in the universe.
A sample space can be a discrete finite set, a discrete countably infinite set such as the set
of integers, or a continuous set such as the set of real numbers. Most examples in probability
textbooks tend to concentrate on simple random experiments such as flipping a coin or throwing
a dice. However, most applications of probability theory typically concern much larger sets Ω like
those in our last three examples. In any case, in order for probability theory to make sense, your
sample space must include all random quantities you may ever want to examine jointly. If you are
going to consider two flips of a coin, then your sample space must include both flips. The concept
of a “repeated random experiment” is alien to the axiomatic approach. Probablistic quantities
in different sample spaces Ω1 and Ω2 can not be compared. The French call a sample space “un
univers de probabilités” (a universe of probablities) and that is an appropriate description of Ω.
It is the universe of anything you may ever want to consider within the same framework or study.
Example 1.4.
• When throwing a dice, the event A = {2, 4, 6} ⊂ Ω that the outcome is even.
5 https://en.wikipedia.org/wiki/Andrey_Kolmogorov
1.4. FOUNDATIONS OF PROBABILITY: THE SMALL PRINT 9
• The event C = {4} that the outcome is 4. This is both an outcome and an event. Some
call it an “atomic” event because it contains just one outcome.
The idea that an event is a subset of possible outcomes is not everyone’s cup of tea. Why
call it “event” if it’s just a subset? Most people would not intuitively equate the notion of event
with the idea of a set or subset. However, you’ve all seen Venn diagrams before, not least in the
“Teach Yourself Probability” IA Maths Examples paper 9. Venn diagrams are simply a pictorial
representation of events as sets. An example Venn diagram is drawn in Figure 1.1 The diagram
Ω
A B
shows the sample space Ω and two events A and B. The intersection A ∩ B, the union A ∪ B, the
complements Ā and B̄ are further events that can be pictured in the Venn diagram. Suppose the
random experiment is a throw of a dice, Ω = {1, 2, 3, 4, 5, 6}, and A = {2, 4, 6} and B = {1, 2, 3}. In
set notation, it is obvious that A ∩ B = {2}, A ∪ B = {1, 2, 3, 4, 6}, Ā = {1, 3, 5} and B̄ = {4, 5, 6}.
In “event” terminology, it may be less easy to specify some of these events. A (the event that the
outcome is even), B (the event that the outcome is 3 or less), Ā (the event that the outcome is
NOT even, i.e., odd) and B̄ (the event that the outcome is NOT 3 or less, i.e., 4 or more) are
all easy to think about. However, A ∩ B (the event that the outcome is even AND 3 or less) or
A ∪ B (the event that the outcome is even OR 3 or less) are more difficult to consider intuitively.
The set analogy and its pictorial representation the Venn diagram are tools that help us refine our
intuition about events.
Definition 1.3. A probability measure p is a function that assigns numbers in R to events, such
that the following axioms holds.
Axiom 1 For any event A ⊂ Ω, p(A) ≥ 0, i.e., the probability of any event is non-negative.
Axiom 2 p(Ω) = 1, i.e., the probability of the certain event is 1.
Axiom 3 For any events A and B with empty intersection A ∩ B = ∅,
i.e., the probability of the union of disjoint events is the sum of their probabilities.
Based on these three axioms, a number of further properties of probability measures can be
deduced, such as
• p(∅) = 0.
• Complement rule: p(Ω − A) = p(Ā)= 1 − p(A).
• If A ⊆ B then p(A) ≤ p(B).
10 CHAPTER 1. PROBABILITY FUNDAMENTALS
You will prove these statements in the Examples Paper 1. For those not familiar with axiomatic
constructs, note how the three axioms are the three “truths” that must be accepted without proof
as minimal conditions for the theory to hold, whereas the four bullet points above are further
“truths” that can be proved as consequences of the axioms.
Example 1.5. The function assigning uniform probabilities of 1/6 to the atomic events con-
taining the 6 outcomes of a throw of the dice is a valid probability measure:
• Since the atomic events have no intersection, the probability of any other event can be
calculated using Axiom 3 and corresponds to 1/6 times the number of elements in the
event.
This example should make it easier for you to understand the meaning of the empty event ∅
and the cetain event Ω. By definition, the random experiment is sure to have an outcome in Ω
and hence the probability of the event that no outcome occurs ∅ must always be zero. The empty
event is also called the impossible event. For the same reason, the event that any outcome in Ω
occurs is certain and must hence have probability 1 by Axiom 2. Note that the probability of an
intersection of events p(A ∩ B) is sometimes denoted
p(A, B)
Definition 1.4. The conditional probabilty of an event conditioned on an event of non-zero prob-
ability is defined as the joint probability divided by the probability of the event, i.e.,
p(A, B) p(A ∩ B)
p(A | B) = = . (1.3)
p(B) p(B)
If the conditioning event has probability zero, the conditional probability is undefined. The
definition of the conditional probability is sometimes stated alternatively as
and called the product rule to match the sum rule above. Intuitively, the conditional probability
can be seen as a way to transfer the probability measure on Ω to a re-defined random experiment
where the conditional event is the new sample space.
Example 1.6. What is the conditional probability that the outcome of the dice throw will be 4
or more, given that it is even? (assume the uniform probability measure defined previously).
1.5. RANDOM VARIABLES 11
Using the events defined in our previous examples, the conditional probability is
As illustrated, the conditional probability given the event A is equivalent to re-defining a new
random experiment where the dice outcome is always even, i.e., with a new sample space
Ω0 = A, and transferring the original probability measure on Ω to this new setup. It is easy to
verify that the conditional probability measure will always satisfy the axioms of a probability
measure with respect to the conditional event interpreted as a sample space.
The notable Bayes’ theorem follows directly from the definition of conditional probability:
p(B ∩ A) p(A | B)p(B)
p(B | A) = = . (1.6)
p(A) p(A)
Example 1.7. We can now use Bayes’ theorem and the sum rule to evaluate the probability
of a candidate getting admitted at hyptothetical university “C” given the use of the word
“volunteer” in the candidate’s personal statement.
Let A be the event that a candidate is admitted at “C”. Let B be the event that the
candidate has the word “volunteer” in their personal statement. We are given the following
values:
• p(A) = 0.0044, i.e., the probability of being admitted to study at “C” nationally is
0.44%.
• p(B | A) = 0.9, i.e., the probability among those admitted to “C” for the word “volun-
teer” to have been in their personal statement is 90%.
• p(B | Ā) = 0.2, i.e., the probablilty among those not admitted to “C” for the word
“volunteer” to have been in their personal statement is 20%.
Using the sum rule, we deduce that
i.e., the probability of being admitted to “C” given that a candidate has used the word
“volunteer” in their personal statement is 1.95%.
X : Ω −→ numbers
We will use upper case letters, e.g., X, to denote random variables. Some may find this formal
definition hard to reconcile with their intuition. If X is a function, why in all the world did anyone
think of calling it a variable? This inconsistency is the result of historical evolution, where random
variables existed long before they were formally defined as part of the axiomatic approach. While
the function/variable dichotomy may at first be confusing, it does have one major advantage in
channeling our intuition towards a view of random variables as effects of an underlying common
randomness. In applications of probability theory, we often handle a large number of random
variables. The temptation would be to view them each as the outcomes of individual random
experiments. However, probablity theory, as stated in the previous section, is unable to deal with
distinct random experiments. Viewing all random variables across space and time as functions of
an underlying common random experiment has the advantage that all variables remain comparable
within the framework of probability theory.
• let Y be a random variable that takes on the value 0 if the outcome is in {1, 4}, 1 if the
outcome is in {2, 5}, and 2 if the outcome is in {3, 6}.
Despite the common underlying randomness, the variables X and Y are “independent” in the
sense that knowing one says nothing about the other. They act as if they were the outcome
of separate random experiments, but we can only make this statement because they are in
fact the outcome a common random experiment and therefore comparable within probability
theory. We will formalise what we mean by “independent” random variables shortly.
Events themselves can be viewed as random variables in the sense that they are either true or
false for each outcome, and could hence be equated to a random variable taking the value 1 (true)
for outcomes in the event, or 0 (false) for outcomes not in the event. Such random variables are
called indicator random variables for the event. X in the example above is an indicator random
variable for the event A in the previous section.
Venn diagrams for events indicate the subset of the sample space Ω corresponding to the event,
i.e., the event is “true” for outcomes of the random experiment within the subset and “false” for
outcomes outside the subset. Figure 1.2 shows what the equivalent for a random variable would be,
though this representation is not comonly called a Venn diagram. The random variable partitions
the sample space Ω into subsets corresponding to its possible values. This pictorial representation
works well for discrete random variables defined over finite alphabets, and one would need some
effort of the imagination to extend the notion to discrete random variables defined over countably
infinite alphabets, or continous random variables, but the principle is the same (think e.g. of a
heat map over Ω for continuous random variables.)
For any random experiment with sample space Ω, conditions on random variables such as
• is X equal to 1?
• is Y smaller or equal to 1?
define events, in that the condition can be stated as “the set of outcomes for which X is 1” or “the
set of outcomes for which Y is 1 or less”. Hence, we can rightfully use our probability measure to
1.5. RANDOM VARIABLES 13
X=4
X=3 X=5
X=6
X=2
X=1
denote the probability of such events, e.g., p(X = 1) and p(Y ≤ 1). We will use the notation
for the probability that the random variable X takes on the value x. PX is called the probablity
distribution or the probability mass function (PMF) of X. Note that we use upper case letters to
denote the random variable and lower case letters to denote possible values.
We will also use the notation
FX (x) = p(X ≤ x) (1.12)
The event corresponding to this statement is the intersection of the two events and we use the
notation
PXY (x, y) = p(X = x ∩ Y = y) (1.13)
and call it the joint probability distribution of X and Y . Finally, the notion of conditional proba-
bility extends naturally to probability distributions as well and we write
for the conditional probability distribution of Y given X. As for events, this can be stated as
Example 1.9. Consider again a fair dice throw but now let X be a random variable that takes
on the value 0 if the outcome is 3 or less and 1 otherwise, and let Y be a random variable
that takes on the value 0 if the outcome is 2 or less and 1 otherwise.
We have
PXY (0, 0) = p({1, 2}) = 1/3
P (0, 1) = p({3}) = 1/6
XY
(1.16)
PXY (1, 0) = 0, and
PXY (1, 1) = p({4, 5, 6}) = 1/2.
Although the sum rule and Bayes’ theorem follow immediately from the previous section, there
is usually little interest in the complement event X 6= x to the event X = x. A more interesting
rule follows if we note that, if Y is the set of all values taken on by the random variable Y , then
the events Y = y for each y ∈ Y don’t intersect and their union must be Ω since every outcome is
assigned a value in Y. Hence, we can use Axiom 3 as we did in the sum rule to state
X
PXY (x, y) = p (∪y∈Y (X = x ∩ Y = y)) = p(X = x) = PX (x). (1.17)
y∈Y
This sum rule for probability distributions is also known as the marginalisation of joint probability
distributions and allows us to recover probability distributions of individual variables (also called
their marginal distributions) from their joint distributions. Bayes’ theorem can be applied using
marginalisation to give
PY |X (y | x)PX (x)
PX|Y (x | y) = (1.18)
PY (y)
PY |X (y | x)PX (x)
=P 0 0
(1.19)
x0 ∈X PY |X (y | x )PX (x )
where the second step combines the marginalisation with Bayes’ theorem. This final expression is
the one most commonly used as Bayes’ theorem as it allows you to derive the “inverse probabilities”
PX|Y from the “forward probabilities” PY |X .
Example 1.10. We can marginalise the joint distribution in the previous example to obtain
PX (0) = PXY (0, 0) + PXY (0, 1) = 1/2 and PY (0) = PXY (0, 0) + PXY (1, 0) = 1/3. We can
compute conditional probabilities, such as the probability that Y = 0 given X = 0
1.5.1 Shortcut for those who don’t like the axiomatic approach
Forget about sample spaces, events, probability measures, and pretty much everything you
learned in this chapter so far. . . Just make a list of all the random variables you will ever need,
X1 , X2 , X3 , . . . , Xn . Now define a joint distribution
i.e., the joint probability distribution is non-negative everywhere and sums to 1. Once the
joint probability distribution of all the random variables is defined, all other joint, conditonal
and individual probability distributions can be computed using the sum rule (marginalisation)
and the product rule (conditional probabilities).
This is a valid summary of probability theory that doesn’t require an axiomatic construc-
tion and will accurately cover all applications of interest involving a finite collection of discrete
random variables. Once you progress to continuous random variables (which we will do in
this module) or to infinite collections of random variables (which you may do in Part IIA if
you learn about random processes), this edifice will become a bit shaky and you may want to
seek shelter in the safety of the well grounded axiomatic theory.
A more comprehensive treatment based on this approach is is [Mac03, Chapter 2], which
is recommended reading for this module.
Example 1.11. Let X be a random variable taking on the value of the outcome of a throw of
a fair dice. Then
1 1 1 1
E[X] = 1 · + 2 · + . . . + 6 · = (1 + 2 + 3 + 4 + 5 + 6) = 3.5 (1.24)
6 6 6 6
Let Y be the random variable taking the value 0 if the outcome is 2 or less, and 1 if the
outcome is 3 or more. Then
1 2 2
E[Y ] = 0 · + 1 · = . (1.25)
3 3 3
Note that the expected value of a binary random variable such as Y is always the probability
of 1, i.e., E[Y ] = PY (1).
or
6
X 1
E[2X] = 2kPX (k) = (2 + 4 + 6 + 8 + 10 + 12) = 7, (1.28)
6
k=1
or
6
X 1
E[X 2 ] = k 2 PX (k) = (1 + 4 + 9 + 16 + 25 + 36) = 15.167 (1.29)
6
k=1
Expectations are linear operators, which means that they will fufil the following two linearity
properties. For any two random variables X and Y ,
X
E[X + Y ] = (x + y)PXY (x, y)
(x,y)∈X ×Y
X X
= xPXY (x, y) + yPXY (x, y)
(x,y)∈X ×Y (x,y)∈X ×Y
X X
= xPX (x) + yPY (y)
x∈X y∈Y
= E[X] + E[Y ]
where we used the marginalisation rule. For any random variable X and constant c,
X X
E[cX] = cxPX (x) = c xPX (x) = cE[X].
x∈X x∈X
Note that in general, the expectation of a product is not the product of expectations, i.e.,
but we will learn about a sufficient condition for random variables in the next section 1.7 where
equality holds.
There are a few expectations of particular interest.
X
E[X 2 ] = x2 PX (x) (1.31)
x∈X
is called the second moment of a distribution. The expectation or mean E[X] is also called the
first moment. We will learn a lot more about moments in Chapter 4. Another quantity of interest
is called the central second moment or variance and is defined as
Var(X) = E (X − E[X])2 .
By averaging the squared difference to the mean, it gives an indication of how “spread out” the
distribution is. If a distribution is tightly concentrated around its mean, its variance will be small.
Using linearity, we can re-write the variance as
value in considering the logarithm of the quantity. Since the logarithm of a probability is always
negative, we prefer to consider the negative logarithm or
1
ι(x) = − log2 PX (x) = log2 . (1.35)
PX (x)
This quantity has a nice intuitive interpretation as a measure of our “surprise” at the outcome
of a random experiment and some call it the “information content” in the value of the random
variable. If the probability distribution assigns a small probability to the value, then its information
content is large and we are more surprised if it occurs. If the value x has probability 1/2, then the
information content is ι(x) = 1. If the value x has probability 1, then the information content is
ι(x) = 0 which is as low as can be, implying that we are not surprised at all when we observe this
value since we believed it would occur with probability 1. The information content is not defined
for values that have zero probability, but we could define it to be infinity, implying that we are
infinitely surprised if the random variable takes on a value that has zero probability.
The average of the information content is a measure of our uncertainty about a random variable
X X 1
H(X) = E[ι(X)] = PX (x)ι(x) = PX (x) log2 . (1.36)
PX (x)
x∈X x∈X
It was introduced by Claude Shannon and is known as Shannon’s entropy. Its unit is the bit6 when
the base of the logarithm is 2. Note the subtlety of the expectation notation in the definition of
the entropy, where the expression denotes the expectation of a function of X but the function
makes use of the probability distribution of X. For values of X that have probability zero, the
entropy expression is undefined. We can avoid this problem by noticing that limx→0 x log x1 = 0
and extend the definition of the entropy by adopting the convention that “0 · log ∞ = 0”.
If PX (0) = PX (1) = 1/2, we have H(X) = 1/2 + 1/2 = 1 bit of uncertainty about the
random variable, which is the largest uncertainty we can have for a binary random variable.
If PX (0) = 0 and PX (1) = 1, we have H(X) = 0 bits of uncertainty about the random
variable, meaning that we are completely certain about is outcome.
Let Y be the value of a fair dice throw. Then
1 log 6
H(Y ) = 6 · log2 6 = log2 6 = = 2.6 bits (1.38)
6 log 2
One useful entropy value to remember is that a binary random variable with probabilities
PX (1) = 0.11 and PX (0) = 0.89 (or vice versa) has entropy H(X) ≈ 1/2.
Information content and entropy are the object of an application of probability theory called
information theory that was pioneered by Shannon. It is taught in the Part IIA module 3F7 where
you will learn that the entropy of a random variable is an indication of how few binary symbols
you can express it in on average, and hence is a measure of central interest when designing data
compression algorithms.
6 The word bit has by now entered common language as describing a binary digit, but it was originally introduced
by Shannon as a measure of uncertainty or information, and we will stick to its original meaning here. A binary
digit is only a bit if it is equally likely to be 0 and 1.
18 CHAPTER 1. PROBABILITY FUNDAMENTALS
1.7 Independence
Definition 1.6. Two events A and B are said to be independent if
i.e., if their joint probability factors into the product of their individual probabilities.
This definition is easily extended to more than two events, e.g., A, B and C are independent
if p(A, B, C) = p(A)p(B)p(C).
Example 1.14. When throwing a fair dice7 , let A be the event that the outcome is even, and
B the event that the outcome is 2 or less. We have
1
p(A, B) = p(A ∩ B) = p({2}) = (1.40)
6
and
1 1 1
· =
p(A)p(B) = p({2, 4, 6})p({1, 2}) = (1.41)
2 3 6
so the two events are independent. Knowing whether the outcome is even or odd is not helpful
in determining whether the outcome is 2 or less.
The last statement in the example above is easier to relate to when considering the definition
of conditional probability. If two events A and B are independent and B has non-zero probability,
then
p(A ∩ B) p(A)p(B)
p(A|B) = = = p(A) (1.42)
p(B) p(B)
and hence the probability of A knowing B is the same as the probability of A without knowing
B. This is taken by some as the definition of independence, but has the drawback that it relies on
conditional probability which is only defined when the conditioning event has non-zero probability,
whereas the definition we gave is more general.
Extending the concept of independence to random variables requires some further thought.
Let X and Y be random variables taking on values over the sets X and Y, respectively. Let x1
and x2 be two elements of X and y1 and y2 be two elements of Y. Remember that the probability
distribution of a random variable was derived by arguing that X = x1 defines the event (set) of
outcomes for which X takes on the value x1 and hence PX (x1 ) = p(X = x1 ). Now, it may well
be that X = x1 and Y = y1 define two independent events
In this case, although some of the events defined by the random variables are independent, we will
not say that the random variables are independent.
Definition 1.7. Two random variables X and Y are independent if all the events corresponding
to values of X are independent of all the events corresponding to values of Y , i.e., if
and “dice” is used for both the singular and the plural.
1.7. INDEPENDENCE 19
As for independent events, for independent random variables X and Y the relation
PXY (x, y) PX (x)PY (y)
PX|Y (x|y) = = = PX (x) (1.46)
PY (y) PY (y)
Thinking further: an event is essentially a binary (indicator) random variable. The joint
distribution of two binary random variables must satisfy 2×2 = 4 conditions for independence,
whereas the associated events need to satisfy only one condition. Clearly, 3 of the 4 conditions
for the random variable are redundant. Can you tell why? Think of Axioms 2 and 3. The
same argument implies that some of the nX ×nY conditions in the general case are redundant.
How many?
For independent random variable, we can express the expectation of the product as the product
of expectations, i.e., if Z = XY for indpendent random variables X and Y , then
X
E[Z] = E[XY ] = xyPXY (x, y) (1.47)
(x,y)∈X ×Y
X
= xyPX (x)PY (y) (1.48)
(x,y)∈X ×Y
X X
= xPX (x) yPY (y) = E[X]E[Y ]. (1.49)
x∈X y∈Y
Note again that this not true in general of two random variables that are not independent. Inde-
pendence is a sufficient condition but not a necessary condition for E[XY ] = E[X]E[Y ] to hold.
Random variables X and Y for which E[XY ] = E[X]E[Y ] are called uncorrelated. Independent
random variables are always uncorrelated, but it is possible in theory for two variables to be un-
correlated but not independent. Think of uncorrelation as a “cheap independence”: it is easier
to verify (just one condition as opposed to conditions for all x and y) and it hints at possible
indepdence but without guaranteeing it.
Before we move on, we will give a quick thought to independence when more than two random
variables are involved. If three or more random variables satisfy the condition
PX1 X2 ...Xn (x1 , x2 , . . . , xn ) = PX1 (x1 )PX2 (x2 ) · · · PXn (xn ) (1.50)
for all (x1 , x2 , . . . , xn ) in X1 × X2 × . . . × Xn , the it is easy to see, by applying the sum rule,
that any Xj and Xk for j 6= k are independent. We say that the random variables are mutually
independent. However, if
PXj Xk (xj , xk ) = PXj (xj )PXk (xk ) (1.51)
for all j,k, and (xj , xk ) in Xj ×Xk , but (1.50) is not fulfilled, then we say that the random variables
are only pairwise independent.
Example 1.15. Let X and Y be binary random variables indicating “heads” in a random
experiment involving two fair coins thrown independently. X is one if the first coin shows
“heads” and zero otherwise, and Y is one if the second coin shows “heads” and zero otherwise.
Let Z be a binary random variable obtained from XORing X and Y , where the XOR operation
20 CHAPTER 1. PROBABILITY FUNDAMENTALS
gives 1 if either X or Y are 1 but not both, and 0 if both X and Y are zero or both are one.
The joint distribution of X,Y ,Z is
PXY Z (0, 0, 0) = 1/4
PXY Z (0, 0, 1) =0
PXY Z (0, 1, 0) =0
P
XY Z (0, 1, 1) = 1/4
(1.52)
P XY Z (1, 0, 0) =0
PXY Z (1, 0, 1) = 1/4
PXY Z (1, 1, 0) = 1/4
PXY Z (1, 1, 1) =0
We obtain PZ (0) = 1/2 by marginalisation of the joint distribution. We see that the three
random variables are not mutually independent by observing PX (0)PY (0)PZ (0) = (1/2)3 =
1/8 which is not equal to PXY Z (0, 0, 0). This is not surprising since Z is a function of X and
Y and hence X and Y fully determine Z and hence cannot be independent of Z.
On the other hand, we can obtain the joint distribution of X and Z by marginalising over
Y,
PXZ (0, 0) = PXY Z (0, 0, 0) + PXY Z (0, 1, 0) = 1/4
P (0, 1) = P
XZ XY Z (0, 0, 1) + PXY Z (0, 1, 1) = 1/4
(1.53)
PXZ (1, 0) = PXY Z (1, 0, 0) + PXY Z (1, 1, 0) = 1/4
PXZ (1, 1) = PXY Z (1, 0, 1) + PXY Z (1, 1, 1) = 1/4
and hence verify that X and Z are independent. Y and Z are also independent by symmetry
and X and Y are independent by definition, so we conclude that X, Y and Z are pairwise
independent.
Example 1.16. What is the probability that two or more people have the same birthday in
a group of n people? This problem is sometimes called the “birthday paradox” because the
probability is higher than most people would expect. A few (wrong) assumptions are made
to make the problem easy:
• Leap years and the 29 February are ignored8 .
• It is assumed that the probability that someone has their birthday on any day of the
year is 1/365, i.e., birthdays are equally probable to be on any day of the year.
• The birthdays in the group are assumed to be independent.
Let us write B1 , B2 , . . . , Bn for the random variables corresponding to the birthdays of
the n people in the group. The event that two or more people have the same birthday is the
complement of the event that all people have different birthdays, whose probability can be
computed as
X X X X
... PB1 (k1 )PB2 |B1 (k2 |k1 ) · · · PBn |B1 ...Bn−1 (kn |k1 . . . kn−1 ).
k1 k2 ∈{k
/ 1 } k3 ∈{k
/ 1 ,k2 } kn ∈{k
/ 1 ,...,kn−1 }
(1.54)
Since all the random variables are independent and uniform, all probabilities in the expression
are 1/365, so it is only a matter of taking sums over the sets. The sum over k1 is over all
1.7. INDEPENDENCE 21
365 birthdays, the sum over k2 over all 364 birthdays not equal to k1 , etc. The resulting
expression is hence
365 364 363 366 − n
p(n different birthdays) = · · ··· (1.55)
365 365 365 365
365!
= (1.56)
365n (365 − n)!
Hence, the probability that two or more people have the same birthdays
1.8 Summary
Introduction:
• Probability theory is a branch of mathematics that deals with uncertain events.
• An event is a subset of Ω.
• A probability measure p is a function that assigns numbers in R. to events.
• Axiom 1: For any event A ⊂ Ω, p(A) ≥ 0, i.e., the probability of any event is non-
negative.
PY |X (y | x)PX (x)
• Bayes’ theorem: PX|Y (x | y) = P
PY |X (y | x0 )PX (x0 )
x0 ∈X
• Expectation:
P P
P E[f (X)] = x∈X f (x)PX (x), in particular E[X] = x∈X xPX (x) and
E[X 2 ] = x∈X x2 PX (x).
• Linearity of expectation: E[X + Y ] = E[X] + E[Y ] and E[cX] = cE[X] for any c
Independence:
• Independent events: p(A, B) = p(A ∩ B) = p(A) · p(B) and hence p(A|B) = p(A)
• Independent random variables: PXY (x, y) = PX (x)PY (y) for all x, y and hence
PX|Y (x|y) = PX (x)
• For independent random variables X and Y , E[XY ] = E[X]E[Y ].
1.9 Problems
Having completed Chapter 1, you should be able to attempt Problems 1 to 5 of Examples Paper 5.
Chapter 2
The previous chapter was loaded with concepts that were new for many of you. This chapter will be
much easier to digest as we simply aim to get familiar with a few common probability distributions
and illustrate them with examples. Note that a comprehensive list of known distributions would
take far more time that we can afford to spend in this course and learning them by heart would
also add little educational value, so we picked a few essential distributions that are worth knowing
about. If you ever need to know about other distributions that the ones presented in this lecture,
a good place to start is the Wikipedia list of probability distributions.
• As mentioned in the previous chapter, they are indicator random variables for events, e.g.,
• They also occur in their own right in digital communications, where information is often
encoded into binary symbols.
• Probability textbooks often illustrate Bernoulli distributions using “biased coins”. These
are coins that have different probabilities of landing on “heads” or “tails”.
23
24 CHAPTER 2. DISCRETE PROBABILITY DISTRIBUTIONS
How to make a biased coin: have tou ever seen a biased coin in reality? If you type “how to
make a biased coin” into a search engine you get lots of ingenious suggestions, such as bending
the coin. My doctoral supervisor Jim Massey was given an extreme version of a biased coin as
a retirement present: this had been manufactured by filing down two coins and gluing them
together to yield something looking like a coin giving a random variable X ∼ Ber(1) that
would always indicate “heads”. Disclaimer: please check the legality of defacing coins in your
jurisidiction before rushing to produce your own “Massey coin”. In the UK as far as I am
aware it is legal to do so at the time of writing.
for any binary random variable in the previous chapter. The other expectations of interest are the
second moment.
E[X 2 ] = PX (0) · 02 + PX (1) · 12 = p (2.3)
The entropy of a Bernoulli random variable is known as the binary entropy function of p
1 1 1 1
H(X) = H2 (p) = PX (0) log2 + PX (1) log2 = p log2 + (1 − p) log2 (2.5)
PX (0) PX (1) p 1−p
with H2 (0) = H2 (1) = 0 using our rule of “0 log 0 = 0”. A plot of the binary entropy function is
represented in Figure 2.1. Notice that our uncertainty about a binary random variable is highest
when it is equally likely to be 1 or 0, and at its lowest when it is certain to be 0 or 1. Our
uncertainty is about half a bit when the probability of 1 or 0 is about 0.11.
0.8
0.6
H2 (p)
0.4
0.2
0
0 0.2 0.4 0.6 0.8 1
p
It is the number of ones, “yes” answers, or successes in n independent Bernoulli trials. Note that
each individual sequence of values x1 , x2 , . . . , xn has probability
n
Y P P
xk
PX1 ,...,Xn (x1 , . . . , xn ) = PXk (xk ) = p k (1 − p)n− k xk
(2.7)
k=1
in other words its probability only depends on the number of ones and number of zeros. For
example,
PX1 X2 X3 X4 X5 (0, 1, 0, 1, 1) = p3 (1 − p)2 (2.8)
because the sequence contains three ones and two zeros. The number of sequences of length n
that have k ones and n − k zeros (when order is irrelevant) is the number of combinations
n n!
= n Ck = (2.9)
k (n − k)!k!
where the first notation is preferred by engineers mathematicians and scientists, and the second
notation is preferred by designers of calculators because the first notation doesn’t fit on a calculator
key. We will use both notations interchangeably.
We’ve now deduced an expression for the binomial distribution
n k
PY (k) = p (1 − p)n−k (2.10)
k
for k = 0, 1, . . . , n.2 We can easily verify that the distribution sums to one using the binomial
expansion3 as follows
n n n 0 n n−1 n n−2 2 n 0
(p+1−p) = 1 = p (1−p) + p (1−p)+ p (1−p) +. . .+ p (1−p)n (2.11)
0 1 2 n
where the expression on the right is precisely the sum of the binomial distribution.
Example 2.1. A hypothetical student “C.” has a probability of p = 1/3 of failing to wake up
in time for lectures on any given morning. What is the probability that “C.” attends4 k of the
six IB Paper 7 “probability and statistics “ lectures?
The random variable Y counting the number of lectures attended is binomial B(6, 1/3).
Figure 2.2 illustrates the distribution graphically. Observe that this particular distributon
has two modes (largest probabilities) at 1 and 2 lectures. In general, the mode of a random
variable X ∼ B(n, p) is the largest integer smaller than np, except when np is an integer as
is the case here, when you get two modes at np and np − 1. We will compute the mean and
standard deviation of binomial distributions below.
2 Note that a binomial random variable Y ∼ B(n, p) which is the sum of n Bernoulli random variables can take
on n + 1 values from 0 to n.
3 We use a slightly different binomial expansion than the one in your data book here, that can easily be derived
they were up until 4 a.m. watching my fascinating video lectures, so please be assured of my warmest appreciation!
26 CHAPTER 2. DISCRETE PROBABILITY DISTRIBUTIONS
0.25
Probability distribution PY (k)
0.2
0.15
0.1
0.05
0
0 1 2 3 4 5 6
Number of lectures k
p(airborne) = PY (0)+PY (1)+PY (2) = (1−p)4 +2p(1−p)3 +6p2 (1−p)2 = 1−4p3 +3p4 . (2.13)
Equating the two expressions leads to the surprising conclusion that four engines are not always
better than two engines: there is a critical point at p = 1/3 above which it is preferrable to fly
in a two engine plane. Personally, I would not be enclined to fly on any airplane whose engines
have such a high probability of failure, but the example nonetheless illustrates an interesting
point.
The expected value of the binomial distribution can be computed tediously using the definition
of the expectation
n
X n k
E[Y ] = k p (1 − p)n−k (2.14)
k
k=1
2.3. THE GEOMETRIC DISTRIBUTION 27
or, alternatively, we can use the linearity properties developed in the previous section to note that,
if Y = X1 + X2 + . . . + Xn where each Xk is Ber(p), then
We can compute the second moment of a binomial distribution using the expectation product
rule for independent random variables
n
E[Y 2 ] = E[(X1 + . . . + Xn )2 ] = nE[X12 ] + 2 E[X1 ]2 = np + n(n − 1)p2 (2.16)
2
following a simple derivation. Finally, the Poisson distribution introduced in the last section of
this chapter gives a good approximation of the binomial distribution for small p and large n.
Y ∼ Geom(p) (2.20)
derived from an infinite collection X1 , X2 , . . . of independent Ber(p) random variables, and there-
fore
i.e., the probability that the first k − 1 variables are zero and that the k-th variable is one. We
leave it as an exercise to verify that
1
E[Y ] = (2.24)
p
Hint: use the expression for the sum of a geometric sequence in the maths data book and note
that
∞
d d a d X k
S∞ = = ar . (2.25)
dr dr 1 − r dr
k=0
28 CHAPTER 2. DISCRETE PROBABILITY DISTRIBUTIONS
The variance can also be determined by repeated use of the derivation trick above to yield
1−p
Var(Y ) = . (2.26)
p2
The entropy of a geometric distribution can also be derived easily using sums of geometric distri-
butions and the derivation trick to yield
H2 (p)
H(Y ) = . (2.27)
p
Example 2.3. The geometric distribution often comes into play in repeated random experi-
ments as the distribution of the number of attempted until success or failure. Consider for
example trying to generate a random point in the disc of radius r by using Python’s pseudo-
random generator random.uniform(-r,r) twice, obtaining two random variables (the x co-
ordinate and the y coordinate) ranging between −r and r. For the purposes of this example,
let us assume that Python’s pseudo-random generator is a perfect generator of uniformly dis-
tributed random variables (these are continuous random variables, about which we will learn
more in the next chapter). If the resulting point is not in the unit circle, i.e., if x2 + y 2 > r,
the attempt is discarded and two new random numbers need to be generated, and so forth
until the point obtained is in the circle. Let Z be the number of attempts until we successfully
generate a point in the circle. The probability of success at every attempt is the surface of
the disk of radius r divided by the surface of the square of side 2r, i.e.,
πr2 π
p(“success”) = = = 0.7854 (2.28)
(2r)2 4
Z is geometrically distributed with p = π/4. The number of attempts needed averages E[X] =
1/p = 4/π = 1.27. The distribution of attempts is illustrated in Figure 2.3.
Note that some textbooks define the geometric distribution as PY (k) = p(1 − p)k for k =
0, 1, 2, 3 . . ., so starting with 0 instead of 1 as we did. These definitions are equivalent in all
respects except that you have to take 1 from the expectation if you start at 0 instead of 1, giving
E[X] = 1/p − 1. The variance and entropy remain the same. Your mathematics data book has
both options, called Geometric (1) (starting from zero) and Geometric (2) (starting from 1 as
presented here.)
5 Pronounced “pwa sô” where the last vocal is a nasal o. I have attended a talk by a reputed mathematician who
spent his life studying Poisson processes yet kept calling them “poison processes”. “Poisson” means fish in French
and there is nothing poisonous about fish as long as it’s fresh.
6 In most probability textbooks the Poisson distribution is described as modeling the number “events” that occur
in a time interval. The problem is that we’ve given the word “event” a very precise meaning in probability theory,
but in this case the word is used in its colloquial sense “a thing that happens at a given time”. We will use the
synonym “incident” to avoid confusion even though it feels a bit clonky.
2.4. THE POISSON DISTRIBUTION 29
0.25
0.2
Probability distribution PZ (k)
0.15
0.1
0.05
0
1 2 3 4 5 6
Number of attempts k
Figure 2.3: Geometric distribution for the number of attempts to generate a point in a circle
Example 2.4. A certain model of 10 Gb/s Ethernet router handles packets at an average rate
of 5 × 106 packets per second. What is the probability distribution of the number of packets
handled in any given microsecond (µs)?
The number of packets handled in a µs follows a Poisson distribution with parameter
λ = 5.
Example 2.5. A bridge is being built for a new motorway linking Cambridge to Milton Keynes,
that will carry traffic at an average rate of 10 vehicles per minute. What is the probability
distribution for the number of vehicles the bridge has to carry in any given hour?
The number of vehicles in an hour follows a Poisson distribution with parameter λ = 600.
Note that this ignores predictable rate fluctuations according to time of day, season, etc. The
structural engineers building the bridge would do well to dimension the structure using more
accurate traffic models than just the average rate of traffic.
Rather than postulate an expression for the Poisson distribution, we will derive it from first
principles. Let Y be the random variable counting the number of incidents in the time interval of
interest,
Y ∼ Poisson(λ). (2.29)
We will refer to the time interval of interest as the “unit interval” (e.g. 1µs.) Let us now sub-
divide the unit interval into n equal sub-intervals. If there are λ incidents per unit interval on
average, there will be λ/n incidents in each sub-interval on average. If we pick n large enough so
that λ/n << 1, since each interval can only contain an integer number of incidents, most intervals
will contain no incident and some may contain one. The probability that any interval contains
two or more incidents is expected to go to zero as n gets large enough. Hence, we can define
indicator random variables X1 , X2 , . . . , Xn that are 1 if an incident occurs in the corresponding
30 CHAPTER 2. DISCRETE PROBABILITY DISTRIBUTIONS
0.4
Probability distribution PY (k)
0.3
0.2
0.1
0
0 5 10 15 20
Number of incidents k
For a fixed k,
n 1 n! nn−1 n−k+1 1 1
lim = lim k = lim ··· = . (2.31)
n→∞ k nk n→∞ n (n − k)!k! n→∞ n n n k! k!
Furthermore, using a limit from the Mathematics Data Book (page 2), we have
n−k n −k
λ λ λ
lim 1− = lim 1− 1− = e−λ · 1 (2.32)
n→∞ n n→∞ n n
giving
λk −λ
e for k = 0, 1, 2, 3 . . .
PY (k) = (2.33)
k!
Since we’ve derived the Poisson distribution as an approximation to a binomial, it is clear that
the Poisson distribution can also be used in reverse to approximate a binomial distribution. This
works for small p and large n and is useful when the binomial distribution is difficult to compute
numerically due to the difficulty of evaluating nk . Hence, for large n and small p,
(pn)k −pn
n k
p (1 − p)n−k ' e , (2.34)
k k!
Poisson distributions for various parameters λ are shown in Figure 2.4. The expectation of
the Poisson distribution is λ by definition, since we assumed that λ was the average number of
incidents per time interval when deriving the distribution. For peace of mind, it’s worth double-
checking by recomputing the expectation from first principles, which is left as an exercise (when
doing so, look out for an expression that turns out to be the power series of the exponential
function.)
The Poisson distribution unusally has identical expectation and variance (which we won’t
derive but is again a fun exercise if you enjoy juggling around with algebra and derivatives)
E[Y ] = Var(Y ) = λ. (2.35)
This is a very useful property and often used in data analysis, as the expectation and the variance
are easy to estimate from data and checking if the estimates are close is a good way to validate
a hypothesis that the data source generates Poisson distributed random variables. The entropy is
more difficult to derive and does not have simple expression.
We will revisit Poisson processes in the next chapter on continuous random variables when we
study the probabilistic properties of the time interval between incidents in a Poisson process.
2.5 Summary
Bernoulli distribution:
• X ∼ Ber(p), PX (1) = 1 − PX (0) = p.
• E[X] = p, E[X 2 ] = p and Var(X) = p(1 − p)
• Binary entropy function: H(X) = H2 (p) = −p log2 p − (1 − p) log2 (1 − p)
Binomial distribution:
• Y ∼ B(n, p), PY (k) = n
k pk (1 − p)n−k
• Binomial coefficient: n n!
n
k = Ck = (n−k)!k!
• The binomial distributions models the number of ones in n independent Bernoulli ran-
dom variables with Bernoulli paramater p
Geometric distribution:
• Y ∼ Geom(p), PY (k) = p(1 − p)k−1 for k = 1, 2, . . .
• E[Y ] = 1/p, Var(Y ) = (1 − p)/p2 , H(Y ) = H2 (p)/p
Poisson distribution
λk −λ
• Y ∼ Poisson(λ), PY (k) = k! e for k = 0, 1, 2, . . .
• E[Y ] = Var(Y ) = λ
• Approximation of a binomial as a Poisson: if Z ∼ B(p, n) for a small p and a large n,
k
then PZ (k) ' (pn)
k! e
−pn
, i.e., approximately Z ∼ Poisson(pn)
• The Poisson distribution models the number of incidents in a unit interval when the
incidents occur at a rate of λ per unit interval
32 CHAPTER 2. DISCRETE PROBABILITY DISTRIBUTIONS
2.6 Problems
Having completed Chapter 2, you should be able to complete problems 1 to 7 in Examples Paper
5.
Chapter 3
In Chapter 1, we introduced the axiomatic approach but reassured you that, for many engineering
applications, it would be sufficient to think about joint distributions of random variables. A
simplified probability theory is sufficient for tackling a finite number of discrete random variables
defined over finite alphabets. For continuous random variables on the other hand, the axiomatic
approach will help us form a better intuition. Mathematicians normally study probability theory
hand in hand with measure theory, a more general framework for assigning measures (numbers)
to sets, and a probability measure is just one example of such a measure. As engineers, we need to
understand some continuous probabilty theory because there are topics within that theory that are
of central relevance to engineering applications. We will , but we do so without studying measure
theory. We hope in the two introductory sections to give you sufficient depth of understanding for
you to imagine why such a theory exists and why mathematicians require it to provide a precise
framework for all possible continuous random variables.
• an event A partitions the sample space Ω into two regions: the set A of outcomes where the
event is “true”, and the complementary set Ā of outcomes where the event is “false”;
For continuous random variables, the picture remains the same but is more difficult to imagine
graphically. For one, the sample space itself must contain a continuum of possible outcomes in
order for random variables defined on it to be truly continuous, so we are dealing with more complex
underlying random experiments and corresponding sample spaces. The probability measure p(.)
will in general assign positive probabilities only to regions of Ω. Although single outcomes are
still technically events (e.g., the atomic event {a} containing only the outcome a of the random
experiment), the probability measure for any event containing only a finite number or countably
infinite number of outcomes will in general be zero in those sample apaces. For random variables,
this means that the event {X = x} that a random variable X takes on a single value x will in
general have probability zero. Exceptions to this are for example random variables that remain
constant in a region of Ω and hence may have positive probability for certain values. These are
called mixed discrete/continuous random variables.
33
34 CHAPTER 3. CONTINUOUS RANDOM VARIABLES
Example 3.1. A vehicle’s speed can be considered a continuous random variable. What is the
probability that a vehicle should be passing a speed camera at a speed precisely equal to the
speed limit of 30.00000 . . . mph?
The speed X can take on a continuum of values and the probability of a single outcome
{X = 30.00000 . . .} is zero. We expect the event {29.9 ≤ X ≤ 30.1} on the other hand to
have non-zero probability. Note that if a car passed one speed camera A driving below 30
mph and the next speed camera B a mile down the road driving above 30 mph, then there
must have been a position between A and B where the car was driving at exactly 30 mph.
The fact that an event has zero probability does not necessarily mean that it will never occur
given an suitably infinite set of random variables, such as the infinite set of random variables
corresponding to all the speeds of the car over the interval [A, B].
We will in general refrain from considering events with a finite number of outcomes and hence
not consider probability distributions PX (x) = p(X = x) as we did for discrete random variables.
This distribution would in general be zero everywhere and hence be of little use. However, we can
still consider events corresponding to intervals, and in particular, use the cumulative probability
function defined in Chapter 1
FX (x) = p(X ≤ x). (3.1)
This is still well defined and satisfies the following properties
FX (x) ≥ 0,
lim
x→−∞ FX (x) = 0,
(3.2)
limx→∞ FX (x) = 1,
FX (x) is non-decreasing in x.
where the first property follows from Kolmogorov’s first axiom because FX (x) is the probability of
a well defined event; the second and third properties follow from the fact that the limit of the sets
defined by X ≤ x as x goes to −∞ and +∞ is the empty set and the sample space Ω, respectively
; and the last property follows because if x1 ≤ x2 then the event {X ≤ x1 } is a subset of the
event {X ≤ x2 }, since for every outcome for which X is less than x1 , X is also less than x2 , and
we proved in Chapter 1 that A ⊆ B implies p(A) ≤ p(B) and hence p(X ≤ x1 ) ≤ p(X ≤ x2 ) or
FX (x1 ) ≤ FX (x2 ).
The probability of falling within an interval [a, b] can be expressed in function of the cumulative
probability function, i.e.,
The careful reader will have noticed that we brushed over the difference between ’<’ and ’≤’ in
the expression above. This will in general make no difference as the events {X < a} and {X ≤ a}
differ in only a single point. We have seen that single outcomes or any finite number of outcomes
will in general have probability zero for continuous random experiments. This distinction can
become important in the case of mixed discrete/continuous random variables, and one would have
to carefully adjust the expression above if a were a value of X corresponding to a region of Ω with
non-zero probability.
Example 3.2. The probability that the car speed recorded by speed camera A is zero or less
(assuming that the camera records the magnitude of the speed only) is 0, hence FX (x) = 0
for all x ≤ 0. The probability that the speed is less than 1000 mph is 1 as there are no cars
that can drive that fast, hence FX (1000) = limx→∞ FX (x) = 1.
3.2. THE PROBABILITY DENSITY FUNCTION 35
The probability that the car’s speed is between 29.9 and 30.1 mph is FX (30.1) − FX (29.9).
If some cars are equipped with speed limiters set for 29.9 that physically limit the speed
to that value no matter how much the driver accelerates, then the above expression may
no longer apply and one would need to carefully consider whether we mean the “open” or
“closed” interval between 29.9 and 30.1 and what the (now non-zero) probability of driving
exactly 29.900000 . . . mph is.
Joint cumulative probability functions are defined similarly to joint distributions for discrete
random variables
FXY (x, y) = p(X ≤ x ∩ Y ≤ y) (3.4)
and independence of continuous random variables is defined as the independence of all events
associated with the random variables, e.g.,
for all (x, y) ∈ X × Y, where the condition “for all” specifies an infinity of conditions in the
continuous case.
There are no product or sum rules for cumulative probability functions. The former is a
pure formal issue: if we defined FY |X (y|x) as the probability p(Y ≤ y | X ≤ x), then there would
be no problem to state a product rule and Bayes’ theorem with conditional probability functions.
However, this is not normally done due to the danger of interpreting FY |X (y|x) as p(Y ≤ y | X = x),
which is not defined if X is a true continuous random variable since p(X = x) = 0. As for the
sum rule or mariginalisation, it is not possible to write that with cumulative probabilites because,
unlike the events X = x, the events X ≤ x are not mutually exclusive. They intersect, and hence
it isn’t possible to make use of Kolmogorov’s Axiom 3 to state a sum rule.
You will learn about a product and a sum and rule for continuous variables in the next section.
Can we apply a similar approach to continuous random variables in order to obtain something
along the lines of a probability distribution from the well defined cumulative probability function?
If we try to apply the expression above to a continuous random variable for an infinitesimal
interval, we obtain in general
which is in line with our claim for truly continuous random variables that the probability of single
outcomes or values is zero. If instead we divide the expression by the size of the interval, we get
the definition of a derivative, which is known as the probability density function (PDF):
A short digression into dimensional analysis: the temptation is great to consider a prob-
ability density function as a sort of probability of a “small interval”. The reason for our
long-winded introduction of this section, rediscovering derivatives from first principles rather
than just plainly defining the PDF as a derivative, is to give you a better understanding of
the subtle difference between the PDF and the probability of a small interval. The difference
is in the division by the length of the small interval.
Say for example that X is the speed of a car in the previous examples. The unit of speed
that we used is the mile per hour (mph). Probability is a unitless quantity. The probability
p(a ≤ X ≤ b) = FX (b) − FX (a) is just a number without a unit. The cumlative probability
function FX (x) and the probability distribution PX (x) of a discrete random variable are all
probabilities and hence unitless as well. The probability density function (PDF) however
is divided by ∆x. In our example, ∆x has the unit mph. Hence, the unit of fX (x) is
hpm=mph−1 . In other words, whatever the unit u of X, the PDF always has the unit u−1
and hence it is clearly not a probability, which must always be unitless.
For continuous random variables, think of fX (x)dx as the infinitesimal equivalent of a
probability, but never of fX (x) itself as this can lead to serious confusion.
FX (x) 1 fX (x)
2a
1
a x x
−a −a a
FY (y) fY (y)
1
1 a
y y
−a a −a a
Figure 3.1: Cumulative probability function (left) and probability density function (right) for two
random variables, X uniform and Y triangular
The probability density function is a very useful quantity that often takes precedence over the
cumulative probability function in people’s perception because it acts like a probability distribution
in helping us visualise the probabilistic behaviour of random variables. Figure 3.1 shows the
cumulative probability distribution and the probability density function for two random variables.
When using continuous variables, it is worth remembering the following two rules
1. fX (x)dx is the limiting “probability” that X takes values in the interval [x, x + dx], not
fX (x); and
2. when matters get difficult and challenge your intuition, always revert back to the well defined
cumulative probability function, which has a “physical” meaning the probability of an event.
A few fundamental properties that follow immediately from the definition of the probability density
function are
fX (x) ≥ 0 for all x (3.9)
following from the fact that FX (x) is non-decreasing. Also,
Z b
p(a ≤ X ≤ b) = FX (b) − FX (a) = fX (x)dx, (3.10)
a
3.2. THE PROBABILITY DENSITY FUNCTION 37
i.e., the probability of an interval is the integral of the PDF over the interval, and hence
Z ∞
fX (x)dx = lim FX (x) − lim FX (x) = 1 − 0 = 1, (3.11)
−∞ x→∞ x→−∞
i.e., the probability density function integrates to 1. One would be tempted to conclude that the
density behaves like a probability measure since it’s non-negative and integrates to 1, but note
that the density can be larger than 1, as illustrated for example by considering the uniform density
in the first example of Figure 3.1 for a = 1/4.
Joint probability density functions are obtained from the cumulative density function through
a multiple differentiation, e.g.,
∂ ∂
fXY (x, y) = FXY (x, y) (3.12)
∂x ∂y
and independent random variables satisfy
∂ ∂
fXY (x, y) = FX (x)FY (y) = fX (x)fY (y). (3.13)
∂x ∂y
In fact, this condition implies independence because there is no integration constant when going
from the PDF to the cumulative probability function, since every cumulative probability function
must begin at 0 and end at 1. Finally, we will state without proof1 that the “sum rule” becomes
an “integral rule” or, in other words, that marginalisation applies to probability density functions
Z ∞
fX (x) = fXY (x, y)dy. (3.14)
−∞
fXY (x, y)
fY |X (y | x) = , (3.15)
fX (x)
fY |X (y | x)fX (x)
fX|Y (x | y) = R ∞ . (3.17)
f
−∞ Y |X
(y | x0 )fX (x0 )dx0
So in short, despite all our warnings about not mistaking densities for probabilities, in fact the
PDF (Probability Density Function) acts very much as a PMF (Probability Mass Function) for
all practical purposes since it fulfils a sum rule and a product rule.
The probability density function can also be used to compute expectations
Z ∞
E[g(X)] = g(x)fX (x)dx (3.18)
−∞
for any function f (.). In particular, we will be interested in the mean (also called “first moment”)
Z ∞
E[X] = xfX (x)dx, (3.19)
−∞
entropy in the sense defined for discrete random variables. They do have a quantity known as the
“differential entropy” but we will not study it in this course.
We are now ready to study a few important PDFs and CDFs.
My personal experience with London buses leads me to believe that this a poor example.
3 For those of you who are comfortable with infinetisimal calculus, there is an alternative derivation of the
exponential density that gets the PDF directly and provides an alternative insight. We know that fX (t)dt is
the limiting probability that the next packet in a Poisson process will arrive at time t. This can be seen as the
probability of the event {no packet will arrive in the interval [0, t) AND a packet will arrive in the interval [t, t + dt)}
in the limit. We can write this as follows
(λt)0 −λt (λdt)1 −λdt
fX (t)dt = p(Yt = 0 ∩ Ydt = 1) = PYt (0)PYdt (1) = e e = λe−λt e−λdt dt. (3.28)
0! 1!
Since limdt→0 e −λdt = 1, we obtain the exponential density.
3.3. THE EXPONENTIAL DENSITY 39
1.5
0.5 λ = 1/2
λ=1
λ=2
0
0 1 2 3 4 5
The mean of an exponential density function can be calculated easily using integration by parts
Z ∞ Z ∞
∞
E[X] = tfX (t)dt = −te−λt 0 + e−λt dt (3.29)
0 0
∞
1 1
= − e−λt = (3.30)
λ 0 λ
and the second moment
Z ∞ Z ∞
∞
2
E[X ] = 2
t fX (t)dt = −t2 e−λt 0 +2 te−λt dt (3.31)
0 0
2E[X] 2
= = 2 (3.32)
λ λ
and hence the variance
2 1 1
Var(X) = E[X 2 ] − E[X]2 = − 2 = 2. (3.33)
λ2 λ λ
The relationship between the variance and the mean for Poisson and exponential random
variables can be exploited as a tool to verify if an arrival process, say for example the packet
arrival in our 10 Gbit/s Ethernet router, is a Poisson process:
• verify whether mean and variance of the number of arrivals per unit interval obeys E[Y ] '
Var(Y ) as should be the case for a Poisson distributed random variable, and
• verify whether mean and variance of the time intervals between arrivals obeys E[X]2 '
Var(X) as should be the case for exponential distributed continuous random variables.
The Poisson waiting paradox: the derivation of the exponential density we gave does not
rely anywhere on the last occurrence that happened. Although we described the exponential
density as a model of the time intervals between occurrences, in fact the memoryless nature
of the process means that the time until the next occurrence is exponentially distributed with
parameter λ, no matter when we start counting time. Our derivation only required there to
be no occurrences in the interval [0, t) and one occurrence in the interval [t, t + dt). It did not
require there to have been an occurrence at or near time 0. The average waiting time until
the next occurrence is 1/λ.
40 CHAPTER 3. CONTINUOUS RANDOM VARIABLES
Now consider our derivation in reverse, asking the question of how much time has elapsed
since the last occurrence rather than how much time remains until the next occurrence. Since
our derivation was symmetric in time, the answer would be the same: the time elapsed since
the last occurrence follows an exponential density with parameter λ. The average time elapsed
since the last occurrence is 1/λ.
This is known as the waiting time paradox:
• the average time between occurrences is 1/λ,
• the average time until the next occurrence is 1/λ,
• measurement noise;
We will understand why the Gaussian density is such a good model in Section 4.4.
If Y follows a Gaussian density with parameters µ and σ 2
Y ∼ N(µ, σ 2 ) (3.34)
1 (y−µ)2
fY (y) = √ e− 2σ 2 . (3.35)
2πσ 2
4 The Gaussian density is also called the normal density. The name was introduced by the English mathematician
Karl Pearson who believed that other mathematicians than Gauss deserved to share the credit for its invention.
He later regretted the name, because it may lead people to believe that other densities are abnormal. Gauss’s first
use of the density is undisputed nowadays and all the mathematicians involved in Pearson’s attempted historical
re-engineering have had important mathematical concepts named after them (e.g., the Laplace transform). Sadly,
the name “normal” has stuck, and it is important for you to be aware of this alternative name as you may encouter
it frequently.
3.4. THE GAUSSIAN DENSITY 41
0.6
µ = −1, σ 2 = 1/2
µ = 0, σ 2 = 1
µ = 1, σ 2 = 2
0.4
0.2
−4 −2 2 4
The mean is E[Y ] = µ and the variance is Var(Y ) = E[Y 2 ] − E[Y ]2 = σ 2 . Figure 3.3 shows
Gaussian probability density functions for a few choices of parameters.
A zero mean, unit variance Gaussian random variable
X ∼ N(0, 1) (3.36)
is said to follow a standard Gaussian density. We will show in the next chapter that, if X is
standard Gaussian, then Y = σX + µ is Gaussian with mean µ and variance σ 2 . Conversely, for
any Gaussian random variable Y ∼ N(µ, σ 2 ), X = (Y − µ)/σ is standard Gaussian.
The cumulative probability function or CDF of a standard Gaussian, often denoted Φ(x),
Z x
1 x02
FX (x) = Φ(x) = p(X ≤ x) = √ e− 2 dx0 (3.37)
−∞ 2π
can not be expressed in closed form and is hence often tabulated or implemented as a numerical
integral. It is tabulated on Page 29 of your mathematics data book. Most programming and
computation environments such as Python, MATLAB, etc. have library functions for the so-called
“error function” (e.g. math.erf(x) and scipy.special.erf(x) in Python, erf(x) in MATLAB)
which is the probability that a Gaussian random variable with mean 0 and variance 1/2 (!!!5 ) lies
within the interval [−x, x], giving
Z x
2 02
erf(x) = √ e−x dx0 , (3.38)
π 0
so the standard Gaussian cumulative can be computed from this function as
1 1 √
FX (x) = Φ(x) = + erf(x/ 2). (3.39)
2 2
Another often tabulated or implemented function is called the Q function, defined as
Z ∞
1 x02
Q(x) = √ e− 2 dx0 = 1 − Φ(x), (3.40)
x 2π
i.e., the probability that a standard Gaussian random variable exceeds the value x. The Gaussian
PDF is an even function fX (x) = fX (−x) and hence the CDF satisfies
FX (x) = Φ(x) = 1 − Φ(−x) = 1 − FX (−x). (3.41)
5 Why?? No idea. . . Computer scientists can be strange sometimes.
42 CHAPTER 3. CONTINUOUS RANDOM VARIABLES
A few useful values to remember are, for any Y ∼ N(µ, σ 2 ) or X ∼ N(0, 1),
(
p(µ − σ ≤ Y ≤ µ + σ) = FX (1) − FX (−1) = 2Φ(1) − 1 ' 2/3,
(3.42)
p(µ − 2σ ≤ Y ≤ µ + 2σ) = FX (2) − FX (−2) = 2Φ(2) − 1 ' 0.95
so a 2σ deviation from the mean is often called the “95% confidence interval”.
Example 3.3. As captain of your football team, you have the difficult task of picking a player
to shoot the penalty your team has just been awarded. You carefully consider the binary
sequence of n past penalty try-outs for player A (’1’ for score, ’0’ for fail) and would like an
estimate of the probability that player A will score the next penalty.
The sequence of past penalty try-outs can be considered as a sequence of independent
Bernoulli distributed random variables with parameter p (we ignore the “learning effect” or
“anxiety effect” that may introduce some dependency in the data) and hence the number of
scores (’1’s) Y follows a binomial distribution Y ∼ B(n, p).
The unknown quantity in this case is the parameter of the Bernoulli distribution. It is a
continuous random variable limited to a finite range [0, 1]. We shall call it π to distinguish
it from a fixed parameter p, so we can write for example Fπ (p) for the probability p(π ≤ p).
Hence the number of scores is now Y ∼ B(n, π) where we see for the first time a random
variable π as a parameter for the distribution of another random variable Y .
Most people if given a count of Y scores in n trials would be tempted to conclude that
p = Y /n. There are obvious problems with this approach:
• It yields a fixed value for the parameter without accounting for our uncertainty. Prob-
ability theory is the mathematical theory of uncertainty and hence one should expect a
probability density for π rather than a fixed value p.
• If n = 0 (no data has been seen yet), the estimator above is p = 0/0 and hence not
determined.
• If Y = 1 and n = 1 (only one 1 seen so far and nothing else) the estimator above yields
p = 1 or a probability of 1 to see another 1. This does not seem like a reasonable
estimate given that we’ve seen only one data point so far.
So how can we correctly make statements about π given an observed count of Y scores in n
trials? We will return to this question after introducing the Beta density.
We will say that a random variable π follows a Beta density with parameters α, β, i.e.,
π ∼ Beta(α, β) (3.43)
Γ(α + β) α−1
fπ (p) = p (1 − p)β−1 , with 0 ≤ p ≤ 1 and α, β ≥ 0. (3.44)
Γ(α)Γ(β)
3.5. THE BETA DENSITY 43
2.5
α = 1, β = 1
α = 2, β = 5
2 α = 2, β = 2
1.5
0.5
The Gamma function Γ(x) is an extension of the factorial function to complex and real numbers
and is defined via the integral Z ∞
Γ(x) = y x−1 e−y dy (3.45)
0
for which no closed-form expression exists. For integer arguments n it is equivalent to the factorial
Example 3.4. We return to our football example. We begin by selecting a prior for the
unknown paramater π determining the probability that player A scores a goal at each penalty
attempted. This prior is a probability density function for π given that no data has been
collected (n = 0). At this point, it would be acceptable to chose any prior that reflects
the captain’s experience, belief, gut feeling, or intuition about player A’s abilities. Let us
assume that the captain is inexperienced or would prefer not to exert any prejudice on player
A’s ability to score a penalty before seeing her in action. The captain assumes a uniform
probability density function fπ (p) = 1 for p ∈ [0, 1], or π ∼ Beta(1, 1).
Having observed Y scores in n trials for player A, the captain calculates the conditional
probability6
n k
PY |π (k | p) = p (1 − p)n−k (3.49)
k
44 CHAPTER 3. CONTINUOUS RANDOM VARIABLES
PY |π (k | p)fπ (p)
fπ|Y (p | k) = . (3.50)
PY (k)
For a uniform prior, fπ (p) = 1, it is easy to guess (or verify by a long process of integration
by part that you can do as an exercise if you wish) that PY (k) = 1/(n + 1) for k = 0, 1, . . . , n.
Hence we get
n k
fπ|Y (p | k) = (n + 1) p (1 − p)n−k (3.51)
k
which is the Beta density Beta(k + 1, n − k + 1). The expected value of π given the data is
k+1
E[π | Y = k] = . (3.52)
n+2
In other words, if you’ve seen player A score four goals in five trials, you should estimate the
probability density function of the Bernoulli coefficient to be π ∼ Beta(5, 2). If you want the
expectation of the Bernoulli coefficient, it would be 5/7, which is what most people would
guess was the probability of scoring five goals (one more than scored) in seven penalty trials
(two more than attempted.) This expected value estimator with a uniform prior is called
the Laplace estimator after the French mathematician Pierre-Simon Laplace (1749-1827) and
has the advantage that it gives reasonable answers for the disturbing cases mentioned in the
previous examples box:
• when you haven’t seen any trials, the expected value of the Bernoulli coefficient is
(0 + 1)/(0 + 2) = 1/2;
• when you’ve seen only one trial and it was a success, the expected value of the Bernoulli
coefficient is (1 + 1)/(1 + 2) = 2/3.
The method above is an example of the Bayesian approach to probablities and statistics. The
football team captain could improve the estimate by using a prior more representative of their
intuition, belief or experience, and hence become more able to take decisions effectively when new
players join the team.
3.6 Summary
Fundamentals:
• Cumulative Probability Distribution (CDF): FX (x) = p(X ≤ x)
• Properties: FX (x) ≥ 0, non-decreasing function of x from 0 at −∞ to 1 at +∞
6 We
are stretching our definitions here to conditional probability of discrete random variables conditioned on
continuous random variables with a word of caution that such constructs work for most well behaved engineering
examples but would arise serious suspicion among mathematicians.
3.7. PROBLEMS 45
Probability density:
• Probability Density Function (PDF): fX (x) = dFX
dx
0
= FX (x)
Rb R∞
• Properties: fX (x) ≥ 0 for all x, p(a ≤ X ≤ b) = a
fX (x)dx, and −∞
fX (x)dx = 1.
• Independence: fXY (x, y) = fX (x)fY (y) for all (x, y) (necessary and sufficient condition)
R∞
• Sum rule: fX (x) = −∞ fXY (x, y)dy
• Product rule: fXY (x, y) = fY | X (y | x)fX (x) (or definition of conditional PDF)
fY |X (y | x)fX (x)
• Bayes’ theorem: fX|Y (x | y) = R∞
fY |X (y | x0 )fX (x0 )dx0
−∞
R∞
• Expectation: E[f (X)] = −∞
f (x)fX (x)dx, n-th moment E[X n ].
Exponential Density:
• X ∼ Exp(λ)
Beta density:
• π ∼ Beta(α, β) (π is the name of the random variable, not the number 3.1415 . . .)
Γ(α+β) α−1
• fπ (p) = Γ(α)Γ(β) p (1 − p)β−1 , with 0 ≤ p ≤ 1 and α, β ≥ 0
R∞
• Γ(x) = 0 y x−1 e−y dy and for integers n, Γ(n) = (n − 1)!
• E[π] = α
α+β
• The Beta distribution models the density of a Bernoulli parameter after having observed
α − 1 ones and β − 1 zeros, assuming a uniform prior
3.7 Problems
Having completed this chapter, you should be able to complete Examples Paper 5 and start
working on Examples Paper 6.
46 CHAPTER 3. CONTINUOUS RANDOM VARIABLES
Chapter 4
Having learned about discrete and continuous random variables, we are now ready to start ma-
nipulating them. This is where probability theory really gets interesting.
Of course, the expression just stated for the second case in fact describes both cases: the set Xy
simply contains only one element in the first case.
Example 4.1. Let X be the value of a random dice throw (equal to the outcome). The set of
values is X = {1, 2, 3, 4, 5, 6}.
• Let Y = f1 (X) = X 2 . In this case, all values of X are mapped to distinct values in Y
√
and hence PY (y) = PY (x2 ) = PX (x) = PX ( y) for all y ∈ Y.
• Let Z = f2 (X) be 0 if X is even and 1 if X is odd. In this case,
3 1
PZ (0) = PX (2) + PX (4) + PX (6) = = . (4.2)
6 2
47
48 CHAPTER 4. MANIPULATING AND COMBINING DISTRIBUTIONS
Figure 4.1 illustrates the two example functions and how they map X to Y and Z. In this
type of graphical representation of functions, when there are “collisions” (values mapping to
the same image), the probabilities accumulate.
6 36 6
5 25 5
1
4 16 4 (
0 if X even
X Y = X2 X Z=
1 if X odd
3 9 3
0
2 4 2
1 1 1
Figure 4.1: Y has the same probabilities as X, whereas for Z the probabilities sum to (1/2, 1/2)
.
Although the theory for discrete random variables is very simple, it can already lead to inter-
esting problems, as the following examples show.
Example 4.2. Let X have a binomial distribution X ∼ B(99, p). Let Y be a summary of X
where the 100 possible values of X are grouped into 10 bins, i.e., Y = 1 if 0 ≤ X ≤ 9, Y = 2 if
10 ≤ X ≤ 19, etc. Picture for example Y representing a 10-bin “histogram” for X, or giving
a dart player a mark Y between 1 to 10 according to the number of times they hit bull’s eye
in 100 shots. What is the probability distribution of Y ?
The answer
10y+9
X 99
PY (y) = pm (1 − p)99−m (4.3)
m=10y
m
cannot be brought into a form that can be calculated easily, other than evaluating the sums
using an appoximation for the binomial coefficients.
Example 4.3. Let X have a geometric distribution PX (k) = p(1 − p)k−1 for k = 1, 2, . . .. Let
Y = (X − 1) mod m, i.e., the value of X − 1, modulo some given number m. What is the
probability distribution of Y ?
X and Y could occur for example in the following scenario. Consider a ripple or syn-
chronous counter that you’ve studied in 1A Paper 3 Digital Circuits last year, that counts
from 0 to m − 1 and then starts again at zero. Say the clock used for the counter has a
probability p of failure at each count. The number of clock cycles X until failure follows a
geometric distribution. We are interested in the probability distribution of the clock state Y
when the clock fails, which is (X − 1) mod m.
Observe that Y will take on the value 0 if X is 1, m + 1, 2m + 1, 3m + 1, etc. The resulting
4.1. FUNCTIONS OF RANDOM VARIABLES 49
0.4
0.3
PX (k) and PY (k)
0.2
0.1
0
0 5 10 15 20
Time until failure / clock state at failure
Figure 4.2: Distribution of the time X until failure (green) and clock state Y at failure (red) when
clock operates modulo m = 5.
distribution does have a closed form obtained via the sum of geometric sequences
∞
X p(1 − p)y
PY (y) = p(1 − p)rm+y = (4.4)
r=0
1 − (1 − p)m
y + g 0 (x)
y+1
y
x x+ 1 x+1
g 0 (x)
Figure 4.3: The derivative of g and g −1 seen as coordinates of the tangent vector
• g(x) = x2 for x ≥ 0
• g(x) = ex
Since the function does not decrease anywhere in X , we conclude that the cumulative probability
function of the random variable Y = g(X) is
where x(y) = g −1 (y) is the only x such that g(x) = y. We obtain the probability density function
by differentiating, using the chain rule,
d d 1
fY (y) = FY (y) = FX (g −1 (y)) = fX (g −1 (y)) 0 −1 . (4.6)
dy dy g (g (y))
The inverse of the derivative on the right is the derivative of g −1 (y), which you can either see as
a consequence of
−1
dFY dFX (x) dFX dx dFX dy dFX 1
= = = = (4.7)
dy dy dx dy dx dx dx g 0 (x)
or by picturing the tangent vector at (x, y) that can be expressed either as t1 = (1, g 0 (x)), or as
t2 = t1 /g 0 (x) = (1/g 0 (x), 1). The first expression t1 is the one that matters in the coordinate
system (x, y) relevant to the function g, whereas the second expression t2 is the one that matters
in the reverse coordinate system (y, x) relevant to the function g −1 that maps y to x. This is
illustrated in Figure 4.3.
By integration, we have
FX (x) = x, x ∈ [0, 1]. (4.9)
4.1. FUNCTIONS OF RANDOM VARIABLES 51
fY (y)
y
1
Figure 4.4: The density fY (y) for Y = X 2 and X uniform between 0 and 1.
and
1 1
fY (y) = fX g −1 (y)
= √ . (4.11)
g0 (g −1 (y)) 2 y
The density is illustrated in Figure 4.4.
Now that we have obtained expressions for FY (y) and for fY (y) for functions Y = g(X) such
that g 0 (x) ≥ 0 for all x ∈ X , let us consider the case g 0 (x) ≤ 0 for all x ∈ X (where again, we
assume that g 0 (x) is zero at most in single points so that the function is not constant on any
interval). These are one-to-one functions that do not increase anywhere and hence our first step
would have differed from the previous case in that
where x = g −1 (y). The differentiation to obtain the probability density function would have an
added minus sign, which is just as well since dx/dy ≤ 0 so our original expression without the
minus sign would have resulted in a negative density.
We can combine the two density expressions for g 0 (x) ≥ 0 or g 0 (x) ≤ 0 everywhere in x ∈ X
into a single expression
dx 1
= fX g −1 (y)
fY (y) = fX (x) 0 −1
(4.13)
dy |g (g (y))|
that is valid for all functions Y = g(X) that are either increasing or decreasing everywhere in
X or zero in single points at most. It is possible to extend this expression to functions that
are increasing and decreasing in well-defined intervals, giving you the ability to tackle even quite
complex problems such as the one in the following example.
Example 4.5. The following example is fairly complex and involves a function that is nei-
ther decreasing everywhere nor increasing everywhere. Consider a microprocessor that resets
52 CHAPTER 4. MANIPULATING AND COMBINING DISTRIBUTIONS
Y = sin X
Y = sin X
1
0.5
X
X
π 3π 5π 7π 9π 11π
2 2 2 2 2 2
−0.5
−1
Figure 4.5: For Y = sin X, the union of the non-shaded segments is the event {Y ≤ y}
(syncs) an oscillator every time an interrupt occurs. The time X until the next interrupt is
an exponential random variable. We wish to know the probability density function of the
oscillator state sin X when the next interrupt occurs.
In other words, X is an exponential random variable with parameter λ, i.e.,
and Y = sin X. Can we determine FY (y) and fY (y)? Since sin(x) is neither increasing nor
decreasing everywhere, we cannot simply apply Equation 4.13. It is easiest to tackle this
problem via the cumulative probability function.
The cumulative function FY (y) = p(Y ≤ y) is the probability of the event {Y ≤ y}
that sin X takes on values less than or equal to y. Since sin x is invertible over the interval
[−π/2, π/2], we use the convention that sin−1 y is the value x in the interval [−π/2, π/2] for
which sin x = y. The event {Y ≤ y} is best visualised on a graph as in Figure 4.5 where
the non-shaded region of the circle and of the sinusoid corresponds to the region for which
sin X ≤ y. From examining this region, we conclude that, for x = sin−1 y,
FY (y) = p(Y ≤ y)
∞
X
= p(X ≤ x) + p ((2k − 1)π − x ≤ X ≤ 2kπ + x)
k=1
X∞
= p(X ≤ x) + [FX (2kπ + x) − FX ((2k − 1)π − x)]
k=1
X∞
1 − e−λx−2λkπ − 1 + eλx+λπ−2λkπ
= p(X ≤ x) +
k=1
∞
X
= p(X ≤ x) + eλx eλπ − e−λx e−2λkπ
k=1
h −1 −1
i e−2λπ
= p(X ≤ sin−1 y) + eλ(sin y+π) − e−λ sin y
1 − e−2λπ
where we have used the “sum of geometric sequence” formula from your Mathematics Data
Book. The first term is zero for negative2 y ∈ [−1, 0) and p(X ≤ sin−1 y) = FX (sin−1 y) =
4.1. FUNCTIONS OF RANDOM VARIABLES 53
−1
1 − e−λ sin y for non-negative y ∈ [0, 1]. Since FX (0) = 0, the cumulative probability FY (y)
remains continuous at y = 0, but its derivative is discontinuous and we will not be able to
derive an expression for fY (0). Hence, for negative y ∈ [−1, 0),
−1 −1
d dx λ eλ(sin y+π)
+ e−λ sin y
fY (y) = FY (x) =p 2λπ
, (4.15)
dx dy 1 − y2 e −1
Note how these expressions could have been obtained directly from fX (x) by using the rules
we introduced for “increasing” and “decreasing” functions respectively for all the points x such
that g(x) = y, then summing all the resulting terms. The cumulative probability function
FY (y) and the density fY (y) are plotted in Figure 4.6 for λ = 1. It is fairly surprising that
such a seemingly simple problem yields a discontinuous probability density function for Y , and
one that is so blatantly asymmetric between positive and negative values. Can you explain
why this is the case?
1
0.8
0.8
0.6
0.6
0.4
0.4
0.2 0.2
y y
-1 1 -1 1
Before we move on to other manipulations of random variables, we will consider a very simple
function that you will find very useful in practice, e.g., when solving the examples papers. Let
X be any random variable with mean µ = E[X] and variance σ 2 = Var(X), where σ ≥ 0. Now
consider the random variable
X −µ
Y = . (4.17)
σ
The function g(x) = (x − µ)/σ is increasing everywhere and g −1 (y) = σy + µ, so we can use the
expressions we derived for increasing functions to state directly that
dx
fY (y) = fX (g −1 (y)) = σfX (σy + µ). (4.18)
dy
2 The notation [a, b) specifies an interval closed on the left (including a) and open on the right (excluding b).
54 CHAPTER 4. MANIPULATING AND COMBINING DISTRIBUTIONS
where we used a variable substitution y = g(x) in the integration (noting that dx = σdy), and
similarly,
Z ∞ Z ∞
2 2
E[Y ] = y fY (y)dy = y 2 σfX (σy + µ)dy (4.22)
−∞ −∞
∞ ∞
(x − µ)2
Z Z
1
= 2
fX (x)dx = 2 (x − µ)2 fX (x)dx (4.23)
−∞ σ σ −∞
= E[(X − µ)2 ]/σ 2 = Var(X)/σ 2 = 1. (4.24)
Furthermore, Var(Y ) = E[Y 2 ]−E[Y ]2 = 1−02 = 1. We conclude that Y is a random variable with
mean 0 and variance 1. This is true in particular for a Gaussian random variable X ∼ N (µ, σ 2 ),
as already stated when we introduced the Gaussian distribution. This implies for example that if
you need to evaluate the cumulative distribution FY (y) of a variable Y ∼ N(µ, σ 2 ) at y, you can
read out the value Φ(x) for x = (y − µ)/σ in the table on the last page of your Mathematics Data
Book. In 2P6 Communications later this term, your lecturer will make use of this fact in Handout
4 to compute the probability of error in the presence of Gaussian noise, obtaining expressions of
the form Q(A/σ) = 1 − Φ(A/σ) for an error threshold A and zero-mean Gaussian noise of variance
σ 2 . It is easy to show that, conversely, if you take any random variable Y with mean zero and
variance 1, the random variable X = aY + b has mean b and variance a2 .
Hence the expectation of a sum S is the sum of the expectations. Note that we have not used
independence anywhere in this derivation and hence this statement holds true for any andom
variables X and Y , dependent or independent. We now look at variances, again making no
assumption about indepedence, and write
where we used (4.28) in every step of the derivation. The last term in the expression is known as
the covariance
Cov(X, Y ) = E[XY ] − E[X]E[Y ] (4.33)
so the expression can be summarised as
It is easy to see that the covariance is zero when X and Y are independent, by verifying that
E[XY ] = E[X]E[Y ] for independent random variables, which was shown in (1.49) for discrete
random variables and can easily be shown similarly for continuous random variables, so we conclude
that
Var(X + Y ) = Var(X) + Var(Y ) when X and Y are independent, (4.35)
i.e., the variance of a sum of independent random variables is the sum of the variances.
Another measure related to the covariance is the correlation coefficient
Cov(X, Y )
ρ= p (4.36)
Var(X) Var(Y )
and satsifies −1 < ρ < 1. When ρ > 0 the variables are called correlated and when ρ < 0 they
are anti-correlated. When ρ = 0 the variables are called uncorrelated. This is often taken as a
“poor man’s independence” because it is easier to verify experimentally than full independence.
It is important to understand that independent random variables are always uncorrelated, but
uncorrelated random variables can be dependent. The condition for variables to be uncorrelated
is a lot weaker than the condition for them to be independent.
We had already observed these properties of mean and variance for discrete random variables
and used them to compute the expectation and variance of the binomial distribution that was
introduced as the probability distribution of a sum of independent Bernoulli random variables
with parameter p, but we now know that they apply to all random variables. Having sorted out
means and variances, we are ready to take on the more difficult problem of determining the full
probability distribution or density function of S given those of X and Y . For this, we will handle
the discrete and continuous cases separately.
where the last step follows if X and Y are independent. At this point, it is worth pausing to reflect
about what we mean by ’+’ when we say S = X + Y . I hear you worry that I am about to embark
on a primary school revision exercise, re-deriving addition from first principles. However, there are
many applications in computer science, communications and information theory where we don’t
mean integer addition when we write ’+’. You have already encoutered the “modulo” addition
in an example earlier in this chapter, and computer, electronic, or communications engineers may
well mean “modulo n” addition when they write “X + Y ”. You also learned about alternative
addition rules in your IA Paper 4 Computing course and in the IA Paper 3 Digital Circuits and
Information Processes.
For now, we will restrict ourselves to plain integer addition. In order to further develop the
expression above, we also need to specify that the sets X and Y of possible values of X and Y
consist of a range of consecutive integers, possibly infinite and possibly including negative numbers,
and any sum we write on elements of the set is taken in order from the smallest to the largest
integer. If so, we can continue
X X
PS (s) = PXY (x, y) = PXY (x, s − x) (4.40)
(x,y):x+y=s x∈X
X
= PX (x)PY (s − x) (4.41)
x∈X
where the condition about consecutive integers ensures in the last line that s − x is a valid value
in Y for the random variable Y . Pause a while to reflect on (4.41). We will return to it in the
next section after taking a first look at the sum of continuous random variables.
Example 4.6. Let X and Y be random variables indicating the value of each dice in a two-
fair-dice throwing experiment, and S = X + Y . We have
PS (2) = PX (1)PY (1) = 1/36
P S (3) = PX (1)PY (2) + PX (2)PY (1) = 2/36
PS (4) = PX (1)PY (3) + PX (2)PY (2) + PX (3)PY (1) = 3/36
PS (5) = PX (1)PY (4) + PX (2)PY (3) + PX (3)PY (2) + PX (4)PY (1) = 4/36
PS (6) = PX (1)PY (5) + PX (2)PY (4) + PX (3)PY (3) + PX (4)PY (2) + PX (5)PY (1) = 5/36
P (7)
S = PX (1)PY (6) + PX (2)PY (5) + PX (3)PY (4)
+PX (4)PY (3) + PX (5)PY (2) + PX (6)PY (1) = 6/36
PS (8) = PX (2)PY (6) + PX (3)PY (5) + PX (4)PY (4) + PX (5)PY (3) + PX (6)PY (2) = 5/36
PS (9) = PX (3)PY (6) + PX (4)PY (5) + PX (5)PY (4) + PX (6)PY (3) = 4/36
PS (10) = PX (4)PY (6) + PX (5)PY (5) + PX (6)PY (4) = 3/36
PS (11) = PX (5)PY (6) + PX (6)PY (5) = 2/36
PS (12) = PX (6)PY (6) = 1/36.
(4.42)
in the continuous case and there is no ambiguity about the nature of the sum as there was in the
discrete case.
Recall that fX (x) dx is the infinitesimal probability of being in the interval [x, x + dx). Hence,
Z
fS (s)ds = fX (x) dxfY (y) dy (4.43)
(x,y):x+y=s
Z ∞
= fX (x)fY (s − x) dx ds (4.44)
−∞
where we have used a change of variable s = y + x to obtain the second expression, and noted that
s being a constant in the first expression no integration results over the variable s. This leads to
Z ∞
fS (s) = fX (x)fY (s − x) dx. (4.45)
−∞
x, y s
1 1 2
Example 4.7. Let X and Y be independent and uniform between 0 and 1, i.e.,
(
1 for x ∈ [0, 1],
fX (x) = fY (x) = (4.46)
0 otherwise
and S = X + Y .
We use the expression above to get
(R s
1 1
1 dx = s for s ∈ [0, 1]
Z Z
fS (s) = fX (x)fY (s − x) dx = fY (s − x) dx = R01
0 0 s−1
1 dx = 2 − s for s ∈ [1, 2]
(4.47)
giving a triangular distribution as illustrated in Figure 4.7.
The careful reader will have identified the expression obtained for the sum of two continuous
random variables as a convolution fX P ? fY of the probability density functions fX and fY , as
learned in IA Paper 4. The expression x PX (x)PY (s − x) in (4.41) for discrete random variable
also appears to be a sort of discrete convolution. You may not have encountered a discrete
58 CHAPTER 4. MANIPULATING AND COMBINING DISTRIBUTIONS
convolution in your studies in Part I so far, but it is in fact very common, for example when
studying the discrete (“digital”) equivalent of the linear time-invariant systems, and you may
see many examples of this in Part II. It should therefore come as no surprise that we will use
transforms of distributions and densities in the next section to tackle convolutions efficiently.
where “z” is the variable in the “transform domain”. If we multiply the transforms of two sequences
u1 , u2 , . . . and v1 , v2 , . . . with each other, we obtain
! !
X X
m n
U (z)V (z) = um z vn z (4.49)
m n
XX
m+n
= um vn z (4.50)
m n
!
X X
= um vk−m zk (4.51)
k m
where we used the substitution k = m + n in the last step. We see that the result is the transform
of the convolution (u1 , u2 , . . .) ? (v1 , v2 , . . .). The transforms you’ve learned or have yet to learn
may have different names for the transform domain variable “z”, but they all have the convolution
property as a consequence of the simple derivation above. Table 4.1 provides a list of common
discrete transforms, most of which are in your information and mathematics data books3 . The
last entry in the table is the transform we will be using for discrete probability distributions:
3 Note
that we are only providing this table to help you place the new transforms we introduce within the rich
world of transforms that share the “convolution property.” You are not expected to learn or understand all the
transforms for this course and only need to learn the PGF and MGF.
4.3. TRANSFORMS OF DISTRIBUTIONS 59
Table 4.2: Probability Generating Functions (PGF) for various discrete distributions
Note the expectation notation that piggybacks the definition of the PGF on the definition of
the expectation. It is often the preferred expression in textbooks, although it fails to emphasise
the transform nature of the PGF, which is its main feature in my view. A table of discrete
probability distributions and their PGFs is given in Table 4.2. A similar table is also available in
your mathematics data book on page 27. All of these expressions are easy to derive using power
series and the sum of geometric series from your data book, and you are encouraged to do so as
an exercise.
The PGF of the binomial deserves a special mention. We have introduced the binomial as the
probability distribution of the sum Y = X1 + X2 + . . . + Xn of n independent Bernoulli random
variables with parameter p. Hence the probability distribution of Y results from the convolution
of the Ber(p) distribution n times with itself
which is tedious to evaluate or write down as a multiple summation. We introduced the PGF as a
means to evaluate convolutions efficiently as multiplications in the transform domain, and hence
we see that, given that the Bernoulli PGF is
It is interesting to note that the binomial and Poisson distributions are both closed under
addition: consider a sum S = X + Y of two random variables X and Y . If X and Y are binomial
X ∼ B(n1 , p) and Y ∼ B(n2 , p) with the same parameter p, then
gS (z) = gX (z)gY (z) = (1−p + pz)n1 (1−p + pz)n2 = (1−p + pz)n1 +n2 (4.56)
so S is binomial S ∼ B(n1 + n2 , p). This not surprising given that S is the sum of n1 + n2
independent Bernoulli trials. If X and Y are Poisson X ∼Po(λ1 ) and Y ∼Po(λ2 ) with PGFs
then
gS (z) = gX (z)gY (z) = eλ1 (z−1) eλ2 (z−1) = e(λ1 +λ2 )(z−1) (4.58)
60 CHAPTER 4. MANIPULATING AND COMBINING DISTRIBUTIONS
which implies that S is a Poisson distributed random variable S ∼Po(λ1 +λ2 ). Again not surprising
if you consider two parallel Poisson processes with rates λ1 and λ2 of occurrence, and you then
mix the two processes so that all occurrences of one or the other process count equally. This is
clearly then a Poisson process with rate λ1 + λ2 occurrences per time interval.
A property of the PGF that is also stated in your databook is the following
0 x−1
P
gX (1) = Px∈X xPX (x)z
z=1
= E[X]
00 x−2
gX (1) = x∈X x(x − 1)PX (x)z z=1
= E[X 2 ] − E[X] (4.59)
(k)
gX (1) = E[X(X − 1)(X − 2) · · · (X − k + 1)]
Although this ability to express moments from its derivatives is described in many textbooks as
the “raison d’être” of the PGF, it is in fact a trivial consequence of the definition of the PGF and
rarely useful. It is easy to compute the first and second moments of a sum of independent random
variables without the PGF by using the properties described at the beginning of this section. One
is rarely interested in higher moments of a distribution.
Note that we used the exponential function rather than z x as we did in the discrete case, because z x
is sometimes ill defined for continuous x. We’ve left the integration limits unspecified in our generic
transform but each transform will have specific limits (as was the case for discrete transforms)
and these can be infinite. Note that we called our generic “transform variable” s instead of z as in
the discrete case simply to follow convention. Again, it is easy to show the convolution property
for two function g(x) and f (x),
Z Z
G(s)F (s) = g(x)esx dx f (y)esy dy (4.61)
Z Z
= g(x)f (y)es(x+y) dx dy (4.62)
Z Z
= g(x)f (z − x) dx esz dz (4.63)
where we have applied the change of variable z = x + y and note that the integration limits
may need adapting as a result (but these are infinite in many cases). As in the discrete case,
the transforms you learned may have different names for the tranform domain variable “s”, but
they all have the convolution property. Table 4.3 provides a list of common continuous tranforms.
The last entry in the table is the transform we will be using for continuous probability density
functions:
4.3. TRANSFORMS OF DISTRIBUTIONS 61
Table 4.4: Moment Generating Functions (MGF) for various continuous densities
The MGF can be seen as a two-sided generalisation of the Laplace transform. Note again the
expectation notation that provides a compact definition but hides the transform nature of the
MGF5 .
A table of continuous probability distributions and their MGFs is given in Table 4.4. A similar
table is also available in your mathematics data book on page 28. These expression are mostly
easy to derive and you are encouraged to do so. We derive the MGF of the standard Gaussian
distribution
Z ∞
1 2
gX (s) = √ e−x /2 exs dx (4.65)
−∞ 2π
Z ∞
1 x2 −2xs
= √ e− 2 dx (4.66)
−∞ 2π
Z ∞
2 1 (x−s)2 2
= es /2 √ e− 2 dx = es /2 (4.67)
−∞ 2π
where the last step follows by recognising that the integral is that of a Gaussian density with
variance 1 and mean s and hence integrates to 1. It is easy to show that, for Y = αX
and, for Y = X + β,
gY (s) = eβs gX (s) (4.69)
and hence determine the MGF of general Gaussian variables from the standard Gaussian expression
above.
4 If you have an old pre-2017 mathematics databook, the MGF was defined as E[e−sX ]. This is not the standard
definition given in the probability textbooks that I checked. The MGF is normally defined as E[esX ] (no minus in
the exponent.) This has been corrected in the current version of the databook.
5 You may be wondering why we could not just pick existing transforms to perform convolutions of probability
distributions (e.g. z-transform) or densities (e.g. Fourier transform.) The answer is in part that we needed slightly
different properties (e.g., sum of probabilities over the values in X ) and in part that other transforms are in fact
used as well. In particular, the Fourier transform of a density is called its characteristic function, can be expressed
as gX (jω), and can be used interchangeably with the MGF in most contexts. Furthermore, when tackling other
types of addition for discrete random variables such as modulo addition, the Discrete Fourier Transform and other
related transforms are used. Your mobile phone, your broadband router, and your digital TV receiver (“freeview”,
cable, etc.), are all evaluating Discrete Fourier Transforms of probability distributions thousands of times per second
in order to tackle modulo sums of Bernoulli random variables when decoding incoming signals. You can learn more
about this in the Part II module 3F7 “Information Theory”.
62 CHAPTER 4. MANIPULATING AND COMBINING DISTRIBUTIONS
Now consider two independent Gaussian random variables X ∼ N (µ1 , σ12 ) and Y ∼ N (µ2 , σ22 )
and their sum Z = X + Y . The MGF of the sum is
s2 σ1
2 s2 σ 2
2 s2 (σ1
2 +σ 2 )
2
gZ (s) = esµ1 + 2 esµ2 + 2 = es(µ1 +µ2 )+ 2 (4.70)
showing that Z is a Gaussian random variable with the sum of the means and the sum of the
variances, Z ∼ N(µ1 +µ2 , σ12 +σ22 ). Hence the Gaussian density is closed under addition of random
variables.
Finally, like the PGF, the MGF can be used to compute moments of random variables. Note
that Z ∞
0
gX (0) = xfX (x)esx dx = E[X]. (4.71)
−∞ s=0
This argument can be repeated recursively to obtain
0
E[X] = g (0)
E[X 2 ] = g 00 (0) (4.72)
E[X n ] = g (n) (0).
Hence the name “moment generating function”, Note that the last expression is true for all n as
can easily be verified, and even for n = 0 since
Z ∞ Z ∞
g(0) = fX (x)e0x dx = fX (x) dx = 1 = E[X 0 ] = E[1]. (4.73)
−∞ −∞
Yn = X1 + X2 + . . . + Xn (4.74)
as n grows to infinity.
The central limit theorem is the reason why the Gaussian distribution was described as being
“of central importance” to probability theory when we first introduced it. Whenever quantities
emerge as a sum of many effects, they tend to be Gaussian. Having learned that the density of a
sum is the convolution of the densities of its terms, this also means more generally that if you pick
any functions that have a well-behaved transform, their convolution will tends towards a function
2
of the form e−x , which is rather surprising.
We will first make a few observations about the theorem and then provide a proof. First, note
that, since we’ve already shown that the mean of the sum is always the sum of the means, and the
sum of the variances of independent random variables is the sum of the variances, we immediately
see that
E[Yn ] = µ1 + µ2 + . . . + µn and Var(Yn ) = σ12 + σ22 + . . . + σn2 . (4.76)
4.4. THE CENTRAL LIMIT THEOREM 63
Hence, the “surprising” part of the theorem is the fact that Yn tends to a Gaussian random variable
rather than that its mean and variances are as stated.
What does it mean that Yn “tends to” Y ? This is in fact a subtle question and one that
mathematicians would approach with careful formalism. For our purposes, it suffices to imagine
that the density of Yn is “close” to the density of Y in some sense, and that it becomes “closer”
as n grows. The proof below will give us a better understanding of this process.
The central limit theorem is often presented in the form of an “average” of random variables
instead of a sum. If we multiply the sum by a factor α, since αYn = αX1 + αX2 + . . . + αXn , we
conclude that αYn has mean E[αX1 ]+E[αX2 ]+. . .+E[αXn ] and variance Var(αX1 )+Var(αX2 )+
. . . + Var(αXn ). Now,
E[αXk ] = αE[Xk ] = αµk (4.77)
for all k, and
Var(αXk ) = E[(αXk )2 ] − E[αXk ]2 = α2 Var(Xk ) (4.78)
for all k. Hence, we obtain an alternative version of the central limit theorem by picking α = 1/n,
to yield that
µ1 + µ2 + . . . + µn σ12 + σ22 + . . . + σn2
1
(X1 + X2 + . . . + Xn ) −→ Y ∼ N , . (4.79)
n n n2
Note that if µ1 = µ2 = . . . = µn = µ and σ12 = σ22 = . . . = σn2 = σ 2 , i.e., the random variables
X1 , X2 , . . . are indepedent with equal mean and variance, the equation above implies that
σ2
1
(X1 + X2 + . . . + Xn ) −→ Y ∼ N µ, , (4.80)
n n
i.e., the mean is the same as the mean of each variable, while variance becomes smaller as more
random variables are added. This is the basis for averaging a sequence of random variables over
time, as the probability that the observed value is close to the mean increases as the number of
observations averaged increases. We will re-visit this fact at the end of this section to discuss a
number of probabilistic characteristics
√ that can be estimated statistically √
from data.
Another choice α = 1/ n would give a sum whose mean grows as µ n and whose √ variance
remains constant at σ 2 . This is of no practical relevance but we will use this α = 1/ n in our
proof of the central limit theorem below.
Proof of the central limit theorem: without loss of generality, let us assume that µ1 = µ2 =
. . . = µn = 0 and that σ12 = σ22 = . . . = σn2 = 1. Our approach6 will seek to prove that
1
Yn = √ (X1 + X2 + . . . + Xn ) (4.81)
n
tends to a standard
√ Gaussian distribution N (0, 1). We already know from the discussion above
that E[Yn ] = µ n = 0 and Var(Yn ) = σ 2 = 1. Given that every variable Xk for k = 1, 2, . . .
has an MGF, we can write this MGF as a Taylor series around 0, to yield
0 00 s2 s3
gXk (s) = gXk (0) + gX (0)s + gX (0) + g (3) (0) + . . . (4.82)
k k
2 3!
(m)
Recall that gXk (0) = E[Xkm ]. Hence we can re-write the expression above as
s2
gXk (s) = E[Xk0 ] + E[Xk ]s + E[Xk2 ] + o(s3 ) (4.83)
2
s2
= 1 + µk s + (σk2 + µ2k ) + o(s3 ) (4.84)
2
s2
=1+ + o(s3 ). (4.85)
2
64 CHAPTER 4. MANIPULATING AND COMBINING DISTRIBUTIONS
Note how, despite the fact that the densities of Xk may all be different, the fact that they all
have mean 0 and variance 1 implies that the terms up to the quadratic in the Taylor series of
their MGF must be the same. We can now use the convolution property to observe that
n
Y s
gYn (s) = gXk √ (4.86)
n
k=1
n
s2
3
Y s
= 1+ +o 3/2
(4.87)
2n n
k=1
n
s2 s3
= 1+ +o . (4.88)
2n n3/2
where we have used the limit (1 + x/n)n → ex from your √ Mathematics Data Book page 2,
and the fact that all terms of power 3 and above of s/ n tend to zero as n grows large. We
2
have shown that the MGF of Yn tends to the MGF of a standard Gaussian es /2 as n grows,
indicating that the density of Yn approaches the standard Gaussian density and hence proving
the theorem.
The central limit theorem (CLT) and a more general but weaker result called the law of large
numbers (LLN) form the basis for statistical estimation of probabilities from data. We have already
had a foretaste of estimating probablity parameters from data when we studied the Beta density in
Section 3.5. Both the CLT and the LLN can be used to show that, for a sequence of independent
random variables X1 , . . . , Xn that habe the same mean and variance, the time average of any
function of the variables
1
Zn = (g(X1 ) + g(X2 ) + . . . + g(Xn )) (4.90)
n
is a random variable with expectation E[g(X)] whose variance decreases with n, and hence com-
puting Zn from data gives an estimate of E[g(X)] whose accuracy increases with n. This ability
to estimate probabistic characteristics from data can be used to ascertain that data follows a given
probabilistic model. There are many characteristics of probability distributions or densities that
can be estimated from data, such as:
• the mean or expectation E[X] = µ
• the variance Var(X) = σ 2 = E[X 2 ] − µ2 and the standard deviation σ =
p
Var(X)
• the skewness E[(X − µ)3 ]/σ 3 . If the skewness is positive, the distribution (or density) is
skewed to the right. Informally the ‘tail’ of the distribution (or density) is longer to the right.
There are other characteristics of probability distributions or densities that can be estimated from
data but don’t rely on the CLT or the LLN:
• the mode maxx PX (x) or maxx fX (x)
• the median M such that FX (M ) = 1/2 can be estimated by the data middle: order all n
data points in decreasing order and pick the n/2 point
6 Note
that this is equivalent to the more general theorem because we can shift and scale all random variables
and track the effect on Yn to show that the sum is a general Gaussian random variable, and inversely we can map
the general case to this special case by shifting all random variables so their means are zero and their variances 1.
4.5. MULTIVARIATE GAUSSIANS 65
0.2
0.1
−2
0
0
−2 −1 0 2
1 2
Figure 4.8: Two-variable Gaussian distribution with zero means, unit variances, and E[X1 X2 ] =
1/2
• the quartiles Q1 and Q3 such that FX (Q1 ) = 1/4 and FX (Q3 ) = 3/4 and the interquartile
range Q3 − Q1 can be estimated similarly to the median
Example 4.8. The distribution of income in Britain is skewed to the right. The average income
is very different from the median, since a few people have very large incomes. The logarithm
of the income is much less skewed.
Sometimes the variance of the distribution can be heavily influenced by very few observations.
The interquartile range also quantifies the spread of a distribution, but it is said to be more robust
toward outliers.
• remember that X is a random vector and its possible values x are also vectors. xT denotes
transposition.
• the density is a scalar function as before. We can interpret it as the infinitesimal probability
fX (x)dx1 dx2 . . . dxn that X lies in the region [x1 , x1 +dx1 )×[x2 , x2 +dx2 )×. . .×[xn , xn +dxn )
of n dimensional space. For example, for a 2-dimensional density, fX ((1, 2))dxdy gives the
infinitesimal probability of being in a square of dimensions dx × dy whose bottom left corner
is (1, 2).
• The letter Σ normally used for summation is used to denote a matrix here. We have followed
convention in adopting this notation even though I admit that it is rather bizarre.
√
• The position of the variance in the factor 1/ 2πσ 2 of the univariate Gaussian distribution
is now occupied by the determinant of Σ in the multivariate expression.
• The expression in the exponent is a vector-matrix-vector multiplication but you can persuade
yourself that the result is scalar, so the exponential is simply the scalar exponential.
• The position of the variance in the exponent 1/(2σ 2 ) of the univariate Gaussian density is
now taken up by the inverse Σ−1 of Σ.
Consider the covariance matrix Σ. It diagonal elements Σkk = E[Xk2 ] − µ2k = Var(Xk ) are
simply the variances of the individual component random variables Xk . For any k and m, Σkm =
Σmk since the definition of the matrix elements is symmetric, and hence the covariance matrix is
symmetric. If two components Xk and Xm are independent, then Σkm = E[Xk ]E[Xm ]−µk µm = 0.
The entry Σkm in the covariance matrix is the covariance Cov(Xk , Xm ) of Xk and Xm , as
defined in section 4.2.1, where we had observed that independent random variables always had
zero covariance but it was possible for variables with zero covariance not to be independent.
However, for multi-variate Gaussians it is easy to see that uncorrelated implies independent and
vice-versa. The covariance matrix Σ in that case is diagonal and the density is simply the product
of univariate Gaussian densities.
It can be shown that the marginal density of any component of a multivariate Gaussian is a
univariate Gaussian, i.e.,
Z ∞Z ∞ Z ∞
1 (x−µk )2
fXn (x) = ... fX (x1 , x2 , . . . , xn ) dx1 dx2 . . . dxn−1 = √ e 2Σnn . (4.95)
−∞ −∞ −∞ 2πΣnn
In fact, if we perform any partial marginalisation of a multi-variate Gaussian, we obtain another
multi-variate Gaussian with reduced dimensions. For example, the vector (X1 , . . . , Xk ) for k < n
is multi-variate Gaussian
(X1 , . . . , Xk ) ∼ N(µ1,...,k , Σk×k ) (4.96)
where Σk×k is the k × k submatrix of Σ consisting of its first k rows and first k columns.
Furthermore, conditional densities of variables that are jointly multi-variate Gaussian are also
always uni-variate or multi-variate Gaussian. If we write
A B
Σ= (4.97)
BT C
Note how the conditional mean vector of (X1 , . . . , Xk ) depends on the value y of (Xk+1 , . . . , Xn ).
4.5. MULTIVARIATE GAUSSIANS 67
4.6 Summary
Functions Y = g(X) of random variables:
• X discrete, PY (y) = x∈Xy PX (x) where Xy is the set of all x such that g(x) = y
P
• for any X with mean µ and variance σ 2 , Y = (X − µ)/σ has mean 0 and variance 1
• for any X with mean 0 and variance 1, Y = aX + b has mean b and variance a2 .
Mean and variance of sums of random variables:
• E[X + Y ] = E[X] + E[Y ]
• Var(X + Y ) = Var(X) + Var(Y ) + 2 Cov(X, Y ), where Cov(X, Y ) = E[XY ] − E[X]E[Y ]
is zero if X and Y are independent
Cov(X,Y )
• Correlation coefficient ρ = √
Var(X) Var(Y )
So far in the course, we have introduced probability as a branch of mathematics dealing with
uncertain events. In all the theory you’ve seen so far, you’ve learned how to quantify uncertainty,
manipulate measures of uncertainty, but never how to resolve uncertainty into certainty. Indeed,
in our example in Chapter 3 where a football captain needed the probability that a player would
score a penalty, we learned that the probabilist’s preferred approach is to compute a density for
the unknown parameter p, rather than decide a specific value for it from the data.
There are occasions in practice when one has no choice but to make statements about uncertain
events that don’t involve densities or probability distributions, for example:
• A decision must be taken based on the data observed. For example, a patient has been
tested for a condition and the doctor and patient must now decide whether to treat or not
to treat for the condition based on the test result.
• An estimate is needed for an unknown quantity based on measured data. For example, we
create a shear wave in a fluid using a vibrating probe and obtain sensor measurements of
the dissipated energy, from which the viscosity of the fluid can be estimated.
• Public information: many statisticians and those funding them believe that the wider public
is not able to understand probabilistic statements. This patronising view of public under-
standing of mathematics leads some to demand certain statements about uncertain measure-
ments. For example, “The public wants to be told whether global warming is happening.
They won’t understand a probability density function of the predicted temperatures over
the next decade.” Recent U.S. presidents have made statements such as “Global warming is
a fact!” or “The concept of global warming was created by and for the Chinese in order to
make U.S. manufacturing non-competitive.”. The role of statisticians is to prove or disprove
the global warming hypothesis, depending on which political patrons they serve.
In this chapter, we will give a brief introduction to decision and estimation theory (there is much
more to come in Part II modules such as 3F3) then proceed in more depth to discuss hypothesis
testing.
69
70 CHAPTER 5. DECISION, ESTIMATION AND HYPOTHESIS TESTING
Random ω
X(.) X
Experiment
Y
Y (.) d(.) X̂
to provide an estimate of X based on the observation Y . In all cases, the conditional probability
distribution PY |X (.|.) or density fY |X (.|.) is known to the decider or estimator.
Example 5.1. Data bits are transmitted over a twisted copper cable to your games console
via your home broadband modem. The data bits, valued zero or one, are transmitted as +A
Volt for a zero, and −A Volt for a one. The received observations are independent given the
data and Gaussian distributed with mean +A or −A, i.e.,
1 x
A)2 /(2σ 2 )
fY |X (y|x) = √ e−(y−(−1) for x ∈ {0, 1}. (5.1)
2πσ 2
Hence, if the data bit is a Bernoulli random variable PX (1) = 1 − PX (0) = p, then
fY |X (y|1)p 1
PX|Y (1|y) = = fY |X (y|0) 1−p
. (5.2)
fY |X (y|0)(1 − p) + fY |X (y|1)p 1+ fY |X (y|1) p
For a given observation y, it seems like a good idea to decide X̂ = 1 if PX|Y (1|y) > 1/2 and
to decide X̂ = 0 otherwise, or in other words
fY |X (y|0) 1 − p 1 − p ((y+1)2 −(y−1)2 )/(2σ2 )
d(y) = 1 <1 =1 e <1 (5.3)
fY |X (y|1) p p
where the function 1(.) is called an indicator function and returns a 1 if the condition in the
brackets is true and a 0 otherwise. Note that if PX (1) = p = 1/2, then the rule can be further
simplified to
d(y) = 1 (y + 1)2 − (y − 1)2 < 0 = 1(y < 0),
(5.4)
i.e., decide X̂ = 1 if the observation is negative, and X̂ = 0 if the observation is positive. If
the observation is exactly zero, we could decide either way but since the probability of this
event is zero it does not change the expected performance of our decision.
In this example, a decision for the value of bit X is necessary if your games console re-
quires the data bits to keep your game going, e.g., it is not set up to receive probabilistic
statements specifying the probability distribution of X given the observation. In fact, note
that the “decision approach” to communications is overly simplistic. All of your communica-
tions devices nowadays are set up to receive probabilistic statements about data bits from the
transmission line, as this allows for a much better data encoding and decoding to maximise
the data transmission rate.
The example above used the so-called Maximum A-Posteriori (MAP) rule:
MAP rule: for an observation Y = y, pick X̂ = x to maximise PX|Y (x|y).
The rule for p = 1/2 is an instance of the Maximum Likelihood (ML) rule:
5.2. HYPOTHESIS TESTING: SIMPLE HYPOTHESES 71
The ML rule is equivalent to the MAP rule when X is uniform, but is often also used in cases
where the prior distribution PX is unknown to the decider.
In estimation problems, X is continuous and we cannot hope to reconstruct X exactly. Hence,
our aim in general is to find a X̂ that approximates X as closely as possible given the observation
Y , and one must define what is meant by “as closely as possible.” One common way to define this
closeness is to minimise the Mean Squared Error (MSE) E[(X̂ − X)2 |Y = y], where the condi-
tional expectation is defined just like the expectation but using the conditional probability density
function. Such an estimator is called the Minimum Mean Squared Error (MMSE) estimator. Let
my = E[X|Y = y], i.e., the expected value of X given the observation y, and note that
with equality if and only if X̂ = my = E[X|Y = y]. Hence the mean squared error is minimised
by picking the conditional expectation of X given the observation as our estimate, giving the
following rule:
While we have used a notation implying that the observation Y is a single random variable, in
practice Y is often a vector of random variables. We have only scratched the surface of decision
and estimation theory in this section. Many extensions of these theories are possible. For exam-
ple, decision theory can be extended to take a cost associated with every decision into account.
Estimation theory can be restricted to linear estimators, i.e., what is the best estimator for X
given a vector Y when you are only allowed to use estimators of the form X̂ = c.Y for a given
vector c?
are well defined, where H̄ is the complement of H. H is often called the null hypothesis H0 = H,
and H̄ the alternative hypothesis H1 = H̄.
72 CHAPTER 5. DECISION, ESTIMATION AND HYPOTHESIS TESTING
Example 5.2. A hypothetical researcher has discovered a simple pregnancy test based on
a retina scan that can be administered easily and inexpensively by every optician. The test
gives a numerical outcome that is Gaussian N (1, 1/4) for a pregnant test person, and Gaussian
N (0, 1) for a test person who is not pregnant.
In the example above, the null hypothesis H0 “the test person is pregnant” is a simple hy-
pothesis. It is a well defined event that has a probability and for which there is a conditional
probability of the observation. Its converse H1 = H̄ “the test person is not pregnant” is equally
well defined.
The outcome of a hypothesis test is a statement concluding either
or
H0 is false (i.e., H1 is true),
possibly with a numerical p-value indicating the strength of the statement. Let the random variable
X be an indicator random variable for our statement, i.e., X = 1 if we claim that H0 is true, and
X = 0 if we claim that H0 is false. It is useful to distinguish between the types of error that we
can make in our statement:
HHH0
H
false true
X
√
HH
0 type I
√
1 type II
This table highlights the asymmetry of the hypothesis testing problem. There may be different
consequences to a type I and a type II error. In the example above, if we tell the test person that
she’s likely to be pregnant, we will recommend that she attend a GP for a more accurate test. If
we tell the test person that she’s not pregnant, there will be no follow-up action. The damage
done by a type I error (undetected pregnancy) is a lot worse than the cost of the type II error
(unnecessary visit to a GP.)
Example (continued): One simple strategy for the developer of the iris scan pregnancy test
is to set a threshold at a value β and pick X = 1 (H0 is true) if Y > β and X = 0 (H0 is
false) if Y ≤ β. The type I error probability is defined as
i.e., the probability that the data lie in a region for which we will state “H1 is true” given that
H0 is true. We know that the density of Y given H0 is Gaussian N (1, 1/4). Hence, supposing
that the Medicines and Healthcare products Regulatory Agency (MHRA) requires a Type I
error probability of 0.01 for this type of test, we have
where 2(Y − 1) is standard Gaussian N (0, 1) (see Section 4.1.2) so we can read out the value
z for which Φ(z) = 1 − 0.01 ' 0.99 on page 29 of your Mathematics Data book to obtain
2(1 − β) = 2.33
and hence β = −0.165. Hence, the test will return a result of “pregnant” or “probably
pregnant” for Y > −0.165, and “not pregnant” for Y < −0.165.
5.2. HYPOTHESIS TESTING: SIMPLE HYPOTHESES 73
0.8
p(H0 | Y = −0.2)
0.6
0.4
0.2
0
0 0.2 0.4 0.6 0.8 1
p(H0 )
Figure 5.2: Probability of a hypothesis given the test result in function of the a-priori probability
There is considerable confusion about the meaning of the numerical value associated with a
hypothesis testing statement. In the example above, we required a “not pregnant” result with an
upper bound on the type I error probability of 1%. Most test persons not versed in hypothesis
testing would assume that this means that if they get a “not pregnant” outcome, the probability
that they are not pregnant is 99%. This is wrong! The correct interpretation of this figure is that
the result of the test fell within a region whose total probability is less than 1% for a pregnant
test person.
Example 5.3. Consider the example of a test person whose numerical test result was −0.2
and was given a “not pregnant” outcome. We would now like to depart from the framework
of Hypothesis testing and compute the probability that the person is pregnant given the test
result. In order to use Bayes’ theorem we need the prior probability p = p(H0 ) that a person
taking the test is pregnant. We can write
This probability is plotted in Figure 5.2 and we observe that the probability that the test
person with a numerical result of −0.2 is pregnant varies widely with the a-priori probability
p = p(H0 ). It is easy to calculate that the test person is in fact only 99% sure of not being
pregnant if the a-priori probability is 8.1%.
A central theorem of hypothesis testing known as the Neyman-Pearson theorem shows that
the type of threshold test we applied in our example is in fact optimal in terms of minimising the
Type I probability for a given Type II probability.
74 CHAPTER 5. DECISION, ESTIMATION AND HYPOTHESIS TESTING
Example 5.4. A hypothetical cafe “HN” has recently opened its doors opposite the department
of Engineering. Its hypothetical owner “S” wonders whether customers know the difference
between a “flat white” and a “cappucino”. He proposes to use a double-blind statistical
experiment in which customers are given a drink and asked to guess what it is, and has called
upon the probability experts at the department to help him evaluate the test.
Of the 10 customers who participated in the test so far, 7 correctly identified their drink
and 3 did not. We would like to test the following hypotheses:
• H0 : the data is “random”
• H1 : the data is “not random”
H0 can be made more precise, i.e., the customers’ answers were independent and Ber(1/2)
(independent of each other and of the drinks tasted.)
In the example, H0 is a simple hypothesis: given that the decisions are Ber(1/2) independent
of the drinks, it is possible to evaluate the probability of the data observed. The complement
hypothesis H1 is a composite hypothesis: what does it mean for the data to be “not random”?
Do we mean that it is Bernoulli with a different parameter than 1/2? Or not Bernoulli at all?
Or perhaps the customers’ decisions were not independent and we must consider their joint dis-
tribution? Since the hypothesis is not well defined, it is hard or impossible to talk about the
probability of the data given the hypothesis. The approach described in the previous section is no
longer applicable.
In this case, classical statistics resorts to one-sided statistical tests, simply evaluating the prob-
ability of the data given H0 and deciding that H0 is true if the data falls within a pre-determined
confidence range. Hypothesis H1 is ignored, but the statement “H0 is false” is nonetheless often
translated into “H1 is true” despite the fact that H1 is not well defined. We continue our example.
Example 5.5. In the “cappucino” vs. “flat white” battle, we decide to admit the null hypothesis
H0 if the probability of the data observed or any more extreme data is more than 5% (a very
common threshold for statistical significance.) The sum of Bernoulli random variables is
binomial distributed, and hence
7 3 8 2 9 1 10 0
10 1 1 10 1 1 10 1 1 10 1 1
p(Y ≥ 7 | H0 ) = + + +
7 2 2 8 2 2 9 2 2 10 2 2
1 10 10 10 10
= 10 + + + = 0.1719
2 7 8 9 10
This is far more than 5% and hence we state that H0 is true, or “the data is random”. In
other words, “customers are unable to distinguish between a cappucino and a flat white.”
Note that the size of the experiment has a bearing on our decision. If we increase the
number of test persons involved but assume that the proportion of those correctly identifying
their drinks remains constant, we get for a 20 customer experiment p(Y ≥ 14) = 0.0577
1 The appelation “simple hypothesis” vs. “composite hypothesis” is not widespread and is borrowed from
[RHB06].
5.3. HYPOTHESIS TESTING: COMPOSITE HYPOTHESES 75
so much closer to the threshold but still favouring H0 , and for a 30 customer experiment
p(Y ≥ 21) = 0.0214 so we now decide that H0 is false. The conclusion “the data is not
random” or “H1 is true” is risky since we have not clearly decided what we mean by H1 in
this context, but most would now conclude “customers can distinguish between a flat white
and a cappucino”.
Note that we could use the Beta distribution approach introduced in Section 3.5 to make
statements about H1 if we simply assume that “not random” means independent Bernoulli
with an unknown parameter and hence compare H1 against H0 using Bayesian methods.
Note in the example that we use the probability of the data observed or more extreme data
rather than just the probability of the data. It is in the nature of statistical experiment that the
probability p(Y = y | H0 ) of the data will decrease with the size of the experiment. For example,
if our experiment always yielded half correct guesses and half incorrect guesses (assuming an even
experiment size), the probability of the observed data would be
n 1
p(Y = y | H0 ) =
n/2 2n
where n is the number of customers in the measurement. This is the mode (largest value) of
the binomial distribution, but nonetheless decreases2 with n. Hence it would appear that the
probability of the data being random decreases with the size of the experiment, and we would
reject the null hypothesis in all cases if the experiment size grew sufficiently large. This is the
reason why statisticians insist on measuring the probability of the measured data or more extreme
data. In the extreme example just presented, that value would be 1/2 for all experiment sizes and
hence always accepts the null hypothesis.
Example 5.6. A bus company claims that on a certain route there is a service every 20 minutes.
Three people complain that this claim is false:
• A: had to wait 45 mins on a particular day. Since p(A | H0 ) = 0.105 this is not hugely
unlikely, and cannot be used to reject H0 at a 5% level.
• B: had to wait 45 mins both Monday and Tuesday. Both events happen independently,
so p(B | H0 ) = 0.1052 = 0.011. This seems quite unlikely under the null hypothesis, so
we can reject the companies claim, say at a 5% level.
2 As an exercise, verify that it decreases by considering two consecutive even test sizes n and n + 2 and showing
that n+2 n
2−(n+2) − 2−n < 0.
(n + 2)/2 n/2
76 CHAPTER 5. DECISION, ESTIMATION AND HYPOTHESIS TESTING
This example shows how sensitive the methods of composite hypothesis testing are to the
exact wording of the hypothesis. A court statistician in cases involving customers A, B and
C against the bus company would award damages to customer B and reject the claims of
customers A and C even though common sense does not detect significant differences between
these customers’ experiences.
The final conclusion of this chapter is that one has to be extremely careful when interpreting
statements obtained using hypothesis testing. The questions to ask are:
• what was the criterion used for accepting or rejecting the null hypothesis.
Statements should ideally be carefully worded to describe exactly what has been calculated, al-
though if this is followed to the letter the statements would become so unintelligible as to little no
sense to most people. Perhaps this is a good place to insist once again that computing the prob-
ability of the hypothesis given the data using Bayes’ theorem results in much clearer statements
that may not be pleasant to hear but are in fact informative and accurate, e.g., the probability
that you have cancer given the test result is 85%.
5.4 Summary
Decision and estimation:
• Maximum A-Posteriori (MAP) decision rule: for an observation Y = y, pick X̂ = x to
maximise PX|Y (x|y).
• For a specified worst allowable probability of Type I error, the optimal minimising the
probablity of a Type II error is a threshold-type test (Neyman-Pearson)
5.3. HYPOTHESIS TESTING: COMPOSITE HYPOTHESES 77
• For composite hypotheses (complementary hypothesis H1 not well defined), always set
a threshold to admit the null hypothesis H0 based on the probability (“p-value”) that
the data is as observed or more extreme
78 CHAPTER 5. DECISION, ESTIMATION AND HYPOTHESIS TESTING
Bibliography
[Bil95] Patrick Billingsley. Probability and Measure. John Wiley and Sons, 1995.
[FLS64] Richard P Feynman, Robert B Leighton, and Matthew Sands. The Feynman lectures on
physics. Addison-Wesley Publishing Company, 1964.
[Goo99] Steven N Goodman. Towards evidence-based medical statistics. The P value fallacy.
Annals of Internal Medicine, 130:995–1004, June 1999.
[Mac03] David J C MacKay. Information theory, inference, and learning algorithms. Cambridge
University Press, 2003.
[RHB06] K F Riley, M P Hobson, and S J Bence. Mathematical methods for physics and engi-
neering. Cambridge University Press, 2006.
79