Probability and Random Variables
Probability and Random Variables
Random Variables
A Beginner's Guide
Since it assumes minimal prior technical knowledge on the part of the reader,
this book is suitable for students taking introductory courses in probability
and will provide a solid foundation for more advanced courses in probability
and statistics. It would also be a valuable reference for those needing a
working knowledge of probability theory and will appeal to anyone interested
in this endlessly fascinating and entertaining subject.
Probability and
Random Variables
A Beginner's Guide
David Stirzaker
University of Oxford
PUBLISHED BY CAMBRIDGE UNIVERSITY PRESS (VIRTUAL PUBLISHING)
FOR AND ON BEHALF OF THE PRESS SYNDICATE OF THE UNIVERSITY OF CAMBRIDGE
The Pitt Building, Trumpington Street, Cambridge CB2 IRP
40 West 20th Street, New York, NY 10011-4211, USA
477 Williamstown Road, Port Melbourne, VIC 3207, Australia
http://www.cambridge.org
Synopsis viii
Preface xi
1 Introduction 1
1.1 Preview 1
1.2 Probability 1
1.3 The scope of probability 3
1.4 Basic ideas: the classical case 5
1.5 Basic ideas: the general case 10
1.6 Modelling 14
1.7 Mathematical modelling 19
1.8 Modelling probability 21
1.9 Review 22
1.10 Appendix I. Some randomly selected denitions of probability, in random
order 22
1.11 Appendix II. Review of sets and functions 24
1.12 Problems 27
A Probability
v
vi Contents
B Random Variables
viii
On this occasion, I must take notice to such of my readers as are well versed
in Vulgar Arithmetic, that it would not be difcult for them to make
themselves Masters, not only of all the practical Rules in this book, but also
of more useful Discoveries, if they would take the small Pains of being
acquainted with the bare Notation of Algebra, which might be done in the
hundredth part of the Time that is spent in learning to write Short-hand.
This book begins with an introduction, chapter 1, to the basic ideas and methods of
probability that are usually covered in a rst course of lectures. The rst part of the main
text, subtitled Probability, comprising chapters 24, introduces the important ideas of
probability in a reasonably informal and non-technical way. In particular, calculus is not a
prerequisite.
The second part of the main text, subtitled Random Variables, comprising the nal
three chapters, extends these ideas to a wider range of important and practical applica-
tions. In these chapters it is assumed that the student has had some exposure to the small
portfolio of ideas introduced in courses labelled `calculus'. In any case, to be on the safe
side and make the book as self-contained as possible, brief expositions of the necessary
results are included at the ends of appropriate chapters.
The material is arranged as follows.
Chapter 1 contains an elementary discussion of what we mean by probability, and how
our intuitive knowledge of chance will shape a mathematical theory.
Chapter 2 introduces some notation, and sets out the central and crucial rules and ideas
of probability. These include independence and conditioning.
Chapter 3 begins with a brief primer on counting and combinatorics, including
binomial coefcients. This is illustrated with examples from the origins of probability,
including famous classics such as the gambler's ruin problem, and others.
Chapter 4 introduces the idea of a probability distribution. At this elementary level the
idea of a probability density, and ways of using it, are most easily grasped by analogy
with the discrete case. The chapter therefore includes the uniform, normal, and exponen-
tial densities, as well as the binomial, geometric, and Poisson distributions. We also
discuss the idea of mean and variance.
Chapter 5 introduces the idea of a random variable; we discuss discrete random
variables and those with a density. We look at functions of random variables, and at
conditional distributions, together with their expected values.
Chapter 6 extends these ideas to several random variables, and explores all the above
concepts in this setting. In particular, we look at independence, conditioning, covariance,
and functions of several random variables (including sums). As in chapter 5 we treat
continuous and discrete random variables together, so that students can learn by the use
of analogy (a very powerful learning aid).
Chapter 7 introduces the ideas and techniques of generating functions, in particular
probability generating functions and moment generating functions. This ingenious and
xi
xii Preface
Oxford
January 1999
1
Introduction
1.1 PREVIEW
This chapter introduces probability as a measure of likelihood, which can be placed on a
numerical scale running from 0 to 1. Examples are given to show the range and scope of
problems that need probability to describe them. We examine some simple interpretations
of probability that are important in its development, and we briey show how the well-
known principles of mathematical modelling enable us to progress. Note that in this
chapter exercises and problems are chosen to motivate interest and discussion; they are
therefore non-technical, and mathematical answers are not expected.
1 . 2 P RO BA B I L I T Y
We all know what light is, but it is not easy to tell what it is.
Samuel Johnson
From the moment we rst roll a die in a children's board game, or pick a card (any card),
we start to learn what probability is. But even as adults, it is not easy to tell what it is, in
the general way.
1
2 1 Introduction
For mathematicians things are simpler, at least to begin with. We have the following:
This may seem a trie arbitrary and abrupt, but there are many excellent and plausible
reasons for this convention, as we shall show. Consider the following eventualities.
(i) You run a mile in less than 10 seconds.
(ii) You roll two ordinary dice and they show a double six.
(iii) You ip an ordinary coin and it shows heads.
(iv) Your weight is less than 10 tons.
If you think about the relative likelihood (or chance or probability) of these eventualities,
you will surely agree that we can compare them as follows.
The chance of running a mile in 10 seconds is less than the chance of a double six,
which in turn is less than the chance of a head, which in turn is less than the chance of
your weighing under 10 tons. We may write
chance of 10 second mile , chance of a double six
, chance of a head
, chance of weighing under 10 tons.
(Obviously it is assumed that you are reading this on the planet Earth, not on some
asteroid, or Jupiter, that you are human, and that the dice are not crooked.)
It is easy to see that we can very often compare probabilities in this way, and so it is
natural to represent them on a numerical scale, just as we do with weights, temperatures,
earthquakes, and many other natural phenomena. Essentially, this is what numbers are
for.
Of course, the two extreme eventualities are special cases. It is quite certain that you
weigh less than 10 tons; nothing could be more certain. If we represent certainty by unity,
then no probabilities exceed this. Likewise it is quite impossible for you to run a mile in
10 seconds or less; nothing could be less likely. If we represent impossibility by zero,
then no probability can be less than this. Thus we can, if we wish, present this on a scale,
as shown in gure 1.1.
The idea is that any chance eventuality can be represented by a point somewhere on
this scale. Everything that is impossible is placed at zero that the moon is made of
0 1
chance that a
impossible coin shows heads certain
cheese, formation ying by pigs, and so on. Everything that is certain is placed at unity
the moon is not made of cheese, Socrates is mortal, and so forth. Everything else is
somewhere in [0, 1], i.e. in the interval between 0 and 1, the more likely things being
closer to 1 and the more unlikely things being closer to 0.
Of course, if two things have the same chance of happening, then they are at the same
point on the scale. That is what we mean by `equally likely'. And in everyday discourse
everyone, including mathematicians, has used and will use words such as very likely,
likely, improbable, and so on. However, any detailed or precise look at probability
requires the use of the numerical scale. To see this, you should ponder on just how you
would describe a chance that is more than very likely, but less than very very likely.
This still leaves some questions to be answered. For example, the choice of 0 and 1 as
the ends of the scale may appear arbitrary, and, in particular, we have not said exactly
which numbers represent the chance of a double six, or the chance of a head. We have not
even justied the claim that a head is more likely than double six. We discuss all this later
in the chapter; it will turn out that if we regard probability as an extension of the idea of
proportion, then we can indeed place many probabilities accurately and condently on
this scale.
We conclude with an important point, namely that the chance of a head (or a double
six) is just a chance. The whole point of probability is to discuss uncertain eventualities
before they occur. After this event, things are completely different. As the simplest
illustration of this, note that even though we agree that if we ip a coin and roll two dice
then the chance of a head is greater than the chance of a double six, nevertheless it may
turn out that the coin shows a tail when the dice show a double six. Likewise, when the
weather forecast gives a 90% chance of rain, or even a 99% chance, it may in fact not
rain. The chance of a slip on the San Andreas fault this week is very small indeed,
nevertheless it may occur today. The antibiotic is overwhelmingly likely to cure your
illness, but it may not; and so on.
1 . 3 T H E S C O P E O F P RO BA B I L I T Y
. . . nothing between humans is 1 to 3. In fact, I long ago come to the conclusion
that all life is 6 to 5 against.
Damon Runyon, A Nice Price
4 1 Introduction
Life is a gamble at terrible odds; if it was a bet you wouldn't take it.
Tom Stoppard, Rosencrantz and Guildenstern are Dead, Faber and Faber
In the next few sections we are going to spend a lot of time ipping coins, rolling dice,
and buying lottery tickets. There are very good reasons for this narrow focus (to begin
with), as we shall see, but it is important to stress that probability is of great use and
importance in many other circumstances. For example, today seems to be a fairly typical
day, and the newspapers contain articles on the following topics (in random order).
1. How are the chances of a child's suffering a genetic disorder affected by a grand-
parent's having this disorder? And what difference does the sex of child or ancestor
make?
2. Does the latest opinion poll reveal the true state of affairs?
3. The lottery result.
4. DNA proling evidence in a trial.
5. Increased annuity payments possible for heavy smokers.
6. An extremely valuable picture (a Van Gogh) might be a fake.
7. There was a photograph taken using a scanning tunnelling electron microscope.
8. Should risky surgical procedures be permitted?
9. Malaria has a signicant chance of causing death; prophylaxis against it carries a
risk of dizziness and panic attacks. What do you do?
10. A commodities futures trader lost a huge sum of money.
11. An earthquake occurred, which had not been predicted.
12. Some analysts expected ination to fall; some expected it to rise.
13. Football pools.
14. Racing results, and tips for the day's races.
15. There is a 10% chance of snow tomorrow.
16. Prots from gambling in the USA are growing faster than any other sector of the
economy. (In connection with this item, it should be carefully noted that prots are
made by the casino, not the customers.)
17. In the preceding year, British postmen had sustained 5975 dogbites, which was
around 16 per day on average, or roughly one every 20 minutes during the time
when mail is actually delivered. One postman had sustained 200 bites in 39 years of
service.
Now, this list is by no means exhaustive; I could have made it longer. And such a list
could be compiled every day (see the exercise at the end of this section). The subjects
reported touch on an astonishingly wide range of aspects of life, society, and the natural
world. And they all have the common property that chance, uncertainty, likelihood,
randomness call it what you will is an inescapable component of the story.
Conversely, there are few features of life, the universe, or anything, in which chance is
not in some way crucial.
Nor is this merely some abstruse academic point; assessing risks and taking chances
are inescapable facets of everyday existence. It is a trite maxim to say that life is a lottery;
it would be more true to say that life offers a collection of lotteries that we can all, to
some extent, choose to enter or avoid. And as the information at our disposal increases, it
does not reduce the range of choices but in fact increases them. It is, for example,
1.4 Basic ideas: the classical case 5
1 . 4 BA S I C I D E A S : T H E C L A S S I C A L C A S E
The perfect die does not lose its usefulness or justication by the fact that real dice
fail to live up to it.
W. Feller
Our rst task was mentioned above; we need to supply reasons for the use of the standard
probability scale, and methods for deciding where various chances should lie on this
scale. It is natural that in doing this, and in seeking to understand the concept of
probability, we will pay particular attention to the experience and intuition yielded by
ipping coins and rolling dice. Of course this is not a very bold or controversial decision;
6 1 Introduction
any theory of probability that failed to describe the behaviour of coins and dice would be
widely regarded as useless. And so it would be. For several centuries that we know of,
and probably for many centuries before that, ipping a coin (or rolling a die) has been the
epitome of probability, the paradigm of randomness. You ip the coin (or roll the die),
and nobody can accurately predict how it will fall. Nor can the most powerful computer
predict correctly how it will fall, if it is ipped energetically enough.
This is why cards, dice, and other gambling aids crop up so often in literature both
directly and as metaphors. No doubt it is also the reason for the (perhaps excessive)
popularity of gambling as entertainment. If anyone had any idea what numbers the lottery
would show, or where the roulette ball will land, the whole industry would be a dead
duck.
At any rate, these long-standing and simple gaming aids do supply intuitively con-
vincing ways of characterizing probability. We discuss several ideas in detail.
I Probability as proportion
Figure 1.2 gives the layout of an American roulette wheel. Suppose such a wheel is spun
once; what is the probability that the resulting number has a 7 in it? That is to say, what is
the probability that the ball hits 7, 17, or 27? These three numbers comprise a proportion
3
38 of the available compartments, and so the essential symmetry of the wheel (assuming it
3
is well made) suggests that the required probability ought to be 38 . Likewise the
0 00
1 2 3
118
27 00 1 13
25 10 36 4 5 6
29 24 112
12 3 7 8 9
8 15
34 even
19 22 10 11 12
31 5
18 17 13 14 15
6
21 32
16 17 18
33 20
16 1324
7 19 20 21
4 11
23 30
35 26 22 23 24
14 2 9
0 28
25 26 27
odd
28 29 30
2536
31 32 33
1936
34 35 36
2 to 1 2 to 1 2 to 1
Figure 1.2. American roulette. Shaded numbers are black; the others are red except for the zeros.
1.4 Basic ideas: the classical case 7
Example 1.4.1. Flip a coin and choose `heads'. Then r 1, because you win on the
outcome `heads', and n 2, because the coin shows a head or a tail. Hence the
probability that you win, which is also the probability of a head, is p 12. s
Example 1.4.2. Roll a die. There are six outcomes, which is to say that n 6. If you
win on an even number then r 3, so the probability that an even number is shown is
p 36 12:
Likewise the chance that the die shows a 6 is 16 , and so on. s
Example 1.4.3. Pick a card at random from a pack of 52 cards. What is the
probability of an ace? Clearly n 52 and r 4, so that
4 1
p 52 13 : s
Example 1.4.4. A town contains x women and y men; an opinion pollster chooses an
adult at random for questioning about toothpaste. What is the chance that the adult is
male? Here
n x y and r y:
8 1 Introduction
p(n)
0.20
0.18
0.16
0.14
0.12
0.10
0 10 20 30 40 50 60 70 80 90 100 n
Figure 1.3. The proportion of sixes given in 100 rolls of a die, recorded at intervals of 5 rolls.
_
Figures are from an actual experiment. Of course, 16 0:166.
1.4 Basic ideas: the classical case 9
you win some future similar repetition of this game is close to r(n)=n. We write
r(n) number of wins in n games
(4) p' :
n number n of games
The symbol ' is read as `is approximately equal to'. Once again we note that
0 < r(n) < n and so we may take it that 0 < p < 1.
Furthermore, if a win is impossible then r(n) 0, and r(n)=n 0. Also, if a win is
certain then r(n) n, and r(n)=n 1. This is again consistent with the scale introduced
in gure 1.1, which is very pleasant. Notice the important point that this interpretation
supplies a way of approximately measuring probabilities rather than calculating them
merely by an appeal to symmetry.
Since we can now calculate simple probabilities, and measure them approximately, it is
tempting to stop there and get straight on with formulating some rules. That would be a
mistake, for the idea of proportion gives another useful insight into probability that will
turn out to be just as important as the other two, in later work.
1 . 5 BA S I C I D E A S ; T H E G E N E R A L C A S E
We must believe in chance, for how else can we account for the successes of those
we detest?
Anon.
We noted that a theory of probability would be hailed as useless if it failed to describe the
behaviour of coins and dice. But of course it would be equally useless if it failed to
1.5 Basic ideas; the general case 11
describe anything else, and moreover many real dice and coins (especially dice) have
been known to be biased and asymmetrical. We therefore turn to the question of assigning
probabilities in activities that do not necessarily have equally likely outcomes.
It is interesting to note that the desirability of doing this was implicitly recognized by
Cardano (mentioned in the previous section) around 1520. In his Book on Games of
Chance, which deals with supposedly fair dice, he notes that
However, the ideas necessary to describe the behaviour of such biased dice had to wait
for Pascal in 1654, and later workers. We examine the basic notions in turn; as in the
previous section, these notions rely on our concept of probability as an extension of
proportion.
failure F success S
If you actually obtain a pin and perform this experiment, you will get a graph like that
of gure 1.6. It does seem from the gure that r(n)=n is settling down around some
number p, which we naturally interpret as the probability of success. It may be objected
that the ratio changes every time we drop another pin, and so we will never obtain an
exact value for p. But this gap between the real world and our descriptions of it is
observed in all subjects at all levels. For example, geometry tells us that the diagonal of a
p
unit square has length 2. But, as A. A. Markov has observed,
If we wished to verify this fact by measurements, we should nd that the ratio of
p
diagonal to side is different for different squares, and is never 2.
It may be regretted that we have only this somewhat hit-or-miss method of measuring
probability, but we do not really have any choice in the matter. Can you think of any other
way of estimating the chance that the pin will fall point down? And even if you did think
of such a method of estimation, how would you decide whether it gave the right answer,
except by ipping the pin often enough to see? We can illustrate this point by considering
a basic and famous example.
Example 1.5.1: sex ratio. What is the probability that the next infant to be born in
your local hospital will be male? Throughout most of the history of the human race it was
taken for granted that essentially equal numbers of boys and girls are born (with some
uctuations, naturally). This question would therefore have drawn the answer 12, until
recently.
However, in the middle of the 16th century, English parish churches began to keep
fairly detailed records of births, marriages, and deaths. Then, in the middle of the 17th
century, one John Graunt (a draper) took the trouble to read, collate, and tabulate the
numbers in various categories. In particular he tabulated the number of boys and girls
whose births were recorded in London in each of 30 separate years.
To his, and everyone else's, surprise, he found that in every single year more boys were
born than girls. And, even more remarkably, the ratio of boys to girls varied very little
between these years. In every year the ratio of boys to girls was close to 14:13. The
meaning and signicance of this unarguable truth inspired a heated debate at the time.
For us, it shows that the probability that the next infant born will be male, is
approximately 14 27 . A few moments thought will show that there is no other way of
answering the general question, other than by nding this relative frequency.
p(n)
0.4
Figure 1.6. Sketch of the proportion p(n) of successes when a Bernoulli pin is dropped n times.
For this particular pin, p seems to be settling down at approximately 0.4.
1.5 Basic ideas; the general case 13
It is important to note that the empirical frequency differs from place to place and from
time to time. Graunt also looked at the births in Romsey over 90 years and found the
empirical frequency to be 16:15. It is currently just under 0.513 in the USA, slightly less
than 14 16
27 (' 0:519) and 31 (' 0:516).
Clearly the idea of probability as a relative frequency is very attractive and useful.
Indeed it is generally the only interpretation offered in textbooks. Nevertheless, it is not
always enough, as we now discuss.
Example 1.5.2. If a bond for a million roubles is offered to you for one rouble, and
the sellers are assumed to be rational, then they clearly think the chance of the bond's
being bought back at par is less than one in a million. If you buy it, then presumably you
believe the chances are more than one in a million. If you thought the chances were less,
you would reduce your offer. If you both agree that one rouble is a fair price for the bond,
then you have assigned the value p 106 for the probability of its redemption. Of
course this may vary according to various rumours and signals from the relevant banks
14 1 Introduction
and government (and note that the more ornate and attractive bonds now have some
intrinsic value, independent of their chance of redemption). s
This example leads naturally to our nal candidate for an interpretation of probability.
1.6 MODELLING
If I wish to know the chances of getting a complete hand of 13 spades, I do not set
about dealing hands. It would take the population of the world billions of years to
obtain even a bad estimate of this.
John Venn
The point of the above quote is that we need a theory of probability to answer even the
simplest of practical questions. Such theories are called models.
1.6 Modelling 15
Example 1.6.1: cards. For the question above, the usual model is as follows. We
assume that all possible hands of cards are equally likely, so that if the number of all
possible hands is n, then the required probability is n1 . s
Experience seems to suggest that for a well-made, well-shufed pack of cards, this
answer is indeed a good guide to your chances of getting a hand of spades. (Though we
must remember that such complete hands occur more often than this predicts, because
humorists stack the pack, as a `joke'.) Even this very simple example illustrates the
following important points very clearly.
First, the model deals with abstract things. We cannot really have a perfectly shufed
pack of perfect cards; this `collection of equally likely hands' is actually a ction. We
create the idea, and then use the rules of arithmetic to calculate the required chances. This
is characteristic of all mathematics, which concerns itself only with rules dening the
behaviour of entities which are themselves undened (such as `numbers' or `points').
Second, the use of the model is determined by our interpretation of the rules and
results. We do not need an interpretation of what chance is to calculate probabilities, but
without such an interpretation it is rather pointless to do it.
Similarly, you do not need to have an interpretation of what lines and points are to do
geometry and trigonometry, but it would all be rather pointless if you did not have one.
Likewise chess is just a set of rules, but if checkmate were not interpreted as victory, not
many people would play.
Use of the term `model' makes it easier to keep in mind this distinction between theory
and reality. By its very nature a model cannot include all the details of the reality it seeks
to represent, for then it would be just as hard to comprehend and describe as the reality
we want to model. At best, our model should give a reasonable picture of some small part
of reality. It has to be a simple (even crude) description; and we must always be ready to
scrap or improve a model if it fails in this task of accurate depiction. That having been
said, old models are often still useful. The theory of relativity supersedes the Newtonian
model, but all engineers use Newtonian mechanics when building bridges or motor cars,
or probing the solar system.
This process of observation, model building, analysis, evaluation, and modication is
called modelling, and it can be conveniently represented by a diagram; see gure 1.7.
(This diagram is therefore in itself a model; it is a model for the modelling process.)
In gure 1.7, the top two boxes are embedded in the real world and the bottom two
boxes are in the world of models. Box A represents our observations and experience of
some phenomenon, together with relevant knowledge of related events and perhaps past
experience of modelling. Using this we construct the rules of a model, represented by box
B. We then use the techniques of logical reasoning, or mathematics, to deduce the way in
which the model will behave. These properties of the model can be called theorems; this
stage is represented by box C. Next, these characteristics of the model are interpreted in
terms of predictions of the way the corresponding real system should work, denoted by
box D. Finally, we perform appropriate experiments to discover whether these predictions
agree with observation. If they do not, we change or scrap the model and go round the
loop again. If they do, we hail the model as an engine of discovery, and keep using it to
make predictions until it wears out or breaks down. This last step is called using or
checking the model or, more grandly, validation.
16 1 Introduction
A D
Real
experiment and use predictions
world
measurement
construction interpretation
B C
Model rules of theorems
deduction
world model
This procedure is so commonplace that we rather take it for granted. For example, it
has been used every time you see a weather forecast. Meteorologists have observed the
climate for many years. They have deduced certain simple rules for the behaviour of jet
streams, anticyclones, occluded fronts, and so on. These rules form the model. Given any
conguration of airows, temperatures, and pressures, the rules are used to make a
prediction; this is the weather forecast. Every forecast is checked against the actual
outcome, and this experience is used to improve the model.
Models form extraordinarily powerful and economical ways of thinking about the
world. In fact they are often so good that the model is confused with reality. If you ever
think about atoms, you probably imagine little billiard balls; more sophisticated readers
may imagine little orbital systems of elementary particles. Of course atoms are not
`really' like that; these visions are just convenient old models.
We illustrate the techniques of modelling with two simple examples from probability.
Example 1.6.2: setting up a lottery. If you are organizing a lottery you have to
decide how to allocate the prize money to the holders of winning tickets. It would help
you to know the chances of any number winning and the likely number of winners. Is this
possible? Let us consider a specic example.
Several national lotteries allow any entrant to select six numbers in advance from the
integers 1 to 49 inclusive. A machine then selects six balls at random (without replace-
ment) from an urn containing 49 balls bearing these numbers. The rst prize is divided
among entrants selecting these numbers.
Because of the nature of the apparatus, it seems natural to assume that any selection of
six numbers is equally likely to be drawn. Of course this assumption is a mathematical
model, not a physical law established by experiment. Since there are approximately 14
million different possible selections (we show this in chapter 3), the model predicts that
your chance, with one entry, of sharing the rst prize is about one in 14 million. Figure
1.8 shows the relative frequency of the numbers drawn in the rst 1200 draws. It does not
seem to discredit or invalidate the model so far as one can tell.
1.6 Modelling 17
150
Number of appearances
100
50
1 10 20 30 40 49
Figure 1.8. Frequency plot of an actual 649 lottery after 1200 drawings. The numbers do seem
equally likely to be drawn.
The next question you need to answer is, how many of the entrants are likely to share
the rst prize? As we shall see, we need in turn to ask, how do lottery entrants choose
their numbers?
This is clearly a rather different problem; unlike the apparatus for choosing numbers,
gamblers choose numbers for various reasons. Very few choose at random; they use
birthdays, ages, patterns, and so on. However, you might suppose that for any gambler
chosen at random, that choice of numbers would be evenly distributed over the
possibilities.
In fact this model would be wrong; when the actual choices of lottery numbers are
examined, it is found that in the long run the chances that the various numbers will occur
are very far from equal; see gure 1.9. This clustering of preferences arises because
people choose numbers in lines and patterns which favour central squares, and they also
favour the top of the card. Data like this would provide a model for the distribution of
likely payouts to winners. s
It is important to note that these remarks do not apply only to lotteries, cards, and dice.
Venn's observation about card hands applies equally well to almost every other aspect of
life. If you wished to design a telephone exchange (for example), you would rst of all
construct some mathematical models that could be tested (you would do this by making
assumptions about how calls would arrive, and how they would be dealt with). You can
construct and improve any number of mathematical models of an exchange very cheaply.
Building a faulty real exchange is an extremely costly error.
Likewise, if you wished to test an aeroplane to the limits of its performance, you would
be well advised to test mathematical models rst. Testing a real aeroplane to destruction
is somewhat risky.
So we see that, in particular, models and theories can save lives and money. Here is
another practical example.
18 1 Introduction
1 2 3 4 5
6 7 8 9 10
11 12 13 14 15
16 17 18 19 20
21 22 23 24 25
26 27 28 29 30
31 32 33 34 35
36 37 38 39 40
41 42 43 44 45
46 47 48 49
Figure 1.9. Popular and unpopular lottery numbers: bold, most popular; roman, intermediate
popularity; italic, least popular.
Example 1.6.3: rst signicant digit. Suppose someone offered the following wager:
(i) select any large book of numerical tables (such as a census, some company
accounts, or an almanac);
(ii) pick a number from this book at random (by any means);
(iii) if the rst signicant digit of this number is one of f5, 6, 7, 8, 9g, then you win $1;
if it is one of f1, 2, 3, 4g, you lose $1.
Would you accept this bet? You might be tempted to argue as follows: a reasonable
intuitive model for the relative chances of each digit is that they are equally likely. On
this model the probability p of winning is 59, which is greater than 12 (the odds on winning
would be 5 : 4), so it seems like a good bet. However, if you do some research and
actually pick a large number of such numbers at random, you will nd that the relative
frequencies of each of the nine possible rst signicant digits are given approximately by
f 1 0:301, f 2 0:176, f 3 0:125,
f 4 0:097, f 5 0:079, f 6 0:067,
f 7 0:058, f 8 0:051, f 9 0:046:
Thus empirically the chance of your winning is
f 5 f 6 f 7 f 8 f 9 0:3
The wager offered is not so good for you! (Of course it would be quite improper for a
mathematician to win money from the ignorant by this means.) This empirical distribu-
tion is known as Benford's law, though we should note that it was rst recorded by S.
Newcomb (a good example of Stigler's law of eponymy). s
1.7 Mathematical modelling 19
We see that intuition is necessary and helpful in constructing models, but not sufcient;
you also need experience and observations. A famous example of this arose in particle
physics. At rst it was assumed that photons and protons would satisfy the same statistical
rules, and models were constructed accordingly. Experience and observations showed that
in fact they behave differently, and the models were revised.
The theory of probability exhibits a very similar history and development, and we
approach it in similar ways. That is to say, we shall construct a model that reects our
experience of, and intuitive feelings about, probability. We shall then deduce results and
make predictions about things that have either not been explained or not been observed,
or both. These are often surprising and even counter intuitive. However, when the
predictions are tested against experiment they are almost always found to be good. Where
they are not, new theories must be constructed.
It may perhaps seem paradoxical that we can explore reality most effectively by playing
with models, but this fact is perfectly well known to all children.
1 . 7 M AT H E M AT I C A L M O D E L L I N G
There are very few things which we know, which are not capable of being reduced
to a mathematical reasoning; and when they cannot, it is a sign our knowledge of
them is very small and confused; and where a mathematical reasoning can be had,
it is as great a folly to make use of any other, as to grope for a thing in the dark,
when you have a candle standing by you.
John Arbuthnot, Of the Laws of Chance
The quotation above is from the preface to the rst textbook on probability to appear in
English. (It is in a large part a translation of a book by Huygens, which had previously
appeared in Latin and Dutch.) Three centuries later, we nd that mathematical reasoning
is indeed widely used in all walks of life, but still perhaps not as much as it should be. A
small survey of the reasons for using mathematical methods would not be out of place.
The rst question is, why be abstract at all? The blunt answer is that we have no choice,
for many reasons.
In the rst place, as several examples have made clear, practical probability is inescap-
ably numerical. Betting odds can only be numerical, monetary payoffs are numerical,
stock exchanges and insurance companies oat on a sea of numbers. And even the
simplest and most elementary problems in bridge and poker, or in lotteries, involve
counting things. And this counting is often not a trivial task.
Second, the range of applications demands abstraction. For example, consider the
following list of real activities:
customers in line at a post ofce counter
cars at a toll booth
data in an active computer memory
20 1 Introduction
1 . 8 M O D E L L I N G P RO BA B I L I T Y
Rules and Models destroy genius and Art.
W. Hazlitt
First, we examine the real world and select the experiences and experiments that seem
best to express the nature of probability, without too much irrelevant extra detail. You
have already done this, if you have ever ipped a coin, or rolled a die, or wondered
whether to take an umbrella.
Second, we formulate a set of rules that best describe these experiments and exper-
iences. These rules will be mathematical in nature, for simplicity. (This is not paradox-
ical!) We do this in the next chapter.
Third, we use the structure of mathematics (thoughtfully constructed over the millenia
for other purposes), to derive results of practical interest. We do this in the remainder of
the book.
Finally, these results are compared with real data in a variety of circumstances: by
scientists to measure constants, by insurance companies to avoid ruin, by actuaries to
calculate your pension, by telephone engineers to design the network, and so on. This
validates our model, and has been done by many people for hundreds of years. So we do
not need to do it here.
This terse account of our program gives rise to a few questions of detail, which we
address here, as follows. Do we in fact need to know what probability `really' is? The
answer here is, of course, no. We only need our model to describe what we observe. It is
the same in physics; we do not need to know what mass really is to use Newton's or
Einstein's theories. This is just as well, because we do not know what mass really is. We
still do not know even what light `really' is. Questions of reality are best left for
philosophers to argue over, for ever.
Furthermore, in drawing up the rules, do we necessarily have to use the rather
roundabout arguments employed in section 1.2? Is there not a more simple and straight-
forward way to say what probability does? After all, Newton only had to drop apples to
see what gravity, force, and momentum did. Heat burns, electricity shocks, and light
shines, to give some other trivial examples.
By contrast, probability is strangely intangible stuff; you cannot accumulate piles of it,
or run your hands through it, or give it away. No meter will record its presence or absence,
and it is not much used in the home. We cannot deny its existence, since we talk about it,
but it exists in a curiously shadowy and ghost-like way. This difculty was neatly
pinpointed by John Venn in the 19th century:
But when a science is concerned not so much with objects as with laws, the
difculty of giving preliminary information becomes greater.
The Logic of Chance
What Venn meant by this, is that books on subjects such as uid mechanics need not
ordinarily spend a great deal of time explaining the everyday concept of a uid. The
average reader will have seen waves on a lake, watched bathwater going down the plug-
hole, observed trees bending in the wind, and been annoyed by the wake of passing boats.
And anyone who has own in an aeroplane has to believe that uid mechanics
demonstrably works. Furthermore, the language of the subject has entered into everyday
discourse, so that when people use words like wave, or wing, or turbulence, or vortex,
they think they know what they mean. Probability is harder to put your nger on.
1.9 REVIEW
In this chapter we have looked at chance and probability in a non-technical way. It seems
obvious that we recognize the appearance of chance, but it is surprisingly difcult to give
a comprehensive denition of probability. For this reason, and many others, we have
begun to construct a theory of probability that will rely on mathematical models and
methods.
Our rst step on this path has been to agree that any probability is a number lying
between zero and unity, inclusive. It can be interpreted as a simple proportion in
situations with symmetry, or as a measure of long-run proportion, or as an estimate of
expected value, depending on the context. The next task is to determine the rules obeyed
by probabilities, and this is the content of the next chapter.
1 . 1 0 A P P E N D I X I . S O M E R A N D O M LY S E L E C T E D D E F I N I T I O N S
O F P RO BA B I L I T Y, I N R A N D O M O R D E R
One can hardly give a satisfactory denition of probability.
H. Poincare
Probability is a degree of certainty, which is to certainty as a part is to the whole.
J. Bernoulli
Probability is the study of random experiments.
S. Lipschutz
Mathematical probability is a branch of mathematical analysis that has developed around the
problem of assigning numerical measurement to the abstract concept of likelihood.
M. Munroe
Probability is a branch of logic which analyses nondemonstrative inferences, as opposed to
demonstrative ones.
E. Nagel
I call that chance which is nothing but want of art.
J. Arbuthnot
1.10 Appendix I. Some randomly selected denitions of probability in random order 23
Probability is the reason that we have to think that an event will occur, or that a proposition is
true.
G. Boole
Probability describes the various degrees of rational belief about a proposition given different
amount of knowledge.
J. M. Keynes
An event will on a long run of trials tend to occur with a frequency proportional to its
probability.
R. L. Ellis
One regards two events as equally probable when one can see no reason that would make one
more probable than the other.
P. Laplace
The probability of an event is the ratio of the number of cases that are favourable to it, to the
number of possible cases, when there is nothing to make us believe that one case should occur
rather than any other.
P. Laplace
The probability of an event is the ratio between the value at which an expectation depending on
the happening of the event ought to be computed, and the value of the thing expected upon its
happening.
T. Bayes
The limiting value of the relative frequency of an attribute will be called the probability of that
attribute.
R. von Mises
24 1 Introduction
The probability attributed by an individual to an event is revealed by the conditions under which
he would be disposed to bet on that event.
B. de Finetti
Probability does not exist.
B. de Finetti
Personalist views hold that probability measures the condence that a particular individual has
in the truth of a particular proposition.
L. Savage
The probability of an outcome is our estimate for the most likely fraction of a number of
repeated observations that will yield that outcome.
R. Feynman
It is likely that the word `probability' is used by logicians in one sense and by statisticians in
another.
F. P. Ramsey
Sets
A set is a collection of things that are called the elements of the set. The elements can be any kind of
entity: numbers, people, poems, blueberries, points, lines, and so on, endlessly.
For clarity, upper case letters are always used to denote sets. If the set S includes some element
denoted by x, then we say x belongs to S, and write x 2 S. If x does not belong to S, then we write
x2= S.
There are essentially two ways of dening a set, either by a list or by a rule.
Example 1.11.1. If S is the set of numbers shown by a conventional die, then the rule is that S
comprises the integers lying between 1 and 6 inclusive. This may be written formally as follows:
S fx: 1 < x < 6 and x is an integerg:
Alternatively S may be given as a list:
S f1, 2, 3, 4, 5, 6g: s
One important special case arises when the rule is impossible; for example, consider the set of
elephants playing football on Mars. This is impossible (there is no pitch on Mars) and the set
therefore is empty; we denote the empty set by . We may write as fg.
If S and T are two sets such that every element of S is also an element of T , then we say that T
includes S, and write either S T or S T. If S T and T S then S and T are said to be equal,
and we write S T .
Note that S for every S. Note also that some books use the symbol `' to denote inclusion
1.11 Appendix II. Review of sets and functions 25
and reserve `' to denote strict inclusion, that is to say, S T if every element of S is in T , and
some element of T is not in S. We do not make this distinction.
Combining sets
Given any non-empty set, we can divide it up, and given any two sets, we can join them together.
These simple observations are important enough to warrant denitions and notation.
Denition. Let A and B be sets. Their union, denoted by A [ B, is the set of elements that are in
A or B, or in both. Their intersection, denoted by A \ B, is the set of elements in both A and B. n
Note that in other books the union may be referred to as the join or sum; the intersection may be
referred to as the meet or product. We do not use these terms. Note the following.
We can also remove bits of sets, giving rise to set differences, as follows.
Denition. Let A and B be sets. That part of A which is not also in B is denoted by A n B, called
the difference of A from B. Elements which are in A or B but not both, comprise the symmetric
difference, denoted by A B. n
Finally we can combine sets in a more complicated way by taking elements in pairs, one from
each set.
Example 1.11.2. Let A be the interval [0, a] of the x-axis, and B the interval [0, b] of the y-
axis. Then C A 3 B is the rectangle of base a and height b with its lower left vertex at the origin,
when a, b . 0. s
Venn diagrams
The above ideas are attractively and simply expressed in terms of Venn diagrams. These provide very
expressive pictures, which are often so clear that they make algebra redundant. See gure 1.10.
In probability problems, all sets of interest A lie in a universal set , so that A for all A.
That part of which is not in A is called the complement of A, denoted by A c. Formally
A c n A fx: x 2 , x 2 = Ag:
Obviously, from the diagram or by consideration of the elements
A [ A c , A \ A c , (A c ) c A:
Clearly A \ B B \ A and A [ B B [ A, but we must be careful when making more intricate
combinations of larger numbers of sets. For example, we cannot write down simply A [ B \ C; this
is not well dened because it is not always true that
(A [ B) \ C A [ (B \ C):
26 1 Introduction
Size
When sets are countable it is often useful to consider the number of elements they contain; this is
called their size or cardinality. For any set A, we denote its size by jAj; when sets have a nite
number of elements, it is easy to see that size has the following properties.
If sets A and B are disjoint then
jA [ Bj jAj jBj,
and more generally, when A and B are not necessarily disjoint,
jA [ Bj jA \ Bj jAj jBj:
Naturally jj 0, and if A B then
jAj < jBj:
Finally, for the product of two such nite sets A 3 B we have
jA 3 Bj jAj 3 jBj:
When sets are innite or uncountable, a great deal more care and subtlety is required in dealing
with the idea of size. However, we can see intuitively that we can consider the length of subsets of a
line, or areas of sets in a plane, or volumes in space, and so on. It is easy to see that if A and B are
two subsets of a line, with lengths jAj and jBj respectively, then in general
jA [ Bj jA \ Bj jAj jBj:
Therefore jA [ Bj jAj jBj when A \ B .
We can dene the product of two such sets as a set in the plane with area jA 3 Bj, which satises
the well-known elementary rule for areas and lengths
jA 3 Bj jAj 3 jBj
and is thus consistent with the nite case above. Volumes and sets in higher dimensions satisfy
similar rules.
1.12 Problems 27
Functions
Suppose we have sets A and B, and a rule that assigns to each element a in A a unique element b in
B. Then this rule is said to dene a function from A to B; for the corresponding elements we write
b f (a):
Here the symbol f (:) denotes the rule or function; often we just call it f . The set A is called the
domain of f , and the set of elements in B that can be written as f (a) for some a is called the range
of f ; we may denote the range by R.
Anyone who has a calculator is familiar with the idea of a function. For any function key, the
calculator will supply f (x), if x is in the domain of the function; otherwise it says `error'.
Inverse function
If f is a function from A to B, we can look at any b in the range R of f and see how it arose from A.
This denes a rule assigning elements of A to each element of R, so if the rule assigns a unique
element a to each b this denes a function from R to A. It is called the inverse function and is
denoted by f 1 (:):
a f 1 (b):
Example 1.11.3: indicator function. Let A and dene the following function I(:) on :
I() 1 if 2 A,
I() 0 if 2= A:
Then I is a function from to f0, 1g; it is called the indicator of A, because by taking the value 1 it
indicates that 2 A. Otherwise it is zero. s
This is about as simple a function as you can imagine, but it is surprisingly useful. For example, note
that if A is nite you can nd its size by summing I() over all :
X
jAj I():
2
1 . 1 2 P RO B L E M S
Note well: these are not necessarily mathematical problems; an essay may be a sufcient answer.
They are intended to provoke thought about your own ideas of probability, which you may well have
without realizing the fact.
1. Which of the denitions of probability in Appendix I do you prefer? Why? Can you produce a
better one?
2. Is there any fundamental difference between a casino and an insurance company? If so, what is
it? (Do not address moral issues.)
3. You may recall the classic paradox of Buridan's mule. Placed midway between two equally
enticing bales of hay, it starved to death because it had no reason to choose one rather than the
other. Would a knowledge of probability have saved it? (The paradox is rst recorded by
Aristotle.)
4. Suppose a coin showed heads 10 times consecutively. If it looked normal, would you neverthe-
less begin to doubt its fairness?
28 1 Introduction
5. Suppose Alf says his dice are fair, but Bill says they are crooked. They look OK. What would
you do to decide the issue?
6. What do you mean by risk? Many public and personal decisions seem to be based on the
premise that the risks presented by food additives, aircraft disasters, and prescribed drugs are
comparable with the risks presented by smoking, road accidents, and heart disease. In fact the
former group present negligible risks compared with the latter. Is this rational? Is it compre-
hensible? Formulate your own view accurately.
7. What kind of existence does chance have? (Hint: What kind of existence do numbers have?)
8. It has been argued that seemingly chance events are not really random; the uncertainty about
the outcome of the roll of a die is just an expression of our inability to do the mechanical
calculations. This is the deterministic theory. Samuel Johnson remarked that determinism
erodes free will. Do you think you have free will? Does it depend on the existence of chance?
9. `Probability serves to determine our hopes and fears' Laplace. Discuss what Laplace meant
by this.
10. `Probability has nothing to do with an isolated case' A. Markov. What did Markov mean by
saying this? Do you agree?
11. `That the chance of gain is naturally overvalued, we may learn from the universal success of
lotteries' Adam Smith (1776). `If there were no difference between objective and subjective
probabilities, no rational person would play games of chance for money' J. M. Keynes
(1921).
Discuss.
12. A proportion f of $100 bills are forgeries. What is the value to you of a proffered $100 bill?
13. Flip a coin 100 times and record the relative frequency of heads over ve-ip intervals as a
graph.
14. Flip a broad-headed pin 100 times and record the relative frequency of `point up' over ve-ip
intervals.
15. Pick a page of the local residential telephone directory at random. Pick 100 telephone numbers
at random (a column or so). Find the proportion p 2 of numbers whose last digit is odd, and
also the proportion p1 of numbers whose rst digit is odd. (Ignore the area code.) Is there much
difference?
16. Open a book at random and nd the proportion of words in the rst 10 lines that begin with a
vowel. What does this suggest?
17. Show that A if and only if B A B.
18. Show that if A B and B A then A B.
Part A
Probability
2
The rules of probability
2.1 PREVIEW
In the preceding chapter we suggested that a model is needed for probability, and that this
model would take the form of a set of rules. In this chapter we formulate these rules.
When doing this, we shall be guided by the various intuitive ideas of probability as a
relative of proportion that we discussed in Chapter 1. We begin by introducing the
essential vocabulary and notation, including the idea of an event. After some elementary
calculations, we introduce the addition rule, which is fundamental to the whole theory of
probability, and explore some of its consequences.
Most importantly we also introduce and discuss the key concepts of conditional
probability and independence. These are exceptionally useful and powerful ideas and
work together to unlock many of the routes to solving problems in probability. By the end
of this chapter you will be able to tackle a remarkably large proportion of the better-
known problems of chance.
Prerequisites. We shall use the routine methods of elementary algebra, together with
the basic concepts of sets and functions. If you have any doubts about these, refresh your
memory by a glance at appendix II of chapter 1.
2 . 2 N OTAT I O N A N D E X P E R I M E N T S
From everyday experience, you are familiar with many ideas and concepts of probability;
this knowledge is gained by observation of lotteries, board games, sport, the weather,
futures markets, stock exchanges, and so on. You have various ways of discussing these
random phenomena, depending on your personal experience. However, everyday dis-
course is too diffuse and vague for our purposes. We need to become routinely much
more precise. For example, we have been happy to use words such as chance, likelihood,
probability, and so on, more or less interchangeably. In future we shall conne ourselves
31
32 2 The rules of probability
Table 2.1.
Procedure Outcomes
Roll a die One of 1, 2, 3, 4, 5, 6
Run a horse race Some horse wins it, or there is a dead heat (tie)
Buy a lottery ticket Your number either is or is not drawn
to using the word probability. The following are typical statements in this context.
The probability of a head is 12.
The probability of rain is 90%.
The probability of a six is 16.
The probability of a crash is 109 .
Obviously we could write down an endless list of probability statements of this kind; you
should write down a few yourself (exercise). However, we have surely seen enough such
assertions to realize that useful statements about probability can generally be cast into the
following general form:
(1) The probability of A is p:
In the above examples, A was `a head', `rain', `a six', and `a crash'; and p was `12', `90%',
`16', and `109 ' respectively. We use this format so often that, to save ink, wrists, trees, and
time, it is customary to write (1) in the even briefer form
(2) P(A) p:
This is obviously an extremely efcient and compact written representation; it is still
pronounced as `the probability of A is p'. A huge part of probability depends on
equations similar to (2).
Here, the number p denotes the position of this probability on the probability scale
discussed in chapter 1. It is most important to remember that on this scale
(3) 0< p<1
If you ever calculate a probability outside this interval then it must be wrong!
We shall look at any event A, the probability function P(:), and probability P(A) in
detail in the next sections. For the moment we continue this section by noting that
underlying every such probability statement is some procedure or activity with a random
outcome; see table 2.1.
Useful probability statements refer to these outcomes. In everyday parlance this
procedure and the possible outcomes are often implicit. In our new rigorous model this
will not do. Every procedure and its possible outcomes must be completely explicit; we
stress that if you do not follow this rule you will be very likely to make mistakes. (There
are plenty of examples to show this.) To help in this task, we introduce some very
convenient notation and jargon to characterize all such trials, procedures, and actions.
Denition. Any activity or procedure that may give rise to a well-dened set of
outcomes is called an experiment. n
2.2 Notation and experiments 33
Denition. The set of all possible outcomes is denoted by , and called the sample
space. n
The adjective `well-dened' in the rst denition just means that you know what all the
possibilities of the experiment are, and could write them down if challenged to do so.
Prior to the experiment you do not know for sure what the outcome will be; when you
carry out the experiment it yields an outcome called the result. Often this result will have
a specic label such as `heads' or `it rains'. In general, when we are not being specic,
we denote the result of an experiment by . Obviously 2 ; that is to say, the result
always lies in the set of possible outcomes. For example, if is the set of possible
outcomes of a horse race in which Dobbin is a runner, then
fDobbin winsg
In this expression the curly brackets are used to alert you to the fact that what lies inside
them is one (or more) of the possible outcomes.
We conclude with two small but important points. First, any experiment can have many
different sample spaces attached to it.
The second point is in a sense complementary to the rst. It is that you have little to
lose by choosing a large enough sample space to be sure of including every possible
outcome, even where some are implausible.
Example. 2.2.2 Suppose you are counting the number of pollen grains captured by a
lter. A suitable sample space is the set of all non-negative integers
f0, 1, 2, 3, . . .g:
Obviously only a nite number of these are possible (since there is only a nite amount of
pollen in existence), but any cut-off point would be unpleasantly arbitrary, and might be
too small. s
(b) The number of cars passing over a bridge in one week is counted.
(c) Two players play the best of three sets at tennis.
(d) You deal a poker hand to each of four players.
2.3 EVENTS
Suppose we have some experiment whose outcomes comprise , the sample space. As
we have noted above, the whole point of probability is to say how likely the outcomes are,
either individually or collectively. We therefore make the following denition.
Thus each event comprises one or more possible outcomes . By convention, events are
always denoted by capital letters such as A, B, C, . . . , with or without sufxes, super-
xes, or other adornments such as hats, bars, or stars. Here are a few simple but common
examples.
Example 2.3.2. Suppose you record the number of days this week on which it rains.
The sample space is
f0, 1, 2, 3, 4, 5, 6, 7g:
One outcome is that it rains on one day,
1 1:
The event that it rains more often than not is
A f4, 5, 6, 7g, s
comprising the outcomes 4, 5, 6, 7.
Example 2.3.3. A doctor weighs a patient to the nearest pound. Then, to be on the
safe side, we may agree that
fi: 0 < i < 20 000g:
Some outcomes here seem unlikely, or even impossible, but we lose little by including
them. Then
C fi: 140 < i < 150g
2.3 Events 35
is the event that the patient weighed between 140 and 150 pounds. s
Example 2.3.4. An urn contains a amber and b buff balls. Of these, c balls are
removed without replacing any in the urn, where
c < minfa, bg a ^ b:
Then is the collection of all possible sequences of a's and b's of length c. We may
dene the event D that the number of a's and b's removed is the same. If c is odd, then
this is the impossible event . s
Since events are sets, we can use all the standard ideas and notation of set theory as
summarized in appendix II of Chapter 1. If the outcome of an experiment lies in the
event A, then A is said to occur, or happen. In this case we have 2 A. We always have
A . If A does not occur, then obviously the complementary event A c must occur,
since lies in one of A or A c.
The notation and ideas of set theory are particularly useful in considering combinations
of events.
Example 2.3.5. Suppose you take one card from a conventional pack. Simple events
include
A the card is an ace,
B the card is red,
C the card is a club:
More interesting events are denoted using the operations of union and intersection. For
example
A \ C the card is the ace of clubs,
A [ B the card is either red or an ace or both:
Of course the card cannot be red and a club, so we have B \ C , where denotes the
impossible event, otherwise known as the empty set. s
Table 2.2 gives a brief summary of how set notation represents events and their
relationships.
As we have remarked above, many important relationships between events are very
simply and attractively demonstrated by means of Venn diagrams. For example, gure 2.1
demonstrates very neatly that
(A [ B) \ C (A \ C) [ (B \ C) and A [ (B \ C) (A [ B) \ (A [ C):
A B
A B
Figure 2.1. Venn diagrams. In the upper gure the shaded area is equal to (A [ B) \ C and
(A \ C) [ (B \ C). In the lower gure the shaded area is equal to A [ (B \ C) and also to
(A [ B) \ (A [ C).
2.4 Probability; elementary calculations 37
2 . 4 P RO BA B I L I T Y; E L E M E N TA RY C A L C U L AT I O N S
Now that we have dened events, we can discuss their probabilities. Suppose some
experiment has outcomes that comprise , and A is an event in . Then the probability
that A occurs is denoted by P(A), where of course
(1) 0 < P(A) < 1:
Thus we can think of P as a rule that assigns a number P(A) 2 [0, 1] to each event A in
. The mathematical term for a rule like this is a function, as we discussed in appendix II
of chapter 1. Thus P() is a function of the events in , which takes values in the interval
[0, 1]. Before looking at the general properties of P, it seems sensible to gain experience
by looking at some simple special cases that are either familiar, or obvious, or both.
Example 2.4.1: Bernoulli trial. Many experiments have just two outcomes, for
example: head or tail; even or odd; win or lose; y or crash; and so on. To simplify
matters these are often thought of as examples of an experiment with the two outcomes
success or failure, denoted by S and F respectively. Then for the events S and F we write
P(S ) p and P(F) q 1 p: s
Example 2.4.2: equally likely outcomes. Suppose that an experiment has sample
space , such that each of the jj outcomes in is equally likely. This may be due to
some physical symmetry or the conditions of the experiment. Now let A be any event; the
number of outcomes in A is jAj. The equidistribution of probability among the outcomes
implies that the probability that A occurs should be proportional to jAj (we discussed this
at length in section 1.4). In cases of this type we therefore write
jAj number of outcomes in A
(2) P(A) : s
jj number of outcomes in
This is a large assumption, but it is very natural and compelling. It is so intuitively
attractive that it was being used explicitly in the 16th century, and it is clearly implicit in
the ideas and writings of several earlier mathematicians. Let us consider some further
examples of this idea in action.
38 2 The rules of probability
Example 2.4.3. Two dice are rolled, one after the other. Let A be the event that the
second number is greater than the rst. Here
jj 36
as there are 6 3 6 pairs of the form (i, j), with 1 < i, j < 6. The pairs with i , j are
given by
A f(5, 6), (4, 5), (4, 6), (3, 4), . . . , (1, 5), (1, 6)g;
it is easy to see that jAj 1 2 5 15. Hence
15 5
P(A) : s
36 12
Example 2.4.4. If a coin is ipped n times, what is the probability that each time the
same face shows? Here is the set of all possible sequences of H and T of length n,
which we regard as equally likely. Hence jj 2 n . The event A comprises all heads or
all tails. Hence jAj 2 and the required probability is
2
2( n1) : s
2n
Example 2.4.5: chain. Suppose you are testing a chain to destruction. It has n links
and is stretched between two shackles attached to a ram. The ram places the chain under
increasing tension until a link fails. Any link is equally likely to be the one that snaps,
and so if A is any group of links the probability that the failed link is in A is the
proportion of the total number of links in A. Now jj n, so
jAj
P(A) : s
n
Example 2.4.6: lottery. Suppose you have an urn containing 20 tickets marked 1, 2,
. . . , 20. A ticket is drawn at random. Thus
f1, 2, . . . , 20g fn: 1 < n < 20g:
Events in may include:
A the number drawn is even;
B the number drawn is divisible by 5;
C the number drawn is less than 8:
The implication of the words `at random' is that any number is equally likely to be
chosen. In this case our discussions above yield
jAj 10 1
P(A) ,
jj 20 2
jBj 4 1
P(B) ,
jj 20 5
and
jCj 7
P(C) : s
jj 20
2.4 Probability; elementary calculations 39
Example 2.4.7: dice. Three dice are rolled and their scores added. Are you more
likely to get 9 than 10, or vice versa?
Solution. There are 63 216 possible outcomes of this experiment, as each die has
six possible faces. You get a sum of 9 with outcomes such as (1, 2, 6), (2, 1, 6), (3, 3, 3)
and so on. Tedious enumeration reveals that there are 25 such triples, so
25
P(9) :
216
A similar tedious enumeration shows that there are 27 triples, such as (1, 3, 6), (2, 4, 4)
and so on, that sum to 10. So
27
P(10) . P(9):
216
This problem was solved by Galileo before 1642. s
An interesting point about this example is that if your sample space does not distinguish
between outcomes such as (1, 2, 6) and (2, 1, 6), then the possible sums 9 and 10 can
each be obtained in the same number of different ways, namely six. However, actual
experiments with real dice demonstrate that this alternative model is wrong.
Notice that symmetry is quite a powerful concept, and implies more than at rst
appears.
Example 2.4.8. You deal a poker hand of ve cards face down. Now pick up the fth
and last card dealt; what is the probability that it is an ace? The answer is
1
P(A) :
13
Sometimes it is objected that the answer should depend on the rst four cards, but of
course if these are still face down they cannot affect the probability we want. By
1
symmetry any card has probability 13 of being an ace; it makes no difference whether the
pack is dealt out or kept together, as long as only one card is actually inspected. s
Our intuitive notions of symmetry and fairness enable us to assign probabilities in
some other natural and appealing situations.
Example 2.4.9: rope. Suppose you are testing a rope to destruction: a ram places it
under increasing tension until it snaps at a point S, say. Of course we suppose that the
rope appeared uniformly sound before the test, so the failure point S is equally likely to
be any point of the rope. If the rope is of length 10 metres, say, then the probability that it
2
fails within 1 metre of each end is naturally P(S lies in those 2 metres) 10 15. Like-
wise the probability that S lies in any 2 metre length of rope is 15, as is the probability that
S lies in any 2 metres of the rope, however this 2 metres is made up. s
Example 2.4.10: meteorite. Leaving your house one morning at 8 a.m. you nd that
a meteorite has struck your car. When did it do so? Obviously meteorites take no account
of our time, so the time T of impact is equally likely to be any time between your leaving
40 2 The rules of probability
the car and returning to it. If this interval was 10 hours, say, then the probability that the
1
meteorite fell between 1 a.m. and 2 a.m. is 10 . s
These are special cases of experiments in which the outcome is equally likely to be any
point of some interval [a, b], which may be in time or space.
We can dene a probability function for this kind of experiment quite naturally as
follows. The sample space is the interval [a, b] of length b a. Now let A be any
interval in of length l jAj. Then the equidistribution of probability among the
outcomes in implies that the probability of the outcome being in A is
jAj l
P(A) :
jj b a
Example 2.4.9. revisited: rope. What is the probability that the rope fails nearer to
the xed point of the ram than the moving point? There are 5 metres nearer the xed
5
point, so this probability is 10 12. s
This argument is equally natural and appealing for points picked at random in regions
of the plane. For deniteness, let be some region of the plane with area jj. Let A be
some region in of area jAj. Suppose a point P is picked at random in , with any point
equally likely to be chosen. Then the equidistribution of probability implies that the
probability of P being in A is
jAj
P(A) :
jj
Example 2.4.10. The garden of your house is the region , with area 100 square
metres; it contains a small circular pond A of radius 1 metre. You are telephoned by a
neighbour who tells you your garden has been struck by a small meteorite. What is the
probability that it hit the pond? Obviously, by everything we have said above
jAj
P(hits pond) : s
jj 100
Since it makes no difference which experiment we use to yield these probabilities, there
is something to be said for having a standard format to present the great variety of
probability problems. For reasons of tradition, urns are often used.
Thus example 2.4.11(ii) would be a standard presentation of the probabilities above.
There are three main reasons for this. The rst is that urns are often useful in situations
too complicated to be readily modelled by coins and dice. The second reason is that using
urns (instead of more realistic descriptions) enables the student to see the probabilistic
problems without being confused by false intuition. The third reason is historical: urns
were widely used in conducting lotteries and elections (in both cases because they are
opaque, thus preventing cheating in the rst place and allowing anonymous voting in the
second). It was therefore natural for early probabilists to use urns as models of real
random behaviour.
2 . 5 T H E A D D I T I O N RU L E S
Of course not all experiments have equally likely outcomes, so we need to x rules that
tell us about the properties of the probability function P, in general. Naturally we continue
to require that for any event A
(1) 0 < P(A) < 1,
and in particular the certain event has probability 1, so
(2) P() 1:
The most important rule is the following.
This rule lies at the heart of probability. First let us note that we need such a rule, because
A [ B is an event when A and B are events, and we therefore need to know its probability.
Second, note that it follows from (3) (by induction) that if A1 , A2 , . . . , A n is any
collection of disjoint events then
S n
(4) P i1 A i P(A1 ) P(A n ):
The proof forms exercise 3, at the end of this section.
Third, note that it is sometimes too restrictive to conne ourselves to a nite collection
of events (we have seen several sample spaces with innitely many outcomes), and we
therefore need an extended version of (4).
Denition. Let be a sample space and suppose that P() is a probability function
on a family of subsets of satisfying (1), (2), and (5). Then P is called a probability
distribution on . n
jA [ Bj jAj jBj
P(A [ B) P(A) P(B):
jj jj jj
Second, consider the interpretation of probability as reecting relative frequency in the
long run. Suppose an experiment is repeated N times. At each repetition, events A and B
may, or may not, occur. If they are disjoint, they cannot both occur at the same repetition.
We argued in section 1.8 that the relative frequency of any event should be not too far
from its probability. Indeed, it is often the case that the relative frequency N (A)=N of an
event A is the only available guide to its probability P(A). Now, clearly
N (A [ B) N (A) N (B):
Hence, dividing by N, there is a powerful suggestion that we should have
P(A [ B) P(A) P(B):
Third, consider probability as a measure of expected value. For this case we resurrect
the benevolent plutocrat who is determined to give away $1 at random. The events A and
B are disjoint; if A occurs you get $1 in your left hand, if B occurs you get $1 in your
right hand. If (A [ B) c occurs, then Bob gets $1. The value of this offer to you is
$P(A [ B), the value to your left hand is $P(A), and the value to your right hand is $P(B).
But obviously it does not matter in which hand you get the money, so
P(A [ B) P(A) P(B):
Finally, consider the case where we imagine a point is picked at random anywhere in
some plane region of area jj. If A , we dened
jAj
P(A) :
jj
Since area also satises the addition rule, we have immediately, when A \ B , that
P(A [ B) P(A) P(B):
It is interesting and important to note that in this case the analogy with mass requires the
unit probability mass to be distributed uniformly over the region . We can envisage this
distribution as a lamina of uniform density jj1 having total mass unity. This may seem
a bizarre thing to imagine, but it turns out to be useful later on.
In conclusion, it seems that the addition rule is natural and compelling in every case
where we have any insight into the behaviour of probability. Of course it is a big step to
say that it should apply to probability in every other case, but it seems inevitable. And
doing so has led to remarkably elegant and accurate descriptions of the real world.
Ac
c
Figure 2.2. P(A) P(A ) P() 1.
It is very pleasant to see this consistency with our intuitive probability scale. Note,
however, that the converse is not true, that is, P(A) 0 does not imply that A , as we
now see in an example.
Example 2.6.1. Pick a point at random in the unit square, say, and let A be the event
that this point lies on a diagonal of the square. As we have seen above,
jAj
P(A) jAj,
jj
where jAj is the area of the diagonal. But lines have zero area; so P(A) 0, even though
the event A is clearly not impossible. s
Example 2.6.2. A die is rolled. How many rolls do you need, to have a better than
evens chance of rolling at least one six?
2.6 Simple consequences 45
Solution. If you roll a die r times, there are 6 r possible outcomes. On each roll there
are 5 faces that do not show a six, and there are therefore in all 5 r outcomes with no six.
Hence P(no six in r rolls) 5 r =6 r : Hence, by (4),
P(at least one six) 1 P(no six) 1 56 r :
For this to be better than evens, we need r large enough that 1 56 r . 12. A short
calculation shows that r 4. s
Example 2.6.3: de Mere's problem. Which of these two events is more likely?
(i) Four rolls of a die yield at least one six.
(ii) Twenty-four rolls of two dice yield at least one (6, 6), i.e. double six.
Solution. Let A denote the rst event and B the second event. Then A c is the event
that no six is shown. There are 64 equally likely outcomes, and 54 of these show no six.
Hence by (4)
P(A) 1 P(A c ) 1 56 4 :
Likewise
3524
P(B) 1 P(Bc ) 1 36 :
Now after a little calculation we nd that
671 1
1296 P(A) . 2 . P(B) ' 0:491:
So the rst event is more likely. s
Difference rule. More generally we have the following rule for differences. Suppose
that B A. Then
A B [ (Bc \ A) B [ (A n B) and B \ (Bc \ A) :
Hence
P(A) P(B) P(Bc \ A)
and so
(5) P(A n B) P(A) P(B), if B A:
Figure 2.3 almost makes this argument unnecessary.
A\B
Of course many events are not disjoint. What can we say of P(A [ B) when
A \ B 6 ? The answer is given by the following rule.
Inclusionexclusion rule. This says that, for any events A and B, the probability that
either occurs is given by
(6) P(A [ B) P(A) P(B) P(A \ B):
Example 2.6.4: wet and windy. From meteorological records it is known that for a
certain island at its winter solstice, it is wet with probability 30%, windy with probability
40%, and both wet and windy with probability 20%.
Using the above rules we can nd the probability of other events of interest. For
example:
(i) P(dry) P(not wet) 1 0:3 0:7, by (4);
(ii) P(dry and windy) P(windynwet) P(windy) P(wet and windy) 0:2, by (5);
(iii) P(wet or windy) 0:4 0:3 0:2 0:5, by (6). s
AB
Figure 2.4. It can be seen that P(A [ B) P(A) P(B) P(A \ B).
2.7 Conditional probability; multiplication rule 47
Proof. From (6) this is obviously true for n 2. Suppose it is true for some n > 2;
then
X
n1
P(A1 [ A2 [ [ A n1 ) < P(A1 [ A2 [ [ A n ) P(A n1 ) < P(A r ):
r1
The result follows by induction. h
2. Wet, windy and warm. Show that for any events A (wet), B (windy), and C (warm),
P(A [ B [ C) P(A) P(B) P(C) P(A \ B) P(B \ C) P(C \ A) P(A \ B \ C):
3. Two dice are rolled. How many rolls do you need for a better than evens chance of at least one
double six?
4. Galileo's problem (example 2.4.7) revisited. Let S k be the event that the sum of the three
dice is k. Find P(S k ) for all k.
Remark. Pepys put this problem to Isaac Newton, but was then very reluctant to accept
Newton's (correct) answer.
6. Show that the probability that exactly one of the events A or B occurs is P(A)
P(B) 2P(A \ B).
2 . 7 C O N D I T I O NA L P RO BA B I L I T Y; M U LT I P L I C AT I O N RU L E
In real life very few experiments amount to just one action with random outcomes; they
usually have a more complicated structure in which some results may be known before
the experiment is complete. Or conditions may change. We need a new rule to add to
48 2 The rules of probability
Outcomes
those given in section 2.5; it is called the conditioning rule. Before we give the rule, here
are some examples.
Example 2.7.1. Suppose you are about to roll two dice, one from each hand. What is
the probability that your right-hand die shows a larger number than the left-hand die?
There are 36 outcomes, and in 15 of these the right-hand score is larger. So
(1) P(right-hand larger) 15
36
Now suppose you roll the left-hand die rst, and it shows 5. What is the probability that
the right-hand die shows more? It is clearly not 15
36. In fact only one outcome will do: it
must show 6. So the required probability is 16. s
This is a special case of the general observation that if conditions change then results
change. In particular, if the conditions under which an experiment takes place are altered,
then the probabilities of the various outcomes may be altered. Here is another illustration.
Example 2.7.2: stones. Kidney stones are either small, (i.e. , 2 cm diameter) or
large, (i.e. . 2 cm diameter). Treatment can either succeed or fail. For a sequence of 700
patients the stone sizes and outcomes were as shown in table 2.3. Let L denote the event
that the stone treated is large. Then, clearly, for a patient selected at random from the 700
patients,
(2) P(L) 343
700:
Also, for a patient picked at random from the 700, the probability of success is
(3) P(S ) 562
700 ' 0:8
However, suppose a patient is picked at random from those whose stone is large. The
probability of success is different from that given in (3); it is obviously
247
343 ' 0:72:
That is to say, the probability of success given that the stone is large is different from the
probability of success given no knowledge of the stone.
This is natural and obvious, but it is most important and useful to have a distinct
notation, in order to keep the conditions of an experiment clear and explicit in our minds
2.7 Conditional probability; multiplication rule 49
This may seem a little arbitrary, but it is strongly motivated by our interpretation of
probability as an extension of proportion. We may run through the usual examples in the
usual way.
First, consider an experiment with equally likely outcomes, for which
jAj jBj
P(A) and P(B) :
jj jj
Given simply that B occurs, all the outcomes in B are still equally likely. Essentially, we
now have an experiment with jBj equally likely outcomes, in which A occurs if and only
if A \ B occurs. Hence under these conditions
jA \ Bj
P(AjB) :
jBj
But
jA \ Bj jA \ Bj jBj P(A \ B)
,
jBj jj jj P(B)
which is what (6) says.
Second, we consider the argument from relative frequency. Suppose an experiment is
repeated a large number n of times, yielding the event B on N(B) occasions. Given that B
occurs, we may conne our attention to these N(B) repetitions. Now A occurs in just
N(A \ B) of these, and so empirically
N(A \ B) N(A \ B) n
P(AjB) '
N(B) n N(B)
P(A \ B)
'
P(B)
which is consistent with (6).
Third, we return to the interpretation of probability as a fair price. Once again a
plutocrat offers me a dollar. In this case I get the dollar only if both the events A and B
50 2 The rules of probability
occur, so this offer is worth $P(A \ B) to me. But we can also take it in stages: suppose
that if B occurs, the dollar bill is placed on a table, and if A then occurs the bill is mine.
Then
(i) the value of what will be on the table is $P(B),
(ii) the value of a dollar bill on the table is $P(AjB).
The value of the offer is the same, whether or not the dollar bill has rested on the table, so
P(A \ B) P(AjB)P(B),
which is (6) yet again.
Finally consider the experiment in which a point is picked at random in some plane
region . For any regions A and B, if we are given that B occurs then A can occur if and
only if A \ B occurs. Naturally then it is reasonable to require that P(AjB) is proportional
to jA \ Bj, the area of A \ B. That is, for some k,
P(AjB) kP(A \ B):
Now we observe that obviously P(jB) 1, so
1 kP( \ B) kP(B),
as required. Figure 2.5 illustrates the rule (6).
Let us see how this rule applies to the simple examples at the beginning of this section.
AB AB
B B
Figure 2.5. The left-hand diagram shows all possible outcomes . The right-hand diagram
corresponds to our knowing that the outcome must lie in B; P(AjB) is thus the proportion
P(A \ B)=P(B) of these possible outcomes.
2.7 Conditional probability; multiplication rule 51
Example 2.7.3. A coin is ipped three times. Let A be the event that the rst ip
gives a head, and B the event that there are exactly two heads overall. We know that
f HHH , HHT , HTH , THH , TTH, THT , HTT , TTT g
A f HTT , HHT , HHH , HTH g
B f HHT , HTH, THHg
A \ B f HHT , HTHg:
Hence by (6),
P(A \ B) jA \ Bj jj 2
P(AjB)
P(B) jj jBj 3
and
P(A \ B) 1
P(BjA) 2: s
P(A)
The point of this example is that many people are prepared to argue as follows: `If the
coin shows a head, it is either double-headed or the conventional coin. Since the coin was
picked at random, these are equally likely, so P(DjA) 12'.
52 2 The rules of probability
This is supercially plausible but, as we have seen, it is totally wrong. Notice that this
bogus argument avoids mentioning the sample space ; this is very typical of false
reasoning in probability problems.
We conclude this section by looking at (6) again. It assumes that we know P(A \ B) and
P(B), and denes P(AjB). However, in applications we quite often know P(B) and P(AjB),
and wish to know P(A \ B). In cases like this, we use the following reformulation of (6).
Example 2.7.5: socks. A box contains 5 red socks and 3 blue socks. If you remove 2
socks at random, what is the probability that you are holding a blue pair?
Solution. Let B be the event that the rst sock is blue, and A the event that you have a
pair of blue socks. If you have one blue sock, the probability that the second is blue is the
chance of drawing one of the 2 remaining blues from the 7 remaining socks. That is to
say
P(AjB) 27:
Here A A \ B and so, by (7),
P(A) P(AjB)P(B)
27 3 38 28
3
:
Of course you could do this problem by enumerating the entire sample space for the
experiment, but the above method is much easier. s
Finally let us stress that conditional probability is not in any way an unnatural or purely
theoretical concept. It is completely familiar and natural to you if you have ever bought
insurance, played golf, or observed horse racing, to choose just three examples of the
myriad available. Thus:
Golf. If you play against the Open Champion then P(you win) ' 0. However, given a
sufciently large number of strokes it can be arranged that P(you winjhandicap) ' 12.
Thus any two players can have a roughly even contest.
Horse races. Similarly any horse race can be made into a much more open contest by
requiring faster horses to carry additional weights. Much of the betting industry relies on
the judgement of the handicappers in doing this.
The objective of the handicapper in choosing the weights is to equalize the chances to
some extent and introduce more uncertainty into the result. The ante-post odds reect the
bookmakers' assessment of how far he has succeeded, and the starting prices reect the
gambler's assessment of the position. (See section 2.12 for an introduction to odds.)
Of course this is not the limit to possible conditions; if it rains heavily before a race
then the odds will change to favour horses that run well in heavy conditions. And so on.
Clearly this idea of conditional probability is relevant in almost any experiment; you
should think of some more examples (exercise).
2 . 8 T H E PA RT I T I O N RU L E A N D BAY E S ' RU L E
In this section we look at some of the simple consequences of our denition of
conditional probability. The rst and most important thing to show is that conditional
probability satises the rules for a probability function (otherwise its name would be very
misleading).
Partition rule
(6) P(A) P(AjB)P(B) P(AjBc )P(Bc ):
This has a conditional form as well: for any three events A, B, and C, we have the
Example 2.8.1: pirates. An expensive electronic toy made by Acme Gadgets Inc. is
defective with probability 103. These toys are so popular that they are copied and sold
2.8 The partition rule and Bayes' rule 55
illegally but cheaply. Pirate versions capture 10% of the market, and any pirated copy is
defective with probability 12. If you buy a toy, what is the chance that it is defective?
Solution. Let A be the event that you buy a genuine article, and let D be the event
that your purchase is defective. We know that
9
P(A) 10 , P(A c ) 10
1
, 1
P(DjA) 1000 , P(DjA c ) 12 :
Hence, by the partition rule,
P(D) P(DjA)P(A) P(DjA c )P(A c )
10 9000 20
1
' 5%: s
Solution. Let R denote the event that the result is positive, and D the event that the
individual has the disease. Then by (6)
P(R) P(RjD)P(D) P(RjDc )P(Dc )
0:95p 0:1(1 p)
0:85p 0:1:
For a test as bad as this you will get a lot of positive results even if the disease is rare; if it
is rare, most of these will be false positives. s
Example 2.8.3. Patients may be treated with any one of a number of drugs, each of
which may give rise to side effects. A certain drug C has a 99% success rate in the
absence of side effects, and side effects only arise in 5% of cases. However, if they do
arise then C has only a 30% success rate. If C is used, what is the probability of the event
A that a cure is effected?
Solution. Let B be the event that no side effects occur. We are given that
99
P(AjB \ C) 100 ,
95
P(BjC) 100 ,
Of course many populations can be divided into more than two groups, and many
experiments yield an arbitrary number of events. This requires a more general version of
the partition rule.
Extended partition rule. Let A be some event, and suppose that (Bi ; i > 1) is a
collection of events such that.
[
A B1 [ B2 [ Bi ,
i
and, for i 6 j, Bi \ B j , that is to say, the Bi are disjoint. Then, by the extended
addition rule (5) of section 2.5,
(8) P(A) P(A \ B1 ) P(A \ B2 )
X
P(A \ Bi )
i
X
P(AjBi )P(Bi ):
i
This is the extended partition rule. Its conditional form is
X
(9) P(AjC) P(AjBi \ C)P(Bi jC):
i
Example 2.8.4: coins. You have 3 double-headed coins, 1 double-tailed coin and 5
normal coins. You select one coin at random and ip it. What is the probability that it
shows a head?
Solution. Let D, T , and N denote the events that the coin you select is double-headed,
double-tailed or normal, respectively. Then, if H is the event that the coin shows a head,
by conditional probability we have
P( H) P( HjD)P(D) P( HjT )P(T) P( HjN )P(N )
1 3 39 0 3 19 12 3 59 11
18: s
Obviously the list of examples demonstrating the partition rule could be extended
indenitely; it is a crucial result. Now let us consider the examples given above from
another point of view.
Example 2.8.1 revisited: pirates. Typically, we are prompted to consider this problem
when our toy proves to be defective. In this case we wish to know if it is an authentic
product of Acme Gadgets Inc., in which case we will be able to get a replacement. Pirates,
of course, are famous for not paying compensation. In fact we really want to know
P(AjD), which is an upper bound for the chance that you get a replacement. s
2.8 The partition rule and Bayes' rule 57
Example 2.8.2 revisited: tests. Once again, for the individual the most important
question is, given a positive result do you indeed suffer the disease? That is, what is
P(DjR)? s
Bayes' rule
P(BjA)P(A)
(10) P(AjB) :
P(BjA)P(A) P(BjA c )P(A c )
Here are some applications of this famous rule or theorem.
Example 2.8.2 continued: false positives. Now we can answer the question posed
above: in the context of this test, what is P(DjR)?
and the test looks good. On the other hand, if p 106 , so the disease is very rare, then
P(DjR) ' 105
which is far from conclusive. Ordinarily one would hope to have further independent tests
to use in this case. s
Here is an example of Bayes' rule that has the merit of being very simple, albeit
slightly frivolous.
Solution. Let A be the event that the question is answered correctly, and S the event
that the student knew the answer. We require P(SjA). To use Bayes' rule, we need to
58 2 The rules of probability
2 . 9 I N D E P E N D E N C E A N D T H E P RO D U C T RU L E
At the start of section 2.7 we noted that a change in the conditions of some experiment
will often obviously change the probabilities of various outcomes. That led us to dene
conditional probability.
However, it is equally obvious that sometimes there are changes that make no
difference whatever to the outcomes of the experiments, or to the probability of some
event A of interest. For example, suppose you buy a lottery ticket each week; does the
chance of your winning next week depend on whether you won last week? Of course not;
the numbers chosen are independent of your previous history. What does this mean
2.9 Independence and the product rule 59
formally? Let A be the outcome of this week's lottery, and B the event that you won last
week. Then we agree that obviously
(1) P(AjB) P(AjBc ):
There are many events A and B for which, again intuitively, it seems natural that the
chance that A occurs is not altered by any knowledge of whether B occurs or Bc occurs.
For example, let A be the event that you roll a six and B the event that the dollar exchange
rate fell. Clearly we must assume that (1) holds. You can see that this list of pairs A and B
for which (1) is true could be prolonged indenitely:
A this coin shows a head, B that coin shows a head;
A you are dealt a flush, B coffee futures fall:
Think of some more examples yourself.
Now it immediately follows from (1) by the partition rule that
(2) P(A) P(AjB)P(B) P(AjBc )P(Bc )
P(AjB)fP(B) P(Bc )g, by (1)
P(AjB):
That is, if (1) holds then
(3) P(A) P(AjB) P(AjBc ):
Furthermore, by the denition of conditional probability, we have in this case
(4) P(A \ B) P(AjB)P(B)
P(A)P(B), by (2)
This special property of events is called independence and, when (4) holds, A and B are
said to be independent events. The nal version (4) is usually taken to be denitive; thus
we state the
Example 2.9.1. Suppose I roll a die and pick a card at random from a conventional
pack. What is the chance of rolling a six and picking an ace?
Solution. We can look at this in two ways. The rst way says that the events
A roll a six
and
B pick an ace
are obviously independent in the sense discussed above; that is, P(AjB) P(A) and of
course P(BjA) P(B). Dice and cards cannot inuence each other. Hence
P(A \ B) P(A)P(B)
16 3 13
1 1
78 :
Alternatively, we could use the argument of chapter 1, and point out that by symmetry
60 2 The rules of probability
all 6 3 52 312 possible outcomes of die and card are equally likely. Four of them have
an ace with a six, so
4 1
P(A \ B) 312 78 :
It is very gratifying that the two approaches yield the same answer, but not surprising. In
fact, if you think about the argument from symmetry, you will appreciate that it tacitly
assumes the independence of dice and cards. If there were any mutual inuence it would
break the symmetry. s
If A and B are not independent, then they are said to be dependent. Obviously
dependence and independence are linked to our intuitive notions of cause and effect.
There seems to be no way in which one coin can cause another to be more or less likely
to show a head. However, you should beware of taking this too far. Independence is yet
another assumption that we make in constructing our model of the real world. It is an
extremely convenient assumption, but if it is inappropriate it will yield inaccurate and
irrelevant results. Be careful.
The product rule (5) has an extended version, as follows:
Example 2.9.2. A sequence of fair coins is ipped. They each show a head or a tail
independently, with probability 12 in each case. Therefore the probability that any given
set of n coins all show heads is 2 n . Indeed, the probability that any given set of n coins
shows a specied arrangement of heads and tails is 2 n . Thus, for example, if you ip a
fair coin 6 times,
P( HHHHHH) P( HTTHTH) 26 :
(The less experienced sometimes nd this surprising.) s
Example 2.9.3: central heating. Your heating system includes a pump and a boiler in
a circuit of pipes. You might represent this as a diagram like gure 2.6.
Let F p and Fb be the events that the pump or boiler fail, respectively. Then the event
W that your system works is
W F cp \ F cb :
You might assume that pump and boiler break down independently, in which case, by (5),
(7) P(W ) P(F cp )P(F cb ):
However, your plumber might object that if the power supply fails then both pump and
boiler will fail, so the assumption of independence is invalid. To meet this objection we
dene the events
2.9 Independence and the product rule 61
pump boiler
radiators
Figure 2.7. The system works only if all three elements in the sequence work.
62 2 The rules of probability
on using independence, equation (6). This answer is smaller than that given in (8),
showing how unjustied assumptions of independence can mislead. s
In the example above the elements of the system were in series. Sometimes elements
are found in parallel.
Example 2.9.3 continued: central heating. Your power supply is actually of vital
importance (in a hospital, say) and you therefore t an alternative generator for use in
emergencies. The power system can now be represented as in gure 2.8. Let the event
that the emergency power fails be E. If we assume that Fe and E are independent, then
the probability that at least one source of power works is
P(F ce [ E c ) P((Fe \ E) c )
1 P(Fe \ E)
1 P(Fe )P(E), by (5)
> P(F ce ):
Hence the probability that the system works is increased by including the reserve power
unit, as you surely hoped. s
Example 2.9.4. Suppose a system can be represented as in gure 2.9. Here each
element works with probability p, independently of the others. Running through the
blocks we can reduce this to gure 2.10, where the expression in each box is the
probability of its working. s
power
supply
emergency
power
supply
Figure 2.8. This system works if either of the two elements works.
2.9 Independence and the product rule 63
p p
A p B
p p
Figure 2.9. Each element works independently with probability p. The system works if a route
exists from A to B that passes only through working elements.
p2
p2 p
1 (1 p)(1 p2)2 p
Figure 2.10. Solution in stages, showing nally that P(W ) pf1 (1 p)(1 p2 )2 g, where W is
the event that the system works.
Sometimes elements are in more complicated congurations, in which case the use of
conditional probability helps.
Example 2.9.5: snow. Four towns are connected by ve roads, as shown in gure
2.11. Each road is blocked by snow independently with probability ; what is the
probability that you can drive from A to D?
64 2 The rules of probability
B
A D
C
Figure 2.11. The towns lie at A, B, C, and D.
Solution. Let R be the event that the road BC is open, and Q the event that you can
drive from A to D. Then
P(Q) P(QjR)(1 ) P(QjR) (1 2 )2 (1 ) [1 f1 (1 )2 g2 ] :
The last line follows on using the methods of example 2.9.4, because when R or R c
occurs the system is reduced to blocks in series and parallel. s
Note that events can in fact be independent when you might reasonably expect them
not to be.
Example 2.9.6. Suppose three fair coins are ipped. Let A be the event that they all
show the same face, and B the event that there is at most one head. Are A and B
independent? Write `yes' or `no', and then read on.
Very often indeed we need to use a slightly different statement of independence. Just as
P(AjC) is often different from P(A), so also P(A \ BjC) may behave differently from
P(A \ B). Specically, A and B may be independent given C, even though they are not
necessarily independent in general. This is called conditional independence; formally we
state the following
Example 2.9.7: high and low rolls. Suppose you roll a die twice. Let A2 be the
event that the rst roll shows a 2, and B5 the event that the second roll shows a 5. Also
let L2 be the event that the lower score is a 2, and H 5 the event that the higher score is
a 5.
(i) Show that A2 and B5 are independent.
(ii) Show that L2 and H 5 are not independent.
(iii) Let D be the event that one roll shows less than a 3 and one shows more than a 3.
Show that L2 and H 5 are conditionally independent given D.
4. Suppose that any child is equally likely to be male or female, and Anna has three children. Let
A be the event that the family includes children of both sexes and B the event that the family
includes at most one girl.
(a) Show that A and B are independent.
(b) Is this still true if boys and girls are not equally likely?
(c) What happens if Anna has four children?
5. Find events A, B, and C such that A and B are independent, but A and B are not conditionally
independent given C.
6. Find events A, B, and C such that A and B are not independent, but A and B are conditionally
independent given C.
7. Two conventional fair dice are rolled. Show that the event that their sum is 7 is independent of
the score on the rst die.
8. Some form of prophylaxis is said to be 90% effective at prevention during one year's treatment.
If years are independent, show that the treatment is more likely than not to fail within seven
years.
Example 2.10.1: faults. (i) A factory has two robots producing capeks. (A capek is
not unlike a widget or a gubbins, but it is more colourful.) One robot is old and one is
new; the newer one makes twice as many capeks as the old. If you pick a capek at
random, what is the probability that it was made by the new machine? The answer is
obviously 23 , and we can display all the possibilities in a natural and appealing way in
Figure 2.12. The arrows in a tree diagram point to possible events, in this example N
(new) or N c (old). The probability of the event is marked beside the relevant arrow.
(ii) Now we are told that 5% of the output of the old machine is defective (D), but 10%
of the output of the new machine is defective. What is the probability that a randomly
selected capek is defective? This time we draw a diagram rst, gure 2.13. Now we begin
to see why this kind of picture is called a tree diagram. Again the arrows point to possible
2.10 Trees and graphs 67
new N old N c
2_ 1_
3 3
1 D defective
10
N
2
3 9
10 Dc
1 D
1 20
3
Nc
19
20
D c not defective
events. However, the four arrows on the right are marked with conditional probabilities,
because they originate in given events. Thus
P(DjN ) 101
, P(Dc jN c ) 19
20,
and so on.
The probability of traversing any route in the tree is the product of the probabilities on
the route, by (1). In this case two routes end at a defective capek, so the required
probability is
P(D) 23 3 10
1
13 3 20
1
4
N
5
D
1 1
12 5
Nc
36
N
11
12 55
Dc
19
55
Nc
Figure 2.14. Reversed tree for capeks: D or D is followed by N or N c.
c
Trees like those in gures 2.13 and 2.14, with two branches at each fork, are known as
binary trees. The order in which we should consider the events, and hence draw the tree,
is usually determined by the problem, but given any two events A and B there are
obviously two associated binary trees.
The notation of these diagrams is natural and self-explanatory. Any edge corresponds
to an event, which is indicated at the appropriate node or vertex. The relevant probability
is written adjacent to the edge. We show the rst tree again in gure 2.15, labelled with
symbolic notation.
The edges may be referred to as branches, and the nal node may be referred to as a
leaf. The probability of the event at any node, or leaf, is obtained by multiplying the
probabilities labelling the branches leading to it. For example,
(2) P(Ac \ B) P(BjA c )P(A c ):
Furthermore, since event B occurs at the two leaves marked with an asterisk, the diagram
AB *
P(B|A)
A
P(A) P(Bc|A)
A Bc
P(Ac)
Ac B *
P(B|Ac)
Ac
P(Bc|Ac)
Ac Bc
Figure 2.15. A or A c is followed by B or Bc.
2.10 Trees and graphs 69
shows that
(3) P(B) P(BjA)P(A) P(BjA c )P(A c )
as we know.
Figure 2.16 is the reversed tree. If we have the entries on either tree we can nd the
entries on the other by using Bayes' rule.
Similar diagrams arise quite naturally in knock-out tournaments, such as Wimbledon.
The diagram is usually displayed the other way round in this case, so that the root of such
a tree is the winner in the nal.
B
P(B) P(Ac|B)
Ac B
P(Bc)
A Bc
P(A|Bc)
Bc
P(Ac|Bc)
Ac Bc
R T accurate diagnoses
D Rc Tc false negatives
population
tested
Dc R T false positives
Rc Tc accurate diagnoses
Figure 2.17. Sequential tests. D, disease present; R, rst test positive; T , second test positive.
70 2 The rules of probability
CARD ANSWER
no 1
no
n
p 1
yes yes
q
m
truth yes
1m no
Figure 2.18. Evasive tree.
2.10 Trees and graphs 71
Example 2.10.3: craps, an innite tree. In this well-known game two dice are rolled
and their scores added. If the sum is 2, 3, or 12 the roller loses, if it is 7 or 11 the roller
wins, if it is any other number, say n, the dice are rolled again. On this next roll, if the
sum is n then the roller wins, if it is 7 the roller loses, otherwise the dice are rolled again.
On this and all succeeding rolls the roller loses with 7, wins with n, or rolls again
otherwise. The corresponding tree is shown in gure 2.19. s
We conclude this section by remarking that sometimes diagrams other than trees are
useful.
Example 2.10.4: tennis. Rod and Fred are playing a game of tennis, and have
reached deuce. Rod wins any point with probability r or loses it with probability 1 r.
Let us denote the event that Rod wins a point by R. Then if they share the next two points
the game is back to deuce; an appropriate diagram is shown in gure 2.20. s
7 or 11 win n win
2 or 3 or 12 lose 7 lose
Figure 2.19. Tree for craps. The game continues indenitely.
R
1r R
r
1r Rc r Rc
1r
R c Fred wins the game
Figure 2.20. The diagram is not a tree because the edges rejoin at .
72 2 The rules of probability
Example 2.10.5: coin tossing. Suppose you have a biased coin that yields a head
with probability p and a tail with probability q. Then one is led to the diagram in gure
2.21 as the coin is ipped repeatedly; we truncate it at three ips. s
2 . 1 1 WO R K E D E X A M P L E S
The rules of probability (we have listed them in subsection 2.14.II), especially the ideas
of independence and conditioning, are remarkably effective at working together to
provide neat solutions to a wide range of problems. We consider a few examples.
Example 2.11.1. A coin shows a head with probability p, or a tail with probability
1 p q. It is ipped repeatedly until the rst head appears. Find P(E), the probability
of the event E that the rst head appears at an even number of ips.
Solution. Let H and T denote the outcomes of the rst ip. Then, by the partition
rule,
(1) P(E) P(Ej H)P( H) P(EjT )P(T ):
Now of course P(Ej H) 0, because 1 is odd. Turning to P(EjT ), we now require an odd
number of ips after the rst to give an even number overall. Furthermore, ips are
independent and so
(2) P(EjT ) P(E c ) 1 P(E)
TTT
q
TT p
q q * {TTH, THT, HTT }
p {TH, HT} p
T {THH, HTH, HHT }
q q
q
p H p HH p HHH
Figure 2.21. Counting heads. There are three routes to the node marked , so the probability of one
head in three ips is 3 pq 2 .
2.11 Worked examples 73
Example 2.11.2. You roll a die repeatedly. What is the probability of rolling a six for
the rst time at an odd number of rolls?
Solution. Let A be the event that a six appears for the rst time at an odd roll. Let S
be the event that the rst roll is a six. Then by the partition rule, with an obvious notation,
P(A) P(AjS) 16 P(AjS c ) 56:
But obviously P(AjS ) 1. Furthermore, the rolls are all independent, and so
P(AjS c ) 1 P(A)
Therefore
P(A) 16 56f1 P(A)g
which yields
6
P(A) 11 : s
Example 2.11.3: Huygens' problem. Two players take turns at rolling dice; they each
need a different score to win. If they do not roll the required score, play continues. At
each of their attempts A wins with probability , whereas B wins with probability . What
is the probability that A wins if he rolls rst? What is it if he rolls second?
Solution. Let p1 be the probability that A wins when he has the rst roll, and p2 the
probability that A wins when B has the rst roll. By conditioning on the outcome of the
rst roll we see that, when A is rst,
p1 (1 ) p2 :
When B is rst, conditioning on the rst roll gives
p2 (1 ) p1 :
Hence solving this pair gives
p1
1 (1 )(1 )
and
(1 )
p2 : s
1 (1 )(1 )
74 2 The rules of probability
Example 2.11.4: Huygen's problem again. Two coins, A and B, show heads with
respective probabilities and . They are ipped alternately, giving ABABAB . . .. Find
the probability of the event E that A is rst to show a head.
Example 2.11.5: deuce. Rod and Fred are playing a game of tennis, and the game
stands at deuce. Rod wins any point with probability p, independently of any other point.
What is the probability that he wins the game?
The possible progress of the game is made clearer by the tree diagram in Figure 2.22.
Clearly after an odd number of points either the game is over, or some player has the
advantage. After an even number, either the game is over or it is deuce.
Method II. The tree diagram suggests an alternative approach. Let be the probability
that Rod wins the game eventually given he has the advantage, and the probability that
Rod wins the game eventually given that Fred has the advantage.
Further, R be the event that Rod wins the game and W i be the event that he wins the ith
point. Then, by the partition rule,
P(R)
P(RjW 1 \ W 2 )P(W 1 \ W 2 )
P(RjW 1c \ W 2c )P(W 1c \ W 2c )
P Rj(W 1 \ W 2c ) [ (W1c \ W 2 ) P(W 1 \ W 2c ) [ (W 1c \ W 2 )
p2 0 2 p(1 p):
This is the same as we obtained by the rst method. s
Example 2.11.6. Three players, known as A, B, and C, roll a die repeatedly in the
order ABCABCA . . .. The rst to roll a six is the winner; nd their respective probabilities
of winning.
advantage Rod
p
1p
deuce deuce
p
1p
advantage Fred
1p
Fred wins the game
the rolls are in the order BCABCA . . . and A is third to roll. Hence, starting from this
point, the probability that A wins is now 1 , and we have that
16 (1 ) 56
Applying a similar argument to the sequence of rolls beginning with B, we nd
1 56
because B must fail to roll a six for A to have a chance of winning, and then the sequence
takes the form CABCAB . . ., in which A is second, with probability of winning.
Applying the same argument to the sequence of rolls beginning with C yields
56
because C must fail to roll a six, and then A is back in rst place. Solving these three
equations gives
3691, 30
91, 1 25
91: s
Example 2.11.7. A biased coin is ipped repeatedly until the rst head is shown.
Find the probability p n P(A n ) of the event A n that n ips are required.
Solution. By the partition rule, and conditioning on the outcome of the rst ip,
P(A n ) P(A n j H) p P(A n jT )q
p if n 1
0 qP(A n1 ) otherwise,
by independence. Hence
p n qp n1 q n1 p1 q n1 p, n > 1: s
Of course this result is trivially obvious anyway, but it illustrates the method. Here is a
trickier problem.
Example 2.11.8. A biased coin is ipped up to and including the ip on which it has
rst shown two successive tails. Let A n be the event that n ips are required. Show that, if
p n P(A n ), p n satises
p n pp n1 pqp n2 , n . 2:
Solution. As usual we devise a partition; in this case H, TH, TT are three appropriate
disjoint events. Then
2.11 Worked examples 77
Next we turn to a problem that was considered (and solved) by many 18th century
probabilists, and later generalized by Laplace and others. It arose in Paris with the rather
shadowy gure of a Mr Waldegrave, a friend of Montmort. He is described as an English
gentleman, and proposed the problem to Montmort sometime before 1711. de Moivre
studied the same problem in 1711 in his rst book on probability. It seems unlikely that
these events were independent; there is no record of Waldegrave visiting the same coffee
house as de Moivre, but this seems a very likely connection. (de Moivre particularly
favoured Slaughter's coffee house, in St Martin's Lane). de Moivre also worked as a
mathematics tutor to the sons of the wealthy, so an alternative hypothesis is that
Waldegrave was a pupil or a parent.
The problem is as follows.
The rst round is decided by rolling a die; if it is even A0 wins, if it is odd A1 wins.
All following rounds are decided by ipping a coin. If it shows heads the challenger
wins, if it shows tails the challenged wins. Now it is easy to see that if the coin shows
n 1 consecutive heads then the game is over. Also, the game can only nish when this
occurs. Hence the rst round does not count towards this, and so the required result is
given by the probability p r that the coin rst shows n 1 consecutive heads at the
(r 1)th ip. But this is a problem we know how to solve; it is just an extension of
example 2.11.8.
First we note that (using an obvious notation) the following is a partition of the sample
space:
fT , HT , H 2 T , . . . , H n2 T , H n1 g:
Using conditional probability and independence of ips, this gives
(3) p r 12 p r1 12 2 p r2 12 n1 p r n1 , r . n
with
1 n1
pn 2
and
p1 p2 p n1 0
In particular, when n 3, (3) becomes
(4) p r 12 p r1 12 2 p r2 , r > 4:
Solving this constitutes problem 26 in section 2.16. s
2.12 ODDS
. . . and this particular season the guys who play the horses are being murdered by
the bookies all over the country, and are in terrible distress. . . . But personally I
consider all horse players more or less daffy anyway. In fact, the way I see it, if a
guy is not daffy he will not be playing the horses.
Damon Runyon, Dream Street Rose
Occasionally, statements about probability are made in terms of odds. This is universally
true of bookmakers who talk of `long odds', `1001 odds', `the 21 on favourite', and so
on. Many of these phrases and customs are also used colloquially, so it is as well to make
it clear what all this has to do with our theory of probability.
2.12 Odds 79
In dealing with these ideas we must distinguish very carefully between fair odds and
bookmakers' payoff odds. These are not the same. First, we dene fair odds.
Denition. If an event A has probability P(A), then the fair odds against A are
1 P(A)
(1) a (A) f1 P(A)g : P(A)
P(A)
and the fair odds on A are
P(A)
(2) o (A) P(A) : f1 P(A)g
1 P(A)
The ratio notation on the right is often used for odds.
For example, for a fair coin the odds on and against a head are
1=2
o ( H) a ( H) 1 : 1
1=2
These are equal, so these odds are said to be evens. If a die is rolled, the odds on and
against a six are
1=6
o (6) 1 : 5,
1 1=6
1 1=6
a (6) 5 : 1:
1=6
You should note that journalists and reporters (on the principle that ignorance is bliss)
will often refer to `the odds on A', when in fact they intend to state the odds against A.
Be careful.
Now although the fair odds against a head when you ip a coin are 1:1, no bookmaker
would pay out at evens for a bet on heads. The reason is that in the long run she would
pay out just as much in winnings as she would take from losers. Nevertheless, book-
makers and casinos offer odds; where do they come from? First let us consider casino
odds.
When a casino offers odds of 35 to 1 against an event A, it means that if you stake $1
and then A occurs, you will get your stake back plus $35. If A c occurs then you forfeit
your stake. For this reason such odds are often called payoff odds. How are they xed?
In fact, 35:1 is exactly the payoff odds for the event that a single number you select
comes up at roulette. In the American roulette wheel there are 38 compartments. In a
well-made wheel they should be equally likely, by symmetry, so the chance that your
1
number comes up is 38 .
Now, as we have discussed above, if you get $d with probability P(A) and otherwise
you get nothing, then $P(A)d is the value of this offer to you.
We say that a bet is fair if the value of your return is equal to the value of your stake.
To explain this terminology, suppose you bet $1 at the fair odds given in (1) against A.
You get
1 P(A)
$1 $
P(A)
80 2 The rules of probability
Then the Tote payoff odds for the jth horse are quoted as
1 pj
(4) a ( j)
pj
where
bj
pj
(1 t)b
for some positive number t, less than 1.
What does all this mean? For those who together bet a total of $b j on the jth horse the
total payoff if it wins is
1 pj bj
(5) bj 1 (1 t)b b tb
pj pj
which is $tb less than the total stake, and independent of j. That is to say, the bookmaker
will enjoy a prot of $tb, the `take', no matter which horse wins. (Bets on places and
other events are treated in a similar but slightly more complicated way.)
2.12 Odds 81
Now suppose that the actual probability that the jth horse will win the race is h j . (Of
course we can never know this probability.) Then the value to the gamblers of their bets
on this horse is h j b(1 t), and the main point of betting on horse races is that this may
be greater than b j . But usually it will not be.
It is clear that you should avoid using payoff odds (unless you are a bookmaker). You
should also avoid using fair odds, as the following example illustrates.
Example 2.12.1. Find the odds on A \ B in terms of the odds on A and the odds on
B, when A and B are independent.
Finally we note that when statisticians refer to an `odds ratio', they mean a quantity
such as
P(A) P(B)
R(A:B) :
P(A c ) P(Bc )
More loosely, people occasionally call any quotient of the form P(A)=P(B) an odds ratio.
Be careful.
2 . 1 3 P O P U L A R PA R A D OX E S
Probability is the only branch of mathematics in which good mathematicians
frequently get results which are entirely wrong.
C. S. Pierce
This section contains a variety of material that, for one reason or another, seems best
placed at the end of the chapter. It comprises a collection of `paradoxes', which
probability supplies in seemingly inexhaustible numbers. These could have been included
earlier, but the subject is sufciently challenging even when not paradoxical; it seems
unreasonable for the beginner to be asked to deal with gratuitously tricky ideas as well.
They are not really paradoxical, merely examples of confused thinking, but, as a by-now
experienced probabilist, you may nd them entertaining. Many of them arise from false
applications of Bayes' rule and conditioning. You can now use these routinely and
appropriately, of course, but in the hands of amateurs, Bayes' rule is deadly.
Probability has always attracted more than its fair share of disputes in the popular
press; and several of the hardier perennials continue to enjoy a zombie-like existence on
the internet (or web). One may speculate about the reasons for this; it may be no more
than the fact that anyone can roll dice, or pick numbers, but rather fewer take the trouble
to get the algebra right. At any rate we can see that, from the very beginning of the
subject, amateurs were very reluctant to believe what the mathematicians told them. We
observe Pepys badgering Newton, de Mere pestering Pascal, and so on. Recall the words
of de Moivre: `Some of the problems about chance having a great appearance of
simplicity, the mind is easily drawn into a belief that their solution may be attained by the
mere strength of natural good sense; which generally proves otherwise . . .'; so still today.
In the following examples `Solution' denotes a false argument, and Resolution or
Solution denotes a true argument.
Most of the early paradoxes arose through confusion and ignorance on the part of non-
mathematicians. One of the rst mathematicians who chose to construct paradoxes was
Lewis Carroll. When unable to sleep, he was in the habit of solving mathematical
problems in his head (that is to say, without writing anything); he did this, as he put it, `as
a remedy for the harassing thoughts that are apt to invade a wholly unoccupied mind'.
The following was resolved on the night of 8 September 1887.
Carroll's paradox. A bag contains two counters, as to which nothing is known except
that each is either black or white. Show that one is black and the other white.
`Solution'. With an obvious notation, since colours are equally likely, the possibilities
have the following distribution:
P(BB) P(WW ) 14, P(BW ) 12:
Now add a black counter to the bag, then shake the bag, and pick a counter at random.
What is the probability that it is black? By conditioning on the three possibilities we have
P(B) 1 3 P(BBB) 23 3 P(BWB) 13 3 P(WWB)
1 3 14 23 3 12 13 3 14 23:
2.13 Popular paradoxes 83
But if a bag contains three counters, and the chance of drawing a black counter is 23, then
there must be two black counters and one white counter, by symmetry. Therefore, before
we added the black counter, the bag contained BW, viz., one black and one white.
Resolution. The two experiments, and hence the two sample spaces, are different.
The fact that an event has the same probability in two experiments cannot be used to
deduce that the sample spaces are the same. And in any case, if the argument were valid,
and you applied it to a bag with one counter in it, you would nd that the counter had to
be half white and half black, that is to say, random, which is what we knew already. s
Galton's paradox (1894). Suppose you ip three fair coins. At least two are alike,
and it is an evens chance whether the third is a head or a tail, so the chance that all three
are the same is 12.
Solution. In fact
P(all same) P(TTT ) P( HHH ) 18 18 14:
What is wrong?
Resolution. Again this paradox arises from fudging the sample space. This `third'
coin is not identied initially in , it is determined by the others. The chance whether the
`third' is a head or a tail is a conditional probability, not an unconditional probability.
Easy calculations show that
)
P(3rd is Hj HH) 14 HH denotes the event that there
P(3rd is T j HH) 4 3 are at least two heads:
)
P(3rd is T jTT ) 14 TT denotes the event that there
P(3rd is HjTT ) 4 3 are at least two tails:
In no circumstances therefore is it true that it is an evens chance whether the `third' is a
head or a tail; the argument collapses. s
Bertrand's other paradox. There are three boxes. One contains two black counters,
one contains two white counters and one contains a black and a white counter. Pick a box
at random and remove a counter without looking at it; it is equally likely to be black or
white. The other counter is equally likely to be black or white. Therefore the chance that
your box contains identical counters is 12. But this is clearly false: the correct answer is 23.
Resolution. This is very similar to Galton's paradox. Having picked a box and
counter, the probability that the other counter is the same is a conditional probability, not
an unconditional probability. Thus easy calculations give (with an obvious notation)
(1) P(both blackjB) 23 P(both whitejW );
in neither case is it true that the other counter is equally likely to be black or white. s
Simpson's paradox. A famous clinical trial compared two methods of treating kidney
stones, either by surgery or nephrolithotomy; we denote these by S and N respectively. In
84 2 The rules of probability
all, 700 patients were treated, 350 by S and 350 by N. Then it was found that for cure
rates
273
P(curejS ) ' 0:78,
350
289
P(curejN ) ' 0:83:
35
Surgery seems to have an inferior rate of success at cures. However, the size of the stones
removed was also recorded in two categories:
L diameter more than 2 cm,
T diameter less than 2 cm:
When patients were grouped by stone size as well as treatment, the following results
emerged:
P(curejS \ T) ' 0:93,
P(curejN \ T) ' 0:87,
and
P(curejS \ L) ' 0:73,
P(curejN \ L) ' 0:69:
In both these cases surgery has the better success rate; but when the data are pooled to
ignore stone size, surgery has an inferior success rate. This seems paradoxical, which is
why it is known as Simpson's paradox. However, it is a perfectly reasonable property of a
probability distribution, and occurs regularly. Thus it is not a paradox.
Another famous example arose in connection with the admission of graduates to the
University of California at Berkeley. Women in fact had a better chance than men of
being admitted to individual faculties, but when the gures were pooled they seemed to
have a smaller chance. This situation arose because women applied in much greater
numbers to faculties where everyone had a slim chance of admission. Men tended to
apply to faculties where everyone had a good chance of admission. s
The switching paradox: goats and cars, the Monty Hall problem. Television has
dramatically expanded the frontiers of inanity, so you are not too surprised to be faced
with the following decision. There are three doors; behind one there is a costly car,
behind two there are cheap (non-pedigree) goats. You will win whatever is behind the
door you nally choose. You make a rst choice, but the presenter does not open this
door, but a different one (revealing a goat), and asks you if you would like to change your
choice to the nal unopened door that you did not choose at rst. Should you accept this
offer to switch? Or to put it another way: what is the probability that the car is behind
your rst choice compared to the probability that it lies behind this possible fresh choice?
Answer. The blunt answer is that you cannot calculate this probability as the
question stands. You can only produce an answer if you assume that you know how the
presenter is running the show. Many people nd this unsatisfactory, but it is important
to realize why it is the unpalatable truth. We discuss this later; rst we show why there
is no one answer.
2.13 Popular paradoxes 85
I The `usual' solution. The usual approach assumes that the presenter is attempting
to make the `game' longer and less dull. He is therefore assumed to behaving as follows.
Rules. Whatever your rst choice, he will show you a goat behind a different door;
with a choice of two goats he picks either at random.
Let the event that the car is behind the door you chose rst be C f , let the event that the
car is behind your alternative choice be C a , and let the event that the host shows you a
goat be G. We require P(C a jG), and of course we assume that initially the car is equally
likely to be anywhere. Call your rst choice D1 , the presenter's open door D2, and the
alternative door D3. Then
P(C a \ G)
(2) P(C a jG)
P(G)
P(GjC a )P(C a )
P(GjC f )P(C f ) P(GjC a )P(C a )
P(GjC a )
,
P(GjC f ) P(GjC a )
because P(C a ) P(C f ), by assumption.
Now by the presenter's rules
P(GjC a ) 1
because he must show you the goat behind D3 . However,
P(GjC f ) 12
because there are two goats to choose from, behind D2 and D3, and he picks the one
behind D3 with probability 12. Hence
1
P(C a jG) 2:
1 12 3
III The `maa' solution. There are other possible assumptions; here is a very realistic
set-up. Unknown to the producer, you and the presenter are members of the same family.
If the car is behind D1, he opens the door for you; if the car is behind D2 or D3, he opens
the other door concealing a goat. You then choose the alternative because obviously
P(C a jG) 1: s
86 2 The rules of probability
Remark. This problem is also sometimes known as the Monty Hall problem, after the
presenter of a programme that required this type of decision from participants. It
appeared in this form in Parade magazine, and generated a great deal of publicity and
follow-up articles. It had, however, been around in many other forms for many years
before that.
Of course this is a trivial problem, albeit entertaining, but it is important. This
importance lies in the lesson that, in any experiment, the procedures and rules that dene
the sample space and all the probabilities must be explicit and xed before you begin.
This predetermined structure is called a protocol. Embarking on experiments without a
complete protocol has proved to be an extremely convenient method of faking results
over the years. And will no doubt continue to be so.
There are many more `paradoxes' in probability. As we have seen, few of them are
genuinely paradoxical. For the most part such results attract fame simply because
someone once made a conspicuous error, or because the answer to some problem is
contrary to uninformed intuition. It is notable that many such errors arise from an
incorrect use of Bayes' rule, despite the fact that as long ago as 1957, W. Feller wrote this
warning:
Unfortunately Bayes' rule has been somewhat discredited by metaphysical applica-
tions of the type described by Laplace. In routine practice this kind of argument can
be dangerous . . . . Plato used this type of argument to prove the existence of
Atlantis, and philosophers used it to prove the absurdity of Newtonian mechanics.
Of course Atlantis never existed, and Newtonian mechanics are not absurd. But despite
all this experience, the popular press and even, sometimes, learned journals continue to
print a variety of these bogus arguments in one form or another.
2. Goats and cars revisited. The `incompetent' solution. Due to a combination of indolence
and incompetence the presenter has failed to nd out which door the car is actually behind. So
when you choose the rst door, he picks another at random and opens it (hoping it does not
conceal the car). Show that in this case P(C a jG) 12.
2 . 1 4 R E V I E W: N OTAT I O N A N D RU L E S
In this chapter we have used our intuitive ideas about probability to formulate rules that
probability must satisfy in general. We have introduced some simple standard notation to
help us in these tasks; we summarize the notation and rules here.
2.14 Review: notation and rules 87
I Notation
: sample space of outcomes
A, B, C, . . .: possible events included in
: impossible event
P(:): the probability function
P(A): the probability that A occurs
A [ B: union; either A or B occurs or both occur
A \ B: intersection; both A and B occur
A c : complementary event
A B: inclusion; B occurs if A occurs
AnB: difference; A occurs and B does not
II Rules
Range: 0 < P(A) < 1
Impossible event: P() 0
Certain event: P() 1
Addition: P(A [ B) P(A) P(B) when A \ B
P
Countable addition: P([ i A i ) i P(A i ) when (A i ; i > 1) are disjoint events
Inclusionexclusion: P(A [ B) P(A) P(B) P(A \ B)
Complement: P(A c ) 1 P(A)
Difference: when B A, P(AnB) P(A) P(B)
Conditioning: P(AjB) P(A \ B)=P(B)
Addition: P(A [ BjC) P(AjC) P(BjC) when A \ C and B \ C are disjoint
Multiplication: P(A \ B \ C) P(AjB \ C)P(BjC)P(C)
P
The partition rule: P(A) i P(AjBi )P(Bi ) when (Bi ; i > 1) are disjoint events, and
A [ i Bi
Bayes' rule: P(Bi jA) P(AjBi )P(Bi )=P(A)
Independence: A and B are independent if and only if P(A \ B) P(A)P(B)
This is equivalent to P(AjB) P(A) and to P(BjA) P(B)
Conditional independence: A and B are conditionally independent given C when
P(A \ BjC) P(AjC)P(BjC)
Value and expected value: If an experiment yields the numerical outcome a with
probability p, or zero otherwise, then its value (or expected value) is ap
88 2 The rules of probability
2 . 1 5 A P P E N D I X . D I F F E R E N C E E Q UAT I O N S
On a number of occasions above, we have used conditional probability and independence to show
that the answer to some problem of interest is the solution of a difference equation. For example, in
example 2.11.7 we considered
(1) p n qp n1 ,
in example 2.11.8 we derived
(2) p n pp n1 pqp n2 , pq 6 0,
and in exercise 1 at the end of section 2.11 you derived
(3) p n (q p) p n1 p:
We need to solve such equations systematically. Note that any sequence (x r ; r > 0) in which each
term is a function of its predecessors, so that
(4) x r k f (x r , x r1 , . . . , x r k1 ), r > 0,
is said to satisfy the recurrence relation (4). When f is linear this is called a difference equation of
order k:
(5) x r k a0 x r a1 x r1 a k1 x r k1 g(r), a0 6 0:
When g(r) 0, the equation is homogeneous:
(6) x r k a0 x r a1 x r1 a k1 x r k1 , a0 6 0:
Solving (1) is easy because p n1 qp n2 , p n2 qp n3 and so on. By successive substitution we
obtain
p n q n p0 :
Solving (3) is nearly as easy when we notice that
p n 12
is a particular solution. Now writing p n 12 x n gives
x n (q p)x n1 (q p) n x0 :
Hence
p n 12 (q p) n x0 :
Equation (2) is not so easy but, after some work which we omit, it turns out that (2) has solution
(7) p n c1 1n c2 2n
where 1 and 2 are the roots of
x 2 px pq 0
and c1 and c2 are arbitrary constants. You can verify this by substituting (7) into (2).
Having seen these preliminary results, you will not now be surprised to see the general solution to
the second-order difference equation: let
(8) x r2 a0 x r a1 x r1 g(r), r > 0:
Suppose that (r) is any function such that
(r 2) a0 (r) a1 (r 1) g(r)
and suppose that 1 and 2 are the roots of
x 2 a0 a1 x:
Then the solution of (8) is given by
c1 1r c2 2r (r), 1 6 2
xr
(c1 c2 r)1r (r), 1 2 ,
where c1 and c2 are arbitrary constants. Here (r) is called a particular solution, and you should
note that 1 and 2 may be complex, as then may c1 and c2 .
2.16 Problems 89
The solution of higher-order difference equations proceeds along similar lines; there are more 's
and more c's.
2 . 1 6 P RO B L E M S
1. The classic slot machine has three wheels each marked with 20 symbols. You rotate the wheels
by means of a lever, and win if each wheel shows a bell when it stops. Assume that the outside
wheels each have one bell symbol, the central wheel carries 10 bells, and that wheels are
independently equally likely to show any of the symbols (academic licence). Find:
(a) the probability of getting exactly two bells;
(b) the probability of getting three bells.
2. You deal two cards from a conventional pack. What is the probability that their sum is 21?
(Court cards count 10, and aces 11.)
3. You deal yourself two cards, and your opponent two cards. Your opponent reveals that the sum
of those two cards is 21; what is the probability that the sum of your two cards is 21? What is
the probability that you both have 21?
4. A weather forecaster says that the probability of rain on Saturday is 25%, and the probability of
rain on Sunday is 25%. Can you say the chance of rain at the weekend is 50%? What can you
say?
5. My lucky number is 3, and your lucky number is 7. Your PIN is equally likely to be any
number between 1001 and 9998. What is the probability that it is divisible by at least one of
our two lucky numbers?
6. You keep rolling a die until you rst roll a number that you have rolled before. Let A k be the
event that this happens on the kth roll.
(a) What is P(A12 )? (b) Find P(A3 ) and P(A6 ).
7. Ann aims three darts at the bullseye and Bob aims one. What is the probability that Bob's dart
is nearest the ball? Given that one of Ann's darts is nearest, what is the probability that Bob's
dart is next nearest? (They are equally skilful.)
8. In the lottery of 1710, one in every 40 tickets yielded a prize. It was widely believed at the
time that you needed to buy 40 tickets at least, to have a better than evens chance of a prize.
Was this belief correct?
9. (a) You have two red cards and two black cards. Two cards are picked at random; show that
the probability that they are the same colour is 13.
(b) You have one red card and two black cards; show that the probability that two cards
picked at random are the same colour is 13. Are you surprised?
(c) Calculate this probability when you have
(i) three red cards and three black cards, (ii) two red cards and three black cards.
10. A box contains three red socks and two blue socks. You remove socks at random one by one
until you have a pair. Let T be the event that you need only two removals, R the event that the
rst sock is red and B the event that the rst sock is blue. Find
(a) P(BjT ), (b) P(RjT ), (C) P(T ):
11. Let A, B and C be events. Show that
A \ B (A c [ Bc ) c ,
and
A [ B [ C (A c \ Bc \ C c ) c :
90 2 The rules of probability
Let A and B be events with P(A) 35 and P(B) 12. Show that
1
10 < P(A \ B) < 12
and give examples to show that both extremes are possible. Can you nd bounds for P(A [ B)?
14. Show that if P(AjB) . P(A), then
P(BjA) . P(B) and P(A c jB) , P(A c ):
15. Show that if A is independent of itself, then either P(A) 0 or P(A) 1.
16. A pack contains n cards labelled 1, 2, 3, . . . , n (one number on each card). The cards are dealt
out in random order. What is the probability that
(a) the kth card shows a larger number than its k 1 predecessors?
(b) each of the rst k cards shows a larger number than its predecessors?
(c) the kth card shows n, given that the kth card shows a larger number than its k 1
predecessors?
17. Show that P(AnB) < P(A):
18. Show that
! !
[
n X X \
n
n1
P Ar P(A r ) P(A r \ A s ) () P Ar :
r1 r r,s r1
T
Is there a similar formula for P( nr1 A r )?
19. Show that
P(A \ B) P(A)P(B) P((A [ B) c ) P(A c )P(Bc ):
20. An urn contains a amber balls and b buff balls. A ball is removed at random.
(a) What is the probability that it is amber?
(b) Whatever colour it is, it is returned to the urn with a further c balls of the same colour as
the rst. Then a second ball is drawn at random from the urn. Show that the probability
that it is amber is .
21. In the game of antidarts a player shoots an arrow into a rectangular board measuring six metres
by eight metres. If the arrow is within one metre of the centre it scores 1 point, between one
and two metres away it scores 2, between two and three metres it scores 3, between three and
four metres and yet still on the board it scores 4, and further than four metres but still on the
board it scores 5. William Tell always lands his arrows on the board but otherwise they are
purely random.
3
(a) Show that the probability that his rst arrow scores more than 3 points is 1 16.
B
A C
(b) Find the probability that he scores a total of exactly 4 points in his rst two arrows.
(c) Show that the probability that he scores exactly 15 points in three arrows is given by
3
2 3 1 p
1 sin1 7 :
3 4 8
22. A mole has a network of burrows as shown in gure 2.23. Each night he sleeps at one of the
junctions. Each day he moves to a neighbouring junction but he chooses a passage randomly,
all choices being equally likely from those available at each move.
(a) He starts at A. Find the probability that two nights later he is at B.
(b) Having arrived at B, nd the probability that two nights later he is again at B.
(c) A second mole is at C at the same time as the rst mole is at A. What is the probability
that two nights later the two moles share the same junction?
23. Three cards in an urn bear pictures of ants and bees; one card has ants on both sides, and one
card has bees on both sides, and one has an ant on one side and a bee on the other.
A card is removed at random and placed at. If the upper face shows a bee, what is the
probability that the other side shows an ant?
24. You pick a card at random from a conventional pack and note its suit. With an obvious notation
dene the events
A1 S [ H, A2 S [ D, A3 S [ C:
Show that A j and A k are independent when j 6 k, 1 < j, k < 3.
25. A fair die is rolled repeatedly. Find
(a) the probability that the number of sixes in k rolls is even,
(b) the probability that in k rolls the number of sixes is divisible by 3.
26. Waldegrave's problem, example 2.11.10. Show that, with four players, equation (4) in this
example has the solution
p r2 p r2
1 1 5 1 1 5
p r p p :
2 5 4 2 5 4
27. Karel ips n 1 fair coins and Newt ips n fair coins. Karel wins if he has more heads than
Newt, otherwise he loses Show that P(Karel wins) 12.
28. Arkle (A) and Dearg (D) are connected by roads as in gure 2.24. Each road is independently
blocked by snow with probability p. Find the probability that it is possible to travel by road
from A to D.
Funds are available to snow-proof just one road. Would it be better to snow-proof AB or BC?
A D
29. You are lost on Mythy Island in the summer, when tourists are two-thirds of the population. If
you ask a tourist for directions the answer is correct with probability 34; answers to repeated
questions are independent even if the question is the same. If you ask a local for directions, the
answer is always false.
(a) You ask a passer-by whether Mythy City is East or West. The answer is East. What is the
probability that it is correct?
(b) You ask her again, and get the same reply. Show that the probability that it is correct is 12.
(c) You ask her one more time, and the answer is East again. What is the probability that it is
correct?
(d) You ask her for the fourth and last time and get the answer West. What is the probability
that East is correct?
(e) What if the fourth answer were also East?
30. A bull is equally likely to be anywhere in the square eld ABCD, of side 1. Show that the
probability that it is within a distance x from A is
8 2
>
> x
< , 0<x<1
4
px
> (x 2 1)1=2 x x 2 cos1 1 , 1 < x < p2:
> 2
:
4 x
The bull is now tethered to the corner A by a chain of length 1. Find the probability that it is
nearer to the fence AB than the fence CD.
31. A theatre ticket is in one of three rooms. The event that it is in the ith room is Bi , and the event
that a cursory search of the ith room fails to nd the ticket is Fi , where
0 < P(Fi jBi ) , 1:
Show that P(Bi jFi ) , P(Bi ), that is to say, if you fail to nd it in the ith room on one search,
then it is less likely to be there. Show also that P(Bi jF j ) . P(Bi ) for i 6 j, and interpret this.
32. 10% of the surface of a sphere S is coloured blue, the rest is coloured red. Show that, however
the colours are distributed, it is possible to inscribe a cube in S with 8 red vertices. (Hint: Pick
a cube at random from the set of all possible inscribed cubes, let B(r) be the event that the rth
vertex is blue, and consider the probability that any vertex is blue.)
3
Counting and gambling
It is clear that the enormous variety which can be seen both in nature and in
the actions of mankind, and which makes up the greater part of the beauty of
the universe, arises from the many different ways in which objects are
arranged or chosen. But it often happens that even the cleverest and best-
informed men are guilty of that error of reasoning which logicians call the
insufcient, or incomplete, enumeration of cases.
J. Bernoulli (ca. 1700)
3.1 PREVIEW
We have seen in the previous chapter that many chance experiments have equally likely
outcomes. In these problems many questions can be answered by merely counting the
outcomes in events of interest. Moreover, quite often simple counting turns out to be
useful and effective in more general circumstances.
In the following sections, therefore, we review the basic ideas about how to count
things. We illustrate the theory with several famous examples, including birthday
problems and lottery problems. In particular we solve the celebrated problem of the
points. This problem has the honour of being the rst to be solved using modem methods
(by Blaise Pascal in 1654), and therefore marks the ofcial birth of probability. A natural
partner to it is the even more famous gambler's ruin problem. We conclude with a brief
sketch of the history of chance, and some other famous problems.
Prerequisites. You need only the usual basic knowledge of elementary algebra. We
shall often use the standard factorial notation
r! r(r 1) 3 3 3 3 2 3 1:
Remember that 0! 1, by convention.
93
94 3 Counting and gambling
jAj
P(A)
jj
and we `only' have to count the elements of A and . For example, suppose you are dealt
ve cards at poker; what is the probability of a full house? You rst need the number of
ways of being dealt ve cards, assumed equally likely. Next you need the number of such
hands that comprise a full house (three cards of one kind and two of another kind, e.g.
QQQ33). We shall give the answer to this problem shortly; rst we remind ourselves of
the basic rules of counting. No doubt you know them informally already, but it can do no
harm to collect them together explicitly here.
The rst is obvious but fundamental.
Correspondence rule. Suppose we have two nite sets A and B. Let the numbers of
objects in A and B be jAj and jBj respectively. Then if we can show that each element of
A corresponds to one and only one element of B, and vice versa, then jAj jBj.
Example 3.2.1. Let A f11, 12, 13g and B f~, }, g. Then jAj jBj 3. s
Example 3.2.2: reection. Let A be a set of distinct real numbers. Dene the set B
such that
B fb: b 2 Ag:
Then jAj jBj. s
Example 3.2.3: choosing. Let A be a set of size n. Let c(n, k) be the number of ways
of choosing k of the n elements in A. Then
c(n, k) c(n, n k),
because to each choice of k elements there corresponds one and only one choice of the
remaining n k elements. s
Addition rule. Suppose that A and B are disjoint nite sets, so that A \ B . Then
jA [ Bj jAj jBj:
Example 3.2.4: choosing. Let A be a set containing n elements, and recall that
c(n, k) is the number of ways of choosing k of these elements. Show that
(1) c(n, k) c(n 1, k) c(n 1, k 1):
Solution. We can label the elements of A as we please; let us label one of them the
rst element. Let B be the collection of all subsets of A that contain k elements. This can
be divided into two sets: B( f ), in which the rst element always appears, and B( f ), in
which the rst element does not appear. Now on the one hand
jB( f )j c(n 1, k 1)
3.2 First principles 95
because the rst element is guaranteed to be in all these. On the other hand
jB( f )j c(n 1, k)
because the rst element is not in these, and we still have to choose k from the n 1
remaining. Obviously jBj c(n, k), by denition. Hence, by the addition rule, (1)
follows. s
The addition rule has an obvious extension to the union of several disjoint sets; write
this down yourself (exercise).
The third counting rule will come as no surprise. As we have seen several times in
chapter 2, we often combine simple experiments to obtain more complicated sample
spaces. For example, we may roll several dice, or ip a sequence of coins. In such cases
the following rule is often useful.
Multiplication rule. Let A and B be nite sets, and let C be the set obtained by
choosing any element of A and any element of B. Thus C is the collection of ordered
pairs
C f(a, b): a 2 A, b 2 Bg:
Then
(2) jCj jAi Bj:
This rule is often expressed in other words; one may speak of decisions, or operations, or
selections. The idea is obvious in any case. To establish (2) it is sufcient to display all
the elements of C in an array:
(a1 , b1 ) . . . (a1 , b n )
.. ..
C . .
(a m , b1 ) . . . (a m , b n )
Here m jAj and n jBj. The rule (2) is now obvious by the addition rule. Again, this
rule has an obvious extension to the product of several sets.
Solution. There are three choices for the rst step, then two for the second, then one
for the last. The required number is 3! s
For our nal rule we consider the problem of counting the elements of A [ B, when A
and B are not disjoint. This is given by the inclusionexclusion rule, as follows.
96 3 Counting and gambling
Example 3.3.1. You have ve books on probability. In how many ways can you
arrange them on your bookshelf?
Solution. Any of the ve can go on the left. This leaves four possibilities for the
second book, and so by the multiplication rule there are 5 3 4 20 ways to put the rst
two on your shelf. That leaves three choices for the third book, yielding 5 3 4 3 3 60
ways of shelving the rst three. Then there are two possibilities for the penultimate book,
and only one choice for the last book, so there are altogether
5 3 4 3 3 3 2 3 1 5! 120
ways of arranging them.
Incidentally, in the course of showing this we have shown that the number of ways of
arranging a selection of r books, 0 < r < 5, is
5!
5 3 3 (5 r 1) : s
(5 r)!
It is quite obvious that the same argument works if we seek to arrange a selection of
r things from n things. We can choose the rst in n ways, the second in n 1 ways,
and so on, with the last chosen in n r 1 ways. By the product rule, this gives
n(n 1) (n r 1) ways in total. We display this result, and note that the conven-
tional term for such an ordering or arrangement is a permutation. (Note also that
algebraists use it differently.)
x!
xr :
(x r)!
In particular, r r r!
Note that various other notations are used for this, most commonly (x) r in the general
case, and x Pr when x is an integer.
Next we turn to the problem of counting arrangements when the objects in question are
not all distinguishable.
In the above example involving books, we naturally assumed that the books were all
distinct. But suppose that, for whatever strange reason, you happen to have two new
copies of some book. They are unmarked, and therefore indistinguishable. How many
different permutations of all ve are possible now? There are in fact 60 different
arrangements. To see this we note that in the 120 arrangements in example 3.3.1 there are
60 pairs in which each member of the pair is obtained by exchanging the positions of the
two identical books. But these pairs are indistinguishable, and therefore the same. So
there are just 60 different permutations.
We can generalize this result as follows. If there are n objects of which n1 form one
indistinguishable group, n2 another, and so on up to n r , where
(3) n1 n2 n r n,
then there are
n!
(4) M(n1 , . . . , n r )
n1 !n2 ! n r !
distinct permutations of these n objects. It is easy to prove this, as follows. For each of
the M such arrangements suppose that the objects in each group are then numbered, and
hence distinguished. Then the objects in the rst group can now be arranged in n1 ! ways,
and so on for all r groups. By the multiplication rule there are hence n1 !n2 ! n r !M
permutations. But we already know that this number is n!. Equating these two gives (4).
This argument is simpler than it may appear at rst sight; the following example makes
it obvious.
Example 3.3.2. Consider the word `dada'. In this case n 4, n1 n2 2, and (4)
gives
4!
M(2, 2) 6,
2!2!
as we may verify by exhaustion:
aadd, adad, daad, dada, adda, ddaa:
Now, as described above we can number the a's and d's, and permute these now
distinguishable objects for each of the six cases. Thus
8
>
> a1 a2 d 1 d 2
<
a2 a1 d 1 d 2
aadd yields
>
> a a d d
: 1 2 2 1
a2 a1 d 2 d 1
and likewise for the other ve cases. There are therefore 6 3 4 24 permutations of 4
objects, as we know already since 4! 24. s
3.3 Arranging and choosing 99
Once again we interject a brief note on names and notation. The numbers
M(n1 , . . . , n r ) are called multinomial coefcients. An alternative notation is
n1 n2 n r
M(n1 , . . . , n r ) :
n1 , n2 , . . . , n r
The most important case, and the one which we see most often, is the binomial coefcient
n n!
M(n r, r)
r (n r)!r!
This is also denoted by n C r . We can also write it as
x xr
,
r r!
which makes sense when x is any real number. For example
1
(1) r :
r
Binomial coefcients arise very naturally when we count things without regard to their
order, as we shall soon see.
In counting permutations the idea of order is essential. However, it is often the case
that we choose things and pay no particular regard to their order.
Example 3.3.3: quality control. You have a box of numbered components, and you
have to select a xed quantity (r, say) for testing. If there are n in the box, how many
different selections are possible?
If n 4 and r 2, then you can see by exhaustion that from the set fa, b, c, dg you
can pick six pairs, namely
ab, ac, ad, bc, bd, cd: s
There are many classical formulations of this basic problem; perhaps the most fre-
quently met is the hand of cards, as follows.
You are dealt a hand of r cards from a pack of n. How many different possible hands
are there? Generally n 52; for poker r 5, for bridge r 13.
The answer is called the number of combinations of r objects from n objects. The key
result is the following.
First derivation of (5). We know from (1) that the number of permutations of r things
from n things is n r . But any permutation can also be xed by performing two operations:
(i) choose a subset of size r;
(ii) choose an order for the subset.
Suppose that step (i) can be made in c(n, r) ways; this number is what we want to nd.
We know step (ii) can be made in r! ways. By the multiplication rule (2) of section 3.2
the product of these two is n r , so
n!
(6) c(n, r)r! n r :
(n r)!
Hence
n n!
(7) c(n, r) h
r r!(n r)!
This argument is very similar to that used to establish (4); and this remark suggests an
alternative proof.
Second derivation of (5). Place the n objects in a row, and mark the r selected objects
with the symbol S. Those not selected are marked F. Therefore, by construction, there is
a oneone correspondence between the combinations of r objects from n and the
permutations of r S-symbols and n r F-symbols. But, by (4), there are
n!
(8) M(r, n r)
r!(n r)!
permutations of these S- and F-symbols. Hence using the correspondence rule (see the
start of section 3.2) proves (5). h
Another useful method of counting a set is to split it up in some useful way. This
supplies another derivation.
Third derivation of (5). As above we denote the number of ways of choosing a subset
of size r from a set of size n objects by c(n, r). Now suppose one of the n objects is in
some way distinctive; for deniteness we shall say it is pink. Now there are two distinct
methods of choosing subsets of size r:
(i) include the pink one and choose r 1 more objects from the remaining n 1;
(ii) exclude the pink one and choose r of the n 1 others.
There are c(n 1, r 1) ways to choose using method (i), and c(n 1, r) ways to
choose using method (ii). By the addition rule their sum is c(n, r), which is to say
(9) c(n, r) c(n 1, r) c(n 1, r 1):
Of course we always have c(n, 0) c(r, n) 1, and it is an easy matter to check that the
solution of (9) is
n
c(n, r) ;
r
we just plod through a little algebra:
3.4 Binomial coefcients and Pascal's triangle 101
!
n n! n(n 1)!
(10)
r r!(n r)! r(r 1)!(n r)(n r 1)!
(n 1)! 1 1
(r 1)!(n r 1)! r n r
n1 n1
: h
r r1
2. Show that the multinomial coefcient can be written as a product of binomial coefcients:
sr s r1 s2
M(n1 , . . . , n r )
s r1 s r2 s1
Pr
where s r i1 ni .
3. Four children are picked at random (with no replacement) from a family which includes exactly
two boys. The chance that neither boy is chosen is half the chance that both are chosen. How
large is the family?
4. You ip a fair coin n times. What is the probability that
(a) there have been exactly three heads?
(b) there have been at least two heads?
(c) there have been equal numbers of heads and tails?
(d) there have been twice as many tails as heads?
3 . 4 B I N O M I A L C O E F F I C I E N T S A N D PA S C A L' S T R I A N G L E
The binomial coefcients
n
cn, r
r
can be simply and memorably displayed as an array. There are of course many ways to
organize such an array; let us place them in the nth row and rth column like this:
0th row ! 1
1 1
1 2 1
1 3 3 1
1 4 6 4 1
1 5 10 10 5 1
1 6 15 20 15 6 1
1 7 21 35 35 21 7 1
...
:
0th column
102 3 Counting and gambling
Example 3.4.1: demonstration of (3). Suppose that you have n people and a van with
k < n seats. In how many ways
w can
you choose these k travellers, with one driver?
n
(i) You can choose k to go in ways, and choose one of these k to drive in k ways.
k
So
n
wk :
k
(ii) You can choose k 1 passengers in
n
k1
ways, and then pick the driver in n (k 1) ways. So
3.4 Binomial coefcients and Pascal's triangle 103
n
w (n k 1) :
k1
Now (3) follows. s
We give one more example of this technique: its use to prove a famous formula.
Example 3.4.2: Van der Monde's formula. Remarkably, it is true that for integers
m, n, and r < m ^ n,
X m n m n
:
k
k rk r
Solution. Suppose there are m men and n women, and you wish to form a team with
r members. In how many distinct ways can this be done? Obviously in
m n
r
ways if you choose directly from the whole group. But now suppose you choose k men
from those present and r k women from those present. This may be done in
m n
k rk
ways, by the multiplication rule. Now summing over all possible k gives the left-hand
side, by the addition rule. s
5. Ant. An ant walks on the non-negative plane integer lattice starting at (0, 0). When at ( j, k)
it can step either to ( j 1, k) or ( j, k 1). In how many ways can it walk to the point (r, s)?
104 3 Counting and gambling
Example 3.5.1: personal identier numbers. Commonly PINs have four digits. A
computer assigns you a PIN at random. What is the probability that all four are different?
Solution. Conventionally PINs do not begin with zero (though there is no technical
reason why they should not). Therefore, using the multiplication rule,
jj 9 3 10 3 10 3 10:
Now A is the event that no digit is repeated, so
jAj 9 3 9 3 8 3 7:
Hence
jAj 93
P(A) 3 0:504: s
jj 10
Example 3.5.2: poker dice. A set of poker dice comprises ve cubes each showing
f9, 10, J, Q, K, Ag, in an obvious notation. If you roll such a set of dice, what is the
probability of getting a `full house' (three of one kind and two of another)?
Solution. Obviously jj 65, because each die may show any one of the six faces. A
particular full house is chosen as follows:
choose a face to show three times;
choose another face to show twice;
choose three dice to show the rst face.
By the multiplication rule, and (5) of section 3.3, we have therefore that
5
jAj 6 3 5 3 :
3
Hence
6 5 5
P(full house) 2 6 ' 0:039 s
4 3
3.5 Choice and Chance 105
Example 3.5.3: birthdays. For reasons that are mysterious, some (rather vague)
signicance is sometimes attached to the discovery that two individuals share a birthday.
Given a collection of people, for example a class or lecture group, it is natural to ask for
the chance that at least two do share the same birthday.
We begin by making some assumptions that greatly simplify the arithmetic, without in
any way sacricing the essence of the question or the answer. Specically, we assume that
there are r individuals (none of whom was born on 29 February) who are all indepen-
dently equally likely to have been born on any of the 365 days of a non-leap year.
Let s r be the probability that at least two of the r share a birthday. Then we ask the
following two questions:
(i) How big does r need to be to make s r . 12? That is, how many people do we need to
make a shared birthday more likely than not?
(ii) In particular, what is s24 ?
(In fact births are slightly more frequent in the late summer, multiple births do occur, and
some births occur on 29 February. However, it is obvious, and it can be proved, that the
effect of these facts on our answers is practically negligible.)
Before we tackle these two problems we can make some elementary observations. First,
we can see easily that
1
s2 ' 0:003
365
because there are (365)2 ways for two people to have their birthdays, and in 365 cases
they share it. With a little more effort we can see that
1093
s3 ' 0:008
133225
because there are (365)3 ways for three people to have their birthdays, there are
365 3 364 3 363 ways for them to be different, and so there are (365)3 365 3
364 3 363 ways for at least one shared day. Hence, as required,
(365)3 365 3 364 3 363
(1) s3 :
(365)3
These are rather small probabilities but, at the other extreme, we have s366 1, which
follows from the pigeonhole principle. That is, even if 365 people have different birthdays
then the 366th person must share one. At this point, before we give the solution, you
should write down your intuitive guesses (very roughly) at the answers to (i) and (ii).
Solution. The method for nding s r has already been suggested by our derivation of
s3 . We rst nd the number of ways in which all r people have different birthdays. There
are 365 possibilities for the rst, then 364 different possibilities for the second, then 363
possibilities different from the rst two, and so on. Therefore, by the multiplication rule,
there are
365 3 364 3 3 (365 r 1)
ways for all r birthdays to be different.
106 3 Counting and gambling
Also, by the multiplication rule, there are (365) r ways for the birthdays to be
distributed. Then by the addition rule there are
(365) r 365 3 3 (365 r 1)
ways for at least one shared day. Thus
(365) r 365 3 3 (365 r 1)
(2) sr
(365) r
364 3 3 (365 r 1)
1
(365) r1
Now after a little calculation (with a calculator) we nd that approximately
s24 ' 0:54, s23 ' 0:51, s22 ' 0:48:
Thus a group of 23 randomly selected people is sufciently large to ensure that a shared
birthday is more likely than not.
This is generally held to be surprisingly low, and at variance with uninformed intuition.
How did your guesses compare with the true answer? s
At this point we pause to make a general point. You must have noticed that in all the
above examples the sample space has the property that
jj n r , r > 1
for some n and r. It is easy to see that this is so because the n objects could each supply
any one of r outcomes independently. This is just the same as the sample space you get
if from an urn containing n distinct balls you remove one, inspect it, and replace it, and
do this r times altogether. This situation is therefore generally called sampling with
replacement.
If you did not replace the balls at any time then
n!
jj n r , 1 < r < n:
(n r)!
Naturally this is called sampling without replacement.
We now consider some classic problems of this latter kind.
Example 3.5.4: bridge hands. You are dealt a hand at bridge. What is the probability
that it contains s spades, h hearts, d diamonds, and c clubs?
With a calculator and some effort you can show that, for example, the probability of 4
spades and 3 of each of the other three suits is
3
13 13
4 3
P(A(4, 3, 3, 3)) ' 0:026:
52
13
In fact, no other specied hand is more likely. s
Example 3.5.5: bridge continued; shapes. You are dealt a hand at bridge. Find the
probability of the event B that the hand contains s1 of one suit, s2 of another suit and so
on, where s1 s2 s3 s4 13 and
s1 > s2 > s3 > s4 :
Notice how this differs from the case when suits are specied. The shape (4, 3, 3, 3)
has probability 0.11, approximately, even though it was the most likely hand when suits
were specied.
Example 3.5.6: poker. You are dealt a hand of 5 cards from a conventional pack. A
full house comprises 3 cards of one value and 2 of another (e.g. 3 twos and 2 fours). If the
hand has 4 cards of one value (e.g. 4 jacks), this is called four of a kind. Which is more
likely?
Solution. (i) First we note that comprises all possible choices of 5 cards from 52
cards. Hence
52
jj :
5
108 3 Counting and gambling
(ii) For a full house you can choose the value of the triple in 13 ways, and then you can
choose their 3 suits in
4
3
ways. The value of the double can then be chosen in 12 ways, and their suits in
4
2
ways. Hence
4 4 52
P(full house) 13 12
3 2 5
' 0:0014:
(iii) Four of a kind allows 13 choices for the quadruple and then 48 choices for the
other card. Hence
52
P(four of a kind) 13 3 48
5
' 0:00024: s
Example 3.5.7: tennis. Rod and Fred are playing a game of tennis. The scoring is
conventional, which is to say that scores run through (0, 15, 30, 40, game), with the usual
provisions for deuce at 4040.
Rod wins any point with probability p. What is the probability g that he wins the
game? We assume that all points are won or lost independently.
You can use the result of example 2.11.5.
Solution. Let A k be the event that Rod wins the game and Fred wins exactly k points
during the game; let A d be the event that Rod wins from deuce. Clearly
g P(A0 ) P(A1 ) P(A2 ) P(A d ):
Let us consider these terms in order.
(i) For A0 to occur, Rod wins 4 consecutive points; P(A0 ) p4 :
(ii) For A1 to occur Fred wins a point at some time before Rod has won his 4 points.
There are
4
4
1
occasions for Fred to win his point, and in each case the probability that Rod wins 4 and
Fred 1 is p4 (1 p). Therefore
P(A1 ) 4 p4 (1 p):
(iii) Likewise for A2 we must count the number of ways in which Fred can win 2
points. This is just the number of ways of choosing where he can win 2 points, namely
5
10:
2
3.6 Applications to lotteries 109
Hence
P(A2 ) 10 p4 (1 p)2 :
(iv) Finally, Rod can win having been at deuce; we denote the event deuce by D. For D
to occur Fred must win 3 points, and so by the argument above
6
P(D) p3 (1 p)3 :
3
The probability that Rod wins from deuce is found in example 2.11.5, so combining that
result with the above gives
P(A d ) P(A d jD)P(D)
p2 6
p3 (1 p)3 :
1 2 p(1 p) 3
Thus
20 p5 (1 p)3
g p4 4 p4 (1 p) 10 p4 (1 p)2 : s
1 2 p(1 p)
Remark. The rst probabilistic analysis of tennis was carried out by James Bernoulli,
and included as an appendix to his book published in 1713. Of course he was writing
about real tennis (the Jeu de Paume), not lawn tennis, but the scoring system is essentially
the same. The play is extremely different.
2. Poker dice. You roll 5 poker dice. Show that the probability of 2 pairs is
1 6 5
65 ' 0:23
2! 3 2, 2, 1
Explain the presence of 1=2! in this expression.
3. Bridge. Show that the probability that you have x spades and your partner has y spades is
13 39 13 x 26 x 52 39
:
x 13 x y 13 y 13 13
What is the conditional probability that your partner has y spades given that you have x spades?
4. Tennis. Check that, in example 3.5.7, when p 12 we have g 12 (which we know directly
in this case by symmetry).
5. Suppose Rod and Fred play n independent points. Rod wins each point with probability p, or loses
it to Fred with probability 1 p. Show that the probability that Rod wins exactly k points is
n
p k (1 p) n k :
k
3 . 6 A P P L I C AT I O N S T O L OT T E R I E S
Now in the way of Lottery men do also tax themselves in the general, though out of
hopes of Advantage in particular: A Lottery therefore is properly a Tax upon
110 3 Counting and gambling
unfortunate self-conceited fools; men that have good opinion of their own luckiness,
or that have believed some Fortune-teller or Astrologer, who had promised them
great success about the time and place of the Lottery, lying Southwest perhaps from
the place where the destiny was read.
Now because the world abounds with this kinde of fools, it is not t that every man
that will, may cheat every man that would be cheated; but it is rather ordained, that
the Sovereign should have the Guardianship of these fools, or that some Favourite
should beg the Sovereign's right of taking advantage of such men's folly, even as in
the case of Lunaticks and Idiots.
Wherefore a Lottery is not tollerated without authority, assigning the proportion in
which the people shall pay for their errours, and taking care that they be not so
much and so often couzened, as they themselves would be.
William Petty (1662)
Lotto's a taxation
On all fools in the nation
But heaven be praised
It's so easily raised.
Traditional
In spite of the above remarks, lotteries are becoming ever more widespread. The usual
form of the modern lottery is as follows. There are n numbers available; you choose r of
them and the organizers also choose r (without repetition). If the choices are the same,
you are a winner.
Sometimes the organizers choose one extra number (or more), called a bonus number.
If your choice includes this number and r 1 of the other r chosen by the organizers,
then you win a consolation prize.
Lotteries in this form seem to have originated in Genoa in the 17th century; for that
reason they are often known as Genoese lotteries. The version currently operated in
England has n 49 and r 6, with one bonus number. Just as in the 17th century, the
natural question is, what are the chances of winning? This is an easy problem: there are
n
r
ways of choosing r different numbers from n numbers, and these are equally likely. The
probability that your single selection of r numbers wins is therefore
n
(1) pw 1 :
r
In this case, when (n, r) (49, 6), this gives
49 13233343536
pw 1
6 49 3 48 3 47 3 46 3 45 3 44
1
:
13 983 816
It is also straightforward to calculate the chance of winning a consolation prize using the
bonus number. The bonus number can replace any one of the r winning numbers to yield
your selection of r numbers, so
3.6 Applications to lotteries 111
n
(2) pc r :
r
When (n, r) (49, 6), this gives
1
pc :
2330 636
An alternative way of seeing the truth of (2) runs as follows. There are r winning numbers
and one bonus ball. To win a consolation prize you can choose the bonus ball in just one
way, and the remaining r 1 numbers in
r
r1
ways. Hence, as before,
r n
pc 31 :
r1 r
The numbers drawn in any national or state lottery attract much more attention than
most other random events. Occasionally this gives rise to controversy because our
intuitive feelings about randomness are not sufciently well developed to estimate the
chances of more complicated outcomes.
For example, whenever the draw yields runs of consecutive numbers, (such as
f2, 3, 4, 8, 38, 42g, which contains a run of length three), it strikes us as somehow less
random than an outcome with no runs. Indeed it is not infrequently asserted that there are
`too many' runs in the winning draws, and that this is evidence of bias. (Similar assertions
are sometimes made by those who enter football `pools'.) In fact calculation shows that
intuition is misleading in this case. We give some examples.
Example 3.6.1: chance of no runs. Suppose you pick r numbers at random from a
sequence of n numbers. What is the probability that no two of them are adjacent, that is
to say, the selection contains no runs? We just need to count the number of ways s of
choosing r objects from n objects in a line, so that there is at least one unselected object
as a spacer between each pair of selected objects. The crucial observation is that if we
strike out or ignore the r 1 necessary spacers then we have an unconstrained selection
of r from n (r 1) objects. Here are examples with n 4 and r 2; unselected
objects are denoted by s, selected objects by
, and the unselected object used as a
spacer is d:
ds
s
,
s
d
s
,
and so on. Conversely any selection of r objects from n (r 1) objects can be turned
into a selection of r objects from n objects with no runs, simply by adding r 1 spacers.
Therefore the number we seek is
n (r 1)
s :
r
Hence the probability that the r winning lottery numbers contain no runs at all is
112 3 Counting and gambling
n1 r n
(3) ps
r r
For example, if n 49 and r 6 then
44 49
ps
6 6
' 0:505:
So in the rst six draws you are about as likely to see at least one run as not. This is
perhaps more likely than intuition suggests.
When the bonus ball is drawn, the chance of no runs at all is now
43 49
' 0:375:
7 7
The chance of at least one run is not far short of 23. s
4. Example 3.6.1 revisited: no runs. You pick r numbers at random from a sequence of n
numbers (without replacement). Let s(n, r) be the number of ways of doing this such that no
two of the r selected are adjacent. Show that
s(n, r) s(n 2, r 1) s(n 1, r):
Now set s(n, r) c(n r 1, r) c(m, k), where m n r 1. Show that c(m, k) satises
the same recurrence relation, (9) of section 3.3, as the binomial coefcients. Deduce that
n r1
s(n, r) :
r
3 . 7 T H E P RO B L E M O F T H E P O I N T S
Prolonged gambling differentiates people into two groups; those playing with the
odds, who are following a trade or profession; and those playing against the odds,
who are indulging a hobby or pastime, and if this involves a regular annual outlay,
this is no more than what has to be said of most other amusements.
John Venn
In this section we consider just one problem, which is of particular importance in the
history and development of probability. In previous sections we have looked at several
problems involving dice, cards, and other simple gambling devices. The application of
the theory is so natural and useful that it might be supposed that the creation of prob-
ability parallelled the creation of dice and cards. In fact this is far from being the case.
The greatest single initial step in constructing a theory of probability was made in
response to a more recondite question, the problem of the points.
Roughly speaking the essential question is this.
Two players, traditionally called A and B, are competing for a prize. The contest takes
the form of a sequence of independent similar trials; as a result of each trial one of the
contestants is awarded a point. The rst player to accumulate n points is the winner; in
colloquial parlance A and B are playing the best of 2n 1 points. Tennis matches are
usually the best of ve sets; n 3.
The problem arises when the contest has to be stopped or abandoned before either has
won n points; in fact A still needs a points (having n a already) and B still needs b
points (having n b already). How should the prize be fairly divided? (Typically the
`prize' consisted of stakes put up by A and B, and held by the stakeholder.)
For example, in tennis, sets correspond to points and men play the best of ve sets. If
114 3 Counting and gambling
the players were just beginning the fourth set when the court was swallowed up by an
earthquake, say, what would be a fair division of the prize? (assuming a natural reluctance
to continue the game on some other nearby court).
This is a problem of great antiquity; it rst appeared in print in 1494 in a book by Luca
Pacioli, but was almost certainly an old problem even then. In his example, A and B were
playing the best of 11 games for a prize of ten ducats, and are forced to abandon the game
when A has 5 points (needing 1 more) and B has 2 points (needing 4 more). How should
the prize be divided?
Though Pacioli was a man of great talent (among many other things his book includes
the rst printed account of double-entry book-keeping), he could not solve this problem.
Nor could Tartaglia (who is best known for showing how to nd the roots of a cubic
equation), nor could Forestani, Peverone, or Cardano, who all made attempts during the
16th century.
In fact the problem was nally solved by Blaise Pascal in 1654, who, with Fermat,
thereby ofcially inaugurated the theory of probability. In that year, probably sometime
around Pascal's birthday (19 June; he was 31), the problem of the points was brought to
his attention. The enquiry was made by the Chevalier de Mere (Antoine Gombaud) who,
as a man-about-town and gambler, had a strong and direct interest in the answer. Within a
very short time Pascal had solved the problem in two different ways. In the course of a
correspondence with Fermat, a third method of solution was found by Fermat.
Two of these methods use ideas that were well known at that time, and are familiar to
you now from the previous section. That is, they relied on counting a number of equally
likely outcomes.
Pascal's great step forward was to create a method that did not rely on having equally
likely outcomes. This breakthrough came about as a result of his explicit formulation of
the idea of the value of a bet or lottery, which we discussed in chapters 1 and 2. That is, if
you have a probability p of winning $1 then the game is worth $ p to you.
It naturally follows that, in the problem of the points, the prize should be divided in
proportion to the players' respective probabilities of winning if the game were to be
continued. The problem is therefore more precisely stated thus.
Precise problem of the points. A sequence of fair coins is ipped; A gets a point for
every head, B a point for every tail. Player A wins if there are a heads before b tails,
otherwise B wins. Find the probability that A wins.
Solution. Let (a, b) be the probability that A wins and (a, b) the probability that B
wins. If the rst ip is a head, then A now needs only a 1 further heads to win, so the
conditional probability that A wins, given a head, is (a 1, b). Likewise the conditional
probability that A wins, given a tail, is (a, b 1). Hence, by the partition rule,
(1) (a, b) 12(a 1, b) 12(a, b 1):
Thus if we know (a, b) for small values of a and b, we can nd the solution for any a
and b by this simple recursion. And of course we do know such values of (a, b), because
if a 0 and b . 0, then A has won and takes the whole prize: that is to say
(2) (0, b) 1:
Likewise if b 0 and a . 0 then B has won, and so
3.7 The problem of the points 115
(3) (a, 0) 0:
How do we solve (1) in general, with (2) and (3)? Recall the fundamental property of
Pascal's triangle: the entries c( j k, k) d( j, k) satisfy
(4) d( j, k) d( j 1, k) d( j, k 1):
You don't need to be a genius to suspect that the solution (a, b) of (1) is going to be
connected with the solutions
j k
c( j k, k) d( j, k)
k
of (4). We can make the connection even more transparent by writing
1
(a, b) ab u(a, b):
2
Then (1) becomes
(5) u(a, b) u(a 1, b) u(a, b 1)
with
(6) u(0, b) 2 b and u(a, 0) 0:
There are various ways of solving (5) with the conditions (6), but Pascal had the
inestimable advantage of having already obtained the solution by another method. Thus
he had simply to check that the answer is indeed
b1
X
ab1
(7) u(a, b) 2 ,
k0
k
and
b1
X
1 ab1
(8) (a, b) :
2 ab1 k0
k
At long last there was a solution to this classic problem. We may reasonably ask why
Pascal was able to solve it in a matter of weeks, when all previous attempts had failed for
at least 150 years. As usual the answer lies in a combination of circumstances: mathe-
maticians had become better at counting things; the binomial coefcients were better
understood; notation and the techniques of algebra had improved immeasurably; and
Pascal had a couple of very good ideas.
Pascal immediately realized the power of these ideas and techniques and quickly
invented new problems on which to use them. We discuss the best known of them in the
next section.
3 . 8 T H E G A M B L E R ' S RU I N P RO B L E M
In writing on these matters I had in mind the enjoyment of mathematicians, not the
benet of the gamblers; those who waste time on games of chance fully deserve to
lose their money as well.
P. de Montmort
Following the contributions of Pascal and Fermat, the next advances were made by
Christiaan Huygens, who was Newton's closest rival for top scientist of the 17th century.
Born in the Netherlands, he visited Paris in 1655 and heard about the problems Pascal
had solved. Returning to Holland, he wrote a short book Calculations in Games of
Chance (van Rekeningh in Speelen van Geluck). Meanwhile, Pascal had proposed and
solved another famous problem.
Pascal's problem of the gambler's ruin. Two gamblers, A and B, play with three
dice. At each throw, if the total is 11 then B gives a counter to A; if the total is 14 then A
gives a counter to B. They start with 12 counters each, and the rst to possess all 24 is the
winner. What are their chances of winning?
Pascal gives the correct solution. The ratio of their respective chances of winning,
p A : p B , is
150 094 635 296 999 122 : 129 746 337 890 625,
which is the same as
282 429 536 481 : 244 140 625
12
on dividing by 3 .
Unfortunately it is not certain what method Pascal used to get this result. However,
Huygens soon heard about this new problem, and solved it in a few days (sometime
between 28 September 1656 and 12 October 1656). He used a version of Pascal's idea of
value, which we have discussed several times above:
Now of course we do not know for sure if this was Pascal's method, but Pascal was
certainly at least as capable of extending his own ideas as Huygens was. The balance of
probabilities is that he did use this method. By long-standing tradition this problem is
always solved in books on elementary probability, and so we now give a modern version
of the solution. Here is a general statement of the problem.
Gambler's ruin. Two players, A and B again, play a series of independent games.
Each game is won by A with probability , or by B with probability ; the winner of each
3.8 The gambler's ruin problem 117
game gets one counter from the loser. Initially A has m counters and B has n. The victor
of the contest is the rst to have all m n counters; the loser is said to be `ruined', which
explains the name of this problem. What are the respective chances of A and B to be the
victor?
Note that 1, and for the moment we assume 6 .
Just as in the problem of the points, suppose that at some stage A has a counters (so B
has m n a counters), and let A's chances of victory at that point be v(a). If A wins
the next game his chance of victory is now v(a 1); if A loses the next game his chance
of victory is v(a 1). Hence, by the partition rule,
(1) v(a) v(a 1) v(a 1), 1 < a < m n 1:
Furthermore we know that
(2) v(m n) 1
because in this case A has all the counters, and
(3) v(0) 0
because A then has no counters.
From section 2.15, we know that the solution of (1) takes the form
v(a) c1 a c2 a ,
where c1 and c2 are constants, and and are the roots of
(4) x 2 x 0:
Trivially, the roots of (4) are 1, and = 6 1 (since we assumed 6 ). Hence,
using (2) and (3), we nd that
1 (=) a
(5) v(a) :
1 (=) m n
In particular, when A starts with m counters,
1 (=) m
p A v(m) :
1 (=) m n
This method of solution of difference equations was unknown in 1656, so other ap-
proaches were employed. In obtaining the answer to the gambler's ruin problem, Huygens
(and later workers) used intuitive induction with the proof omitted. Pascal probably did
use (1) but solved it by a different route. (See the exercises at the end of the section.)
Finally we consider the case when . Now (1) is
(6) v(a) 12v(a 1) 12v(a 1)
and it is easy to check that, for arbitrary constants c1 and c2,
v(a) c1 c2 a
satises (6). Now using (2) and (3) gives
a
(7) v(a) :
m n
So somebody does win; the probability that the game is unresolved is zero.
2. Solve the equation (1) as follows.
(a) Rearrange (1) as
fv(a 1) v(a)g fv(a) v(a 1)g:
(b) Sum and use successive cancellation to get
fv(a 1) v(1)g fv(a) v(0)g v(a):
(c) Deduce that
1 (=) m
v(a) v(1):
1 =
(d) Finally derive (5).
Every step of this method would have been familiar to Pascal in 1656.
3. Adapt the method of the last exercise to deal with the case when in the gambler's ruin
problem.
4. Suppose a gambler plays a sequence of fair games, at each of which he is equally likely to lose a
point or gain a point. Show that the chance of being a points ahead before rst being d points
down is a=(a d).
3 . 9 S O M E C L A S S I C P RO B L E M S
I have made this letter longer than usual, because I lack the time to make it shorter.
Pascal in a letter to Fermat.
Pascal and Fermat corresponded on the problem of the points in 1654, and on the
gambler's ruin problem in 1656. Their exchanges mark the ofcial inauguration of prob-
ability theory. (Pascal's memorial in the Church of St Etienne-du-Mont in Paris warrants
a visit by any passing probabilist.) These ideas quickly circulated in intellectual circles,
and in 1657 Huygens published a book on probability, On Games of Chance (in Latin and
Dutch editions); an English translation by Arbuthnot appeared in 1692.
This pioneering text was followed in remarkably quick succession by several books on
probability. A brief list would include the books of de Montmort (1708), J. Bernoulli
(1713), and de Moivre (1718), in French, Latin, and English respectively.
It is notable that the development of probability in its early stages was so extensively
motivated by simple games of chance and lotteries. Of course, the subject now extends
far beyond these original boundaries, but even today most people's rst brush with
probability will involve rolling a die in a simple board game, wondering about lottery
odds, or deciding which way to nesse the missing queen. Over the years a huge amount
of analysis has been done on these simple but naturally appealing problems. We therefore
give a brief random selection of some of the better-known classical problems tackled by
these early pioneers and their later descendants. (We have seen some of the easier
classical problems already in chapter 2, such as Pepys' problem, de Mere's problem,
Galileo's problem, Waldegrave's problem, and Huygens' problem.)
Example 3.9.1: problem of the points revisited. As we have noted above, Pascal was
probably assisted in his elegant and epoch-making solution of this problem by the fact
that he could also solve it another way. A typical argument runs as follows.
3.9 Some classic problems 119
Solution. Recall that A needs a points and B needs b points; A wins any game with
probability p. Now let A k be the event that when A has rst won a points, B has won k
points at that stage. Then
(1) A k \ A j , j 6 k
and
P(A k ) P(A wins the (a k)th game and a 1 of the preceding a k 1 games)
pP(A wins a 1 of a k 1 games)
a k a k1
p (1 p) ,
a1
S
by exerciseP5 of section 3.5. Now the event that A wins is b1 k0 A k , and the solution
(a, b) k P(Ak ) follows, using (1) above. (See problem 21 also.) s
Example 3.9.2: problem of the points extended. It is natural to extend the problem
of the points to a group of n players P1 , . . . , Pn , where P1 needs a1 games to Pwin, P2
needs a2 , and so on, and the probability that Pr wins any game is p r . Naturally p r 1.
The same argument as that used in the previous example shows that if P1 wins the contest
when Pr has won x r games (2 < r < n, x r , a r ), this has probability
(a1 x1 x n 1)!
(2) p1a1 p2x2 p xnn :
(a1 1)!x2 ! x n !
Thus the total probability that P1 wins the contest is the sum of all such terms as each x r
runs over 0, 1, . . . , a r 1. s
Solution. The mathematician must have removed the boxes from their pockets
n 1 n k times. If the last (n 1)th (unsuccessful) removal of some box is the
right-hand box, then the previous n right-hand removals may be chosen from any of the
previous 2n k. This has probability
(2 n k1) 2n k
2 :
n
The same is true for the left pocket, so
(2 n k) 2n k
pk 2 : s
n
120 3 Counting and gambling
Example 3.9.4: occupancy problem. Suppose a fair die with s faces (or sides) is
rolled r times. What is the probability a that every side has turned up at least once?
Solution. Let A j be the event that the jth side has not been shown. Then
(3) a 1 P(A1 [ A2 [ [ A s )
X
s X
1 P(A j ) P(A j \ A k )
j1 j, k
(1) s P(A1 \ \ A s )
on using problem 18 of section 2.16. Now by symmetry P(A j ) P(A k ), P(A j \ A k )
P(A m \ A n ), and so on. Hence
s \ !
s
a 1 sP(A1 ) P(A1 \ A2 ) (1) P Aj :
2 j
Remark. This example may look a little articial, but in fact it has many practical
applications. For example, if you capture, tag (if not already tagged), and release r
animals successively in some restricted habitat, what is the probability that you have
tagged all the s present? Think of some more such examples yourself.
Example 3.9.5: derangements and coincidences. Suppose the lottery machine were
not stopped after the winning draw, but allowed to go on drawing numbers until all n were
removed. What is the probability d that no number r is the rth to be drawn by the
machine?
Solution. Let A r be the event that the rth number drawn is in fact r; that is to say, the
rth ball that rolls out bears the number r. Then
3.10 Stirling's formula 121
!
[
n
(6) d 1P Ar
r1
X
n
1 P(A r ) (1) n P(A1 \ \ A n )
r1
n
1 nP(A1 ) P(A1 \ A2 )
2
!
\n
n
(1) P Ar
r1
by problem 18 of section 2.16 and symmetry, as usual. Now for any set of k numbers
1 1 1 (n k)!
(7) P(A1 \ \ A k ) :
n n1 n k1 n!
Hence
1 n (n 2)!
(8) d 1 n
n 2 n!
k n (n k)! 1
(1) (1) n
k n! n!
1 1 1
(1) n
2! 3! n!
It is remarkable that as n ! 1 we have d ! e 1 . s
2. Derangements once again. Let d n be the number of derangements of the rst n integers.
Show that d n1 nd n nd n1 , by considering which number is in the rst place in each
derangement.
above, especially (for example) in even the simplest problems involving poker hands or
suit distributions in bridge hands.
For another example, consider the basic problem of proportions in ipping coins.
Example 3.10.1. A fair coin is ipped repeatedly. Routine calculations show that
10 10
(1) P(exactly 6 heads in 10 flips) 2 ' 0:2,
6
50 50
(2) P(exactly 30 heads in 50 flips) 2 ' 0:04,
30
1000 1000
(3) P(exactly 600 heads in 1000 flips) 2 ' 108 :
600
These are simple but not straightforward. The problem is that n! is impossibly large for
large n. (Try 1000! on your pocket calculator.) s
Furthermore, an obvious next question in ipping coins is to ask for the probability that
the proportion of heads lies between 0.4 and 0.6, say, or any other range of interest. Even
today, summing the relevant probabilities including factorials would be an exceedingly
tedious task, and for 18th century mathematicians it was clearly impossible. de Moivre
and others therefore set about nding useful approximations to the value of n!, especially
for large n. That is, they tried to nd a sequence (a(n); n > 1) such that as n increases
a(n)
! 1,
n!
and of course, such that a(n) can be relatively easily calculated. For such a sequence we
use the notation n! a(n). In 1730 de Moivre showed that a suitable sequence is given
by
(4) a(n) Bn n1=2 e n
where
1 1 1 1
(5) log B ' 1 :
12 360 1260 1680
Inspired by this, Stirling showed that in fact
(6) B (2)1=2 :
We therefore write:
Stirling's formula
(7) n! (2n)1=2 n n e n :
This enabled de Moivre to prove the rst central limit theorem in 1733. We meet this
important result later.
Remark. Research by psychologists has shown that, before the actual calculations,
many people (probabilistically unsophisticated) estimate that the probabilities dened in
(1), (2), and (3) are roughly similar, or even the same. This may be called the fallacy of
proportion, because it is a strong, but wrongly applied, intuitive feeling for proportionality
3.11 Review 123
that leads people into this error. Typically they are also very reluctant to believe the truth,
even when it is demonstrated as above.
Use Stirling's formula to obtain an approximate value for this. (Then compare your answer with
the exact result, 53 644 737 765 488 792 839 237 440 000:)
2. Use Stirling's formula to approximate the number of ways of being dealt one hand at bridge,
52
635 013 559 600:
13
3.11 REVIEW
As promised above we have surveyed the preliminaries to probability, and observed its
foundation by Pascal, Fermat, and Huygens. This has, no doubt, been informative and
entertaining, but are we any better off as a result? The answer is yes, for a number of
reasons: principally
(i) We have found that a large class of interesting problems can be solved simply by
counting things. This is good news, because we are all quite condent about
counting.
(ii) We have gained experience in solving simple classical problems which will be
very useful in tackling more complicated problems.
(iii) We have established the following combinatorial results.
The number of possible sequences of length r using elements from a set of
size n is n r . (Repetition permitted.)
The number of permutations of length r using elements from a set of size n is
n(n 1) (n r 1). (Repetition not permitted.)
The number of combinations (choices) of r elements from a set of size n is
n n(n 1) (n r 1)
r r(r 1) 1
The number of subsets of a set of size n is 2 n .
The number of derangements of a set of size n is
1 1 1 1 n 1
n! 1 (1) :
1! 2! 3! 4! n!
(iv) We can record the following useful approximations.
Stirling's formula says that as n increases
p n1=2 n
2 n e
! 1:
n!
124 3 Counting and gambling
I Finite series
Consider the series
X
n
sn a r a1 a2 a n :
r1
The variable r is a dummy variable or index of summation, so any symbol will sufce:
X n X
n
ar ai :
r1 i1
In general
X
n X
n X
n
(ax r by r ) a xr b yr :
r1 r1 r1
In particular
X
n
1 n;
r1
Xn
r 12 n(n 1), the arithmetic sum;
r1
Xn
n1 n1
r 2 16 n(n 1)(2n 1) 2 ;
r1 3 2
!2
X
n X
n
3
r r 14 n2 (n 1)2 ;
r1 r1
n
X n
x r y n r (x y) n , the binomial theorem;
r0 r
X X a b c a b
a b c
M(a, b, c)x y z xa ybzc
abc n abc n ab a
a,b,c>0 a,b,c>0
II Limits
Very often we have to deal with innite series. A fundamental and extremely useful concept in this
context is that of the limit of a sequence.
Denition. Let (s n ; n > 1) be a sequence of real numbers. If there is a number s such that
js n sj may ultimately always be as small as we please then s is said to be the limit of the sequence
s n . Formally we write
lim s n s
n!1
if and only if for any . 0, there is a nite n0 such that
js n sj ,
for all n . n0 . n
Notice that s n need never actually take the value s, it must just get closer to it in the long run. (For
example, let s n n1 .)
1 3 1 x2 5 3 1 x3
1 x 3 3 3 3 3
2 2 2 2! 2 2 2 3!
X1 r
2r x
:
r0
r 4
In particular, we often use the case n 2:
X
1
(r 1)x r (1 x)2 :
r0
Also, by denition, for all x,
X
1
xr
exp x e x
r0
r!
126 3 Counting and gambling
3 . 1 3 P RO B L E M S
1. Assume people are independently equally likely to have any sign of the Zodiac.
(a) What is the probability that four people have different signs?
(b) How many people are needed to give a better than evens chance that at least two of them
share a sign?
(There are 12 signs of the Zodiac.)
2. Five digits are selected independently at random (repetition permitted), each from the ten
possibilities f0, 1, . . . , 9g. Show that the probability that they are all different is 0.3
approximately.
What is the probability that six such random digits are all different?
3. Four digits are selected independently at random (without repetition) from f0, 1, . . . , 9g. What
is the probability that
(a) the four digits form a run? (e.g. 2, 3, 4, 5)
(b) they are all greater than 5?
(c) they include the digit 0?
(d) at least one is greater than 7?
(e) all the numbers are odd?
4. You roll 6 fair dice. You win a small prize if at least 2 of the dice show the same, and you win a
big prize if there are at least 4 sixes. What is the probability that you
(a) get exactly 2 sixes?
(b) win a small prize?
(c) win a large prize?
(d) win a large prize given that you have won a small prize?
5. Show that the probability that your poker hand contains two pairs is approximately 0.048, and
that the probability of three of a kind is approximately 0.021.
6. Show that n r , the number of permutations of r from n things, satises the recurrence relation
n r (n 1) r r(n 1) r1 :
7. Show that
n 2
X
2n n
:
n k0
k
3.13 Problems 127
8. A construction toy comprises n bricks, which can each be any one of c different colours. Let
w(n, c) be the number of different ways of making up such a box. Show that
w(n, c) w(n 1, c) w(n, c 1)
and that
nc1
w(n, c) :
n
9. Pizza problem. Let Rn be the largest number of bits of a circular pizza which you can
produce with n straight cuts. Show that
Rn Rn1 n
and that
n1
Rn 1:
2
10. If n people, including Algernon and Zebedee, are randomly placed in a line (queue), what is
the probability that there are exactly k people in line between Algernon and Zebedee?
What if they were randomly arranged in a circle?
11. A combination lock has n buttons. It opens if k different buttons are depressed in the correct
order. What is the chance of opening a lock if you press k different random buttons in random
order?
12. In poker a straight is a hand such as f3, 4, 5, 6, 7g, where the cards are not all of the same suit
(for that would be a straight ush), and aces may rank high or low. Show that
( )
5
4 4 52
P(straight) 10 10 ' 0:004:
1 1 5
Show also that P(straight ush) ' 0:000015.
13. The Earl of Yarborough is said to have offered the following bet to anyone about to be dealt a
hand at whist: if you paid him one guinea, and your hand then contained no card higher than a
nine, he would pay you one thousand guineas. Show that the probability y of being dealt such a
hand is
5394
y
9860 459
What do you think of the bet?
14. (a) Adonis has k cents and Bubear has n k cents. They repeatedly roll a fair die. If it is
even, Adonis gets a cent from Bubear; otherwise, Bubear gets a cent from Adonis. Show
that the probability that Adonis rst has all n cents is k=n.
(b) There are n 1 beer glasses f g0 , g 1 , . . . , g n g, in a circle. A wasp is on g 0 . At each ight
the wasp is equally likely to y to either of the two neighbouring glasses. Let L k be the
event that the glass g k is the last one to be visited by the wasp (k 6 0). Show that
P(L k ) n1 .
15. Consider the standard 6 out of 49 lottery.
(a) Show that the probability that 4 of your 6 numbers match those drawn is
13 545
:
13 983 816
(b) Find the probability that all 6 numbers drawn are odd.
(c) What is the probability that at least one number fails to be drawn in 52 consecutive
drawings?
128 3 Counting and gambling
16. Matching. The rst n integers are placed in a row at random. If the integer k is in the kth
place in the row, that is a match. What is the probability that `1' is rst, given that there are
exactly m matches?
17. You have n sovereigns and r friends, n > r. Show that the number of ways of dividing the
coins among your friends so that each has at least one is
n1
:
r1
18. A biased coin is ipped 2n times. Show that the probability that the number of heads is the
same as the number of tails is
2n
( pq) n :
n
Use Stirling's formula to show how this behaves as n ! 1.
19. Suppose n objects are placed in a row. The operation S k is dened thus: `Pick one of the rst k
objects at random, and swap it with the object in the kth place'. Now perform S n , S n1 ,
. . . , S1 . Show that the nal order is equally likely to be any one fo the n! permutations of the
objects.
20. Your computer requires you to choose a password comprising a sequence of m characters
drawn from an alphabet of a possibilities, with the constraint that not more than two consecu-
tive characters may be the same. Let t(m) be the total number of passwords, for m . 2. Show
that
t(m) (a 1)ft(m 1) t(m 2)g:
Hence nd an expression for t(m).
21. Suppose A and B play a series of a b 1 independent games, each won by A with probability
p, or by B with probability 1 p. Find the probability that A wins at least a games, and hence
obtain the solution (9) in exercise 2 of section 3.7, the problem of the points.
4
Distributions: trials, samples, and
approximation
4.1 PREVIEW
This chapter deals with one of the most useful and important ideas in probability, that is,
the concept of a probability distribution. We have seen in chapter 2 how the probability
function P assigns or distributes probability to the events in . We have seen in chapter 3
how the outcomes in are often numbers or can be indexed by numbers. In these, and
many other cases, P naturally distributes probability to the relevant numbers, which we
may regard as points on the real line. This all leads naturally to the idea of a probability
distribution on the real line, which often can be easily and obviously represented by
simple and familiar functions.
We shall look at the most important special distributions in detail: Bernoulli,
geometric, binomial, negative binomial, and hypergeometric. Then we consider some
important and very useful approximations, especially the Poisson, exponential, and
normal distributions.
In particular, we shall need to deal with problems in which probability is assigned to
intervals in the real line, or even to the whole real line. In such cases we talk of a
probability density, using a rather obvious analogy with the distribution of matter.
Finally, probability distributions and densities in the plane are briey considered.
Prerequisites. We use elementary results about sequences and series, and their limits,
such as
x n
lim 1 e x:
n!1 n
See the appendix to chapter 3 for a brief account of these notions.
4 . 2 I N T RO D U C T I O N ; S I M P L E E X A M P L E S
Very often all the outcomes of some experiment are just numbers. We give some examples.
129
130 4 Distributions: trials, samples, and approximation
Temperature. You observe a thermometer and record the temperature to the nearest
degree. The outcome is an integer.
Counter. You turn on your Geiger counter, and note the time when it has counted 106
particles. The outcome is a positive real number.
Lottery. The lottery draw yields seven numbers between 1 and 49.
Obviously we could produce yet another endless list of experiments with random
numerical outcomes here: you weigh yourself; you sell your car; you roll a die with
numbered faces, and so on. Write some down yourself. In such cases it is customary and
convenient to denote the outcome of the experiment before it occurs by some appropriate
capital letter, such as X .
We do this in the interests of clarity. Outcomes in general (denoted by ) can be
anything: rain, or heads, or an ace, for example. Outcomes that are denoted by X (or any
other capital) can only be numerical. Thus, in the second example above we could say
`Let T be the temperature observed'.
In the third example we might say
`Let X be the time needed to count 106 particles'.
In all examples of this kind, events are of course just described by suitable sets of
numbers. It is natural and helpful to specify these events by using the previous notation; thus
fa < T < bg
means that the temperature recorded lies between a and b degrees, inclusive. Likewise
fT 0g
is the event that the temperature is zero. In the same way
fX . xg
6
means that the time needed to count 10 particles is greater than x. In all these cases X
and T are being used in the same way as we used in earlier chapters, e.g. rainy days in
example 2.3.2, random numbers in example 2.4.11, and so on.
Finally, because these are events, we can discuss their probabilities. For the events
given above, these would be denoted by
P(0 < T < b), P(T 0), P(X . x),
respectively.
The above discussion has been fairly general; we now focus on a particularly important
special case. That is, the case when X can take only integer values.
Denition. Let X denote the outcome of an experiment in which X can take only
integer values. Then the function p(x) given by
p(x) P(X x), x 2 Z,
4.2 Introduction; simple examples 131
is
Pcalled the probability distribution of X . Obviously p(x) > 0, and we shall show that
x p(x) 1. n
Note that we need only discuss this function for integer values of x, but it is convenient
(and possible) to imagine that p(x) 0 when x is not an integer. When x is an integer,
p(x) then supplies the probability that the event fX xg occurs. Or, more briey, the
probability that X x.
Example 4.2.1: die. Let X be the number shown when a fair die is rolled. As always
X 2 f1, 2, 3, 4, 5, 6g,
and of course
P(X x) 16, x 2 f1, 2, 3, 4, 5, 6g: s
Example 4.2.2: Bernoulli trial. Suppose you engage in some activity that entails that
you either win or lose, for example, a game of tennis or a bet. All such activities are given
the general name of a Bernoulli trial. Suppose that the probability that you win the trial is
p.
Let X be the number of times you win. Putting it in what might seem a rather stilted
way, we write
X 2 f0, 1g
and
P(X 1) p:
Obviously X 0 and X 1 are complementary, and so by the complement rule
P(X 0) 1 p
q,
where p q 1. The event X 1 is traditionally known as `success', and X 0 is
known as `failure'. s
The Bernoulli trial is the simplest, but nevertheless an important, random experiment,
and an enormous number of examples are of this type. For illustration consider the
following.
(i) Flip a coin; we may let fheadg fsuccessg S.
(ii) Each computer chip produced is tested; S fthe chip passes the testg.
(iii) You attempt to start your car one cold morning; S fit startsg.
(iv) A patient is prescribed some remedy; S fhe is thereby curedg.
In each case the interpretation of failure is obvious; F S c .
Inherent in most of these examples is the possibility of repetition. This leads to another
important
The above assumptions enable us to calculate the probability of any given sequence of
successes and failures very easily, by independence. Thus, with an obvious notation,
P(SFS) pqp p2 q,
P(FFFS) q 3 p,
and so on.
The choice of examples and vocabulary makes it clear in which kind of questions we
are interested. For example:
(i) How long do we wait for the rst success?
(ii) How many failures are there in any n trials?
(iii) How long do we wait for the rth success?
The answers to these questions take the form of a collection of probabilities, as we see in
the next few sections.
Further natural sources of distributions arise from measurement and counting. For
example, suppose n randomly chosen children are each measured to the nearest inch, and
N r is the number of children whose height is recorded as r inches. Then we have argued
often above that r N r =n is (or should be) a reasonable approximation to the
probability p r that a randomly selected child in this population is r inches tall. Of course
r > 0 and
X X
r n1 N r 1:
r r
Thus r satises the rules for a probability distribution, as well as representing an
approximation to p r . Such a collection is called an empirical distribution.
Example 4.2.3: Benford's distribution revisited. Let us recall this classic problem,
stated as follows. Take any large collection of numbers, such as the Cambridge statistical
tables, or a report on the Census, or an almanac. Offer to bet, at evens, that a number
picked at random from the book will have rst signicant digit less than 5. The more
people you can nd to accept this bet, the more you will win.
The untutored instinct expects intuitively that all nine possible numbers should be
equally likely. This is not so. Actual experiment shows that empirically the distribution of
probability is close to
1
(1) p(k) log10 1 , 1 < k < 9:
k
This is Benford's distribution, and the actual values are approximately
p(1) 0:301, p(2) 0:176, p(3) 0:125,
p(4) 0:097, p(5) 0:079, p(6) 0:067,
p(7) 0:058, p(8) 0:051, p(9) 0:046:
You will notice that p(1) p(2) p(3) p(4) ' 0:7; the odds on your winning are
better than two to one. This is perhaps even more agrant than a lottery.
It turns out that the same rule applies if you look at a larger number of signicant
digits. For example, if you look at the rst two signicant digits, then these pairs lie in the
set f10, 11, . . . , 99g. It is found that they have the probability distribution
4.2 Introduction; simple examples 133
1
p(k) log10 1 , 10 < k < 99:
k
Why should the distribution of rst signicant digits be given by (1)? Supercially it
seems rather odd and unnatural. It becomes less unnatural when you recall that the choice
of base 10 in such tables is completely arbitrary. On another planet these tables might be
in base 8, or base 12, or indeed any base. It would be extremely strange if the rst digit
distribution was uniform (say) in base 10 but not in the other bases.
We conclude that any such distribution must be in some sense base-invariant. And,
recently, T. P. Hill has shown that Benford's distribution is the only one which satises
this condition. s
In these and all the other examples we consider, a probability distribution is just a
collection of numbers p(x) satisfying the conditions noted above,
X
p(x) 1, p(x) > 0:
x
This is ne as far as it goes, but it often helps our intuition to represent the collection
p(x) as a histogram. This makes it obvious at a glance what is going on. For example,
gure 4.1 displays p(0) and p(1) for Bernoulli trials with various values of p(0).
For another example, consider the distribution of probabilities for the sum Z of the
scores of two fair dice. We know that
1 2 6
p(2) 36 , p(3) 36 , . . . , p(7) 36 ,
5 1
p(8) 36 , ..., p(12) 36 :
where p(2) P( Z 2), and so on. This distribution is illustrated in gure 4.2, and is
known as a triangular distribution.
Before we turn to more examples let us list the principal properties of a distribution
p(x). First, and obviously by the denition,
(2) 0 < p(x) < 1:
Second, note that if x1 6 x2 then the events fX x1 g and fX x2 g are disjoint. Hence,
by the addition rule (3) of section 2.5,
(3) P(X 2 fx1 , x2 g) p(x1 ) p(x2 ):
1 1
2 1
2
e1
0 1 x 0 1 x 0 1 x 0 1 x 0 1 x
1
p(0) 1 p(0) 2 p(0) 2
p(0) e1 p(0) 0
p(x)
2 3 4 5 6 7 8 9 10 11 12 x
Figure 4.2. Probability distribution of the sum of the scores shown by two fair dice.
More generally, by the extended addition rule (5) of section 2.5, we have the result:
Note that this way of thinking about probability distributions suggests a neat way of
writing down the probability distribution of a Bernoulli trial.
Another extremely important but simple distribution, which we have often met before,
is the uniform distribution.
Returning to the histograms discussed above, we see that in these examples each bar of
the histogram is of unit width, and the bar at x is of height p(x); it therefore has area
p(x). Thus the algebraic rules laid out from (2) to (5) can be interpreted in terms of areas.
Most importantly, we can see that the probability that X lies between a and b,
P(a < X < b), is just the area of the histogram lying between a and b. This of course is
why such diagrams are so appealing. Note that the value of the distribution function F(x)
at x is just the area of the histogram to the left of x. Figure 4.3 gives an example.
This idea becomes even more appealing and attractive when we recall that not all
experiments have outcomes conned to the integers, or even to a countable set. Weather-
cocks may point in any direction, isotopes may decay at any time, ropes may snap at any
point. In these cases it is natural to replace the discrete bars of the histogram by a smooth
curve, so that it is still true that P(a , X , b) is represented by the area under the curve
between a and b. Such a curve is called a density. The curve has the property that the
shaded area yields P(a , X < b); we denote this by
b
(9) P(a , X < b) f (x) dx:
a
We return to this idea in much more detail later.
p(x)
1 2 3 4 5 6 7 8 9 x
Figure 4.3. Benford's distribution, (1) in section 4.2. The shaded area is F(4) P(X < 4), the
probability that you win the bet described in example 4.2.3; it equals 0.7.
f (x)
a b x
4 . 3 WA I T I N G ; G E O M E T R I C D I S T R I B U T I O N S
For all of us, one of the most familiar appearances of probability arises in waiting. You
wait for a server to become free, you wait for a trafc light to switch to green, you wait
for your number to come up on the roulette wheel, you wait for a bus, and so on. Some of
these problems are too complicated for us to analyse here, but some yield a simple and
classical model.
To respect the traditions of the subject, suppose you are ipping a coin, on the
understanding that you get a prize when a head appears for the rst time and then you
stop. How long do you have to wait? Obviously there is a physical limit to the number of
tosses; the prize-giver will go bankrupt, or the coin will wear out, or the universe may
even cease to exist, in a nite time. So we suppose that if you have not won the prize on
or before the Kth ip, you quit. Let the probability that you stop on the kth ip be p(k).
Clearly you have to ip at least once to win, so
p(k) 0, for k 0, 1, 2, . . . :
Then the probability of heads on the rst ip is 12; the probability of one tail followed by a
head is 14, the probability of two tails followed by a head is 18, and so on. The probability of
k 1 tails followed by a head is 2 k , and
p(k) 2 k , for k 1, 2, . . . , K 1:
( K1)
The probability of K 1 tails is 2 , and you stop on the next ip, whatever it is, so
p(K) 2( K1) :
Since you never make more than K ips,
p(k) 0, k K 1, K 2, . . . :
Putting all these together we see that the number of ips until you stop has the
distribution
4.3 Waiting; geometric distributions 137
8
< 2 k , 1< k < K 1
(1) p(k) 2( K1) , kK
:
0 otherwise:
Now, the sequence 21 , 22 , 23 , . . . is a geometric series (see the appendix to chapter 3)
with ratio 12. It is therefore called the geometric distribution truncated at K, with
parameter 12.
Suppose we now imagine that the coin can be ipped indenitely. Then the distribution
is
k
2 , k>1
(2) p(k)
0 otherwise:
This is called the geometric distribution with parameter 12.
The assumption that the coin can be tossed indenitely is not as unrealistic as it sounds.
After all, however many times it has been ipped, you should be able to toss it once more.
And no one objects to the idea of a line being prolonged indenitely. In both cases we
allow continuation indenitely because it is almost always harmless, and often very
convenient.
Example 4.3.1: die. If you roll a die and wait for a six, then the same argument as
that used for (2) shows that the number of rolls required has the distribution
j1
1 5
(3) p( j) , j > 1: s
6 6
Example 4.3.2: trials. More generally, suppose you have a sequence of independent
Bernoulli trials in which you win with probability p, or lose with probability 1 p. Then
the number of trials you perform until your rst win has the distribution
(4) p(i) p(1 p) i1 , i > 1:
This is the geometric distribution with parameter p. s
A word of warning is appropriate here; you must be quite clear what you are counting.
Let p(i) be the distribution of the number of trials before you win. This number can be
zero if you win on the rst trial, so we should write
(5) p(i) p(1 p) i , i > 0:
This is not the geometric distribution (which is on the positive integers). It is a geometric
distribution.
In any case it is easy to see that p(i) is indeed a probability distribution, as dened in
(4) of section 4.2, because
X X 1
p
(6) p(i) pq i 1:
i i0
1 q
Example 4.3.3: unlucky numbers. Suppose a certain lottery takes place each week.
Let p be the probability that some given number d is drawn on any given week. After n
successive draws, let p(k) be the probability that d last appeared k weeks ago. What is
p(k)?
138 4 Distributions: trials, samples, and approximation
Solution. Note rst that the probability that d does not appear in any given draw is
1 p. Now the last occurrence of d is k weeks ago only if it is drawn that week and then
is not drawn on k occasions. This yields
(7) p(k) p(1 p) k , 0 < k < n 1:
Obviously d fails to appear at all with probability (1 p) n, and in accordance with the
principal property of a distribution, (5) in section 4.2, we do indeed have
X
n1
p(k) (1 p) n 1:
k0
Comparison of (7) with data from real lotteries shows it to be an excellent description of
reality. s
Remark. Lotteries and roulette wheels publish and keep records of their results. This
is for two incompatible reasons. The rst is that they wish to demonstrate that the
numbers that turn up are indeed completely random. The second is that some gamblers
choose to bet on numbers that have not appeared for a long time. The implicit assump-
tion, that such numbers are more likely to appear next time, is the gambler`s fallacy.
Other gamblers choose to bet on the numbers that have appeared most often. Do you
think this is more rational?
Example 4.3.4: `sudden death'. Suppose two players A and B undertake a series of
trials such that each trial independently yields one of the following:
(a) a win for A with probability p;
(b) a win for B with probability q;
(c) a draw (or no result, or a void trial), with probability 1 p q.
The game stops at the rst win by either A or B. This is essentially the format of the game
of craps, and such contests are also often used to resolve golf and other tournaments in
which players are tied for the lead at the end of normal play. In this context, they are
called sudden-death playoffs. We may ask:
(i) What is the probability a n that A wins at the nth trial?
(ii) What is the probability that A wins overall?
(iii) What is the probability (n) that the game lasts for n trials?
Solution. For (i): First we note that A wins at the nth trial if and only if the rst n 1
trials are drawn, and A wins the nth. Hence, using independence,
a n (1 p q) n1 p:
For (ii): By the addition rule for probabilities,
X1 X1
p
(8) an p (1 p q) n1 :
n1 n1
pq
Of course we already know an alternative method for this. Let A w be the event that A is
the overall winner, and denote the possible results of the rst trial by A, B, and D. Then
4.4 The binomial distribution and some relatives 139
2. `Sudden death' continued. Let D n be the event that the duration of the game is n trials, and
let A w be the event that A is the overall winner. Show that A w and D n are independent.
4 . 4 T H E B I N O M I A L D I S T R I B U T I O N A N D S O M E R E L AT I V E S
As we have remarked above, in many practical applications it is necessary to perform
some xed number, n, of Bernoulli trials. Naturally we would very much like to know the
probability of r successes, for various values of r. Here are some obvious examples, some
familiar and some new.
(i) A coin is ipped n times. What is the chance of exactly r heads?
(ii) You have n chips. What is the chance that r are defective?
(iii) You treat n patients with the same drug. What is the chance that r respond well?
(iv) You buy n lottery scratch cards. What is the chance of r wins?
(v) You type a page of n symbols. What is the chance of r errors?
(vi) You call n telephone numbers. What is the chance of making r sales?
This is obviously yet another list that could be extended indenitely, but in every case the
underlying problem is the same. It is convenient to standardize our names and notation
around Bernoulli trials so we ask the following: in a sequence of n independent Bernoulli
trials with P(S) p, what is the probability p(k) of k successes?
For variety, and in deference to tradition, we often speak in terms of coins: if you ip a
biased coin n times, what is the probability p(k) of k heads, where P( H) p?
These problems are the same, and the answer is given by the
Proof of (1). When we perform n Bernoulli trials there are 2 n possible outcomes,
because each yields either S or F. How many of these outcomes comprise exactly k
successes and n k failures? The answer is
n
,
k
because this is the number of distinct ways of ordering k successes and n k failures.
(We proved this in section 3.3; see especially the lines before (8)). Now we observe that,
by independence, any given outcome with k successes and n k failures has probability
pk q n k . Hence
n
p(k) pk q n k , 0 < k < n: h
k
It is interesting, and a useful exercise, to obtain this result in a different way by using
conditional probability. It also provides an illuminating connection with many earlier
ideas, and furthermore illustrates a useful technique for tackling harder problems. In this
case the solution is very simple and runs as follows.
Another proof of (1). Let A(n, k) be the event that n ips show k heads, and let
p(n, k) P(A(n, k)):
The rst ip gives H or T, so by the partition rule (6) of section 2.8
(2) p(n, k) P(A(n, k)j H)P( H) P(A(n, k)jT )P(T ):
But given H on the rst ip, A(n, k) occurs if there are exactly k 1 heads in the next
n 1 ips. Hence
P(A(n, k)j H) p(n 1, k 1):
Likewise
P(A(n, k)jT ) p(n 1, k):
Hence substituting in (2) yields
(3) p(n, k) pp(n 1, k 1) qp(n 1, k):
Of course we know that p(n, 0) q n and p(n, n) p n, so equation (3) successively
supplies values of p(n, k) just as in Pascal's triangle and the problem of the points.
It is now a very simple matter to show that the solution of (3) is indeed given by the
binomial distribution
n
p(n, k) pk q n k , 0 < k < n: h
k
The connection with Pascal's triangle is made completely obvious if the binomial
probabilities are displayed as a diagram (or graph) as in gure 4.5. This is very similar to
4.4 The binomial distribution and some relatives 141
n0 1
n1 q p
n2 q2 2pq p2
n3 q3 3q2p 3qp2 p3
a tree diagram (though it is not in fact a tree). The process starts at the top where no trials
have yet been performed. Each trial yields S or F, with probabilities p and q, and
corresponds to a step down to the row beneath. Hence any path of n steps downwards
corresponds to a possible outcome of the rst n trials. The kth entry in the nth row is the
sum of the probabilities of all possible paths to that vertex, which is just p(k). The rst
entry at the top corresponds to the obvious fact that the probability of no successes in no
trials is unity.
The binomial distribution is one of the most useful, and we take a moment to look at
some of its more important properties. First we record the simple relationship between
p(k 1) and p(k), namely
!
n
(4) p(k 1) p k1 (1 p) n k1
k1
n k n! p
pk (1 p) n k
k 1 k!(n k)! 1 p
n k p
p(k):
k1 1 p
This recursion, starting either with p(0) (1 p) n or with p(n) p n , is very useful in
carrying out explicit calculations in practical cases.
It is also very useful in telling us about the shape of the distribution in general. Note
that
p(k) k1 1 p
(5) ,
p(k 1) n k p
which is less than 1 whenever k , (n 1) p 1. Thus the probabilities p(k) increase up
to this point. Otherwise, the ratio in (5) is greater than 1 whenever k . (n 1) p 1;
0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70 80 90 100
Number of successes Number of successes
Figure 4.6. Binomial distributions. On the left, xed p and varying n: from top to bottom n 10, 20, . . . , 100. On the right, xed n and varying p: from
top to bottom, p 10%, 20%, . . . , 100%. The histograms have been smoothed for simplicity.
4.4 The binomial distribution and some relatives 143
the probabilities decrease past this point. The largest term is p([(n 1) p]), where
[(n 1) p] is the largest integer not greater than (n 1) p. If (n 1) p happens to be
exactly an integer then
p([(n 1) p 1])
(6) 1
p([(n 1) p])
and both these terms are maximal.
The shape of the distribution becomes even more obvious if we draw it; see gure 4.6,
which displays the shape of binomial histograms for various values of n and p.
We shall return to the binomial distribution later; for the moment we continue looking
at the simple but important distributions arising from a sequence of Bernoulli trials. So
far we have considered the geometric distribution and the binomial distribution. Next we
have the
Example 4.4.1: krakens. Each time you lower your nets you bring up a kraken with
probability p. What is the chance that you need n shing trips to catch k krakens? The
answer is given by (7). s
We can derive this distribution by conditioning also. Let p(n, k) be the probability that
you require n ips to obtain k heads; let F(n, k) be the event that the kth head occurs at
the nth ip. Then, noting that the rst ip yields either H or T, we have
144 4 Distributions: trials, samples, and approximation
4.5 SAMPLING
A problem that arises in just about every division of science and industry is that of
counting or assessing a divided population. This is a bit vague, but a few examples should
make it clear.
Votes. The population is divided into those who are going to vote for the Progressive
party and those who are going to vote for the Liberal party. The politician would like to
know the proportions of each.
Soap. There are those who like `Soapo' detergent and those who do not. The
manufacturers would like to know how many of each there are.
Potatoes. Some plants are developing scab, and others are not. The farmer would like
to know the rate of scab in his crop.
Chips. These are perfect or defective. The manufacturer would like to know the
failure rate.
Fish. These are normal, or androgynous due to polluted water. What proportion are
deformed?
4.5 Sampling 145
Turkeys. A lm has been made. The producers would like to know whether viewers
will like it or hate it.
It should now be quite obvious that in all these cases we have a population or collection
divided into two distinct non-overlapping classes, and we want to know how many there
are in each. The list of similar instances could be prolonged indenitely; you should think
of some yourself.
The examples have another feature in common: it is practically impossible to count all
the members of the population to nd out the proportion in the two different classes. All
we can do is look at a part of the population, and try to extrapolate to the whole of it.
Naturally, some thought and care is required here. If the politician canvasses opinion in
his own ofce he is not likely to get a representative answer. The farmer might get a
depressing result if he looked only at plants in the damp corner of his eld. And so on.
After some thought, you might agree that a sensible procedure in each case would be to
take a sample of the population in such a way that each member of the population has an
equal chance of being sampled. This ought to give a reasonable snapshot of the situation;
the important question is, how reasonable? That is, how do the properties of the sample
relate to the composition of the population? To answer this question we build a
mathematical model, and use probability.
The classical model is an urn containing balls (or slips of paper). The number of balls
(or slips) is the size of the population, the colour of the ball (or slip) denotes which group
it is in. Picking a ball at random from the urn corresponds to choosing a member of the
population, every member having the same chance to be chosen.
Having removed one ball, we are immediately faced with a problem. Do we put it back
or keep it out before the next draw? The answer to this depends on the real population
being studied. If a sh has been caught and dissected, it cannot easily be put back in the
pool and caught again. But voters can be asked for their political opinions any number of
times. In the rst case balls are not replaced in the urn, so this is sampling without
replacement. In the second case they are; so that is sampling with replacement. Let us
consider an example of the latter.
white balls. A sample of r balls is removed at random; what is the probability p(k) that it
includes exactly k mauve balls and r k white balls?
Example 4.5.3: wildlife sampling. Naturalists and others often wish to estimate the
size N of a population of more or less elusive creatures. (They may be nocturnal, or
burrowing, or simply shy.) A simple and popular method is capturerecapture, which is
executed thus:
(i) capture a animals and tag (or mark) them;
(ii) release the a tagged creatures and wait for them to mix with the remaining N a;
(iii) capture n animals and count how many are already tagged (these are recaptures).
4.6 Location and dispersion 147
2. Acceptance sampling. A shipment of components (called a lot) arrives at your factory. You
test their reliability as follows. For each lot of 100 components you take 10 at random, and test
these. If no more than one is defective you accept the lot. What is the probability that you accept
a lot of size 100 which contains 7 defectives?
3. In (5), show that p(r 1) p(r 1) < f p(r)g2 .
4 . 6 L O C AT I O N A N D D I S P E R S I O N
Suppose we have some experiment, or other random procedure, that yields outcomes in
some nite set of numbers D, with probability distribution p(x), x 2 D. The following
property of a distribution turns out to be of great practical and theoretical importance.
Natural questions are, why this number, and why is it useful? We answer these queries
shortly; rst of all let us look at some simple examples.
Example 4.6.1: coin. Flip a fair coin and count the number of heads. Trivially
f0, 1g, and p(0) 12 p(1). Hence the mean is
12 3 0 12 3 1 12:
This example is truly trivial, but it does illustrate that the mean is not necessarily one of
the possible outcomes of the experiment. In this case the mean is half a head. (Journalists
and others with an impaired sense of humour sometimes seek to nd amusement in this;
the average family size will often achieve the same effect, as it involves fractional
children. Of course real children are fractious not fractions . . .) s
148 4 Distributions: trials, samples, and approximation
Example 4.6.2: die. If you roll a die once, the outcome has distribution p(x) 16,
1 < x < 6. Then
16 3 1 16 3 2 16 3 3 16 3 4 16 3 5 16 3 6 72: s
Example 4.6.3: craps. If you roll two dice, then the distribution of the sum of their
scores is given in Figure 4.6. After a simple but tedious calculation you will nd that
7. s
At rst sight the mean, , may not seem very useful or fascinating, but there are in fact
many excellent reasons for our interest in it. Here are some of them.
Sample mean and relative frequency. Suppose you have a number n of similar
objects, n potatoes, say, or n hedgehogs. You could then measure any numerical attribute
(such as spines, or weight, or length), and obtain a collection of observations
fx1 , x2 , . . . , x n g. It is widely accepted that the average
1X n
x xr
n r1
is a reasonable candidate for a single number to represent or typify this collection of
measurements. Now suppose that some of these numbers are the same, as they often will
be in a large set of data. Let the number of times you obtain the value x be N (x); thus the
proportion yielding x is
N (x)
P(x) :
n
We have argued above that, in the long run, P(x) is close to the probability p(x) that x
occurs. Now the average x satises
X X X
x n1 (x1 x n ) n1 xN (x) xP(x) ' xp(x) ,
x x x
approximately, in the long run. It is important to remark that we can give this informal
observation plenty of formal support, later on.
Mean as centre of gravity. We have several times made the point that probability is
analogous to mass; we have a unit lump of probability, which is then split up among the
4.6 Location and dispersion 149
Example 4.6.4: sample mean. Suppose you have n lottery tickets bearing the
numbers x1 , x2 , . . . , x n (or perhaps you have n swedes weighing x1 , x2 , . . . , x n ); one of
these is picked at random. What is the mean of the resulting distribution?
Of course we have the probability distribution
p(x1 ) p(x2 ) n1
and so
X X n
xp(x) n1 x r x:
r1
The sample mean is equal to the average. s
Example 4.6.5: binomial mean. By denition (1), the mean of the binomial distribu-
tion is given by
X
n X
n
n!
kp(k) k p k q n k
k0 k1
k!(n k)!
X
n n1
X
(n 1)! k1 n k n1
np p q np p x q n1x
k1
(k 1)!(n k)! x0
x
np( p q) n1
np:
We shall nd neater ways of deriving this important result in the next chapter. s
Example 4.6.6: geometric mean. As usual, from (1) the mean of the geometric
distribution is
X
1
p
kq k1 p
k1
(1 q)2
p1 :
Note that we summed the series by looking in appendix 3.12.III. s
These examples, and our discussion, make it clear that the mean is useful as a guide to
the location of a probability distribution. This is convenient for simple-minded folk such
as journalists (and the media in general); if you are replying to a request for information
about accident rates, or defective goods, or lottery winnings, it is pointless to supply the
press with a distribution; it will be rejected. You will be allowed to use at most one
number; the mean is a simple and reasonably informative candidate.
Furthermore, we shall nd many more theoretical uses for it later on. But it does have
drawbacks, as we now discover; the keen-eyed reader will have noticed already that while
the mean tells you where the centre of probability mass is, it does not tell you how spread
out or dispersed the probability distribution is.
Example 4.6.7. In a casino the following bets are available for the same price (a
price greater than $1000).
(i) You get $1000 for sure.
(ii) You get $2000 with probability 12, or nothing.
(iii) You get $106 with probability 103, or nothing.
Calculating the mean of these three distributions we nd
For (i), $1000.
For (ii), 12 3 $2000 12 3 $0 $1000.
For (iii), 103 3 $106 (1 103 ) 3 $0 $1000.
Thus all these three distributions have the same mean, namely $1000. But obviously they
are very different bets! Would you be happy to pay the same amount to play each of these
games? Probably not; most people would prefer one or another of these wagers, and your
preference will depend on how rich you are and whether you are risk-averse or risk-
seeking. There is much matter for speculation and analysis here, but we note merely the
trivial point that these three distributions vary in how spread out they are about their
mean. That is to say, (i) is not spread out at all, (ii) is symmetrically disposed not too far
from its mean and (iii) is very spread out indeed. s
There are various ways of measuring such a dispersion, but it seems natural to begin by
ignoring the sign of deviations from the mean , and just looking at their absolute
magnitude, weighted of course by their probability. It turns out that the algebra is much
simplied in general if we use the following measure of dispersion in a probability
distribution.
4.6 Location and dispersion 151
The variance is a weighted average of the squared distance of outcomes from the mean; it
is sometimes called the second moment about the mean because of the analogy with mass
mentioned often above.
Example 4.6.7 revisited. For the three bets on offer it is easy to nd the variance in
each case:
For (i), 2 0.
For (ii), 2 12(0 1000)2 12(2000 1000)2 106 .
For (iii), 2 (1 103 )(0 103 )2 103 (106 103 )2 ' 109 .
Clearly, as the distribution becomes more spread out 2 increases dramatically. s
In order to keep the same scale, it is often convenient to use rather than 2 .
Denition. The positive square root of the variance 2 is known as the standard
deviation of the distribution. n
s
X
(3) (x )2 p(x):
x2 D
Example 4.6.10. Show that for any distribution p(x) with mean and variance 2
we have
X
2 x 2 p(x) 2 :
x2 D
Example 4.6.13: distribution with nite mean but innite variance. Let
c
p(x) 3 , x 1, 2, . . . :
x
4.6 Location and dispersion 153
P P1 P1
where c1 x60 x
3
. Then x1 xp(x) x1 xp(x) 16 c2 . Hence 0. How-
ever,
X X
1
2c
2 x 2 p(x) 1: s
x1
x
Remark: median and mode. We have seen in examples 4.6.11 and 4.6.12 above that
the mean may not be nite, or even exist. Nevertheless in these examples (and many
similar cases) we would like a rough indication of location. Luckily, some fairly obvious
candidates offer themselves. If we look at example 4.6.11 we note that the distribution is
symmetrical about zero, and the values 1 are considerably more likely than any others.
These two observations suggest the following two ideas.
Roughly speaking, outcomes are equally likely to be on either side of a median, and the
most likely outcomes are modes.
Example 4.6.11 revisited. Here any number in [1, 1] is a median, and 1 are both
modes. (Remember there is no mean for this distribution.) s
Example 4.6.12 revisited. Here 1 is the only mode, and it is also the only median
because 6=2 . 1=2. (Remember that 1 in this case.) s
Remark: mean and median. It is important to stress that the mean is only a crude
summary measure of the distribution. It tells you something about the distribution of
probability, but not much. In particular it does not tell you that
X
p(x) 12:
x.
154 4 Distributions: trials, samples, and approximation
5. Benford. Show that the expected value of the rst signicant digit in (for example) census
data is 3.44, approximately. (See example 4.2.3 for the distribution.)
4 . 7 A P P ROX I M AT I O N S : A F I R S T L O O K
At this point the reader may observe this expanding catalogue of different distributions
with some dismay. Not only are they too numerous to remember with enthusiasm, but
many comprise a tiresomely messy collection of factorials that promise tedious calcula-
tions ahead.
Fortunately, things are not as bad as they seem because, for most practical purposes,
many of the distributions we meet can be effectively approximated by much simpler
functions. Let us recall an example to illustrate this.
Example 4.7.1: polling voters. Voters belong either to the red party or the green
party. There are r reds, g greens, and v r g voters altogether. You take a random
sample of size n, without asking any voter twice. Let A k be the event that your sample
includes k greens. This is sampling without replacement, and so of course from (2) of
section 4.5 you know that
4.7 Approximations: a rst look 155
g r gr
(1) P(Ak ) ,
k n k n
a hypergeometric distribution. This formula is rather disappointing, as calculating it for
many values of the parameters is going to be dull and tedious at best. And results are
unlikely to appear in a simple form.
However, it is often the case that v, g, and r are very large compared with k and n.
(Typically n might be 1000, while r and g are in the millions.) In this case if we set
p g=v, q 1 p r=v
and remember that k=v and n=v are very small, we can argue as follows. For xed n and
k, as v, g and r becomes increasingly large,
g1 g k1
! p, . . . , ! p
v v
v k1 r n k1
! 1, ! q,
v v
and so on. Hence
g! r! n!(r g n)!
(2) P(A k )
k!( g k)! (n k)!(r n k)! (r g)!
!( )
n g g k1
k v v
( )( )
r r n k1 v v n1
3
v v v v
!
n
' pk q n k
k
for large r, g, and v. Thus in these circumstances the hypergeometric distribution is very
well approximated by the binomial distribution, for many practical purposes. s
Example 4.7.2: rare greens. Suppose in the above example that there are actually
very few greens; naturally we need to make our sample big enough to have a good chance
of registering a reasonable number of them. Now if g, and hence p, are very small, we
have
(3) P(A1 ) np(1 p) n1 :
For this to be a reasonable size as p decreases we must increase n in such a way that np
stays at some desirable constant level, say.
In this case, if we set np , which is xed as n increases, we have as n ! 1
1 k1
1 ! 1, . . . , 1 ! 1,
n n
k
k
(1 p) 1 ! 1,
n
156 4 Distributions: trials, samples, and approximation
and
n
(1 p) n 1 ! e :
n
Hence
n
(4) P(Ak ) pk (1 p) n k
k
n k k
n
1 1
n k n n
n k k
1 k1
1 1 1 1
n k! n n n
k
! e , as n ! 1:
k!
This is called the Poisson distribution. We should check that it is a distribution; it is,
since each term is positive and
X
1
e k =k!:
k0
It is so important that we devote the next section to it, giving a rather different
derivation. s
4 . 8 S PA R S E S A M P L I N G ; T H E P O I S S O N D I S T R I B U T I O N
Another problem that arises in almost every branch of science is that of counting rare
events. Once again, this slightly opaque statement is made clear by examples.
Accidents. Any stretch of road, or road junction, is subject to the occasional accident.
How many are there in a given stretch of road? How many are there during a xed period
at some intersection?
Misprints. An unusually good typesetter makes a mistake very rarely. How many are
there on one page of a broadsheet newspaper? How many does she make in a year?
4.8 Sparse sampling; the Poisson distribution 157
Currants. A frugal baker adds a small packet of currants to his batch of dough. How
many currants are in each bun? How many in each slice of currant loaf?
Clearly this is another list which could be extended indenitely. You have to think only
for a moment of the applications to counting: colonies of bacteria on a dish; aws in a
carpet; bugs in a program; earwigs in your dahlias; daisy plants in your lawn; photons in
your telescope; lightning strikes on your steeple; wasps in your beer; mosquitoes on your
neck; and so on.
Once again we need a canonical example that represents or acts as a model for all the
rest. Tradition is not so inexible in this case (we are not bound to coins and urns as we
were above). For a change, we choose to count the meteorites striking Bristol during a
time period of length t, [0, t], say.
The period is divided up into n equal intervals; as we make the intervals smaller
(weeks, days, seconds, . . .), the number n becomes larger. We assume that the intervals
are so small that the chance of two or more strikes in the same interval is negligible.
Furthermore meteorites take no account of our calendar, so it is reasonable to suppose
that strikes in different intervals are independent, and that the chance of a strike is the
same for each of the n intervals, p say. (A more advanced model would take into account
the fact that meteorites sometimes arrive in showers.) Thus the total number of strikes in
the n intervals is the same as the number of successes in n Bernoulli trials, with
distribution
n
p(k) pk (1 p) n k , 0 < k < n,
k
which is binomial. These assumptions are in fact well supported by observation.
Now obviously p depends on the size of the interval; there must be more chance of a
strike during a month than during a second. Also it seems reasonable that if p is the
chance of a strike in one minute, then the chance of a strike in two minutes should be
about 2 p, and so on. This amounts to the assumption that np=t is a constant, which we
call . So
np t:
Thus as we increase n and decrease p so that t is xed, we have exactly the situation
considered in example 4.7.2, with replaced by t. Hence, as n ! 1,
e t (t) k
P(k strikes in [0, t]) ! ,
k!
the Poisson distribution of (4) in section 4.7.
The important point about the above derivation is that it is generally applicable to many
other similar circumstances. Thus, for example, we could replace `meteorites' by
`currants' and `the interval [0, t]' by `the cake'; the `n divisions of the interval' then
become the `n slices of the cake', and we nd that a fruit cake made from a large batch of
well-mixed dough will contain a number of currants with a Poisson distribution, approxi-
mately.
The same argument has yielded approximate Poisson distributions observed for ying
bomb hits on London in 193945, soldiers disabled by horse-kicks in the Prussian
Cavalry, accidents along a stretch of road, and so on. In general, rare events that occur
158 4 Distributions: trials, samples, and approximation
independently but consistently in some region of time or space, or both, will often follow
a Poisson distribution. For this reason it is sometimes called the law of rare events.
Notice that we have to count events that are isolated, that is to say occur singly, because
we have assumed that only one event is possible in a short enough interval. Therefore we
do not expect the number of people involved in accidents at a junction to have a simple
Poisson distribution, because there may be several in each vehicle. Likewise the number
of daisy owers in your lawn may not be Poisson, because each plant has a cluster of
owers. And the number of bacteria on a Petri dish may not be Poisson, because the
separate colonies form tightly packed groups. The colonies, however, may well have an
approximately Poisson distribution.
We have come a long way from the hypergeometric distribution but, surprisingly, we
can go further still. It will turn out that for large values of the parameter , the Poisson
distribution can be usefully approximated by an even more important distribution, the
normal distribution. But this lies some way ahead.
4 . 9 C O N T I N U O U S A P P ROX I M AT I O N S
We have seen above that, in many practical situations, complicated and unwieldy
distributions can be usefully replaced by simpler approximations; for example, sometimes
the hypergeometric distribution can be approximated by a binomial distribution; this in
turn can sometimes approximated by a Poisson distribution. We are now going to extend
this idea even further.
First of all consider an easy example. Let X be the number shown by a fair n-faced die.
Thus X has a uniform distribution on f1, . . . , ng, and its distribution function is shown
in Figure 4.8, for some large unspecied value of n.
Now if you were considering this distribution for large values of n, and sketched it
many times everyday, you would in general be content with the picture in Figure 4.9.
The line
x
(1) y , 0<x<n
n
is a very good approximation to the function
4.9 Continuous approximations 159
y
0
n x
0
n x
Figure 4.9. The line y x=n provides a reasonable approximation to gure 4.8.
[x]
(2) F(x) ,
n
(recall that [x] means `the integer part of x') as is obvious from the gures. Indeed, for
all x
x [x] 1
j y F(x)j < ,
n n n
so if we use the approximation to calculate probabilities, we nd that we have
b a
P(a , X < b) ' ,
n n
where the exact result is
[b] [a]
P(a , X < b) :
n n
The difference between the exact and approximate answers is always less than 2=n, which
may be negligible for large n. The function y x=n is an excellent continuous approxi-
mation to F(x) for large n.
We can also consider a natural continuous approximation to the actual discrete
distribution
1
p(k) , 1 < k < n,
n
160 4 Distributions: trials, samples, and approximation
when p(k) is displayed as a histogram; see gure 4.10. Clearly the constant function
1
y9 , 0 < x < n,
n
ts p(k) exactly. Now remember that
X
F(x) p(k),
k<x
so that F(x) is just the area under the histogram to the left of x. It is therefore important
to note also that y x=n is the area under y9 1=n to the left of x, as we would wish.
p(k)
1
n
...
12 n k
Figure 4.10. Histogram of the uniform discrete distribution p(k) 1=n, 1 < k < n.
Even neater results are obtained by scaling X by a factor of n; that is to say, we write
X [nb] [na]
P a, < b
n n n
exactly, and then dene U (x) x and u(x) 1, for 0 < x < 1. Then for large n
X
(3) P a, < b ' b a
n
U (b) U (a):
In particular note that when h is small this gives
X
P x , < x h ' h u(x)h:
n
The uniform distribution is so simply dealt with as to be almost dull. Next let us
consider a much more important and interesting case. The geometric distribution
(4) p(k) p(1 p) k1 , k > 1,
arose when we looked at the waiting time X for a success in a sequence of Bernoulli
trials. Once again we consider two gures. Figure 4.11 shows the distribution function
Xx
P(X < x) F(x) p(k) 1 (1 p) x , x > 1,
k1
for a reasonably small value of p. Figure 4.12 shows what you would be content to sketch
in general, to gain a good idea of how the distribution behaves. We denote this curve by
E(x).
It is not quite so obvious this time what E(x) actually is, so we make a simple
calculation.
Let p =n, where is xed and n may be as large as we please. Now for any xed x
4.9 Continuous approximations 161
y
1 ...
0 1 2 3 4 5 x
P[x]
Figure 4.11. The geometric distribution function y(x) P(X < x) k1 p(k) 1 (1 p)[x] .
y
1
...
0 x
Figure 4.12. The function y E(x) provides a reasonable approximation to gure 4.11.
162 4 Distributions: trials, samples, and approximation
Thus
X
(8) P a , < b (1 p)[ na] (1 p)[ nb]
n
[ na] [ nb]
1 1
n n
' e a e b
E(b) E(a):
It can be shown that, for some constant c,
[ na]
c
1 a
e <
n n
so this approximation is not only simple, it is close to the correct expression for large n.
Just as for the uniform distribution, we can obtain a natural continuous approximation
to the actual discrete distribution (3), when expressed as a histogram.
p(k)
...
1 2 3 4 ... k
Figure 4.13. Histogram of the geometric distribution p(k) q k1 p, together with the continuous
approximation y e x (broken line).
From (8) we have, for small h,
X
P a , < a h ' e a e (a h) ' e a h:
n
Thus, as Figure 4.13 and (8) suggest, the distribution (3) is well tted by the curve
e(x) e x :
Again, just as F(x) is the area under the histogram to the left of x, so also does E(x) give
the area under the curve e(x) to the left of x.
These results, though interesting, are supplied mainly as an introduction to our
principal task, which is to approximate the binomial distribution. That we do next, in
section 4.10.
as U (x), u(x), E(x), and e(x) did for the uniform and geometric distributions respectively.
And (x) gives the area under the curve (x) to the left of x; see gure 4.14.
(x)
y 0 x
1
Figure 4.14. The normal function (x) (2)2 exp 12 x 2 . The shaded area is ( y)
y
1 (x) dx. Note that (x) (x) and (x) 1 (x).
This is one of the most remarkable and important results in the theory of probability; it
was rst shown by de Moivre. A natural question is, why is the result so important that de
Moivre expended much effort proving it, when so many easier problems could have
occupied him?
The most obvious motivating problem is typied by the following. Suppose you
perform 106 Bernoulli trials with P(S) 12, and for some reason you want to know the
probability that the number of successes lies between a 500 000 and b 501 000.
This probability is given by
Xb
106 106
(5) 2 :
ka
k
Calculating is a very unattractive prospect indeed; it is natural to ask if there is any
hope for a useful approximation. Now, a glance at the binomial diagrams in section 4.4
shows that there is some hope. As n increases, the binomial histograms are beginning to
get closer and closer to a bell-shaped curve. To a good approximation, therefore, we
might hope that adding up the huge number of small but horrible terms in (5) could be
replaced by nding the appropriate area under this bell-shaped curve; if the equation of
the curve were not too difcult, this might be an easier task. It turns out that our hope is
justied, and there is such a function. The bell-shaped curve is the very well-known
function
( )
1 1 x 2
(6) f (x) exp :
(2)1=2 2
This was rst realized and proved by de Moivre in 1733. He did not state his results in
this form, but his conclusions are equivalent to the following celebrated theorem.
b a
(7) P(a , X < b) '
and
1 a
(8) P(a , X < a 1) ' :
Alternatively, as we did in (4.9), we can scale X and write (7) and (8) in the equivalent
forms
X
(9) P a, < b ' (b) (a)
and, for small h,
X
(10) P a, < a h ' h(a):
As in our previous examples, (x) supplies the area under (x), to the left of x; there are
tables of this function in many books on probability and statistics (and elsewhere), so that
we can use the theorem in practical applications. Table 4.1 lists (x) and (x) for some
half-integer values of x.
We give a sketch proof of de Moivre's theorem later on, in section 4.15; for the moment
let us concentrate on showing how useful it is. For example, consider the expression (5)
above for . By (7), to use the normal approximation we need to calculate
np 500 000
and
(npq)1=2 500:
Now de Moivre's theorem says that, approximately,
(11) (2) (0)
' 0:997 0:5
0:497,
from Table 4.1. This is so remarkably easy that you might suspect a catch; however, there
is no catch, this is indeed our answer. The natural question is, how good is the
approximation? We answer this by comparing the exact and approximate results in a
number of cases. From the discussion above, it is obvious already that the approximation
should be good for large enough n, for it is in this case that the binomial histograms can
be best tted by a smooth curve.
166 4 Distributions: trials, samples, and approximation
We omit any detailed proof of this, but it is intuitively clear, if you just remember that
(x) measures the area under (x). (Draw a diagram.)
The result (14) is sometimes called the local limit theorem.
One further approximate relationship that is occasionally useful is Mills' ratio: for
large positive x
1
(15) 1 (x) ' (x):
x
We offer no proof of this either.
It is intuitively clear from all these results that the normal approximation is better
It can be shown with much calculation, which we omit, that for p 12 and n > 10, the
error in the approximation (13) is always less than 0.01, when you use the continuity
correction. For n > 20, the maximum error is halved again.
If p 6 12, then larger values of n are required to keep the error small. In fact the worst
error is given approximately by the following rough and ready formula when npq > 10:
j p qj
(16) worst error ' :
10(npq)1=2
If you do not use the continuity correction the errors may be larger, especially when
ja bj is small.
Here are some examples to illustrate the use of the normal approximation. In each case
you should spend a few moments appreciating just how tiresome it would be to answer
the question using the binomial distribution as it stands.
Example 4.10.1. In the course of a year a fanatical gambler makes 10 000 fair
wagers. (That is, winning and losing are equally likely.) The gambler wins 4850 of these
and loses the rest. Was this very unlucky? (Hint: (3) 0:0013.)
168 4 Distributions: trials, samples, and approximation
Solution. The number X of wins (before the year begins) is binomial, with 5000
and 50. Now
X 5000
P(X < 4850) P < 3
50
' (3)
' 0:0013:
The chance of winning 4850 games or fewer was only 0.0013, so one could regard the
actual outcome, losing 4850 games, as unlucky. s
Example 4.10.2: rivets. A large steel plate is xed with 1000 rivets. Any rivet is
2
awed with probability 10 . If the plate contains more than 100 awed rivets it will
spring in heavy seas. What is the probability of this?
Solution. The number X of awed rivets is binomial B(103 , 102 ), with 10 and
2
9:9. Hence
X 10 90
P(X . 100) 1 P(X < 100) ' 1 P <
3:2 3:2
1
' 1 (28) ' 28 (28), by (15)
1
' 28 exp(392):
This number is so small that it can be ignored for all practical purposes. The ship would
have rusted to nothing while you were waiting. Its seaworthiness could depend on how
many plates like this were used, but we do not investigate further here. s
Example 4.10.3: cheat or not? You suspect that a die is crooked, i.e. that it has been
weighted to show a six more often than it should. You decide to roll it 180 times and
count the number of sixes. For a fair die the expected number of sixes is 16 3 180 30,
and you therefore contemplate adopting the following rule. If the number of sixes is
between 25 and 35 inclusive then you will accept it as fair. Otherwise you will call it
crooked. This is a serious allegation, so naturally you want to know the chance that you
will call a fair die crooked. The probability that a fair die will give a result in your
`crooked' region is
X35 k 180 k
180 1 5
p(c) 1 :
k25
k 6 6
Calculating this is fairly intimidating. However, the normal approximation easily and
quickly gives
X 30
p(c) 1 P 1 < < 1 ' (1) (1)
5
' 0:32:
This value is rather greater than you would like: there is a chance of about a third that
you accuse an honest player of cheating. You therefore decide to weaken the test, and
accept that the die is honest if the number of sixes in 180 rolls lies between 20 and 40.
4.11 Density 169
The normal approximation now tells you that the chance of calling the die crooked when
it is actually fair is
X 30
1 P 2 < < 2 ' 1 (2) (2)
5
' 0:04:
Whether this is a safe level for false accusations depends on whose die it is. s
Example 4.10.4: airline overbooking. Acme Airways has discovered by long experi-
1
ence that there is a 10 chance that any passenger with a reservation fails to show up for the
ight. If AA accepts 441 reservations for a 420 seat ight, what is the probability that
they will need to bump at least one passenger?
Solution. We assume that passengers show9 up or not independently of each other. The
number that shows up is binomial B 441, 10 , and we want the probability that this
number exceeds 420. The normal approximation to the binomial shows that this prob-
ability is very close to
!
420 12 396:9
1 1
9 1=2
' 1 (0:36)
441 3 10 3 10
' 1 0:64
' 0:36: s
4.11 DENSITY
We have devoted much attention to discrete probability distributions, particularly those
with integer outcomes. But, as we have remarked above, many experiments have
outcomes that may be anywhere on some interval of the line; a rope may break anywhere,
a meteorite may strike at any time. How do we deal with such cases? The answer is
suggested by the previous sections, in which we approximated probabilities by expressing
170 4 Distributions: trials, samples, and approximation
them as areas under some curve. And this idea was mentioned even earlier in example
4.2.4, in which we pointed out that areas under a curve can represent probabilities.
We therefore make the following denition.
Example 4.11.1: uniform density. Suppose a rope of length l under tension is equally
likely to fail at any point. Let X be the point at which it does fail, supposing one end to
be at the origin. Then, for 0 < a < b , l,
P(a , X < b) (b a)l 1
b
l 1 dx:
a
Hence X has density
f (x) l 1 , 0 < x < l: s
Remark. Note that f (x) is not itself a probability; only the area under f (x) can be a
probability. This is obvious from the above example, because if the rope is short, and
l , 1, then f (x) . 1. This is not possible for a probability.
' hf (x):
Thus hf (x) is the approximate probability that X lies within the small interval (x, x h);
this idea replaces discrete probabilities.
Obviously from (1) we have that
(2) f (x) > 0,
and
1
(3) f (x) dx 1:
1
Furthermore we have, as in the discrete case, the following
See appendix 4.14 for a discussion of the integral. For now, just read b f (x) dx as the area under f (x) between
a
a and b.
4.11 Density 171
Key rule for densities. Let C be any subset of R such that P(X 2 C) exists. Then
P(X 2 C) f (x) dx:
x2C
We shall nd this very useful later on. For the moment here are some simple examples.
First, from the above denition of density we see that any function f (x) satisfying (2)
and (3) can be regarded as a density. In particular, and most importantly, we see that the
functions used to approximate discrete distributions in sections 4.9 and 4.10 are densities.
We return to these later. Here is one nal complementary example which provides an
interpretation of the above remarks.
Example 4.11.4. Suppose you have a lamina L whose shape is the region lying
between y 0 and y f (x), where f (x) > 0 and L has area 1. Pick a point P at random
in L, with any point equally likely to be chosen. Let X be the x-coordinate of the point P.
b
Then by construction P(a , X < b) a f (x) dx, and so f (x) is the density of X . s
Example 4.12.1. You roll two dice. The possible outcomes are the set of ordered
pairs f(i, j); 1 < i < 6, 1 < j < 6g. s
Example 4.12.2. Your doctor weighs and measures you. The possible outcomes are
of the form (x grams, y millimetres), where x and y are positive and less than 106 , say.
(We assume your doctor's scales and measures round off to whole grams and millimetres
respectively.) s
You can easily think of other examples yourself. The point is that these outcomes are
not single numbers, so we cannot usefully identify them with points on a line. But we can
usefully identify them with points in the plane, using Cartesian coordinates for example.
Just as scalar outcomes yielded distributions on the line, so these outcomes yield
distributions on the plane. We give some examples to show what is going on.
The rst natural way in which such distributions arise is in the obvious extension of
Bernoulli trials to include ties. Thus each trial yields one of
fS, F, T g fsuccess, failure, tieg:
We shall call these de Moivre trials.
Solution. Just as for the binomial distribution of n Bernoulli trials, there are several
different ways of showing this. The simplest is to note that, by independence, any
sequence of n trials including exactly x successes and y failures has probability
p x q y (1 p q) nx y , because the remaining n x y trials are all ties. Next, by (4) of
section 3.3, the number of such sequences is the trinomial coefcient
n!
,
x! y!(n x y)!
and this proves (1) above. s
4.13 Distributions in the plane 173
Example 4.12.4: uniform distribution. Suppose you roll two dice, and let X and Y
be their respective scores. Then by the independence of the dice
1
P(X x, Y y) 36 , 0 < x, y < 6:
This is the uniform distribution on f1, 2, 3, 4, 5, 6g2 . s
It should now be clear that, at this simple level, such distributions can be treated using
the same ideas and methods as we used for distributions on the line. There is of course a
regrettable increase in the complexity of notation and equations, but this is inevitable. All
consideration of the more complicated problems that can arise from such distributions is
postponed to chapter 6, but we conclude with a brief glance at location and spread.
Given our remarks about probability distributions on the line, it is natural to ask what
can be said about the location and spread of distributions in the plane, or in three
dimensions. The answer is immediate if we pursue the analogy with mass. Recall that we
visualized a discrete probability distribution p(x) on the line as being a unit mass divided
up so that a mass p(x) is found at x. Then the mean is just the centre of gravity of this
mass distribution, and the variance is its moment of inertia about the mean.
With this in mind it seems natural toPregard a distribution in R2 (or R3 ) as being a
distribution of masses p(x, y) such that x, y p(x, y) 1. Then the centre of gravity is at
G (x, y) where
X X
x xp(x, y) and y yp(x, y):
x, y x, y
We dene the mean of the distribution p( j, k) to be the point (x, y). By analogy with the
spread of mass, the spread of this distribution is indicated by its moments of inertia with
respect to the x- and y- axes,
X
21 (x x)2 p(x, y)
x, y
and
X
22 ( y y)2 p(x, y):
x, y
What is the probability that the sum of the two numbers is 3? Find the mean and variance of X
and Y .
2. You roll a die, which shows X , and then ip X fair coins, which show Y heads. Find
P(X x, Y y), and hence calculate the mean of Y .
174 4 Distributions: trials, samples, and approximation
4.13 REVIEW
In this chapter we have looked at the simplest models for random experiments. These
give rise to several important probability distributions. We may note in particular the
following.
Bernoulli trial
P(S) P(success) p 1 q
P(F) P(failure) q:
Geometric distribution for the rst success in a sequence of independent Bernoulli trials:
p(k) P(k trials for 1st success) pq k1 , k > 1:
Negative binomial distribution for the number of Bernoulli trials needed to achieve k
successes:
n1
p(n) p k q n k , n > k:
k1
Poisson distribution
We introduced the ideas of mean and variance as measures of the location and spread
of the probability in a distribution. Table 4.2 shows some important means and variances.
Very importantly, we went on to note that probability distributions could be well
approximated by continuous functions, especially as the number of sample points
becomes large. We use these approximations in two ways. First, the local approximation,
which says that if p(k) is a probability distribution, there may be a function f (x) such
4.13 Review 175
that p(k) ' f (x). Second, the global approximation, which follows from the local:
Xb
p(k) ' the area under f (x) between a and b
ka
F(b) F(a) for some function F(x):
Occasionally it is useful to improve these approximations by making a continuity
correction to take account of the fact that f (x) is continuous but p(k) is not.
In particular we considered the normal approximation to the binomial distribution
n
p(k) pk q n k
k
with mean np and variance 2 npq. This approximation is given by
1 k
p(k) '
where
1
(x) 1=2
exp 12 x 2
(2)
and
x
F(x) (x) (u) du:
1
176 4 Distributions: trials, samples, and approximation
Convergence. Let (x n ; n > 1) be a sequence of real numbers. Suppose that there is a real
number a such that jx n aj is always ultimately as small as we please; formally,
jx n aj , for all n . n0 ,
where is arbitrarily small and n0 is nite. n
In this case the sequence (x n ) is said to converge to the limit a. We write either
x n ! a as n ! 1,
or
lim x n a:
n!1
Now let f (x) be any function dened in some interval (, ), except possibly at the point x a.
Let (x n ) be a sequence converging to a, such that x n 6 a for any n. Then ( f (x n ); n > 1) is also a
sequence; it may converge to a limit l.
Limits of functions. If the sequence ( f (x n )) converges to the same limit l for every sequence
(x n ) converging to a, x n 6 a, then we say that the limit of f (x) at a is l. We write either
f (x) ! l as x ! a, or lim f (x) l: n
x!a
Suppose now that f (x) is dened in the interval (, ), and let lim x!a f (x) be the limit of f (x) at
a. This may or may not be equal to f (a). Accordingly we dene:
Now, given a continous function f (x), we are often interested in two principal questions about f (x).
(i) What is the slope (or gradient) of f (x) at the point x a?
(ii) What is the area under f (x) lying between a and b?
4.14 Appendix. Calculus 177
Question (i) is answered by looking at chords of f (x). For any two points a and x, the slope of the
chord from f (a) to f (x) is
f (x) f (a)
s(x) :
xa
If s(x) has a limit as x ! a, then this is what we regard as the slope of f (x) at a. We call it the
derivative of f (x), and say that f (x) is differentiable at a.
For question (ii), let f (a) be a function dened on [a, b]. Then the area under the curve f (x) in
[a, b] is denoted by
b
f (x) dx,
a
and is called the integral of f (x) from a to b. In general, areas below the x-axis are counted as
negative; for a probability density this case does not arise, because density functions are never
negative.
The integral is also dened as a limit, but any general statements would take us too far aeld. For
well-behaved positive functions you can determine the integral as follows. Plot f (x) on squared
graph paper with interval length 1=n. Let S n be the number of squares lying entirely between f (x)
and the x-axis between a and b. Set
I n S n =n2 :
Then
b
lim I n f (x) dx:
n!1 a
The function f (x) is said to be integrable.
Of course we almost never obtain integrals by performing such a limit. We almost always use a
method that relies on the following, most important, connexion between differentiation and
integration.
4 . 1 5 A P P E N D I X . S K E T C H P RO O F O F T H E N O R M A L L I M I T
THEOREM
In this section we give an outline of a proof that the binomial distribution is well approximated by
the bell-shaped curve (x). We will do the symmetric case rst, since the algebra is more transparent
and enables the reader to appreciate the method unobscured by technical details. To be precise, the
symmetric binomial distribution is
n n
p(k) 2 ,
k
4.15 Appendix. Sketch proof of the normal limit theorem 179
with mean n=2 and variance 2 n=4. It is the claim of the normal limit theorem that for large
n and moderate k
1 1 (k )2
p(k) ' exp :
(2)1=2 2 2
(It can be shown more precisely that a sufcient condition is that (k )3 =n2 must be negligibly
small as n increases.)
To demonstrate the theorem we need to remember three things.
Arithmetic series
X
n1
(2) k 12 n(n 1)
k1
Exponential approximation
x2
ex 1 x ,
2!
which implies that for small x we can write
1
(3) ' 1 x ' e x ,
1x
and the quadratic and higher terms can be neglected.
Now to sketch the theorem. Assume k . n=2 and n even, without much loss of generality. Then
n! 2 n
p(k)
k!(n k)!
n n n n
1 k 1
n! 2 n 2 2 2 2
n n n
! ! k(k 1) 1
2 2 2
Now we use Stirling's formula in the rst term, and divide top and bottom of the second term by
(n=2) k n=2 , to give
2 4 n 2
1 1 1 k 1
n n 2 n
p(k) ' 1=2
1=2
n 2 4 n 2
(2) 1 1 1 k
4 n n 2 n
Now we use the exponential approximation, and then sum the arithmetic series as follows. From the
above,
1 4 8 4 n 2 n
p(k) ' exp k 1 k
(2 )1=2 n n n 2 n 2
1 2 n n 2 n
exp k k 1 k
(2 )1=2 n 2 2 n 2
( 2 )
1 1 k
exp ,
(2 )1=2 2
as required.
180 4 Distributions: trials, samples, and approximation
In the asymmetric case, when p 6 q, we use exactly the same line of argument, but the algebra
becomes a bit more tedious. We have in this case
n
p(k) pk q n k :
k
We shall assume for convenience that k . np, and that np and nq are integers. The method can still
be forced through when they are not, but with more ddly details. Remember that
np
and
2 npq:
Recall that we regard terms like (k np)3 =n2 and (k np)=n, and anything smaller, as negligible as
n increases. Off we go:
!
n
P(k successes) p(k) pk q n k
k
n! pk q n k nq(nq 1) (nq k 1 np)
(nq)!(np)! (np k np)(np k np 1) (np 1)
1 k 1 np
1 1
1 nq nq
' , by Stirling
(2npq)1=2 1 k np
1 1
np np
1 1 1 k 1 np
' exp
(2)1=2 nq np nq
k 1 np k np
, by (3)
np np
1 1 k np 1 k np
exp
(2)1=2 npq npq np
( )
1 (k np)2 (k np) ( p q)
' exp , ignoring
1=2
(2) 2npq npq 2
( 2 )
1 1 k
exp ,
(2)1=2 2
as required.
4 . 1 6 P RO B L E M S
1. You roll n dice; all those that show a six are rolled again. Let X be the number of resulting
sixes. What is the distribution of X ? Find its mean and variance.
2. For what values of c1 and c2 are the following two functions probability distributions?
(a) p(x) c1 x, 1 < x < n.
(b) p(x) c2 =fx(x 1)g, x > 1.
3. Show that for the Poisson distribution p(x) x e =x! the variance is . Show also that
f p(x)g2 > p(x 1) p(x 1).
4.16 Problems 181
10. Tagging. A population of n animals has had a number t of its members captured, tagged,
and released back into the population. At some later time animals are captured again, without
replacement, until the rst capture at which m tagged animals have been caught. Let X be the
number of captures necessary for this. Show that X has the distribution
t t1 n t n1
p(k) P(X k)
n m1 km k1
where m < k < n t m.
11. Runs. You ip a coin h t times, and it shows h heads and t tails. An unbroken sequence
of heads, or an unbroken sequence of tails, is called a run. (Thus the outcome HTTHH contains
3 runs.) Let X be the number of runs in your sequence. Show that X has the distribution
h1 t1 h t
p(x) P(X x) :
x1 x h
What is the distribution of the number of runs of tails?
182 4 Distributions: trials, samples, and approximation
12. You roll two fair n-sided dice, each bearing the numbers f1, 2, . . . , ng. Let X be the sum of
their scores. What is the distribution of X ? Find continuous functions T (x) and t(x) such that
for large n
X
P < x ' T (x)
n
and
X
P x , < x h ' t(x)h:
n
13. You roll two dice; let X be the score shown by the rst die, and let W be the sum of the scores.
Find
p(x, w) P(X x, W w):
14. Consider the standard 649 lottery (six numbers are chosen from f1, . . . , 49g). Let X be the
largest number selected. Show that X has the distribution
x1 49
p(x) , 6 < x < 49:
5 6
What is the distribution of the smallest number selected?
15. When used according to the manufacturer's instructions, a given pesticide is supposed to kill
any treated earwig with probability 0.96. If you apply this treatment to 1000 earwigs in your
garden, what is the probability that there are more than 20 survivors? (Hint: (3:2) ' 0:9993.)
16. Candidates to compete in a quiz show are screened; any candidate passes the screen-test with
probability p. Any contestant in the show wins the jackpot with probability t, independently of
other competitors. Let X be the number of candidates who apply until one of them wins the
jackpot. Find the distribution of X .
17. Find the largest term in the hypergeometric probability distribution, given in (2) of section 4.5.
If m w t, nd the value of t for which (2) is largest, when m, r, and k are xed.
18. You perform n independent de Moivre trials, each with r possible outcomes. Let X i be the
number of trials that yield the ith possible outcome. Prove that
n!
P(X 1 x1 , . . . , X r x r ) p1x1 p xr r ,
x1 ! x r !
where pi is the probability that any given trial yields the ith possible outcome.
19. Consider the standard 649 lottery again, and let X be the largest number of the six selected,
and Y the smallest number of the six selected.
(a) Find the distribution P(X x, Y y).
(b) Let Z be the number of balls drawn that have numbers greater than the largest number not
drawn. Find the distribution of Z.
20. Two integers are selected at random with replacement from f1, 2, . . . , ng. Let X be the
absolute difference between them (X > 0). Find the probability distribution of X , and its
expectation.
21. A coin shows heads with probability p, or tails with probability q. You ip it repeatedly. Let X
be the number of ips until at least two heads and at least two tails have appeared. Find the
distribution of X , and show that it has expected value 2f(pq)1 1 pqg.
22. Each day a robot manufactures m n capeks; each capek has probability of being defective,
independently of the others. A sample of size n (without replacement) is taken from each day's
output, and tested (n > 2). If two or more capeks are defective, then every one in that day's
output is tested and corrected. Otherwise the sample is returned and no action is taken. Let X
be the number of defective capeks in the day's output after this procedure. Show that
4.16 Problems 183
fm 1 (n 1)xgm! x
P(X x) (1 ) m nx , x < m 1:
(m x 1)!x!
Show that X has expected value
fm n m(n 1)g(1 ) n1 :
23. (a) Let X have a Poisson distribution with parameter . Use the Poisson and normal
approximations to the binomial distribution to deduce that for large enough
X
P p < x ' (x):
(b) In the years 197999 in Utopia the average number of deaths per year in trafc accidents
is 730. In the year 2000 there are 850 deaths in trafc accidents, and on New Year's Day
2001, there are 5 such deaths, more than twice the daily average for 197999.
The newspaper headlines speak of `New Year's Day carnage', without mentioning the
total gures for the year 2000. Is this rational?
24. Families. A woman is planning her family and considers the following possible schemes.
(a) Bear children until a girl is born, then stop.
(b) Bear children until the family rst includes children of both sexes, and then stop.
(c) Bear children until the family rst includes two girls and two boys, then stop.
Assuming that boys and girls are equally likely, and multiple births do not occur, nd the mean
family size in each case.
25. Three points A, B, and C are chosen independently at random on the perimeter of a circle. Let
p(a) be the probability that at least one of the angles of the triangle ABC exceeds a. Show
that
(
1 (3a 1)2 , 13 < a < 12
p(a) 1
3(1 a)2 , 2 < x < 1:
26. (a) Two players play a game comprising a sequence of points in which the loser of a point
serves to the following point. The probability is p that a point is won by the player who
serves. Let f m be the expected number of the rst m points that are won by the player who
serves rst. Show that
f m pm (1 2 p) f m1 :
Find a similar equation for the number that are won by the player who rst receives
service. Deduce that
m 1 2p
fm f1 (1 2 p) m g:
2 4p
(b) Now suppose that the winner of a point serves to the following point, things otherwise
being as above. Of the rst m points, let em be the expected number that are won by the
player who serves rst. Find em . Is it larger or smaller than f m ?
Review of Part A, and preview of Part B
I have yet to see a problem, however complicated, which, when you looked at
it in the right way, did not become still more complicated.
P. Anderson, New Scientist, 1969
We began by discussing several intuitive and empirical notions of probability, and how
we experience it. Then we dened a mathematical theory of probability using the
framework of experiments, outcomes, and events. This included the ideas of indepen-
dence and conditioning. Finally, we considered many examples in which the outcomes
were numerical, and this led to the extremely important idea of probability distributions
on the line and in higher dimensions. We also introduced the ideas of mean and variance,
and went on to look at probability density. All this relied essentially on our denition of
probability, which proved extremely effective at tackling these simple problems and
ideas.
Now that we have gained experience and insight at this elementary level, it is time to
turn to more general and perhaps more complicated questions of practical importance.
These often require us to deal with several random quantities together, and in more
technically demanding ways. It is also desirable to have a unied structure, in which
probability distributions and densities can be treated together.
For all these reasons, we now introduce the ideas and methods of random variables,
which greatly aid us in the solution of problems that cannot easily be tackled using the
naive machinery of Part A. This is particularly important, as it enables us to get to grips
with modern probability. Everything in Part A would have been familiar in the 19th
century, and much of it was known to de Moivre in 1750. The idea of a random variable
was nally made precise only in 1933, and this has provided the foundations for all the
development of probability since then. And that growth has been swift and enormous.
Part B provides a rst introduction to the wealth of progress in probability in the 20th
century.
185
Part B
Random Variables
5
Random variables and their distributions
5.1 PREVIEW
It is now clear that for most of the interesting and important problems in probability, the
outcomes of the experiment are numerical. And even when this is not so, the outcomes
can nevertheless often be represented uniquely by points on the line, or in the plane, or in
three or more dimensions. Such representations are called random variables. In the
preceding chapter we have actually been studying random variables without using that
name for them. Now we develop this idea with new notation and background. There are
many reasons for this, but the principal justication is that it makes it much easier to
solve practical problems, especially when we need to look at the joint behaviour of
several quantities arising from some experiment. There are also important theoretical
reasons, which appear later.
In this chapter, therefore, we rst dene random variables, and introduce some new
notation that will be extremely helpful and suggestive of new ideas and results. Then we
give many examples and explore their connections with ideas we have already met, such
as independence, conditioning, and probability distributions. Finally we look at some new
tasks that we can perform with these new techniques.
Prerequisites. We shall use some very elementary ideas from calculus; see the
appendix to chapter 4.
5 . 2 I N T RO D U C T I O N T O R A N D O M VA R I A B L E S
In chapter 4 we looked at experiments in which the outcomes in were numbers; that is
to say, R or, more generally, Rn . This enabled us to develop the useful and
attractive properties of probability distributions and densities. Now, experimental out-
comes are not always numerical, but we would still like to use the methods and results of
chapter 4. Fortunately we can do so, if we just assign a number to any outcome 2 in
some natural or convenient way. We denote this number by X (). This procedure simply
denes a function on ; often there will be more than one such function. In fact it is
almost always better to work with such functions than with events in the original sample
space.
189
190 5 Random variables and their distributions
Of course, the key to our success in chapter 4 was using the probability distribution
function
F(x ) P(X < x ) P(Bx ),
where the event Bx is given by
Bx fX < xg f: X () < xg:
We therefore make the following denition.
Remark. You may well ask, as many students do on rst meeting the idea, why we
need these new functions. Some of the most important reasons arise in slightly more
advanced work, but even at this elementary level you will soon see at least four reasons.
(i) This approach makes it much easier to deal with two or more random variables;
(ii) dealing with means, variances, and related quantities, is very much simpler when
we use random variables;
(iii) this is by far the best machinery for dealing with functions and transformations;
(iv) it unies and simplies the notation and treatment for different kinds of random
variable.
Here are some simple examples.
Example 5.2.3: medical. You go for a check-up. The sample space is rather too large
to describe here, but what you are interested in is a collection of numbers comprising
your height, weight, and values of whatever other physiological variables your physician
measures. s
Example 5.2.4: breakdowns. You buy a car. Once again, the sample space is large,
but you are chiey interested in the times between breakdowns, and the cost of repairs
each time. These are numbers, of course. s
Example 5.2.5: opinion poll. You ask people whether they approve of the present
government. The sample space could be
fapprove strongly, approve, indifferent, disapprove, disapprove stronglyg:
You might nd it very convenient in analysing your results to represent by the
numerical scale
S f2, 1, 0, 1, 2g,
or if you prefer, you could use the non-negative scale
Q f0, 1, 2, 3, 4g:
You are then dealing with random variables. s
Example 5.2.6: craps. You roll two dice, yielding X and Y . You play the game using
the combined score Z X Y , where 2 < Z < 12, and Z is a random variable. s
Example 5.2.7: medical. Your physician may measure your weight and height,
yielding the random variables X kilograms and Y metres. It is then customary to nd the
value V of your bodymass index, where
X
V 2:
Y
It is felt to be desirable that the random variable V should be inside, or not too far outside,
the interval [20, 25]. s
192 5 Random variables and their distributions
Example 5.2.8: poker. You are dealt a hand at poker. The sample space comprises
52
5
possible hands. What you are interested in is the number of pairs, and whether or not you
have three of a kind, four of a kind, a ush, and so on. This gives you a short set of
numbers telling you how many of these desirable features you have. s
Example 5.2.9: election. In an election, let the number of votes garnered by the ith
candidate be X i . Then in the simple rst-past-the-post system the winner is the one with
the largest number of votes Y ( max i X i ). s
Example 5.2.10: coins. Flip a coin n times. Let X be the number of heads and Y the
number of tails. Clearly X Y n: Now let Z be the remainder on dividing X by 2, i.e.
Z X modulo 2:
Then Z is a random variable taking values in f0, 1g. If X is even then Z 0; if X is odd
then Z 1. s
If we take account of order in ipping coins, we can construct a rich array of interesting
random variables with complicated relationships (which we will explore later).
It is important to realize that although all random variables have the above structure,
and share many properties, there are signicantly different types. The following example
shows this.
Example 5.2.11. You devise an experiment that selects a point P randomly from the
interval [0, 2], where any point may be chosen. Then the sample space is [0, 2], or
formally
f: 2 [0, 2]g:
Now dene X and Y by
0 if 0 < < 1
X () and Y () 2 :
1 if 1 , < 2
Naturally X and Y are both random variables, as they are both suitable real-valued
functions on . But clearly they are very different in kind; X can take one of only two
values, and is said to be discrete. By contrast Y can take any one of an uncountable
number of values in [0, 4]; it is said to be continuous. s
We shall develop the properties of these two kinds of random variable side by side
throughout this book. They share many properties, including much of the same notation,
but there are some differences, as we shall see.
Furthermore, even within these two classes of random variable there are further
subcategories, which it is often useful to distinguish. Here is a short list of some of them.
Indicator random variable. If X can take only the values 0 or 1, then X is said to be
an indicator. If we dene the event on which X 1,
A f: X () 1g,
then X is said to be the indicator of A.
Discrete random variable. If X can take any value in a set D that is countable, then
X is said to be discrete. Usually D is some subset of the integers, so we assume in future
that any discrete random variable is integer valued unless it is stated otherwise.
Summary
(i) We have an experiment, a sample space , and associated probabilities given by P.
That is, it is the job of the function P() to tell us the probability of any event in .
(ii) We have a random variable X dened on . That is, given 2 , X () is some
real number, x, say.
Now of course the possible values x of X are more or less likely depending on P and X .
What we need is a function to tell us the probability that X takes any value up to x. To
nd that, we simply dene the event
Bx f: X () < xg:
Then, obviously,
P(X < x ) P(Bx ):
This is the reason why (as we claimed above) random variables have probability
distributions just like those in chapter 4. We explore the consequences of this in the rest
of the chapter.
3. Example 5.2.10 continued. Suppose you are ipping a coin that moves you 1 metre east when it
shows a head, or 1 metre west when it shows a tail. Describe the random variable W denoting
your position after n ips.
4. Give an example in which is uncountable, but the random variable X dened on is
discrete.
5 . 3 D I S C R E T E R A N D O M VA R I A B L E S
Let X be a random variable that takes values in some countable set D. Usually this set is
either the integers or some obvious subset of the integers, such as the positive integers. In
fact we will take this for granted, unless it is explicitly stated otherwise. In the rst part of
this book we used the function P(), which describes how probability is distributed around
. Now that we are using random variables, we need a different function to tell us how
probability is distributed over the possible values of X .
Remark. Recall from section 5.2 that P(X x ) denotes P(A x ), where A x
f: X () xg. Sometimes we use the notation p X (x), to avoid ambiguity.
Of course p(x ) has exactly the same properties as the distributions in chapter 4,
namely
(2) 0 < p(x ) < 1
and
X
(3) p(x ) 1:
x2 D
Here are some simple examples to begin with, several of which are already familiar.
Trivial random variable. Let X be constant, that is to say X () c for all . Then
p(c) 1: s
We have remarked several times that the whole point of the distribution p(x ) is to tell
us how probability is distributed over the possible values of X . The most important
demonstration of this follows.
Key rule for the probability distribution. Let X have distribution p(x), and let C be
any collection of possible values of X . Then
X
(4) P(X 2 C) p(x ):
x2C
This is essentially the same rule as we had in chapter 4, and we prove it in the same way.
As usual A x f: X () xg. Thus, if x 6 y we have A x \ A y . Hence, by the
addition rule for probabilities,
!
[
P(X 2 C) P Ax
x2C
X
P(A x )
x2C
X
p(x ): h
x2C
An important application of the key rule (4) arises when we come to consider functions
of random variables. In practice, it is often useful or necessary (or both) to consider such
functions, as the following examples show.
Example 5.3.2: switch function. Let T denote the temperature in some air-condi-
tioned room. If T . b, then the a.c. unit refrigerates; if T , b, then the a.c. unit heats.
Otherwise it is off. The state of the a.c. unit is therefore given by S(T ), where
196 5 Random variables and their distributions
8
< 1 if T . b
S(T) 0 if a < T < b
:
1 if T , a:
Naturally, using (4),
X
b
P(S 0) P(T t): s
ta
2
P X 72 < 2 P 32 < X < 112
;
3
7 2
7 p 7
p
P X 2 <2 P 2 2< X <2 2
P(2:1 < X < 4:9) 13: s
Example 5.3.4. Let X be the score shown by a die, and let Y 2X . Then g(x ) 2x,
so that g1 ( x) 12 y and
P(Y 4) P(X 2)
16,
which is obvious anyway, of course. s
If g(x ) is not a one-to-one function, then several values of X may give rise to the same
value y g(x). In this case we must sum over all these values to get
X
(6) P(Y y) P(X x ):
x: g(x) y
We can write this argument out in more detail as follows. Let A y be the event that
g(X ) y. That is,
A y f: g(X ()) yg:
Then using the key rule (4) we have
5.3 Discrete random variables 197
P(Y y) P(X 2 A y )
X
p(x ),
x: g(x) y
Finally, we dene another function that is very useful in describing the behaviour of
random variables. As we have seen in many examples above, we are often interested in
quite simple events and properties of X , such as X < x, or X . x. For this reason we
introduce the distribution function, as follows.
Denition. Let X have probability distribution p(x). Then the cumulative distribu-
tion function (or c.d.f.) of X is F(x ), where
(7) F(x ) P(X < x )
X x
p( y), by (4): n
y1
Remark. If we wish to stress the role of X , or avoid ambiguity, we often use the
notation p X (x) and F X (x ) to denote respectively the distribution and the distribution
function of X .
It is important to note that knowledge of F(x ) also determines p(x ), because
(9) p(x) P(X < x) P(X < x 1)
F(x) F(x 1):
This is often useful when X is indeed integer valued.
198 5 Random variables and their distributions
Informally, when x . 0 is not too small, F(x) and F(x ) are known as the right tail
and left tail of X , respectively.
Remark. If two random variables are the same, then they have the same distribution.
That is, if X () Y () for all then obviously
P(X x ) P(Y x ):
However, the converse is not necessarily true. To see this, ip a fair coin once and let X
be the number of heads, and Y the number of tails. Then
P(X 1) P(Y 1) 12 P(X 0) P(Y 0),
so X and Y have the same distribution. But
X ( H) 1, X (T ) 0 and Y ( H) 0, Y (T ) 1:
Hence X and Y are never equal.
3. Change of units. Let X have distribution p(x), and let Y a bX , for some a and b. Find
the distribution of Y in terms of p(x), and the distribution function of Y in terms of F X (x),
when b . 0. What happens if b < 0?
5 . 4 C O N T I N U O U S R A N D O M VA R I A B L E S ; D E N S I T Y
We discovered in section 5.3 that discrete random variables have discrete distributions,
and that any discrete distribution arises from an appropriate discrete random variable.
What about random variables that are not discrete? As before, the answer has been
foreshadowed in chapter 4.
5.4 Continuous random variables; density 199
Let X be a random variable that may take values in an uncountable set C, which is all
or part of the real line R. We need a function to tell us how probability is distributed over
the possible values of X . It cannot be discrete; we recall the idea of density.
Denition. The random variable X is said to be continuous, with density f (x ), if, for
all a < b,
b
(1) P(a < X < b) f (x ) dx: n
a
The probability density f (x ) is sometimes called the p.d.f. When we need to avoid
ambiguity, or stress the role of X , we may use f X (x ) to denote the density.
Of course f (x) has the properties of densities in chapter 4, which we recall as
(2) f (x ) > 0
and
1
(3) f (x ) dx 1:
1
We usually specify densities only at those points where f (x) is not zero.
From the denition above it is possible to deduce the following basic identity, which
parallels that for discrete random variables, (4) in section 5.3.
Just as in the discrete case, f (x ) shows how probability is distributed over the possible
values of X . Then the key rule tells us just how likely X is to fall in any subset B of its
values (provided of course that P(X 2 B) exists).
It is important to remember one basic difference between continuous and discrete
random variables: the probability that a continuous random variable takes any particular
value x is zero. That is, from (4) we have
x
(5) P(X x) f (u) du 0:
x
Such densities also arose in chapter 4 as useful approximations to discrete distributions.
(Very roughly speaking, the idea is that if probability masses become very small and
close together, then for practical purposes we may treat the result as a density.) This led
to the continuous uniform density as an approximation to the discrete uniform distribu-
tion, and the exponential density as an approximation to the geometric distribution. Most
importantly, it led to the normal density as an approximation to the binomial distribution.
We can now display these in our new format. Remember that, as we remarked in chapter
4, it is possible that f (x ) . 1, because f (x ) is not a probability. However, informally we
can observe that, for small h, P(x < X < x h) ' f (x )h: The probability that X lies in
(x, x h) is approximately hf (x ). The smaller h is, the better the approximation.
As in the discrete case, the two properties (2) and (3) characterize all densities; that is
to say any nonnegative function f (x ), such that the area under f (x ) is 1, is a density.
200 5 Random variables and their distributions
Example 5.4.3: an unbounded density. In contrast to the discrete case, densities not
only can exceed 1, they need not even be bounded. Let X have density
(8) f (x) 12 x 1=2 , 0 , x , 1:
Then f (x ) . 0 and, as required,
1
1
f (x ) dx x 1=2 0 1,
0
but f (x) is not bounded as x ! 0. s
Our next two examples are perhaps the most important of all densities. Firstly:
Example 5.4.4: normal density. The standard normal random variable X has density
(x ), where
(9) (x ) (2)1=2 exp 12 x 2 , 1 , x , 1:
We met this density as an approximation to a binomial distribution in chapter 4, and we
shall meet it again in similar circumstances in chapter 7. For the reasons suggested by
that result, it is a distribution that is found empirically in huge areas of science and
statistics. It is easy to see that (x ) > 0, but not so easy to see that (3) holds. We
postpone the proof of this to chapter 6. s
Secondly:
Example 5.4.5: exponential density with parameter . Let X have density function
x
e , x . 0
(10) f (x )
0 otherwise:
5.4 Continuous random variables; density 201
Example 5.4.6: the Poisson process and the exponential density. Recall our deriva-
tion of the Poisson distribution. Suppose that events can occur at random anywhere in the
interval [0, t], and these events are independent, `rare', and `isolated'. We explained the
meaning of these terms in section 4.8, in which we also showed that, on these assump-
tions, the number of events N (t) in [0, t] turns out to have approximately a Poisson
distribution,
e t (t) n
(12) P(N (t) n) :
n!
When t is interpreted as time (or length), then the positions of the events are said to form
a Poisson process. A natural way to look at this is to start from t 0, and measure the
interval X until the rst event. Then X is said to be the waiting time until the rst event,
and is a random variable.
Clearly X is greater than t if and only if N (t) 0. From (12) this gives
P(X . t) P(N (t) 0) e t :
Now from (10) we nd that, if X is exponential with parameter ,
1
(13) P(X . t) e u du e t :
t
We see that the waiting time in this Poisson process does have an exponential density. s
Just as for discrete random variables, one particular probability is so important and
useful that it has a special name and notation.
Denition. Let X have density f (x). Then the distribution function F(x ) of X is
given by
x
(14) F(x ) f (u) du P(X < x ) P(X , x ),
1
since P(X x) 0. n
202 5 Random variables and their distributions
Sometimes this is called the cumulative distribution function, but not by us. As in the
discrete case the survival function is given by
(15) F(x ) 1 F(x ) P(X . x ),
and we may denote F(x ) by F X (x ), to avoid ambiguity. We have seen that the distribution
function F(x ) is dened in terms of the density f (x ) by (14). It is a very important and
useful fact that the density can be derived from the distribution function by differen-
tiation:
dF(x )
(16) f (x ) F9(x ):
dx
This is just the fundamental theorem of calculus, which we discussed in appendix 4.14.
This means that in solving problems, we can choose to use either F or f , since one can
always be found from the other.
Here are some familiar densities and their distributions.
Example 5.4.7: Cauchy density. We know that tan1 (1) =2, and tan1 (1)
1
=2, and tan (x ) is an increasing function. Hence
1 1
(22) F(x ) tan1 x
2
is a distribution function, and differentiating gives
1
(23) f (x) ,
(1 x 2 )
which is known as the Cauchy density. s
5.4 Continuous random variables; density 203
We can use the distribution function to show that the Poisson process, and hence the
exponential density, has intimate links with another important family of densities.
Example 5.4.9: the gamma density. As in example 5.4.6, let N (t) be the number of
events of a Poisson process that occur in [0, t]. Let Y r be the time that elapses from t 0
until the moment when the rth event occurs. Now a few moments' thought show that
Y r . t if and only if N (t) , r:
Hence
(26) 1 FY (t) P(Y r . t) P(N (t) , r)
X
r1
e t (t) x =x!, by (12):
x0
Therefore Y r has density f Y ( y) obtained by differentiating (26):
(27) f Y ( y) ( y) r1 e y =(r 1)!, 0 < y , 1:
This is known as the gamma density, with parameters and r. s
Remark. You may perhaps be wondering what happened to the sample space and
the probability function P(:), which played a big part in early chapters. The point is that,
since random variables take real values, we might as well let be the real line R. Then
any event A is a subset of R with length jAj, and
P(A) f (x ) dx:
x2 A
We do not really need to mention again explicitly. However,
it is worth noting that this
shows that any non-negative function f (x ), such that f (x ) dx 1, is the density
function of some random variable X .
5 . 5 F U N C T I O N S O F A C O N T I N U O U S R A N D O M VA R I A B L E
Just as for discrete random variables, we are often interested in functions of continuous
random variables.
Example 5.5.1. Many measurements have established that if R is the radius of the
trunk, at height one metre, of a randomly selected tree in Siberia, then R has a certain
density f (r). The cross-sectional area of such a tree at height one metre is then roughly
A R2 :
What is the density of A? s
This exemplies the general problem, which is: given random variables X and Y, such
that
Y g(X )
for some function g, what is the distribution of Y in terms of that of X ? In answering this
we nd that the distribution function appears much more often in dealing with continuous
random variables than it did in the discrete case. The reason for this is rather obvious; it
is the fact that P(X x) 0 for random variables with a density. The elementary lines
of argument, which served us well for discrete random variables, sometimes fail here for
that reason. Nevertheless, the answer is reasonably straightforward, if g(X ) is a one-to-
one function. Let us consider the simplest example.
Example 5.5.2: scaling and shifting. Let X have distribution F(x ) and density f (x ),
and suppose that
(1) Y aX b, a . 0:
Then, arguing as we did in the discrete case,
(2) FY ( y) P(Y < y)
P(aX b < y)
yb
P X <
a
yb
F :
a
Thus the distribution of Y is just the distribution F of X , when it has been shifted a
distance b along the axis and scaled by a factor a.
The scaling factor becomes even more apparent when we nd the density of Y . This is
obtained by differentiating FY ( y), to give
d d yb 1 yb
(3) f Y ( y) FY ( y) F f :
dy dy a a a
You may wonder why we imposed the condition a . 0. Relaxing it shows the reason, as
follows. Let Y aX b with no constraints on a. Then we note that if a 0 then Y is
just a constant b, which is to say that
(4) P(Y b) 1, a 0:
If a 6 0, we must consider its sign. If a . 0 then
5.5 Functions of a continuous random variable 205
yb yb
(5) P(aX < y b) P X < F :
a a
If a , 0 then
yb yb
(6) P(aX < y b) P X > 1 F :
a a
In each case, when a 6 0 we obtain the density of Y by differentiating FY ( y) to get
1 yb
f ( y) f X , a . 0,
a a
or
1 yb
f ( y) f X , a , 0:
a a
We can combine these to give
1 yb
(7) f Y ( y) fX , a 6 0: s
jaj a
The general case, when Y g(X ), can be tackled in much the same way. The basic
idea is rather obvious; it runs as follows. Because Y g(X ), we have
(8) FY ( y) P(Y < y) P( g(X ) < y):
Next, we differentiate to get the density of Y :
d d
(9) f Y ( y) FY ( y) P( g(X ) < y):
dy dy
Now if we play about with the right-hand side of (9), we should obtain useful expressions
for f Y ( y), when g(:) is a friendly function.
We can clarify this slightly hazy general statement by examples.
Example 5.5.4. Let X have a continuous distribution function F(x ), and let
Y F(X ):
Then, as above,
FY ( y) P(Y < y) P(F(X ) < y)
P(X < F 1 ( y)),
206 5 Random variables and their distributions
Example 5.5.5: normal densities. Let X have the standard normal density
1 x2
f (x ) (x ) exp ,
(2)1=2 2
and suppose Y X , where 6 0. Then, by example 5.5.2,
8
>
> y
>
< , . 0,
FY ( y)
>
> y
>
: 1 , , 0:
Differentiating shows that Y has density
( )
1 1 y 2
f Y ( y) exp :
(2 2 )1=2 2
Example 5.5.6: powers. Let X have density f and distribution F. What is the density
of Y , where Y X 2 ?
Solution. Here some care is needed, for the function is not oneone. We write, as
usual,
FY ( y) P(X 2 < y)
p p
P( y < X < y )
p p
F( y) F( y )
so that Y has density
d 1 p p
f Y ( y) FY ( y) p f f ( y ) f ( y )g: s
dy 2 y
5.6 Expectation 207
Example 5.5.7: continuous to discrete. Let X have an exponential density, and let
Y [X ], where [X ] is the integer part of X . What is the distribution of Y ?
Solution. Trivially for any integer n, we have [x] > n if and only if x > n. Hence
P(Y > n) e n , n>0
and so
P(Y n) P(Y > n) P(Y > n 1)
e n (1 e n ), n > 0:
Thus Y has a geometric distribution. s
5 . 6 E X P E C TAT I O N
In chapter 4 we introduced the ideas of mean and variance 2 for a probability
distribution. These were suggested as guides to the location and spread of the distribution,
respectively. Recall that for a discrete distribution ( p(x ); x 2 D), we dened
X
(1) xp(x )
x
and
X
(2) 2 (x )2 p(x ):
x
Now, since any discrete random variable has such a probability distribution, it follows
that we can calculate its mean using (1). This is such an important and useful attribute
that we give it a formal denition.
Remark. We note that (3) and (4) immediately demonstrate one of the advantages of
using the concept of random variables. That is, EX denotes the mean of the distribution
of X , regardless of its type; (discrete, continuous, or whatever). The use of the expec-
tation symbol unies these ideas for all categories of random variable.
Now the denition of expectation in the continuous case may seem a little arbitrary, so
we expend a brief moment on explanation. Recall that we introduced probability densities
originally as continuous approximations to discrete probability distributions. Very
roughly speaking, as the distance h between discrete probability masses decreases, so
they merge into what is effectively a probability density. Symbolically, as h ! 0, we have
for X 2 A
X
P(X 2 A) f X (x), where f X (x ) is a discrete distribution
x2 A
! f (x ) dx, where f (x ) is a density function.
x2 A
Likewise we may appreciate that, as h ! 0,
X
EX x f X (x )
! xf (x ) dx:
We omit the details that make this argument a rigorous proof; the basic idea is obvious.
Let us consider some examples of expectation.
Though simple, this equation is more important than it looks! We recall from chapter 2
that it was precisely this relationship that enabled Pascal to make the rst nontrivial
calculations in probability. It was a truly remarkable achievement to combine the notions
of probability and expectation in this way. He also used the following.
5.6 Expectation 209
Example 5.6.2: two possible values. Let X take the value a with probability p(a), or
b with probability p(b). Of course p(a) p(b) 1. Then
(6) EX ap(a) bp(b):
This corresponds to a wager in which you win a with probability p(a), or b with
probability p(b), where your stake is included in the value of the payouts. The wager is
said to be fair if EX 0. s
1 : s
0, by symmetry: s
Let us return to consider discrete random variables for a moment. When X is integer
valued, and non-negative, the following result is often useful.
210 5 Random variables and their distributions
Example 5.6.8: tail sum. When X > 0, and X is integer valued, show that
X
1 X
1
(10) EX P(X . r) f1 F(r)g:
r0 r0
Solution. By denition
X1
(11) EX rp(r) p(1)
r1
p(2) p(2)
p(3) p(3) p(3)
..
.
X1 X
1 X
1
p(r) p(r) p(r)
r1 r2 r3
on summing the columns on the right-hand side of (11). This is just (10), as required. s
It is natural to wonder whether some simple expression similar to (10) holds for
continuous random variables. Remarkably, the following example shows that it does.
Example 5.6.10: tail integral. Let the non-negative continuous random variable X
have density f (x ) and distribution function F(x ). Then
1 1
(12) EX f1 F(x )g dx P(X . x) dx:
0 0
The proof is the second part of problem 25 at the end of the chapter. Here we use this
result in considering the exponential density.
Finally in this section, we note that since the mean gives a measure of location, it is
natural in certain circumstances to obtain an idea of the probability in the tails of the
distribution by scaling with respect to the mean. This is perhaps a bit vague; here is an
example to make things more precise. We see more such examples later.
Example 5.6.14: leading batsmen. In any innings a batsman faces a series of balls.
At each ball (independently), he is out with probability r, or scores a run with probability
p, or scores no run with probability q 1 p r. Let his score in any innings be X .
Show that his average score is a EX p=r and that, for large a, the probability that
his score in any innings exceeds twice his average is approximately e 2.
Solution. First we observe that the only relevant balls are those in which the batsman
scores, or is out. Thus, by conditional probability,
p
P(scoresjrelevant ball) ,
p r
r
P(outjrelevant ball) :
p r
Thus X is geometric, with parameter r=( p r), and we know that
n1
p
P(X . n) , n>0
p r
and
p r p
a EX 1 ,
r r
212 5 Random variables and their distributions
Hence
2a1 2a1
p 1
P(X . 2a) 1
p r 1 p=r
2a1
1
1
a1
' e 2 for large a: s
Remark. This result is due to Hardy and Littlewood (Math. Gazette, 1934), who
derived it in connexion with the batting statistics of some exceptionally prolic cricketers
in that season.
This is a good moment to stress that despite appearing in different denitions, discrete
and continuous random variables are very closely related; you may regard them as two
varieties of the same species.
Broadly speaking, continuous random variables serve exactly the same purposes as
discrete random variables, and behave in the same way. The similarities make themselves
apparent immediately since we use the same notation: X for a random variable, EX for
its expectation, and so on.
There are some differences in development, and in the way that problems are
approached and solved. These differences tend to be technical rather than conceptual, and
lie mainly in the fact that probabilities and expectations may need to be calculated by
means of integrals in the continuous case. In a sense this is irrelevant to the probabilistic
properties of the questions we want to investigate. This is why we choose to treat them
together, in order to emphasize the shared ideas rather than the technical differences.
Find EX .
3. Let X have the gamma density
r
f (x ) x r1 e x , x > 0:
(r 1)!
Find EX .
and by denition
X
(4) EY ypY ( y):
y
Remark. Some labourers in the eld of probability make the mistake of assuming
that (5) and (6) are the denitions of EY . This is not so. They are unconscious of the fact
that EY is actually dened in terms of its own distribution, as (4) states in the discrete
case. For this reason the theorem is occasionally known as the law of the unconscious
statistician.
Proof of (i). Consider the right-hand side of (5), and rearrange the sum so as to group
all the terms in which g(x ) y, for some xed y. Then for these terms
X X X
g(x ) p(x ) yp(x ) y p(x )
x x x: g(x ) y
ypY ( y):
Now summing over all y, we obtain the denitive expression for EY , as required.
The proof of (ii) is similar in conception but a good deal more tedious in the exposition,
so we omit it. h
The above theorem is one of the most important properties of expectation, and it has a
vital corollary.
Corollary: linearity of expectation. Let X be any random variable and let the random
variable Z satisfy
Z g(X ) h(X )
for functions g and h. Then
(7) E Z E g(X ) Eh(X ):
Proof for continuous case. The proof uses (6), and proceeds along similar lines, with
integrals replacing sums. h
5.7 Functions and moments 215
Example 5.7.1: dominance. Suppose that g(x ) < c, for some constant c, and all x.
Then for the discrete random variable X
X
E g(X ) g(x ) p(x )
x
X
< cp(x ), since g(x ) < c,
x
X
c, since p(x ) 1:
x
The same argument works for a continuous random variable, so in either case we have
shown that
(8) E g(X ) < c: s
With all these new results, we can look with fresh eyes at the variance, briey
mentioned in section 5.6 and dened by (2) in that section.
The important point to notice here, as we did for expectation, is that if we write
(10) 2X EX 2 ( EX )2
then this denition holds for any random variable, whether discrete or continuous. The
advantages of using random variables become ever more obvious.
Example 5.7.3: die. Let X be the number shown by rolling a fair die, numbered from
1 to 6. Then
EX 2 16 12 22 32 42 52 62 91
6
and so
72
var X 91
6 2 35
12: s
216 5 Random variables and their distributions
Example 5.7.4: change of location and scale. Let X have mean and variance 2 .
Find the mean and variance of Y , where Y aX b.
Solution. We have
E(aX b) aE(X ) b a b
and
E(aX b)2 a2 EX 2 2abEX b2 :
Hence
var Y E(aX b)2 fE(aX b)g2 a2 var X a2 2 : s
Example 5.7.5: normal density. Let X have the standard normal density (x ). Then
EX 0, and
1 " #1 1
x 2 =2
2 xe
var X x (x ) dx (x ) dx
1 (2)1=2 1 1
1:
Now let
Y X:
We have shown in example 5.6.9 that Y has the N( , 2 ) density, namely
1 1 (x )2
f Y ( y) exp :
(2)1=2 2 2
By the law of the unconscious statistician (6) we can now calculate
(11) EY E( X )
and
(12) var Y Ef(Y )2 g E( 2 X 2 ) 2 :
You could verify this by explicit calculation using the density of Y , if you wished. The
fact that the N( , 2 ) density has mean and variance 2 makes the notation even more
transparent and reasonable. s
It is by now obvious that we are very often interested simply in probabilities such as
P(X . x), or P(jX j . x ). These are simple to ask for, but frequently hard to nd, or too
complicated to be useful. One important use of expectation is to provide bounds for these
probabilities (and many others).
Example 5.7.6: Markov's inequality. Let X be any random variable; show that
EjX j
(13) P(jX j > a) < , a . 0:
a
Solution. Let Y be the indicator of the event that jX j > a. Then it is always true that
aY < jX j:
Now using (8) yields the required result. s
5.7 Functions and moments 217
Example 5.7.7: Chebyshov's inequality. Let X be any random variable; show that
EX 2
P(jX j > a) < , a . 0:
a2
Solution. We have
(14) P(jX j > a) P(X 2 > a2 )
E(X 2 )
<
a2
on using (13) applied to the random variable X 2 . s
Example 5.7.8: fashion retailer. A shop stocks fashionable items; the number that
will be demanded by the public before the fashion changes is X , where X . 0 has
distribution function F(x ). Each item sold yields a prot of a; all those unsold when the
fashion changes must be dumped at a loss of b each. Since X is large, we shall assume
that its distribution is well approximated by some density f (x ). If the shop stocks c of
these items, what should the manager choose c to be, in order that the expected net prot
is greatest?
Solution. The net prot g(c, X ) garnered, when c are stocked and the demand is X ,
is given by
aX b(c X ), X < c,
g(c, X )
ac, X . c:
Hence by (6), we have approximately
c 1
E g(c, X ) fax b(c x )g f (x) dx acf (x ) dx
1 c
c
ac (a b) (x c) f (x) dx:
0
The maximum of this function of c is found by equating its rst derivative to zero; thus
d c
0 a (a b) (x c) f (x) dx
dc 0
c
a (a b) f (x ) dx
0
a (a b)F(c):
Hence F(c) a=(a b), and the manager should order around c items, where c is the
point at which the distribution function F rst reaches the level a=(a b), as x
increases. s
218 5 Random variables and their distributions
4. St Petersburg problem. You have to determine the fair entry fee for the following game. A
fair coin is ipped until it rst shows a head; if there have been X tails up to this point then the
prize is Y $2 X . Find EY . Would anyone pay this entry fee to play?
5 . 8 C O N D I T I O NA L D I S T R I B U T I O N S
Suppose we are considering some random variable X . Very often we may be told that X
obeys some condition, or it may be convenient to impose conditions.
Example 5.8.1. Let X be the lifetime of the lightbulb illuminating your desk.
Suppose you know that it has survived for a time t up to now. What is the distribution of
X , given this condition?
Except in very special cases, we must expect that the distribution of X given this
condition is different from the unconditional distribution prevailing before you screw
it in. s
This, and other obvious examples, lead us to dene conditional distributions, just as
similar observations led us to dene conditional probability in chapter 2.
Once again it is convenient to deal separately with discrete and continuous random
variables; conditional densities appear in section 5.9. Let X be discrete with probability
distribution p(x ). Recall that there is an event A x f: X () xg, so that
p(x ) P(X x ) P(A x ):
Now, given that some event B has occurred, we have the usual conditional probability
P(A x jB) P(A x \ B)=P(B):
It is therefore an obvious step to make the following
Of course, just as with events, it may happen that the conditional distribution of X
given B is the same as the unconditional distribution of X . That is, we may have that for
all x
(1) p(xjB) p(x ):
In this case we say that X is independent of B. This is of course consistent with our
previous denition of independence, because (1) is equivalent to
P(A x jB) P(A x \ B)=P(B) P(A x ),
which says that A x and B are independent.
Here are some examples.
X
p(xjB):
x2C
Thus conditional distributions obey the key rule, just like unconditional distributions;
compare this with (4) of section 5.3. s
Example 5.8.3. Let X be geometric with parameter p, and let B be the event that
X . a. Then for x . a
P(X xjB) P(fX xg \ B)=P(B)
pq x1 =P(B) pq x1 =q a
pq xa1 :
This is still a geometric distribution! s
Example 5.8.4. Let U be uniform on f1, . . . , ng and let B be the event that
a , U < b, where 1 , a , b , n. Then for a , r < b
P(U rjB) P(fU rg \ B)=P(B)
1 ba 1
:
n n ba
This is still a uniform distribution! s
Example 5.8.5. Let X be Poisson with parameter , and let B be the event X 6 0.
Then for x . 0
e x
P(X xjB) :
(1 e )x!
This is not a Poisson distribution, but it is still a distribution, for obviously
220 5 Random variables and their distributions
X
1
e x
1: s
x1
(1 e )x!
This last result is generally true, and we single it out for special notice.
Since p(xjB) is a distribution, it may have an expected value. Because of the condition
that B occurs, it is naturally called conditional expectation.
Denition. For a discrete random variable X and any event B, the conditional
expectation of X given B is
X
(2) E(X jB) xp X j B (xjB): n
x
Example 5.8.6. You ip a fair coin three times; let X be the number of heads. Find
the conditional expectation of X given that at least two heads are shown.
Example 5.8.7: runs. A biased coin shows a head with probability p, or a tail with
probability q 1 p; it is ipped repeatedly. A run of heads is any unbroken sequence
5.8 Conditional distributions 221
of heads, either until the rst tail or after the last tail, or between any two tails. A run of
tails is dened similarly. Find the distribution of the lengths of (i) the rst run, and (ii) the
second run.
and
X
1
E(X jT ) xq x1 p p1 :
x1
Likewise
E(Y j H) p1 and E(Y jT ) q 1 : s
Partition rule for expectation. Let X be a discrete random variable, and let
(B
S r ; r > 1) be a partition of , which is to say that B j \ B k for j 6 k and
r B r . Then
X
(5) EX E(X jB r )P(B r ):
r
Example 5.8.8: runs revisited. Recall our terminology in example 5.8.7: when you
repeatedly ip a biased coin, the length of the rst run is X and the length of the second
run is Y .
From the results of that example we now see that
p q
EX pE(X j H ) qE(X jT ) ,
q p
whereas
p q
EY pE(Y j H ) qE(Y jT ) 2: s
p q
Conditional expectation offers a very neat way of analysing random variables that arise
as a result of a sequence of independent actions, the archetype of which is, of course,
ipping a coin or coins.
Example 5.8.9: a new way of nding the mean and variance of the geometric distribu-
tion. Of course we already know one way of doing this: you sum the appropriate
series. The following is a typical application of conditional expectation.
We know that if a biased coin (showing a head with probability p) is ipped repeatedly,
then the number of ips X up to and including the rst head is geometric with parameter
p. Let H and T denote the possible outcomes of the rst ip. By the partition rule (4),
(7) EX E(X j H)P( H) E(X jT )P(T )
pE(X j H) qE(X jT ):
Let us consider these terms. On the one hand, given H, we have immediately that X 1.
Hence
(8) E(X j H) 1:
On the other hand, given T, we know that the number of further ips necessary to obtain
a head has the same distribution as X . Hence
(9) E(X jT ) 1 EX :
5.8 Conditional distributions 223
There are many problems of this kind, which can be tricky if tackled head on but which
are extremely simple if conditional expectation is used correctly. (It is perhaps for this
reason that they are so often found in examinations.) We conclude this section with a few
classic examples.
Example 5.8.10: quiz. You are a contestant in a quiz show, answering a series of
questions. You answer correctly with probability p, or incorrectly with probability q; you
get $1 for every correct answer, and you are eliminated when you rst give two
consecutive wrong answers. The questions are independent. Find the expected number of
questions you attempt, and your expected total prize money.
Solution. Let X be the number of questions answered, and Y your total prize money.
To use conditional expectation we need a partition of the sample space; let C be the event
that you answer a question correctly. Then an appropriate partition is supplied by the
three events fC, C c C, C c C c g, where
224 5 Random variables and their distributions
Example 5.8.11. Two archers (Actaeon and Baskerville) take it in turns to aim at a
target; they hit the bull, independently at each attempt, with respective probabilities
and . Let X be the number of shots until the rst bull. What is EX ?
Solution. Let B denote a bull. Then using our by now familiar new method, we write
EX E(X jB) E(X jBBc )(1 ) E(X jBc Bc )(1 )(1 )
2(1 ) (2 EX )(1 )(1 ):
Hence
2
EX : s
Hence
1
2 12 2 (n 1) 12 n1 (n 1) 12 n1
2
EX
1 12 12 n1 )
2 n 2:
So the required expectation, E(1 X ), is 2 n 1. We shall see an even neater way of
doing this in exercise 4 of section 6.11. s
Example 5.8.13: duration of gambler's ruin. Recall the gambler's ruin problem in
which you gain or lose one point with equal probabilities 12 at each bet. You begin with k
points; if you ever reach zero points, or n points, then the game is over. All bets are
independent. Let T k be the number of bets until the game is over; show that
ET k k(n k).
Solution. Let W be the event that you win the rst bet, and k ET k . Then
k E(T k jW )P(W ) E(T k jW c )P(W c )
12(1 k1 ) 12(1 k1 )
12 k1 12 k1 1, 0 , k , n:
Naturally 0 n 0. Hence it is easy to verify that indeed
k k(n k): s
2. Gamblers ruined unfairly. Suppose you win each point with probability p, or lose it with
probability q 1 p; as always, bets are independent. Let k be the expected number of bets
until the game is over, given that you start with k points. (As usual the game stops when you
rst have either no points or n points.) Show that
k p k1 q k1 1
and deduce that for p 6 q
( )
1 (q= p) k
k k n (q p)1 :
1 (q= p) n
5 . 9 C O N D I T I O NA L D E N S I T Y
We have found conditional probability mass functions to be very useful on many
occasions. Naturally we expect conditional density functions to be equally useful. They
are, but they require a slightly indirect approach. We start with the conditional distribu-
tion function.
226 5 Random variables and their distributions
Denition. Let X have distribution function F(x), and let B be an event. Then the
conditional distribution function of X given B is
(1) F X j B (xjB) P(X < xjB)
P(fX < xg \ B)
: n
P(B)
Example 5.9.1. Let X be uniform on (0, a), and let B be the event 0 < X < b, where
b , a. Find F X j B (xjB).
Example 5.9.1 revisited. Let X be uniform on (0, a) and let B be the event
f0 < X < bg. Then differentiating (2) gives the conditional density
f X j B (xjB) b1 , 0 < x < b: s
Just as in the discrete case, f (xjB) satises the same key rule as any ordinary density.
5.9 Conditional density 227
Example 5.9.2: lack of memory. Let X be exponential with parameter , and let B t
be the event that X . t. Show that the conditional density of X t, given B t , is also
exponential with parameter .
Remark. The importance of this result is clear when we recall that the exponential
density is a popular model for waiting times. Let X be the waiting time until your light
bulb fails. Suppose X is exponential, and your light bulb has survived for a time t. Then
the above result says that the further survival time is still exponential, as it was to begin
with.
Roughly speaking, a component or device with this property cannot remember how old
it is. Its future life has the same distribution at any time t, if it has survived until t.
Recall that, among discrete random variables, the geometric distribution also has this
property, as you would expect.
Example 5.9.3: conditional survival. Let X be the lifetime of your light bulb, and let
B be the event that it survives for a time a. Find the conditional distribution of X given B
for each of the following distributions of X :
228 5 Random variables and their distributions
At this point we note that a conditional density may also have a mean, called the
conditional expectation. It is given by
1
(6) E(X jB) x f (xjB) dx:
1
This is just as important as the conditional expectation dened for discrete random
variables in (2) of section 5.8, and has much the same properties. We explore some of
these later on. Finally we record one natural and important special case. It may be that the
continuous random variable X is independent of B; formally we write
This is essentially our usual denition, for it just says that the events B and fX < xg
are independent. Differentiating gives just what we would anticipate:
(8) f X j B (xjB) f X (x ),
whenever X and B are independent.
In any case we nd from the key rule that
(9) P(fX 2 Cg \ B) P(X 2 C)P(B),
whenever X and B are independent.
5.10 REVIEW
This chapter has concerned itself with random variables and their properties, most
importantly their distributions and moments.
Distributions
When X is discrete, When X is continuous,
p(x ) P(X x ), f (x)h ' P(x , X < x h)
X
and P(X 2 A) p(x), for small h;
x2 A
and when X is integer valued P(X 2 A) f (x ) dx, and
x2 A
p(x ) F(x) F(x 1). f (x ) dF(x)=dx.
Functions. Suppose that random variables X and Y are such that Y g(X ) for some
function g. Then
if both are discrete, if both are continuous,
X d
pY ( y) p X (x ). f Y ( y) f X (x) dx.
x: g(x ) y
dy x: g(x )< y
230 5 Random variables and their distributions
X p(x) 2
indicator p(1) p p pq
p(0) 1 p q
n np np(1 p)
binomial p x (1 p) nx ,
x
B(n, p) 0<x<n
Poisson e x
, x>0
x!
uniform n1 , 1<x<n 1
2(n 1) 1
12 (n
2
1)
f1, 2, . . . , ng
X f (x ) EX var X
1 1 1
uniform (b a) , a<x<b 2(b a) 12(b a)2
exponential e x , x>0 1 2
( )
1=2 1 1 x 2
normal (2) exp ,
2 2
N(, 2 ) 1 , x , 1
gamma r r1 r2
x r1 e x , x>0
(r 1)!
As the A k become arbitrarily small, we obtain the double integral of f over C in the limit. (Many
details have been omitted here.) It only remains to choose our coordinates, and the shapes of the
areas A k . We consider two important cases.
Cartesian coordinates. In this case it is very natural to let each A k be a small rectangle with
sides having lengths denoted by x k and y k , as shown in gure 5.1. In the limit we obtain the
required volume, denoted by
(1) V f (x, y) dx dy f (x, y) dy dx:
C C
Polar coordinates. In this case it is more natural to let each A k be a small curvilinear
quadrilateral, as shown in gure 5.2. In this case A k r k r k k , and we obtain the volume in the
form
(2) V f (r, )r dr d:
C
yk Ak yk
xk
C
xk
Figure 5.1. A k x k y k .
5.12 Problems 233
rkk
O rk
The point of these examples is that you can often save yourself a great deal of effort by choosing
the most appropriate coordinates.
5 . 1 2 P RO B L E M S
1. You roll 5 dice. Let X be the smallest number shown and Y the largest.
(a) Find the distribution of X , and EX .
(b) Find the distribution of Y , and EY .
234 5 Random variables and their distributions
2. (a) You roll 5 dice; what is the probability that the sum of the scores is 18 or more?
(c) You roll 4 dice; what is the probability that the sum of the scores is 14 or more?
3. Let X have density f (x ) cx d , x > 1. Find (a) c, (b) EX , (c) var X . In each case state for
what values of d your answer holds.
4. Let X have the exponential density. Find
(a) P(sin X . 12),
(b) EX n , n > 1.
5. Show that for any random variable with nite mean EX ,
(EX )2 < (EX 2 ):
6. Which of the following can be density functions? For those that are, nd the value of c, and the
distribution function F(x).
cx(1 x) 0 < x < 1
(a) f (x )
0 otherwise:
1
cx x>1
(b) f (x )
0 otherwise:
(c) f (x ) c exp(x 2 4x ), 1 , x , 1:
(d) f (x ) ce x (1 e x )2 , 1 , x , 1:
7. Which of the following can be distribution functions? For those that are, nd the density.
1 exp(x 2 ) x > 0
(a) F(x )
0 otherwise:
1
exp(x ), x > 0
(b) F(x )
0 otherwise:
(c) F(x ) e x (e x e x )1 , 1 , x , 1:
2
(d) F(x ) e x e x (e x e x )1 , 1 , x , 1:
8. Let X be Poisson with parameter . Find the distribution and expectation of X , given that X is
odd.
9. Let X have the density
exp x
f (x) , a , x , a:
2 sinh a
Show that
EX a coth a 1,
and
2
a
var X 1 :
sinh a
10. Your dart is equally likely to hit any point of a circular dart board. Its height above the bull is Y
(negative if below the bull), and its distance from the bull is R. Find the density and
distribution of Y , and of R. What is ER?
11. An urn contains one carmine ball and one magenta ball. A ball is drawn at random; if it is
carmine the game is over. If it is magenta then the ball is returned to the urn together with one
extra magenta ball. This procedure is repeated until 10 draws have been made or a carmine ball
is drawn, whichever is sooner. Let X be the number of draws. Find p X (x ) and EX .
Now suppose the game can only be terminated by the appearance of a carmine ball; let Y be
the number of draws. Find the distribution pY ( y) and EY .
5.12 Problems 235
12. You are racing around a circular track C; the pits are at a point P on the circumference of C.
Your car is equally likely to break down at any point B of the track. Let X be the distance from
B to P in a straight line. Find the density, mean, and variance of X .
13. You re a musket on the planet Zarg with muzzle velocity V, making an angle with the
horizontal ground. The musket ball strikes the ground at a distance X (V 2 = g) sin 2 away,
where g is the acceleration due to gravity; in Zarg units g 1.
(a) If V is constant and is uniform on [0, =2], nd the density of X .
2
(b) If is constant and V has density f V (x) cx 2 e x , x . 0, nd the density of X . What is
c in this case?
14. The diameter D of a randomly selected lead shot has density f (x), x . 0. Find the density of
the weight of a randomly selected lead shot. (Assume any shot is spherical.)
15. I try to open my door with one of the three similar keys in my pocket; one of them is the
correct key, the other two will not turn. Let X be the number of attempts necessary if I choose
keys at random from my pocket and drop those that fail to the ground. Let Y be the number of
attempts necessary if I choose keys at random from my pocket and replace those that fail in my
pocket. Find EX and EY .
16. You are at the origin between two walls lying at x 1. You shine your torch so that the beam
makes an angle with the line x 0; is uniformly distributed on [0, 2]. Let Y be the y-
coordinate of the point where the beam strikes a wall. Show that Y has a Cauchy density.
17. Let S be the speed of a randomly selected molecule in a gas. According to the kinetic theory of
gases, S has probability density
2
=(2 2 )
f (s) s 2 e s :
Find . The kinetic energy of a molecule of mass m is X 12 mS 2 . Find the density of X .
18. Let X be a non-negative random variable with nite expectation, having distribution function
F(x) and density f (x ). Show that for x . 0,
1
xf1 F(x)g < xf (x) dx,
x
1
and deduce that as x ! 1, xf1 F(x)g ! 0. Hence show that EX 0 f1 F(x )gdx.
19. A fair three-sided die has its faces labelled 1, 2, 3. It is rolled repeatedly. Let X n be the number
of rolls until the sum of the numbers shown is at least n. Show that for n > 4
3EX n EX n1 EX n2 EX n3 3:
Suppose now that the three faces are shown with respective probabilities p, q, and r. Write
down the equivalent equation for the expectations EX n .
20. Let X be a random variable with EX 3 , 1. The skewness of X is given by
E(X )3
skw X ,
3
where EX and 2 var X .
(a) If X is Bernoulli with parameter p, show that
skw X (q p)=(qp)1=2 :
(b) For any random variable X show that
skw X (EX 3 3EX 2 23 )= 3 :
(c) If X is Poisson with parameter , show that skw X 1=2 .
(d) If X is geometric with parameter p, p(x) q x1 p, x > 1, show that skw X
(1 q)=q 1=2 .
236 5 Random variables and their distributions
21. Let X be a random variable with density f (x) cx 1 e x , x > 0.
(a) Find c, and evaluate EX and var X .
(a) Show that if . 1, and s, t . 0, then
P(X . s t j X . t) , P(X . s):
What if , 1?
and that
1
r
EX rx r1 P(X . x ) dx:
0
26. (a) Show that if X is a random variable with var X 0, then, for some constant a,
P(X a) 1.
27. Use Markov's inequality to show that, for any t . 0 and any random variable such that E(e tX )
exists,
P(X > a) < e at E(e tX ), for a . 0:
Deduce that
P(X > a) < inf fe at Ee tX g:
t.0
6.1 PREVIEW
In chapter 5 we looked at probability distributions of single random variables. But of
course we often wish to consider the behaviour of two or more random variables together.
This chapter extends the ideas of chapters 4 and 5, so that we can make probability
statements about collections and sequences of random variables.
The most important instrument in this venture is the joint probability distribution,
which we meet in section 6.2. We also dene the concept of independence for random
variables, and explore some consequences. Jointly distributed random variables have joint
moments, and we look at the important ideas of covariance and correlation. Finally, we
consider conditional distributions and conditional expectation in this new setting.
Prerequisites. We shall use one new technique in this chapter; see appendix 5.11 on
double integrals.
238
6.2 Joint distributions 239
Looking back to chapter 4, we can see that we have already considered such pairs of
random variables, when we looked at distributions in the plane. In that case R2 , and
for each outcome (x, y), we set
(X (), Y ()) (x, y):
We extend these simple ideas in the same way as we did in chapter 5.
Thus, let X and Y be a pair of discrete random variables dened on . As usual we
assume that X and Y take integer values, unless otherwise stated. As in chapter 5, the
natural function to tell us how probability is distributed over the values of the pairs
(X , Y ) is the following.
Any function satisfying (2) and (3) is a joint or bivariate probability distribution. Note
that, as usual, we specify any p(x, y) by giving its values where it is not zero. Here are
some simple examples of joint distributions.
Example 6.2.1: pair of dice. Two fair dice are rolled, yielding the scores X and Y .
We know already that
1
p(x, y) 36 , 1 < x, y < 6 s
Example 6.2.2: pair of indicators. Let X and Y be indicators, so that X 2 f0, 1g and
Y 2 f0, 1g. Then
(X , Y ) 2 f(0, 0), (0, 1), (1, 0), (1, 1)g
and the joint distribution is just the array
p(0, 0), p(0, 1)
: s
p(1, 0), p(1, 1)
Example 6.2.3: ipping a coin. Suppose a coin shows a head with probability p, or a
tail with probability q. You ip the coin repeatedly. Let X be the number of ips until the
240 6 Jointly distributed random variables
rst head, and Y the number of ips until the rst tail. Then, obviously, the joint
probability distribution of X and Y is
p(1, y) p y1 q, y > 2,
p(x, 1) pq x1 , x > 2: s
As before, questions about the joint behaviour of X and Y are answered by a key rule.
Key rule for joint distributions. Let X and Y have joint distribution p(x, y), and let
C be a collection of possible values of (X , Y ). Then
X
(4) P((X , Y ) 2 C) p(x, y):
x, y 2C
The proof is essentially the same as that of (4) in section 5.3 and is left as a routine
exercise. This is of course the same rule that we used in chapter 4 to look at distributions
in the plane (allowing for changes in notation and emphasis).
Our rst application of the key rule is exceedingly important and useful.
and
X
(6) pY ( y) p(x, y):
x
In each case the sum is taken over all possible values of y and x respectively; when
calculated in this way these are sometimes called the marginal distributions of x and y. It
is of course most important and useful that we can obtain them from p(x, y).
Remark. The use of the term `marginal' is explained if we write the joint probabil-
ities p(x, y) in the form of an array. Then the distribution p X (x) of X is given by the
6.2 Joint distributions 241
column sums and the distribution pY ( y) of Y by the row sums; these are conveniently
placed at the margins.
pY (n) p(1, n) . . . p(m, n)
.. .. ..
. . .
pY (2) p(1, 2) p(m, 2)
pY (1) p(1, 1) . . . p(m, 1)
Example 6.2.1 revisited: dice. For two dice we know that the joint distribution is
1
uniform; p(x, y) 36 . Obviously and trivially
X 6
(7) p X (x) p(x, y) 16: s
y1
Remark. We have shown that p(x, y) always yields the marginals p X (x) and pY ( y).
However, the converse is not true. To see this, consider two experiments, (A) and (B).
(A) Flip a fair coin; let X be the number of heads and Y the number of tails. Then
p X (0) p X (1) pY (0) pY (1) 12,
with
p(0, 1) 12, p(0, 0) 0; p(1, 0) 12, p(1, 1) 0:
(B) Flip two fair coins; let X be the number of heads shown by the rst and Y the
number of heads shown by the second. Then
p X (0) p X (1) pY (0) pY (1) 12,
which is the same as in (A). But
p(0, 1) p(0, 0) p(1, 0) p(1, 1) 14,
which is different from (A). In general the marginals do not determine the joint
distribution. There is an important exception to this, which we examine in section 6.4.
Simple and empirical distributions are inevitably presented in the form of an array. In
theoretical applications we usually have an algebraic representation, which occupies less
space, saving trees and avoiding writer's cramp.
Example 6.2.6. A pair of dice bear the numbers 1, 2, 3 twice each, on pairs of
opposite faces. Both dice are rolled, yielding the scores X and Y respectively. Obviously
p( j, k) 19, for 1 < j, k < 3:
We could display the joint probabilities p( j, k) as a very dull array, if we wished.
Now suppose we roll these dice again, and consider the difference between their scores,
denoted by U, and the sum of their scores, denoted by V. Then
2 < U < 2 and 2 < V < 6:
We can calculate the joint distribution of U and V by running over all possible outcomes.
242 6 Jointly distributed random variables
For example
p(0, 4) P(U 0, V 4)
P(f2, 2g) 19:
Eventually we produce the following array of probabilities:
V
6 0 0 19 0 0
5 0 19 0 19 0
1
4 9 0 19 0 19
3 0 19 0 19 0
2 0 0 19 0 0
2 1 0 1 2 U
We could write this algebraically, but it is more informative and appealing as shown.
Observe that U and V both have triangular distributions, but U is symmetrical about 0
and V is symmetrical about 4. s
Solution. By (3),
X
1c (x y) cn2 (n 1):
x, y
Next,
X
n
1
p X (x) c (x y) fnx 12 n(n 1)g
y1
n2 (n 1)
1
fx 12(n 1)g:
n(n 1)
Likewise,
1
pY ( y) f y 12(n 1)g: s
n(n 1)
Example 6.2.8. You roll three dice. Let X be the smallest number shown and Y the
largest. Find the joint distribution of X and Y .
Solution. Simple enumeration is sufcient here. For x , y 1 there are three possi-
bilities: the three dice show different values, or two show the larger, or two show the
6.2 Joint distributions 243
smaller. Hence
6( y x 1) 3 3 y x
(8) p(x, y) , x , y 1:
216 216 36
For x y 1 there are two possibilities, and
33 1
p(x, y) , x y 1:
216 36
For x y, there is one possibility, so
1
p(x, y) , x y:
P 216
It is easy for you to check that x, y p(x, y) 1, as it must be. s
Example 6.2.9: Benford's distribution for signicant digits. Suppose you take a
large volume of numerical data, such as can be found in an almanac or company accounts.
Pick a number at random and record the rst two signicant digits, which we denote by
(X , Y ). It is found empirically that X has the distribution
1
(9) p(x) log10 1 , 1 < x < 9:
x
As we noted in example 4.2.3 it has recently been proved that there are theoretical
grounds for expecting this result. Likewise it is found empirically, and theoretically, that
the pair (X , Y ) has the joint distribution
1
p(x, y) log10 1 , 1 < x < 9; 0 < y < 9:
10x y
Of course, we nd the marginal distribution of X to be
0 1
X9 Y9
10x y 1 1
p X (x) p(x, y) log10 @ A log10 1 ,
y0 y0
10x y x
Obtaining the marginals in this way is attractive and useful, but the key rule can be
applied to nd more interesting probabilities than just the marginals. The point is that in
most applications of interest, the region C (see equation (4)) is determined by the joint
behaviour of X and Y . For example, to nd P(X Y ) we set
C f(x, y): x yg;
to nd P(X . Y ) we set
C f(x, y): x . yg;
and so on. Here is an example.
Finally we note that just as a single random variable X has a distribution function
F(x) P(X < x), so too do jointly distributed random variables have joint distribution
functions.
Denition. Let X and Y have joint distribution p(x, y). Then their joint distribution
function is
XX
(11) F(x, y) p(i, j) P(X < x, Y < y): n
i<x j< y
Once again we can nd p(x, y) if we know F(x, y), though it is not quite so simple as
it was for one random variable:
(12) p(x, y) F(x, y) F(x, y 1)
F(x 1, y) F(x 1, y 1):
The proof of (12) is easy on substituting (11) into the right-hand side.
Example 6.2.11. You roll r dice. Let X be the smallest number shown and Y the
largest. Find the joint distribution of X and Y .
Hence
6 r f( y x 1) r 2( y x) r ( y x 1) r g, y 6 x
p(x, y)
6 r , y x:
When r 3 we recover (8), of course. s
Denition. The random variables X and Y are said to be jointly continuous, with
joint density f (x, y), if for all a , b and c , d
db
(1) P(a , X , b, c , Y , d) f (x, y) dx dy: n
c a
This is the natural extension of the denition of density for one random variable, which is
b
(2) P(a , X , b) f (x) dx:
a
In (2) the integral represents the area under the curve f (x); in (1) the double integral
represents the volume under the surface f (x, y). It is clear that f (x, y) has properties
similar to those of f (x), that is,
(3) f (x, y) > 0,
and
1 1
(4) f (x, y) dx dy 1:
1 1
And, most importantly, we likewise have the
Key rule for joint densities. Let X and Y have joint density f (x, y). The probability
that the point (X , Y ) lies in some set C is given by
246 6 Jointly distributed random variables
(5) P((X , Y ) 2 C) f (x, y) dx dy:
C
The integral represents the volume under the surface f (x, y) above C. (There is a
technical necessity for C to be a set nice enough for this volume to be dened, but that
will not bother us here.) It is helpful to observe that (5) is completely analogous to the
rule for discrete random variables
X
P((X , Y ) 2 C) p(x, y):
(x, y)2C
Denition. Let the region A have area jAj. Then X and Y are said to be jointly
uniform on A if they have joint density
1
jAj , (x, y) 2 A
(7) f (x, y) n
0 elsewhere:
6.3 Joint density 247
In general, just as for discrete random variables, use of the key rule provides us with
any required probability statement about X and Y . In particular we note the
Example 6.3.5. Let (X , Y ) be uniformly distributed over the circle centred at the
origin, radius 1. Find (i) the joint density of X and Y, (ii) the marginal density of X .
: s
As in the case of a single continuous random variable, the distribution function is often
useful.
Denition. Let X and Y have joint density f (x, y). The joint distribution function of
X and Y is denoted by F(x, y), where
x y
(11) F(x, y) P(X < x, Y < y) f (u, v) dv du: n
1 1
@2
(12) F(x, y) f (x, y),
@x@ y
and it yields the marginal distributions of X and Y :
F X (x) F(x, 1), FY ( y) F(1, y):
Obviously 0 < F(x, y) < 1, and F is non-decreasing as x or y increases. Furthermore, as
in the discrete case we have
(13) P(a < X < b, c < Y < d) F(b, d) F(a, d) F(b, c) F(a, c):
The distribution function turns out to be most useful when we come to look at functions
of jointly distributed random variables. For the moment we just look at a couple of simple
examples.
Example 6.3.7. Let X and Y have joint density f (x, y) x y, 0 < x, y < 1. Then
for 0 < x, y < 1
y x
F(x, y) (u v) du dv 12 xy(x y):
0 0
For 0 < x < 1, y . 1,
1 x
F(x, y) (u v) du dv 12 x(x 1):
0 0
For 0 < y < 1, x . 1,
F(x, y) 12 y( y 1):
Obviously for x, y . 1 we have F(x, y) 1. s
6.4 INDEPENDENCE
The concept of independence has been useful and important on many previous occasions.
Recall that events A and B are independent if
(1) P(A \ B) P(A)P(B):
In section 5.9 we noted that the event A and the random variable X are independent if, for
any C,
(2) P(fX 2 Cg \ A) P(X 2 C)P(A):
It therefore comes as no surprise that the following denition is equally useful and
important.
Denition. Let X and Y have joint distribution function F(x, y). Then X and Y are
independent if and only if for all x and y
(3) F(x, y) F X (x)FY ( y): n
Remark. We can relate this to our basic concept of independence in (1) by noting
that (3) says
P(Bx \ B y ) P(Bx )P(B y ),
where Bx f: X () < xg and B y f: Y () < yg.
As usual, the general statement (3) implies different special forms for discrete and
continuous random variables.
Discrete case. If X and Y have the joint discrete distribution p(x, y), then X and Y
are independent if, for all x, y,
(4) p(x, y) p X (x) pY ( y):
6.4 Independence 251
Continuous case. If X and Y have joint density f (x, y), then they are independent if,
for all x, y,
(5) f (x, y) f X (x) f Y ( y):
In any case the importance of independence lies in the special form of the key rule.
Key rule for independent random variables. If X and Y are independent then, for
any events fX 2 Ag and fY 2 Bg,
(6) P(X 2 A, Y 2 B) P(X 2 A)P(Y 2 B):
The practical implications of this rule explain why independence is mostly employed in
two converse ways:
(i) We assume that X and Y are independent, and use the rule to nd their joint
behaviour.
(ii) We nd that the joint distribution of X and Y satises (1), and deduce that they are
independent; this simplies all future calculations.
Of these, (i) is the more usual. Note that all these ideas and denitions are extended in
obvious and trivial ways to any sequence of random variables. Here are some simple
examples.
Example 6.4.1. Pick a card at random from a conventional pack of 52 cards. Let X
denote the suit (in bridge order, so X (C) 1, X (D) 2, X ( H) 3, X (S) 4), and Y
the rank with aces low (so 1 < Y < 13). Then for any x, y
1
P(X x, Y y) 52 14 3 13
1
Example 6.4.2. Pick a point (X , Y ) uniformly at random in the rectangle 0 < x < a,
0 < y < b. Then
1
f (x, y)
ab
The marginal densities are
b
1
f X (x) f (x, y)dy ,
0 a
and
1
f Y ( y) :
b
Hence
1
f (x, y) f X (x) f Y ( y),
ab
and X and Y are independent. s
In fact it follows from our denition of independence that if the joint distribution
252 6 Jointly distributed random variables
F(x, y) factorizes as the product of a function of x and a function of y, for all x, y, then
X and Y are independent.
However, a little thought is needed in applying this result.
Example 6.4.3: Bernoulli trials. In n trials the joint distribution of the number of
successes X and the number of failures Y is
p x (1 p) y
p(x, y) n! :
x! y!
This looks like a product of functions of x and y, but of course it is not, because it is only
valid for x y n. Here X and Y are not independent. s
and likewise
pY ( y) y2 (1 )1 ,
then we nd
p(x, y) x y2 p X (x) pY ( y) x y4 (1 )2
provided that
2 (1 )2
which entails 12. If 6 12, then p(x, y) is not a distribution. s
The most striking property of this density is that it has rotational symmetry about the
origin. That is to say, in polar coordinates with r 2 x 2 y 2 we have
1
f (x, y) exp( 12 r2 ):
2
This does not depend on the angle tan1 ( y=x). Roughly speaking, the point (X , Y ) is
equally likely to lie in any direction from the origin. Hence, for example,
P(0 , Y , X ) P((X , Y ) lies in the first octant)
18:
As a bonus, we can use (7) to prove what we skipped over in chapter 5, namely that
1 p
exp(12 x 2 ) dx 2:
1
To see this, let X and Y be independent with the same density
f (x) c exp(12 x 2 ):
Then (X , Y ) has joint density
f (x, y) c2 expf12(x 2 y 2 )g:
Since f (x, y) is a density, we have by appendix 5.11 that
1 f (x, y) dx dy c2 exp(12 r 2 ) r dr d
2 1
2
c2 d re r =2
dr
0 0
c2 2
as required. s
by the independence of X , Y , and Z. Since there are six such disjoint possibilities, we have
(8) F(u, v, w) 6u(v u)(w v) F(u, v, v) F(u, u, v) F(u, u, u):
Now we obtain the required density by differentiating (8) with respect to u, v, and w.
Hence
f (u, v, w) 6, 0 , u , v , w , 1: s
As an application of this, suppose that three points are placed at random on [0, 1],
independently. What is the probability that no two are within a distance 14 of each other?
By the key rule, this is
11=2 11=4 1
f (u, v, w) du dv dw 6 dw dv du 18,
vu . 1=4,
wv . 1=4
0 u1=4 v1=4
6.5 FUNCTIONS
We have seen in section 5.7 that it is very easy to deal with functions of a single random
variable. Most problems in real life involve functions of more than one random variable,
however, and these are more interesting.
Example 6.5.1. Your steel mill rolls a billet of steel from the furnace. It is 10 metres
long but, owing to the variations to be expected in handling several tons of white-hot
6.5 Functions 255
metal, the height and width are random variables X and Y . The volume is the random
variable Z 10XY .
What can we say of Z? s
The general problem for functions of two random variables amounts to this: let the
random variable
Z g(X , Y )
be a function of X and Y ; what is the distribution of Z?
As we did for single random variables it is convenient to deal separately with the
discrete and continuous cases; also, as it was for single random variables, the answer to
our question is supplied by the key rules for joint distributions.
Continuous case. Let the random variables X , Y , and Z satisfy Z g(X , Y ), where
X and Y have density f (x, y). Then by (5) of section 6.3,
(2) F Z (z) P( Z < z) f (x, y) dx dy:
x, y:
g(x, y)<z
When Z is also continuous, its density f Z (z) is easily obtained by differentiating (2)
above.
These ideas are best grasped by inspection of examples.
Example 6.5.2. (i) Suppose two numbers X and Y are picked at random from
f1, 2, . . . , 49g, without replacement. What is the distribution of Z maxfX , Y g
X _ Y?
(ii) Lottery revisited. Let Z be the largest of six numbers picked from f1, 2, . . . , 49g
in a draw for the lottery. What is the distribution of Z?
X
z1 X
z1
p(x, z) p(z, y)
x1 y1
1
49
(z 1) , 2 < z < 49
2
For (ii): With an obvious notation,
(
1
313 131 31 31 xi 6 x j , 1 < i, j < 6
p(x1 , . . . , x6 ) 49 48 47 46 45 44
0 otherwise:
Using (1) again gives, after a little calculation,
z1 49
p(z) , 6 < z < 49: s
5 6
Equation (2) can be used for similar purposes with continuous random variables.
Solution. By (5),
P( Z < z) f (x, y) dx dy:
x_ y<z
Since X , Y always, this calculation is rather an easy one. In fact Z Y , so Z has the
density
f (z) f Y (z) ze z , z . 0: s
Example 6.5.4: sum of uniform random variables. Let X and Y be independent and
uniform on [0, 1]. What is the density of Z X Y ?
Hence, differentiating,
z, 0<z<1
f Z (z)
2 z, 1 < z < 2:
This is a triangular density. s
Example 6.5.5. Let X and Y have the joint density derived in example 6.4.6, so that
1
f (x, y) exp f 12 (x2 y2 )g:
2
What is the distribution of Z (X 2 Y 2 )1=2 ?
Often the direct use of independence allows calculations to be carried out simply,
without using (1) or (2).
Joint distributions of two or more functions of several random variables are obtained in
much the same way, only with a good deal more toil and trouble. We look at a few simple
examples here; a general approach to transformations of continuous random variables is
deferred to section 6.13.
258 6 Jointly distributed random variables
Example 6.5.7. Let X and Y be independent and geometric, with parameters and
respectively. Dene
U minfX , Y g X ^ Y ,
V maxfX , Y g X _ Y ,
W V U:
Find the joint probability distribution of U and V, and of U and W . Show that U and W
are independent of each other.
A similar result is true for exponential random variables; this is important in more
advanced probability.
e ()u e ()uw e ()uw :
Hence we can nd the joint density by differentiation, yielding
f (u, w) (e w e w )( )e()u :
Thus U and W, where U is exponential with parameter and W has density
f W (w) (e w e w ), 0 < w , 1,
are independent by (5) of section 6.4. s
If we are sufciently careful and persistent, we can establish even more surprising
results in this way. Here is one nal illustration.
Example 6.5.9. Let X and Y be independent and exponential, both with parameter 1.
Dene
U X Y,
X
: V
X Y
Find the joint density of U and V . Deduce that U and V are independent, and nd their
marginal density functions.
1v
.. y
. x
v
...
u uv ...
...
0 uv u x
Hence, surprisingly, we nd that U and V are independent, where U has a gamma density
and V is uniform on (0, 1). s
6 . 6 S U M S O F R A N D O M VA R I A B L E S
In the previous section we looked in a general way at how to nd the probability
distributions of various functions of random variables. In practical applications it most
often turns out that we are interested in the sum of random variables. For example:
The success or failure of an insurance company or bank depends on the cumulative
sum of payments in and out.
6.6 Sums of random variables 261
We have noted above that practical estimates or measurements very often use the
sample mean of observations X r , that is,
1X n
X X r:
n r1
Quality control often concerns itself with the total sum of errors or defective items in
some process.
And so on; think of some more yourself. In this section we therefore look at various ways
of nding the distributions of sums of random variables. We begin with some easy
examples; in particular, we rst note that in a few cases we already know the answer.
This is useful but limited; the time has come to give a general approach to this problem.
As with much else, the answer is supplied by the key rule for joint distributions.
Sum of discrete random variables. Let X and Y have joint distribution p(x, y), and
let Z X Y . Then, by (1) of section 6.5,
X
(1) p Z (z) P(X Y z) p(x, y)
x, y:
x yz
X
p(x, z x):
x
Probability density of a sum. If the pair (X , Y ) has density f (x, y), the density of
Z X Y is given by
1
(4) f Z (z) f (x, z x) dx:
1
Proof of (4). Let X and Y have joint density f (x, y), with Z X Y . Then by the
key rule for joint densities, (5) of section 6.3,
(7) F Z (z) P( Z < z) f (x, y) dx dy
x y<z
1 zx
f (x, y) dy dx:
x1 y1
Now differentiating this with respect to z, and recalling the fundamental theorem of
calculus in the more general form (see appendix 4.14), gives
1
(8) f Z (z) f (x, z x) dx,
x1
as required. h
Here are some examples. In each of the following X and Y are independent, with
Z X Y.
6.6 Sums of random variables 263
Example 6.6.3: binomial sum. If X is binomial B(n, p) and Y is binomial B(m, p),
then by (2)
Xz
n x nx m
p Z (z) p q p zx q mzx
x0
x z x
X z
z m nz n m
pq
x0
x zx
m n
p z q m nz :
z
Thus Z is binomial B(m n, p), as we knew already. s
Example 6.6.4: Poisson sum. If X and Y are Poisson, with parameters and
respectively, then by (2)
X z
z
x zx e () X z x zx
p Z (z) e e
x0
x! (z x)! z! x0 x
e ()
( ) z :
z!
Thus Z is Poisson with parameter . s
Example 6.6.5: geometric sum. Let X and Y be independent and geometric with
parameter p. Find the mass functions of U X Y and V X Y .
Solution. By (2),
X
u1
pU (u) p X (x) pY (u x)
1
X
u1
q x1 pq ux1 p
1
(u 1)q u2 p, u > 2,
X
1
q2
p2 q v y1 q y1 p2 q v2
y1
1 q2
pq v
:
1q
264 6 Jointly distributed random variables
For v , 0 we have
X X
1
pV (v) p X (x) pY ( y) p X (x) pY (x v)
x yv x1
X
1
q v2 q 2
p2 q x1 q xv1 p2
x1
1 q2
pq v
:
1q
Finally, for v 0,
X
1
p2
pV (0) p2 q 2x2
x1
1 q2
p
: s
1q
Example 6.6.6: uniform sum. Let X and Y be uniform on (0, a). Then by (6)
z
f Z (z) f X (x) f Y (z x)dx
0
(z 2
0a dx za2 0<z<a
a 2
za a dx (2a z)a2 , a < z < 2a,
which is the triangular density on (0, 2a), as we already know. s
Next we consider the sum of two independent normally distributed random variables. It
is worth recalling that the normal density was rst encountered in chapter 4 as an
approximation to the binomial distribution, and that the sum of two independent binomial
random variables (with the same parameter p) was found to be binomial in example
6.6.3. We should therefore expect that the sum of two independent normal random
variables is itself normal. This is so, as we now see.
Example 6.6.8: normal sum. Let X and Y be independent and normal with zero
mean, having variances 2 and 2 respectively. We let 2 2 2 . Then we have from
6.6 Sums of random variables 265
(5), as usual,
1
1 x2 (z x)2
f Z (z) exp 2 dx
1 2 2 22
( 2 )
1
1 2 2
exp 2 2 x 2 z dx
(22 )1=2 1 (2)1=2 2
z2
3 exp 2
2
1 z2
exp 2 :
(22 )1=2 2
This last step follows because the integrand is just the
2 z 2 2
N ,
2 2
density, which when integrated over R gives 1, as always.
Thus we have shown that if X and Y are independent normal random variables, with
zero mean and variances 2 and 2, then X Y is also normal, with variance 2 2 . s
The same argument will give the distribution function or the survival function of a sum
of random variables. For example, let X and Y be discrete and independent, with
Z X Y . Then
(9) F Z (z) P( Z < z) P(X Y < z)
X
P(X < z y)P(Y y)
y
X
F X (z y) pY ( y):
y
Likewise
X
(10) F Z (z) P( Z . z) F X (z y) pY ( y):
y
As an application, let us reconsider Pepys' problem, which appeared in exercise (5) at the
end of section 2.6. We can now solve a more general case.
Example 6.6.9: extended Pepys' problem. Show that if A n has 6n dice and needs at
least n sixes, then A n has an easier task than A n1 . That is to say, the chance of at least n
sixes in 6n rolls is greater than the chance of at least n 1 sixes in 6(n 1) rolls.
Solution. We can divide A n1 's rolls into two groups, one of size 6n, yielding X
sixes, and the other of size 6, yielding Y sixes. Write Z X Y . Then we need to show
that
P( Z > n 1) < P(X > n):
Write p(n) P(X n), pY (n) P(Y n), and F(n) P(X > n). We know that X is
266 6 Jointly distributed random variables
binomial B 6n, 16 and Y is binomial B 6, 16 . From what we have proved about the
binomial distribution, we have
(11) p(n 5) < p(n 4) < p(n 3) < p(n 2) < p(n 1) < p(n),
and
(12) EY 1:
Now, by conditioning on Y , we obtain
P( Z > n 1) P(X Y > n 1)
X
6
P(X Y > n 1jY r) pY (r)
r0
X
6
P(X > n 1 r) pY (r)
r0
F(n), by (12)
P(X > n)
as required. s
6 . 7 E X P E C TAT I O N ; T H E M E T H O D O F I N D I C AT O R S
In the previous sections we have looked at the distribution of functions of two or more
random variables. It is often useful and interesting to know the expected value of such
functions, so the following results are very important.
Addition rule for expectation. For any two random variables X and Y with a joint
distribution, we have
(3) E(X Y ) EX EY :
The proof is easy. Suppose X and Y are discrete; then by (1)
XX
E(X Y ) (x y) p(x, y)
x y
XX XX
xp(x, y) yp(x, y)
x y y x
X X
xp X (x) ypY ( y), by (6) of section 6:2
x y
Example 6.7.1: dice. You add the scores X r from n rolls of a die. By the above,
Xn X n
E Xr EX 1
r1 r1
X
n
7
, by (4)
r1
2
7n
:
2
The important thing about this trivial example
P is that the calculation is extremely easy,
though the actual probability distribution of nr1 X r is extremely complicated. s
Example 6.7.2: waiting for r successes. Suppose you undergo a sequence of Ber-
noulli trials with P(S) p; let T be the number of trials until the rth success. What is ET ?
Solution. We know from exercise 3 at the end of section 6.6 that T has a negative
binomial distribution, so
X1
n1
ET n (1 p) n r p r :
n r
r 1
Summing this series is feasible but dull. Here is a better way. Let X 1 be the number of
trials up to and including the rst success, X 2 the further number of trials to the second
success, and so on for X 3 , X 4 , . . . , X r. Then each of X 1 , X 2 , . . . has a geometric
distribution with parameter p and mean p1 . Hence
X r
ET E X k rp1 : s
k1
Example 6.7.3: coupon collecting. Each packet of some ineffably dull and noxious
product contains one of d different types of ashy coupon. Each packet is independently
equally likely to contain any of the d types. How many packets do you expect to need to
buy until the moment when you rst possess all d types?
Solution. Let T 1 be the number of packets bought until you have one type of coupon,
T2 the further number required until you have two types of coupon, and so on up to T d .
Obviously T1 1. Next, at each purchase you obtain a new type with probability
(d 1)=d, or not with probability 1=d. Hence T2 is geometric, with mean d=(d 1).
Likewise, T r is geometric with mean d=(d r 1), for 1 < r < d. Hence the ex-
pected number of packets purchased is
X
d X
d
(5) E(T1 T 2 T d ) ET r d=(d r 1): s
r1 r1
6.7 Expectation; the method of indicators 269
Example 6.7.4. A dart hits a plane target at the point (X , Y ) where X and Y have
density
1
f (x, y) expf 12 (x2 y2 )g:
2
Let R be the distance of the dart from the bullseye at the origin. Find ER2 .
Solution. Of course we could nd the density of R, and then the required expectation.
It is much easier to note that X and Y are each N(0, 1) random variables, and then
ER2 E(X 2 Y 2 ) EX 2 EY 2
var X var Y
2: s
We have looked rst at the addition law for expectations because of its paramount
importance. But, of course, the law of the unconscious statistician works for many other
functions of interest. Here are some examples.
Solution. It is clear that polar coordinates are going to be useful here. In each case by
application of (2) we have
For (i):
2 1
2 2 1=2
Ef(X Y ) g 1 r 2 dr d
0 0
2
:
3
For (ii): By symmetry the answer is the same in each octant, so
8 =4 1 2
EjX ^ Y j 8 yf (x, y) dx dy r sin dr d
0, y, x 0 0
8 1
1 p :
3 2
For (iii):
2 1
2 2 1
E(X Y ) 1 r 3 dr d :
0 0 2
For (iv): By symmetry EfX =(X Y )g EfY =(X Y 2 )g, and their sum is 1.
2 2 2 2 2
Hence
1
EfX 2 =(X 2 Y 2 )g : s
2
270 6 Jointly distributed random variables
In concluding this section we rst recall a simple but important property of indicator
random variables.
Solution. Let X i be the indicator of the event that the ith element works. Since the
elements are different we have EX i pi , where pi is not necessarily equal to p j , for
i 6 j. Nevertheless
Xn Xn
(7) E Xi p j:
i1 i1
The point here is that we do not need to know anything about the joint distribution of the
failures of elements; we need only know their individual failure rates in this structure in
order to nd . s
Corollary: binomial mean. In the special case when pi p for all i, we know that
the number of elements working is a binomial random variable X with parameters n and
p. Hence EX np, as we showed more tediously in chapter 4.
This idea will also supply the mean of other sampling distributions discussed in
chapter 4.
Example 6.7.7: hypergeometric mean. Suppose n balls are drawn at random without
replacement from an urn containing f fawn balls and m mauve balls. What is the
expected number of fawn balls removed? As usual, n < f ^ m.
Solution. Let Y be the number of fawn balls and let X r be the indicator of the event
that the r th ball drawn is fawn. Then
EX r f =(m f )
and the answer is
Xn
EY E X r nf =(m f ):
r1
Since we know that Y has a hypergeometric distribution, this shows that
X X n
f m m f
(8) EY yP(Y y) y
y y0
y n y n
nf =(m f ):
6.7 Expectation; the method of indicators 271
You may care to while away an otherwise idle moment in proving this by brute force. s
Example 6.7.8: inclusionexclusion. For 1 < r < n, let I r be the indicator of the
event A r . Then if we dene X by
Y
n
X 1 (1 I r )
r1
Indicators can be used to prove the following useful and familiar result.
Here is an illustration.
Example 6.7.9. An urn contains c cobalt balls and d dun balls. They are removed
without replacement; let X be the number of dun balls removed before the rst cobalt
ball. Then X > x if and only if the rst x balls are dun. The probability of this is
272 6 Jointly distributed random variables
d cd
:
x x
Hence
d
X
d cd
EX :
x1
x x
This is continued in exercise 3. s
We conclude this section with yet another look at the method of indicators.
Example 6.7.10: matching. The rst n integers are drawn at random out of a hat (or
urn, if you prefer). Let X be the number of occasions when the integer drawn in the rth
place is in fact r. Find var X.
Solution. This is tricky if you try to use the distribution of X , but simple using
indicators. Let I r indicate that r is drawn in the r th place. Then
Xn
X I r:
r1
Furthermore
1 1
EI r and E(I r I s ) , r 6 s:
n n(n 1)
Hence
var X E(X 2 ) (EX )2
P
Ef( nr1 I r )2 g 1
nE(I 2r ) n(n 1)E(I r I s ) 1
1 1 1 1: s
4. We have E(X Y ) EX EY , for any X and Y . Show that it is not necessarily true that
median (X Y ) median X median Y ,
nor is it necessarily true that
mode (X Y ) mode X mode Y :
6 . 8 I N D E P E N D E N C E A N D C OVA R I A N C E
We have seen on many occasions that independence has useful and important conse-
quences for random variables and their distributions. Not surprisingly, this is also true for
expectation. The reason for this is the following
EX EY :
When X and Y are jointly continuous the proof just replaces summations by integrals. h
Compare this with your previous derivation. Here is a slightly more complicated
application.
Solution. Obviously your rst box supplies you with the rst coupon of your set. Let
N1 be the number of boxes you need to open to get a coupon different from that in the
rst box. The probability that any coupon is the same as your rst is 1=n, the probability
that it is different is (n 1)=n. Boxes are independent. Hence
x1
1 n1
P(N 1 x) , x > 1,
n n
which is geometric with parameter (n 1)=n. Now let N2 be the further number of boxes
you need to open to obtain the third coupon of your set. The same line of argument shows
that
x1
2 n2
P(N2 x) , x > 1,
n n
which is geometric with parameter (n 2)=n.
Continuing in this way yields a series of geometric random variables (N k ; 1 < k
< n 1) with respective parameters (n k)=n. Obviously the process stops at N n1 ,
because this yields the nth and nal member of your complete set. Also
X 1 N1 N n1 ,
and furthermore the random variables N1 , . . . , N n1 are independent, because the boxes
are. Hence
EX 1 EN1 EN n1
n n n
1
n1 n2 1
Xn
1
n :
k1
k
6.8 Independence and covariance 275
When X and Y are not independent, the product rule does not necessarily hold, so it is
convenient to make the following denition.
At this point we note that while independent X and Y have zero covariance, the
converse is not true.
Example 6.8.3. Let X be any bounded non-zero random variable with a distribution
symmetric about zero; that is to say, p(x) p(x), or f (x) f (x). Let Y X 2 . Then
Y is not independent of X , but nevertheless
cov(X , Y ) EX 3 EX EX 2 0: s
Despite this, their covariance is clearly a rough and ready guide to the mutual
dependence of X and Y . An even more useful guide is the correlation function.
This may appear unnecessarily complicated, but the point is that if we change the
location and scale of X and Y then r is essentially unchanged, since
(8) r(aX b, cY d) sign(ac) r(X , Y ),
where
8
< 1, x . 0
sign(x) 0, x 0
:
1, x , 0:
Thus r is scale-free, but covariance is not, because
(9) cov(aX b, cX d) ac cov(X , Y ):
Here are two simple routine examples.
Example 6.8.5. Roll two dice, yielding the scores X and Y respectively. By indepen-
dence cov(X , Y ) 0. Now let U minfX , Y g and V maxfX , Y g, with joint dis-
tribution
(
1
, u,v
p(u, v) 18 1
36, u v:
Then, using (9) of section 6.7 and some arithmetic, we nd that EV 91
36. Also EV
7 EU 161
36 , with E(UV ) E(XY ) EX EY , and so
2
cov(U , V ) 3536 : s
The use of covariance extends our range of examples of the use of indicators.
We end this section with a look at one of the most important results in probability, the
so-called law of large numbers.
At various points throughout the book we have remarked that given a set X 1 , . . . , X n
of independent observations or readings, which we take to be random variables, their
average
Xn
(11) X n n1 Xr
r1
is a quantity of natural interest. We used this idea to motivate our interest in, and
denitions of, expectation. In the case when the X 's are indicators, we also used this
expression to motivate our ideas of probability. In that case X n is just the relative
frequency of whatever the X 's are indicating.
In both cases we claimed that empirically, as n ! 1, the averages X n tend to settle
down around some mean value. It is now time to justify that assertion. Obviously, we
need to do so, for if X n in (11) did not ever display this type of behaviour, it would
undermine our theory, to say the least.
We remind ourselves that the X 's are independent and identically distributed, with
mean and variance 2 , 1. Then we have the following so-called
6. `You can never foretell what any one man will do, but you can say with precision what an
average number of men will be up to'. Attributed to Sherlock Holmes by A. Conan Doyle.
(a) Has Conan Doyle said what he presumably meant to say?
(b) If not, rephrase the point correctly.
7. Let X and Y be independent. Is it ever true that
var(XY ) var X var Y ?
Example 6.9.1. You roll a fair die, which shows X . Then you ip X fair coins, which
show Y heads. Clearly, as always,
P(X 1) 16:
However, suppose we observe that Y 2. Now it is obviously impossible that X 1.
Knowledge of Y has imposed conditions on X , and to make this clear we use an obvious
notation and write
P(X 1jY 2) 0: s
What can we say in general about such conditional probabilities? Recall from chapter 2
that for events A and B
(1) P(AjB) P(A \ B)=P(B):
Also recall from section 5.8 that for a discrete random variable X and an event B,
(2) P(X xjB) PfX xg \ B=P(B):
The following denition is now almost self-evident.
Denition. Let X and Y be discrete random variables, with distribution p(x, y). Then
the conditional distribution of X given Y is denoted by p(xj y), where
(3) p(xj y) P(X x, Y y)=P(Y y)
p X ,Y (x, y)= pY ( y): n
Remark. Of course this is just (1) written in terms of random variables, since
P(X xjY y) P(A x jA y )
P(A x \ A y )=P(A y )
where A x f: X () xg, A y f: Y () yg. All the above denition really
comprises is the name and notation.
In view of this connection, it is not surprising that the partition rule also applies.
Partition rule for discrete distributions. Let X and Y have joint distribution p(x, y).
Then we have
6.9 Conditioning and dependence, discrete case 281
X
(4) p X (x) p(xj y) pY ( y):
y
To prove this, just multiply (3) by pY ( y) and sum over y. Of course, it also follows
directly from the partition rule in chapter 2, provided always that X and Y are dened on
the same sample space. h
Here are some examples to show the ways in which (3) and (4) are commonly applied.
Example 6.9.1 continued. You roll a die, which shows X , and ip X coins, which
show Y heads. Find p(x, y) and pY ( y).
Solution. Given X x, the number of heads is binomial B x, 12 . That is to say,
x x
p( yjx) 2 , 0 < y < x:
y
Hence
p(x, y) p( yjx) p X (x)
x
16 3 2x , 0 < y < x; 1 < x < 6:
y
Finally
X
6
1 xx
pY ( y) 32 : s
x y
6 y
Example 6.9.2. Let X and Y be independent geometric random variables, each with
parameter p. Suppose Z X Y . Find the distribution of X given Z.
Solution. We have
(6) P(X x, Y y) P(X x, Y y, N x y)
P(X x, Y yjN x y)P(N x y)
(x y)! x y e x y
p q
x! y! (x y)!
( p) x p (q) y q
e e , x, y > 0:
x! y!
This factorizes for all x and y, so X and Y are independent, being Poisson with
parameters p and q respectively. s
Next we return to the probability p(xj y) dened in (3), and stress the point that the
conditional distribution of X given Y is indeed a distribution. That is to say,
(7) p(xj y) > 0
and
X
(8) p(xj y) 1:
x
The rst of these is obvious; to see the second write
X X p(x, y) pY ( y)
p(xj y) 1:
x x
pY ( y) pY ( y)
Even more importantly we have the
EX : h
We give two simple illustrations of this. Further interesting examples follow later. We
note that there is a conditional law of the unconscious statistician, that is,
X
E( g(X )jY y) g(x) p(xj y):
x
Furthermore,
X
E g(X , Y ) E( g(X , Y )jY y) pY ( y),
y
Example 6.9.5: dice. Your roll two dice; let U be the minimum and V the maximum
of the two numbers shown. Then, as we know,
(
2
, 1 < u,v < 6
p(u, v) 36 1
36, u v:
Hence as usual
X
1
pU (u) p(u, v) 36(13 2u), 1<u<6
v
and
X
1
pV (v) p(u, v) 36 (2v 1), 1 < v < 6:
u
Thus, for example
8 2
>
< , u,v
p(ujv) 2v 1
: 1 ,
>
u v:
2v 1
Therefore
284 6 Jointly distributed random variables
X
v
v2
E(U jV v) u p(ujv)
u1
2v 1
and of course
X
6
E(U ) E(U jV v) p(v)
v1
X
6
1 2
36v 91
36:
v1
which we obtained in a more tedious way earlier. s
Example 6.9.6: random sum. Suppose that we can regard insurance claims as
independent random variables X 1 , X 2 , . . . , having a common distribution p(x). Suppose
the number of claims next year were to be N , where N is independent of the X i . Find the
expected total of next year's claims.
We conclude this section with a remarkable extension of the idea of a `random sum',
which we looked at in the above example. In that case the number of terms N in the sum
S N was independent of the summands X 1 , X 2 , . . .. However, this is often not the case.
For example, suppose you are a trader (or gambler), who makes a prot of X 1 , X 2 , . . . on
a sequence of deals until your retirement after the N th deal. This index N is a random
variable, and it can depend only on your previous deals. That is to say, you may retire
because X 1 X 2 X N . $109 , and you decide to take up golf; or you may retire
because S N , $109 , and you are ruined (or in gaol for fraud). You cannot choose to
retire on the basis of future deals. Either way, the event fN . kg that you continue trading
after the kth deal depends only on X 1 , . . . , X k . Then the following amazing result is
true.
6.9 Conditioning and dependence, discrete case 285
Example 6.9.7: Wald's equation. If the X i are independent and identically distrib-
uted with nite mean, N is integer valued with EN , 1, and fN . kg is independent of
X k1 , X k2 , . . . then
XN
(12) E X r EN EX 1 :
r1
To prove this, let I k1 be the indicator of the event fN . k 1g. Then I k1 is
independent of X k , and we can write
X N X1
E Xr E X k I k1 , because I k1 0 for N , k
r0 k1
X
1
E(X k I k1 ), because EjX k j , 1
k1
X
1
EX k EI k1 , by independence
k1
X
1
EX 1 P(N . k 1)
k1
EX 1 EN : s
The principal application of this is to gambling; in a casino it is always the case that
EX k , 0. It follows that no matter what system you play by, when you stop you have
ES N , 0. You must expect to lose.
4. Gambler's ruin. Suppose you are a gambler playing a sequence of fair games for unit
stakes, with initial fortune k. Let N be the rst time at which your fortune is either 0 or n,
0 < k < n; at this time you stop playing. Use Wald's equation to show that the probability that
you stop with a fortune of size n is k=n.
286 6 Jointly distributed random variables
5. Derangements and matching. There are n coats belonging to n people, who make an
attempt to leave by choosing a coat at random. Those who have their own coat can leave; the
rest hang the coats up at random, and then make another attempt to leave by choosing a coat at
random. Let N be the number of attempts until everyone leaves. Use the method of proving
Wald's equation to show that EN n. (Hint: Recall that the expected number of matches in a
derangement of n objects is 1.)
It is natural next to consider the case when X and Y are jointly continuous random
variables with density f (x, y). Of course we cannot use the elementary arguments that
give (1), because P(Y y) 0 for all y. Nevertheless, as we shall show, slightly more
complicated reasoning will supply the following very appealing denition and results
parallelling (1)(5).
Denition. Let X and Y have density f (x, y). Then for f Y . 0 the conditional density
of X given Y y is dened by
f (x, y)
(6) f X jY (xj y) : n
f Y ( y)
Just as in the discrete case, we can recover the unconditional density by integrating
f X jY (xj y) f (xj y):
1
(7) f X (x) f (xj y) f Y ( y) dy:
1
Note crucially that f (xj y) is indeed a density, as it is non-negative and
1 1
1
f (xj y) dx f (x, y) dx 1:
1 f Y ( y) 1
Therefore it has the properties of a density, including the key rule
(8) P(X 2 CjY y) f X jY (xj y) dx,
x2C
Example 6.10.1. Let X and Y be uniform on the triangle 0 < x < y < 1, so that
f (x, y) 2, 0 < x < y < 1. First we can easily calculate the marginals:
1 y
f X (x) 2 dy 2(1 x) f Y ( y) 2 dx 2 y:
x 0
Hence, by (6) we have the conditional densities:
f 1
(11) f X jY (xj y) , 0 < x < y,
fY y
f 1
(12) f Y j X ( yjx)
, x < y < 1:
fX 1 x
Of course this just conrms what intuition and inspection of gure 6.2 would tell us:
given Y y, X is clearly uniform on [0, y]; and, given X x, Y is clearly uniform on
[x, 1].
0 x y 1
Figure 6.2. The point (X , Y ) is picked uniformly at random in the triangle f(0, 0), (0, 1), (1, 1)g:
288 6 Jointly distributed random variables
Example 6.10.2. Let the random variables U and V have joint density
f (u, v) e v , for 0 , u , v , 1:
Find the marginals f U (u), f V (v), and the conditional densities f U jV (ujv) and f V jU (vju).
What is the density of Y V U ?
Now
yu
(17) P(V U < yjU u) f V jU (vju) dv
u
e u (e u e yu ) 1 e y :
Therefore
1
P(V U < y) (1 e y )e u du
0
1 e y:
Since this is the distribution function of an exponential random variable, it follows that
Y V U has density e y. Since (17) does not depend on u, it is also true that U and
V U are independent. s
Remark: link with Poisson process. We have noted above that times of occurrence of
earthquakes, meteorite strikes, and other rare random events are well described by a
Poisson process. This has the property that intervals between events are independent
exponential random variables. In view of what we have proved above, we have that U is
the time of the rst event, and V the time of the second event, in such a process.
Equation (16) then has the following interpretation. Given that the time of the second
event in such a sequence is V , the time of the rst event is uniform on (0, V ): it was
equally likely to have been any time in (0, V )!
Sometimes the use of a conditional density offers a slightly different approach to
calculations that we can already do.
Before we conclude this section, we return to supply additional reasons for making the
denition (6). Let us consider the event
B f y , Y < y hg:
Now by ordinary conditional probability, it follows that the conditional distribution
290 6 Jointly distributed random variables
function of X given B is
where F(x, y) is the joint distribution function of X and Y, and FY ( y) is the distribution
function of Y .
Now if we let h # 0 in (18) this supplies an attractive candidate for the conditional
distribution function of X , given Y y. We get
As usual, the derivative of the distribution function (when it exists) is the density. Thus
differentiating (19) we obtain
f (xjY y) f (x, y)= f Y ( y),
when the integral exists. Hence, multiplying by f Y ( y) and integrating over y, we obtain
(21) E(X ) E(X jY y) f Y ( y) dy:
The name is explained by a glance at the original partition rule in section 2.8. We can use
this rule to supply a slightly different derivation of the convolution rule for the distri-
bution of sums of continuous random variables.
Example 6.10.4: convolution rule. Let X and Y have density f (x, y), and set
Z X Y . Then
6.11 Applications of conditional expectation 291
F Z (z) P( Z < z)
P(X Y < z)
P(X Y < zjY y) f Y ( y) dy, by (22)
P(X y < zjY y) f Y ( y) dy
F X jY y (z yj y) f Y ( y) dy:
This is the equivalent for continuous random variables of the expression which we
derived for discrete random variables.
6 . 1 1 A P P L I C AT I O N S O F C O N D I T I O NA L E X P E C TAT I O N
As we have seen, conditional distributions and conditional expectations can be very
useful. This usefulness is even more marked in more advanced work, so we now introduce
a new and compact notation for conditional expectation.
The essential idea that underlies this is the fact that conditional expectation is a
292 6 Jointly distributed random variables
random variable. This sounds like a paradox, but just recall the denitions. First let X
and Y be discrete; then given Y y we have
X
(1) E(X jY y) xp(x, y)= pY ( y):
x
In both cases (1) and (2), we observe that E(X jY y) is a function of y. Let us denote it
by ( y), that is
(3) E(X jY y) ( y):
But y ranges over all the possible values of Y , and therefore (Y ) is a function of Y . And
as we noted in section 5.2, a function of a random variable is a random variable! For
clarity and consistency we can now write
(Y ) E(X jY )
on the understanding that (Y ) is the function of Y that takes the value E(X jY y) when
Y y.
As a random variable, (Y ) may have an expectation, of course; and we know from
sections 6.9 and 6.10 that this expectation is EX . For convenience, we repeat the
argument here: suppose X and Y are discrete, then
X
EX E(X jY y) pY ( y)
y
X
( y)p Y ( y)
y
and this can often simplify the calculation of covariances and correlations, and other
expectations.
Here are some examples.
Example 6.11.1: potatoes. A sack contains n potatoes. The weight of the rth potato
is a random variable X r (in kilos), where the X r are independent and identically
distributed. The sack of potatoes weighs 100 kilos, and you remove m potatoes at random.
What is the expected weight of your sample of size m?
Pn Pm
Solution. We set r0 X r S n , r0 X r S m . Then the question asks for
E(S m jS n ), when S n 100. Now it is clear (essentially by symmetry) that E(X j jS n )
E(X k jS n ), for all j and k. Hence
Sn
E(X i jS n ) ,
n
Xm
m
E(S m jS n ) E(X i jS n ) S n :
i1
n
Thus the expected weight of your sample is 100m=n kilos. s
Example 6.11.2: thistles. A thistle plant releases X seeds, where X is Poisson with
parameter . Each seed independently germinates with probability p; the total crop is Y
thistle seedlings. Find cov(X , Y ) and r(X , Y ).
Solution. By (5),
E(XY ) EfE(XY jX )g EfX E(Y jX )g:
The total crop number Y, now conditional on X , is binomial with parameters X and p.
Hence
E(Y jX ) Xp
and
(6) E(Y 2 jX ) Xp(1 p) X 2 p2 :
Thus
E(XY ) E(X 2 p) (2 ) p,
giving
cov(X , Y ) p2 p EX EY p:
Finally, using (6),
var Y EfE(Y 2 jX )g (EY )2
p:
Hence
cov(X , Y ) p p
r(X , Y ) p: s
(var X var Y )1=2 (2 p)1=2
294 6 Jointly distributed random variables
4. Waldegrave's problem again. Suppose a coin shows heads with probability p, and you ip
it repeatedly. Let X n be the number of ips until it rst shows a run of n consecutive heads.
Show that
E(X n jX n1 ) X n1 1 qEX n :
Deduce that
EX n p1 p1 EX n1
Xn
p k :
k1
Hence derive the result of example 5.8.12.
5. Let X and Y have the joint density
f (x, y) cx( y x)e y , 0 < x < y , 1:
(a) Find c.
(b) Show that
f (xj y) 6x( y x) y 3 , 0 < x < y,
f ( yjx) ( y x) expf(y x)g, x < y , 1:
(c) Deduce that E(X jY ) 12 Y and E(Y jX ) X 2.
6 . 1 2 B I VA R I AT E N O R M A L D E N S I T Y
We know well by now that if X and Y are independent N(0, 1) random variables, then
they have joint density
1
(1) (x, y) expf 12 (x2 y2 )g:
2
Very often it is necessary to consider random variables that are separately normal but not
independent. With this in mind, dene
(2) U X,
(3) V rX (1 r2 )1=2 Y ,
6.12 Bivariate normal density 295
where , . 0 and jrj < 1. Then, from what we know already about sums of indepen-
dent normal random variables, we nd that U is N(0, 2 ) and V is
N(0, 2 r2 f(1 r2 )1=2 g2 ) N(0, 2 ):
Thus U and V are also separately normal. But what about the joint distribution of U
and V ? They are clearly not independent! Proceeding in the usual way, we calculate
F(u, v) P(U < u, V < v)
Denition. If log X and log Y have jointly a bivariate normal density, then X and Y
have a bivariate log normal density. n
Solution. First, remark that log Z log W 2 log H: By assumption, log W and
log H have a bivariate normal density, centred at (, ) say, with parameters , , and r.
Hence log Z has a normal density with mean 2 and variance
2 2 r 42 :
Thus Z is log normal. (See problem 24.) s
Finally we note the important point that the joint density f (u, v) in (4) factorizes as the
product of separate functions of u and v, if and only if r 0. This tells us that U and V
are independent if and only if r 0. That is, normal random variables are independent if
and only if they are uncorrelated, unlike most other bivariate distributions.
the spread (or variance) of the heights of successive generations to increase. But it does
not.
Galton resolved this dilemma by nding the distribution of the sons' heights condi-
tional on that of their parents. He found empirically that this was normal, with a mean
intermediate between the population mean and their parent's height. He called this
phenomenon regression to the mean and described it mathematically by assuming that
the heights of fathers and sons were jointly binormal about the population mean. Then the
conditional densities behaved exactly in accordance with the observations. It would be
hard to overestimate the importance of this brilliant analysis.
One curious and interesting consequence is the following.
Example 6.12.2: doctor's paradox. Suppose your doctor measures your blood
pressure. If its value X is high (where X 0 is average), you are recalled for a further
measurement, giving a second value Y . On the reasonable assumption that X and Y have
approximately a standard bivariate normal distribution, the conditional density of Y given
X 0 is seen from (6) to have mean rX , which is less than X . It seems that merely to
revisit your doctor makes you better, whether or not you are treated for your complaint.
This result may partly explain the well-known placebo effect. s
(b) var(X jY y) 2 (1 r2 ),
2 r
(c) E(X jX Y v) v
2 2r 2
2 2 (1 r2 )
(d) var(X jX Y v) 2 :
2r 2
(Hint: use exercise 1; no integrations are required.)
6 . 1 3 C H A N G E - O F - VA R I A B L E S T E C H N I QU E ; O R D E R
S TAT I S T I C S
At several places in earlier sections it has been necessary or desirable to nd the joint
distribution of a pair of random variables U and V, where U and V are dened as
functions of X and Y,
(1) U u(X , Y ), V v(X , Y ):
In many cases of interest X and Y have joint density f (x, y); the question is, what is the
joint density f U ,V (u, v)?
We have always succeeded in answering this, because the transformations were mostly
linear (or bilinear), which simplies things. (The most recent example was in the previous
section, when we considered U and V as linear combinations of the normal random
variables X and Y .) Otherwise, the transformation was to polars. Not all transformations
are linear or polar, and we therefore summarize a general technique here. We consider
two dimensions for simplicity.
The proofs are not short, and we omit them, but it is worth remarking that in general it
is intuitively clear what we are doing. The point about f (x, y) is that f (x, y) x y is
roughly the probability that (X , Y ) lies in the small rectangle
R (x, x x) 3 ( y, y y):
The joint density of U and V has the same property, so we merely have to rewrite
f (x, y) xy in terms of u(x, y) and v(x, y). The rst bit is easy, because
f (x, y) f (x(u, v), y(u, v)):
The problem arises in nding out what the transformation does to the rectangle R. We
have seen one special case: in polars, when x r cos and y r sin , we replace x y
by r r . In general, the answer is given by the following.
Change of variables. Let S and T be sets in the plane. Suppose that u u(x, y) and
v v(x, y) dene a oneone function from S to T with unique inverses x x(u, v) and
y y(u, v) from T to S. Dene the determinant
@x @ y
@u @u @x @ y @x @ y
(2) J (u, v) ,
@x @ y @u @v @v @u
@v @v
where all the derivatives are continuous in T . Then the joint density of U u(X , Y ) and
V v(X , Y ) is given by
(3) f U ,V (u, v) f X ,Y (x(u, v), y(u, v))jJ (u, v)j:
6.13 Change-of-variables techniques; order statistics 299
Informally we see that the rectangle R with area x y, has become a different shape,
with area jJ (u, v)ju v.
Example 6.13.1: bivariate normal. From (2) and (3) in section 6.12 we see that
u v r 1 u
x(u, v) and y(u, v)
(1 r2 )1=2
Hence
1 0
1
jJ j r 1
(1 r2 )1=2
(1 r2 )1=2 (1 r2 )1=2
and
( 2 !)
1 1 u2 v ru
f (u, v) exp
2 (1 r2 )1=2 2 2 (1 r2 )1=2
just as before. s
5. Let X and Y be independent exponentially distributed random variables with parameters and
respectively. Find the density of Z X =(X Y ).
6.14 Review 301
6.14 REVIEW
In this chapter we have considered jointly distributed random variables. Such variables X
and Y have a distribution function
F(x, y) P(X < x, Y < y)
where
P(a , X < b, c , Y < d) F(b, d) F(a, d) F(b, c) F(a, c):
Random variables (X , Y ) when discrete have a distribution p(x, y) and when continuous
have a density f (x, y) such that:
@2 F
f (x, y) .
@x@ y
Marginals
In the discrete case in the continuous case
X
p X (x) p(x, y), f X (x) f (x, y) dy,
Xy
pY ( y) p(x, y). f Y ( y) f (x, y) dx.
x
Functions
In the discrete case in the continuous case
X
P( g(X , Y ) z) p(x, y) P( g(X , Y ) < z) f (x, y) dx dy
gz
X g<z
Independence. Random variables X and Y are independent if and only if for all x
and y
F(x, y) F X (x)FY ( y);
6 . 1 5 P RO B L E M S
1. You ip a fair coin repeatedly; let X be the number of ips required to obtain n heads.
(a) Show that EX 2n.
(b) Show also that P(X , 2n) 12:
(Elaborate calculations are not required.)
(c) Atropos ips a fair coin n 1 times; Belladonna ips a fair coin n times. The one with
the larger number of heads wins. Show that P(A wins) 12.
2. Each packet of Acme Gunk is equally likely to contain one of three different types of coupon.
Let the number of packets required to complete your set of the three different types be X . Find
the distribution of X .
3. Your coin shows heads with probability p. You ip it repeatedly; let X be the number of ips
until the rst head, and Y the number until the rst tail. Find E(minfX , Y g), and EjX Y j.
4. The random variable X is uniform on [0, 1], and conditional on X x; Y is uniform on (0, x).
What is the joint density of X and Y ? Find
(a) f Y ( y), (b) f X jY (xj y), (c) E(X jY ):
5. Rod and Fred play the best of ve sets at tennis. Either Fred wins a set with probability or
Rod wins with probability r, independently of other sets. Let X be the number of sets Fred
wins and Y the number of sets Rod wins. Find the joint distribution of X and Y, and calculate
cov (X , Y ) when r 12.
6.15 Problems 303
6. You roll n dice obtaining X sixes. The dice showing a six are rolled again yielding Y sixes.
Find the joint distribution of X and Y . What is E(X jY )? What is var(X jY )?
7. You make n robots. Any robot is independently faulty with probability . You test all the
robots; if any robot is faulty your test will detect the fault with probability , independently of
the other tests and robots. Let X be the number of faulty robots, and Y the number detected as
faulty. Show that, given Y , the expected number of faulty robots is
n(1 ) (1 )Y
E(X jY ) :
1
8. Two impatient, but also unpunctual, people arrange to meet at noon. Art arrives at X hours
after noon and Bart at Y hours after noon, where X and Y are independently and uniformly
distributed on (0, 1). Each will wait at most 10 minutes before leaving, and neither will wait
after 1.00 p.m. What is the probability that they do in fact meet? What is the probability that
they meet given that neither has arrived by 12.30?
9. An urn contains three tickets bearing the numbers a, b, and c. Two are taken at random without
replacement; let their numbers be X and Y . Write down the joint distribution of X and Y .
Show that cov(X , Y ) 19(ab bc ca a2 b2 c2 ).
10. You roll two dice, showing X and Y . Let U be the minimum of X and Y . Write down the joint
distribution of U and Y, and nd cov(U , X ).
11. Runs revisited. A coin shows heads with probability p, or tails with probability q. You ip
it repeatedly; let X be the length of the opening run until a fresh face appears, and Y the length
of the second run. Find the joint distribution of X and Y, and show that
( p q)2
cov(X , Y ) :
pq
12. Let X and Y be independent standard normal random variables. Let Z X 2 =(X 2 Y 2 ). Show
that E Z 12 and var Z 18.
13. A casino offers the following game. There are three fair coins on the table, values 5, 10, and 20
respectively. You can nominate one coin; if its value is a, then your entry fee is a. Denote the
values of the remaining coins by b and c.
Now the coins are ipped; let X be the indicator of the event that yours shows a head, and Y
and Z the indicators of heads for the other two. If aX . bY cZ then you win all three coins;
if aX , bY cZ then you get nothing; if aX bY cZ then your stake is returned. Which
coin would you nominate?
14. Suppose that X 1 , . . . , X n are independent and that each has a distribution which is symmetric
P
about 0. Let S n nr1 X r .
(a) Show that the distribution of S n is also symmetric about zero.
(b) Is this still necessarily true if the X r are not independent?
15. Let X and Y be independent standard normal random variables. Show that
p p
E(minfjX j, jY jg) 2( 2 1)=
16. Let U and V have joint density f (u, v), and set X UV . Show that
1 x=v 0 1
P(UV < x) f (u, v) du dv f (u, v) du dv,
0 1 1 x=v
17. Buffon's needle. The oorboards in a large hall are planks of width 1. A needle of length
a < 1 is dropped at random onto the oor. Let X be the distance from the centre of the needle
to the nearest joint between the boards, and the angle which the needle makes with the joint.
Argue that X and have density
f (x, ) 1 , 0 < x < 12; 0 < < 2:
Deduce that the probability that the needle lies across a joint is 2a=.
18. An urn contains j jet and k khaki balls. They are removed at random until all the balls
remaining are of the same colour. Find the expected number left in the urn.
19. Show that cov(X , Y Z) cov(X , Y ) cov(X , Z). When Z and Y are uncorrelated show
that
r(X , Y Z) ar(X , Y ) br(X , Z),
where
a2 b2 1:
Suppose Y1 , . . . , Y n are uncorrelated. Show that
P X
n
r(X , nr1 Y r ) a r r(X , Y r ),
r1
where
X
n
a2r 1
r1
Can you see a link with Pythagoras' theorem?
20. You are the nance director of a company that operates an extremely volatile business. You
assume that the quarterly prots in any year, X 1 , X 2 , X 3 , X 4 , are independent and identically
distributed continuous random variables. What is the probability that X 1 . X 2 . X 3 . X 4 ?
Would you be very concerned if this happened? Would you be thrilled if X 1 . X 2 , X 3 , X 4 ?
Would you expect a bonus if prots increased monotonically for six successive quarters?
21. Let X and Y be independent and uniform on f1, 2, . . . , ng and let U minfX , Y g and
V maxfX , Y g. Show that
(n 1)(2n 1) (4n 1)(n 1)
EU , EV ,
6n 6n
2
n2 1 (n2 1)(2n2 1)
cov(U , V ) , var U var V ,
6n 36n2
and that
n2 1 1
r(U, V ) 2 ! as n ! 1:
2n 1 2
22. Let X and Y have joint density f (x, y) 1, 0 < x, y < 1. Let U X ^ Y and V X _ Y .
Show that r(U , V ) 12. Explain the connexion with the preceding problem.
23. There are 10 people in a circle, and each of them ips a fair coin. Let X be the number whose
coin shows the same as both neighbours.
(a) Find P(X 10) and P(X 9):
(b) Show that EX 52 and var X 25 8.
24. Let X and Y be random variables, such that Y e X . If X has the N(, 2 ) normal
distribution, nd the density function of Y . This is known as the log normal density, with
parameters and 2 . Find the mean and variance of Y .
6.15 Problems 305
25. You are selling your house. You receive a succession of offers X 0 , X 1 , X 2 , . . . , where we
assume that the X i are independent identically distributed random variables. Let N be the
number of offers necessary until you get an offer better than the rst one, X 0 . That is,
N minfn: X n . X 0 g, n > 1:
Show that
P(N > n) P(X 0 > X 1 , X 0 > X 2 , . . . , X 0 > X n1 )
> n1 :
Deduce that EN 1. Why is this a poor model? (Hint: Let A k be the event that no offer is
S
larger than the kth. Then P(N > n) P(A0 ) and 1 P( n1
k0 A k ); use Boole's inequality.)
26. Let X , Y , and Z be jointly distributed random variables such that P(X > Y ) 1.
(a) Is it true that P(X > x) > P(Y > x) for all x?
(b) Is it true that P(X > Z) > P(Y > Z)?
(c) Show that EX > EY .
27. Let X and Y have the joint distribution
p(x, y) cf(x y 1)(x y)(x y 1)g1 , x, y > 1:
Find the value of c, and EX .
28. You plant n seeds; each germinates independently with probability . You transplant the
resulting X seedlings; each succumbs independently to wilt with probability . Let Y be the
number of plants you raise. Find the distribution of X and Y, and calculate cov(X , Y ):
29. Are you normal? When a medical test measuring some biophysical quantity is admin-
istered to a population, the outcomes are usually approximately normally distributed, N(, 2 ).
If your measurement is further than 2 from , you are diagnosed as abnormal and a candidate
for medical treatment. Suppose you undergo a sequence of such tests with independent results.
(a) What is the probability that any given test indicates that you are abnormal?
(b) After how many tests will you expect to have at least one abnormal result?
(c) After how many tests will the probability of your being regarded as abnormal exceed 12?
30. Recall the gambler's ruin problem: at each ip of a fair coin you win $1 from the casino if it is
heads, or lose $1 if it is tails.
(a) Suppose initially you have $k, and the casino has $(n k). Let X k be the number of ips
until you or the casino has nothing. If m k EX k , show that for 0 , k , n 1
2m k m k1 m k1 2:
Deduce that m k k(n k):
(b) Suppose initially that you have $Y where Y is binomial B(n, p) and the casino has
$(n Y ), and let the duration of the game be X . Find EX and cov(X , Y ). Show that X
and Y are uncorrelated if p 12.
(c) Suppose that initially you have $Y and the casino has $ Z, where Y and Z are
independent. Find EX and cov(X , Y ), where X is the duration of the game.
31. (a) Let X 1 , X 2 , . . . , X n be independent and uniformly distributed on (0, 1), and dene
M n maxfX 1 , . . . , X n g. Show that as n ! 1,
P(n(1 M n ) . y) ! e y , y . 0:
(b) Let X 1 , X 2 , . . . , X n be independent and identically distributed on (0, 1), with density
f (x), where 0 , f (1) , 1. Let M n maxfX 1 , . . . , X n g. Show that as n ! 1
P(n(1 M n ) . y) ! expf f (1) yg:
306 6 Jointly distributed random variables
36. Let X and Y have zero mean and unit variance, with correlation coefcient r. Dene U X
and V Y rX . Show that U and V are uncorrelated.
37. Alf, Betty, and Cyril each have a torch; the lifetimes of the single battery in each are
independent and exponentially distributed with parameters , , and respectively. Find the
probability that they fail in the order: Alf's, Betty's, Cyril's.
38. Let X and Y be independent standard normal random variables, and let Z X Y . Find the
distribution and density of Z given that X > 0 and Y > 0.
(Hint: No integration is required; use the circular symmetry of the joint density of X and Y .)
p
Show that E( ZjX . 0, Y . 0) 2 2=.
39. Let X 1 , . . . , X n be a collection of random variables with the property that the collection
fX 1 , . . . , X r g is independent of X r1 for 1 < r < n. Prove that the X i are all mutually
independent.
40. Let X and Y be independent and let each have the distribution N(0, 1). Let
2XY X2 Y2
U 2 2 1=2
and V 2 :
(X Y ) (X Y 2 )1=2
show that U and V are also N(0, 1), and independent.
6.15 Problems 307
44. Random stake. You make n wagers, where the probability of winning each wager is p.
The stake is the same for each, but is a random variable X chosen before the wagers. Thus
P
your winnings are W nr1 XY r , where
1 with probability p
Yr
1 with probability 1 p:
Find cov(X , W ) and show that X and W are uncorrelated if the game is fair. Are they
independent?
45. For any random variable X with nite fourth moment we dene the kurtosis of X by
kur X Ef(X )4 g=fE(X )2 g2
4 = 4
(a) If X is N(, 2 ), show that kur X 3.
(b) If X is exponential with parameter , show that kur X 9.
P
(c) If X 1 , . . . , X n are independent and identically distributed and S n nr1 X r, show that
kur S n 3 (kur X 1 3)=n.
(d) If X is Poisson with parameter , show that kur X 3 1.
(e) if X is binomial B(n, p), show that kur X 3 (1 6 pq)=(npq).
Remark. Kurtosis is an indication of how peaked the distribution of X is above its mean;
small values indicate strong peaking.
46. Let X 1 , X 2 , X 3 , . . . be a sequence of independent identically distributed random variables,
each uniform on (0, 1). Let
( )
Xn
N min n: X r . x , 0 , x , 1:
r1
48. Plates. Your factory produces rectangular plates of length X and width Y , where X and Y
are independent; they have respective means X , Y and respective variances 2X , 2Y . Find
the variance of the perimeter B and the area A of any plate, and show that A and B are not
independent.
49. Chi-squared. Let X 1 . . . , X n be independent N(0, 1) random variables. Show that the
P
density of Z nr1 X 2r is
2 n=2 12 n 1 x n=21 e x=2 ,
known as the chi-squared density, with n degrees of freedom, 2 (n).
50. Student's t-density. Let X have the 2 (n) density, and let Y be N(0, 1) and independent of X .
Show that the density of Z Y (X =n)1=2 is
(n1)=2
12(n 1) x2
1 ,
(n)1=2 12 n n
known as `Student's t-density', with n degrees of freedom, t(n).
7
Generating functions
7.1 PREVIEW
The most important thing about random variables is that they all have a distribution.
Where they have moments, we like to know those as well. However, distributions are
often cumbersome to deal with; as an example of this, recall the convolution formulas for
the sums of independent random variables in section 6.6. And even simple tasks, such as
nding moments, are often wearisome. We may suspect, furthermore, that there are even
more tedious computations to come.
Fortunately there are miraculous devices to help us with many of these humdrum tasks;
they are called generating functions. In this chapter we dene the probability generating
function and the moment generating function; then we explore some of their simpler
properties and applications. These functions were rst used by Euler and de Moivre in the
18th century; Euler used them in number theory and de Moivre actually used them to
help with probability distributions. They seem to have hit on the idea independently, a
classic illustration of the theory that great minds think alike.
7 . 2 I N T RO D U C T I O N
We introduce the basic idea with two examples. First, recall that the binomial distribution
is
n
(1) p(r) p r q n r , 0 < r < n:
r
But we know, by the binomial theorem, that these numbers are the coefcients of s r in
the function
Xn Xn
n
(2) G(s) (q ps) n p r q n r s r p(r)s r :
r0
r r0
Therefore the collection of probabilities in (1) is exactly equivalent to the function G(s)
in (2), in the sense that if we know G(s) we can nd the p(r), and if we know the p(r) we
can nd G(s). The function G(s) has effectively bundled up the n 1 probabilities in (1)
into a single entity, and they can then be generated from G(s) whenever we so desire.
309
310 7 Generating functions
Second, we likewise know that the moments of the exponential density are
1
r!
(3) r x r e x dx r ,
0
after prolonged integration. But we also know, by Taylor's theorem, that these numbers
appear in the coefcients of t r in the expansion of the function
X 1
tr X 1
r r
(4) M(t) r
t :
t r0
r0
r!
Therefore the collection of moments in (3) is exactly equivalent to the function M(t) in
(4), in the sense dened above. The function M(t) has effectively bundled up the
moments in (3) into a single entity, and the moments r can thus be generated from M(t),
whenever we so desire.
With this preliminary, the following denitions are obvious.
Es X ,
is the probability generating function of X . Sometimes we denote it by G X (s), and we
may refer to it as the p.g.f. of X . n
E(e tX ):
Sometimes we denote it by M X (t), and we may refer to it as the m.g.f. of X . n
Note that if X has a probability generating function, then (5) and (6) yield
(8) M X (t) G X (e t ):
P
Next we remark that the series in (5) converges for jsj < 1, because r p(r) 1.
Moments are not so well behaved, which is why we need the extra condition to derive (7).
We shall always assume that, for some d . 0, M(t) does converge for jtj , d.
7.2 Introduction 311
The functions G(s) and M(t) have the following essential properties.
First, G(s) determines the probabilities p(r) uniquely; that is, G X (s) G Y (s) if and
only if pY (r) p X (r).
Second, M(t) determines the moments uniquely; that is, if M X (t) M Y (t) (and both
exist for jtj , d), then EX r EY r , and conversely.
However, there is more. In fact G X (s) also determines the moments of X , and M X (t)
also determines the distribution of X . We emphasize these two results as follows:
Theorem. Let Es X G(s), and let G( r) (s) be the rth derivative of G(s). Then
(9) G( r) (1) EfX (X 1) (X r 1)g:
In particular
(10) G9(1) G(1) (1) EX :
Theorem. Let E(e tX ) M(t) , 1, for jtj , d, where X is continuous with density
f (x). Then
(11) f (x) / e tx M(t) dt:
This looks nice enough, but unfortunately the integrand is a function of a complex
variable, and the integral is taken around a curve in the Argand plane. So we neither prove
(11), nor do we ever evaluate it; it is enough to know it is there. It is called the inversion
theorem for the m.g.f. M(t).
To sum up: if we know that
X
1
G(s) p(k)s k Es X
k0
then P(X k) p(k), and
G( r) (1) EfX (X 1) (X r 1)g;
if we know that
1
M(t) e tx f (x) dx Ee tX , 1, jtj , d,
1
312 7 Generating functions
7 . 3 E X A M P L E S O F G E N E R AT I N G F U N C T I O N S
Before we can use generating functions we need to learn to recognize them (very much as
drivers need to learn to recognize road signs before setting off). Here are some popular
varieties.
and
var X G 0(1) G9(1) fG9(1)g2 npq:
Example 7.3.2: normal density. If X has the standard normal density (x), then its
m.g.f. is
p 1
2
2 M(t) e tx ex =2 dx
1
1
expf12(x t)2 12 t 2 gdx
1
1
2
exp 12 t 2 ev =2
dv, setting x t v:
1
314 7 Generating functions
Hence
2
X1
t2t
M(t) e t =2
r0
r! 2 r
and so X has moments
(2r)!
(8) 2 r : s
r! 2 r
Of course, by the law of the unconscious statistician, we do not need to nd the density
of g(X ) in order to nd the m.g.f. of g(X ), as the following demonstrates.
Example 7.3.3: squared normal. Let X be a standard normal random variable. Find
the m.g.f. of Y X 2 .
Solution. We seek
2
M Y (t) Ee tY Ee tX
1
1
1=2
exp tx 2 12 x 2 dx
(2) 1
1
1 1 1 2
1=2 (1 2t)1=2
exp 2 y dy;
1 (2)
on setting
y
x
(1 2t)1=2
we obtain
1
M Y (t) : s
(1 2t)1=2
Hence show that if Z has a normal density with mean and variance 2 , then
M Z (t) exp t 12 2 t 2 :
7 . 4 A P P L I C AT I O N S O F G E N E R AT I N G F U N C T I O N S
Generating functions are remarkably versatile objects, and can be used to do many things.
But for us, they are mostly used for two purposes: nding limits and working with sums.
I Limits
The following results are crucial.
Example 7.4.1: uniform limit. Let Y n be uniform on f1, 2, . . . , ng. Then Y n =n has
moment generating function
M n (t) Ee tY n = n n1 fe t=n e 2 t= n e nt= n g
1 e t= n e ( n1) t= n 1 1 et
n 1 e t=n n e t=n 1
et 1 et 1
! , as n ! 1,
t2 t3 t
t 2
2n 6n
Ee tU
where U is uniform on [0, 1], by (2) of section 7.3. Hence, as n ! 1,
316 7 Generating functions
Yn
P <x ! x, 0 < x < 1,
n
P(U < x): s
Compare the transparent simplicity of these derivations with the tedious chore of
working directly with the distributions, as we did in chapter 4. Many other limit theorems
can be proved this way, but we have to move on.
Example 7.4.3: Bernoulli trials and the binomial distribution. Let (I k ; 1 < k < n)
be a collection of independent Bernoulli trials with
P(X k 1) p 1 q:
P
Then Es q ps, and for the sum X nk1 I k we have
Ik
7.4 Applications of generating functions 317
Y
n
Es X Es I k , by independence
k1
(q ps) n :
Therefore X is binomial with parameters n and p. We already know this, of course,
but compare the brevity and elegance of this line of reasoning with more primitive
methods. s
The above example was especially simple because X and Y had the same parameter p.
It is interesting to compare this with the case when they have different parameters.
The same idea works for continuous random variables if we use the moment generating
function. Here is an illustration.
Example 7.4.6: normal sums. Let X and Y be normal and independent, with
2 2
distributions N( , ) and N(, ) respectively. Find the distribution of Z X Y .
3. I have a die in which the three pairs of opposite faces bear the numbers 1, 3, and 5. I roll the die
twice and add the scores, then ip a fair coin twice and add the number of heads to the sum of
the scores shown by the die. Find the distribution of the total.
Thus X is uniform on f0, 1, . . . , ng, a result which we showed with very much more effort in
chapter 6.
7 . 5 R A N D O M S U M S A N D B R A N C H I N G P RO C E S S E S
Generating functions are even more useful in dealing with the sum of a random number
of random variables. This may sound a little arcane, but it is a very commonly arising
problem, as we noted when we proved Wald's equation in chapter 6.
G N fG X (s)g: s
Now we notice that the above argument works just as well for continuous random
variables X 1 , X 2 , . . . , provided that we use the moment generating function. Thus we
have in general
Example 7.5.2. Let each X r be exponential with parameter , and let N be geometric
with parameter p. Then
ps
(7) M(t) and G(s) :
t 1 qs
Therefore
p p
(8) G(M(t)) :
t q p t
P
Therefore T Nr1 X r has an exponential distribution with parameter p. (Remember
that the sum of a xed number of exponential random variables has a gamma dis-
tribution.) s
7.5 Random sums and branching processes 321
Probably the most famous and entertaining application of these results is to the theory
of branching processes. This subject had its origins in the following basic question: why
do some families die out, while others survive?
The question has been posed informally ever since people started using family names.
But the rst person to use generating functions in its solution was H.W. Watson in 1873,
in answer to a challenge by Francis Galton. (The challenge appeared in the April 1 issue
of the Educational Times.) For this reason it is often known as the GaltonWatson
problem, though Watson's analysis was awed. The correct solution in modern format
was eventually produced by J. F. Steffensen in 1930. I. J. Bienayme had earlier realized
what the answer should be, but failed to supply any reasons.
The problem they were all interested in is as follows. A population reproduces itself in
generations; the number in the nth generation is Z n . The rules for reproduction are these.
(i) Each member of the nth generation produces a family (maybe of size zero) in the
(n 1)th generation.
(ii) Family sizes of all individuals are independent and identically distributed random
variables, with distribution ( p(x); x > 0) and probability generating function G(s).
With these rules, can we describe Z n in the long run? Let Es Z n G n (s), and assume
Z 0 1. Then the solution to the problem is based on the following result:
(9) G n (s) G(G n1 (s)) G n1 (G(s))
G(G( (G(s)) ))
where the right-hand side is the n-fold iterate of the function G(:).
The proof of (9) relies on the following observations:
(i) every member of the nth generation has an ancestor in the rst generation;
(ii) the rth member of the rst generation has Z (n1
r)
descendants in the nth generation,
( r)
where Z n1 has the same distribution as Z n1 . Hence
(10) Z n Z (1) ( Z1 )
n1 Z n1 :
Now this is a random sum of independent random variables, so
(11) Es Z n G n (s) G Z1 (G n1 (s)) G1 (G n1 (s)):
The same argument applied to the (n 1)th generation shows that
(12) Z n Z (1) (2) ( Z n1 )
1 Z1 Z1 ,
which gives
(13) G n (s) G n1 (G(s)):
Iterating either (11) or (13) gives (9). Decompositions such as (10) and (12) lie at the
heart of many arguments in the theory of branching processes.
Now recall that our interest was motivated by the question, does the family of
descendants of the rst individual become extinct? We can examine this by considering
the events
A n f Z n 0g:
Obviously if A n occurs, then the descendants of the rst individual have become extinct
by the nth generation. Furthermore, setting s 0 in (9) shows that
322 7 Generating functions
Example 7.5.3: geometric branching. Suppose that, in the branching process dened
above, each family size has a type of geometric distribution such that
(15) P( Z 1 k) q k p, k > 0; p q 1:
Then G(s) p=(1 qs), and we can show by induction that the nth iterate of G(:) is
8
>
> pfq n p n qs(q n1 p n1 g
< n1 , p 6 q
q p n1 qs(q n p n )
(16) G n (s)
>
: n (n 1)s ,
>
p q:
n 1 ns
Now E Z 1 q= p, and from (19) we have that
8
> p(q n p n )
>
< n1 , p 6 q
q p n1
G n (0)
> n
>
: , pq
n1
(
1, q< p
!
pq 1 p , q,
as n ! 1. This all agrees with what we said above, of course. s
2. Rounding errors. Suppose that you round off 108 numbers to the nearest integer, and then
add them to get the total S. Assume that the rounding errors are independent and uniform on
[12, 12 ]. What is the probability that S is wrong by (a) more than 3, (b) more than 6?
3. Let X be Poisson with parameter , and let Y be gamma with parameters
p r and 1. Explain why
we can say, without any elaborate calculations, that (X )= is approximately normally
p
distributed as ! 1, and (Y r)= r is approximately normally distributed as r ! 1.
7 . 7 R A N D O M WA L K S A N D OT H E R D I V E R S I O N S
Generating functions can be applied to many new problems, and also provide new ways
of doing old problems. We give a few randomly selected examples here. Many of them
rely on a particular application of conditional expectation, that is, the fact that
(1) Es X EfE(s X jY )g
for any discrete random variables X and Y . Similarly, you may sometimes use
E exp(tX ) E[Efexp(tX )jY g],
in the continuous case.
Our rst example is extremely famous, and arises in a startling number of applications,
in various disguises. As usual we prefer to use a standard format and nomenclature; the
following is hallowed by tradition.
Example 7.7.1: simple random walk. Starting from the origin, a particle performs a
random walk on the integers, with independent and identically distributed jumps
7.7 Random walks and other diversions 325
X 1 , X 2 , . . . such that
(2) P(X 1 1) p and P(X 1 1) q 1 p:
Pn
Its position after n steps is S n r1 X r . Let T r be the number of jumps until the
particle visits r for the rst time (r . 0), and let T0 be the number of jumps until it rst
revisits the origin. Show that
(3) E(s T r ) (Es T1 ) r , r . 0:
T1
Hence nd Es . Use this to deduce that
(4) Es T0 1 (1 4 pqs 2 )1=2 :
Solution. For any k, let T k, k1 be the number of steps from T k until T k1 , that is, the
number of steps to reach k 1 after rst having arrived at k. Then T0,1 T1 , and the
random variables T k, k1 are independent and identically distributed. Furthermore
(5) T r T0,1 T1,2 T r1, r :
Hence
(6) Es T r (Es T01 ) r (Es T1 ) r :
Next we observe that
(7) Es T1 EfE(s T1 jX 1 )g:
Now, trivially,
E(s T1 jX 1 1) s
and not quite so trivially
E(s T1 jX 1 1) E(s 1T1,0 T0,1 ) s(Es T1 )2 , by (6)
Therefore, by conditional expectation (conditioning on X 1 )
(8) Es T1 ps qs(Es T1 )2 :
Hence Es T1 is a root of the quadratic qsx 2 x ps 0. Only one of these roots is a
probability generating function that converges for jsj < 1, and it is
1 (1 4 pqs 2 )1=2
(9) Es T1 :
2qs
For the last part we note that
Es T0 EfE(s T0 jX 1 )g:
Now E(s T0 jX 1 1) Es 1T1,0 sE(s T0,1 ) and E(s T0 jX 1 1) sEs T0,1 . From their
denitions we see that Es T0,1 is obtained from Es T0,1 by simply interchanging p and q.
Then, by conditional expectation again,
(10) Es T0 psEs T1,0 qsEs T0,1
1 (1 4 pqs 2 )1=2 : s
Remark. The name `random walk' was invented by Karl Pearson in 1905, to describe
a similar problem in two (or more) dimensions. Following the solution of that problem by
Rayleigh, Pearson noted the corollary that `. . . in open country the most likely place to
326 7 Generating functions
nd a drunken man is somewhere near his starting point'. Since then it has also been
known as the `drunkard's walk' problem. The solution of Rayleigh's problem of random
ights leads to a similar corollary for drunken birds in the open air.
Example 7.7.2: Huygens' problem. Two coins are ipped alternately; they show
heads with respective probabilities and . Let X be the number of ips up to and
including the rst head. Find Es X .
Solution. The sequence must begin with one of the three mutually exclusive out-
comes H or TH or TT. Now
E(s X j H) s, E(s X jTH) s 2 , E(s X jTT ) E(s 2 X ),
where X has the same distribution as X . Hence
(11) Es X E(s X j H) (1 )E(s X jTH)
(1 )(1 )E(s X jTT )
s (1 )s 2 (1 )(1 )s2 Es X :
Thus
s (1 )s 2
(12) Es X :
1 (1 )(1 )s 2
From this it is a trivial matter to calculate any desired property of X . For example, the
probability that the second coin ipped rst shows a head is just the sum of the
coefcients of the even powers of s in Es X , and from (12) this is simply
(1 )
: s
1 (1 )(1 )
Example 7.7.3: Waldegrave's problem revisited. In our nal visit to this problem, we
nd the generating function Es N of the number of rounds played in this game. In the
usual notation we write N 1 X , where X is the number of ips of a coin until it rst
shows n 1 consecutive heads. Then by conditional probability and independence,
(13) Es X 12E(s X jT ) 12 2 E(s X j HT )
12 n1 E(s X j H n2 T ) 12 n1 E(s X j H n1 )
12 s 12 2 s 2 12 n1 s n1 Es X 12 s n1 :
Hence
1 n1
X s
Es (2 1 n )
1
2s 2s
1
1 12 s
and so
N s n 12 n1 1 12 s
X
(14) Es sEs : s
1 s 12 s n
7.7 Random walks and other diversions 327
Example 7.7.4: tail generating functions. We have seen above that it is often useful
to know P(T . n) for integer-valued random variables. Let
X
1
T (s) P(X . n)s n :
s0
X
Show that if Es G(s), then
1 G(s)
(15) T (s) :
1s
X
1
E I(X . n)s n
n0
X
X 1
E s n, because I is an indicator
n0
1 sX
E , summing the geometric series
1s
1 G(s)
: s
1s
Example 7.7.5: coupons. Suppose any packet of Acme Deathweed is equally likely
to contain any one of four different types of coupon. If the number of packets you need to
collect the set is T , nd Es T and P(T k).
Solution. We know that the number of packets bought between the consecutive
appearances of a new type is geometric, with parameters 1, 34, 12, 14 respectively for each.
Hence, using the geometric p.g.f.,
! ! !
3 1 1
T 4s 2s 4s
(16) Es s
1 14 s 1 12 s 1 34 s
!
1 9
4 2 4
323
s 2 :
1 14 s 1 12 s 1 34 s
Hence
3 1 1 k4
(17) P(T k) 32 2 4 4 12 k4 92 34 k4
3 14 k1 12 k5 14 34 k2
' 34 k1 , for large k: s
Finally we note that just as pairs of random variables may have joint distributions, so
too may they have joint generating functions.
328 7 Generating functions
Example 7.7.6: de Moivre trials. In a certain ballgame, suppose any pitch indepen-
dently results in a ball, a strike, or a hit, with respective probabilities p, q, and r, where
p q r 1. Let n pitches yield X n balls, Y n strikes, and n X n Y n hits. Then
E(s X 1 t Y1 ) r ps qt:
Hence, by independence of the pitches
G n (s, t) E(s X n t Y n ) (r ps qs) n :
We see that
(22) E(s X n ) G n (s, 1) (r q ps) n
so the number of balls X n is binomial, and clearly X n and Y n are not independent, by
(21). Furthermore
@ 2Gn
(23) E(X n Y n ) (1, 1) n(n 1) pq:
@s@ t
Hence
cov(X n , Y n ) n(n 1) pq npnp npq: s
2. Waldegrave's problem revisited. Four card players are bored with bridge, and play Walde-
grave's game instead. North is A0 . Show that the probability that either of East or West wins is 25.
7.9 Appendix. tables of generating functions 329
3. Waldegrave's problem once again. Suppose that in each round the challenger wins with
probability p; the conditions of the problem are otherwise unchanged. Show that
EN 1 p1 p( n1)
and
s n p n1 (1 ps)
Es N :
1 s (1 p) p n1 s n
The bridge players in exercise 2 want the NorthSouth pair to have the same chance of winning
as the EastWest pair; show that this is impossible for any choice of p such that 0 , p , 1.
4. Suppose any pitch results in a ball, a strike, a hit, or a foul, with respective probabilities
p, q, r, 1 p q r. Then n independent pitches yield X n balls, Y n strikes, Z n hits, and
n X n Y n Z n fouls. Find the joint p.g.f. of these four random variables. Now suppose the
number of pitches is N , where N is Poisson with parameter . Show that the numbers of balls,
hits, strikes, and fouls are independent Poisson random variables.
5. If X , . . . , X r jointly have the multinomial distribution, show that
E(s1X 1 s2X s Xr r ) ( p1 s1 p2 s2 p r s r ) n :
7.8 REVIEW
In this chapter we have introduced the idea of generating functions, in particular the
probability generating function (p.g.f.)
G(s) Es X
and the moment generating function (m.g.f.)
M(t) Ee tX :
You can think of such functions as organizers which store a collection of objects that
they will regurgitate on demand. Remarkably, they will often produce other information
if differently stimulated: thus the p.g.f. will produce the moments, and the m.g.f. will
produce the probability distribution (in most cases).
Furthermore, these functions are particularly adept at handling sequences, sums, and
collections of random variables, as was exemplied in section 7.3. We applied the idea in
looking at branching processes, the central limit theorem, and random walks.
7 . 9 A P P E N D I X . TA B L E S O F G E N E R AT I N G F U N C T I O N S
Table 7.1. Discrete distributions
Bernoulli p x (1 p)1x ; x 0, 1 1 p ps
n
binomial p x (1 p) nx ; 0 < x < n (1 p ps) n
x
7 . 1 0 P RO B L E M S
1. Consider the coupon-collecting problem with three different types of coupon, and let T be the
number of packets needed until you rst possess all three types. Find P(T k) using a
probability generating function. Show that ET 11 3
2 and var T 64.
2. Consider Huygens' problem with three coins A, B, and C, which show heads with probability,
, , and respectively. They are ipped repeatedly in the order ABCABCAB . . .. Let X be the
number of ips until the rst head. Find Es X , and hence deduce the probability that C is the
rst to show a head.
3. Let X n have a negative binomial distribution with parameters n and p 1 =n. Show that
X n has probability generating function f ps=(1 qs)g n , and deduce that as n ! 1, the
distribution of X n n converges to a Poisson distribution.
4. If X has moment generating function M(t), then the function K(t) logM(t) is called the
cumulant generating function; if the function K(t) is expanded as
7.10 Problems 331
X
1
K(t) k r t r =r!
r1
then k r is called the r th cumulant of X . What are the cumulants when X is (a) Poisson? (b)
normal? (c) exponential?
5. You are taking a test in which the test paper contains 116 questions; you have one hour. You
decide to spend no more than one minute on any one question, and the times spent on questions
are independent, with density f (x) 6x(1 x), 0 < x < 1. Show that there is a 20% chance,
approximately, that you will not attempt all the questions.
Show that
n
qt pt
Ee tYn p exp q exp
(npq)1=2 (npq)1=2
and hence deduce de Moivre's central limit theorem.
8. Let X be the normal distribution N(0, 2 ). Show that
EX 2 k 2 k (2k)! 2 k =k!
2
Y 2 )
9. Let X and Y be independent normal distributions N(0, 1). Find Ee t( X and Ee tXY .
10. Let T (s) be the tail generating function of X (dened in example 7.7.4). Show that
EX lim T (s)
s!1
11. Let Z n be the size of the nth generation in an ordinary GaltonWatson branching process with
Z 0 1. Let T n be the total number of individuals who have ever lived, up to and including the
nth generation. If Es T n H n (s), show that
H n (s) sG1 ( H n1 (s))
where G1 (s) Es Z1 .
12. Let Z n be the size of the nth generation in an ordinary GaltonWatson branching process with
E Z 1 . Show that
E( Z m Z n ) n m E Z 2m , m < n:
Hence nd cov( Z m , Z n ) and r( Z m , Z n ).
13. Find the probability of extinction of a GaltonWatson branching process when the initial
population Z 0 is a random variable with probability generating function P(s).
332 7 Generating functions
19. Bivariate normal m.g.f. Let X and Y be independent standard normal random variables;
let U X ; and V rX (1 r2 )1=2 Y . Show that the joint moment generating function of U
and V is
M(s, t) E(e sU tV )
exp 12 s 2 2str t 2 :
Hence nd r(U , V ).
20. Let X and Y be independent standard normal random variables and let
U X and V rX (1 r2 )1=2 Y :
Find the joint moment generating function of U and V .
21. Find the moment generating functions Ee tX corresponding to the following density functions
on (1, 1):
7.10 Problems 333
1 jx1 1
(a) 2e ; (b) ; (c) exp(x e x ):
1 a1 cosh x
For (b) use the fact that 0 fx =(1 x)g dx =sin a:)
For what values of t do they exist?
22. You ip two fair coins. Let I, J , and K be the indicators of the respective events that
(a) the rst shows a head,
(b) the second shows a head,
(c) they both show heads or they both show tails.
Find the joint probability generating function
G IJK (x, y, z) E(x I y J z k ):
but that
G IJK (x, y, z) 6 G I (x)G J ( y)G K (z):
The events are pairwise independent, but not independent.
23. Random walk in the plane. A particle takes a sequence of independent unit steps in the
plane, starting at the origin. Each step has equal probability 14 of being north, south, east, or
west. It rst reaches the line x y a after T steps, and at the point (X , Y ). Show that
h n oi
G T (s) Es T s 1 1 1 s 2 1=2 a , jsj , 1:
Deduce that
Es X Y G T 12(s s 1 ) :
24. Two particles perform independent random walks on the vertices of a triangle; that is to say, at
any step each particle moves along the clockwise edge with probability p, or the anticlockwise
edge with probability q 1 p. At time n 0 both are at the same vertex; let T be the
number of steps until they again share a vertex. Let S be the number of steps until they rst
share a vertex if initially they are at different vertices. Show that
Es T ( p2 q 2 )s 2 pqs Es S ,
Es S pqs (1 pq)s Es S :
Hence nd Es T and show that ET 3.
25. More compounding. Let X have a Poisson distribution with parameter , where is a
random variable having an exponential density with parameter . Find E(s X j), and hence
show that
P(X k) ( 1)( k1) , 0 < k:
26. Show that
G(x, y, z) 18 (xyzw xy yz zw zx yw xz 1)
is the joint generating function of four variables that are pairwise and triple-wise independent,
but which are nevertheless not independent.
27. Let X > 0 have probability generating function G(s). Show that
1
Ef(X 1)1 g G(s)ds:
0
Hence nd Ef(X 1)1 g when X is (a) Poisson, (b) geometric, (c) binomial, (d) logarithmic.
334 7 Generating functions
28. Let X have moment generating function M(t). Show that for a . 0:
(a) if t . 0, P(X > a) < e at M(t);
(b) if t , 0, P(X < a) < e at M(t).
Now let X be Poisson with parameter . Show that, for b . 1,
b
e
P(X > b) < inf fe b t M(t)g e
t,0 b
and for b , 1
b
e
P(X < b) < inf fe bt M(t)g e :
t,0 b
In particular, verify that
=2
e 2
P(X > 2) < and P X < < :
4 2 e
Compare these with the bounds yielded by Chebyshov's inequality, (1 and 41, respec-
tively).
29. Three particles perform independent symmetric random walks on the vertices of a triangle; that
is to say, at any step each particle moves independently to either of the other two vertices with
equal probability 12. At n 0, all three are at the same vertex.
(a) Let T be the number of steps until they all again share a vertex. (a) Find Es T , and show
that ET 9.
(b) Suppose that they all start at different vertices; let R be the time until they rst share a
vertex. Do you think ER . ET or ER , ET ? Find ER and test your conjecture.
(c) Let S be the number of steps until they all again share the same vertex from which they
began. Find ES.
30. Poisson number of de Moivre trials. Suppose any ball yields a wicket with probability p,
or one or more runs with probability q, or neither of these with probability r, where
p q r 1. Suppose the total number of balls N has a Poisson distribution with parameter
, independent of their outcomes. Let X be the number of wickets, and Y the number of balls
from which runs are scored. Show that X and Y are independent Poisson random variables by
calculating G(s, t) E(s X t Y ).
Remark. It can be shown that the characteristic function of a Cauchy random variable X is
X (t) e jtj .
P
32. Show that if X 1 , . . . , X n are independent Cauchy random variables and X n1 nr1 X r,
then X has the same Cauchy density as the X i .
7.10 Problems 335
Section 2.2
1. (a) All sequences of j's and k's of length m.
(b) The non-negative integers.
(c) New rules: (a1 , a2 ), (a3 , a4 ), (a5 , a6 ), where a i < 7 (1 < i < 6).
(d) All quadruples (x1 , x2 , x3 , x4 ), where each x i is a choice of ve different elements from
(C, D, H, S) 3 (A, 2, 3, 4, 5, 6, 7, 8, 9, 10, J , Q, K),
and x i \ x j , i 6 j.
Section 2.3
1. (a) fi, j: 1 < i, j < 6g, A fi, j: i j 3g.
(b) fi: 0 < i < 100g, A fi: 0 < i < 4g:
(c) fB, Gg 3 fB, Gg 3 fB, Gg, A fBBB, GGGg:
(d) the non-negative integers, A the integers in [10, 15]:
(e) f(r1 , f 1 ), (r2 , f 2 ), (r3 , f 3 )g where 0 < ri , f i < 7,
A fr1 r2 7g [ fr1 r3 7g [ fr2 r3 7g:
(f) fx, y: x, y > 0g, A fx, y: x . yg:
2. Draw two Venn diagrams.
Section 2.4
18
1. 37.
2. There are 12 3 52 3 51 pairs of cards, and 6 pairs of aces, so P(two aces) (13 3 17)1 221
1
.
area ABD 12 3 jBDj 3 height jBDj
3. :
area ABC 12 3 jBCj 3 height jBCj
1 2
2r
4. 14.
r 2
5. b=(a b).
Section 2.5
1. S \ F and S [ F. Hence 1 P() P(S) P(F), by (3).
2. If jj n, then there are at most 2 n different events in .
336
Hints and solutions 337
S n1 Sn
3. P r1 A r P r1 A r P(A n1 ), by (3). Now induction yields (4).
4. A (A n B) [ (A \ B). By the addition rule, P(A) P(A n B) P(A \ B): Hence P(A \ B) <
P(A). The same is true with B and A interchanged.
5. Use 1 P() P(A) P(A c ) P() P().
Section 2.6
1. Obvious from P(B) P(A) P(B n A).
2. P(A [ B [ C) P(A) P(B [ C) P(A \ (B [ C))
P(A) P(B) P(C) P(B \ C) P((A \ B) [ (A \ C))
and so on.
r
3. P(at least one double six in r throws) 1 35
36 ,
3524 25
1 36 ' 0:491 , 12 , 0:506 ' 1 35
36 :
So 25 is the number needed.
4. By enumeration of cases, the probabilities are P(S k ) a k =216, where, in order from a3 to a18 ,
the a i are
1, 3, 6, 10, 15, 21, 25, 27, 27, 25, 21, 15, 10, 6, 3, 1:
56
5. (a) 1 6 :
12
(b) 1 35 36 , 1 56 6 :
Section 2.7
1. 0:1=0:25 40%
2. (a) We know 0 < P(A \ B) < P(B). Divide by P(B).
(b) P(jB) P( \ B)=P(B) 1.
(c) P(A1 [ A2 jB) P(A1 \ B) [ (A2 \ B))=P(B). Expand this.
For the last part set A1 A, A2 A c , and use (c).
P(A \ B \ C) P(B \ C)
3. RHS P(C) LHS.
P(B \ C) P(C)
4. Use the above exercise to give
P(all red) P(3 redj2 red)P(2 redj1 red)P(1 red)
3 4 5 2
13 3 14 3 15 91:
By the addition rule
2
P(same colour) P(red) P(green) P(blue) 91 45 3 91
1 4
91:
5. LHS P(A \ B)=P(A [ B)
P(A \ B) P(A \ B)
< min , RHS:
P(A) P(B)
Section 2.8
1. (a) (1 ) ; (b) =f (1 ) g.
4 95
2. P(reject) 10 3 100 (1 103 ) 3 100
5
. Then we have (a) 104 3 100
95
=P(reject);
(b) 1 P(reject).
338 Hints and solutions
3. (a) You can only say what this probability is if you assume (or know) what your friend decided
to tell you in all possible circumstances. Otherwise you cannot determine this probability.
Consider two cases for example.
(i) Your friend has decided that if she has both red aces she will say `one is the ace of
diamonds'. In this case the probability that she has both red aces, if she tells you she
has the ace of hearts, is zero.
(ii) Your friend has decided that if she has one ace she will say `I have a club', but if she
has two aces, she will say `one is the ace of hearts'. In this case the probability that
she has both red aces, if she tells you she has the ace of hearts, is unity.
(b) In this case do you know what the answers would be in all cases, because you have received
a truthful answer to a xed question. We calculate:
332 1
P(either is A H ) 1 P(neither is A H ) 1 ,
433 2
231 1
P(both red aces) ,
433 6
hence
P(both red acesjeither is A H ) 16=12 13:
P P
4. P(AjBi \ C)P(Bi jC) P(A \ Bi \ C)=P(C) P(A \ C)=P(C):
Section 2.9
1. P(A c \ Bc ) P(A c )P(Bc ) P(A [ B) c ) (1 P(A))(1 P(B)) P(A \ B) P(A)P(B) 0
iff A and B are independent.
2
2. (a) 0.35, (b) 0.2, (c) 9, (d) 0.08.
3. P(n 1 tails followed by head) fP(T)g n1 P( H).
4. Let P(G) q, P(B) p, where p q 1.
(a) P(both sexes) 1 p3 q 3 , P(at most one girl) p3 3 p2 q,
P(both sexes and at most one girl) 3 p2 q. Then
(1 p3 q 3 )( p3 3 p2 q) 3 p2 q
for independence, which occurs when p q 12 and when pq 0, and not otherwise.
(b) P(both sexes) 1 p4 q 4 , P(at most one girl) p4 4 p3 q,
P(both sexes and at most one girl) 4 p3 q. For independence
(1 p4 q 4 )( p4 4 p3 q) 4 p3 q,
which occurs when pq 0 and for just one other value of p, which is approximately
p 0:4.
5. Flip two coins, with A rst coin shows head, B second coin shows head, C both coins
show heads.
6. See example 2.9.7.
Section 2.10
1. P(A1 jA2 )P(A1 ) P(A1 \ A2 ) and so on, giving the LHS by successive cancellation.
2. As gure 2.20, terminated after the sixth deuce.
3. Modied version of gure 2.19.
4. 6 p2 q 2 .
Hints and solutions 339
Section 2.11
Section 2.12
1. By denition, for a unit stake, you get 1 a with probability P(A). The value of this is
(1 a )P(A), so the casino's take is
t 1 (1 a )P(A):
From (1) we have P(A) (1 a )1 , so
t 1 (1 a )(1 a )1 ( a a )(1 a )1 :
Section 2.16
1 2 1 2
1. (a) 20 3 12 2 3 20
1
3 12 3 19
20; (b) 1
2 3 20 .
32
2. 663.
15 32 15
3. 442, 663 3 442.
1 1
4. (a) No; 4 < P(rain at weekend) < 2.
(b)
5. P(3 divides PIN) 2999 1286
8998, P(7 divides PIN) 8998, P(21 divides
169
PIN) 8990, so P(either one
divides PIN) 4116
8998 .
6. (a) P(A12 ) 0; (b) P(A3 ) 60=63 , P(A6 ) 100=64 .
1 1
7. 4, 3.
8. No. In fact 28 will do.
2 2
9. (c) (i) 5, (ii) 5.
10. (c) P(T ) 25 (from part (c) of the previous answer). Hence (a) 14; (b) 34.
11. Use a Venn diagram, or check the elements;
by the rst result, A [ B [ C ((A [ B) c \ C c ) c
(A c \ Bc \ C c ) c .
12. Use the above problem and induction.
13. P(A \ B) P(A) P(B) P(A [ B).
If B A, this gives P(A \ B) P(B) 12.
1
If A [ B , this gives P(A \ B) 10 .
The bounds are as given because
maxfP(A), P(B)g < P(A [ B) < 1:
14. The rst inequality follows from P(A \ B) . P(A)P(B). The second follows from
P(A c \ B) P(B) P(A \ B) , P(B) P(B)P(A) P(B)P(A c ):
15. P(A \ A) fP(A)g2 gives P(A) 0 or P(A) 1.
340 Hints and solutions
p k 16(1 p k1 q k1 ) 56 p k1 ; q k 16 p k1 56 q k1 :
Section 3.2
1. (a) 6; (b) 0; (c) 6.
6
3. 6!=6 ' 0:015.
P
4. For every collection of xi such that 1000
i1 xi 1100, there is a oneone map xi ! 7 xi to the
P
collection of 7 xi such that (7 xi ) 5900.
Section 3.3
1. (a) Use the oneone correspondence between choosing r objects from n and choosing the
remaining n r.
(b) Verify trivially from (7).
(c) Set up the same difference equation for C(n, r) and C(n, n r).
2. Expand RHS to give (4).
Section 3.4
1. Each element is either in or not in any set, giving 2 n choices.
n
2. There are ways to choose the k brackets to supply x k, the rest of the brackets supply y n k.
k
3. Set x y 1 in exercise 2.
r1
4. The answer to the hint is , as there are r 1 numbers less than r, of which we choose
k1
k 1. Now sum over all possibilities for the largest number selected.
rs
5. .
r
Section 3.5
1. jj 9 3 103 . jAj number of PINs with double zero number of PINs with single zero
8
number of PINs with no zeros 9 3 8 3 3 9 3 8 3 3 3 3 9 3 4!=2!.
2
Hence P(A) 0:432.
6
2. Choose three faces in ways; divide by 2! to avoid counting the pairs twice, permute these
3
5
symbols in ways.
2, 2, 1
3. Divide the given expression by P(you have x spades)
13 39
.
x 13 x
n
5. Choose the points to win in ways.
k
342 Hints and solutions
Section 3.6
1. Choose 5 non-adjacent objects from 48, and choose one to make a pair in 49 objects.
3. Choose k of your selection from the selected numbers and then choose r k from the n r
unselected.
Section 3.9
1. The probability that a given choice of r players draw their own name and the remaining n r
do not is
1 1 1 1 1 () n r
n n1 n r 1 2! 3 (n r)!
n
There are such choices of r players, giving the result.
r
Section 3.13
55
1. (a) 96; (b) ve will do. (P(at least two of the five share a sign) ' 0:6).
2. (b) 0.15 approximately.
10 10 9 10
3. (a) 7 ; (b) 1 ; (c) 2=5;
4 4 3 4
8 10 10
(d) 1 2=3; (e)5 1=42.
4 4 4
6 2 6
4. (a) 12 56 5 ; (b) 1 6!=66 319=324; (c) 5 51 66 406=66 ;
4 5
406 319
(d) .
66 324
13 4 4
5. Choose the ranks of the pairs in ways, the suits of the pairs in ways, and
2 2 2
the other card in 44 ways. Then
13 4 4 52
44 ' 0:48:
2 2 2 5
6. If the rst thing is not chosen there are (n 1) r permutations of r things from the remaining
n 1. The other term arises if the rst thing is chosen. Then use the addition rule.
Alternatively, substitute in the formula.
7. Use van der Monde, example 3.4.2.
8. Consider the boxes that include the rst colour, and those that do not.
9. When you make the nth cut, the previous n 1 cuts divide it into n segments at most. So the
largest number of extra bits produced by this cut is n. Hence Rn Rn1 n. Now verify the
given solution.
10. (a) 2(n k 1)=n!; (b) 2=(n 1)!
11. (n k)!=n!.
12. Choose the start of the run in 10 ways and the suits in 45 ways; exclude the straight ush.
Hints and solutions 343
32 52
13. . This gives fair odds of 1800: 1, approximately, so by exercise 1 of section
13 13
2.12 the Earl's percentage take was around 44%. A nice little earner, provided he insisted on a
nal shufe himself.
14. (a) This is (7) of section 3.8.
(b) Let A be the event that the wasp has visited g k1 when its last ight is to g k from g k1 .
And let B be the event that the wasp has visited g k1 is when its last ight is to g k from
g k1 . It must do one or the other, and A \ B .
By the rst part, P(L k jA) P(L k jB) n1 . Hence, using the partition rule,
P(L k ) P(L k jA)P(A) P(L k jB)P(B) n1 fP(A) P(B)g n1 :
6 43 49 25 49
15. (a) ; (b) .
4 2 6 6 6
(c) Let A r be the event that the number r has failed to turn up. Then
!
[
49
P P
P A r P(A r ) P(A5 \ A s )
r1
49 49
49P(A1 ) P(A1 \ A2 ) P(A1 \ \ A43 ):
2 43
49 k 49
For any set of k numbers, P(A1 \ \ A k ) and so
! 6 6
[49
48 49 47 49 49
P A r 49
r1
6 2 6 43 6
Section 4.2
1. (a) 1 (1 p) n ; (b) 3 p2 (1 p) p3 .
2. (a) 1 q 4 4q 3 (1 q); (b) 1 q 3 3q 2 (1 q).
(a) > (b); the moral seems obvious.
3. Imagine that you are `in gaol', and look at Figure 4.1.
Section 4.3
1. P(A n ) 56 n1 3 16, P(A n \ E) 56 2 m1 3 16 (n 2m), P(E) 11
5
52 m 11
and P(A n jE) 6 3 25 (n 2m). Yes, but not the geometric distribution.
2. From example 4.3.4, P(A n jD n ) a n =(n) p=( p q) P(A n ):
Section 4.4
2
n n n
2. p2 k q 2 n2 k p k1 q n k1 p k1 q n k1
k k1 k1
(k 1)(n k 1)=f(n k)kg > 1:
3. The correspondence rule in action.
344 Hints and solutions
Section 4.5
1. Consider the ratio
a N 1a N a N 1a N 1
: ,
r n r n r n r n
which reduces to an: r(N 1). This gives increasing terms up to the integer nearest to an=r;
thereafter the terms decrease.
93 93 100
2. 7 .
10 9 10
r n r a r N a n r
3. < 1:
r1 n r1 a r1 N a n r1
Section 4.6
Pn 1
1. x1 xn n1 3 12 n(n 1).
Pn
2 1 2
x1 n x 2 16(n 1)(2n 1) 14(n 1)2 : (See subsection 3.12.I.)
P1
3. 2 2 x1
q p 2 . Use the negative binomial theorem from subsection 3.12.III to sum
x1 x
P1 1
the series: x1 2 x(x 1)q x1 (1 q)3 .
P P
4. k k e =k! k1 e =(k 1)!
Section 4.8
1
1. n 200, p 40 , np 5. So
P
(a) 1 P(less than 4) 1 3r0 5 r e 5 =r! ' 0:74,
(b) P(none) e 5 ' 0:0067.
2. n 404; p 102 ; np 4:04,
P(bump at least one) e 4:04 1 4:04 12(4:04)2 16(4:04)3 ' 0:43:
3. k [] if is not an integer. If is an integer then p( 1) p() e =!.
Section 4.9
1. T (x) 1 (1 x)2 ; t(x) 2(1 x).
Section 4.10
1. Exact binomial, p(12) 0:028, p(16) 0:0018;
normal approximation, p(12) ' 0:027, p(16) ' 0:0022.
2. Show that the mode m is [np] and then use Stirling's formula.
3. np 800, (npq)1=2 20; the probability is 1 (4) ' 14(4) ' 0:00003, which is
extremely small. But if observed it would suggest the new one is better.
Hints and solutions 345
Section 4.11
1. By symmetry we might just as well pick a point in the semicircle 0 < y < (1 x 2 )1=2 . Now use
example 4.11.4.
2. By (5), 12 3 3 3 3a 12 3 2 3 3a 1; thus a 15
2
.
P(jX j . 1) P(3 < X < 1) P(1 < X < 2)
12 3 2 3 2a 12 3 1 3 32 a 11
30:
Section 4.12
1. p(x, y) 16, x 6 y; P(X Y 3) 13; P(X x) 13 P(Y y); so each is uniform on
f1, 2, 3g with mean 2 and variance 23.
1 x x
2. p(x, y) 2 , 0 < y < x < 6.
6 y
X y x X 6
1 x 7
y 2x 3 .
x, y
6 y x1
6 2 4
Section 4.16
1. X is binomial B(n, 62 ) with mean 62 n and variance 62 n(1 62 ).
2. (a) c1 2=fn(n 1)g; (b) c2 1.
P 2 2 P P
3. x e =x! x(x 1) x e =x! x x e =x! 2 .
4. When [np] k.
5. x 8 > (x 2 1)4 ; the distribution is geometric.
6. P(X x) pf(q r) x1 r x1 g qf( p r) x1 r x1 g, x < 2.
! !
y1 X
y j y j
j
P(Y y) p q i r y ji
j1 i k i
! !
y1 Xy k yk
qk p i r y ki , y > j k:
k1 i j i
7. (a) pq=(1 p2 ); (b) p r1 q=(1 p r ); (c) p r (1 p s )=(1 p rs ).
8. P(X 2n) 12(1 ) n1 (1 ) n1 f(1 ) (1 )g;
P(X 2n 1) 12(1 ) n1 (1 ) n1 ( ), n > 1.
P(E) 2=f2( )g. Not in general, but B is independent of E and
fX 2ng when .
n k1
9. p(n k) pk q n
k1
k n
(n k 1)(n k 2) . . . (k 1)k
1
n! k k
n
k n
n1 1
1 1 1 ! e .
n! k k k n!
p(n k) is the probability that in repeated Bernoulli trials the (n k)th trial is the kth
346 Hints and solutions
success; this is the probability of exactly n failures before the kth success. The result shows
that as failures become rarer, in the long run they have a Poisson distribution. This is consistent
with everything in section 4.8.
10. P(kth recapture is mth tagged)
P(k th is tagged) 3 P(1st k 1 recaptures include exactly m 1 tagged).
Now P(kth is tagged) is t=n, and second term is hypergeometric. Hence the result.
h t
11. (a) The total number of possible sequences is . The number of ways of having x
h
runs of heads is the number of ways of dividing the h heads into x groups, which we may
h1
do with x 1 bars placed in any of the h 1 gaps, that is in ways. The t tails
x1
must then be distributed with at least one in these x 1 positions, and any number
(including zero) at each end. Adding 2 (notional) tails to go at the ends shows that this is
the same as the number of ways of dividing t 2 tails into x 1 groups, none of which is
t21
empty. This may be done in ways, by problem 17 of section 3.13, and the
x11
result follows.
12. t(x) 1 jxj, jxj < 1.
1
(1 x)2 , 1 < x < 0
T (x) 2 1
1 2(1 x)2 , 0 < x < 1:
1
13. p(x, w ) 36, x 1 < w < x 6, 1 < x < 6.
x1
14. (a) Choose the numbers less than x in ways.
5
49 x 49
(b) , 1 < x < 44.
5 6
15. 7 3 104 approximately.
16. P(X k) (1 pt) k1 pt.
17. p( ^k) in (2) of section 4.5, where in the general case ^k is the integer part of
(m 1)r (w 1)
m1w1
What are the special cases?
The ratio p t (k)= p t1 (k) is
(t m)(t r)
:
t(t m r k)
So p(k) is largest for xed m, r, and k when t [mr=k].
x y1 49
19. (a) , 1 < y , x 4 < 45.
4 6
49 z 1 49
(b) .
6z 6
20. P(X 0) n1 , P(X x) 2n2 (n x), 1 < x < n 1.
X
n1 X
n1
2 n2 1
mean xp(x) 2
(nx x 2 ) :
x1 x1
n 3n
21. P(X n) (n 1) p2 q n2 (n 1)q 2 p n2 , n > 4.
Hints and solutions 347
P1
mean n4 n(n 1)(q n2 p2 p n2 q 2 ). Now use the negative binomial theorem from
3.12.III with n 3 to obtain the result.
23. (b) Deaths per day may be taken to be Poisson with parameter 2: P(5 or more in one day)
1 7e 2 ' 0:054, just over 5%. This is not so unlikely. You would expect at least one
such day each month. However, as deaths per annum are approximately normal, we
calculate, using part (a),
x 730 120 120
P(X . 850) P p . p ' 1 p ' 3 3 105 :
730 730 730
So 2000 really was an extremely bad year, compared with the previous two decades.
11
24. (a) 2; (b) 3, using problem 6; (c) 2, using problem 21.
Section 5.2
Section 5.3
1 2
2n (n x)(n x 1), n < x < 0
1. F(x)
1 12 n2 (n x)(n x 1), 0 < x < n:
2. If P(X x) p(x), x 2 D, then we let D and dene X () , together with
P
P(A) x2 A p(x) for any event in .
ya ya
3. If b . 0, pY ( y) p X , FY ( y) F X .
b b
If b 0 then P(Y a) 1. If b , 0 then pY ( y) is as above, but
X
ya
FY ( y) P X > p X (x):
b x> (ya)=b
Section 5.4
1. c 2a2 ; F(x) x 2 =a2 .
2. f 1 (1 ) f 2 > 0 and f1 (1 ) f 2 g dx 1 1.
(b) Not in general. For example, if f 1 f 2 12, 0 < x < 2, then f 2 f 2 14, which is not a
density. But consider f 1 f 2 1, 0 < x < 1, when f 1 f 2 is a density.
3. Check that (20) and (21) hold. Yes; likewise.
348 Hints and solutions
Section 5.5
1. From example 5.5.6,
1 p p 1
f Y ( y) p f( y) ( y)g p exp(12 y), y > 0:
2 y 2 y
2. Uniform on the integers f0, 1, . . . , m 1g; pY (k) m1 .
1 2=3
3. 3y f ( y 1=3 ).
4. f Y ( y) f ( y), by symmetry about y 12.
Section 5.6
1
1. 2x 2 dx 23.
0
2 X
n
2
2. x2 3 1 3 n(n 1)(2n 1) 13(2n 1).
n(n 1) x1 n(n 1) 6
1
r r
3. EX f r x r e x =(r 1)!g dx f r1 x r e x =r!g dx , because the integrand is a
0
density with integral unity.
Section 5.7
1. B(n, p) has mean np and variance npq. Hence EX n=2, EY 2n=4; var X n=4,
var Y 6n=16. Use example 5.7.4 to give E Z n, var Z 2n.
2. Let I be the indicator of the event h(X ) > a. Then aI < h(X ) always. Now take the expected
value of each side, and use the fact that EI P(I 1) P(h(X ) > a).
3. P(X n) 2 n , EX 2, var X 2.
(a) P(jX 2j > 2) < E(jX 2j2 )=4 14 var X 12.
Actually P(jX 2j > 2) P(X > 4) 24.
(b) P(X > 4) < EjX j=4 2=4. Actually P(X > 4) 24.
Section 5.8
1. (a) Condition on the rst point, and then on the second, to get (with an obvious notation) rst
EX rE(X jR) E(X jF) 1 and second E(X jR) 1 E(X jF), E(X jF) 1
rE(X jR). Hence E(X ) (2 r)=(1 r).
(b) EY (2 EY )2r 2(r2 2 ). So EY 2=(1 2r).
(c) E(X jL) EX and E(Y jL) EY .
2. Condition on the rst point. Check that k satises the recurrence, together with 0 n 0.
Section 5.9
1. Given that X > t, we have shown that X t Y , where Y is exponential. Hence E(X jX > t)
t 1 , and var (X jX > t) var Y 2 .
Hints and solutions 349
Section 5.12
1. (a) P(X > x) f(6 x 1)=6g5 , 1 < x < 6. So
p(x) P(X x) P(X > x) P(X > x 1)
5 5
6x1 6x
:
6 6
5 5 5
X6
6 5 1 4062
EX P(X > x) ' 1:57:
x1
6 6 6 2592
(b) P(Y > y) 1 P(Y , y) 1 f( y 1)=6g5 . So p( y) P(Y y) ( y=6)5
f( y 1)=6g5 .
By symmetry Y has the same distribution as 7 X , so EY 7 EX ' 5:43. Or do the sum.
2. (a) 12, by symmetry. (b) Let A r be the event that the rst two dice sum to r, and B r the event
that the other two sum to r. Then we know
P(A r ) P(B r ) 62 minfr 1, 13 rg, 2 < r < 12:
P P12
Then P(sum of 4 dice 14) 12 r2 P(A r )P(B14 r )
2 4 2 2
r2 fP(A r )g 6 (1 2
2 2 2 2 2 2 2 2 2 4
3 4 5 6 5 4 3 2 1 ) 6 3 146. Hence by symmetry
P(sum of 4 dice > 14) 12 12 3 146 64 :
and let a ! 1.
Section 6.2
1. p(1, 6) ! 1, and p(x, y) ! 0 otherwise.
n1 1
2. (a) , (b) .
2n n
3. If the parties are P, Q, and R, then if voters' preferences are distributed like this, it follows that 23
of them prefer P to Q, 23 prefer Q to R, and 23 prefer R to P. So whoever is elected, 23 of the voters
preferred some other party.
Section 6.3
1 e x xe y , 0 < x < y,1
1. F(x, y)
1 e y ye y , 0 < y < x , 1;
F X (x) 1 e x , FY ( y) 1 e y ye y
f X (x) e x , f Y ( y) ye y :
2. X and Y are not jointly continuous, so there is no contradiction. (Their joint distribution is said
to be singular.)
@2F
3. , 0, so this cannot be a joint distribution.
@x@ y
352 Hints and solutions
Section 6.4
1 1 xy
1. dz dx dy 14.
0 0 0
Section 6.5
1. P(W < w) is the volume of the pyramid x > 0, y > 0, z > 0, x y z < w. The volume of
such a pyramid is 13 3 area of base 3 height 13 3 12 w 2 3 w 16 w 3 . Hence, differentiating,
f W (w ) 12 w 2 , 0 < w < 1. Consideration of other pyramids in the cube 0 < x, y, z < 1, yields
the rest of f W (w ).
2. From the solution to example 6.5.2(ii), either by symmetry or by using similar arguments
49 y 49
pY ( y) , 1 < y < 44:
5 6
Section 6.6
1. P(A n ) P(at least n sixes)
1 P(less than nsixesin n rolls)
Xn1 6 n r r
6n 5 1
1 > P(A n1 ), as we showed in (11). The inequality follows.
r0
r 6 6
Alternatively, if you have a computer big enough for symbolic algebra, it will rewrite the
expression for P(A n ) in a form which is monotone in n.
2. f Z (z) e z=2 e z .
3. The number of ips of an unfair coin until the nth head has the negative binomial distribution.
The waiting times between heads are geometric. (Or use induction.)
4. Recall from section 4.8 that the number of meteorites up to time t is Poisson, and the gaps
between meteorites are exponential. Or verify the induction, using
z z
f (z x)e x dx f (x)e (zx) dx
0 0
z z
n x n1 e x (zx) n1 e z
e dx x n1 dx:
0 (n 1)! (n 1)j 0
Section 6.7
P
1. Let I r be the indicator of the event that your n coupons include the rth type. Find E r I r .
Q P
2. Show that r (1 I r ) > 1 r I r . (Induction is easy.)
3. E(S r =S k ) r=k, for r < k. The cobalt balls divide the dun balls into c 1 groups, with the
same expectation. Hence EX d=(c 1), since the sum of the groups is d.
Hints and solutions 353
Section 6.8
Section 6.9
1. Discrete case: p(xj y) p X (x) , p(x, y) p X (x) pY ( y). By the above, when X and Y are
independent
E(X jY ) 1EX .
2(13 2u ) , u , v
2. p(vju)
(13 2u)1 , u v:
X 6
42 u 2
E(V jU u) v p(vju )
vu
13 2u
52 72
E(UV ) 49
4 ; cov(U , V ) 49
4 91 161
36 3 36 6 6 :
P N 2 P N 2 P
3. E 1 X r E 1 X r 2 r , s X r X s
EN E(X 21 ) fE(N 2 ) EN g(EX 1 )2 :
P
Now subtract (E 1N X r )2 .
Section 6.10
v
1. (a) f (u, v) 2 f (u) f (v), 0 < u , v; f V (v) 0 2 f (u) duf (v);
v
f (ujv) f (u, v)= f V (v) f (u)= 0 f (u ) du:
v v
E(UjV v) 0 uf (u ) du= 0 f (u ) du.
(b) From above.
2. Z X Y has density 2 ze z, and X and Z have joint density 2 e z, 0 < x , z , 1. This
follows either from example 6.10.2 or directly from
@2 @ 2 x zx 2 y
f (x, z) P(X < x, Z < z) e dy e x dx,
@x@z @x@z 0 0
Section 6.11
1. var(Y jX ) E(Y 2 jX ) f(X )g2 , and varf(X )g Ef(X )2 g (EY )2 :
2. We need to use the fact that
E[fX E(X jY )gfE(X jY ) g(Y )g] E([EfX E(X jY )jY g]fE(X jY ) g(Y )g) 0
So E(X g)2 E(X g)2 E(X )2 E( g )2 2Ef(X )( g )g
E(X )2 E( g )2 > E(X )2 .
a c
3. p X jY (0j0) , p X jY (1j0) ,
ac ac
b d
p X jY (0j1) , p X jY (1j1) .
bd bd
c d
So E(X jY 0) , E(X jY 1) , and
ac bd
c(1 Y ) dY
E(X jY ) , as required.
ac bd
Section 6.12
1. By denition V (r= )(U ) (1 r2 )1=2 Y . Hence the conditional density of V ,
given U u, is normal with mean E(V jU ) (r= )(U ) and variance var(V jU )
2 (1 r2 ).
2. (a) Using (2) and (3) with 1,
P(U . 0, V . 0) P(X . 0, rX (1 r2 )1=2 Y . 0):
In polar coordinates the region (x . 0, rx (1 r2 )1=2 y . 0) is the region
r
r . 0, 2 1=2
, tan , 1 (r . 0, r , sin , 1):
(1 r )
Hence
1 =2
2 1
P(U . 0, V . 0) rer =2
dr d
0 (sin1 r) 2
1
sin 1 r
2 2
(b) P(0 , U , V ) P(0 , V , U ) 12P(0 , U , 0 , V ), by symmetry.
(c) max(U, V ) maxfX , rX (1 r2 )1=2 Y g. The line
Hints and solutions 355
1r
y x
(1 r2 )1=2
divides the plane into two regions above the line
x , rx (1 r2 )1=2 y:
Below the line the inequality is reversed. In polars this line is given by
1 r 1=2
tan :
1r
Note that
1 r 1=2 1 r 1=2
sin , cos :
2 2
Hence
1 2
re r =2
Efmax(U , V )g frr cos (1 r2 )r sin gd r cos d dr
0 2
r
1
3 [rfsin( ) sin g (1 r2 )fcos( ) cos g
2 2
fsin sin( )g]
1 p
p f 2(1 r)1=2 g:
2
!
2 2
1
3. cot1 . To see this, recall that U and V are independent if and only if they are
2 2r
uncorrelated, which is to say that 0 E(UV ) (EY 2 EX 2 )12 sin 2 EXY cos 2.
Section 6.13
1 0
1. (a) The inverse is x u, y v=u, so jJ j juj1 .
v=u 2 1=u
1 1 v
The density of V is the marginal 1 f u, du
juj u
(b) In this case jJ j jzj.
(c) Use (b), or use the circular symmetry and problem 16 of section 5.12.
(d) Using (a) we have
1 v
f V (v) f (u, v) du f X (u) f Y du
juj u
1 1
2 2 2
1 e u =2 (u 2 v2 )1=2 u du 1 e v =2 e y =2 dy
z 0
v u
2. u uv, y u uv, jJ j juj:
1 v u
f (u, v) f X (uv) f Y (u uv)juj
n m
u (uv) n1 e uv fu(1 v)g m1 e (uuv)
(n 1)! (m 1)!
n m
(1 v) m1 v n1 u m n1 e u , 0 < v <, u > 0:
(n 1)!(m 1)!
As this factorizes, U and V are independent.
356 Hints and solutions
Section 6.15
1. (a) Expected number to get one head is 2.
(b) Exploit the symmetry of the binomial B 2n 1, 12 distribution.
2. X 1 S T , where S is geometric with parameter 23, and T is geometric (and independent
of S) with parameter 13. By the convolution rule,
P
p X (x) P(S T x 1) x2 r1 P(S r)P(T x r 1)
P x2 1 r1 2 2 x r2 1 2 x1 1 1 x1
r1 3 333 3 33 3 2 2 :
3. For any numbers x and y, by inspection we have
minfx, yg 12jx yj 12(x y):
Hence E minfX , Y g 12EjX Y j 12EX 12EY . But minfX , Y g 1, EX p1 ; EY q 1 .
Hence
1 1 1 p q
EjX Y j 2 2 :
p q pq q p
4. y) x 1 , 0 < y < x < 1.
f (x,
(a)f Y ( y) log y, 0 , y < 1.
(b)f X jY (xj y) x 1 =log y.
(c)
E(X jY ) (1 Y )=log Y .
332
5. cov(X , Y ) 27 8 16 .
n! 5 nx 5 x y 1 y
6. p(x, y) 36 .
(n x)!(x y)! y! 6 36
n y 30 nx 5 x y
From this, or by direct argument, p(xj y) , which is to say that, given
x y 35 35
5
5
Y , X Y is binomial B n Y , 35 . Hence E(X Y jY ) (n Y ) 35 and var (X Y jY )
30 5 1 6
(n Y ) 35 3 35. Therefore E(X jY ) 7 (n 6Y ), and var (X jY ) 49 (n Y ).
7. This is essentially the same as problem 6:
P(faulty \ not detected) (1 )
P(faultyjnot detected) :
P(not detected as faulty) 1
(1 )
Hence, given Y , X is then binomial B b Y , .
1
8. P(meet) 11 5
36; P(meetjafter 12:30) 9.
9. p(a, b) 16, etc.; E(XY ) 13(ab bc ca), EX 13(a b c).
1
, 1 < u, y < 6
10. p(u, y) 36 1
36 (6 u 1), u y,
cov(U , X ) 35
24.
Hints and solutions 357
P(U 0) P(V 0) P(W 0) 18. Hence E(U ) E(V ) E(W ) 0, so the expected
net gain is nil for any nomination. However, 4 var U 525, 4 var V 875, and 4 var W 950,
so you will choose a if you are risk-averse, but c if you are risk-friendly.
14. (a) The random variables X 1 , . . . , X n are also independent and have the same joint
P P
distribution as X 1 , . . . , X n . Thus 1n X i has the same distribution as 1n X i , as required.
(b) No. For example:
7 5
1 36 0 36
2 4 6
Y 0 36 36 36
3 8 1
1 36 36 36
1 0 1
X
Here pY ( j) p X ( j) 13, for all j, and each is symmetric about zero, but
3 5
P(X Y 2) 36 6 36 P(X Y 2)
.
Finally, since there are 52 individuals, S 2D 3M < 52, and taking expectations shows
360 Hints and solutions
EM < 0:2; now use P(M . 0) < EM, which is obvious (and in any case follows from
Markov's inequality).
34. These are just special cases of the coupon collector's problem. Use your calculator.
p
38. f2(z= 2) 1g2 F Z (z). Differentiate for density.
E( Z j X . 0, Y . 0) E(X j X . 0, Y . 0) E(Y j X . 0, Y . 0)
2E(X j X . 0):
43. Use indicators: I k indicates a head on the k th ip; J k indicates that the (k 1)th and k th ips
P P P P
are different. Then X 1n I k , R 1 2n J k , and so E(XR) Ef nk1 I k (1 nk2 J k )g.
Now calculate EI k p, EJ k 2 pq, E(I k J k ) qp, E(I k J k1 ) pq, and so on.
Section 7.2
1. From (9), G(2) (1) EfX (X 1)g EX 2 EX var X (EX )2 EX . Now use G9(1)
EX :
P1 k1 k P p
2. (a) 1 q ps ps 1 1 ( ps)
k1
; (b) .
1 qs
7 P
1s
3. (a) 16 s 6r1 16 s r , so p(r) 16; this is a die.
1s
(b) p(0) q, p(1) p. This is an indicator, or a Bernoulli trial.
4. G Y (s) Es Y Es aX b s b Ef(s a ) X g sb G X (s a ).
EY G9Y (1) bs b1 G X (s) as b G9X (s a ) s1 b aEX :
5. 2(1 e t te t )t 2 .
6. (1 ) pe t (1 qe t )1 ;
EX (1 ) p1 , var X (1 )( q) p2 .
Section 7.3
1. Ee tY Ee t(abX ) e at M X (bt).We know 2
that if X is N(0, 1), then X is N( , ). Hence
t t 1 2
M Z (t) M X ( t)e e exp 2( t) .
e at 1 X 1
(at) r X
1
EX r r
2. M X (t) t .
at r0
(r 1)! r0 r!
1
txx r1 r
3. M X (t) e x =(r 1)! dx. Now set
0
y ( t)x=.
4. (a) G9(1) np(q ps) n1 s1 np, G(2) (1) n2 p2 np2 .
p pqs 2 pq
(b) G9(s) , G(2) (s) , so
1 qs (1 qs)2 (1 qs)3
2q 1 1 q
var X 2 2 2 :
p p p p
(c) M9(t) r r ( t) r1 , M (2) (t) r(r 1) r ( t) r2 , so
2
r(r 1) r
var X :
2
Hints and solutions 361
Section 7.4
1. expf( )(s 1)g.
2. (q n p n s) n f1 (s 1)(=n)g n ! expf(s 1)g.
1 3 5 1 1 2 3 4 5 6
3. 3 (s s s ) 2 (1 s) 6 (s s s s s s ), so the distribution is the same as that of
the sum of two conventional fair dice, namely triangular.
Section 7.5
P (sp) k log(1 sp)
1. Es X , so
k log(1 p) log(1 p)
Es T G N (G X (s)) e exp log(1 sp)
log(1 p)
1 p =log(1 p)
:
1 ps
This is a negative binomial p.g.f., since (1 x) is expanded in series by the negative binomial
theorem.
Section 7.6
p p
1. n 2000, n 100= 6: So
!
1900 2000 X 2000 2200 2000
P(1900 , X , 2200) P p , p , p :
100= 6 100= 6 100= 6
Now use the central limit theorem.
p
2. n 0; n 3. Hence
P(3 , error , 3) ' (1) (1), P(6 , error , 6) ' (2) (2).
So (a) 2f(1 (1)g ' 0:32, (b) 2f(1 (2)g ' 0:04.
3. If is an integer then X has the same distribution as the sum of independent Poisson random
variables, each with parameter 1. The central limit theorem applies to this sum. Y has the same
distribution as the sum of r independent exponential random variables, each with parameter 1.
Section 7.7
1 (1 4 pq)1=2 1 (1 2q)
1. (a) [Es T i ] s1 1.
2q 2q
We have taken as the positive root (1 4 pq)1=2 1 2q, because q , 12. Differentiating yields
ET 1 ; otherwise, you can write ET1 EfE(T1 jX 1 )g p qfE(T1 j X 1 1) 1g
p qf2 ET 1 1g, so ET1 (1 2q)1 ( p q)1 .
(b) [Es T1 ] s1 1, but the derivative at s 1 is innite.
(c) Es X 1 ps qs 1 , so Es S n ( ps qs 1 ) n .
(d) G Y ( ps qs 1 ).
2. From (14), with n 3, P(N is even) 12fG N (1) G N (1)g 12 1 15 .
3. For equal chances, they need p such that G N (1) 0, which is impossible.
4. E s X n t Y n u Z n v n X n Y n Z n f ps qt ru (1 p q r)vg n .
362 Hints and solutions
Section 7.10
2 1
3s 3s
1. Like example 5, except that Es T 1
3 . Now use partial fractions.
1 3s 1 23 s
s (1 )s 2 (1 )(1 )s 3
2. Es X ,
1 (1 )(1 )(1 )s 3
and we need the sum of the coefcients of powers of s 3 , which is
(1 )(1 )
:
1 (1 )(1 )(1 )
pn (1 =n) n
3. Es X n n s n Es X n ! e s
(1 qs) n (1 s=n) n
4. (a) k(t) (e t 1), so k r .
(b) k(t) t 12 2 t 2 , so k1 , k2 2 , k r 0, r > 3.
(c) k(t) log(1 t=); so k r (r 1)! r .
5. P(attempt all questions)
! 8 ! 9
< X 1=2 1=2 =
X
116 116
116 116
P X r < 60 X r 58 <2
1
: 1
20 20 ;
Hence
q2 s2
G :
1 pst pqs 2 t
Then G D (s) G(s, 1) and G S (t) G(1, t). Some plodding gives cov(D, S) p( p q)q 4.
1 1
21. (a) , jtj , 1; (b) , jtj , .
1 t2 cos 12 t
1 x 1
(c) Set e x y in M(t) 1 e tx e x e e dx, to obtain
1 yM(t) 0 y t e y dy (1 t),
where the gamma function is dened by (x) 0 e y x1 dy, for x . 0.
22. G(x, y, z) 14 (xyz x y z). Hence
G(x, y) 14 (xy y x 1) 12(1 x)12(1 y) G(x)G( y)
and so on.
23. (a) Let (X n , Y n ) be the position of the walk after n steps, and let U n X n Y n . By
inspection, U n performs a simple random walk with p q 12, so by example 7.7.1 the
rst result follows.
(b) Let V n X n Y n . It is easy to show that V n performs a simple symmetric random walk
that is independent of U n , and hence also independent of T . The result follows from
exercise 1(d) at the end of section 7.7.
24. Condition on the rst step. This leads to
2 p2 q 2 s 2
Es T ( p2 q 2 )s :
1 (1 pq)s
Differentiate the equations, or argue directly, to get ET 1 2 pqES and ES 1
(1 pq)ES.
364 Hints and solutions
Remember to look at the contents for larger topics. Abbreviations used in this index: m.g.f. moment
generating function; p.g.f. probability generating function.
365
366 Index