Probability and Statistics

UNIVERSITY OF ZIMBABWE
MATHEMATICAL METHODS FOR PHYSICS/

METEOROLOGY/FORENSIC PHYISICS 2
HIPH203/HMCS102/HMPH103/HMS204/HFOSCP203/HSST103
Lecturer : Mr. T. Mazikana

Course title : Math Methods 2
Department : Mathematics and Computational Sciences
Course duration : 1 Semester
Probability and Statistics.
1
Chapter 1
Introduction to Probability Theory
1.1 Probability
Probability theory provides a mathematical foundation to concepts such as “probability”, “infor-

mation”, “belief”, “uncertainty”, “confidence”, “randomness”, “variability”, “chance” and “risk”.
Probability theory is important to empirical scientists because it gives them a rational framework
to make inferences and test hypotheses based on uncertain empirical data. Probability theory is
also useful to engineers building systems that have to operate intelligently in an uncertain world.
For example, some of the most successful approaches in machine perception (e.g., automatic speech
recognition, computer vision) and artificial intelligence are based on probabilistic models. More-
over probability theory is also proving very valuable as a theoretical framework for scientists trying
to understand how the brain works. Many computational neuroscientists think of the brain as a
probabilistic computer built with unreliable components, i.e., neurons, and use probability theory
as a guiding framework to understand the principles of computation used by the brain. Consider
the following examples:
• You need to decide whether a coin is loaded (i.e., whether it tends to favor one side over the
other when tossed). You toss the coin 6 times and in all cases you get “Tails”. Would you
say that the coin is loaded?
• You are trying to figure out whether newborn babies can distinguish green from red. To do
so you present two colored cards (one green, one red) to 6 newborn babies. You make sure
that the 2 cards have equal overall luminance so that they are indistinguishable if recorded
2
by a black and white camera. The 6 babies are randomly divided into two groups. The first
group gets the red card on the left visual field, and the second group on the right visual field.
You find that all 6 babies look longer to the red card than the green card. Would you say
that babies can distinguish red from green?
• A pregnancy test has a 99% validity (i.e., 99 of 100 pregnant women test positive) and 95%
specificity (i.e., 95 out of 100 non pregnant women test negative). A woman believes she
has a 10% chance of being pregnant. She takes the test and tests positive. How should she
combine her prior beliefs with the results of the test?
• You need to design a system that detects a sinusoidal tone of 1000Hz in the presence of white
noise. How should you design the system to solve this task optimally?
• How should the photo receptors in the human retina be interconnected to maximize informa-
tion transmission to the brain?
While these tasks appear different from each other, they all share a common problem: The need to
combine different sources of uncertain information to make rational decisions. Probability theory
provides a very powerful mathematical framework to do so. We now go into the mathematical
aspects of probability theory.
1.2 Sample Spaces
A set S that consists of all possible outcomes of a random experiment is called a sample space,
and each outcome is called a sample point. Often there will be more than one sample space that
can describe outcomes of an experiment, but there is usually only one that will provide the most
information.
Example 1.2.1. If we toss a die, then one sample space is given by {1, 2, 3, 4, 5, 6} while another is
{even, odd}. It is clear, however, that the latter would not be adequate to determine, for example,
whether an outcome is divisible by 3.
The sample space is also called the outcome space, reference set, and universal set. It is
often useful to portray a sample space graphically. In such cases, it is desirable to use numbers in
place of letters whenever possible. If a sample space has a finite number of points, it is called a
finite sample space. If it has as many points as there are natural numbers 1, 2, 3, . . . , it is called a
countably infinite sample space. If it has as many points as there are in some interval on the x axis,
3
such as 0 ≤ x ≤ 1, it is called a noncountably infinite sample space. A sample space that is finite
or countably finite is often called a discrete sample space, while one that is noncountably infinite
is called a nondiscrete sample space.
Example 1.2.2. The sample space resulting from tossing a die yields a discrete sample space.
However, picking any number, not just integers, from 1 to 10, yields a non-discrete sample space.
1.3 Events
We have defined outcomes as the elements of a sample space S. In practice, we are interested in
assigning probability values not only to outcomes but also to sets of outcomes. For example, we
may want to know the probability of getting an even number when rolling a die. In other words,
we want the probability of the set {2, 4, 6}. An event is a subset A of the sample space S, i.e., it
is set of possible outcomes. If the outcome of an experiment is an element of A, we say that the
event A has occurred. An event consisting of a single point of S is called a simple or elementary
event.
As particular events, we have S itself, which is the sure or certain event since an element of S must
occur, and the empty set ∅, which is called the impossible event because an element of ∅ cannot
occur.
By using set operations on events in S, we can obtain other events in S. For example, if A and B
are events, then
1. A ∪ B is the event “either A or B or both.” A ∪ B is called the union of A and B.
2. A ∩ B is the event “both A and B.” A ∩ B is called the intersection of A and B.
3. A0 is the event “not A.” A0 is called the complement of A.
4. A − B = A ∩ B 0 is the event “A but not B.” In particular, A0 = S − A.
If the sets corresponding to events A and B are disjoint, i.e., A ∩ B = ∅, we often say that the
events are mutually exclusive. This means that they cannot both occur. We say that a collection
of events A1 , A2 , . . . , An is mutually exclusive if every pair in the collection is mutually exclusive.
Example 1.3.1. Consider an experiment of tossing a coin twice, let A be the event “at least one
head occurs” and B the event “the second toss results in a tail.” Find the events A ∪ B, A ∩ B, A0
and A − B.
4
Solution: We observe that A = {HT, T H, HH}, B = {HT, T T } and so we have
A ∪ B = {HT, T H, HH, T T } = S,
A ∩ B = {HT }
A0 = {T T }
A − B = {T H, HH}.
1.4 The Concept of Probability
In any random experiment there is always uncertainty as to whether a particular event will or will
not occur. As a measure of the chance, or probability, with which we can expect the event to occur,
it is convenient to assign a number between 0 and 1. If we are sure or certain that an event will
occur, we say that its probability is 100% or 1. If we are sure that the event will not occur, we
say that its probability is zero. If, for example, the probability is 1/4, we would say that there is
a 25% chance it will occur and a 75% chance that it will not occur. Equivalently, we can say that
the odds against occurrence are 75% to 25%, or 3 to 1.
There are two important procedures by means of which we can estimate the probability of an
event.
1. CLASSICAL APPROACH: If an event can occur in h different ways out of a total of n

possible ways, all of which are equally likely, then the probability of the event is h/n.
Example 1.4.1. Suppose we want to know the probability that a head will turn up in a single
toss of a coin. Since there are two equally likely ways in which the coin can come up-namely,
heads and tails (assuming it does not roll away or stand on its edge)- and of these two ways
a head can arise in only one way, we reason that the required probability is 1/2. In arriving
at this, we assume that the coin is fair, i.e., not loaded in any way.
2. FREQUENCY APPROACH: If after n repetitions of an experiment, where n is very

large, an event is observed to occur in h of these, then the probability of the event is h/n.
This is also called the empirical probability of the event.
Example 1.4.2. If we toss a coin 1000 times and find that it comes up heads 532 times, we
estimate the probability of a head coming up to be 532/1000 = 0.532.
Both the classical and frequency approaches have serious drawbacks, the first because the words
“equally likely” are vague and the second because the “large number” involved is vague. Because
of these difficulties, mathematicians have been led to an axiomatic approach to probability.
5
1.5 The Axioms of Probability
Suppose we have a sample space S. If S is discrete, all subsets correspond to events and conversely;
if S is nondiscrete, only special subsets (called measurable) correspond to events. To each event A
in the class C of events, we associate a real number P (A). The P is called a probability function,
and P (A) the probability of the event, if the following axioms are satisfied.
Axiom 1. For every event A in class C, P (A) ≥ 0

Axiom 2. For the sure or certain event S in the class C, P (S) = 1
Axiom 3. For any number of mutually exclusive events A1 , A2 , . . . , in the class C,
P (A1 ∪ A2 ∪ . . .) = P (A1 ) + P (A2 ) + . . . In particular, for two mutually exclusive events A1
and A2 , P (A1 ∪ A2 ) = P (A1 ) + P (A2 ).
1.6 Some Important Theorems on Probability
From the above axioms we can now prove various theorems on probability that are important in
further work.
Theorem 1.6.1. If A1 ⊂ A2 , then P (A1 ) ≤ P (A2 ) and P (A2 − A1 ) = P (A2 ) − P (A1 ).
Theorem 1.6.2. For every event A, 0 ≤ P (A) ≤ 1, i.e., a probability between 0 and 1.
Theorem 1.6.3. For ∅, the empty set, P (∅) = 0, i.e., the impossible event has probability zero.
Theorem 1.6.4. If A0 is the complement of A, then P (A0 ) = 1 − P (A).
Theorem 1.6.5. If A = A1 ∪ A2 ∪ A3 ∪ . . . ∪ An , where A1 , A2 , . . . , An are mutually exclusive
events, then
P (A) = P (A1 ) + P (A2 ) + P (A3 ) + . . . + P (An ).
In particular, if A = S, the sample space, then
P (A1 ) + P (A2 ) + P (A3 ) + . . . + P (An ) = 1.
Theorem 1.6.6. If A and B are any two events, then P (A ∪ B) = P (A) + P (B) − P (A ∩ B).
More generally, if A1 , A2 , A3 are any three events, then
P (A1 ∪A2 ∪A3 ) = P (A1 )+P (A2 )+P (A3 )−P (A1 ∩A2 )−P (A2 ∩A3 )−P (A3 ∩A1 )+P (A1 ∩A2 ∩A3 ).
Generalizations to n events can also be made.
6
Theorem 1.6.7. For any events A and B, P (A) = P (A ∩ B) + P (A ∩ B 0 ).
Theorem 1.6.8. If an event A must result in the occurrence of one of the mutually exclusive events
A1 , A2 , . . . , An , then
P (A) = P (A ∩ A1 ) + P (A ∩ A2 ) + · · · + P (A ∩ An ).
1.7 Assignment of Probabilities
If a sample space S consists of a finite number of outcomes a1 , a2 , . . . , an , then by theorem 1.6.5,

P (A1 ) + P (A2 ) + . . . + P (An ) = 1
where A1 , A2 , . . . , An are elementary events given by Ai = {ai }.
It follows that we can arbitrarily choose any nonnegative numbers for the probabilities of these
simple events as long as the previous equation is satisfied. In particular, if we assume equal proba-
bilities for all simple events, then
1
P (Ak ) =
, k = 1, 2, . . . , n
n
And if A is any event made up of h such simple events, we have
h
.
P (A) =
n
This is equivalent to the classical approach to probability. We could of course use other procedures
for assigning probabilities, such as frequency approach.
Assigning probabilities provides a mathematical model, the success of which must be tested by
experiment in much the same manner that the theories in physics or others sciences must be tested
by experiment.
Example 1.7.1. A single die is tossed once. Find the probability of a 2 or 5 turning up.
Solution: The sample space is S = {1, 2, 3, 4, 5, 6}. If we assign equal probabilities to the sample
points, i.e., if we assume that the die is fair, then
1
P (1) = P (2) = · · · = P (6) = .
6
The event that either 2 or 5 turns up is indicated by 2 ∪ 5. Therefore,
1 1 1
P (2 ∪ 5) = P (2) + P (5) = + = .
6 6 3
7
1.8 Conditional Probability
Let A and B be two events such that P (A) > 0. Denote P (B|A) the probability of B given that A
has occurred. Since A is known to have occurred, it becomes the new sample space replacing the
original S. From this we are led to the definition
P (A ∩ B)
P (B|A) ≡ (1.8.1)
P (A)
or
P (A ∩ B) ≡ P (A)P (B|A). (1.8.2)
In words, this is saying that the probability that both A and B occur is equal to the probability
that A occurs times the probability that B occurs given that A has occurred. We call P (B|A)
the conditional probability of B given A, i.e., the probability that B will occur given that A
has occurred. It is easy to show that conditional probability satisfies the axioms of probability
previously discussed.
Example 1.8.1. Find the probability that a single toss of a die will result in a number less than 4
if
(a) no other information is given and

(b) it is given that the toss resulted in an odd number.
Solution:
(a) Let B denote the event {less than 4}. Since B is the union of the events 1, 2, or 3 turning
up, we see by Theorem 1.6.5 that
1 1 1 1
P (B) = P (1) + P (2) + P (3) = + + =
6 6 6 2
assuming equal probabilities for the sample points.
3 1 2 1
(b) Letting A be the event {odd number}, we see that P (A) = = . Also, P (A ∩ B) = = .
6 2 6 3
Then
P (A ∩ B) 1/3 2
P (B|A) = = = .
P (A) 1/2 3
Hence, the added knowledge that the toss results in an odd number raises the probability from
1/2 to 2/3.
8
1.9 Theorems on Conditional Probability
Theorem 1.9.1. For any three events A1 , A2 , A3 , we have
P (A1 ∩ A2 ∩ A3 ) = P (A1 )P (A2 |A1 )P (A3 |A1 ∩ A2 ). (1.9.1)
In words, the probability that A1 and A2 and A3 all occur is equal to the probability that A1
occurs times the probability that A2 occurs given that A1 has occurred times the probability that
A3 occurs given that both A1 and A2 have occurred. The result is easily generalized to n events.
Theorem 1.9.2. If an event A must result in one of the mutually exclusive events A1 , A2 , . . . , An ,
then
P (A) = P (A1 )P (A|A1 ) + P (A2 )P (A|A2 ) + . . . + P (An )P (A|An ). (1.9.2)
1.10 Independent Events
If P (B|A) = P (B), i.e., the probability of B occurring is not affected by the occurrence or nonoc-
currence of A, then we say that A and B are independent events. This is equivalent to
P (A ∩ B) = P (A)P (B). (1.10.1)
Notice also that if this equation holds, then A and B are independent.
We say that three events A1 , A2 , A3 are independent if they are pairwise independent.
P (Aj ∩ Ak ) = P (Aj )P (Ak ), j 6= k where j, k = 1, 2, 3 (1.10.2)
and
P (A1 ∩ A2 ∩ A3 ) = P (A1 )P (A2 )P (A3 ). (1.10.3)
Both of these properties must hold in order for the events to be independent. Independence of
more than three events is easily defined.
Note: In order to use this multiplication rule, all of your events must be independent.
9
1.11 Bayes’ Theorem or Rule
Suppose that A1 , A2 , . . . , An are mutually exclusive events whose union is the sample space S, i.e.,
one of the events must occur. Then if A is any event, we have the following important theorem:
Theorem 1.11.1. (Bayes’ Rule):
P (Ak )P (A|Ak )
P (Ak |A) = n . (1.11.1)
X
P (Aj )P (A|Aj )
j=1
This enables us to find the probabilities of the various events A1 , A2 , . . . , An that can occur.
For this reason Bayes’ theorem is often referred to as a theorem on the probability of causes.
10
Chapter 2
Introduction
2.1 What is statistics?
The word statistics has two meanings:
1. It refers to the sets of data relating to a wide range of topics such as the size of popula-
tions, production activity, retail prices, incomes, rainfall, etc.
2. Statistics refers to the theory and methods used for collection, description, analysis and
interpretation of numerical data.
Based on the above definitions one can say statistics comprises of two branches
– Descriptive statistics is concerned with summarizing and describing numerically a body

of data.
– More importantly, Inferential statistics is the process of reaching generalizations about
the whole (called the populations) by examining a portion or many portions (called
samples).
11
2.2 Statistics in engineering and scientific experimentation
Statistical methods are applied in an enormous diversity of problems in fields as:
Agriculture (which varieties grow best?)
Genetics, Biology (selecting new varieties, species)
Economics (how are the living standards changing?)
Market Research (comparison of advertising campaigns)
Education (what is the best way to teach small children reading?)
Environmental Studies (do strong electric or magnetic fields induce higher cancer rates?)
Quality engineering
2.3 What does a statistician need to be able to do?
1. Formulate a real world problem in statistical terms.
2. Give advise on efficient data collection.
3. Analyse data effectively and extract the minimum amount of information
4. Interpret and report the results.
2.4 Definition of terms
Below are the definitions of some common terms in statistics:
Population: is the totality of all objects/items which we are concerned.
Sample: It is a part of population with which we are concerned or under study.
12
2.4.1 Why do we sample?
There are various reasons why statisticians use samples and some are as follows:
Cost-effective: To consider a sample is cost effective with respect to time, money, and
labour that in considering the whole population.
Accessibility: Some members of the population may not be accessible, therefore it is only
logical to consider a sample.
Utility: In some experimental methods it will be futile to consider the whole population if
the process involves destroying the objects/items/individuals.
Precision: less errors due to human errors in a sample than in a survey.
13
2.5 Descriptive statistics/Explanatory statistics
2.5.1 Measures of central tendency
Mode: is the value or observation which occurs most frequently.
– Advantage : It is simple and straight forward to identify

– Disadvantage: It ignores some of the collected data
Median -The sample median is obtained by first ordering the n observations from smallest
to largest (with any repeated values included so that every sample observation appears in the
ordered list). Then,


 The single middle
th
 value if n is odd = n+1

ordered value


2
∗
x = The average of the two
middle values if n



 th
is even = average of n2 and n2 + 1

ordered value.

Advantages of median:
– It is easy and straight forward to compute.

– It is hardly affected by extreme values in a data set.
Disadvantage(s) of median:
– It is hardly affected by extreme values in a data set.
Arithmetic mean: is the sum of all n observations or values divided by sample size, n, that
is
n
X
xi
i=1
x̄ = (2.5.1)
n
– Advantages: It considers all the values in the sample to find the outcome.
– Disadvantage: It is affected by extreme values in a data set.
14
Variance is the sum of the squared deviations from the mean of n values divided by the
degrees of freedom (n − 1), that is
Pn
2 i=1 (xi− x̄)2
Var[xi ] = s =
n−1
Pn 2 2
i=1 (xi − 2x̄xi + x̄ )
=
n−1
Pn
( n 2
P P
i=1 xi i=1 xi )
Pn 2−2 n
x
i=1 i n ( i=1 x i ) + n n2
=
n−1
( n 2
P
i=1 xi )
Pn 2
i=1 xi − n
= . (2.5.2)
n−1
Example 2.5.1. Determine the mode, median and mean from the following dataset: 4,5,1,4,12,10.
Solution
1. Mode: Since 4 is the most frequently occurring observation it is the mode.
2. Median: To find the median, first arrange the observations as to size, that is, in either
ascending or descending order, that is,
1, 4, 4, 5, 10, 12
4+5
Thus, the median = = 4.5
2
3. Mean:
4 + 5 + 1 + 4 + 12 + 10
x̄ =
6
36
=
6
= 6.
4. Variance:
( n 2
P
i=1 xi )
Pn 2
i=1 xi − n
Var[xi ] =
n−1
15
2
302 − (36)
6
=
5
= 17.2
2.5.2 Graphical presentation of data
Prepare your own notes on Histograms, Stem and leaf plots and Box-plots (focus on how to con-
struct these graphs, look at advantages and disadvantages)
Figure 2.1: Histogram sample
16
Home work
Explain clearly, using the following data, how the following are constructed:
23 29 40 28 15 22 46 39 22 17 26 33 35 49 20,
36 25 15 31 17 43 54 36 30 30 40 27 24 20 28 42
22 37 17 39 17 22 9 26 29
(a) Stem and leaf plot,
(b) Histogram,
(c) Box plot.
Figure 2.2: Box plot sample
17
18
Chapter 3
Random variables
3.1 Continuous random variables in one-dimension
A discrete random variable is a random variable with a finite (or countably infinite) set of
real numbers for its range.
Examples of discrete random variables: number of scratches on a surface, proportion of

defective parts among 1000 tested, number of transmitted bits received in error.
A continuous random variable is a random variable with an interval (either finite or infinite)
of real numbers for its range.
Examples of continuous random variables: electrical current, length, pressure, temperature,

time, voltage, weight
A continuous random variable can also be defined a random variable where the data can
take infinitely many values.
With continuous random variables, the interval on the real line may be open or closed,
bounded or unbounded.
For instance, the interval could be [0, 1], (0, ∞) or (−∞, ∞).
To define the probability of an event involving a continuous random variable, we cannot simply
count the number ways the event can occur ( as we can with a discrete random variable).
Rather, we introduce a function f called a probability density function.
19
This function must be non-negative and have the property that the area of the region bounded
by the graph of f and the x− axis, −∞ < x < +∞, is 1, that is;
Z ∞
P (−∞ < x < +∞) = f (x)dx = 1. (3.1.1)
−∞
The probability that x lies in the interval [c, d] is given by

Z d
P (c ≤ x ≤ d) = f (x)dx. (3.1.2)
c
Another property of probability density function is that
f (x) ≥ 0 − ∞ < x < +∞ . (3.1.3)
Another way to describe the probability distribution of a random variable is to define a

function (of real number x) that provides the probability that X is less than x, that is
F (x) = P (X < x).
F (x) is known as the cumulative distribution function (cdf) of a continuous random

variable with probability density function f (x),
Z x
F (x) = P (X ≤ x) = f (u)du, f or − ∞ < x < +∞. (3.1.4)
−∞
Therefore
dF (x)
= f (x) (3.1.5)
dx
Example 3.1.1. Suppose that the battery failure time, measured in hours, has a probability
density function (p.d.f)
2
f (x) = x ≥ 0.
(x + 1)3
(i) Determine whether this is a valid p.d.f
(ii) Find the probability that a randomly selected battery from the warehouse will have a lifetime
less than 5 hours.
20
Determine the cumulative distribution function of this continuous random variable.
Solution
(i) To be a valid p.d.f over the given interval, f must have two characteristics.
1. It must be non-negative over the entire interval, and,
2. Its definite integral over the interval must be precisely a unity.
Since the give p.d.f is non-negative over the defined interval, we now investigate the second
characteristic, that is,
Z ∞ b
2 1
P (0 ≤ x ≤ ∞) = dx = lim − = 1.
0 (x + 1)3 b→∞ (x + 1)2 0
(ii)
Z 5 5
2 1 35
P (0 ≤ x ≤ 5) = 3
dx = − 2
= .
0 (x + 1) (x + 1) 0 36
(iii)
Z x x
2 1 1
F (x) = 3
du = − 2
=1− .
0 (u + 1) (u + 1) 0 (x + 1)2
3.2 Measures of central tendency and continuous random vari-

ables
Expected value or Mean: If f is a probability density function (pdf) for a continuous random
variable x over the interval [a, b], then the expected value of x is given by
Z b
µ = E(x) = xf (x)dx. (3.2.1)
a
Variance and Standard Deviation: If f is a probability density function for a continuous

random variable x over the interval [a, b], then the variance of x is given by
Z b
2
σ = V (x) = x2 f (x)dx − µ2 = E(X 2 ) − µ2
a
21
p
where µ is the expected value of x. The standard deviation of x is σ = V (x)
Example 3.2.1. : Let the continuous random variable X denote the current measured in a thin
copper wire in milliamperes. Assume that the range of X is [0, 20mA], and assume that the
probability density function of X is
f (x) = 0.05, 0 ≤ x ≤ 20
(a) What is the probability that a current measurement is less than 10 milliamperes?
(b) Determine the expected value and the variance.
Solution
R 10
(a) P (X < 10) = 0 0.05dx = 0.5.
R 20 2 20
(b)(i) Expected value E(x) = 0 xf (x)dx = 0.05 x2 = 10.
0
R 20 3 20
(ii) Variance V (x) = 0 (x − 10)2 f (x)dx = 0.05 (x−10)
3 = 33.33.
0
3.3 Median and Mode of continuous random variables
Median: Another useful measure of central tendency is the median. We define the median to
be the number m such hat precisely half of the x− values lie below m and the other half of
the x− values lie above m., That is
P (a ≤ x ≤ m) = 0.5
Example 3.3.1. Determine the median value for the following p.d.f
1 −x/10
f (x) = e , x > 0.
10
Solution Using the definition of median, we have

Z m
1
P (0 ≤ x ≤ m) = e−x/10 dx
10 0
22
h im
= −e−x/10
0
= −e−m/10 + 1
= 0.5
Simplifying gives
m = −10 ln 0.5 ≈ 6.93.
Mode: The mode is the value that appears most often in a set of data. The mode of a
continuous probability distribution is the value x at which its probability density function
has its maximum value, so the mode is at the peak.
Tutorial #1
1. Show that the following functions are probability density functions for some k and determine
the value of k. Then determine the mean and variance of X.

3 3
(a) f (x) = kx2 for 0 < x < 4 k= , µ = 3, σ 2 =
64 5

1 2 23
(b) f (x) = k(1 + 2x) for 0 < x < 2 k = , µ = 3, σ =
6 81

−x
(c) f (x) = ke for x > 0. 2
k = 1, µ = 1, σ = 1

201 2 1
(d) f (x) = k for 100 < x < 100 + k. k > 0 k = 1, µ = ,σ =
2 12
2. Suppose that
(
e−(x−6) , 6<x
f (x) =
0 x≤6
Determine the following probabilities

(a) P (X > 6) (b) P (X > 4) (c) P (6 ≤ X < 8).
3. Show that for a continuous random variable
Z b Z b
2
(x − µ) dx = x2 f (x)dx − µ2 , a ≤ x ≤ b.
a a
23
4. The probability density function of the time to failure of an electronic component in an
airplane (in hours) is
1 −0.001x
f (x) = e , x>0
1000
(a) Determine the probability that
(i) a component lasts more than 3000 hours before failure. [e-3 ]
(ii) a component fails in the interval from 1000 to 2000 hours. [0.2325]
(iii) a component fails before 1000 hours. [0.6321]
(b) Determine the number of hours at which 10% of all components have failed. [105.36hrs]
24
3.4 Continuous random variables in two-dimensions
Joint probability distributions
Analogous to the probability density function of a single continuous random variable, a joint
probability density function can be defined over two-dimensional space.
The double integral of fXY (x, y) over a region R provides the probability that (X, Y ) assumes
a value in R.
This integral can be interpreted as volume under the surface fXY (x, y) over the region R.
Definition: A joint probability density function for the continuous random variable X and Y
denoted as fXY (x, y), satisfies the following properties
1.
fXY (x, y) ≥ 0, for all x, y (3.4.1)
2.
Z ∞ Z ∞
fXY (x, y)dxdy = 1. (3.4.2)
−∞ −∞
3. For any region R of two-dimensional space

Z Z
P ([X, Y ] ∈ R) = fXY (x, y)dxdy. (3.4.3)
R
Example 3.4.1. : A privately owned business operates both a drive-in facility and a walk-in
facility. On a randomly selected day, let X and Y, respectively, be the proportions of the time that
the drive-in and the walk-in facilities are in use, and suppose that the joint density function of
these random variables is

2
 5 (2x + 3y),
 0 ≤ x ≤ 1, 0≤y≤1
f (x, y) =

0 otherwise

25
(a) Determine whether this is a valid p.d.f
(b) Find P [(X, Y ) ∈ A], where A = {(x, y)|0 < x < 0.5, 0.25 < y < 0.5}
Solution
(a) It can easily be verified that
fXY (x, y) ≥ 0, for all x, y.
Thus
Z ∞ Z ∞ Z 1Z 1
2
fXY (x, y)dxdy = (2x + 3y)dxdy,
−∞ −∞ 0 0 5
Z 1 2 1
2x 6xy
= + dy
0 5 5 0
Z 1
2 6y
= + dy
0 5 5
2y 3y2 1

= + = 1.
5 5 0
(b) To calculate the probability, we use

1 1 1
P [(X, Y ) ∈ A] = P 0 < X < , <Y <
2 4 2
Z 0.5 Z 0.5
2
= (2x + 3y)dxdy
0.25 0 5
13
= .
160
3.5 Marginal and conditional probability distributions
In probability theory and statistics, the marginal distribution of a subset of a collection of random
variables is the probability distribution of the variables contained in the subset. It gives the prob-
abilities of various values of the variables in the subset without reference to the values of the other
26
variables. This contrasts with a conditional distribution, which gives the probabilities contingent
upon the values of the other variables.
Definition: If the joint density function of continuous random variables X and Y is fXY (x, y), the
marginal probability density functions of X and Y are
Z
fX (x) = fXY (x, y)dy, (3.5.1)
Rx
Z
fY (y) = fXY (x, y)dx, (3.5.2)
Ry
respectively, where Rx denotes the set of all points in the range of (X, Y ) for which X = x, and
Ry denotes the set of all points in the range of (X, Y ) for which Y = y.
Example 3.5.1. From example 5, the marginal distributions are as follows

Z y2 =1 ! 1
4xy 6y 2

2 (4x + 3)
fX (x) = 2x + 3y dy = + =
y1 =0 5 5 10 0 5
!
Z x2 =1
2 2(1 + 3y)
fY (y) = 2x + 3y dx = .
x1 =0 5 5
Definition 3.5.1. : Given continuous random variables X and Y with joint probability density
function fXY (x, y), the conditional probability density function of Y given X = x is
fXY (x, y)
fY |x (y) = f (y|x) = for fX (x) > 0 (3.5.3)
fX (x)
Similarly, the conditional probability density function of X given that Y = y is
fXY (x, y)
fX|y (y) = f (x|y) = for fY (y) > 0 (3.5.4)
fY (y)
27
3.5.1 Conditional mean and variance for continuous random variables
The conditional mean of X given Y = y, denoted as E(X|y) or µX|y , is

Z
E(X|y) = xfX|y (x)dx, (3.5.5)
Ry
The conditional mean of Y given X = x, denoted as E(Y |x) or µY |x , is

Z
E(Y |x) = yfY |x (y)dy, (3.5.6)
Rx
The conditional variance of Y given X = x, denoted as V (Y |x) or σY2 |x , is

Z Z
2
V (Y |x) = (y − µY |x ) fY |x (y)dy = y 2 fY |x (y)dy − µY |x (3.5.7)
Rx Rx
The conditional variance of X given Y = y, denoted as V (X|y) or σX|y

2 , is
Z Z
V (X|y) = (x − µX|y )2 fX|y (x)dx = x2 fX|y (x)dx − µX|y (3.5.8)
Ry Ry
Example 3.5.2. Consider the pdf fXY (x, y) = x + y, for 0 < x < 1 and 0 < y < 1.
Determine
1. f (Y |x)
2. the conditional mean of Y given that X = 0.5
3. P (0.25 < Y < 0.5|x = 0.5)
Solution
1.
" #1
1
y2
Z
f (x) = (x + y)dy = xy + = x + 0.5
0 2
0
28
Thus
fXY (x, y) x+y

f (Y |x) = = .
fX (x) x + 0.5
2.
! " #0.5
Z 0.5
0.5 + y (1 + y)y 7
P (0.25 < Y < 0.5|x = 0.5) = dy = = = 0.21875.
0.25 0.5 + 0.5 2 32
0.25
3.
! " #1
1 1
y2 y3
Z Z
0.5 + y 7
E(Y |x) = yf (Y |x)dy = y dy = + = .
0 0 0.5 + 0.5 4 3 12
0
3.6 Independence of continuous random variables
The definition of independence for continuous random variables is similar to that of discrete random
variables. For continuous random variables if fXY (x, y) = fX (x)fY (y) for all x and y then the
random variables X and Y are said to be independent.
Example 3.6.1. Demonstrate that for the pdf
fXY (x, y) = e−(x+y) , x > 0, y > 0,
the random variables X and Y are independent.
Solution
It can easily be verified that fXY (x, y) is a valid pdf.
Z ∞ Z ∞
−(x+y) −x
fX (x) = e dy = e e−y dy = e−x .
0 0
29
Z ∞ Z ∞
−(x+y) −y
fY (y) = e dy = e e−x dx = e−y .
0 0
Clearly
fXY (x, y) = fX (x)fY (y).
which implies that the random variables X and Y are independent.
3.7 Properties of variance
1. If X is a random variable and a, b, c are constants, then
var[aX + c] = a2 var[X]
2. If X and Y are random variable with joint probability distribution, with a and b as constants
then
var[aX + bY ] = a2 var[X] + b2 var[Y ] + 2abcov[X, Y ]
3. If X and Y are random variable with joint probability distribution, with a and b as constants
then
var[aX − bY ] = a2 var[X] + b2 var[Y ] − 2abcov[X, Y ]
4. If X and Y are independent random variable (Cov(X, Y ) = 0), with a and b as constants,
then
var[aX ± bY ] = a2 var[X] + b2 var[Y ]
Example 3.7.1. If X and Y are random variable (r.v) with joint probability distribution, such
that var[X] = 2, var[Y ] = 4 and cov(X, Y ) = −2, find
(a) var[Z] where Z = 3X − 4Y + 8.
(b) cov(P, Q) where P = X − 2Y and Q = 3X + Y.
Solution
30
(a)
var[Z] = 32 var[X] + 42 var[Y ] − 2(3)(4)cov(XY )

= 9(2) + 16(4) − 24(−2)
= 34.
(b)
cov(X, Y ) = cov(X − 2Y, 3X + Y )

= cov(3X, X − 5X, Y − 2Y, Y )
= 3cov(X, X) − 5cov(X, Y ) − 2cov(Y, Y )
= 3var[X] − 5cov(X, Y ) − 2var[Y ]
= 3(2) − 5(−2) − 2(4)
= 8.
3.8 The Normal distribution function
One of the most important examples of a continuous probability distribution is the Normal
distribution.
The pdf of the Normal distribution (Gaussian distribution) is given by
1 2 2
f (x) = √ e−(x−µ) /2σ , −∞<x<∞ (3.8.1)
σ 2π
where µ and σ are the mean and standard deviation, respectively.
If we let z be the standardised variable corresponding to x, that is, if we let

x−µ
Z= (3.8.2)
σ
then (3.8.1) becomes
1 2
Φ(z) = √ e−z /2 (3.8.3)
2π
Equation (3.8.3) is often referred termed the standard normal density function.
Properties of Normal curve
1. The model & mean occur at x = µ.
31
2. The curve is symmetric about a vertical axis through the mean.
3. The curve has its points of infection at x = µ + σ and x = µ − σ.
4. The curve approaches the horizontal axis asymptotically
5. The total area under the curve is equivalent to unity.
Example 3.8.1. 1. Find the area under the standard normal curve
(a) between z = 0 and z = 1.2

(b) between z = −0.68 and z = 0
2. Suppose the current measurement in a strip of wire are assumed to follow a normal distribution
with a mean of 10 milliamperes and a variance of 4 (milliamperes)2 . What is the probability
that a measurement will exceed 13 milliamperes?
Solution
1. (a)
P (0 ≤ Z ≤ 1.2) = P (Z ≤ 1.2) − P (Z ≤ 0)
= 0.8849 − 0.5
= 0.3849
2. Let X denote the current milliamperes. The requested probability can be represented as
P (X > 13). Let Z = x−10
2 . Therefore

x − 10 13 − 10
P (X > 13) = P >
2 2
= P (Z > 1.5)
= 0.06681.
32
Chapter 4
Hypothesis testing
4.1 Introduction
Definition: A hypothesis is a statement about a population.
Hypothesis testing is concerned with deciding between the two hypothesis H0 ( null hypothesis)
33
and H1 (alternative hypothesis)
H0 is an assertion that a parameter in a statistical model takes a particular value.
It is the hypothesis we usually set up with the expectation of rejecting it.
H1 expresses the way in which the value of a particular parameter in a statistical model may
deviate from that specified in H0
Example 4.1.1. A machine that produces metal cylinders is set to make cylinders with a diameter
of 50 mm. Is it practical that all cylinders that this machine will produce will have a diameter of
exactly 50 mm?
Solution
(i) H0 : µ = 50 (All cylinders produced by the machine have the set diameter, 50mm)
(ii) H1 6= 50 (There is a possibility that the machine can produce cylinders whose diameter is not
50mm)
Definitions
Test statistics: H0 generally reflects a position of no change
We conduct a test not to prove H0 but to see if it should be rejected.
Such tests are based on the value of sample statistics, such as x̄, z or t scores and these are
called test statistics.
Critical region: It is a subset of a test statistic that might be observed in an experiment.
The subset is chosen so that the total probability is low on H0 and is better explained by H1 .
Type I error : is the probability of rejecting the H0 when it is true
Type II error: is the probability of failing to reject H0 when it is not true.
Level of significance: is the probability of committing a type I error.
The P-value is the smallest level of significance that would lead to rejection of the null
hypothesis H0 with the given data.
34
4.2 Steps in hypothesis testing
1. State your H0 and H1
(a) If H1 is of the form µ 6= µ0 , then a two tail test is used.

(b) If H1 is of the form µ > µ0 , then a single tail or one sided test to right is used.
(c) If H1 is of the form µ < µ0 , then a single tail or one sided test to the left is required.
Figure 4.1: The distribution of Z0 when H0 : µ = µ0 is true; with the critical region for (a) the two
sided H1 : µ 6= µ0 , (b) the one-sided alternative H1 : µ > µ0 , and the one-sided alternative H1 :
µ < µ0 .
2. Choose the level of significance.
3. Choose the appropriate test statistics and establish the critical region.
4. Compute the test statistics value based on the sample data
5. Draw conclusion, that is reject or fail to reject H0
Remark: We reject H0 when the computed value lies in the critical region.
35
Example 4.2.1. : An electrical firm manufactures light bulbs that have a lifetime that is approx-
imately normally distributed with a mean of 12 hours and variance 0.64 hours2 . A light bulb is
selected at random, and is tested, and the lifetime is found to be 13.3 hours. Determine whether
this bulb belongs to the manufacturer. Use the 5% level of significance.
Solution
1. H0 : µ=12 (The bulb belongs to the manufacturer)

H1 : µ 6=12 (The bulb does not belong to the manufacturer)
2. Level of significance: 5%
3. Test statistic: Normal distribution.

x−µ 13.3 − 12
4. Computed Z-value:Z = = √ = 1.625
σ 0.64
5. Conclusion: Since the computed Z value (1.625) dose not lie in the critical region [Z ≤
S
−1.96 Z > 1.96] we fail to reject H0 at 5% level of significance and we conclude that the
bulb belongs to the manufacturer.
36
Figure 4.2: vulnerability of the z-score
37
4.3 Comparing a single mean to a specified value, when the pop-
ulation variance is known
It is typical that we compare a single observation to a specified value

Usually we take a sample of size n, from which we compute the sample mean x̄ which we then
compare with a specified value.

Pn
i=1 xi
var[x̄] = var
n
1
= var[x1 + x2 + · · · + xn ]
n2
nσ 2 σ2
= =
n2 n
r
σ2 σ
Therefore the standard error S.E[x̄] = =√
n n
Example 4.3.1. A soft-drink bottler purchases 10 bottles from a glass company. The bottler
wants to know if the average mean breaking strength exceeds 200 pounds per square inch. If so
she wants to accept the bottles. Past experience indicates that for 4 specimen bottles the variance
of the breaking strength is 100 psi2 and a mean of 214 psi. Investigate at 5% level of significance
whether the manufacturer should accept or reject the bottles.
Solution
1. H0 : µ = 200
H1 : µ > 200
2. Level of significance: 5%
3. Test statistic: Z (Normal distribution)
4. Computed Z-value
x̄ − µ 214 − 200
Z= = 10 = 2.8
S.E[x̄] 2
5. Conclusion: Reject H0 . The manufacturer should accept the lot since the mean breaking
strength is greater than 200 pascals.
38
4.4 The difference between two mean when population variances
σ12 and σ22 are known
Since we have two populations under study, we are supposed to deduce the standard error
for these populations
σ2
We know that var[x̄] = ,
n
σ12 σ22
Then var[x¯1 − x¯2 ] = var[x¯1 ] + var[x¯2 ] = + ,
n1 n 2
s
σ12 σ22
Thus S.E.[x¯1 − x¯2 ] = + ,
n1 n2
(x¯1 − x¯2 ) − (µ1 − µ2 )

Hence Z = ,
S.E[x¯1 − x¯2 ]
Example 4.4.1. A manufacturer claims that the average tensile strength of synthetic fibre A ex-
ceeds the average tensile strength of synthetic fibre B. To test his claim, 50 pieces of each type
of synthetic fibre are tested under similar conditions. Type A had an average tensile strength of
43.7 psi and a variance of 11.8 psi2 , while type B had an average tensile strength of 41.5 psi and a
variance of 46.3 psi2 . At 5% significance level, test the manufacturer’s claim.
Solution
The problem objective is to compare two populations, thus
1. H0 : µA − µB = 0
H1 : µA − µB > 0
2. Test statistic
(x¯A − x¯B ) − (µA − µB )
Z = q 2 2
σA σB
nA + nB
39
(43.7 − 41.5) − 0
= q
11.8 46.3
50 + 50
= 2.04
3. Level of significance 5% (one tail test)
4. Conclusion: Reject H0 . There is sufficient evidence to allow us to conclude that µA > µB ,

hence the manufacturer’s claim is true.
4.5 Hypothesis testing when the population variances is not known

and n < 30.
Example 4.5.1. A manufacturer of television picture tubes has a production line that produces an
average of 100 tubes per day. Because of new government regulations, a new safety device has been
installed, which the manufacturer believes will reduce average daily output. A random sample of 15
days’ output after the installation of the safety device is shown below.
93, 103, 95, 101, 91, 105, 96, 94, 101, 88, 98, 94, 101, 92, 95
At 5% significance level, is there sufficient evidence to conclude that the average daily output has
decreased following the installation of the safety device?
Solution
Here the mean x̄ and the variance are unknown.
Also observe the sample (n = 15) < 30 hence we use the t distribution
1. H0 : µ = 100 (Daily production has not changed)

H1 : µ < 100 (Daily production has decreased)
2. Level of significance 5% (one tail test)
40
3. Test statistic
x̄ − µ
t =
√s
n
96.47 − 100
= 4.85
√
15
= −2.82.
4. Conclusion: Reject H0 since the computed t value lie in the critical region (−2.82 < −1.761),
and conclude that there is enough evidence to show that the average daily production has
decreased.
4.6 Difference between two population mean when the population

variances σ12 and σ22 are not known and (n1 − 1) + (n2 − 1) =
n1 + n2 − 2 < 30.
In order to use t distribution to make a valid test of hypothesis about µ1 − µ2 the following
conditions must be met.
1. The two population random variables (x1 and x2 ) are normally distributed.
2. The two sample must be independent
3. The two population variances are equal, that is σ12 = σ22 .
By condition 3, we have common variance known as pooled variance, given by
(n − 1)s21 + (n2 − 1)s22

s2p =
n1 + n2 − 2
Since we are comparing two populations, thus
(x¯1 − x¯2 ) − (µ1 − µ2 ) (x¯1 − x¯2 ) − (µ1 − µ2 )

t= q 2 = r .
sp s2p
+ 2 1 1
n1 n2 s p n1 + n2
Example 4.6.1. The manager of a large production facility believes that worker productivity is a
function of, among other things, the design of the job, which refers to the sequence of movements.
Two designs are being considered for the production of new product. In an experiment, six work-
ers using design A had a mean assembly time of 7.60 minutes, with a standard deviation of 2.36
minutes, for this product. (The six observation were 8.2, 5.3, 6.5, 5.1, 9.7, 10.8). Eight workers using
41
design B had a mean assembly time of 9.20 minutes, with a standard deviation of 1.35 minutes.
(The observations were 9.5, 8.3, 7.5, 10.9, 11.3, 9.3, 8.8, 8.0). Can we conclude at the 5% level of sig-
nificance that the average assembly times differ for the two designs? Assume that the times are
normally distributed.
Solution
Here were being asked to determine if µ1 6= µ2 .
Observe that (n1 − 1) + (n2 − 1) < 30, ⇒ t distribution.
1. H0 : µ1 − µ2 = 0
H1 : µ1 − µ2 6= 0
2. Test statistic
(x¯1 − x¯2 ) − (µ1 − µ2 ) (7.60 − 9.20) − 0
t= r = q
1 1
= −1.61
2 1
sp n1 + n2 1 3.38 6 + 8
Recall that
(n − 1)s21 + (n2 − 1)s22
s2p =
n1 + n2 − 2
(6 − 1)2.362 + (8 − 1)1.352
=
6+8−2
= 3.38
3. Conclusion: We fail to reject H0 . Since the computed t value does not lie in the critical
region. Therefore we conclude that there is no sufficient evidence to allow us to conclude that
a difference in mean assembly times exists between designs A and B.
4.7 Paired Comparison
Here, we consider two sample as in a two sample t− test, the difference is that in this
experimental design the samples are not independent.
42
Observations occur in pairs such that, the two observations in a pair are taken from the same
the same experimental unit
Or from two similar experimental unit (similar with respect to certain attribute)
Example 4.7.1. Gasohol has received much attention in recent years as possible alternative to
gasoline as a fuel for auto-mobiles. To compare the mileages per-gallon that can be achieved with the
two fuels, the following test was performed. Eight cars were selected and their fuel tanks completely
cleaned. Each car was driven twice over a predetermined course-once using gasohol and once using
gasoline and the miles per gallon was recorded for each trip.
At 10% significance level, does the data support the hypothesis that the mean mileage per gallon of
gasohol is less than that of gasoline?
Solution
This is a paired comparison
Mileage with gasohol 30 36 34 22 12 32 15 31

Mileage with gasoline 35 42 40 27 15 33 20 35
Difference -5 -6 -6 -5 -3 -1 -5 -4
(Difference)2 25 36 36 25 9 1 25 16
It follows that
P
D −35
x¯d = = = −4.375
n 8
s P s
n D2 − ( D)2 8(173) − (−35)2
P
sd = = = 1.69
n(n − 1) 8(7)
1. H0 : µd = 0
H1 : µd < 0
2. Test statistic
x̄ − µd −4.375 − 0
t= sd = 1.69 = −7.29
√ √
8 8
43
3. Level of significance 10% (one tail test, t < (t0.1,7 = −1.42)
4. Conclusion: Reject H0 and conclude that the mean mileage for gasohol is less than that of
gasoline at 10% level of significance.
4.8 Confidence Intervals
Making inferences by using confidence intervals
In general a (1 − α)100% confidence interval (CI) for the population parameter θ is given by:
θ̂ − critical value · SE[θ̂] < θ < θ̂ − critical value · SE[θ̂]. (4.8.1)
4.8.1 Critical value is for a two tail test
CI for the true population mean µ, when the population variance σ 2 is known
Example 4.8.1. Revisit the soft-drink bottler problem (example 13).
A (1 − α)100% CI for µ is given by
x̄ − z α2 · SE[x̄] < µ < x̄ + z α2 · SE[x̄]. (4.8.2)
H0 : µ = 200, H1 : µ > 200
Level of significance:5% ⇒ 95% CI for µ
Recall that x̄ = 214, n = 4 and σ 2 = 100

r r
σ2 σ2
214 − 1.96 < µ < 214 + 1.96
n n
204.2 < µ < 223.8
Conclusion: Since the stated value in the null hypothesis does not lie within the 95% CI we
reject H0 at 5% level of significance and conclude that the mean breaking strength is greater
than 200.
44
4.8.2 CI for true population mean µ when the population variance σ 2 is un-
known and n < 30.
Example 4.8.2. Revisit example 15 (The television picture tubes problem).
A (1 − α)100% CI for µ in this case is given by

4.85
x̄ ± t( α2 ,n−1) · SE[x̄] = 96.47 ± 2.14 √
15
⇒ 93.79 < µ < 99.15
H0 : µ = 100, H1 : µ < 100.
Conclusion: Reject H0 (value stated in the null hypothesis not in the CI). Conclude that the
average production has changed.
4.8.3 CI for the difference between two populations
CI for two populations when µ1 − µ2 , σ12 and σ22 are known and n1 + n2 − 2 > 30.
Example 4.8.3. Revisit Example 14
A (1 − α)100% for µ1 − µ2 is given by

s
s21 s2
(x̄1 − x¯2 ) ± z α2 + 2
n1 n2
r
11.8 46.3
= (43.7 − 41.5) ± 1.96 +
50 50
⇒ 0.087 < µ1 − µ2 < 4.31
H0 : µ1 − µ2 = 0, H1 : µ1 − µ2 > 0
Conclusion. Reject H0 . Conclude that the average tensile strength for the two metals are
different.
45
4.8.4 CI for the difference between two populations
CI for two populations when µ1 − µ2 , σ12 and σ22 are unknown and n1 + n2 − 2 < 30.
Example 4.8.4. Revisit Example 16.
A (1 − α)100% for µ1 − µ2 is given by

s
1 2
(x̄1 − x¯2 ) ± t( α2 ,n1 +n2 −2) s2p +
n1 n2
s
1 1
= (7.6 − 9.2) ± 2.179 3.38 +
6 8
⇒ −3.76 < µ1 − µ2 < 0.56
H0 : µ1 − µ2 = 0, H1 : µ1 − µ2 6= 0
Conclusion. Accept H0 . Conclude that there is no sufficient evidence for us to conclude that
there is a significant difference in mean assembly times between designs A and B.
4.9 Non-parametric statistics: χ2 test
Tests which do not involve population parameters such as the mean and variance or which
make no assumption regarding the form of the distribution are referred to as non-parametric
tests.
One of the parametric statistics is the χ2 test.
It tests for independence and normality of populations
It makes use of the observed and expected frequencies or count data.
The general form of the χ2 test is given by
(Oij − Eij )2
P
χ2 = (4.9.1)
Eij
where Oij and Eij denotes the observed and the expected frequency respectively.
46
Example 22
A manager at a shirt manufacturing firm, which operates three shifts daily, wishes to determine if
is a relationship in the quality of workmanship among the three shifts. After an inspection of the
600 shirts produced on a particular day, the manager complied the data in table below, showing
the number of shirts with flaws produced by each shift.
Shift
Shirt condition 1 2 3 Total
Flawed 10 9 11 30
No flaws 240 191 139 570
Total 250 200 150 600
Does the data indicate that there is a relationship in the quality of workmanship among the three
shifts ? (Use α = 0.10)
Solution
1. H0 : The two classifications are independent

H1 : The two classifications are dependent
2. Test statistic χ2 test, χ2(r−1)(c−1) > 4.61
3. Expected frequencies are brackets in each box
Shift
Shirt condition 1 2 3 Total
Flawed 10 (12.5) 9(10) 11(7.5) 30
No flaws 240(237.5) 191(190) 139(142.5) 570
Total 250 200 150 600
30 × 250
E11 = = 12.5
600
47
4. Thus
6
X (Oi − ei )2
χ2 =
ei
i=1
(10 − 12.5)2 (240 − 237.5)2 (139 − 142.5)2
= + + ··· +
12.5 237.5 142.5
= 2.36
5. Conclusion: We fail to reject H0 , and we conclude that the two classifications are independent.
48
Chapter 5
Linear Regression and Correlation
5.1 Introduction
In many real world problems there are two or more variables that are related, and it is
important to explore this relationship.
E.g in an industrial situation it is known that the tar content in the outlet stream in a chemical
process is related to the inlet temperature.
It may be of interest to develop a method of prediction, that is,
A procedure for estimating the tar content for various levels of the inlet temperature.
One such model is a regression model
Applications of regression are numerous and occur in almost every filed, including:
– engineering
– physical sciences
– economics and resource management
– life and biological sciences
This is a model with a single regressor or independent variable, say x
49
The relationship between the response y and the regressor, x, is a straight line.
Figure 5.1: Deviations of the data from the estimated regression model.
The simple linear regression model is

y = β0 + β1 x + (5.1.1)
where β0 is the y intercept, β1 is the slope, and is a random error component.

Errors are assumed to be normal distributed with zero mean and an unknown but constant
variance.
Errors are also assumed to be uncorrelated, implying that the value of one error does not
depend on the value of any other error.
Parameters β0 and β1 are called regression coefficients
β1 gives the change in the mean distribution of y produced by a unit change in x

If the range of data on x includes x = 0, then the y− intercept β0 is the mean of the
distribution of the response y when x = 0
However, if the range of values for x does not include zero, then β0 has no practical interpre-
tation.
50
5.2 Least squares estimation of β0 and β1
Parameters β0 and β1 are unknown and must be estimated using sample data
These data may result in either from a controlled experiment designed specifically to collect
the data, or from existing records.
The method of least squares is used to estimate β0 and β1 , that is,
We need to estimate β0 and β1 so that the sum of squares of the differences between the
observations yi and the straight line is at minimum.
From (5.1.1) we may write
yi = β0 + β1 xi + i , i = 1, 2, 3, · · · , n. (5.2.1)
Equation (5.1.1) can be viewed as a population regression model while equation (5.2.1) is a
simple regression model written in terms of the n pairs of data (yi , xi ), i = 1, 2, · · · , n.
The fitted regression line is
ŷ = βˆ0 + βˆ1 x,
such that each pair of observations satisfies the relation
yˆi = βˆ0 + βˆ1 x + ei (5.2.2)
where ei = yi − yˆi is called a residual and describes the error in the fit of the model at the
ith data point
We need to find βˆ0 and βˆ1 so as to minimize the Residual Sum of Squares (RSS).
n
X n
X
RSS = e2i = (yi − βˆ0 − βˆ1 xi )2 (5.2.3)
i=1 i=1
Taking partial derivative of (5.2.3) with respect to βˆ0 and βˆ1 , gives
n
∂RSS X
= −2 (yi − βˆ0 − βˆ1 xi ) (5.2.4)
∂ βˆ0 i=1
n
∂RSS X
= −2 (yi − βˆ0 − βˆ1 xi )xi , (5.2.5)
∂ βˆ1 i=1
51
Setting the partial derivatives to zero and rearranging the terms we obtain
n
X n
X
nβˆ0 + βˆ1 xi = yi (5.2.6)
i=1 i=1
n
X n
X n
X
βˆ0 xi + βˆ1 x2i = xi yi (5.2.7)
i=1 i=1 i=1
From (5.2.6) we have

Pn Pn
i=1 yi i=1 xi
βˆ0 = − βˆ1
n n
= ȳ − βˆ1 x̄. (5.2.8)
Substituting (5.2.8) into (5.2.7) one gets
n ni=1 xi yi − ( ni=1 xi )( ni=1 yi )

P P P
ˆ
β1 =
n ni=i x2i − ( ni=1 xi )2
P P
Sxy
= (5.2.9)
Sxx
where
n
( n i=1 xi )2
X P
Sxx = x2i − (5.2.10)
n
i=1
n
( n i=1 yi )2
X P
Syy = yi2 − (5.2.11)
n
i=1
n
( n i=1 yi )( n i=1 x1 )
X P P
Sxy = x i yi − (5.2.12)
n
i=1
5.3 Analysis of variance approach to regression analysis
Often the problem of analysing the quality of the fitted regression line is handled through an
ANOVA approach.
The analysis of variance approach in a simple regression model test the hypothesis:
52
– H0 : β1 = 0 (The slope is not significant different from zero)
– H1 : β1 6= 0 (The slope is significantly different from zero)
One-way ANOVA Table
Source df SS MS F
Regr SS
Regression 1 Regr SS Regr SS s2
Residual n-2 SSE s2 = SSE
n−2
Total n-1 SS(Total)
5.3.1 The coefficient of determination
The quantity r2 is called the coefficient of determination
Mathematically it is given by
Regression SS residual SS
r2 = =1− (5.3.1)
Syy Syy
Since Syy is a measure of the variability in the response y without considering the effect of
the regressor x, and the residual SS is a measure of the variability in y remaining after x has
been considered.
r2 is the proportion of the variation in the response y accounted for by the regressor x.
By formulation the coefficient of determination is in the range 0 ≤ r2 ≤ 1
Values of r2 close to 1 imply that the model explains most of the variation in y.
5.3.2 Inferences concerning the regression coefficients
1. Confidence interval (CI) for β1
A (1 − α)100% CI for the slope β1 is given by

s s
S 2 S2
βˆ1 − t( α2 ,n−2) < β1 < βˆ1 + t( α2 ,n−2) (5.3.2)
Sxx Sxx
53
where t( α2 ,n−2) is the value of the t− distribution with (n-2) degrees of freedom, and S 2
is the residual mean square from the ANOVA table.
2. Confidence interval (CI) for β0
A (1 − α)100% CI for the slope β1 is given by

s s
2 x̄2

ˆ 2
1 x̄ ˆ 2
1
β0 − t( 2 ,n−2) S
α + < β0 < β0 + t( 2 ,n−2) S
α + (5.3.3)
n Sxx n Sxx
3. Prediction of new observations
Equation ŷ = βˆ0 + βˆ1 x may be used to predict the mean response y|x0 at x = x0 .
Where x0 is not necessarily one of the pre-chosen values.
4. Prediction interval for y0
A (1 − α)100% confidence interval for the mean response y|x0 is given by
yˆ0 − t( α2 ,n−2) · SE[x0 ] < y0 < yˆ0 + t( α2 ,n−2) · SE[x0 ]

s
1 (x0 − x̄)2

where SE[x0 ] = S 1 + + 2
n Sxx
Example 5.3.1. An experiment on the amount of converted sugar in a certain biochemical process
at various temperatures was conducted, and the following results were obtained.
Temperature (x) 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0
Converted sugar (y) 8.1 7.8 8.5 9.8 9.5 8.9 8.6 10.2 9.3 9.2 10.5
(a) Fit a simple linear regression model to the data.
(b) Carry out an analysis of variance (ANOVA) to test at the 5% level of significance whether he
slope is significantly different from zero. From the ANOVA table, compute the coefficient of
determination, r2 and interpret it.
(c) Predict the amount of converted sugar when the coded temperature is 1.75. Find a 95% predic-
tion interval for this prediction.
Solution
54
(a) From the given data we have the following
Pn
i=1 xi 16.5
x̄ = = = 1.5, (5.3.4)
n 11
Pn
i=1 yi
100.4
ȳ = == 9.13, (5.3.5)
n 11
n
( ni=1 xi )2
X P
Sxx = x2i − = 1.1, (5.3.6)
n
i=1
n
( ni=1 yi )2
X P
Syy = yi2 − = 7.20, (5.3.7)
n
i=1
n
( ni=1 xi )( ni=1 yi )
X P P
Sxy = x i yi − = 1.99.
n
i=1
Using the above results we have

Sxy 1.99
βˆ1 = = = 1.81 (5.3.8)
Sxx 1.1
βˆ0 = ȳ − βˆ1 x̄ = 9.13 − 1.81(1.5) = 6.41 (5.3.9)
Thus, the fitted regression equation is ŷ = 6.41 + 1.81x

Remark Now, to draw the fitted line on the scatter, we need to compute at least two fitted
response values at two given values of the regressor x, e.g
ŷ|x=1.0 = 6.41 + 1.81(1.0) = 8.22 (5.3.10)

ŷ|x=2.0 = 6.41 + 1.81(2.0) = 10.03. (5.3.11)
(b) For the ANOVA table we need to compute the following
Regression SS = βˆ1 Sxy = 1.81(1.99) = 3.60 (5.3.12)

Residual SS = Syy − βˆ1 Sxy = 7.20 − 3.60 = 3.60
The hypothesis are:
H0 : β1 = 0
55
H1 : β1 6= 0
One-way Anova Table
Source df SS MS F
Regression 1 3.60 3.60 9.0
Residual 9 3.60 0.40
Total 10 7.20
0.05 > 5.12

Critical value F(1,9)
Conclusion: Since the computed F − value (9.0) is greater than the critical value (5.12) at 5%
level of significance, we reject H0 and conclude that the slope is significantly different from
zero.
Regression SS 3.60
r2 = Syy = 7.20 = 0.50
Comment: This implies that 50% of the variation in converted sugar (y) is explained by the
temperature (x) and the remainder is unaccounted for by our regression model.
(c) The predicted converted sugar at a coded temperature of 1.75 is
yˆ0 = 6.41 + 1.81(1.75) = 9.58

s
1 (x0 − x̄)2

SE[y0 ] = S2 1 + + (5.3.13)
n Sxx
s
(1.75 − 1.5)2

1
= 0.40 1 + + = 0.6776 (5.3.14)
11 1.1
Therefore the 95% prediction interval for y0 is
yˆ0 − t0.025,9 < y0 < yˆ0 + t0.025,9
9.58 − 2.26(0.6776) < y0 < 9.58 + 2.26(9.58)
8.05 < y0 < 11.11
56
THE END OF THE NOTES !!!!!!!!!!!!!!!!!!
57
Mathematics 2: Stats): Practise Questions
1. Verify that the function f is a probability density function(p.d.f) over the given interval
(a) f (x) = 18 , x ∈ [0, 8]

(b) f (x) = 16 e−x/6 , x ∈ [0, ∞)
(c) f (x) = 12x2 (1 − x), x ∈ [0, 1].
2. Find the constant k so that the function f is a probability density function over the given
interval
(a) f (x) = kx, x ∈ [1, 5]

k
(b) f (x) = b−a , x ∈ [a, b]
(c) f (x) = k(4 − x2 ), x ∈ [−2, 2]
3. The distribution of petrol consumption at a garage is given by

(
k(x − 1)(3 − x), 1 < x < 3
f (x) =
0 otherwise
(a) Determine the value of k, so that f (x) becomes a valid p.d.f.

(b) Find the model, mean and variance of X
(c) Find the probability that X is greater that 2.5 litres
4. Let X denote the reaction time, in seconds, to a certain stimulus and Y denote the tempera-
ture (o F) at which a certain reaction starts to take place. Suppose that two random variables
X and Y have the joint density
(
k(2x + y), 2 < x < 6, 0 < y < 5
f (x) =
0 otherwise
(a) Find k,
(b) P (X > 3, Y > 2)
(c) P (X + Y < 4)
58
5. Let X and Y denote the lengths of life, in years, of two components in an en electronic system.
If the joint density function of these variables is
(
e−(x+y) , x > 0, y > 0
f (x) =
0 otherwise
find P (0 < X < 1|Y = 2).
6. Given that var[X] = 2 and var[Y ] = 3, compute
(a) var 12 X + 10

(b) var[4Y + 9],

(c) var 92 Y − 3

7. If X and Y are random variable (r.v) with joint probability distribution, such that var[X] =
1.5, var[Y ] = 2 and cov(X, Y ) = −1, find
(a) var[Z], where Z = X − 2Y + 1

(a) var[W ], where W = X + Y + 1
(c) cov(P, Q), where P = X + Y , and Q = X − Y
(c) cov(P, Q), where P = X + Y , and Q = X − 3Y
8. The compressive strength of samples of cement can be modeled by a normal distribution with
a mean of 6000 kilograms per square centimeter and a standard deviation of 100 kilograms
per square centimeter.
(a) What is the probability that a sample’s strength is less that 6250 Kg/cm2 ?
(b) What is the probability that a sample’s strength is between 5800 and 5900 Kg/cm2 ?
(c) What strength is exceeded by 95% of the sample?
9. The diameter of holes for cable harness is known to have a standard deviation of 0.01 cm. A
random sample of size 30 yields an average diameter of 1.5045 cm. Use α = 0.01, to test the
hypothesis that the true mean hole diameter is 1.50 cm.
10. A contractor makes a large purchase of cement. The bags of cement are supposed to weigh 94
kg. The contractor decided to test a sample of bags to see if he is getting stipulated weight.
A random sample of size 9 yielded the following weights.
94.1 93.4 92.8 93.4 93.5 94.0 93.8 92.9 94.2
Make an appropriate test at the 0.05 level of significance.
59
11. Iron ore is extracted from rocks obtained from two different sites A and B. The observations
are of percentage of iron ore per sample:
A: 47.9 51.3 42.4 54.9 61.8
B: 39.7 50.3 36.8 29.6 41.2 43.7
2 ) and N (µ , σ 2 ) respectively.
Assume normal populations, N (µA , σA B B
(a) Test the hypothesis H0 : µA = 50 against the alternative H1 : µA < 50.

(b) Test the hypothesis H0 : µA = µB against the alternative H1 : µA 6= µB .
(d) Find 95% confidence intervals for µA and µB .
(d) Find a 90%confidence interval for µA − µB .
12. The average fuel consumption for 10 small cars before and after a certain additive substance
was introduced into their fuel was observed and the data obtained were recorded as follows:
After : 47 38 44 48 52 55 44 52 60 44
Before : 40 39 32 33 40 27 36 56 50 40
Suppose that the differences in fuel consumption are normally distributed with mean µD and
2
variance σD
(a) At 5% significance level, is there sufficient evidence to conclude that the additive sub-
stance increases fuel consumption in each vehicle?
(b) Construct a 90% confidence interval for µD .
13. A market survey was conducted for the purpose of forming a demographic profile who would
like to own an electronic engineering company. The data collected is presented below:
Response Men Women

Interested 32 20
Not interested 118 130
Is there sufficient evidence to conclude that the desire to own an electronic engineering com-
pany is related to gender? (Test using α = 0.05).
14. Voltage output (y) and engine speed (x) in metres per second were observed for a turbine at
a hydroelectric station were recorded as follows:
Engine speed (x) : 166 169 186 202 203
Voltage Output (y) : 1.6 1.3 1.9 1.6 2.2
60
(a) Plot the data on a graph paper.
(b) Fit a simple linear regression model to the data.
(c) Carry out an analysis of variance (ANOVA) to test at the 5% level of significance whether
he slope is significantly different from zero. From the ANOVA table, compute the coef-
ficient of determination, r2 and interpret it.
(d) Predict the amount of converted sugar when the coded temperature is 1.75. Find a 95%
prediction interval for this prediction.
(e) Find the standard errors of the estimated parameters β̂0 and β̂1 .
15. Explain clearly, using the following data, how the following are constructed:
23 29 40 28 15 22 46 39 22 17 26 33 35 49 20,
36 25 15 31 17 43 54 36 30 30 40 27 24 20 28 42
22 37 17 39 17 22 9 26 29
(a) Stem and leaf plot,

(b) Histogram,
(c) Box plot.
16. An experiment was conducted to compare the speeds of the word-processing packages of two
brands of minicomputers A and B. Forty people with similar backgrounds were randomly
selected and divided into two groups. One group was assigned minicomputer A and another
to minicomputer B. Each person was asked to perform the same word processing job and the
length of time it takes for each person to complete the job was recorded. Past experience
has shown that the population, associated with both minicomputer A and B are normally
distributed. The times required by the group using minicomputer A had a mean of 14.8
minutes and a variance of 3.9 minutes2 . For the group using using minicomputer B, the mean
length of time to complete the task was 12.3 minutes and the variance was 4.3 minutes2 .
Is there sufficient evidence to conclude that the mean length of time required to complete
a word-processing task using minicomputer A is less than that of minicomputer B? (Use
α = 0.01).
61
62
63
64
65

Probability and Statistics

Uploaded by

Probability and Statistics

Uploaded by

UNIVERSITY OF ZIMBABWE

MATHEMATICAL METHODS FOR PHYSICS/

Lecturer : Mr. T. Mazikana

Probability and Statistics.

Introduction to Probability Theory

Probability theory provides a mathematical foundation to concepts such as “probability”, “infor-

1.2 Sample Spaces

1. A ∪ B is the event “either A or B or both.” A ∪ B is called the union of A and B.

2. A ∩ B is the event “both A and B.” A ∩ B is called the intersection of A and B.

3. A0 is the event “not A.” A0 is called the complement of A.

4. A − B = A ∩ B 0 is the event “A but not B.” In particular, A0 = S − A.

1.4 The Concept of Probability

1. CLASSICAL APPROACH: If an event can occur in h different ways out of a total of n

2. FREQUENCY APPROACH: If after n repetitions of an experiment, where n is very

Axiom 1. For every event A in class C, P (A) ≥ 0

1.6 Some Important Theorems on Probability

1.7 Assignment of Probabilities

If a sample space S consists of a finite number of outcomes a1 , a2 , . . . , an , then by theorem 1.6.5,

(a) no other information is given and

Theorem 1.9.1. For any three events A1 , A2 , A3 , we have

P (A1 ∩ A2 ∩ A3 ) = P (A1 )P (A2 |A1 )P (A3 |A1 ∩ A2 ). (1.9.1)

P (A) = P (A1 )P (A|A1 ) + P (A2 )P (A|A2 ) + . . . + P (An )P (A|An ). (1.9.2)

1.10 Independent Events

P (A ∩ B) = P (A)P (B). (1.10.1)

P (Aj ∩ Ak ) = P (Aj )P (Ak ), j 6= k where j, k = 1, 2, 3 (1.10.2)

P (A1 ∩ A2 ∩ A3 ) = P (A1 )P (A2 )P (A3 ). (1.10.3)

Theorem 1.11.1. (Bayes’ Rule):

2.1 What is statistics?

 The word statistics has two meanings:

– Descriptive statistics is concerned with summarizing and describing numerically a body

Statistical methods are applied in an enormous diversity of problems in fields as:

 Agriculture (which varieties grow best?)

 Genetics, Biology (selecting new varieties, species)

 Economics (how are the living standards changing?)

 Market Research (comparison of advertising campaigns)

 Education (what is the best way to teach small children reading?)

2.3 What does a statistician need to be able to do?

1. Formulate a real world problem in statistical terms.

2. Give advise on efficient data collection.

3. Analyse data effectively and extract the minimum amount of information

4. Interpret and report the results.

2.4 Definition of terms

Below are the definitions of some common terms in statistics:

 Population: is the totality of all objects/items which we are concerned.

 Sample: It is a part of population with which we are concerned or under study.

2.5.1 Measures of central tendency

 Mode: is the value or observation which occurs most frequently.

– Advantage : It is simple and straight forward to identify

– It is easy and straight forward to compute.

– It is hardly affected by extreme values in a data set.

1. Mode: Since 4 is the most frequently occurring observation it is the mode.

2.5.2 Graphical presentation of data

Figure 2.1: Histogram sample

(a) Stem and leaf plot,

(c) Box plot.

Figure 2.2: Box plot sample

3.1 Continuous random variables in one-dimension

 Examples of discrete random variables: number of scratches on a surface, proportion of

 Examples of continuous random variables: electrical current, length, pressure, temperature,

 Rather, we introduce a function f called a probability density function.

 The probability that x lies in the interval [c, d] is given by

 Another property of probability density function is that

f (x) ≥ 0 − ∞ < x < +∞ . (3.1.3)

 Another way to describe the probability distribution of a random variable is to define a

 F (x) is known as the cumulative distribution function (cdf) of a continuous random

(i) Determine whether this is a valid p.d.f

3.2 Measures of central tendency and continuous random vari-

 Variance and Standard Deviation: If f is a probability density function for a continuous

(b) Determine the expected value and the variance.

3.3 Median and Mode of continuous random variables

Solution Using the definition of median, we have

m = −10 ln 0.5 ≈ 6.93.

Determine the following probabilities

Joint probability distributions

fXY (x, y) ≥ 0, for all x, y (3.4.1)

3. For any region R of two-dimensional space

The word statistics has two meanings:

Agriculture (which varieties grow best?)

Genetics, Biology (selecting new varieties, species)

Economics (how are the living standards changing?)

Market Research (comparison of advertising campaigns)

Education (what is the best way to teach small children reading?)

Population: is the totality of all objects/items which we are concerned.

Sample: It is a part of population with which we are concerned or under study.

Mode: is the value or observation which occurs most frequently.

Examples of discrete random variables: number of scratches on a surface, proportion of

Examples of continuous random variables: electrical current, length, pressure, temperature,

Rather, we introduce a function f called a probability density function.

The probability that x lies in the interval [c, d] is given by

Another property of probability density function is that

Another way to describe the probability distribution of a random variable is to define a

F (x) is known as the cumulative distribution function (cdf) of a continuous random

Variance and Standard Deviation: If f is a probability density function for a continuous

The conditional mean of X given Y = y, denoted as E(X|y) or µX|y , is

The conditional mean of Y given X = x, denoted as E(Y |x) or µY |x , is

The conditional variance of Y given X = x, denoted as V (Y |x) or σY2 |x , is

The conditional variance of X given Y = y, denoted as V (X|y) or σX|y

The pdf of the Normal distribution (Gaussian distribution) is given by

If we let z be the standardised variable corresponding to x, that is, if we let

Properties of Normal curve

Definition: A hypothesis is a statement about a population.

H0 is an assertion that a parameter in a statistical model takes a particular value.

It is the hypothesis we usually set up with the expectation of rejecting it.

Test statistics: H0 generally reflects a position of no change

We conduct a test not to prove H0 but to see if it should be rejected.

Critical region: It is a subset of a test statistic that might be observed in an experiment.

Type I error : is the probability of rejecting the H0 when it is true

Type II error: is the probability of failing to reject H0 when it is not true.

Level of significance: is the probability of committing a type I error.

1. H0 : µ=12 (The bulb belongs to the manufacturer)

It is typical that we compare a single observation to a specified value

Here the mean x̄ and the variance are unknown.

1. H0 : µ = 100 (Daily production has not changed)

By condition 3, we have common variance known as pooled variance, given by

Since we are comparing two populations, thus

Here were being asked to determine if µ1 6= µ2 .

Observe that (n1 − 1) + (n2 − 1) < 30, ⇒ t distribution.