Probability and Statistics
Probability and Statistics
HIPH203/HMCS102/HMPH103/HMS204/HFOSCP203/HSST103
1
Chapter 1
1.1 Probability
• You need to decide whether a coin is loaded (i.e., whether it tends to favor one side over the
other when tossed). You toss the coin 6 times and in all cases you get “Tails”. Would you
say that the coin is loaded?
• You are trying to figure out whether newborn babies can distinguish green from red. To do
so you present two colored cards (one green, one red) to 6 newborn babies. You make sure
that the 2 cards have equal overall luminance so that they are indistinguishable if recorded
2
by a black and white camera. The 6 babies are randomly divided into two groups. The first
group gets the red card on the left visual field, and the second group on the right visual field.
You find that all 6 babies look longer to the red card than the green card. Would you say
that babies can distinguish red from green?
• A pregnancy test has a 99% validity (i.e., 99 of 100 pregnant women test positive) and 95%
specificity (i.e., 95 out of 100 non pregnant women test negative). A woman believes she
has a 10% chance of being pregnant. She takes the test and tests positive. How should she
combine her prior beliefs with the results of the test?
• You need to design a system that detects a sinusoidal tone of 1000Hz in the presence of white
noise. How should you design the system to solve this task optimally?
• How should the photo receptors in the human retina be interconnected to maximize informa-
tion transmission to the brain?
While these tasks appear different from each other, they all share a common problem: The need to
combine different sources of uncertain information to make rational decisions. Probability theory
provides a very powerful mathematical framework to do so. We now go into the mathematical
aspects of probability theory.
A set S that consists of all possible outcomes of a random experiment is called a sample space,
and each outcome is called a sample point. Often there will be more than one sample space that
can describe outcomes of an experiment, but there is usually only one that will provide the most
information.
Example 1.2.1. If we toss a die, then one sample space is given by {1, 2, 3, 4, 5, 6} while another is
{even, odd}. It is clear, however, that the latter would not be adequate to determine, for example,
whether an outcome is divisible by 3.
The sample space is also called the outcome space, reference set, and universal set. It is
often useful to portray a sample space graphically. In such cases, it is desirable to use numbers in
place of letters whenever possible. If a sample space has a finite number of points, it is called a
finite sample space. If it has as many points as there are natural numbers 1, 2, 3, . . . , it is called a
countably infinite sample space. If it has as many points as there are in some interval on the x axis,
3
such as 0 ≤ x ≤ 1, it is called a noncountably infinite sample space. A sample space that is finite
or countably finite is often called a discrete sample space, while one that is noncountably infinite
is called a nondiscrete sample space.
Example 1.2.2. The sample space resulting from tossing a die yields a discrete sample space.
However, picking any number, not just integers, from 1 to 10, yields a non-discrete sample space.
1.3 Events
We have defined outcomes as the elements of a sample space S. In practice, we are interested in
assigning probability values not only to outcomes but also to sets of outcomes. For example, we
may want to know the probability of getting an even number when rolling a die. In other words,
we want the probability of the set {2, 4, 6}. An event is a subset A of the sample space S, i.e., it
is set of possible outcomes. If the outcome of an experiment is an element of A, we say that the
event A has occurred. An event consisting of a single point of S is called a simple or elementary
event.
As particular events, we have S itself, which is the sure or certain event since an element of S must
occur, and the empty set ∅, which is called the impossible event because an element of ∅ cannot
occur.
By using set operations on events in S, we can obtain other events in S. For example, if A and B
are events, then
If the sets corresponding to events A and B are disjoint, i.e., A ∩ B = ∅, we often say that the
events are mutually exclusive. This means that they cannot both occur. We say that a collection
of events A1 , A2 , . . . , An is mutually exclusive if every pair in the collection is mutually exclusive.
Example 1.3.1. Consider an experiment of tossing a coin twice, let A be the event “at least one
head occurs” and B the event “the second toss results in a tail.” Find the events A ∪ B, A ∩ B, A0
and A − B.
4
Solution: We observe that A = {HT, T H, HH}, B = {HT, T T } and so we have
A ∪ B = {HT, T H, HH, T T } = S,
A ∩ B = {HT }
A0 = {T T }
A − B = {T H, HH}.
In any random experiment there is always uncertainty as to whether a particular event will or will
not occur. As a measure of the chance, or probability, with which we can expect the event to occur,
it is convenient to assign a number between 0 and 1. If we are sure or certain that an event will
occur, we say that its probability is 100% or 1. If we are sure that the event will not occur, we
say that its probability is zero. If, for example, the probability is 1/4, we would say that there is
a 25% chance it will occur and a 75% chance that it will not occur. Equivalently, we can say that
the odds against occurrence are 75% to 25%, or 3 to 1.
There are two important procedures by means of which we can estimate the probability of an
event.
Both the classical and frequency approaches have serious drawbacks, the first because the words
“equally likely” are vague and the second because the “large number” involved is vague. Because
of these difficulties, mathematicians have been led to an axiomatic approach to probability.
5
1.5 The Axioms of Probability
Suppose we have a sample space S. If S is discrete, all subsets correspond to events and conversely;
if S is nondiscrete, only special subsets (called measurable) correspond to events. To each event A
in the class C of events, we associate a real number P (A). The P is called a probability function,
and P (A) the probability of the event, if the following axioms are satisfied.
From the above axioms we can now prove various theorems on probability that are important in
further work.
Theorem 1.6.1. If A1 ⊂ A2 , then P (A1 ) ≤ P (A2 ) and P (A2 − A1 ) = P (A2 ) − P (A1 ).
Theorem 1.6.2. For every event A, 0 ≤ P (A) ≤ 1, i.e., a probability between 0 and 1.
Theorem 1.6.3. For ∅, the empty set, P (∅) = 0, i.e., the impossible event has probability zero.
Theorem 1.6.4. If A0 is the complement of A, then P (A0 ) = 1 − P (A).
Theorem 1.6.5. If A = A1 ∪ A2 ∪ A3 ∪ . . . ∪ An , where A1 , A2 , . . . , An are mutually exclusive
events, then
P (A) = P (A1 ) + P (A2 ) + P (A3 ) + . . . + P (An ).
In particular, if A = S, the sample space, then
P (A1 ) + P (A2 ) + P (A3 ) + . . . + P (An ) = 1.
Theorem 1.6.6. If A and B are any two events, then P (A ∪ B) = P (A) + P (B) − P (A ∩ B).
More generally, if A1 , A2 , A3 are any three events, then
P (A1 ∪A2 ∪A3 ) = P (A1 )+P (A2 )+P (A3 )−P (A1 ∩A2 )−P (A2 ∩A3 )−P (A3 ∩A1 )+P (A1 ∩A2 ∩A3 ).
Generalizations to n events can also be made.
6
Theorem 1.6.7. For any events A and B, P (A) = P (A ∩ B) + P (A ∩ B 0 ).
Theorem 1.6.8. If an event A must result in the occurrence of one of the mutually exclusive events
A1 , A2 , . . . , An , then
P (A) = P (A ∩ A1 ) + P (A ∩ A2 ) + · · · + P (A ∩ An ).
It follows that we can arbitrarily choose any nonnegative numbers for the probabilities of these
simple events as long as the previous equation is satisfied. In particular, if we assume equal proba-
bilities for all simple events, then
1
P (Ak ) =
, k = 1, 2, . . . , n
n
And if A is any event made up of h such simple events, we have
h
.
P (A) =
n
This is equivalent to the classical approach to probability. We could of course use other procedures
for assigning probabilities, such as frequency approach.
Assigning probabilities provides a mathematical model, the success of which must be tested by
experiment in much the same manner that the theories in physics or others sciences must be tested
by experiment.
Example 1.7.1. A single die is tossed once. Find the probability of a 2 or 5 turning up.
Solution: The sample space is S = {1, 2, 3, 4, 5, 6}. If we assign equal probabilities to the sample
points, i.e., if we assume that the die is fair, then
1
P (1) = P (2) = · · · = P (6) = .
6
The event that either 2 or 5 turns up is indicated by 2 ∪ 5. Therefore,
1 1 1
P (2 ∪ 5) = P (2) + P (5) = + = .
6 6 3
7
1.8 Conditional Probability
Let A and B be two events such that P (A) > 0. Denote P (B|A) the probability of B given that A
has occurred. Since A is known to have occurred, it becomes the new sample space replacing the
original S. From this we are led to the definition
P (A ∩ B)
P (B|A) ≡ (1.8.1)
P (A)
or
P (A ∩ B) ≡ P (A)P (B|A). (1.8.2)
In words, this is saying that the probability that both A and B occur is equal to the probability
that A occurs times the probability that B occurs given that A has occurred. We call P (B|A)
the conditional probability of B given A, i.e., the probability that B will occur given that A
has occurred. It is easy to show that conditional probability satisfies the axioms of probability
previously discussed.
Example 1.8.1. Find the probability that a single toss of a die will result in a number less than 4
if
Solution:
(a) Let B denote the event {less than 4}. Since B is the union of the events 1, 2, or 3 turning
up, we see by Theorem 1.6.5 that
1 1 1 1
P (B) = P (1) + P (2) + P (3) = + + =
6 6 6 2
assuming equal probabilities for the sample points.
3 1 2 1
(b) Letting A be the event {odd number}, we see that P (A) = = . Also, P (A ∩ B) = = .
6 2 6 3
Then
P (A ∩ B) 1/3 2
P (B|A) = = = .
P (A) 1/2 3
Hence, the added knowledge that the toss results in an odd number raises the probability from
1/2 to 2/3.
8
1.9 Theorems on Conditional Probability
In words, the probability that A1 and A2 and A3 all occur is equal to the probability that A1
occurs times the probability that A2 occurs given that A1 has occurred times the probability that
A3 occurs given that both A1 and A2 have occurred. The result is easily generalized to n events.
Theorem 1.9.2. If an event A must result in one of the mutually exclusive events A1 , A2 , . . . , An ,
then
If P (B|A) = P (B), i.e., the probability of B occurring is not affected by the occurrence or nonoc-
currence of A, then we say that A and B are independent events. This is equivalent to
Notice also that if this equation holds, then A and B are independent.
We say that three events A1 , A2 , A3 are independent if they are pairwise independent.
and
Both of these properties must hold in order for the events to be independent. Independence of
more than three events is easily defined.
Note: In order to use this multiplication rule, all of your events must be independent.
9
1.11 Bayes’ Theorem or Rule
Suppose that A1 , A2 , . . . , An are mutually exclusive events whose union is the sample space S, i.e.,
one of the events must occur. Then if A is any event, we have the following important theorem:
P (Ak )P (A|Ak )
P (Ak |A) = n . (1.11.1)
X
P (Aj )P (A|Aj )
j=1
This enables us to find the probabilities of the various events A1 , A2 , . . . , An that can occur.
For this reason Bayes’ theorem is often referred to as a theorem on the probability of causes.
10
Chapter 2
Introduction
1. It refers to the sets of data relating to a wide range of topics such as the size of popula-
tions, production activity, retail prices, incomes, rainfall, etc.
2. Statistics refers to the theory and methods used for collection, description, analysis and
interpretation of numerical data.
Based on the above definitions one can say statistics comprises of two branches
11
2.2 Statistics in engineering and scientific experimentation
Environmental Studies (do strong electric or magnetic fields induce higher cancer rates?)
Quality engineering
12
2.4.1 Why do we sample?
There are various reasons why statisticians use samples and some are as follows:
Cost-effective: To consider a sample is cost effective with respect to time, money, and
labour that in considering the whole population.
Accessibility: Some members of the population may not be accessible, therefore it is only
logical to consider a sample.
Utility: In some experimental methods it will be futile to consider the whole population if
the process involves destroying the objects/items/individuals.
Precision: less errors due to human errors in a sample than in a survey.
13
2.5 Descriptive statistics/Explanatory statistics
Median -The sample median is obtained by first ordering the n observations from smallest
to largest (with any repeated values included so that every sample observation appears in the
ordered list). Then,
The single middle
th
value if n is odd = n+1
ordered value
2
∗
x = The average of the two
middle values if n
th
is even = average of n2 and n2 + 1
ordered value.
Advantages of median:
Disadvantage(s) of median:
Arithmetic mean: is the sum of all n observations or values divided by sample size, n, that
is
n
X
xi
i=1
x̄ = (2.5.1)
n
– Advantages: It considers all the values in the sample to find the outcome.
– Disadvantage: It is affected by extreme values in a data set.
14
Variance is the sum of the squared deviations from the mean of n values divided by the
degrees of freedom (n − 1), that is
Pn
2 i=1 (xi− x̄)2
Var[xi ] = s =
n−1
Pn 2 2
i=1 (xi − 2x̄xi + x̄ )
=
n−1
Pn
( n 2
P P
i=1 xi i=1 xi )
Pn 2−2 n
x
i=1 i n ( i=1 x i ) + n n2
=
n−1
( n 2
P
i=1 xi )
Pn 2
i=1 xi − n
= . (2.5.2)
n−1
Example 2.5.1. Determine the mode, median and mean from the following dataset: 4,5,1,4,12,10.
Solution
2. Median: To find the median, first arrange the observations as to size, that is, in either
ascending or descending order, that is,
1, 4, 4, 5, 10, 12
4+5
Thus, the median = = 4.5
2
3. Mean:
4 + 5 + 1 + 4 + 12 + 10
x̄ =
6
36
=
6
= 6.
4. Variance:
( n 2
P
i=1 xi )
Pn 2
i=1 xi − n
Var[xi ] =
n−1
15
2
302 − (36)
6
=
5
= 17.2
Prepare your own notes on Histograms, Stem and leaf plots and Box-plots (focus on how to con-
struct these graphs, look at advantages and disadvantages)
16
Home work
Explain clearly, using the following data, how the following are constructed:
23 29 40 28 15 22 46 39 22 17 26 33 35 49 20,
36 25 15 31 17 43 54 36 30 30 40 27 24 20 28 42
22 37 17 39 17 22 9 26 29
(b) Histogram,
17
18
Chapter 3
Random variables
A discrete random variable is a random variable with a finite (or countably infinite) set of
real numbers for its range.
A continuous random variable is a random variable with an interval (either finite or infinite)
of real numbers for its range.
A continuous random variable can also be defined a random variable where the data can
take infinitely many values.
With continuous random variables, the interval on the real line may be open or closed,
bounded or unbounded.
For instance, the interval could be [0, 1], (0, ∞) or (−∞, ∞).
To define the probability of an event involving a continuous random variable, we cannot simply
count the number ways the event can occur ( as we can with a discrete random variable).
19
This function must be non-negative and have the property that the area of the region bounded
by the graph of f and the x− axis, −∞ < x < +∞, is 1, that is;
Z ∞
P (−∞ < x < +∞) = f (x)dx = 1. (3.1.1)
−∞
Therefore
dF (x)
= f (x) (3.1.5)
dx
Example 3.1.1. Suppose that the battery failure time, measured in hours, has a probability
density function (p.d.f)
2
f (x) = x ≥ 0.
(x + 1)3
(ii) Find the probability that a randomly selected battery from the warehouse will have a lifetime
less than 5 hours.
20
Determine the cumulative distribution function of this continuous random variable.
Solution
(i) To be a valid p.d.f over the given interval, f must have two characteristics.
1. It must be non-negative over the entire interval, and,
2. Its definite integral over the interval must be precisely a unity.
Since the give p.d.f is non-negative over the defined interval, we now investigate the second
characteristic, that is,
Z ∞ b
2 1
P (0 ≤ x ≤ ∞) = dx = lim − = 1.
0 (x + 1)3 b→∞ (x + 1)2 0
(ii)
Z 5 5
2 1 35
P (0 ≤ x ≤ 5) = 3
dx = − 2
= .
0 (x + 1) (x + 1) 0 36
(iii)
Z x x
2 1 1
F (x) = 3
du = − 2
=1− .
0 (u + 1) (u + 1) 0 (x + 1)2
Expected value or Mean: If f is a probability density function (pdf) for a continuous random
variable x over the interval [a, b], then the expected value of x is given by
Z b
µ = E(x) = xf (x)dx. (3.2.1)
a
21
p
where µ is the expected value of x. The standard deviation of x is σ = V (x)
Example 3.2.1. : Let the continuous random variable X denote the current measured in a thin
copper wire in milliamperes. Assume that the range of X is [0, 20mA], and assume that the
probability density function of X is
f (x) = 0.05, 0 ≤ x ≤ 20
(a) What is the probability that a current measurement is less than 10 milliamperes?
Solution
R 10
(a) P (X < 10) = 0 0.05dx = 0.5.
R 20 2 20
(b)(i) Expected value E(x) = 0 xf (x)dx = 0.05 x2 = 10.
0
R 20 3 20
(ii) Variance V (x) = 0 (x − 10)2 f (x)dx = 0.05 (x−10)
3 = 33.33.
0
Median: Another useful measure of central tendency is the median. We define the median to
be the number m such hat precisely half of the x− values lie below m and the other half of
the x− values lie above m., That is
P (a ≤ x ≤ m) = 0.5
Example 3.3.1. Determine the median value for the following p.d.f
1 −x/10
f (x) = e , x > 0.
10
22
h im
= −e−x/10
0
= −e−m/10 + 1
= 0.5
Simplifying gives
Mode: The mode is the value that appears most often in a set of data. The mode of a
continuous probability distribution is the value x at which its probability density function
has its maximum value, so the mode is at the peak.
Tutorial #1
1. Show that the following functions are probability density functions for some k and determine
the value of k. Then determine the mean and variance of X.
3 3
(a) f (x) = kx2 for 0 < x < 4 k= , µ = 3, σ 2 =
64 5
1 2 23
(b) f (x) = k(1 + 2x) for 0 < x < 2 k = , µ = 3, σ =
6 81
−x
(c) f (x) = ke for x > 0. 2
k = 1, µ = 1, σ = 1
201 2 1
(d) f (x) = k for 100 < x < 100 + k. k > 0 k = 1, µ = ,σ =
2 12
2. Suppose that
(
e−(x−6) , 6<x
f (x) =
0 x≤6
23
4. The probability density function of the time to failure of an electronic component in an
airplane (in hours) is
1 −0.001x
f (x) = e , x>0
1000
(a) Determine the probability that
(i) a component lasts more than 3000 hours before failure. [e-3 ]
(ii) a component fails in the interval from 1000 to 2000 hours. [0.2325]
(iii) a component fails before 1000 hours. [0.6321]
(b) Determine the number of hours at which 10% of all components have failed. [105.36hrs]
24
3.4 Continuous random variables in two-dimensions
Analogous to the probability density function of a single continuous random variable, a joint
probability density function can be defined over two-dimensional space.
The double integral of fXY (x, y) over a region R provides the probability that (X, Y ) assumes
a value in R.
This integral can be interpreted as volume under the surface fXY (x, y) over the region R.
Definition: A joint probability density function for the continuous random variable X and Y
denoted as fXY (x, y), satisfies the following properties
1.
2.
Z ∞ Z ∞
fXY (x, y)dxdy = 1. (3.4.2)
−∞ −∞
Example 3.4.1. : A privately owned business operates both a drive-in facility and a walk-in
facility. On a randomly selected day, let X and Y, respectively, be the proportions of the time that
the drive-in and the walk-in facilities are in use, and suppose that the joint density function of
these random variables is
2
5 (2x + 3y),
0 ≤ x ≤ 1, 0≤y≤1
f (x, y) =
0 otherwise
25
(a) Determine whether this is a valid p.d.f
(b) Find P [(X, Y ) ∈ A], where A = {(x, y)|0 < x < 0.5, 0.25 < y < 0.5}
Solution
Thus
Z ∞ Z ∞ Z 1Z 1
2
fXY (x, y)dxdy = (2x + 3y)dxdy,
−∞ −∞ 0 0 5
Z 1 2 1
2x 6xy
= + dy
0 5 5 0
Z 1
2 6y
= + dy
0 5 5
2y 3y2 1
= + = 1.
5 5 0
In probability theory and statistics, the marginal distribution of a subset of a collection of random
variables is the probability distribution of the variables contained in the subset. It gives the prob-
abilities of various values of the variables in the subset without reference to the values of the other
26
variables. This contrasts with a conditional distribution, which gives the probabilities contingent
upon the values of the other variables.
Definition: If the joint density function of continuous random variables X and Y is fXY (x, y), the
marginal probability density functions of X and Y are
Z
fX (x) = fXY (x, y)dy, (3.5.1)
Rx
Z
fY (y) = fXY (x, y)dx, (3.5.2)
Ry
respectively, where Rx denotes the set of all points in the range of (X, Y ) for which X = x, and
Ry denotes the set of all points in the range of (X, Y ) for which Y = y.
!
Z x2 =1
2 2(1 + 3y)
fY (y) = 2x + 3y dx = .
x1 =0 5 5
Definition 3.5.1. : Given continuous random variables X and Y with joint probability density
function fXY (x, y), the conditional probability density function of Y given X = x is
fXY (x, y)
fY |x (y) = f (y|x) = for fX (x) > 0 (3.5.3)
fX (x)
fXY (x, y)
fX|y (y) = f (x|y) = for fY (y) > 0 (3.5.4)
fY (y)
27
3.5.1 Conditional mean and variance for continuous random variables
Z Z
V (X|y) = (x − µX|y )2 fX|y (x)dx = x2 fX|y (x)dx − µX|y (3.5.8)
Ry Ry
Example 3.5.2. Consider the pdf fXY (x, y) = x + y, for 0 < x < 1 and 0 < y < 1.
Determine
1. f (Y |x)
Solution
1.
" #1
1
y2
Z
f (x) = (x + y)dy = xy + = x + 0.5
0 2
0
28
Thus
2.
! " #0.5
Z 0.5
0.5 + y (1 + y)y 7
P (0.25 < Y < 0.5|x = 0.5) = dy = = = 0.21875.
0.25 0.5 + 0.5 2 32
0.25
3.
! " #1
1 1
y2 y3
Z Z
0.5 + y 7
E(Y |x) = yf (Y |x)dy = y dy = + = .
0 0 0.5 + 0.5 4 3 12
0
The definition of independence for continuous random variables is similar to that of discrete random
variables. For continuous random variables if fXY (x, y) = fX (x)fY (y) for all x and y then the
random variables X and Y are said to be independent.
Solution
Z ∞ Z ∞
−(x+y) −x
fX (x) = e dy = e e−y dy = e−x .
0 0
29
Z ∞ Z ∞
−(x+y) −y
fY (y) = e dy = e e−x dx = e−y .
0 0
Clearly
var[aX + c] = a2 var[X]
2. If X and Y are random variable with joint probability distribution, with a and b as constants
then
var[aX + bY ] = a2 var[X] + b2 var[Y ] + 2abcov[X, Y ]
3. If X and Y are random variable with joint probability distribution, with a and b as constants
then
var[aX − bY ] = a2 var[X] + b2 var[Y ] − 2abcov[X, Y ]
4. If X and Y are independent random variable (Cov(X, Y ) = 0), with a and b as constants,
then
var[aX ± bY ] = a2 var[X] + b2 var[Y ]
Example 3.7.1. If X and Y are random variable (r.v) with joint probability distribution, such
that var[X] = 2, var[Y ] = 4 and cov(X, Y ) = −2, find
Solution
30
(a)
(b)
One of the most important examples of a continuous probability distribution is the Normal
distribution.
1 2 2
f (x) = √ e−(x−µ) /2σ , −∞<x<∞ (3.8.1)
σ 2π
where µ and σ are the mean and standard deviation, respectively.
Equation (3.8.3) is often referred termed the standard normal density function.
31
2. The curve is symmetric about a vertical axis through the mean.
3. The curve has its points of infection at x = µ + σ and x = µ − σ.
4. The curve approaches the horizontal axis asymptotically
5. The total area under the curve is equivalent to unity.
Example 3.8.1. 1. Find the area under the standard normal curve
2. Suppose the current measurement in a strip of wire are assumed to follow a normal distribution
with a mean of 10 milliamperes and a variance of 4 (milliamperes)2 . What is the probability
that a measurement will exceed 13 milliamperes?
Solution
1. (a)
P (0 ≤ Z ≤ 1.2) = P (Z ≤ 1.2) − P (Z ≤ 0)
= 0.8849 − 0.5
= 0.3849
2. Let X denote the current milliamperes. The requested probability can be represented as
P (X > 13). Let Z = x−10
2 . Therefore
x − 10 13 − 10
P (X > 13) = P >
2 2
= P (Z > 1.5)
= 0.06681.
32
Chapter 4
Hypothesis testing
4.1 Introduction
Hypothesis testing is concerned with deciding between the two hypothesis H0 ( null hypothesis)
33
and H1 (alternative hypothesis)
H1 expresses the way in which the value of a particular parameter in a statistical model may
deviate from that specified in H0
Example 4.1.1. A machine that produces metal cylinders is set to make cylinders with a diameter
of 50 mm. Is it practical that all cylinders that this machine will produce will have a diameter of
exactly 50 mm?
Solution
(i) H0 : µ = 50 (All cylinders produced by the machine have the set diameter, 50mm)
(ii) H1 6= 50 (There is a possibility that the machine can produce cylinders whose diameter is not
50mm)
Definitions
Such tests are based on the value of sample statistics, such as x̄, z or t scores and these are
called test statistics.
The subset is chosen so that the total probability is low on H0 and is better explained by H1 .
The P-value is the smallest level of significance that would lead to rejection of the null
hypothesis H0 with the given data.
34
4.2 Steps in hypothesis testing
Figure 4.1: The distribution of Z0 when H0 : µ = µ0 is true; with the critical region for (a) the two
sided H1 : µ 6= µ0 , (b) the one-sided alternative H1 : µ > µ0 , and the one-sided alternative H1 :
µ < µ0 .
3. Choose the appropriate test statistics and establish the critical region.
Remark: We reject H0 when the computed value lies in the critical region.
35
Example 4.2.1. : An electrical firm manufactures light bulbs that have a lifetime that is approx-
imately normally distributed with a mean of 12 hours and variance 0.64 hours2 . A light bulb is
selected at random, and is tested, and the lifetime is found to be 13.3 hours. Determine whether
this bulb belongs to the manufacturer. Use the 5% level of significance.
Solution
2. Level of significance: 5%
36
Figure 4.2: vulnerability of the z-score
37
4.3 Comparing a single mean to a specified value, when the pop-
ulation variance is known
Solution
1. H0 : µ = 200
H1 : µ > 200
2. Level of significance: 5%
3. Test statistic: Z (Normal distribution)
4. Computed Z-value
x̄ − µ 214 − 200
Z= = 10 = 2.8
S.E[x̄] 2
5. Conclusion: Reject H0 . The manufacturer should accept the lot since the mean breaking
strength is greater than 200 pascals.
38
4.4 The difference between two mean when population variances
σ12 and σ22 are known
Since we have two populations under study, we are supposed to deduce the standard error
for these populations
σ2
We know that var[x̄] = ,
n
σ12 σ22
Then var[x¯1 − x¯2 ] = var[x¯1 ] + var[x¯2 ] = + ,
n1 n 2
s
σ12 σ22
Thus S.E.[x¯1 − x¯2 ] = + ,
n1 n2
Example 4.4.1. A manufacturer claims that the average tensile strength of synthetic fibre A ex-
ceeds the average tensile strength of synthetic fibre B. To test his claim, 50 pieces of each type
of synthetic fibre are tested under similar conditions. Type A had an average tensile strength of
43.7 psi and a variance of 11.8 psi2 , while type B had an average tensile strength of 41.5 psi and a
variance of 46.3 psi2 . At 5% significance level, test the manufacturer’s claim.
Solution
1. H0 : µA − µB = 0
H1 : µA − µB > 0
2. Test statistic
(x¯A − x¯B ) − (µA − µB )
Z = q 2 2
σA σB
nA + nB
39
(43.7 − 41.5) − 0
= q
11.8 46.3
50 + 50
= 2.04
Example 4.5.1. A manufacturer of television picture tubes has a production line that produces an
average of 100 tubes per day. Because of new government regulations, a new safety device has been
installed, which the manufacturer believes will reduce average daily output. A random sample of 15
days’ output after the installation of the safety device is shown below.
93, 103, 95, 101, 91, 105, 96, 94, 101, 88, 98, 94, 101, 92, 95
At 5% significance level, is there sufficient evidence to conclude that the average daily output has
decreased following the installation of the safety device?
Solution
Also observe the sample (n = 15) < 30 hence we use the t distribution
40
3. Test statistic
x̄ − µ
t =
√s
n
96.47 − 100
= 4.85
√
15
= −2.82.
4. Conclusion: Reject H0 since the computed t value lie in the critical region (−2.82 < −1.761),
and conclude that there is enough evidence to show that the average daily production has
decreased.
In order to use t distribution to make a valid test of hypothesis about µ1 − µ2 the following
conditions must be met.
1. The two population random variables (x1 and x2 ) are normally distributed.
2. The two sample must be independent
3. The two population variances are equal, that is σ12 = σ22 .
Example 4.6.1. The manager of a large production facility believes that worker productivity is a
function of, among other things, the design of the job, which refers to the sequence of movements.
Two designs are being considered for the production of new product. In an experiment, six work-
ers using design A had a mean assembly time of 7.60 minutes, with a standard deviation of 2.36
minutes, for this product. (The six observation were 8.2, 5.3, 6.5, 5.1, 9.7, 10.8). Eight workers using
41
design B had a mean assembly time of 9.20 minutes, with a standard deviation of 1.35 minutes.
(The observations were 9.5, 8.3, 7.5, 10.9, 11.3, 9.3, 8.8, 8.0). Can we conclude at the 5% level of sig-
nificance that the average assembly times differ for the two designs? Assume that the times are
normally distributed.
Solution
1. H0 : µ1 − µ2 = 0
H1 : µ1 − µ2 6= 0
2. Test statistic
(x¯1 − x¯2 ) − (µ1 − µ2 ) (7.60 − 9.20) − 0
t= r = q
1 1
= −1.61
2 1
sp n1 + n2 1 3.38 6 + 8
Recall that
(n − 1)s21 + (n2 − 1)s22
s2p =
n1 + n2 − 2
(6 − 1)2.362 + (8 − 1)1.352
=
6+8−2
= 3.38
3. Conclusion: We fail to reject H0 . Since the computed t value does not lie in the critical
region. Therefore we conclude that there is no sufficient evidence to allow us to conclude that
a difference in mean assembly times exists between designs A and B.
Here, we consider two sample as in a two sample t− test, the difference is that in this
experimental design the samples are not independent.
42
Observations occur in pairs such that, the two observations in a pair are taken from the same
the same experimental unit
Or from two similar experimental unit (similar with respect to certain attribute)
Example 4.7.1. Gasohol has received much attention in recent years as possible alternative to
gasoline as a fuel for auto-mobiles. To compare the mileages per-gallon that can be achieved with the
two fuels, the following test was performed. Eight cars were selected and their fuel tanks completely
cleaned. Each car was driven twice over a predetermined course-once using gasohol and once using
gasoline and the miles per gallon was recorded for each trip.
At 10% significance level, does the data support the hypothesis that the mean mileage per gallon of
gasohol is less than that of gasoline?
Solution
It follows that
P
D −35
x¯d = = = −4.375
n 8
s P s
n D2 − ( D)2 8(173) − (−35)2
P
sd = = = 1.69
n(n − 1) 8(7)
1. H0 : µd = 0
H1 : µd < 0
2. Test statistic
x̄ − µd −4.375 − 0
t= sd = 1.69 = −7.29
√ √
8 8
43
3. Level of significance 10% (one tail test, t < (t0.1,7 = −1.42)
4. Conclusion: Reject H0 and conclude that the mean mileage for gasohol is less than that of
gasoline at 10% level of significance.
In general a (1 − α)100% confidence interval (CI) for the population parameter θ is given by:
CI for the true population mean µ, when the population variance σ 2 is known
Conclusion: Since the stated value in the null hypothesis does not lie within the 95% CI we
reject H0 at 5% level of significance and conclude that the mean breaking strength is greater
than 200.
44
4.8.2 CI for true population mean µ when the population variance σ 2 is un-
known and n < 30.
Conclusion: Reject H0 (value stated in the null hypothesis not in the CI). Conclude that the
average production has changed.
CI for two populations when µ1 − µ2 , σ12 and σ22 are known and n1 + n2 − 2 > 30.
H0 : µ1 − µ2 = 0, H1 : µ1 − µ2 > 0
Conclusion. Reject H0 . Conclude that the average tensile strength for the two metals are
different.
45
4.8.4 CI for the difference between two populations
CI for two populations when µ1 − µ2 , σ12 and σ22 are unknown and n1 + n2 − 2 < 30.
H0 : µ1 − µ2 = 0, H1 : µ1 − µ2 6= 0
Conclusion. Accept H0 . Conclude that there is no sufficient evidence for us to conclude that
there is a significant difference in mean assembly times between designs A and B.
Tests which do not involve population parameters such as the mean and variance or which
make no assumption regarding the form of the distribution are referred to as non-parametric
tests.
(Oij − Eij )2
P
χ2 = (4.9.1)
Eij
where Oij and Eij denotes the observed and the expected frequency respectively.
46
Example 22
A manager at a shirt manufacturing firm, which operates three shifts daily, wishes to determine if
is a relationship in the quality of workmanship among the three shifts. After an inspection of the
600 shirts produced on a particular day, the manager complied the data in table below, showing
the number of shirts with flaws produced by each shift.
Shift
Shirt condition 1 2 3 Total
Flawed 10 9 11 30
No flaws 240 191 139 570
Total 250 200 150 600
Does the data indicate that there is a relationship in the quality of workmanship among the three
shifts ? (Use α = 0.10)
Solution
Shift
Shirt condition 1 2 3 Total
Flawed 10 (12.5) 9(10) 11(7.5) 30
No flaws 240(237.5) 191(190) 139(142.5) 570
Total 250 200 150 600
30 × 250
E11 = = 12.5
600
47
4. Thus
6
X (Oi − ei )2
χ2 =
ei
i=1
(10 − 12.5)2 (240 − 237.5)2 (139 − 142.5)2
= + + ··· +
12.5 237.5 142.5
= 2.36
5. Conclusion: We fail to reject H0 , and we conclude that the two classifications are independent.
48
Chapter 5
5.1 Introduction
In many real world problems there are two or more variables that are related, and it is
important to explore this relationship.
E.g in an industrial situation it is known that the tar content in the outlet stream in a chemical
process is related to the inlet temperature.
A procedure for estimating the tar content for various levels of the inlet temperature.
Applications of regression are numerous and occur in almost every filed, including:
– engineering
– physical sciences
– economics and resource management
– life and biological sciences
49
The relationship between the response y and the regressor, x, is a straight line.
Figure 5.1: Deviations of the data from the estimated regression model.
50
5.2 Least squares estimation of β0 and β1
Parameters β0 and β1 are unknown and must be estimated using sample data
These data may result in either from a controlled experiment designed specifically to collect
the data, or from existing records.
We need to estimate β0 and β1 so that the sum of squares of the differences between the
observations yi and the straight line is at minimum.
yi = β0 + β1 xi + i , i = 1, 2, 3, · · · , n. (5.2.1)
Equation (5.1.1) can be viewed as a population regression model while equation (5.2.1) is a
simple regression model written in terms of the n pairs of data (yi , xi ), i = 1, 2, · · · , n.
ŷ = βˆ0 + βˆ1 x,
such that each pair of observations satisfies the relation
yˆi = βˆ0 + βˆ1 x + ei (5.2.2)
where ei = yi − yˆi is called a residual and describes the error in the fit of the model at the
ith data point
We need to find βˆ0 and βˆ1 so as to minimize the Residual Sum of Squares (RSS).
n
X n
X
RSS = e2i = (yi − βˆ0 − βˆ1 xi )2 (5.2.3)
i=1 i=1
Taking partial derivative of (5.2.3) with respect to βˆ0 and βˆ1 , gives
n
∂RSS X
= −2 (yi − βˆ0 − βˆ1 xi ) (5.2.4)
∂ βˆ0 i=1
n
∂RSS X
= −2 (yi − βˆ0 − βˆ1 xi )xi , (5.2.5)
∂ βˆ1 i=1
51
Setting the partial derivatives to zero and rearranging the terms we obtain
n
X n
X
nβˆ0 + βˆ1 xi = yi (5.2.6)
i=1 i=1
n
X n
X n
X
βˆ0 xi + βˆ1 x2i = xi yi (5.2.7)
i=1 i=1 i=1
Sxy
= (5.2.9)
Sxx
where
n
( n i=1 xi )2
X P
Sxx = x2i − (5.2.10)
n
i=1
n
( n i=1 yi )2
X P
Syy = yi2 − (5.2.11)
n
i=1
n
( n i=1 yi )( n i=1 x1 )
X P P
Sxy = x i yi − (5.2.12)
n
i=1
Often the problem of analysing the quality of the fitted regression line is handled through an
ANOVA approach.
The analysis of variance approach in a simple regression model test the hypothesis:
52
– H0 : β1 = 0 (The slope is not significant different from zero)
– H1 : β1 6= 0 (The slope is significantly different from zero)
Source df SS MS F
Regr SS
Regression 1 Regr SS Regr SS s2
Residual n-2 SSE s2 = SSE
n−2
Total n-1 SS(Total)
Mathematically it is given by
Regression SS residual SS
r2 = =1− (5.3.1)
Syy Syy
Since Syy is a measure of the variability in the response y without considering the effect of
the regressor x, and the residual SS is a measure of the variability in y remaining after x has
been considered.
r2 is the proportion of the variation in the response y accounted for by the regressor x.
Values of r2 close to 1 imply that the model explains most of the variation in y.
53
where t( α2 ,n−2) is the value of the t− distribution with (n-2) degrees of freedom, and S 2
is the residual mean square from the ANOVA table.
Equation ŷ = βˆ0 + βˆ1 x may be used to predict the mean response y|x0 at x = x0 .
Where x0 is not necessarily one of the pre-chosen values.
Example 5.3.1. An experiment on the amount of converted sugar in a certain biochemical process
at various temperatures was conducted, and the following results were obtained.
Temperature (x) 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0
Converted sugar (y) 8.1 7.8 8.5 9.8 9.5 8.9 8.6 10.2 9.3 9.2 10.5
(b) Carry out an analysis of variance (ANOVA) to test at the 5% level of significance whether he
slope is significantly different from zero. From the ANOVA table, compute the coefficient of
determination, r2 and interpret it.
(c) Predict the amount of converted sugar when the coded temperature is 1.75. Find a 95% predic-
tion interval for this prediction.
Solution
54
(a) From the given data we have the following
Pn
i=1 xi 16.5
x̄ = = = 1.5, (5.3.4)
n 11
Pn
i=1 yi
100.4
ȳ = == 9.13, (5.3.5)
n 11
n
( ni=1 xi )2
X P
Sxx = x2i − = 1.1, (5.3.6)
n
i=1
n
( ni=1 yi )2
X P
Syy = yi2 − = 7.20, (5.3.7)
n
i=1
n
( ni=1 xi )( ni=1 yi )
X P P
Sxy = x i yi − = 1.99.
n
i=1
H0 : β1 = 0
55
H1 : β1 6= 0
Source df SS MS F
Regression 1 3.60 3.60 9.0
Residual 9 3.60 0.40
Total 10 7.20
Regression SS 3.60
r2 = Syy = 7.20 = 0.50
Comment: This implies that 50% of the variation in converted sugar (y) is explained by the
temperature (x) and the remainder is unaccounted for by our regression model.
56
THE END OF THE NOTES !!!!!!!!!!!!!!!!!!
57
Mathematics 2: Stats): Practise Questions
1. Verify that the function f is a probability density function(p.d.f) over the given interval
2. Find the constant k so that the function f is a probability density function over the given
interval
4. Let X denote the reaction time, in seconds, to a certain stimulus and Y denote the tempera-
ture (o F) at which a certain reaction starts to take place. Suppose that two random variables
X and Y have the joint density
(
k(2x + y), 2 < x < 6, 0 < y < 5
f (x) =
0 otherwise
(a) Find k,
(b) P (X > 3, Y > 2)
(c) P (X + Y < 4)
58
5. Let X and Y denote the lengths of life, in years, of two components in an en electronic system.
If the joint density function of these variables is
(
e−(x+y) , x > 0, y > 0
f (x) =
0 otherwise
(a) var 12 X + 10
7. If X and Y are random variable (r.v) with joint probability distribution, such that var[X] =
1.5, var[Y ] = 2 and cov(X, Y ) = −1, find
8. The compressive strength of samples of cement can be modeled by a normal distribution with
a mean of 6000 kilograms per square centimeter and a standard deviation of 100 kilograms
per square centimeter.
(a) What is the probability that a sample’s strength is less that 6250 Kg/cm2 ?
(b) What is the probability that a sample’s strength is between 5800 and 5900 Kg/cm2 ?
(c) What strength is exceeded by 95% of the sample?
9. The diameter of holes for cable harness is known to have a standard deviation of 0.01 cm. A
random sample of size 30 yields an average diameter of 1.5045 cm. Use α = 0.01, to test the
hypothesis that the true mean hole diameter is 1.50 cm.
10. A contractor makes a large purchase of cement. The bags of cement are supposed to weigh 94
kg. The contractor decided to test a sample of bags to see if he is getting stipulated weight.
A random sample of size 9 yielded the following weights.
59
11. Iron ore is extracted from rocks obtained from two different sites A and B. The observations
are of percentage of iron ore per sample:
A: 47.9 51.3 42.4 54.9 61.8
B: 39.7 50.3 36.8 29.6 41.2 43.7
2 ) and N (µ , σ 2 ) respectively.
Assume normal populations, N (µA , σA B B
12. The average fuel consumption for 10 small cars before and after a certain additive substance
was introduced into their fuel was observed and the data obtained were recorded as follows:
After : 47 38 44 48 52 55 44 52 60 44
Before : 40 39 32 33 40 27 36 56 50 40
Suppose that the differences in fuel consumption are normally distributed with mean µD and
2
variance σD
(a) At 5% significance level, is there sufficient evidence to conclude that the additive sub-
stance increases fuel consumption in each vehicle?
(b) Construct a 90% confidence interval for µD .
13. A market survey was conducted for the purpose of forming a demographic profile who would
like to own an electronic engineering company. The data collected is presented below:
Is there sufficient evidence to conclude that the desire to own an electronic engineering com-
pany is related to gender? (Test using α = 0.05).
14. Voltage output (y) and engine speed (x) in metres per second were observed for a turbine at
a hydroelectric station were recorded as follows:
Engine speed (x) : 166 169 186 202 203
Voltage Output (y) : 1.6 1.3 1.9 1.6 2.2
60
(a) Plot the data on a graph paper.
(b) Fit a simple linear regression model to the data.
(c) Carry out an analysis of variance (ANOVA) to test at the 5% level of significance whether
he slope is significantly different from zero. From the ANOVA table, compute the coef-
ficient of determination, r2 and interpret it.
(d) Predict the amount of converted sugar when the coded temperature is 1.75. Find a 95%
prediction interval for this prediction.
(e) Find the standard errors of the estimated parameters β̂0 and β̂1 .
15. Explain clearly, using the following data, how the following are constructed:
23 29 40 28 15 22 46 39 22 17 26 33 35 49 20,
36 25 15 31 17 43 54 36 30 30 40 27 24 20 28 42
22 37 17 39 17 22 9 26 29
16. An experiment was conducted to compare the speeds of the word-processing packages of two
brands of minicomputers A and B. Forty people with similar backgrounds were randomly
selected and divided into two groups. One group was assigned minicomputer A and another
to minicomputer B. Each person was asked to perform the same word processing job and the
length of time it takes for each person to complete the job was recorded. Past experience
has shown that the population, associated with both minicomputer A and B are normally
distributed. The times required by the group using minicomputer A had a mean of 14.8
minutes and a variance of 3.9 minutes2 . For the group using using minicomputer B, the mean
length of time to complete the task was 12.3 minutes and the variance was 4.3 minutes2 .
Is there sufficient evidence to conclude that the mean length of time required to complete
a word-processing task using minicomputer A is less than that of minicomputer B? (Use
α = 0.01).
61
62
63
64
65