Applied Statistics with Python
Applied Statistics with Python
with Python
Applied Statistics with Python: Volume I: Introductory Statistics and Regression
concentrates on applied and computational aspects of statistics, focusing on
conceptual understanding and Python-based calculations. Based on years of
experience teaching introductory and intermediate Statistics courses at Touro
University and Brooklyn College, this book compiles multiple aspects of applied
statistics, teaching the reader useful skills in statistics and computational science
with a focus on conceptual understanding. This book does not require previ-
ous experience with statistics and Python, explaining the basic concepts before
developing them into more advanced methods from scratch. Applied Statistics
with Python is intended for undergraduate students in business, economics, biol-
ogy, social sciences, and natural science, while also being useful as a supplemen-
tary text for more advanced students.
Key Features:
• Concentrates on more introductory topics such as descriptive statistics,
probability, probability distributions, proportion and means hypothesis
testing, as well as one-variable regression.
• The book’s computational (Python) approach allows us to study Statistics
much more effectively. It removes the tedium of hand/calculator com-
putations and enables one to study more advanced topics.
• Standardized sklearn Python package gives efficient access to machine
learning topics.
• Randomized homework as well as exams are provided in the author’s
course shell on My Open Math web portal (free).
Leon Kaganovskiy
Designed cover image: © Leon Kaganovskiy
Reasonable efforts have been made to publish reliable data and information, but the author and pub-
lisher cannot assume responsibility for the validity of all materials or the consequences of their use.
The authors and publishers have attempted to trace the copyright holders of all material reproduced
in this publication and apologize to copyright holders if permission to publish in this form has not
been obtained. If any copyright material has not been acknowledged please write and let us know so
we may rectify in any future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced,
transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or
hereafter invented, including photocopying, microfilming, and recording, or in any information
storage or retrieval system, without written permission from the publishers.
For permission to photocopy or use material electronically from this work, access www.copyright.
com or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA
01923, 978-750-8400. For works that are not available on CCC please contact mpkbookspermis-
sions@tandf.co.uk
Trademark notice: Product or corporate names may be trademarks or registered trademarks and are
used only for identification and explanation without intent to infringe.
DOI: 10.1201/9781003473114
Publisher’s note: This book has been prepared from camera-ready copy provided by the authors.
Contents
Preface ix
1 Introduction 1
1.1 Introduction to Statistics . . . . . . . . . . . . . . . . . . . . . 1
1.2 Data Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Sampling Methods . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4 Statistical Experiments . . . . . . . . . . . . . . . . . . . . . . 8
1.5 Roadmap for the Book . . . . . . . . . . . . . . . . . . . . . . . 9
3 Probability 35
3.1 Basic Concepts of Probability . . . . . . . . . . . . . . . . . . . 35
3.2 Addition Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.3 Simple Finite Probability Distributions . . . . . . . . . . . . . . 38
3.4 Complement of an Event . . . . . . . . . . . . . . . . . . . . . . 41
3.5 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.6 Conditional Probability . . . . . . . . . . . . . . . . . . . . . . 45
3.7 Contingency Tables Probability . . . . . . . . . . . . . . . . . . 46
3.8 General Multiplication Rule and Tree Diagrams . . . . . . . . . 56
3.9 Expected Value of Probability Distribution . . . . . . . . . . . 62
4 Probability Distributions 65
4.1 Probability Distributions . . . . . . . . . . . . . . . . . . . . . . 65
4.2 Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . 67
4.2.1 Normal Distribution Model . . . . . . . . . . . . . . . . 67
4.2.2 Normal Probability Calculations . . . . . . . . . . . . . 70
v
vi Contents
4.2.3 Central Limit Theorem . . . . . . . . . . . . . . . . . . 74
4.2.4 Normal Distribution Examples . . . . . . . . . . . . . . 74
4.2.5 68-95-99.7 Rule . . . . . . . . . . . . . . . . . . . . . . . 83
4.3 Binomial Distribution . . . . . . . . . . . . . . . . . . . . . . . 86
4.3.1 Permutations and Combinations . . . . . . . . . . . . . 87
4.3.2 Binomial Formula . . . . . . . . . . . . . . . . . . . . . 89
4.3.3 Normal Approximation of the Binomial Distribution . . 95
Bibliography 305
Index 307
Preface
ix
x Preface
Minitab, R, etc., but it is usually done separately in computer labs and only
for some problems based on data les [15, 13, 5, 14, 12, 18, 3, 16, 2, 7].
On the other hand, there are a number of software-based Statistics text-
books which assumes that students have already taken a Statistics course
[8, 17, 4, 9, 1, 6, 19, 10, 11, 20]. Some of them use Python, but most use R. In
this book, I don't assume previous experience in Statistics. We learn the ba-
sic concepts and develop more advanced methods from scratch. I use Python
as a computational tool at every step because of its amazing versatility and
exibility. It is a free programming language that is used by beginners and
advanced practitioners alike with many free add-on libraries for Statistics and
its applications. It is a scripting language that allows turning otherwise te-
dious hand calculations into step-by-step functions where students can see the
results of every computation to obtain information from the data in real-time.
The main dierence between this textbook and others is weaving code into
the text at every step in a clear and accessible way to give students a hands-on
computational tool that can benet them in this and many other courses. The
homework for each chapter and exams are provided in my course shell on My
Open Math web portal. It is an amazing free platform to teach Mathematics
and Statistics. All problems are randomized. I provide scripting les for each
chapter with clear instructions for each problem, and students have remarked
that they liked the course a lot compared to their peers who had to deal with
old-style approaches to Statistics.
1
Introduction
The treatment group had about 5% better chance of having a sizeable choles-
terol reduction. The main question of Statistics is if it happened by chance
statistically signicant result. There are always natural vari-
or if it is a
ations (uctuations) in any data - a perfectly fair coin ipped 200 times
does not land on 100 heads and 100 tails. Each such trial would come out
slightly dierently. Then the question becomes, how likely is it to obtain a
5% dierence if the new drug and old drug perform about the same? We have
to wait until a later chapter to develop computational tools to answer this
question, but intuitively, the larger the dierence, the less probable it is to
obtain it by chance.
1
2 Introduction
mydata = mydata[['cesd','age','mcs','pcs','homeless','substance','sex']];
mydata # NOTE that Python counts from row 0!!!
The le above has a data frame (data matrix) form. Each row (case) is
a patient, and each column is a variable describing that patient. Variables
have dierent types. The age, cesd (depression score), mcs (mental health
score), and pcs (physical health score) are all numerical (quantitative)
variables. The homeless, sex, and substance are all categorical (quali-
tative) variables (labels).
Data Introduction 3
mydata.head(10)
Variables wage, age, educ, and exper are numerical, while race, sex, south,
married, and sector are categorical.
Lastly, MHEALTH.csv data le contains a lot of information about the health
measures of 40 males. All variables are numerical.
dis-
In addition, numerical (quantitative) variables can be subdivided into
crete and continuous. Discrete variables can only take on distinct, sepa-
rate values, which are typically counted in whole numbers (0,1,2,. . . ). Unlike
continuous variables, which can take on any value within a range, discrete
4 Introduction
variables have a nite or countably innite number of possible values. For
example, the number of students in a class, the number of cars in a garage,
the number of books on a shelf, etc. are discrete, while distance, time, height,
weight, etc. are continuous.
For numerical variables both discrete and continuous, numerical statistics like
mean, and median, as well as transformations, make sense. For categorical
variables, whether nominal or ordinal, only counting in each group (level)
makes sense. Also note that there are some variables represented via numbers
that are categorical, like zip codes. The mean zip code of students in a class
doesn't make any sense.
url="https://raw.githubusercontent.com/leonkag/Statistics0/main/HELPrct.csv"
mydata = pd.read_csv(url) # save as mydata file
# mydata.head(10)
sns.scatterplot(data=mydata, x="mcs", y="cesd")
[ ]: <matplotlib.axes._subplots.AxesSubplot at 0x7f3e15d4c460>
Data Introduction 5
Not surprisingly, as the mental score becomes higher, the depression score
decreases, so these variables arenegatively associated (dependent). In
this setting, the x-axis mcs explanatory (independent, pre-
would be an
dictor) variable, while the y-axis cesd would be a response (dependent)
variable. On the other hand, men's weight (WT) is positively associated
with waist measurement (WAIST) as shown in the Figure below.
[ ]: url="https://raw.githubusercontent.com/leonkag/Statistics0/main/MHEALTH.csv"
mydata = pd.read_csv(url) # save as mydata file
# mydata.head(10)
sns.scatterplot(data=mydata, x="WAIST", y="WT")
[ ]: <matplotlib.axes._subplots.AxesSubplot at 0x7f3e15ce27f0>
6 Introduction
[ ]: <matplotlib.axes._subplots.AxesSubplot at 0x7f3e157198b0>
Sampling Methods 7
Remember that association does not imply causation. The mcs does not
cause the reduction in cesd score; they are just measurements of depression
and mental state which are related to each other. A classic example of asso-
ciation without causation is NYC subway fare and average pizza slice price,
which have stayed almost the same throughout the years since the 1950s. Both
prices have grown over the years due to ination, etc., not one causing the
another.
The data is collected in two primary ways: observational studies and ex-
periments. If researchers do not directly interfere with how the data arises,
it is an observational study. The data sets considered in this section are
all observational. Any opinion polls, surveys, etc. are observational. A study
looking back in time at medical or business records ( retrospective studies ),
and prospective studies where a cohort (group) of similar individuals is fol-
lowed over many years into the future are all observational. Experiments,
however, involve manipulating one or more independent variables to observe
their eect on a dependent variable. The example at the beginning of the
chapter on the eects of a new cholesterol-reducing drug was an experimen-
tal study. Researchers randomized patients into treatment and control groups
and proportions were compared. The explanatory group variable has two levels
(treatment and control). The proportions by group are the responses. Obser-
vational studies may only show the association/relationship between variables,
not cause and eect. Only a carefully designed randomized experiment may be
used to establish cause and eect.
To illustrate why it would be wrong to infer cause and eect based on a simple
association, let's consider the following example. A medical study focusing on
blood pressure and mortality may nd that low blood pressure is associated
with a higher risk of mortality. However, it does not cause death directly,
rather, it is due to the confounding eect of heart disease. A variable that
is associated with both the explanatory (low blood pressure) and response
variable (mortality) but is not directly accounted for by the study is called a
confounding (lurking) variable. There are ways to account for confounding
variables; they should not be dismissed.
In some studies, a randomized block design is called for that involves divid-
ing participants into homogeneous blocks before random assignment. It can be
particularly useful when there are factors that may inuence the outcome but
are not the primary focus of the study (confounding variables). For example,
considering again the new drug experiment, it seems logical to assume that
the drug reaction could be dierent for patients with moderately high choles-
terol levels (moderate) vs. patients with very high cholesterol levels (high).
Therefore, we create a blocking variable Chol with two levels of moderate and
high, separate all 200 patients according to this variable, and then take simple
random samples of equal sizes from moderate and high groups, respectively.
This strategy ensures each treatment group has an equal number of moderate
and high cholesterol level patients.
Pn n
x1 + x2 + ... + xn i=1 xi 1X
x= = = xi (2.1)
n n n i=1
The way to compute it depends on whether you are dealing with a small set
of data to be typed into an array manually or a data le.
Let's rst assume that we just have a small data set of heights for students
in a class, and we want to nd the average. Python array can be dened
with [], but numpy library needs to be imported to use numerical arrays
and statistical functions eectively. Numpy arrays are better for Statistics,
as they are vectorized (addition, subtraction, and many other operations are
automatically done element-by-element) and have a rich set of mathematical
functions. First, np.array() is dened, then np.mean is used to compute the
mean. The results are printed with format specications like {: .4f } to display
the desired number of digits rather than the full 16 double-digits-accuracy
11
12 Descriptive Data Analysis
[2]: import numpy as np # numerical library
x = np.array([69,64,63,73,74,64,66,70,70,68,67])
xbar = np.mean(x);
print('xbar = {:.4f} for n = {:d} observations'.format(xbar,len(x)))
Note that if any of the data are missing and represented by nan, the mean
computes to nan. To avoid it, we have to adjust our command:
[4]: x = np.array([69,64,63,73,74,64,66,70,70,68,67,float("nan")])
xbar = np.mean(x); print('regular mean xbar = {:.4f}'.format(xbar))
xbar = np.nanmean(x); print('nanmean() xbar = {:.4f}'.format(xbar))
Alternatively, we can trim the mean to a specied percentage on low and high
extremes.
Next, let's nd the average for a variable dened in the data le. We use cesd
depression score in HELPrct data le, which we used before:
We can also nd the mean of several columns at once with np.mean() function
or .mean method as illustrated below.
PN
x1 + x2 + ... + xN i=1 xi
µ= =
N N
It is usually impractical to nd the mean of the entire population, so µ is
estimated using the sample mean x. For the depression score, x = 32.85 is a
point estimate of the true population mean of the depression score for all
such substance abusers. Naturally, for a larger sample, the accuracy improves.
In a later chapter, we develop a quantitative way (condence interval) to
quantify the accuracy of such point estimates.
72 · 29 + 84 2088 + 84
= = 72.4
30 30
which is a little bit higher than before, as it should be since the missing
student's score is higher than the mean.
The shape of the histogram above indicates that most hourly workers made
below $10/hour, much fewer made around $20/hour, and only one made over
right-
$40/hour (an outlier). The shape of hourly wages trails o to the right -
skewed shape. The most common example of a right-skewed distribution is
income. Most people stay in a lower income range with only a few making a
lot resulting in a long, thin tail to the right illustrated in the distribution of
household income in 2014 reproduced below:
Next, the histogram of the retirement age data below has a longer tail to the
left - left-skewed shape. Most people retire in their 60s and 70s, but a few
could take early retirement in their 40s and 50s.
[ ]: <Axes: ylabel='Count'>
16 Descriptive Data Analysis
[ ]: <Axes: ylabel='Count'>
Numerical Data 17
The modes are not very rigorously dened but help to get a better sense of the
data. For example, a class grade distribution is often unimodal with only one
prominent peak, which is assigned B or B- range. However, sometimes there
are two separate groups of students with good and bad grades with their own
peaks - bimodal.
Let's illustrate it rst with the grades list considered before. Note the dier-
ence between population vs. sample standard deviation calculation in Python.
stdframe = pd.DataFrame({'x':x,'xbar':np.repeat(xbar,len(x)),'d':d,'d2':d2})
print(stdframe,'\n')
print('sum of deviations =', np.sum(d), '\n')
############################################################################
x xbar d d2
0 69 68.0 1.0 1.0
1 64 68.0 -4.0 16.0
2 63 68.0 -5.0 25.0
3 73 68.0 5.0 25.0
4 74 68.0 6.0 36.0
5 64 68.0 -4.0 16.0
6 66 68.0 -2.0 4.0
7 70 68.0 2.0 4.0
8 70 68.0 2.0 4.0
9 68 68.0 0.0 0.0
10 67 68.0 -1.0 1.0
s = 12.5145
It should be noted that mean, standard deviations, and any other summary
statistics do not completely describe the data. It is still crucial to look at the
actual histogram. For example, the Figure below shows three distributions that
look quite dierent, but all have the same mean 0 and standard deviation 1.
Using modality, we can distinguish between the rst plot (bimodal) and the
last two (unimodal). Using skewness, we can distinguish between the last plot
(right-skewed) and the rst two.
We made a note before that the standard deviation is in the same units as
the original observations. Say, if the stock price is given in dollars, so is the
standard deviation. However, how would you compare the variability of two
data sets given in dierent units (say, stock prices in dollars vs. in pesos)?
Analogously, how to compare the variability of two stocks with very dierent
overall prices - say, one stock sells in hundreds of dollars and another in cents.
not
The most important property of the median is that, unlike the mean, it is
aected as much by extreme values (outliers). If we replace 100 by 100
million in the example above, the mean changes dramatically, but the median
is still 5.
The median breaks ordered data in two halves; in turn, quartiles break the
rst quartile Q1 is the 25th percentile (25% of the data are
halves in half. The
third quartile Q3 is the 75th percentile
below this value and 75% above). The
(75% below, 25% above). A boxplot represents the data graphically as shown
in the diagram below.
Numerical Data 23
The actual box is from Q1 to Q3 and the median is drawn as a line inside this
box. The interquartile range is the length of the box: IQR = Q3 − Q1 . It
contains the middle 50% of the data, so it is also a measure of variability. The
upper whisker extends from Q3 to the highest observation below Q3 + 1.5 ·
IQR. If there is no observation at or close to this value, the upper whisker may
be considerably below this value. The observations above the upper whisker
are considered to be outliers and denoted by dots or stars. Similarly, the lower
whisker extends from Q1 down to the lowest data point above Q1 −1.5·IQR (it
could be considerably above). The observations below this value are outliers
as well. The Figure below compares boxplots of wage from CPS85 and cesd
depression scores from HELPrct.
[ ]: import pandas as pd; import numpy as np;
import seaborn as sns; import matplotlib.pyplot as plt;
url="https://raw.githubusercontent.com/leonkag/Statistics0/main/HELPrct.csv"
HELPrct = pd.read_csv(url)
url="https://raw.githubusercontent.com/leonkag/Statistics0/main/CPS85.csv"
CPS85 = pd.read_csv(url)
fig, axes = plt.subplots(1, 2, figsize=(10, 3), sharey=False)
sns.boxplot(data=CPS85, x="wage", ax=axes[0])
sns.boxplot(data=HELPrct, x="cesd", ax=axes[1])
24 Descriptive Data Analysis
[ ]: <Axes: xlabel='cesd'>
Q1, Q3, IQR, lower bound, upper bound = 5.25 11.25 6.0 -3.75 20.25
The boxplot of wage is highly skewed, and many observations are identied as
outliers above the upper whisker at 20.25. The lower whisker would have been
at−3.75, which is below the lowest wage value, so it stops at that value. The
cesd boxplot on the right side of the Figure above is approximately symmetric
and doesn't show any outliers.
[ ]: HELPrct[['cesd','mcs','pcs']].describe()
[ ]: CPS85[['wage','exper','educ']].describe()
Numerical Data 25
The median and quartiles Q1 and Q3 (therefore IQR) did not change. They
are so-called robust statistics - extreme observations have little eect. How-
ever, the mean and standard deviations changed ( non-robust statistics -
aected strongly by extremes). In principle, if observations are highly skewed
like incomes, median and IQR are better measures of center and spread, while
for more symmetrical data like IQ scores and heights, mean and standard
deviation are better measures, respectively.
The center graph in the Figure below shows that for the symmetrical bell-
shaped data, mean, median and mode coincide. The right-skewed data shows
that extreme tail values pull the mean to the right much further than the
median (robust). Analogously, for the left-skewed data, the extreme left tail
26 Descriptive Data Analysis
values pull the mean to the left much more than the median.
[ ]:
mytable = mydata['substance'].value_counts()
print(mytable)
mytable.plot.bar()
alcohol 177
cocaine 152
Categorical Data 27
heroin 124
Name: substance, dtype: int64
[ ]: <Axes: >
It shows the counts for each type of substance with alcohol the highest, then
cocaine, and nally heroin. The normalize=True option allows to convert
counts into proportions, which is more informative.
[ ]: mytable = mydata['substance'].value_counts(normalize=True)
print(mytable)
mytable.plot.bar()
alcohol 0.390728
cocaine 0.335541
heroin 0.273731
Name: substance, dtype: float64
[ ]: <Axes: >
28 Descriptive Data Analysis
O.plot.bar(stacked=False,ax=axes[1])
[ ]: <Axes: xlabel='homeless'>
Each value in the table represents the number of times a particular com-
bination of variable outcomes occurred. For example, there are 59 homeless
cocaine abusers and 74 housed alcoholics. The margins provide row and col-
umn totals. For example, the total number of alcoholics is 103+74 = 177, and
the total number of housed patients is 74+93+77 = 244, and the grand total
of all patients is 453. We can also create a table of proportions. The overall
proportions are obtained by dividing each entry by the grand total:
For example, the proportion of housed alcoholics out of the total is 74/453 =
0.163. Analogously, the proportion of homeless cocaine abusers is 59/453 =
0.13.
30 Descriptive Data Analysis
It is more informative to investigate conditional (row and column) proportions.
Row proportions:
The output provided a proportions breakdown for housed and homeless. For
example, 103/209 = 0.493 of all homeless are alcoholics, 59/209 = 0.282 of
all homeless are cocaine abusers, etc. . . Analogously, 74/244 = 0.303 of all
housed are alcoholics, etc. . .
Column proportions:
[ ]:
mydata.groupby('sex')['cesd'].describe()
We can present graphical displays like boxplots, histograms, etc. broken down
by categorical variables too.
[ ]: sns.boxplot(data=mydata,x='sex',y='cesd')
We can break up the cesd by more than one categorical variable - here gender
and substance and represent the 2nd categorical variable by the hue of the
graph:
[ ]: mydata.groupby(['sex','substance'])['cesd'].describe()
[ ]: plt.figure(figsize=(10,3))
sns.boxplot(data=mydata,x='sex',y='cesd',hue='substance')
On the other hand, two histograms on the same graph overlap each other and
are harder to distinguish. It may be better to display them seprately as shown
in the two gures below. The option col="sex" creates two graphs in two
columns.
[ ]: <seaborn.axisgrid.FacetGrid at 0x7f0a4f3b76a0>
[ ]: <seaborn.axisgrid.FacetGrid at 0x7f0a1793b520>
The graph with two separate histograms for males and females is preferrable.
Alternatively, we can produce a density plot which is a smoothed version of a
histogram as shown in the Figure below:
Let's begin the discussion of probability with a sample space, which is the set
of all possible outcomes of an experiment. Any subset of the sample space is an
event. A simple event consists of exactly one element of the sample space. For
example, the sample space for tossing two coins is S = {HH, HT, T H, T T }.
Any of the above outcomes would be a simple event. A more general event
would be, say, tossing one head and one tail A = {HT, T H}. For equally
likely outcomes, the probability could be dened simply as a ratio of the
number of outcomes in the event of interest to the total number of simple
events in the sample space. Then, the probability of one head and one tail is
P (A) = 2/4 = 1/2.
More generally, we use a frequentist approach where the probability of an
outcome is the proportion of times the outcome would occur if the process is
observed an innite number of times. Let's consider a simulation of a large,
but of course nite number of repetitions of tossing a fair coin. Let p̂n be
the proportion of heads in the rst n coin tosses. The Figure below shows
two sequences of fair coin tosses. Initially, they are very dierent, but as
the number of tosses increases, p̂n eventually converges to the theoretical
probability p = 1/2. Note that after just a few tosses or even 10 or 20, the
proportion still uctuates a lot, only in a long run of many thousands of tosses,
35
36 Probability
p̂n stabilizes around true theoretical probability. It is a common misconception
that if you ip a fair coin 10 or 20 times, the proportion of heads should be
close to half, but that is only achieved in the long run (large numbers). For
the same reason, if the red came out a few times in a row at a casino table,
it is not more likely for black to appear, they only equilibrate in a long run!
def coinFlip(size):
flips = np.random.randint(0, 2, size=size)
return flips.mean()
coinFlip = np.frompyfunc(coinFlip, 1, 1)
x = np.arange(1, 10000, 1)
y1 = coinFlip(x)
y2 = coinFlip(x)
fig, ax = plt.subplots()
ax.plot(x, y1); ax.plot(x, y2);
ax.set_xscale('log')
Addition Rule 37
For example, rolling an odd number on a die ({1,3,5}) is disjoint from rolling
an even number ({2,4,6}). However, rolling a number less than 3 ({1,2}) and
rolling an even number ({2,4,6}) are not disjoint because they share {2} (in-
tersection is {2}).
Given two disjoint events A1 and A2 , the probability that one occurs or the
other is given by the addition rule:
4 4 8
P (Q or K) = P (Q) + P (K) = + = ≈ 0.154
52 52 52
4 26 2 28 7
P (K or R) = P (K) + P (R) − P (K and R) = + − = =
52 52 52 52 13
In the case they are disjoint, P (A1 and A2 ) = 0 and we have a simple addition
rule.
p 0.5 0.5
The coin need not be fair, so an equally valid probability distribution might
be:
Simple Finite Probability Distributions 39
p 0.6 0.4
x 1 2 3 4 5 6
A fair die and coin are examples of uniform distribution where all outcomes
are equally likely; an unfair coin is not.
The sum of outcomes while rolling two fair dice provides a very interesting
example of non-uniform distribution. Possible sums are given by:
. 1 2 3 4 5 6
1 2 3 4 5 6 7
2 3 4 5 6 7 8
3 4 5 6 7 8 9
4 5 6 7 8 9 10
5 6 7 8 9 10 11
6 7 8 9 10 11 12
The probabilities are obtained by counting how many ways to get a particular
sum out of all possible 6 · 6 = 36 equally likely combinations:
x p
0 2 0.027778
1 3 0.055556
2 4 0.083333
3 5 0.111111
4 6 0.138889
5 7 0.166667
6 8 0.138889
7 9 0.111111
8 10 0.083333
9 11 0.055556
10 12 0.027778
40 Probability
[ ]: <Axes: >
x p
1 1 0.301030
2 2 0.176091
3 3 0.124939
Complement of an Event 41
4 4 0.096910
5 5 0.079181
6 6 0.066947
7 7 0.057992
8 8 0.051153
9 9 0.045757
[ ]: <Axes: >
This simple algebraic formula is surprisingly useful for many complex proba-
bility questions. It is often hard or impossible to nd the probability of the
event, but the probability of the complement can be found and subtracted
42 Probability
from 1 to answer the original question.
For rolling two dice, say, what is the probability that the sum is 9 or less?
One can count directly, but the complement rule is more ecient:
6 30 5
P (sum ≤ 9) = 1 − P (sum ≥ 10) = 1 − = =
36 36 6
There are more applications of the complement rule in future sections, but
rst independence must be discussed.
3.5 Independence
Random processes or events are independent if the occurrence of one event
does not aect the occurrence of the other. Sometimes it is obvious - one die
roll does not aect another, and one coin ip does not aect another. Other
times it requires research; for example, it has been shown that eye color is
independent of gender, but it was not obvious to start with. On the other
hand, it is well known by insurance companies, that age and rate of accidents
are not independent. Younger, less experienced, and more reckless drivers
get into accidents more often and, as a result, are charged higher insurance
premiums. Stock prices day to day are not independent of each other, but the
nature of this dependence is still a mystery.
For example, about 27% of the US population has blue eyes. Suppose two
people are selected at random independently from each other. The probability
that both have blue eyes is:
The probability that at least one of the two randomly chosen people has blue
eyes is best found using the complement rule:
{BB,BN,NB,NN} of which three have at least one blue-eyed person (still you
would have to compute three products and add them). If we have three people
there are 23 = 8 total options {BBB,BBN,BNB,BNN,NBB,NBN,NNB,NNN},
and there are seven combinations in which at least one person has blue eyes.
The complement is much easier to compute.
P (at least one blue eyes) = 1−P (none) = 1−P (N N N ) = 1−0.733 = 0.61098
Finally, for a larger number of people, for example 10, using the complement
is the only option:
26 4 104 2
P (Red) · P (Kings) = · = =
52 52 52 · 52 52
These probabilities are the same, so picking red cards and picking kings at
random is independent, which was not obvious.
Note also that we should not confuse independent events and disjoint events.
Compare the above with computations we have done before for the probability
of picking a king OR a red card (Addition Rule ):
4 26 2 28 7
P (K or R) = P (K) + P (R) − P (K and R) = + − = =
52 52 52 52 13
2
The probability P (K and R) = 52 ̸= 0, so these events are not disjoint, but
they are independent. It is a common misconception that mutual exclusive-
ness (disjoint events) and independence are the same. On the contrary, for
44 Probability
any nontrivial disjoint events A and B, independence cannot even hold.
Birthday Problem
This is a classical problem in probability that illustrates ideas of complement
and independence. Suppose that k people are selected at random from the
general population. What are the chances that at least two of those k were
born on the same day?
At least two gives way too many options, but the complement (none) is
much easier to track.
pv = 1 - pv # complement
df = pd.DataFrame({'class_size':kv, 'probability': pv});
# print(df)
plt.plot(kv,pv)
plt.axhline(y = 0.5, color = 'r', linestyle = '-')
[ ]: <matplotlib.lines.Line2D at 0x7eb397a1ee30>
Conditional Probability 45
The Figure above shows that the probabilities greatly exceed what our intu-
ition would suggest. It is more than a 50% chance for a class of 23 already,
and by 40, it is almost 90%.
Note that
2
2 6 P (A and B)
P (A|B) = = 3 =
3 6
P (B)
46 Probability
The equation
P (A and B)
P (A|B) = (3.4)
P (B)
is actually used as the denition of the conditional probability.
As a more practical example, a car insurance company will consider informa-
tion about a person's driving history (conditioned on past history) to assess
the probability of an accident.
4 4
P (Ace f irst and Queen second) = ·
52 51
4 3
P (Ace f irst and Ace second) = ·
52 51
Second, with replacement:
2
4 4 4
P (Ace f irst and Queen second) = · =
52 52 52
2
4 4 4
P (Ace f irst and Ace second) = · =
52 52 52
Note that in sampling without replacement, the 2nd probability is conditional,
while in sampling with replacement, all probabilities are independent and
actually the same.
Example
Consider a marketing study on social class and purchasing habits for two
brands A and B. The code below to display data frame with margins is a bit
technical.
Contingency Tables Probability 47
Brand A 2 22 21 45
Brand B 24 28 3 55
Sum 26 50 24 100
nr = 2; nc = 3;
O = pd.DataFrame({'Lower':[2,24], 'Middle':[22,28], 'Upper':[21,3]},
index= ['Brand A','Brand B'])
print(O)
Om = pd.DataFrame.copy(O) # creating data frame with margins
Om['Sum'] = O.sum(axis=1);
c = O.sum(axis=0);
Om.loc[len(Om.index)] = c.tolist() + [O.sum().sum()]
print(Om)
24
P (Brand B and Lower) = = 0.24
100
22
P (Brand A and M iddle) = = 0.22
100
The probabilities above arejoint probabilities connecting events with
and. On the other hand, marginal probabilities are row and column total
probabilities for each variable separately:
55
P (Brand B) = = 0.55
100
48 Probability
26
P (Lower) = = 0.26
100
2
P (Lower | Brand A) = = 0.0444
45
22
P (M iddle |Brand A) = = 0.489
45
Column probabilities:
2
P (Brand A | Lower) = = 0.0769
26
22
P (Brand A | M iddle) = = 0.44
50
Note above that switching the order of conditioning produces dierent prob-
abilies P (Lower | Brand A) ̸= P (Brand A | Lower).
A ratio of probabilities formula can also be used to nd any of the conditional
probabilities. For example:
3
3 100 P (Brand B and U pper)
P (Brand B | U pper) = = 24 = = 0.125
24 100
P (U pper)
The answer is the same, but it is more work. However, it is needed when the
full contingency table is not given.
The main question in a contingency table is whether row and column variables
depend on each other. Let's compare conditional and marginal probabilities:
2
P (Brand A | Lower) = = 0.077
26
22
P (Brand A | M iddle) = = 0.44
50
21
P (Brand A | U pper) = = 0.875
24
Contingency Tables Probability 49
45
P (Brand A) = = 0.45
100
These probabilities are very dierent, so the row variable (Brand) is dependent
on the column variable (Social Class). What if they were close, but not exactly
the same? The chi-squared test in chapter 6 will answer how close is close
enough for independence.
Let's also check if the joint probability is the product of marginal ones:
print('\nDependence check')
print('Two conditional probabilities and the marginal:')
print('P(DoveYes | Male) = a/(a+c): ', a/(a+c))
print('P(DoveYes | Female) = b/(b+d): ', b/(b+d))
print('P(DoveYes) = (a+b)/(a+b+c+d): ', (a+b)/(a+b+c+d))
Dependence check
Two conditional probabilities and the marginal:
P(DoveYes | Male) = a/(a+c): 0.2
P(DoveYes | Female) = b/(b+d): 0.2
P(DoveYes) = (a+b)/(a+b+c+d): 0.2
The dependence check shows that the gender and Dove soap usage are inde-
pendent of each other.
We can also illustrate the Addition Rule. For example:
Example
In the next problem, we consider a similar two-by-two contingency table of
a new scan technique to check for muscle tear against the gold standard
(it could be a standard scan or surgery). In addition, we introduce standard
terminology for these types of problems.
The gold standard (truth) is in the columns and new test results are in
the rows, although it does not matter - just be consistent. The table shows
standard terminology. The patients who have the condition and tested positive
are called True Positive, and those who tested negative are called False
Negative. The patients who don't have the condition and tested positive
are called False Positive, and those who tested negative are called True
Negative. The row totals give the total number of positive and negative
tests respectively, while column totals give the total number of patients with
and without the condition. There is also the grand total in the lower right
corner. First, the code below sets up the contingency table, and the overall
probabilities are computed via dividing each entry by the grand total of 100:
47
P (has tear and test positive) = = 0.47
100
2
P (has tear and test negative) = = 0.02
100
5
P (no tear and test positive) = = 0.05
100
46
P (no tear and test negative) = = 0.46
100
The marginal probabilities:
49
P (has tear) = = 0.49
100
51
P (no tear) = = 0.51
100
52
P (test positive) = = 0.52
100
48
P (test negative) = = 0.48
100
Next, we set row probabilities with new test terminology.
First, positive predictive value is the probability that a person who tests
positive indeed has a condition P (has tear|test positive).
[ ]: print('\nPos Predict Value = a/(a+b) = ',a,'/',a+b,' = ', a/(a+b))
The high positive predictive values close to 100% is important for a useful
test. A value close to 50/50 indicates a useless (uninformative) test.
Similarly, negative predictive value is the probability that the person who
tests negative indeed has no condition P (no tear|test negative).
[ ]: print('Neg Predict Value = d/(c+d) = ',d,'/',c+d,' = ', d/(c+d))
The specicity (true negative rate) is the probability that the person who
has no condition tests negative P (test negative|no tear).
[ ]: print('Specificity = d/(b+d) = ',d,'/',b+d,' = ',d/(b+d))
Finally, the overall accuracy of the test is dened as the proportion correct
results:
[ ]: def ProbabilityTable22(a,b,c,d,rnames,cnames):
import pandas as pd; import numpy as np;
ProbabilityTable22(a=47,b=5,c=2,d=46,rnames=['Test Positive','Test
,→Negative'],
Example
In the previous problem on the new scan for muscle tear diagnostics, the
number of patients with and without the muscle tear was about the same.
Sometimes, however, we are testing for a rare condition/infection. For ex-
ample, for the infection disease data below, we apply the same function to
compute all the probabilities.
[ ]: ProbabilityTable22(a=90,b=625,c=10,d=2575,rnames=['Test Positive','Test
,→Negative'], cnames=['Has Infection','No Infection'])
The Specicity and Sensitivity are quite high, but the Positive Predictive
Value is very low. It does not necessarily invalidate the test, rather the in-
fection has a low prevalence - the proportion of the population with the
condition is only about 3% here. Retesting for positive cases might be in
order or, if possible, focusing on subgroups of the population with a higher
occurrence of the condition.
Example
Here is another practical application. A rm introduces a new lie detector test
and advertises it to the Police, FBI, etc. Here are the results of the test run
on 100 people divided about equally between actors who lied and who did not
lie.
[ ]: ProbabilityTable22(a=44,b=16,c=8,d=32,rnames=['Test Positive','Test
,→Negative'], cnames=['Lied','Not Lied'])
Overall, it is not very impressive, but you have to compare it to the competi-
tion.
[ ]:
P (A and B)
P (A|B) =
P (B)
can be used to trivially solve for the joint probability P (A and B) to obtain
the General Multiplication Rule:
For the independent events, the conditional probability is the same as the orig-
inal probability P (A|B) = P (A), so we get independent events Multiplication
Rule again P (A and B) = P (A) · P (B).
It would be redundant, but the general multiplication rule can be illustrated
for the muscle tear example:
P (test positive and has tear) = P ( has tear) · P (test positive | has tear)
a a+c a
= ·
a+b+c+d a+b+c+d a+c
47 49 47
= ·
100 100 49
In reality, we usually don't have a full contingency table, but rather we know
the overall incidence of the disease:
a+c 100
P (has inf ection) = = = 0.0303
a+b+c+d 3300
As well as manufacturer provided Sensitivity and Specicity:
90
Sensitivity = P (test positive|has inf ection) =
100
2575
Specif icity = P (test negative|no inf ection) = = 0.8047
3200
Based on the above information only, the probability tree below can be con-
structed:
58 Probability
[ ]: import requests; from io import BytesIO; from PIL import Image
url = 'https://raw.githubusercontent.com/leonkag/Statistics0/main//
,→TreeInfect1.JPG'
A tree starts with a root. The next level gives the marginal probabilities of
having an infection or not (add up to 1). The next column gives conditional
probabilities of testing positive or negative given the previous branch of infec-
tion (use givensensitivity and specicity and the complement law). Finally, the
last column lists joint probabilities as products of marginal and conditional:
P (has inf ection) · P (test positive|has inf ection) = 0.0303 · 0.9 = 0.02727
P (has inf ection and test negative) =
P (has inf ection) · P (test negative|has inf ection) = 0.0303 · 0.1 = 0.00303
P (no inf ection and test positive) =
P (no inf ection) · P (test positive|no inf ection) = 0.9697 · 0.19531 = 0.18939
P (no inf ection and test negative) =
P (no inf ection) · P (test negative|no inf ection) = 0.9697 · 0.80469 = 0.7803
More importantly, a probability tree can be used to nd the reversed condi-
tional probability P(has infection | tested positive) - Positive Predictive Value
General Multiplication Rule and Tree Diagrams 59
0.0303 · 0.9
P (has inf ection|tested positive) = = 0.12587
0.0303 · 0.9 + 0.9697 · 0.19531
0.9697 · 0.80469
P (no inf ection|tested negative) = = 0.99613
0.9697 · 0.80469 + 0.0303 · 0.1
[ ]:
Example
Let's consider a more realistic infectious disease problem where the contin-
gency table is not given. Assume a manufacturer produced a new test for
infectious disease with a prevalence of 0.05 (P(randomly selected person has
the disease)=0.05). Also, the manufacturer claims that their test has Sensitiv-
ity = P(test positive | has infection) = 0.91 and Specicity = P(test negative
| no infection) = 0.84. Note that this formulation works for any new test scan
or device; for example, we can assess the eectiveness of a new alarm system,
lie detector test, etc. Note that another way to give conditional probabilities
would be to say that the false negative rate is 9%, and the false positive rate
is 16%. Based on this information we can create the following tree shown in
the Figure below:
[ ]:
60 Probability
0.05 · 0.91
P (has inf ection|tested positive) = = 0.23038
0.05 · 0.91 + 0.95 · 0.16
0.95 · 0.84
P (no inf ection|tested negative) = = 0.99439
0.95 · 0.84 + 0.05 · 0.09
Example
Conditional trees and Bayes Theorem have many applications. In this ex-
ample, a business application for quality control is investigated. Assume a
computer manufacturer buys computer chips from rms A, B, and C with
General Multiplication Rule and Tree Diagrams 61
supply shares and defective percentages given in the table below. If a com-
puter was returned with a defective chip, how likely is it that it came from
each of the rms?
Supply % 30 50 20
Defective % 4 3 5
Based on the supply percentages, the main branches of the probability tree
below are created and the defective percentages are conditional probabilities
for each supplier.
0.30 · 0.04
P (F irm A|def ective) = = 0.3243
0.30 · 0.04 + 0.50 · 0.03 + 0.20 · 0.05
0.50 · 0.03
P (F irm B|def ective) = = 0.4054
0.30 · 0.04 + 0.50 · 0.03 + 0.20 · 0.05
0.20 · 0.05
P (F irm C|def ective) = = 0.2703
0.30 · 0.04 + 0.50 · 0.03 + 0.20 · 0.05
[ ]:
Example
Consider an example of placing a bet of $10 on a single number in a casino
roulette, say 27, but it does not matter which one. There are 38 equally likely
slots on a roulette. Most likely you are going to lose your bet with probability
37 1
38 . In an unlikely event of winning with the probability 38 , the net gain is
$350. Therefore the probability distribution is:
Let's assume you bet 100 times. You would expect to win on average $350 ·
1
100 · 38 = $921.05 and lose $10 · 100 · 37
38 = $973.68 for the net expected gain of
Expected Value of Probability Distribution 63
−$52.63
−$52.63. Per game, this amounts to
100 = −$0.53 dollars loss per game.
Note that you cannot lose this amount in a single game, most of the time you
lose your $10 bets and occasionally win $350. However, over 100 games, on
average, you expect to lose this amount per game. Note that this amount does
not depend on the number of bets you place:
1 37
$350 · 100 · 38 − $10 · 100 · 38 1 37
= $350 · − $10 ·
100 38 38
Thus, we obtain the denition of the Expected Value for a Probability
Distribution:
n
X
µ = E(X) = xi · P (X = xi ) =
i=1
x1 · P (X = x1 ) + x2 · P (X = x2 ) + ... + xn · P (X = xn )
The standard deviation for a data set was dened in Chapter 2 using squared
qP
deviations from the mean: s = (xi −x̄)2
n−1 . Analogously, probability dis-
tribution standard deviation is dened using squared deviations from the
expected value:
v
u n
uX
σ = t (xi − µ)2 · P (X = xi ) =
i=1
p
(x1 − µ)2 · P (X = x1 ) + ... + (xn − µ)2 · P (X = xn )
n
X
2
σ = (xi − µ)2 · P (X = xi ) =
i=1
[ ]: import numpy as np
x = np.array([-10,350])
p = np.array([37/38,1/38])
mu = np.sum(x*p); sig = np.sqrt(np.sum((x-mu)**2*p));
print('Expected Value = {:.3f}, Standard Deviation = {:.3f}'.format(mu,sig))
print('mu +- 2sig = {:.3f}, {:.3f}'.format(mu-2*sig, mu+2*sig))
64 Probability
Expected Value = -0.526, Standard Deviation = 57.626
mu +- 2sig = -115.779, 114.726
Example
In the context of insurance, expected value refers to the average amount the
company expects to pay to a customer. It is a key concept in risk management
to determine the pricing of insurance premiums.
Consider auto insurance with a premium of $800 per year for coverage against
accidents. Based on the historical data, the following probabilities and corre-
sponding payouts are used:
[ ]: import numpy as np
x = np.array([0 , 5000, 10000])
p = np.array([0.96 , 0.03, 0.01])
mu = np.sum(x*p); sig = np.sqrt(np.sum((x-mu)**2*p));
print('Expected Value = {:.3f}, Standard Deviation = {:.3f}'.format(mu,sig))
print('mu +- 2sig = {:.3f}, {:.3f}'.format(mu-2*sig, mu+2*sig))
So, the expected value for the insurance company in this example is $250 - the
average expected payout per policy. This information is useful for making a
business decision on the level of premium accounting for administrative costs,
prot margin, etc. The actual payouts in any given period vary, but averaged
over many policies, it gives a measure of risk.
4
Probability Distributions
On the other hand, a continuous random variable can take any value
within a given range. It is often associated with measurements like height,
weight, time, temperature, distance, etc. Unlike discrete random variables,
continuous ones have an uncountably innite number of possible outcomes.
Any real number range contains innitely many values, so the probability of
any specic value is 0. Instead, areas under a continuous probability distribu-
tion function are used to compute the probabilities. The two requirements for
a continuous probability distribution function are:
1 ≤ f (x) ≤ 1 for
R0 ∞ all x
2
−∞
f (x)dx = 1
65
66 Probability Distributions
For example, an electric outlet voltage varies slightly around the prescribed
value of 120 volts. One assumption might be that it has a uniform distri-
bution in the range 119 − 121 V, i.e. the values are spread evenly over the
range of possibilities.
For a general uniform distribution on the interval [a, b], the probability
distribution function is given by:
( (
1 1 1
if a <x<b = if 119 < x < 121
f (x) = b−a = 121−119 2
0 if x < a or x > b 0 if x < 119 or x > 121
1
T otal Area = base · height = (b − a) · =1
(b − a)
The probability of any particular event (interval) is equal to the area under
the probability distribution function over that interval (see Figure below). For
example, to nd the probability that the voltage is between 119 and 120.5 V,
we compute.
1
P (119 < V < 120.5) = base · height = (120.5 − 119) · = 0.75
2
Also, note that a single particular value has no area, so the probability of it
is 0. P (V = 120) = 0. We could ask instead for a tight interval around the
value of interest: P (119.9 < V < 120.1) = (120.1 − 119.9) · 12 = 0.1.
[ ]: import requests; from io import BytesIO; from PIL import Image
url = 'https://raw.githubusercontent.com/leonkag/Statistics0/main/
,→UniformDistrib.JPG'
[ ]: <matplotlib.collections.PolyCollection at 0x7bf3b5d1f040>
68 Probability Distributions
The probability distribution function is given by:
1 (x−µ)2
f (x) = √ e− 2σ2 (4.1)
2πσ
However, one cannot use it to nd areas under the curve analytically, only
numerically. Changing the mean µ shifts the normal curve horizontally, while
σ species the spread as illustrated in the Figure below. On the left, we show
distributions of height for populations of three dierent countries with means
66 in, 69 in, and 72 in, respectively, but with the same standard deviation
of 3 in. On the right, all distributions have the same mean of 69 in, but the
standard deviations are 3 in, 6 in, and 9 in, respectively.
plt.subplot(1, 2, 2)
x = np.arange(45, 95, 0.01)
#define multiple normal distributions
plt.plot(x, norm.pdf(x, 69, 3), label='μ: 69, σ: 3')
plt.plot(x, norm.pdf(x, 69, 6), label='μ: 69, σ: 6')
plt.plot(x, norm.pdf(x, 69, 9), label='μ: 69, σ: 9')
plt.legend()
[ ]: <matplotlib.legend.Legend at 0x7bf3b575bf40>
It is used to compute the areas under any other normal distribution N (µ, σ),
but rst, the given distribution must be standardized to a Z-score with mean
0 and standard deviation 1. The Z-score is the number of standard deviations
above or below the mean:
x−µ
z= (4.3)
σ
For example, someone with an IQ score of 130 on the standard IQ scale with
mean µ = 100 and standard deviation σ = 15 has the standardized score:
Therefore, Jane did better on her SAT than John on his Regents test.
70 Probability Distributions
4.2.2 Normal Probability Calculations
Let's now start discussing probability questions for normal distributions.
For example, what is the fraction (percentage) of students who scored be-
low John's score? Or equivalently, if we select a random student, what is the
probability that their score is below 70? This is given by the left tail area
under the normal probability distribution function shown in the Figure be-
low. I created my own (user) function plot_normal() with def command. In
parentheses, it has its input parameters. This function does not produce any
output; it plots the areas under the normal curve with given parameters and
the corresponding standard normal area.
z score = 0.5000
probability = 0.6915
[ ]: def plot_normal(mu,sig,x1,x2,z1,z2):
# Plotting the shaded areas for normal distribution
figure(figsize=(10, 3), dpi=80)
plt.subplot(1, 2, 1)
xv = np.arange(mu-5*sig, mu+5*sig, 0.01)
plt.plot(xv, norm.pdf(xv, mu, sig))
px=np.arange(x1,x2,0.01)
plt.fill_between(px,norm.pdf(px,mu,sig),color='r')
plt.subplot(1, 2, 2)
zv = np.arange(-5, 5, 0.01)
plt.plot(zv, norm.pdf(zv, 0, 1))
pz=np.arange(z1,z2,0.01)
plt.fill_between(pz,norm.pdf(pz),color='r')
[ ]: plot_normal(mu=mu,sig=sig,x1=mu-5*sig,x2=x,z1=-5,z2=z)
Normal Distribution 71
The next common question to ask would be what is the fraction (percentage)
of students, who scored above John's score (at least as good as John's)? Or
equivalently, if we select a random student, what is the probability that their
score is above John's score of 70? This is given by the right tail area under the
normal probability distribution function shown in the Figure below. The total
area under any probability distribution is 1, so we can nd the complement
area to the left and then subtract it from 1:
plot_normal(mu=mu,sig=sig,x1=x,x2=mu+5*sig,z1=z,z2=5)
z score = 0.5000
probability = 0.3085
Finally, a school board may decide that students who scored between 45 and
59 require extra help. What proportion of students falls in between these
bounds? Or equivalently, what is the probability that a randomly chosen stu-
dent falls in these bounds?
This probability is best obtained as a dierence between the probability of
scoring below 59 and scoring below 45 as shown in the Figure below:
plot_normal(mu=mu,sig=sig,x1=x1,x2=x2,z1=z1,z2=z2)
The last several problems have focused on nding the percentile (lower tail
area) or upper tail area for a particular observation. There is an opposite kind
of question - what value (quantile) corresponds to a particular percentile? For
example, what is the value corresponding to the 95 percentile on the Regents
test? Or equivalently, which score corresponds to the top 5% on the Regents
exam? To answer this question, rst, we need to use the inverse function of
cumulative probability distribution (CDF) to nd the standardized z-score
corresponding to 95% of the data below it, as illustrated in the Figure below.
Then, nd the raw score x from the denition of z :
x−µ
z= =⇒ z · σ = x − µ =⇒
σ
x=µ+z·σ (4.4)
Normal Distribution 73
plot_normal(mu=mu,sig=sig,x1=mu-5*sig,x2=x,z1=-5,z2=z)
z = 1.6449
raw score x = 92.8971
The value (quantile) corresponding to the 95th percentile is 92.897. The 50th
percentile corresponds to z = 0, so the raw score is x = µ+z ·σ = µ+0·σ = µ.
Let's also compute a quantile corresponding to the lower 10th percentile on
the Regents test. Or equivalently, which score corresponds to the bottom 10%
scores on the Regents exam? The Figure below illustrates the bottom 10%
area under the normal distribution.
plot_normal(mu=mu,sig=sig,x1=mu-5*sig,x2=x,z1=-5,z2=z)
z = -1.2816
raw score x = 34.3690
74 Probability Distributions
Example
Let's consider a business-oriented example. Suppose that the gas mileage of
a particular type of car is normally distributed with a mean of 22 mpg and a
standard deviation of 6 mpg.
First, which percentage of cars has mpg below 19 (left tail on the graph shown
below)? Or equivalently, if we select a random car, what is the probability that
its mpg is below 19?
z = (x-mu)/sig; p = norm.cdf(z);
print('z score = {:.4f}, p = {:.4f}'.format(z,p))
plot_normal(mu=mu,sig=sig,x1=mu-5*sig,x2=x,z1=-5,z2=z)
Next, what is the fraction (percentage) of cars with mpg above 28 (right
tail)? Or equivalently, if we select a random car, what is the probability that
its mpg is above 28?
plot_normal(mu=mu,sig=sig,x1=x,x2=mu+5*sig,z1=z,z2=5)
plot_normal(mu=mu,sig=sig,x1=x1,x2=x2,z1=z1,z2=z2)
Let's also consider the mean of a sample of n = 30 cars. What is the prob-
ability that a sample mean mpg isbelow 19? The standard deviation for
sample means is much smaller according to the CLT:
σ 6
σx = √ = √ = 1.095
n 30
Note that with much smaller σx , we get a much sharper (tighter) normal
distribution and, as a result, much less chance that a sample mean will be
below 19 as can be seen in the Figure below.
[ ]: <matplotlib.collections.PolyCollection at 0x7bf3b57581c0>
plot_normal(mu=mu,sig=sig,x1=mu-5*sig,x2=x,z1=-5,z2=z)
z = 1.2816
raw score x = 29.6893
plot_normal(mu=mu,sig=sig,x1=mu-5*sig,x2=x,z1=-5,z2=z)
z = -1.4051
raw score x = 13.5696
Normal Distribution 79
The value (quantile) corresponding to the 8th percentile is 13.57. In this case,
z is negative, and the quantile is below the mean.
Example
The distribution of baby weights is known to be normal with a mean around
3400 g and a standard deviation of approximately 600 g.
plot_normal(mu=mu,sig=sig,x1=mu-5*sig,x2=x,z1=-5,z2=z)
Next, what is the fraction (percentage) of babies who are heavier than (weigh-
ing more than) 4500? Equivalently, for a randomly chosen baby, what is the
probability that the weight is above 4500? This is given by the right tail area
under the normal probability distribution function shown below:
Which fraction (percentage) of babies have a weight between 2200 and 4600
g as shown in the Figure below? Equivalently, for a randomly chosen baby,
what is the probability that their weight is within the above bounds?
z1 = (x1-mu)/sig; z2 = (x2-mu)/sig;
print('z1, z2 scores = {:.4f}, {:.4f}'.format(z1,z2))
p = norm.cdf(z2)-norm.cdf(z1);
print('probability = {:.4f}\n\n'.format(p))
plot_normal(mu=mu,sig=sig,x1=x1,x2=x2,z1=z1,z2=z2)
The above is close to 95%, but not quite. Let's consider the middle 95%
of data in more details. First, let's determine the quantile or baby's weight
Normal Distribution 81
corresponding to the 97.5th percentile (i.e, the weight such that 97.5% of
babies have lower weights)? Equivalently, which weight separates the top 2.5%
of baby weights?
plot_normal(mu=mu,sig=sig,x1=mu-5*sig,x2=x,z1=-5,z2=z)
z = 1.9600
raw score x = 4575.9784
plot_normal(mu=mu,sig=sig,x1=mu-5*sig,x2=x,z1=-5,z2=z)
z = -1.9600
raw score x = 2224.0216
82 Probability Distributions
The quantiles are approximately 2224 g for the 2.5th percentile and 4576 g
for the 97.5th percentile. These bounds contain the middle 95% of the data.
Medical professionals usually use them to dene cutos for low birth weight
and high birth weight babies requiring special medical attention.
Let's also illustrate the CLT by comparing the probability for one baby to
have a weight more than, say, 3600 g and for a sample of n = 30 babies to
have average weight above 3600 g. For the sample, we must adjust the sample
standard deviation
σ 600
σx = √ = √ = 109.545
n 30
As always, σx is much smaller than σ , so the sample means normal distribution
is much sharper (tighter), and there is much less chance that a sample mean
will be above 3600 as can be seen in the Figure below.
[ ]: <matplotlib.collections.PolyCollection at 0x7bf3b5cfeb00>
Normal Distribution 83
(µ ± σ) − µ ±σ
z= = = ±1
σ σ
Analogously, for the 2 standard deviations:
(µ ± 2σ) − µ ±2σ
z= = = ±2
σ σ
The area under the normal probability distribution function contained within
z = ±1, ±2, and ± 3 is computed as follows:
Therefore, all the standard percentages given in the Figure below check out.
Let's make this statement just a bit more precise with a quantile approach.
If we want exactly 95% of the data in the middle, we need to nd z -scores
corresponding to 2.5% on the left and 97.5% on the right.
The normal distribution is symmetric, so these values have the same magni-
tude, just opposite signs. They are approximately z = ±1.960, which is quite
close to z = ±2, but not the same, as shown in the Figure below.
zv = np.arange(-4, 4, 0.01)
plt.plot(zv, norm.pdf(zv, 0, 1))
pz=np.arange(z1,z2,0.01)
plt.fill_between(pz,norm.pdf(pz),color='r')
z1 = -1.9600, z2 = 1.9600
Normal Distribution 85
[ ]: <matplotlib.collections.PolyCollection at 0x7bf3b53adff0>
2
The Standard Normal distribution has e−x /2
term which decreases to 0 very
fast but never touches or crosses it, so a normal random variable can fall 4, 5,
or even more standard deviations away from the mean, but it is very unlikely
as shown below.
[ ]:
86 Probability Distributions
X
µ = E(X) = x · P (X = x) = 1 · p + 0 · (1 − p) = p
p p
p(1 − p)(1 − p + p) = p(1 − p)
n! = n · (n − 1) · (n − 2) · (n − 3) · ... · 3 · 2 · 1 = n · (n − 1)!
For example
0! = 1
1! = 1
2! = 2 · 1 = 2
3! = 3 · 2 · 1 = 6
4! = 4 · 3 · 2 · 1 = 24
5! = 5 · 4 · 3 · 2 · 1 = 120
6! = 6 · 5 · 4 · 3 · 2 · 1 = 720
How many ways are there to sit these faculty members around the table? Once
again order matters, and we have 7 ways to choose a person for the 1st chair,
6 ways for the 2nd, etc., so
7 · 6 · 5 · 4 · 3 · 2 · 1 = 7! =7 P7
Generally,
n! n! n!
n Pn = = = = n!
(n − n)! (0)! 1
88 Probability Distributions
On the other hand, let's consider how many ways to choose 3 faculty members
for a committee. In this case, the order does NOT matter. We know that
there are 7 P3 ways to choose 3 out of 7 when order does matter. For each
of these choices, there are 3! ways to arrange 3 faculty members inside the
chosen group. Therefore, the number of combinations (order does not
matter) of 3 items selected from 7 dierent items is:
7!
7 P3 4! 7! 7·6·5
7 C3 = = = = = 35
3! 3! 3!4! 6
Generally, the number of combinations (order does not matter) of r
items selected from n dierent items is:
n!
n Cr = (4.7)
r!(n − r)!
There are many examples of the number of combinations. For example, how
many dierent Poker hands can one draw from a standard deck of 52 cards?
A poker hand is 5 cards, and it does not matter in which order they come;
therefore, the number of combinations is:
52!
52 C5 = = 2598960
5!47!
Another standard example is a State Lottery. How many ways to choose 6
balls out of 50 numbered balls? Note that you win if you have the correct
numbers in any order, so order does not matter and we use combinations:
50!
50 C6 = = 15890700
6!44!
1
Therefore, if you buy one lottery ticket, your chance to win is
15890700 . The
computations above are illustrated in the Python code below.
[ ]: import math
fact = math.factorial(7); print('factorial of 7 = ',fact )
fact = math.factorial(50); print('factorial of 50 = ',fact )
nPr = math.perm(7,3); print('7P3 = ',nPr )
nCr = math.comb(7,3); print('7C3 = ',nCr )
nCr = math.comb(52,5); print('52C5 = ',nCr )
nCr = math.comb(50,6); print('50C6 = ',nCr )
P = 1/nCr; print('Prabablity to win = ',P )
factorial of 7 = 5040
factorial of 50 =
30414093201713378043612608166064768844377641568960512000000000000
7P3 = 210
7C3 = 35
52C5 = 2598960
50C6 = 15890700
Prabablity to win = 6.292988980976294e-08
Binomial Distribution 89
n Ck · p · p ... · p · (1 − p) · (1 − p) · ... · (1 − p)
n!
P (X = k) = n Ck pk (1 − p)n−k = pk (1 − p)n−k (4.8)
k!(n − k)!
Like any other standard distribution, there are well-known formulas for mean
and standard deviation of the Binomial Distribution:
p
µ=E =n·p σ= n · p · (1 − p) (4.9)
P
We would not have to compute using basic formulas
pP µ=E= xi ·P (X = xi )
and σ= (xi − µ)2 · P (X = xi ) from scratch.
90 Probability Distributions
A very small sample of size n = 3 was used just to illustrate the formula. More
realistically, consider a random sample of n = 25 independent patients, for
each one either the drug works (success p = 0.87) or it doesn't (failure 1 − p =
1 − 0.87 = 0.13). Let's illustrate typical questions for Binomial distribution.
First, let's nd what is the probability that the drug works for 20 patients?
25!
P (X = 20) = 0.8720 0.135
20!5!
p1 = 0.1217
We can compute this probability for any number of successes in the range k=
0..n, which produces Probability Distribution Function (PDF) shown
as a barplot below.
Next, what is the probability that 20 or fewer (fewer than 21) patients
responded well to this drug?
p2 = 0.2183
p3 = 0.7817
p3 = 0.7817
Next, let's consider the probability that between 16 and 21 patients responded
well to the drug. Unlike a continuous distribution, for any discrete distribution,
the in-between question requires subtracting one from the lower limit so that
this value is included:
[ ]: p4 = binom.cdf(k=21,n=n,p=p) - binom.cdf(k=15,n=n,p=p);
print('p4 = {:.4f}'.format(p4))
p4 = 0.4116
Finally, let's nd the mean and standard deviation of this distribution.
In the code below, we have also included µ ± 2σ bounds of usual values (at
least 75% by Chebyshev Theorem). The barplot of the probability distribution
shows a similar range of likely outcomes.
20!
P (X = 2) = 0.032 0.9718
2!18!
n = 20; p = 0.03; k = 2;
p1 = binom.pmf(k=k,n=n,p=p);
print('Probability of {:d} out of {:d} = p1 = {:.4f}\n'.format(k,n,p1))
kv = np.arange(n+1);
pv = binom.pmf(k=kv,n=n,p=p);
plt.bar(kv, pv)
p2 = binom.cdf(k=k,n=n,p=p);
print('Probability of {:d} or less out of {:d} = p2 = {:.4f}\n'.
,→format(k,n,p2))
p3 = 1-binom.cdf(k=k,n=n,p=p);
print('Probability of {:d} or more out of {:d} = p3 = {:.4f}\n'.
,→format(k+1,n,p3))
p4 = binom.cdf(k=3,n=n,p=p) - binom.cdf(k=1,n=n,p=p);
print('Probability between 2 to 3 = {:.4f}\n'.format(p4))
The barplot above shows the probability distribution for the number of side
eects.
Example
Here is a quality control example. Suppose about 8% of all computer chips
produced by a manufacturing rm are defective. Assume there are n = 45
( independent) chips selected at random; each chip is either defective (p
=
0.08) or not (1 − p = 1 − 0.08 = 0.92).
Once again, the questions are shown computationally; we only mention one
Binomial formula with factorials.
What is the probability that 4 chips are defective?
45!
P (X = 4) = 0.084 0.9241
4!41!
n = 45; p = 0.08; k = 4;
p1 = binom.pmf(k=k,n=n,p=p);
print('Probability of {:d} out of {:d} = p1 = {:.4f}\n'.format(k,n,p1))
kv = np.arange(n+1);
pv = binom.pmf(k=kv,n=n,p=p);
plt.bar(kv, pv)
94 Probability Distributions
p2 = binom.cdf(k=k,n=n,p=p);
print('Probability of {:d} or less out of {:d} = p2 = {:.4f}\n'.
,→format(k,n,p2))
p3 = 1-binom.cdf(k=k,n=n,p=p);
print('Probability of {:d} or more out of {:d} = p3 = {:.4f}\n'.
,→format(k+1,n,p3))
p4 = binom.cdf(k=9,n=n,p=p) - binom.cdf(k=1,n=n,p=p);
print('Probability between 2 to 9 = {:.4f}\n'.format(p4))
In the previous example about defective computer chips, the sample size was
not small, n = 45, but the distribution was a little skewed right. Let's increase
the sample (batch) size to n = 300, then as the Figure below shows, the
binomial distribution looks exactly bell-shaped (normal distribution).
n = 300; p = 0.08;
kv = np.arange(50);
pv = binom.pmf(k=kv,n=n,p=p);
plt.bar(kv, pv)
n · p ≥ 10 and n · (1 − p) ≥ 10
Then, the approximate normal has the same mean and standard deviation as
the parent binomial distribution:
p
µ=n·p σ= n · p · (1 − p)
so the success/failure conditions fail, the distribution is still a bit skewed, and
normal approximation should not be used. For n = 300 and p = 0.08, the
conditions become:
Consider the probability that there would be 30 or more defective chips. Using
Binomial formula directly, we obtain:
300
X 300!
P (X ≥ 30) = 0.08k 0.92300−k
k!(300 − k)!
k=30
# Exact computation
n = 300; p = 0.08;
pb = 1-binom.cdf(k=29,n=n,p=p);
print('Binomial probability 30 or more defective = pb = {:.6f}\n'.format(pb))
This approximation with the continuity correction is much closer to the exact
answer. Such continuity correction is even more important for the probability
of a small interval. For example, what is the probability that the number of
defective chips is between 25 and 35?
98 Probability Distributions
[ ]: pb = binom.cdf(k=35,n=n,p=p) - binom.cdf(k=24,n=n,p=p);
print('Binomial probability pb = {:.4f}\n'.format(pb))
The exact probability is much closer to the one computed with the continuity
correction.
5
Inferential Statistics and Tests for
Proportions
For example, many polling agencies (Pew, Gallup, etc.) conduct regular pres-
idential approval polls. A sample of approximately 1000-2000 people is ran-
domly selected. A sample proportion p̂ estimates the true population pro-
portion p. Consider a particular poll of sample size n = 1600 with 660 ap-
proving presidential performance. Then sample proportion is p̂ = yes/total =
660/1600 = 0.4125 - point estimate. To nd the true population proportion
p we must conduct a census of, say, N = 250 million people, which is not
realistic. However, p̂ approximates p.
A dierent random sample of the same size would produce a slightly dierent
sample proportion, say, p̂ = yes/total = 665/1600 = 0.416. Variation in this
point estimate from one sample to another is quantied by sampling error.
Note that an unbiased simple random sample is always assumed to avoid
any systematic error.
To understand the behavior of the sampling error, let's simulate its variation
with a known population approval rating of p = 0.41. It is not known in reality,
but it is set to this particular value for simulation purposes.
99
100 Inferential Statistics and Tests for Proportions
print('numYes, numNo = {:.1f}, {:.1f}'.format(numYes, numNo))
possible_entries = np.concatenate((np.repeat(1,numYes), np.repeat(0,numNo)))
pv = np.repeat(0.0,numsamples) # sample proportions
for j in range(numsamples):
# resample below produces a SIMPLE RANDOM SAMPLE!!!
onesample = resample(possible_entries, n_samples=n, replace=True)
pv[j] = np.sum( onesample == 1) / n
The standard error is about twice as much as the one obtained for the n = 1600
sample size. However, if the sample size is very small n = 25 and the true
102 Inferential Statistics and Tests for Proportions
proportion p = 0.1 is further away from 0.5 than 0.41, the resulting sampling
distribution is skewed right (not bell-shaped) as shown in the Figure below:
First, let's verify the conditions of this CLT Theorem for the above examples:
1. n = 1600, p = 0.41 =⇒ np = 656 ≥ 10, n(1 − p) = 944 ≥ 10
2. n = 400, p = 0.41 =⇒ np = 164 ≥ 10, n(1 − p) = 236 ≥ 10
3. n = 25, p = 0.10 =⇒ np = 2.5 < 10, n(1 − p) = 22.5 ≥ 10
Thus, for the rst two simulations, the CLT conditions hold, and normal
distributions is a valid approximation, but in the third case it fails and it
is skewed right. Independence of individual observations must hold and is
ensured, for example, for a simple random sample of less than 10% of the
population size. In an unlikely case when the sample size n exceeds 10% of
the population size N,
q the standard error must be multiplied by a nite
correction factor N −n
N −1 , which is very close to 1 for n < 0.1N .
Let's compare the results of the rst simulation with n = 1600 to the pre-
dictions of the CLT. In the simulation p = 0.41; therefore, the corresponding
normal distribution mean and standard errors are:
r r
p(1 − p) 0.41(1 − 0.41)
µ = p = 0.41 and SE = = ≈ 0.0123
n 1600
The theoretical standard error is very close to the simulation standard error
0.0122.
In reality, the true population proportion p is not known. Assume one poll
of n = 1600 is taken and x = 675 participants approve the presidential
performance, so sample proportion p̂ = 675/1200 ≈ 0.422. We cannot check
the success failure condition with unknown true population proportion p, so
instead the sample proportion p̂ is substituted:
np ≈ np̂ = 1600 · 0.422 = 675.2 ≥ 10
n(1 − p) ≈ n(1 − p̂) = 1600 · (1 − 0.422) = 924.8 ≥ 10
Therefore, the sample proportion approximately follows a normal distribution
with mean and standard error:
r r r
p(1 − p) p̂(1 − p̂) 0.422(1 − 0.422)
µ = 0.422 SE = ≈ = = 0.01235
n n 1600
The standard error is extremely close to the one computed with the exact
p = 0.41.
104 Inferential Statistics and Tests for Proportions
Provided the conditions of CLT Theorem are satised, p̂ follows normal dis-
tribution with mean and standard deviation (called standard error) given by:
q
p̂(1−p̂)
µ = p̂ and SE = n .
Note that the unknown true population proportion p was substituted by its
sample proportion approximation p̂.
As we know, for a normal distribution, 95% of the data are within about 1.96
standard deviations away from the mean, as shown in the Figure below.
A few condence intervals do not contain the true population proportion, but
this does not invalidate the theory. It is a manifestation of natural sampling
variability. Just as some observations naturally occur more than 1.96 standard
deviations from the mean, analogouly some point estimates naturally are fur-
ther away from the true proportion. Condence interval gives a plausible range
of values. The values outside of it are implausible (much less likely), but not
impossible!
else:
106 Inferential Statistics and Tests for Proportions
# interval does not contain p
ax.errorbar(phat, j, lolims=True, xerr=1.96*SE, yerr=0.0,
,→linestyle='', c='red')
ax.axvline(p0, color='darkorange')
print('Proportion inside = ', numin/numsamples)
Example
A sample of n = 1600 randomly chosen voters is taken and x = 675 of them
approve presidential performance. Construct and interpret a 95% condence
interval for the population proportion.
The interpretation is that we are 95% condent that the actual proportion
of American voters who approve of the presidential performance is between
CIL = 0.400 and CIR = 0.446.
The critical z⋆ is found in such a way that the area between −z ⋆ and z⋆ under
the standard normal distribution N (0, 1) corresponds to the condence level.
These z⋆ bounds for standard levels are given in the Python code above.
Condence Interval for Proportions 109
r
⋆ ⋆ p̂(1 − p̂)
point estimate ± z · SE = p̂ ± z (5.1)
n
As the level of condence increases, the area of the condence region in the
middle increases, so the corresponding critical z ⋆ bound increases and the
interval becomes wider. Thus, we are more condent but about a wider inter-
val. To use a shing analogy again: to be more sure to trap the sh, a wider
net must be used ( trade-o relationship).
The best way to write a condence interval is in terms of margin of error e:
r
⋆ ⋆ p̂(1 − p̂)
e = z · SE = z ⇒ CI = p̂ ± e (5.2)
n
Example
In the previous example, construct 90%, 95%, and 99% condence intervals.
The table above shows column by column all the steps for each condence
x 675
level. The sample proportion p̂ = n = 1600 ≈ 0.422 and standard error
q q
p̂(1−p̂) 0.422(1−0.422)
SE = n ≈ 1600 ≈ 0.0123 are the same regardless of the
110 Inferential Statistics and Tests for Proportions
level. The margin of error is the product e = z ⋆ · SE , which is increasing with
condence level. The condence interval is p̂ ± e, which gets wider and wider
as the level of condence increases.
[ ]: def OneProportionCI(x,n,ConfidenceLevels):
import numpy as np; from scipy.stats import norm; import pandas as pd
import matplotlib.pyplot as plt
print('One Proportion Confidence Interval function')
print('Number of successes x = ', x)
print('Sample size n = ', n)
print('Confidence Levels',ConfidenceLevels)
'MarginErr':MarginErr,'phat':phat,'CIL':CIL,'CIR':CIR});
pd.set_option("display.precision", 4); print(df,'\n')
OneProportionCI(x=10,n=100,ConfidenceLevels=[0.9,0.95,0.99])
The most common 95% condence interval contains the claimed 8% proportion
of defective items, so no reason to reject the company's claim. In fact, in this
case, all common levels of condence produce intervals containing the claimed
8%.
Example
To assess the ecacy of a new drug, a random sample of n = 159 patients was
taken and the drug was eective in 86% of them. The company claimed that
it is at least 80% eective. Construct 90%, 95%, and 99% condence intervals
and interpret the results.
In this case, the actual number of people in which the drug was eective was
not given, rather a percentage - 86%. Therefore, it is computed on the y with
x = 0.86 · 159.
x
p̂ = ⇒ x = p̂ · n
n
[ ]: OneProportionCI(x=0.86*159,n=159,ConfidenceLevels=[0.9,0.95,0.99])
Example
A poll of US adults found that 26% of 310 Republicans supported a generic
National Health Plan. A politician claims that more than 1/3 of Republicans
supported such a plan. Construct standard condence intervals and interpret
the results.
[ ]: OneProportionCI(x=310*0.26,n=310,ConfidenceLevels=[0.9,0.95,0.99])
The most common 95% condence interval is well below 1/3 ≈ 0.3333 =
33.33%; therefore, the claim could be rejected. In fact, in this case, all common
levels of condence produce condence intervals below the claimed 1/3.
[ ]:
The rst step is to state null and alternative hypotheses. The null hy-
pothesis is generally the skeptical perspective (no dierence), agreeing with
the status quo. The alternative hypothesis is rejecting this status quo. In
this problem, it would be not 50/50, i.e. there is preference one way or another
- majority supports or majority opposes ( two-sided alternative).
H0 : p = 0.5 (there are 50% supporting term limits).
H1 : p ̸= 0.5 (there is less than 50% or more than 50% supporting term lim-
its).
550
The hypothesis test determines whether the sample proportion p̂ = 1200 =
0.4583 is signicantly dierent from the claimed null hypothesis proportion
p0 = 0.5.
In hypothesis testing, it is always initially assumed that the null hy-
pothesis H0 is true. Then, the probability of observing the actual data or
something even more extreme is found ( p-value). Thus, assuming H0: p = 0.5
is true sets null proportion p0 = 0.5 that is used in all computations.
First, check the conditions of applicability of the Central Limit Theorem:
1. The poll was based on a simple random sample, so independence holds.
2. n · p0 = 1200 · 0.5 = 600 ≥ 10 and n · (1 − p0 ) = 1200 · 0.5 = 600 ≥ 10 =⇒
success/failure conditions hold.
Therefore, the distribution of sample proportions p̂ (null distribution) is
normal with mean and standard deviation given by:
q q
p0 (1−p0 ) 0.5(1−0.5)
µ = p0 = 0.5 and SE = n =
1200 = 0.014.
For the condence interval approach of the previous section, p = p̂ was used
in the success/failure conditions and the standard error (SE) computation.
Now, H0 is assumed true, so p = p0 is used.
The p-value is the probability to observe the sampling proportion p̂ = x/n =
0.4583 or something even more extreme under our bell-shaped null distribu-
tion centered at p0 = 0.5 with standard error SE = 0.014.
The alternative H1 : p ̸= 0.5 is two-sided; therefore both left and right tails
must be accounted for. The normal distribution is symmetric, so the p-value
can be found as twice the area of a single tail (see Figure below).
To nd this tail area, we rst nd z corresponding to the observed p̂ = 0.458:
p̂ − p0 0.458 − 0.5
z= = = −2.887
SE 0.014
Then, Python is used to nd the area under the tail and multiply it by 2 to
obtain p-value = 0.0038924. It measures how likely to observe sample pro-
portion p̂ = 0.458 or something more extreme and is much less than given
signicance (error) level α = 0.05, so the initial assumption H0: p = 0.5 can
be rejected. Thus, the proportion of voters supporting 2-term limits for gov-
ernors p̂ = 0.458 is signicantly dierent from 0.5 = 50%. Note that even
though p̂ = 0.458 < 0.5, it is incorrect to change the alternative hypothesis
114 Inferential Statistics and Tests for Proportions
H1: ̸ 0.5
p = to one-sided H1: p < 0.5 post-factum. It would have negative
accuracy implications as will be explained later.
Using the condence interval approach in the code below, we are 95% condent
(for 5% signicance error rate) that the true population proportion is between
43.01% and 48.65%, which does not contain the claimed H0: p = 0.5. This
conrms our conclusion to reject H0.
print('H0: p = p0')
print('H1: p not p0 or one-sided test\n')
xv = np.arange(-4, 4, 0.001)
axs[1].plot(xv, norm.pdf(xv, 0, 1))
xL=np.arange(-4,-np.abs(z),0.001)
axs[1].fill_between(xL,norm.pdf(xL,0,1),color='r')
xR=np.arange(np.abs(z),4,0.001)
axs[1].fill_between(xR,norm.pdf(xR,0,1),color='b')
#------------------------------------------------------------------
[ ]: OneProportionTest(x=550,n=1200,p0=0.5,ConfidenceLevels=[0.9,0.95,0.99])
H0: p = p0
H1: p not p0 or one-sided test
H0: p = p0
H1: p not p0 or one-sided test
The hypothesis test tries to infer whether the sample proportion p̂ = 0.117 is
signicantly dierent from H0: p0 = 0.09. This null value p0 is used to check
the assumptions of the CLT. The sample of widgets is assumed to be a simple
random sample, so the observations are independent. Also, the code above
shows that n · p0 ≥ 10 and n · (1 − p0 ) ≥ 10, so success/failure conditions
are satised. Therefore, the sampling distribution of p̂ (null distribution)
is normal with mean and standard deviation given by
q µ = p0 and SE =
p0 (1−p0 )
(see the code above).
n
p̂ − p0 0.117 − 0.09
z= = = 1.021
SE 0.026
The alternative ̸ p0 is two-sided, so both tails add up to p-value =
H1 : p =
0.3073751 > 0.05 = α. Therefore, there is not enough evidence to reject the
initial assumption H0: p = p0 - no reason to reject the company's claim.
Also, from a condence interval approach, we are 95% condent (for 5% sig-
5.92% and
nicance error rate) that the true population proportion is between
17.41%, which contains the claimed H0: p = 0.09. This conrms the conclusion
not to reject H0.
Example
About 7% percent of the population of some country is left-handed. The rest
are right-handed, and about 1% who are ambidextrous (no dominant hand). In
a random sample of 300 children from a particular region, 32 are left-handed.
Do these data provide evidence that the 7% value is inaccurate for children
in this region? Use a stricter 1% level of signicance.
H0: p = p0
H1: p not p0 or one-sided test
The hypothesis test tries to infer whether the sample proportion p̂ = 0.107
is signicantly dierent from H0: p0 = 0.07. This null value p0 is used to
check the assumptions of the Central Limit Theorem (CLT). The sample of
children is assumed to be a simple random sample, so the observations are
independent. Also, the code above shows that n · p0 ≥ 10 and n · (1 − p0 ) ≥ 10,
so success/failure conditions are satised. Therefore, the sampling distribution
ofp̂ (null distribution
q ) is normal with mean and standard deviation given
p0 (1−p0 )
by µ = p0 and SE = (see the code and the Figure above).
n
Hypothesis Testing for Proportions 119
p̂ − p0 0.107 − 0.07
z= = = 2.489
SE 0.015
The alternative H1 : p ̸= p0 is two-sided, so both tails add up to p-value
= 0.0128069 > 0.01 = α. Therefore, there is not enough evidence to reject
the initial assumption H0: p = p0 - no reason to reject the null claim. Note,
however, that if we used the usual α = 0.05 level, we would have rejected the
H0.
Also, from a condence interval approach, we are 99% condent (for 1% sig-
nicance error rate) that the true population proportion is between 6.08% and
15.26%, which contains the claimed H0: p = 0.07. This conrms our conclu-
sion not to reject H0. Note again that at 95% level, the CI is between 7.17%
and 14.16% and does not contain H0: p = 0.07 which would have resulted in
the rejection of H0.
So far, only two-sided hypothesis tests have been considered. It is often not
known in advance which way the data lands, so it is a safer, more conservative
approach (medical studies are usually two-sided). A one-sided hypothesis may
overlook data supporting the opposite conclusion, but one-sided tests are still
quite common. Here is an example.
Example
A School district claims that at least 80% of its students passed the SAT.
The State education board needs to check their claim. In a random sample of
200 students, 149 passed. Test at 5% error level.
The Python function written above does not require any changes for one-sided
tests since it always prints both two-sided and one-sided p-values (one-sided
p-value is just half of the two-sided). However, the interpretation changes.
The "at least" in the problem formulation implies a one-sided test and the
state board is trying to refute the school district's claim, so the alternative is
to the left. Whether a one-sided test refutes or supports the claim is somewhat
subtle, but important, decision.
H0: p = p0
H1: p not p0 or one-sided test
The hypothesis test tries to infer whether the sample proportion p̂ = 0.745 is
signicantly less than H0: p0 = 0.80. This null value p0 is used to check the
assumptions of the CLT. The sample of students is assumed to be a simple
random sample, so the observations are independent. Also, the code above
shows that n · p0 ≥ 10 and n · (1 − p0 ) ≥ 10, so success/failure conditions
are satised. Therefore, the sampling distribution of p̂ (null distribution)
is normal with mean and standard deviation given by
q µ = p0 and SE =
p0 (1−p0 )
(see the code above).
n
p̂ − p0 0.745 − 0.8
z= = = −1.944
SE 0.028
The alternative H1:p < 0.80 is one-sided, so only the left tail area gives p-value
= 0.025915 < 0.05 = α. Thus, the state ocials can reject the district's claim.
My code provides both two-tailed and one-tailed p-values. If a two-sided test
were performed, then p-value = 0.0518299 > 0.05, so we would have failed to
reject the H0.
Note that if, say, 165 students passed the test, then p̂ = 165/200 = 0.825 >
0.80. Thus, p̂ is already opposite to the alternative H1 : p < 0.80; therefore
it is pointless to continue with the calculation - the data supports H0. Only
because the actual p̂ = 149/200 = 0.745 < 0.80, there was a point to continue.
Moreover, in case the sample proportion is below the claimed 80%, then the
question becomes if it is signicantly below. Assume, for example, that p̂ =
155
200 = 0.775 < 0.80, then the code and the Figure below give a very large
p-value, for which we cannot reject H0. Therefore, not only should p̂ be below
0.80, it has to be suciently below to reject it.
[ ]: OneProportionTest(x=155,n=200,p0=0.80,ConfidenceLevels=[0.9,0.95,0.99])
H0: p = p0
H1: p not p0 or one-sided test
122 Inferential Statistics and Tests for Proportions
sample proportion phat = 155.00/200 = 0.7750
Hypothesis Testing Approach:
success/failure: n*p0 = 160.0000, n*(1-p0) = 40.0000 <=10??
Standard Error = SE =sqrt(p0*(1-p0)/n) =
sqrt(0.8000*(1-0.8000)/200) = 0.0283
Standardized z = (phat-p0)/SE = (0.7750-0.8000)/0.0283=-0.8839
P-values: 2-sided = 0.3767591178, 1-sided =0.1883795589
Example
Assume a random sample of 1000 of a top State University students is taken
and 629 students are from this state. Can we conclude that more than 60%
of students are in-state at a 5% level of signicance?
[ ]: OneProportionTest(x=629,n=1000,p0=0.60,ConfidenceLevels=[0.9,0.95,0.99])
H0: p = p0
H1: p not p0 or one-sided test
The hypothesis test tries to infer whether the sample proportion p̂ = 0.629
is signicantly greater than H0: p0 = 0.60. The assumptions of the CLT
hold as before. Therefore, the sampling distribution of p̂ (null distribution)
is normal with mean and standard deviation given by
q µ = p0 and SE =
p0 (1−p0 )
(see the code).
n
p̂ − p0 0.629 − 0.60
z= = = 1.879
SE 0.0155
The alternative H1: p > 0.60 is one-sided, so only the right tail area shown
in the Figure below gives you p-value = 0.030607 < 0.05 = α. Therefore,
there is enough evidence to reject the initial assumption. Thus, there are
more than 60% of in-state students (the one-sided alternative claim has been
124 Inferential Statistics and Tests for Proportions
substantiated). Note that if the test was two-sided, the p-value would have
been 0.06121 > 0.05 and we would have failed to reject the H0.
Example
In a large survey, one year people were asked if they thought unemployment
would increase, and 46% thought that it would increase. Next year, a small
opinion poll was conducted, and 282 out of 649 said that they thought unem-
ployment would increase. At the 10% level, is there enough evidence to show
that the proportion of citizens who believe unemployment would increase is
less than the proportion who felt it would increase in the earlier survey?
Note that in the code, we added an unusually low condence level 0.8; we will
need it later.
H0: p = p0
H1: p not p0 or one-sided test
The hypothesis test tries to infer whether the sample proportion p̂ = 0.435
is signicantlyless than H0: p0 = 0.46. The assumptions of the CLT hold as
before. Therefore, the sampling distribution of p̂ (null distribution) is nor-
q
p0 (1−p0 )
mal with mean and standard deviation given by µ = p0 and SE = n
(see the code).
p̂ − p0 0.435 − 0.46
z= = = −1.303
SE 0.020
The alternative H1: p < 0.46 is one-sided, so only the left tail area gives p-
value = 0963422 < 0.10 = α as shown in the Figure below. Therefore, there is
enough evidence to reject the initial assumption. Thus, there were signicantly
fewer people who thought that unemployment would increase. If we used more
common α = 0.05, we would have failed to reject the H0. Note that my Python
code provides both, one-tailed and two-tailed p-values. In this case, if we had
done a two-sided test, the p-value would have been 0.1926844 > 0.10, and we
would have failed to reject the H0 as well.
The rows show the actual truth, while the columns correspond to the decision
made by the hypothesis test based on the data. Therefore, a Type 1 Error
is rejecting the null hypothesis H0 when it is actually true. Its probability is
signicance level α in the tails of the condence interval regions shown again
in the Figure below.
Note that if you are rejecting the null H0, you might have done the Type
1 error and if you are failing to reject H0, you might have done the Type 2
error. It does not have to be the case, but it is a possibility. The Type 1 and
Type 2 errors are in a trade-o relationship. If Type 1 error is reduced, it is
harder to reject H0 when it is actually true, but it is also harder to reject H0
when it is actually false increasing Type 2 error, and vice-versa. If a sample
size is increased, it would simultaneously reduce both types of errors, which
increases overall accuracy.
The null hypothesis must only be rejected if there is strong enough evidence
against it. The most common choice is to use α = 0.05 = 5% signicance
error level (probability of type 1 error), so that if H0 is true, it is only rejected
about 5% of the time.
Both sample sizes are rather large. The reason is that the given margin of error
of 1% is too small, so it is increased to 2% and we obtain a 4-fold reduction
in sample sizes.
Let's illustrate the sample size needed for dierent condence levels.
For p = 0.7:
ConfLevel zstar na
0 0.90 1.6449 1420.4103
1 0.95 1.9600 2016.7659
2 0.99 2.5758 3483.3207
For p = 0.5:
ConfLevel zstar nb
0 0.90 1.6449 1690.9647
1 0.95 1.9600 2400.9118
2 0.99 2.5758 4146.8104
Hypothesis Testing for Proportions 131
As the condence level increases, z∗ increases and the required sample sizes
increase.
Example
In this example, we are going to investigate the le HELPrct.csv used al-
ready in Chapter 2. It contains a plethora of information about 453 substance
abusers. We are concentrating on the proportion of homeless vs. housed in
this sample and want to test if it is 50/50.
mytable = mydata['homeless'].value_counts() #
print(mytable)
mytable.plot.bar()
homeless
housed 244
homeless 209
Name: count, dtype: int64
[40]: OneProportionTest(x=209,n=209+244,p0=0.50,ConfidenceLevels=[0.9,0.95,0.99])
H0: p = p0
H1: p not p0 or one-sided test
The hypothesis test tries to infer whether the sample proportion p̂ = 0.461 is
signicantly dierent from H0: p0 = 0.5. This null value p0 is used to check
the assumptions of the CLT. The patients can be assumed independent from
each other (based on the selection, simple random sample, etc.). Also, the code
above shows that n · p0 ≥ 10 and n · (1 − p0 ) ≥ 10, so success/failure conditions
are satised. Therefore, the sampling distribution of p̂ (null distribution)
is normal with mean and standard deviation given by µ = p0 and SE =
q
p0 (1−p0 )
(see the code above).
n
p̂ − p0 0.461 − 0.5
z= = = −1.644
SE 0.023
The alternativeH1 : p ≠ p0 is two-sided, so both tails add up to p-value =
0.1000846 > 0.05 = α as shown in the Figure above. Therefore, there is not
enough evidence to reject the initial assumption H0: p = p0 . Thus, no reason
to reject the 50/50 claim.
Using the condence interval approach, we are 95% condent (for 5% signi-
cance error rate) that the true population proportion is between 41.55% and
50.73%, which contains the claimed H0: p = 0.50. This conrms our conclusion
not to reject H0.
134 Inferential Statistics and Tests for Proportions
s
∗ ∗ p̂1 (1 − p̂1 ) p̂2 (1 − p̂2 )
e = z · SE = z + CI = p̂1 − p̂2 ± e
n1 n2
For the hypothesis testing approach for two proportions H0: p1 −p2 = 0 (equal
population proportions); therefore, sample proportions are pooled as follows:
Let p̂1 = x1 x2
n1 and p̂2 = n2 be the two-sample proportions, then the pooled
proportion is:
x1 x
n + 2n
ppool = n1 +n2 = n1 n11 +nn22 2 = p̂1 nn11 +
x1 +x2 p̂2 n2
+n2
It is used in success/failure conditions and standard error computations:
1) Success/failure conditions:
n1 ppool ≥ 10, n1 (1 − ppool ) ≥ 10, and
n2 ppool ≥q10, n2 (1 − ppool ) ≥ 10
q
ppool (1−ppool ) ppool (1−ppool )
2) SE =
n1 + n2 = ppool (1 − ppool )( n11 + 1
n2 )
Dierence of Two Proportions 135
Example
The computer manufacturing company evaluates the quality of computer chips
provided by companies A and B. In a random sample of manufacturer A, 23
out of 80 chips were defective, and for manufacturer B, 30 out of 64 were
defective. Compare the proportions at the 5% level and make conclusions.
diffphats = ph1-ph2;
print('difference phat1-phat2 = {:.4f}-{:.4f} = {:.4f}'.
format(ph1,ph2,diffphats))
z = (diffphats)/SE; # standardized statistic
print('Standardized statistic = z = diffphats/SE = {:.4f}/{:.4f} = {:.4f}'
136 Inferential Statistics and Tests for Proportions
.format(diffphats,SE,z))
pval2 = 2*(1-norm.cdf(np.abs(z))); # 2 sided p-value
print('P-values: 2-sided = {:.10f}, 1-sided ={:.10f}\n'.format(pval2,pval2/
,→2))
print(' {:.4f},{:.4f},{:.4f},{:.4f}'.
,→format(n1*ph1,n1*(1-ph1),n2*ph2,n2*(1-ph2)))
SEci = np.sqrt(ph1*(1-ph1)/n1+ph2*(1-ph2)/n2);
print('Standard Error = SEci =sqrt(ph1*(1-ph1)/n1+ph1*(1-ph2)/n2)) =')
print(' = sqrt({:.4f}(1-{:.4f})/{:d}+{:.4f}(1-{:.4f})/{:d}) = {:.4f}'
.format(ph1,ph1,n1,ph2,ph2,n2,SEci))
zstar = norm.ppf(1-alpha/2); # z critical
MarginErr = zstar*SEci; # margin of error
CIL = diffphats - MarginErr; CIR = diffphats + MarginErr;
df = pd.DataFrame({'ConfLevel':ConfLevel,'zstar':zstar,'SEci':SEci, \
'MarginErr':MarginErr,'diffphats':diffphats,'CIL':
,→CIL,'CIR':CIR});
[42]: TwoProportionTest(x1=23,n1=80,x2=30,n2=64,CIs=[0.90,0.95,0.99])
H0: p1-p2 = 0
H1: p1-p2 not 0 or one-sided test
The hypothesis test aims to infer whether the sample proportions dierence
p̂1 − p̂2 = 0.288 − 0.469 = −0.181 is signicantly dierent from the claimed
null hypothesis dierence of 0. The samples of chips are assumed to be simple
random samples, so the observations are independent within groups and the
samples are independent from each other. Also, the code above shows that:
n1 ppool = 80 · 0.368 = 29.444 ≥ 10, n1 (1 − ppool ) = 80 · (1 − 0.368) = 50.556 ≥
10,
n2 ppool = 64·0.368 = 23.556 ≥ 10, n2 (1−ppool ) = 64·(1−0.368) = 40.444 ≥ 10
so the success/failure conditions are satised.
Therefore, based on CLT, the sampling distribution of p̂1 − p̂2 (null distri-
bution) is normal
qwith mean and standard deviation
q given by:
1 1 1 1
µ = 0 and SE = ppool (1 − ppool )( n1 + n2 )= 0.368(1 − 0.368)( 80 + 64 ) =
0.081 (see left plot of the Figure above).
The p-value is the probability to see the observed sampling proportion
dierence p̂1 − p̂2 = −0.181 or something even more extreme under the
normal null distribution.
The z corresponding to the p̂1 − p̂2 = −0.181 is
138 Inferential Statistics and Tests for Proportions
s
p̂1 (1 − p̂1 ) p̂2 (1 − p̂2 )
SE = +
n1 n2
r
0.288(1 − 0.288) 0.469(1 − 0.469)
= + = 0.0803
80 64
Therefore, say 95% condence interval for the dierence of population pro-
portions p1 − p2 is:
Example
A total of 87 out of 100 students preparing with rm B's preparation guide
passed the Medical admission test MCAT, while 91 out of 120 passed it with
rm P's guide. Is there a signicant dierence at the 1% level? How about a
5% level?
H0: p1-p2 = 0
H1: p1-p2 not 0 or one-sided test
The hypothesis test tries to infer whether the sample proportions dierence
p̂1 − p̂2 = 0.87 − 0.758 = 0.112 is signicantly dierent from the claimed null
hypothesis dierence of 0.
The samples of students are assumed to be simple random samples, so the
observations are independent within groups, and the samples of themselves
are independent from each other. Also, the code above shows that:
n1 ppool , n1 (1−ppool ), n2 ppool , n2 (1−ppool ) are all at least 10, so success/failure
conditions are satised.
Therefore, according to the CLT, the sampling distribution of p̂1 − p̂2 (null
distributionq) is normal with mean: µ = p1 − p2 = 0 and standard deviation
1 1
given SE = ppool (1 − ppool )( n1 + n2 ) (see the code as well as the left plot
140 Inferential Statistics and Tests for Proportions
of the Figure above).
The p-value is the probability of seeing the observed sampling proportion
dierence or something even more extreme under our bell-shaped normal null
distribution. The z corresponding to the p̂1 − p̂2 is:
Therefore, we are 99% condent (1% signicance error) that the true popu-
lation proportion is between −2.11% and 24.45%, which contains the claimed
H0 : p1 − p2 = 0 conrming not rejecting H0. Once again the 95% CI (1.06%,
21.27%) does not contain 0, so we would have rejected H0.
Example
In this example, we look again at the substance abusers data set. We want
to investigate if the proportions of people who got treatment are dierent in
homeless vs. housed patients. Is there a signicant dierence at the 5% level?
How about a 10% level?
[45]: TwoProportionTest(x1=68,n1=209,x2=61,n2=244,CIs=[0.90,0.95,0.99])
H0: p1-p2 = 0
H1: p1-p2 not 0 or one-sided test
The hypothesis test tries to infer whether the sample proportions dierence
p̂1 − p̂2 = 0.325 − 0.25 = 0.075 is signicantly dierent from the claimed null
hypothesis dierence of 0.
142 Inferential Statistics and Tests for Proportions
The substance abuse patients in each group are assumed to be simple random
samples, so the observations are independent within groups and the samples
themselves are independent of each other. Also, the code above shows that:
n1 ppool , n1 (1−ppool ), n2 ppool , n2 (1−ppool ) are all at least 10, so success/failure
conditions are satised. Therefore, according to the CLT the sampling distri-
bution of p̂1 − p̂2 (null distributionq
) is normal with mean µ = p1 − p2 = 0
1 1
and standard deviation given SE = ppool (1 − ppool )( n1 + n2 ) (see the cal-
culations and the left plot of the Figure above).
Thus, we are 95% condent (for 5% signicance error rate) that the true
population proportion is between −0.82% and 15.89%, which does contain
the claimed H0 : p1 − p2 = 0. This conrms our conclusion not to reject H0
at 95% level. However, the 90% CI given by (0.52%, 14.55%) does not contain
0 and leads to the rejection of H0.
6
Goodness of Fit and Contingency
Tables
Example
A shop owner wants to compare the number of t-shirts of each size that were
sold to the ordered proportions. Assume the sold (observed) numbers of t-
shirts and expected proportions for each type are as given in the rst two
columns in the following table:
The numbers of shirts of each type (observed values) are quite close to the
expected values in this case (not the same due to sampling variation). Are the
dierences large enough to indicate that the observed (sample) sales do not
follow the expected (population) sales.
Let's formulate it as a hypothesis test:
H0: The observed sales follow the expected proportions or O≈E (no bias,
natural sampling uctuation).
H1: The observed sales do NOT follow the expected proportions or O ̸= E .
143
144 Goodness of Fit and Contingency Tables
In the hypothesis tests for proportions, a test statistic was:
Oi − Ei
√
Ei
where i-s refer to category (Small, Medium, etc.). As always, residuals have
varying signs, so adding them would lead to cancellations; instead their squares
are considered.
[ ]: <matplotlib.legend.Legend at 0x789fa9c3f1c0>
The function dened below computes the steps of the goodness of t method
and plots a chi-squared distribution with df = k − 1 = 4 − 1 = 3 degrees of
freedom with the shaded area corresponding to the p-value as shown in the
Figure below.
[ ]: def GoodnessFit(Observed,BinNames,Pexpect):
import numpy as np; import matplotlib.pyplot as plt; import pandas as
,→pd
146 Goodness of Fit and Contingency Tables
from scipy.stats import chisquare; from scipy.stats import chi2
.format(total,P_exp[0],P_exp[1],P_exp[n-1]))
print(' [{:.3f},{:.3f},....,{:.3f}]'.format(E[0],E[1],E[n-1]))
print('Residuals = (Observed - Expected)/sqrt(Expected) = ')
print('=({:d}-{:.2f})/sqrt({:.2f}), ... , ({:d}-{:.2f})/sqrt({:.2f})\n'
.format(O[0],E[0],E[0],O[n-1],E[n-1],E[n-1]))
R = (O-E)/np.sqrt(E); R2 = R**2;
pval = 1 - chi2.cdf(X2,df)
print('p-value = 1 - chi2.cdf(X2,df) = ', pval, '\n')
df = len(O)-1 =4-1 = 3
Chi-Squared = sum( (O-E)**2/E )= sum( Residuals**2) =
=(25-22.50)**2/22.50 + ... + (68-67.50)**2/67.50 = 0.648
p-value = 1 - chi2.cdf(X2,df) = 0.8853267818237286
Example
The data for a particular hospital's emergency room admissions over the pre-
vious month are given below. To properly allocate resources, investigate if the
admissions are uniformly distributed by the day of the week.
H0: The observed values of the number of hospital admissions follow the
expected proportions O≈E (equally likely 7 days, so pi = 1/7).
H1: The observed values do NOT follow the expected proportions O ̸= E .
[ ]: GoodnessFit(Observed=[99,72,69,62,67,74,92],
BinNames=['Sun','Mon','Tue','Wed','Thr','Fri','Sat'],
Pexpect=[1/7,1/7,1/7,1/7,1/7,1/7,1/7])
df = len(O)-1 =7-1 = 6
Chi-Squared = sum( (O-E)**2/E )= sum( Residuals**2) =
=(99-76.43)**2/76.43 + ... + (92-76.43)**2/76.43 = 14.781
p-value = 1 - chi2.cdf(X2,df) = 0.022027570437209598
Adding the observed values, we obtain T otal = 535. If the observed values
1
follow the expected proportions, 535 ·
7 ≈ 76.429 admissions are expected on
Sun, the same on other days. All the Expected Values are given in the table
above and none is below 5 2
. The resulting test statistic χ is:
Example
In the HELPrct le on substance abusers' health data, investigate if the sub-
stance variable is equally distributed between alcohol, cocaine, and heroin.
Note that we employ value_counts() to get the counts for each substance
level.
H0: The observed values of the substance abusers in each group follow the
expected proportions O≈E (equally likely).
H1: The observed values do NOT follow the expected proportions or O ̸= E .
[ ]: import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
url="https://raw.githubusercontent.com/leonkag/Statistics0/main/HELPrct.csv"
mydata = pd.read_csv(url) # save as mydata file
counts = mydata['substance'].value_counts(); counts
[ ]: substance
alcohol 177
cocaine 152
heroin 124
Name: count, dtype: int64
[ ]: GoodnessFit(Observed=[177,152,124],
BinNames=['alcohol','cocaine','heroin'],
Pexpect=[1/3,1/3,1/3])
df = len(O)-1 =3-1 = 2
Chi-Squared = sum( (O-E)**2/E )= sum( Residuals**2) =
=(177-151.00)**2/151.00 + ... + (124-151.00)**2/151.00 = 9.311
p-value = 1 - chi2.cdf(X2,df) = 0.009507929549892102
Example
An M&M candy pack is supposed to have 23% blue, 23% orange, 15% green,
15% yellow, 12% red, and 12% brown candies. A random sample of several
packs is selected and dierent colors are counted. Do the observed counts
follow the claimed proportions?
df = len(O)-1 =6-1 = 5
Chi-Squared = sum( (O-E)**2/E )= sum( Residuals**2) =
=(58-66.47)**2/66.47 + ... + (29-34.68)**2/34.68 = 4.320
p-value = 1 - chi2.cdf(X2,df) = 0.5043501118366657
Goodness of Fit 153
Observed values sum up to T otal = 289. If they follow the expected propor-
tions, there should be 289 · 0.23 ≈ 66.47 Blue, etc.. The Expected Values are
in the table above, none of them is below 5. The resulting test statistic χ2 is:
154 Goodness of Fit and Contingency Tables
Example
For a number of numerical data sets, the leading digit distribution is surpris-
ingly heavily skewed right (Benford Law): leading 1 - 30% likely, leading 2 -
17.6%, . . . , leading 9 < 5%. If they were uniformly distributed, each of the nine
digits would occur about 11.1% of the time. The formula for the probability
distribution of the rst digit d = 1, 2, ..., 9 is given by:
The observed counts of leading digits from a large nancial document are
given below. Do they follow Benford Law?
H0: The observed values of the leading digits follow the expected proportions
O ≈ E.
H1: The observed values do NOT follow the expected proportions or O ̸= E .
[ ]: import numpy as np
nv = np.arange(1,10)
P_exp = np.array(np.log10(nv+1)-np.log10(nv)); P_exp
# log10(2)-log10(1), log10(3)-log10(2), log10(4)-log10(3)
[ ]: GoodnessFit(Observed=[1110,601,462,345,260,223,205,180,156],
BinNames=["1","2","3","4","5","6","7","8","9"],
Pexpect=P_exp)
df = len(O)-1 =9-1 = 8
Chi-Squared = sum( (O-E)**2/E )= sum( Residuals**2) =
=(1110-1066.25)**2/1066.25 + ... + (156-162.07)**2/162.07 = 6.058
p-value = 1 - chi2.cdf(X2,df) = 0.6407462181172101
The observed counts add up to T otal = 3542. If the observed counts follow
the expected proportions, 3542 · 0.301 ≈ 1066.248 of the leading digits should
be 1, etc. The Expected Values are shown in the table and the second Figure
above, none of them is below 5. The resulting test statistic χ2 is:
[ ]:
[ ]:
Chi-Squared Test of Independence in a Two-Way Table 157
Example
A supplement company claims that their extract is eective in preventing
common cold viruses. Healthy volunteers randomized into groups were given
a placebo, low dose, or high dose of the supplement and exposed to a cold
virus. The results are summarized in the Table below. Test the claim that
getting a cold infection is independent of the treatment group (i.e., row and
column variables are independent) at a 5% level.
Infected 10 28 31 69
Not Infected 40 60 64 164
Sum 50 88 95 233
H0: Getting an infection is independent of the treatment (i.e., row and col-
umn variables are independent).
H1: Dependent.
The χ2 statistic approach is used again, but what are the expected counts?
As always, initially assume H0 is true - getting an infection is independent of
the treatment group. Let's concentrate on a particular cell in the upper left
corner:
69 50
P (Inf ected) · P (P lacebo) = ·
233 233
Therefore, the expected count in this cell should be:
69 50 69 · 50
E = (Grand T otal) · (P robability) = 233 · · =
233 233 233
69 · 50
E1,1 = = 14.807
233
69 · 88
E1,2 = = 26.06
233
.................................
164 · 95
E2,3 = = 66.87
233
To illustrate this degree of freedom, let's consider the same contingency table
as above, but specify only two cell entries. Then all other entries could be
found by subtracting from the totals. For example, the entry for Infected and
High Dose is 69 − 10 − 28 = 31, or Not Infected and Placebo is 50 − 10 = 40.
Infected 10 28 ? 69
Not Infected ? ? ? 164
Sum 50 88 95 233
The χ2 statistic is
independent of the treatment group which implies that the supplement is not
eective. Note in the code below that the counts must be entered column-by-
column, and row names are in the data frame index. The second Figure shows
a stacked barplot of the data.
[ ]: def ChiSqIndependence(Observed):
import numpy as np; import matplotlib.pyplot as plt;
import pandas as pd; from scipy.stats import chi2_contingency;
from scipy.stats import chi2
print("Expected: ")
print(E)
[ ]: import pandas as pd
O = pd.DataFrame({'Placebo':[10,40], 'LowDose':[28,60], 'HighDose':[31,64]},
index= ['Infected','NotInfected'])
160 Goodness of Fit and Contingency Tables
ChiSqIndependence(O)
Note that when there are only two rows (as in this case) or only two columns,
we can look at this problem from the proportion comparison point of view:
H0: p1 = p2 = p3 proportions of infected individuals in each treatment group
are the same.
H1: At least one of the proportions is dierent.
As such, it is a generalization of the two-proportion test studied before. In
fact, the proportions of infected individuals in our example are given below
for Placebo, low dose, and high dose of the supplement, respectively.
[ ]: O.iloc[0,:]/O.sum(axis=0)
[ ]: Placebo 0.200000
LowDose 0.318182
HighDose 0.326316
dtype: float64
[ ]:
Example
A polling company conducted a study to determine the support for the Na-
tional healthcare plan among randomly chosen respondents with dierent
party aliations. The data are shown in the code below. Test the claim that
the response is independent of the party aliation (i.e., row and column vari-
ables are independent).
H0: The response is independent of the party aliation (i.e., row and column
variables are independent)
162 Goodness of Fit and Contingency Tables
H1: Dependent.
Let's start with the code and then explain the steps. Note that none of the
expected counts are below 5, so the assumptions of the applicability of χ2 test
hold.
[ ]: import pandas as pd
O = pd.DataFrame({'Support':[250,431,320], 'Oppose':[489,220,298]},
index= ['Republicans','Democrats','Independents'])
ChiSqIndependence(O)
There are only two columns in this table, therefore the problem could be re-
written from the proportion comparison point of view:
H0 : p1 = p2 = p3 proportions of support in each party are the same.
H1: At least one of the proportions is dierent.
The proportions of support in our example are given below for Republicans,
Democrats, and Independents, respectively, and according to the problem re-
sults, they are signicantly dierent.
[ ]: O.iloc[:,0]/O.sum(axis=1)
[ ]: Republicans 0.338295
Democrats 0.662058
Independents 0.517799
dtype: float64
Example
A marketing study of shopping habits by social class produced the data below.
Investigate if the brand choice is independent of the social class (i.e., row and
column variables are independent).
164 Goodness of Fit and Contingency Tables
H0: The brand choice is independent of the social class (i.e., row and column
variables are independent).
H1: Dependent.
None of the expected counts below are less than 5, so χ2 approach is applicable.
[ ]: import pandas as pd
O = pd.DataFrame({'Upper Class':[130,30,20,20], 'Middle Class':
,→[100,400,60,40],
'Low Class':[70,70,20,40]},
index= ['Brand A','Brand B','Brand C','Brand D'])
ChiSqIndependence(O)
Example
Consider the HELPrct data le again. Test the claim that the gender (sex) is
independent of the preferred substance (substance) at a 5% level.
H0: The substance is independent of the gender (i.e., row and column variables
are independent).
H1: Dependent.
[ ]: import pandas as pd
url="https://raw.githubusercontent.com/leonkag/Statistics0/main/HELPrct.csv"
mydata = pd.read_csv(url)
χ2 = 2.026 is very small resulting in a large p-value = 0.363 > 0.05, as shown
in the chi-squared distribution Figure above. Therefore, there is not enough
evidence to reject H0 - substance is independent of gender. The stacked bar
plot above shows it as well.
With only two columns, the problem can be reformulated as a proportion test:
H0: p1 = p2 = p3 proportions of males in each substance.
H1: At least one of the proportions is dierent.
The observed proportions of males for each substance are:
[ ]: O.iloc[:,1]/O.sum(axis=1)
[ ]: substance
alcohol 0.796610
cocaine 0.730263
heroin 0.758065
dtype: float64
Example
Let's come back to an example considered in the previous chapter. First, we
repeat the main idea that 87 out of 100 students preparing with the Barron
guide passed the Medical admission test MCAT, while 91 out of 120 passed it
with the Princeton guide. Is there a signicant dierence at the 1% level? How
about a 5% level? We recast this problem as a contingency table as follows:
test the claim that the passing is independent of the type of review used (i.e.,
168 Goodness of Fit and Contingency Tables
row and column variables are independent).
[ ]: import pandas as pd
O = pd.DataFrame({'Barrons':[87,13], 'Princeton':[91,29]},
index= ['Passed','Failed'])
ChiSqIndependence(O)
The χ2 = 4.4033 results in p-value = 0.0358 > 0.01, as shown in the chi-
squared distribution Figure above. Therefore, there is not enough evidence to
reject H0 at 0.01 level, so at this stricter level, the MCAT passing is indepen-
dent of the preparation type. We can also see it in the stacked bar plot above.
On the other hand, if α = 0.05 were used, we would have rejected H0.
Comparing these results with the 2-proportions test in the previous chapter,
we notice that the p-values are the same and z 2 = 2.0982 = 4.4033 = χ2 ,
which is true in general.
7
Inference for Means
σ
M ean = µ Standard Error = SE = √ (7.1)
n
Note, in addition, that if the original population distribution is normal, then
this theorem holds for samples of any size.
What is remarkable is that this is true forany distribution, the original distri-
bution does NOT have to be bell-shaped. Proving this theorem, much like the
Central Limit Theorem (CLT) for proportions, would require a more advanced
course, but it can be illustrated with a simulation again. We use exponential
distribution which is always skewed right as shown in the Figure below. A
waiting time in line or lifetimes of some devices tend to have such skewed
right distributions. We use it here purely for illustration purposes.
171
172 Inference for Means
L = 1;
dv2 = expon.pdf(x=xv,scale=1/L)
plt.plot(xv,dv2,'g--',label='L=1')
L = 2;
dv3 = expon.pdf(x=xv,scale=1/L)
plt.plot(xv,dv3,'b-.',label='L=2')
plt.legend()
[ ]: <matplotlib.legend.Legend at 0x7f3ee4e30280>
The code below simulates the distribution of sample means from an exponen-
tial distribution with a mean of 1/2. The sample sizes are n = 100.
[ ]: # Simulation of sample means sample size n = 100
import seaborn as sns; import numpy as np
L = 2;
n = 100; numsamples = 20000
xv = np.repeat(0.0,numsamples) # sample means
for j in range(numsamples):
L = 2;
n = 40; numsamples = 10000
xv = np.repeat(0.0,numsamples) # sample means
for j in range(numsamples):
The simulation below shows a very small sample size n = 10 and the distri-
bution is skewed right and no longer bell-shaped as the Figure shows.
L = 2;
n = 10; numsamples = 10000
xv = np.repeat(0.0,numsamples) # sample means
for j in range(numsamples):
However, let's rerun the previous simulation with n = 10 but sampling from
a standard normal distribution. This results again in a normal sampling dis-
tribution shown in the Figure below, which illustrates the comment made at
the end of the CLT that for a normal underlying distribution, the CLT holds
for a sample of any size.
onesample = np.random.normal(size=n)
xv[j] = np.mean(onesample)
The CLT for sample means has an important complication compared to the
CLT for sample proportions. For proportions, the standard error uses the same
q
p(1−p) σ
proportion p: SE = n , whereas for means, standard error SE = √ is
n
dependent on the population standard deviation, σ which is usually unknown
and must be estimated. Any estimation is imperfect, so a more spread-out
t-distribution is needed which will be introduced shortly.
There are two conditions required to apply the Central Limit Theorem for a
sample mean x:
1. Independence
The sample observations must be independent. It is satised with a simple
random of size less than 10% of the population, for example.
2. Normality
When a sample is small, the sample observations must come from a nor-
mally distributed population. The larger the sample gets, this condition can
be relaxed more and more. This condition is admitedly more vague. For real-
istic data, especially with smaller sample sizes, there is no good way to check
the normality condition. For example, the Figure below is a histogram of 20
randomly chosen numbers from a normal distribution, but it does not look
bell-shaped at all.
One-Sample Mean Tests 177
[ ]: <Axes: ylabel='Count'>
When data is not directly available, we can try to assess the normality and
skewness of the data based on experience. For example, consider the bank
account distribution. The large majority of accounts would have moderate
amounts, while a relatively tiny fraction has amassed millions and billions.
Therefore, the distribution is extremely skewed. On the other hand, test scores
tend to be bell-shaped, and we can use the CLT even for smaller sample sizes.
7.1.1 T-distribution
The Central Limit Theorem above had the true population standard devia-
tion σ in the formula for standard error√σ , which is usually unknown.
SE =
n
For the distribution of sample proportions, the unknown true population
One-Sample Mean Tests 179
The sample standard deviation s can estimate population value σ quite accu-
rately for larger sample sizes n ≥ 30. However, for a smaller sample n < 30,
this estimate is less precise. In this case, W. Gosset proved that sample means
distribution should be modeled by a slightly dierent t-distribution. It is
also symmetric and bell-shaped, but it is more spread out (has thicker
tails). Therefore, observations are more likely to fall beyond 2 standard de-
viations away from the mean than under the normal distribution. Much like
χ2 , t-distribution is not one, but a family of distributions based on the
xv = np.arange(-6,6,0.01)
dv = t.pdf(x=xv,df = 2)
plt.plot(xv,dv,'r--',label='t, df=2')
dv = t.pdf(x=xv,df = 4)
plt.plot(xv,dv,'g--',label='t, df=4')
dv = t.pdf(x=xv,df = 10)
plt.plot(xv,dv,'b--',label='t, df=10')
dv = t.pdf(x=xv,df = 30)
plt.plot(xv,dv,'y--',label='t, df=30')
dv = norm.pdf(x=xv)
plt.plot(xv,dv,label='normal')
plt.legend()
[ ]: <matplotlib.legend.Legend at 0x7f5140ce4e50>
180 Inference for Means
The Figure below shows our usual 90%, 95%, and 99% middle areas for con-
dence intervals on the t-distribution with df = n − 1 = 10 − 1 = 9. The
overall shape is identical to the standard normal curve, but the critical values
are larger and the areas are wider.
axs[1].plot(xv, t.pdf(xv,df=df1))
px=np.arange(-tstar[1],tstar[1],0.01)
axs[1].fill_between(px,t.pdf(px,df=df1),color='r')
axs[1].text(-4,0.03,"0.025"); axs[1].text(2.9,0.03,"0.025")
axs[2].plot(xv, t.pdf(xv,df=df1))
px=np.arange(-tstar[2],tstar[2],0.01)
axs[2].fill_between(px,t.pdf(px,df=df1),color='r')
axs[2].text(-4.1,0.03,"0.005"); axs[2].text(2.8,0.03,"0.005")
pd.set_option("display.precision", 4); print(df,'\n')
s
point estimate ± t⋆df × SE = x ± t⋆df √
n
where the margin of error is:
s
M argin of error = e = t⋆df √ ⇒ (7.2)
n
CI = x ± e (7.3)
Example
A random sample of 20 cars was tested for harmful emissions. The mean was
0.174 g/mi with a standard deviation of 0.016 g/mi (min 0.142, max 0.202).
Construct 90%, 95%, and 99% condence intervals. EPA requires that these
emissions be no more than 0.165 g/mi. Is this requirement being met?
[ ]: def OneMeanCI(xbar,s,n,ConfidenceLevels):
import numpy as np; from scipy.stats import t; import pandas as pd
import matplotlib.pyplot as plt
print('One Mean Confidence Interval function')
print('Sample mean xbar = ', xbar)
print('Sample standard deviation = ', s)
print('Sample size n = ', n)
print('Confidence Levels',ConfidenceLevels,'\n')
axs[1].plot(xv, t.pdf(xv,df=df1))
px=np.arange(-tstar[1],tstar[1],0.01)
axs[1].fill_between(px,t.pdf(px,df=df1),color='r')
axs[1].text(-4,0.03,"0.025"); axs[1].text(2.9,0.03,"0.025")
axs[2].plot(xv, t.pdf(xv,df=df1))
px=np.arange(-tstar[2],tstar[2],0.01)
axs[2].fill_between(px,t.pdf(px,df=df1),color='r')
axs[2].text(-4.1,0.03,"0.005"); axs[2].text(2.8,0.03,"0.005")
[ ]: OneMeanCI(xbar=0.174,s=0.016,n=20,ConfidenceLevels=[0.90,0.95,0.99])
Example
In a test of the eectiveness of garlic in lowering cholesterol levels, 25 randomly
selected subjects regularly consumed raw garlic in xed amounts. The changes
in cholesterol levels have a mean of (-1.7) and a standard deviation of 4.2.
The minimum value of this dierence was -10.1 and the maximum was 7.9.
Construct 90%, 95%, and 99% condence intervals of the mean net change in
LDL cholesterol after the treatment.
s
H0 : µ = µ0 SE = √ df = n − 1
n
[ ]: def OneMeanTtest(xbar,s,n,mu0,ConfidenceLevels):
import numpy as np; from scipy.stats import t; import pandas as pd
import matplotlib.pyplot as plt
print('One Mean T test detailed computation')
print('Sample mean xbar = {:.4f}'.format(xbar))
print('Sample standard deviation = {:.4f}'.format(s))
print('Sample size n = {:d}'.format(n))
print('Null hypothesis claimed mean mu0 = ', mu0)
print('Confidence Levels',ConfidenceLevels,'\n')
print('H0: mu = mu0')
print('H1: mu not mu0 or one-sided test\n')
CohenD = meansdiff/s
print("Cohen D = meansdiff/s = {:.4f}/{:.4f} = {:.4f}\n"
.format(meansdiff,s,CohenD))
[ ]: OneMeanTtest(xbar=127,s=41,n=24,mu0=150,ConfidenceLevels=[0.90,0.95,0.99])
H0: mu = mu0
H1: mu not mu0 or one-sided test
Using the condence interval approach, we are 95% condent (for a 5% sig-
nicance error rate) that the true population mean is between 109.69 and
144.31. This interval does not contain the claimed H0 : µ = 150 conrming
the rejection of H0. If the 99% level were used, the CI (103.51,150.49) would
have contained 150 (just barely), resulting in failing to reject H0.
Example
When 20 people used a popular diet for one year, their mean weight loss was
2.7 lb and the standard deviation was 6.1 lb (min -8, max 10). Use a 0.01
signicance level to test the claim that the mean weight loss is not 0 lb. Based
on these results, does the diet appear to be eective?
H0 : µ = 0 (assume true)
H1 : µ ̸= 0
The hypothesis test seeks to determine if the sample mean x = 2.7 is signi-
cantly dierent from the claimed null hypothesis µ0 = 0.
[ ]: OneMeanTtest(xbar=2.7,s=6.1,n=20,mu0=0,ConfidenceLevels=[0.90,0.95,0.99])
H0: mu = mu0
H1: mu not mu0 or one-sided test
The assumptions for the t-test hold the same way as in the previous problem.
A simple random sample ensures independence. The min/max of the data are
within 2.5 standard deviations away from the mean, so no extreme outliers.
The p-value is the probability of seeing the observed sample mean x = 2.7
or something even more extreme under our t null distribution shown in the
Figure above.
x − µ0 2.7 − 0
t= = = 1.979
SE 1.364
The alternative is two-sided H1 : µ ̸= 0, so both tails must be accounted for.
The areas under the tails add up to p-value = 0.0624378 > 0.01 = α, as shown
in the Figure above. Thus, we conclude that there is not enough evidence to
190 Inference for Means
reject the initial assumption H0. Thus, the diet does not seem to be eective.
The conclusion would also stand at a most common α = 0.05 (5%) level, but
not at 10%.
Using the condence interval approach, we are 99% condent (for 1% signi-
cance error rate) that the true population mean is between -1.2 and 6.6. This
interval contains the claimed H0 : µ = 0, which conrms our conclusion not
to reject H0. The 95% CI also contains 0, but not 90% CI, which conrms our
previous observations.
So far we have considered only a two-sided test. Let's investigate some one-
sided tests adjusting to lower levels for condence intervals similar to propor-
tion tests.
Example
A company claims to increase students' Math scores to at least 70 on the
Regents exam. The consumer protection agency is testing their claim. A sam-
ple of 20 students is randomly selected and the mean is 65 with the standard
deviation of 9. Test the claim at a 5% level.
H0 : µ = 70 (assume true).
H1 : µ < 70
The hypothesis test seeks to determine if the sample mean x = 65 is signi-
cantly lower than the claimed null hypothesis H0.
[ ]: OneMeanTtest(xbar=65,s=9,n=20,mu0=70,ConfidenceLevels=[0.90,0.95,0.99])
H0: mu = mu0
H1: mu not mu0 or one-sided test
The assumptions for the t-test hold. As always, a simple random sample is
assumed which ensures independence. The min/max of the data are not given
in this case, but test scores tend to be normally distributed.
In an argument similar to the case of the proportion, any sample mean x̄ above
70 is already invalidating H1 : µ < 70, and there is no reason to continue to
test such an alternative. The consumer protection agency has no case against
the company. Moreover, even if the sample mean x̄ is below 70, it must be
small enough to be signicantly below 70 thus rejecting the H0. In the case
above it was, but the code below shows the same calculation for x̄ = 68, which
is not low enough resulting in a p-value above 0.05 and failure to reject H0,
as shown in the Figure below.
[ ]: OneMeanTtest(xbar=68,s=9,n=20,mu0=70,ConfidenceLevels=[0.90,0.95,0.99])
H0: mu = mu0
H1: mu not mu0 or one-sided test
Example
A water bottle company claims that there is at most 1 ppm (part per million)
of arsenic in their bottled water. A consumer agency is set to test this claim.
A random sample of 25 bottles was collected, and the mean concentration of
arsenic was 1.09 with the standard deviation of 0.24 (min 0.9, max 1.37). Test
the claim at 5%.
Because of the at least claim, the alternative is one-sided. Also, because the
consumer protection agency is testing their claim, they are trying to refute it,
so the alternative is opposite to the company's claim.
H0 : µ = 1 (assume true).
H1 : µ > 1
The hypothesis test seeks to determine if the sample mean x = 1.09 is signif-
icantly higher than the claimed null hypothesis H0.
[ ]: OneMeanTtest(xbar=1.09,s=0.24,n=25,mu0=1,ConfidenceLevels=[0.90,0.95,0.99])
H0: mu = mu0
194 Inference for Means
H1: mu not mu0 or one-sided test
The assumptions for the t-test hold - independent observations and min/max
of the data are within 2.5 standard deviations away from the mean, so no
extreme outliers.
One-Sample Mean Tests 195
x − µ0 1.09 − 1 0.09
t= = = = 1.875
SE 0.048 0.048
The alternative is one-sided to the right H1 : µ > 1, so only the right tail is
included, which gives p-value = 0.0365063 < 0.05 = α, as shown in the Figure
above. Therefore, there is enough evidence to reject the initial assumption
H0: µ = 1. Thus, the mean concentration of arsenic in the company's water
bottles is signicantly higher than 1. Note that if we used a two-sided test in
this case, the p-value = 0.0730126 > 0.05 = α, which would lead to failing to
reject H0. Once again, it is important to follow the given hypothesis direction.
Yet again, for the one-sided test, the condence level is adjusted to 1 − 2α =
1 − 2 · 0.05 = 0.90. Thus, we are 90% condent (10% signicance error) that
the true population mean is between 1.01 and 1.17, which does NOT contain
the claimed H0 : µ = 1. This conrms our conclusion to reject H0. If we
wrongly used 95% condence interval of 0.99 to 1.19, the conclusion would
have been the opposite.
Before we start with examples, let's remark that even though the t-test is quite
robust to departures from normality, in more extreme cases, a non-parametric
Wilcoxon signed rank test is an essential alternative. It uses data ranks rather
than actual values. Ranks are resistant to outliers and skewness, making this
test a better choice for small, skewed samples. It can be used, for example, to
determine whether the center location of the sample (mean or median) is equal
to the prescribed value. Generally, non-parametric procedures have less power
to reject H0 when it is false than the corresponding parametric procedure
if normality assumption holds. When data is available, we will employ both
t-test and Wilcoxon tests and compare the results.
Example
The greenhouse gas CO2 emissions is a very serious isssue for the environ-
ment. Assume we are given that last year the CO2 emission in a country was
3.26 metric tons per capita. The data below gives a random sample of CO2
emissions this year. At the 1% level, is there enough evidence to show that
the mean CO2 emissions level is lower this year?
The problem asks for lower emissions; therefore the alternative is one-sided:
196 Inference for Means
H0 : µ = 3.26 (assume true).
H1 : µ < 3.26
The hypothesis test seeks to determine if the sample mean x = 3.092 is sig-
nicantly lower than the claimed null hypothesis H0.
This time, the actual data are given, so we rst consider the histogram
and boxplot of the data in the code below as well as the stats pack-
age textttstats.ttest_1samp() function to perform the t-test. In addition,
scipy.stats wilcoxon function is used to run the Wilcoxon test on the data.
Note that it takes the dierence mydata.x-3.26 as the input. In both tests, we
have to remember to specify alternative = "less" to compute the left-tail
test; otherwise a two-sided alternative is done automatically. A step-by-step
code is used as well.
OneMeanTtest(xbar=xbar1,s=s1,n=n1,mu0=3.26,
ConfidenceLevels=[0.90,0.95,0.98, 0.99])
H0: mu = mu0
H1: mu not mu0 or one-sided test
Example
A new experimental reading technique was tried on 16 elementary school
children. Their reading scores are given below. Is there enough evidence to
show that the mean score is higher than the State requirement of 65? Test
at the 10% level.
Note that we are checking if the mean score is higher than the State require-
ment, so the alternative is one-sided. Also, once again we are not trying to
refute the claim, just checking if it is higher, so the alternative is greater than
65.
H0 : µ = 65 (assume true)
H1 : µ > 65
The hypothesis test seeks to determine if the sample mean x = 72.938 is
signicantly higher than the claimed null hypothesis H0.
The actual data are given once again, so a histogram and a boxplot are
used to check for the extreme skewness and/or outliers. Note again that the
alternative="greater" must be used to ensure the correct one-sided test.
OneMeanTtest(xbar=xbar1,s=s1,n=n1,mu0=65,ConfidenceLevels=[0.80,0.90,0.95,0.
,→99])
One-Sample Mean Tests 199
H0: mu = mu0
H1: mu not mu0 or one-sided test
Example
Consider again the HELPrct.csv le and concentrate on the depression cesd
score. Let's say, for the general population, the mean score is around 30.
Investigate if it is dierent for this sample of substance abusers. Use 5% level
of signicance.
H0 : µ = 30 (assume true)
H1 : µ ̸= 30
The hypothesis test seeks to determine if the sample mean x = 32.848 is
signicantly dierent from the claimed null hypothesis H0.
The code is a bit dierent - HELPrct.csv data le must be loaded and we
reference variable cesd.
[ ]: import pandas as pd; import numpy as np;
import matplotlib.pyplot as plt; import seaborn as sns
url="https://raw.githubusercontent.com/leonkag/Statistics0/main/HELPrct.csv"
mydata = pd.read_csv(url) # save as mydata file
One-Sample Mean Tests 201
OneMeanTtest(xbar=xbar1,s=s1,n=n1,mu0=30,ConfidenceLevels=[0.90,0.95,0.99])
H0: mu = mu0
H1: mu not mu0 or one-sided test
We can assume these patients are a simple random sample of all such patients,
so they are independent. The histogram and boxplot above are almost per-
fectly bell-shaped (normally distributed). Thus, the assumptions for the t-test
hold well in this case (the t-test is robust against such departures anyway).
Using the condence interval approach, we are 95% condent (for 5% signi-
cance error rate) that the true population mean is between 31.69 and 34, which
does not contain the claimed H0 : µ = 30. This conrms our conclusion to
reject H0.
Example
Consider the HELPrct.csv le again and concentrate on the mental health
mcs score. From previous studies of substance abusers, the average mcs score
is around 32. Investigate if it is dierent for this sample of substance abusers.
Conduct a hypothesis test at a 5% level.
H0 : µ = 32 (assume true)
H1 : µ ̸= 32
One-Sample Mean Tests 203
url="https://raw.githubusercontent.com/leonkag/Statistics0/main/HELPrct.csv"
mydata = pd.read_csv(url) # save as mydata file
OneMeanTtest(xbar=xbar1,s=s1,n=n1,mu0=32,ConfidenceLevels=[0.90,0.95,0.99])
H0: mu = mu0
H1: mu not mu0 or one-sided test
Same as in the previous problem, t-test assumptions hold. Even though the
distribution of mcs scores is a bit skewed right, the sample is so large that
the distribution of sample means is normal in any case by the Central Limit
Theorem.
x − µ0 31.677 − 32
t= = = −0.536
SE 0.603
The alternative H1 is two-sided, so both left and right tails need to be ac-
counted for. The areas under the tails add up to p-value = 0.5922422 >
0.05 = α, so we conclude that there is not enough evidence to reject the ini-
tial assumption H0. Thus, this sample gives no reason to doubt that the mean
population mcs mental score for these substance abusers is 32. Note also that
the non-parametric Wilcoxon test leads to the same conclusions.
Also, we are 95% condent (5% signicance error) that the true population
mean is between 30.49 and 32.86, which contains the claimed H0 : µ = 32.
This conrms our conclusion not to reject H0.
2
z∗s
n= (7.4)
e
Example
What sample size is needed to estimate the mean height for a population of
adults in the United States? Assume that we want to be 95% condent that
the sample mean is within 0.2 in of the population mean with a standard
deviation of 3 in.
As always, the estimate is rounded up, so you would need 865 subjects in the
study to achieve the desired accuracy.
The estimate's dependence on z∗ and e is the same as for the sample size
∗ 2
(z ) p(1−p)
in the proportion test - n = . First, it is directly proportional to
e2
(z ∗ )2 , so as the level of condence increases 0.90 → 0.95 → 0.99, z ∗ increases,
and the resulting n increases. Conversely, as the margin of error e increases
(lower accuracy), n decreases. Let's illustrate it with a modication of the
above example to compute n for several condence levels and several margins
of error.
ConfLevel zstar e n
0 0.95 1.96 0.1 3457.3129
1 0.95 1.96 0.2 864.3282
2 0.95 1.96 0.3 384.1459
For one-sample means test studied in the previous sections, Cohen dened
eect size to be:
x − µ0
d=
s
which is essentially the t or z of x based on the original standard deviation s,
not standard error SE = √sn . He also introduced guidelines on what consti-
tutes a small, medium or large eect: small (d = 0.2), medium (d = 0.5), and
large (d = 0.8).
One-Sample Mean Tests 207
Another criticism is that too much focus is on the probability of making a Type
I error (rejecting H0 when it is actually true) and not enough attention is given
to the Type II error (not rejecting H0 when it is actually false). Ideally, we need
a method sensitive enough to detect real eects in the underlying population
which is measured by the statistical power (probability to reject H0
when it is actually false - complement of Type II error). A reasonable
level of power is 0.80-0.90 which translates into 80%-90% probability of nding
the eect if it exists in the population. Power is inuenced by a number of
factors:
1. Sample size - a larger sample increases the power.
2. Eect size - a bigger eect increases the power (big eect is easier to detect).
3. The criterion for signicance error α (Type I error probability) is usually
set at 0.05. Reducing α to 0.01, say, will make it harder to reject H0 when
it is true, but also when it is false and power decreases. Increasing α to 0.10,
say, will make it easier to reject H0 when it is true, but also when it is false
and the power increases.
4. Whether you have a one- or two-tailed hypothesis, one-tailed statistical
tests tend to be more powerful than two-tailed tests.
The code below shows that given any three of the sample size - n, probability
of Type I error - α, eect size - d, and power - power, the fourth one can be
found with TTestPower().solve_power() function.
[ ]: from statsmodels.stats.power import TTestPower
n1 = 40; print('Given sample size = n1 = ', n1)
d1 = 0.5; print('Given Cohen d effect size = difference/s = ', d1)
alpha1 = 0.05; print('Significance level = alpha = ', alpha1)
power1 = TTestPower().solve_power(nobs=n1, effect_size = d1, \
power = None, alpha = alpha1)
print('For the values above, power = {:.4f}\n'.format(power1))
It is more illuminating to plot power curves which show how the change in
eect size and sample size impact the power of the statistical test. The Figure
below shows how the power of the test increases with a larger eect size (it is
easier to detect a larger eect size). It also increases much steeper for larger
sample sizes.
[ ]: import numpy as np
import matplotlib.pyplot as plt
from statsmodels.stats.power import TTestIndPower
Analogously, the Figure below shows how the power of the test increases with
the sample size. Also it is steeper for the larger eect size.
The key observation is that because the data are paired, we can construct the
dierences. Therefore, we get just one set of dierences and we are back
to one sample t-test derived in the previous section with the same exact code!
H0 : µd = 0
H1 : µd ̸= 0 or < >
sd d−0
SE = √ df = n − 1 tdf =
n SE
where d is mean of sample dierences, n is number of pairs, and sd is their
standard deviation.
Example
In a test of the eectiveness of a new exercise system, 16 randomly selected
subjects were observed following this system. The weight changes (before-
after) have a mean of 2.7 lb and a standard deviation of 5.5 lb. The minimum
value of this dierence was -3.9 and the maximum was 9.1. Is there a dierence
at the 5% level?
Solution:
This is a classical example of paired before and after data. In fact, we are
already given summaries in terms of dierences. The only change to the code
used for one sample means tests is to set the null hypothesis H0 to the true
population mean dierence of 0 implying no dierence.
H0 : µd = 0 (assume true)
H1 : µd ̸= 0
The hypothesis test seeks to determine if the sample mean of 2.7 is signicantly
dierent from the claimed null hypothesis 0, i.e., whether there is a signicant
dierence before and after the exercise course.
[ ]: OneMeanTtest(xbar=2.7,s=5.5,n=16,mu0=0,ConfidenceLevels=[0.90,0.95,0.99])
H0: mu = mu0
H1: mu not mu0 or one-sided test
The assumptions for the t-test hold the same way for dierences. A simple
random sample of patients ensures independence. The min/max of the dif-
ferences data are within 2.5 standard deviations away from the mean, so no
extreme outliers.
x − µ0 2.7 − 0
t= = = 1.964
SE 1.375
Means Tests for Paired Data Samples 213
Example
Consider the data set textbooks available from the openintro library. It has
prices for 73 textbooks at the UCLA bookstore and Amazon. Test whether
there is a dierence at the 1% level.
This is another classical example of paired data - each book has a dierent
price from two dierent sellers. This time, we have the actual data set, thus
we can explore the boxplot and the histogram of the dierences in the Figure
below. The mean, standard deviation, and sample size of the dierences have
to be computed before my step-by-step function can be applied.
H0 : µ = 0 (assume true)
H1 : µ ̸= 0
The hypothesis test seeks to determine if the sample mean x = 12.762 is
signicantly dierent from the claimed null hypothesis 0, i.e., whether there
is a signicant dierence between the prices for the same textbooks at the
UCLA bookstore and at Amazon.
url="https://raw.githubusercontent.com/leonkag/Statistics0/main/
,→UCLAtextbooks.csv"
H0: mu = mu0
H1: mu not mu0 or one-sided test
x−µ0
Therefore, the sampling null distribution of is approximately a t-
SE
distribution (although standard normal could have been used for such a large
s 14.255
sample as well), where SE = √ = √ = 1.668.
n 73
The p-value is the probability of seeing the observed sample mean x or some-
thing even more extreme under our t null distribution shown in the Figure
above.
x − µ0 12.762 − 0
t= = = 7.649
SE 1.668
The alternative is two-sided H1 : µ ̸= 0, so both tails must be accounted for.
−11
The areas under the tails add up to p-value = 6.93 · 10 < 0.01 = α, so we
conclude that there is enough evidence to reject the initial assumption H0:
µ = 0. Thus, there is a dierence in book prices.
Using the condence interval approach, we are 99% condent (1% signicance
error) that the true population mean of dierences is between 8.35 and 17.18,
which does not contain the claimed H0 : µ = 0. This conrms our conclusion
to reject H0.
Example
To determine if a massage is eective for treating muscle pain, a pilot study
was conducted where a certied therapist treated randomly chosen patients.
Pain level was measured immediately before and after the treatment. The
data is given in the code below. Do the data show that the treatment reduces
pain? Test at the 5% level.
First, the boxplot and histogram of the dierences (before - after) are explored
in the Figure below. Because it asks for the reduction of pain, it implies that
216 Inference for Means
the average pain scores before are larger than after; therefore it is a one-
sided test to the right:
H0 : µd = 0 (assume true)
H1 : µd > 0 (note it is a one-sided test)
The hypothesis test determines whether the sample mean x = 1.65 is sig-
nicantly larger than the claimed null hypothesis H0: µd = 0, which im-
plies a signicant reduction in pain from before to after the treatment. Note
alternative = "greater" must be used in built-in tests.
mydata = pd.DataFrame({'before':[2, 3, 8, 2, 3, 3, 2, 6, 2, 5, 5, 3, 3, 3,
,→7, 5, 1, 3, 9, 1],
'after' :[0, 2, 6, 0, 0, 3, 3, 4, 1, 5, 2, 3, 0, 3,
,→3, 0, 2, 1, 4, 1]})
H0: mu = mu0
H1: mu not mu0 or one-sided test
Means Tests for Paired Data Samples 217
Example
A small sample of students is tested before and after they use the tutoring
center. Test if their grades improved at the 1% level.
Once again, we have the actual data set; thus we can explore the boxplots
and the histograms of the dierences in the Figure below. Note also that the
statement of the improvement of grades means that the average scores before
are smaller than after; therefore it is a one-sided test to the left.
H0 : µ = 0 (assume true).
H1 : µ < 0 (one-sided test).
The hypothesis test seeks to determine if the mean of the dierences x =
−2.167 is signicantly smaller than the claimed null hypothesis µ0 = 0, i.e.,
whether there is a signicant improvement in the test scores due to tutoring.
H0: mu = mu0
H1: mu not mu0 or one-sided test
A simple random sample ensures independence. The data set is so small in this
problem that we cannot really conrm normality from the histogram and/or
boxplot; however, test scores tend to be bell-shaped, so it is reasonable to
assume that the assumptions for the t-test hold.
Hypothesis test:
H0 : µ1 = µ2 ⇐⇒ µ1 − µ2 = 0
H1 : µ1 ̸= µ2 ⇐⇒ µ1 − µ2 ̸= 0
The standard error for the dierence of means is computed dierently depend-
ing on the variances of the groups:
1. Pooled Variance - population variances are assumed to be equal: σ12 = σ22 .
If the original data are available, a Levene's test for equality of variances can
assess it. If not, equality is assumed if the ratio of the larger sample standard
deviation to the smaller is less than 2 (rule of thumb).
Degree of F reedom = df = n1 + n2 − 2
2. Unpooled Variance
In case the population variances cannot be assumed equal, there is no exact
solution, but Welch's approximation is used:
s
s21 s2
Standard Error = SE = + 2
n1 n2
2
s21 s22
n1 + n2
Degree of F reedom = df = 2 2 2
s1 s2
2
n1 n2
n1 −1 + n2 −1
222 Inference for Means
Example
Compare the test scores of independent samples of men and women on a
standardized accounting test. We take a random sample of 31 people (16 men
and 15 women). Sample means are 80 and 84, and standard deviations are 12
and 14, for men and women, respectively. Note that women outperform men
by 4 points, but that might be a sampling error. We would like to test whether
this dierence is signicant at the 5% level.
H0 : µ1 = µ2 ⇐⇒ µ1 − µ2 = 0 (assumed true)
H1 : µ1 ̸= µ2 ⇐⇒ µ1 − µ2 ̸= 0
The function below is rather long, but it just implements the formulas outlined
above.
[ ]: def TwoMeansTtests2(x1,s1,n1,x2,s2,n2,ConfLevels):
import numpy as np; from scipy.stats import t; import pandas as pd
import matplotlib.pyplot as plt
print('Two Independent Means T test detailed computation')
print('Sample means, standard deviation, and sample size are:')
print('1st Sample: x1 = {:.3f}, s1 = {:.3f}, n1 = {:d}'
.format(x1,s1,n1))
print('1st Sample: x2 = {:.3f}, s2 = {:.3f}, n2 = {:d}'
.format(x2,s2,n2))
print('Confidence Levels',ConfLevels,'\n')
ConfLevels = np.array(ConfLevels)
sdratio = np.maximum(s1/s2,s2/s1)
print("standard deviations ratios <2? = {:.4f}\n".format(sdratio))
meansdiff = x1-x2
print('meansdiff = x1-x2 = {:.4f} - {:.4f} = {:.4f}\n'.
format(x1,x2,meansdiff))
print("=============================================")
print("1st Pooled approach:
,→---------------------------------------------- ")
Means Tests for Two Independent Samples 223
.format(sp2,n1,n2,SE))
df1 = n1+n2-2
print("degree of freedom = df = n1+n2-2 = {:d}+{:d}-2 = {:d}"
.format(n1,n2,df1))
t1 = meansdiff/SE
print('t = meansdiff/SE = {:.3f}/{:.3f} = {:.3f}'
.format(meansdiff,SE,t1))
pval2 = 2*(1 - t.cdf(x=abs(t1), df=df1))
print('pval2 = 2*(1 - spst.t.cdf(x=abs(t1), df=df1))')
print("pval two and one tailed : ", pval2, pval2/2,'\n')
print("=========================================")
print("==========================================")
[ ]: TwoMeansTtests2(x1=80,s1=12,n1=16,
x2=84,s2=14,n2=15,ConfLevels=[0.90,0.95,0.99])
H0: mu1-mu2 = 0
H1: mu1-mu2 not 0 or one-sided test
===================================================
1st Pooled approach: ----------------------------------------------
sp2 = ((n1-1)*s1**2 + (n2-1)*s2**2)/(n1+n2-2) =
(16-1)12.000**2 + (15-1)14.000**2)/(16+15-2) = 169.103
SE = np.sqrt(sp*(1/n1+1/n2)) = np.sqrt(169.103*(1/16+1/15) = 4.674
degree of freedom = df = n1+n2-2 = 16+15-2 = 29
t = meansdiff/SE = -4.000/4.674 = -0.856
pval2 = 2*(1 - spst.t.cdf(x=abs(t1), df=df1))
pval two and one tailed : 0.39908551964946626 0.19954275982473313
========================================================
2nd UNpooled approach (Welch Approximation):-------------
SE = np.sqrt(s12/n1 + s22/n2) =
sqrt(12.000**2/16 + 14.000**2/15) = 4.698
degree of freedom = df = 27.67390755241222
t = meansdiff/SE = -4.000/4.698 = -0.852
pval2 = 2*(1 - spst.t.cdf(x=abs(t1), df=df1))
pval two and one tailed : 0.4017926906364071 0.20089634531820355
===============================================================
Compute Cohen's d:
averagesd = (s1+s2)/2 = (12.000+14.000)/2 = 13.000
Cohen's d = meansdiff/averagesd = (-4.000/13.000)/2 = -0.308
226 Inference for Means
The groups of men and women are independent random samples, so the inde-
pendence within and between the groups should hold. We are not given min
and max values to check the rule of thumb for each group, but test scores tend
to be bell-shaped, so the normality could be assumed (the t-test is robust to
departures from normality in any case).
In this case, the standard deviations are quite close, so the pooled approach
can be used, but the results for the pooled and unpooled approaches are almost
the same anyway.
The p-value > 0.05 = α, so we conclude that there is not enough evidence
to reject the initial assumption H0: µ1 − µ2 = 0. Thus, there is no signicant
dierence between men's and women's test scores. Also, all 95% condence
intervals contain the claimed H0: µ1 − µ2 = 0, which conrms our conclusion
not to reject H0.
Example
In this example we are given data summaries on two companies' daily pay:
sample sizes 20 and 25, sample means 240 and 220, and standard deviations
26 and 12, respectively. Min and max for the 1st group are 200 and 275,
and for the 2nd group 200 and 245. We would like to determine whether the
dierence in daily pay between the two companies is statistically signicant
at a 1% level.
Note that because we were told to use 1% signicance level, α has to be set
to0.01 (condence level 0.99).
H0 : µ1 = µ2 ⇐⇒ µ1 − µ2 = 0 (assumed true)
H1 : µ1 ̸= µ2 ⇐⇒ µ1 − µ2 ̸= 0
[ ]: TwoMeansTtests2(x1=240,s1=26,n1=20,
x2=220,s2=12,n2=25,ConfLevels=[0.90,0.95,0.99])
H0: mu1-mu2 = 0
H1: mu1-mu2 not 0 or one-sided test
========================================================
1st Pooled approach: ----------------------------------------------
sp2 = ((n1-1)*s1**2 + (n2-1)*s2**2)/(n1+n2-2) =
(20-1)26.000**2 + (25-1)12.000**2)/(20+25-2) = 379.070
SE = np.sqrt(sp*(1/n1+1/n2)) = np.sqrt(379.070*(1/20+1/25) = 5.841
degree of freedom = df = n1+n2-2 = 20+25-2 = 43
t = meansdiff/SE = 20.000/5.841 = 3.424
pval2 = 2*(1 - spst.t.cdf(x=abs(t1), df=df1))
pval two and one tailed : 0.001366515043855987 0.0006832575219279935
===================================================
2nd UNpooled approach (Welch Approximation):-------------
SE = np.sqrt(s12/n1 + s22/n2) =
sqrt(26.000**2/20 + 12.000**2/25) = 6.290
degree of freedom = df = 25.442573732854534
t = meansdiff/SE = 20.000/6.290 = 3.180
pval2 = 2*(1 - spst.t.cdf(x=abs(t1), df=df1))
pval two and one tailed : 0.0038511672711767364 0.0019255836355883682
===================================================
Compute Cohen's d:
averagesd = (s1+s2)/2 = (26.000+12.000)/2 = 19.000
Cohen's d = meansdiff/averagesd = (20.000/19.000)/2 = 1.053
228 Inference for Means
The groups of employees from two rms are assumed to be independent ran-
dom samples, so the independence within and between the groups holds. The
min and max for each group are within x ± 2.5s bounds, so the observations
are not too extreme and normality could be assumed (t-test is robust to de-
partures from normality in any case).
This time, the standard deviations ratio is more than 2 and the pooled ap-
proach should not be used. We can also see in the nal output that its
results are quite dierent. The conclusions are still the same though. The
pvalue < 0.01 = α, so we conclude that there is enough evidence to reject
the initial assumption H0: µ1 − µ2 = 0. Thus, there is signicant dierence
between groups. Also, 99% condence interval does NOT contain the claimed
0, which conrms our conclusion to reject H0.
Example
Consider again the HELPrct le. Compare mean cesd depression scores by
gender. We would like to test whether there is a signicant dierence at the
1% level.
Means Tests for Two Independent Samples 229
This time we have an actual data le, so in the Figure below we explore
histogram and boxplot separated by gender. Also, Levene's test is used to
assess equality of group variances.
H0 : µ1 = µ2 ⇐⇒ µ1 − µ2 = 0 (assumed true)
H1 : µ1 ̸= µ2 ⇐⇒ µ1 − µ2 ̸= 0
[ ]: import pandas as pd; import numpy as np; from scipy import stats
import matplotlib.pyplot as plt; import seaborn as sns
url="https://raw.githubusercontent.com/leonkag/Statistics0/main/HELPrct.csv"
mydata = pd.read_csv(url) # save as mydata file
mydata[['cesd','sex']].head()
[ ]: cesd sex
0 49 male
1 30 male
2 39 male
3 15 female
4 39 male
H0: mu1-mu2 = 0
H1: mu1-mu2 not 0 or one-sided test
========================================================
1st Pooled approach: ----------------------------------------------
sp2 = ((n1-1)*s1**2 + (n2-1)*s2**2)/(n1+n2-2) =
(107-1)13.018**2 + (346-1)12.103**2)/(107+346-2) = 151.889
SE = np.sqrt(sp*(1/n1+1/n2)) = np.sqrt(151.889*(1/107+1/346) = 1.363
degree of freedom = df = n1+n2-2 = 107+346-2 = 451
t = meansdiff/SE = 5.290/1.363 = 3.880
pval2 = 2*(1 - spst.t.cdf(x=abs(t1), df=df1))
pval two and one tailed : 0.00011999508048665675 5.999754024332837e-05
========================================================
2nd UNpooled approach (Welch Approximation):-------------
SE = np.sqrt(s12/n1 + s22/n2) =
sqrt(13.018**2/107 + 12.103**2/346) = 1.417
degree of freedom = df = 166.5919791073123
t = meansdiff/SE = 5.290/1.417 = 3.734
pval2 = 2*(1 - spst.t.cdf(x=abs(t1), df=df1))
pval two and one tailed : 0.000258725844173302 0.000129362922086651
========================================================
Compute Cohen's d:
averagesd = (s1+s2)/2 = (13.018+12.103)/2 = 12.560
Cohen's d = meansdiff/averagesd = (5.290/12.560)/2 = 0.421
232 Inference for Means
Female and male substance abusers can be assumed to be independent random
samples, so the independence within and between the groups holds. The Figure
above shows histograms and boxplots for males and females. The graphs are
not particularly skewed, with no extreme outliers, so we can assume normality.
The sample sizes are so large that CLT guarantees the normality of sample
means in any case. We also look at the non-parametric test just to compare.
Because Levene's test shows a p-value much higher than 0.05, equal variance
can be assumed, and the pooled approach is justied. The pooled and un-
pooled approaches give the same answer anyway.
Example
In this example, we would like to determine if there is a signicant dierence
in the average prices of concert tickets in Kansas City and Salt Lake City. We
sample 11 ticket stubs from Kansas City and 10 from Salt Lake City. Test
the claim using a 0.05 level of signicance. Note that Welch approximation is
used!
This time, the data set is small and is best entered on the y into
pd.DataFrame. The Figure below shows histograms and boxplots separated
by city. Levene's test is used to assess equality of group variances. Note how
the data is entered in a long format: all prices are entered in the variable
price, and the categorical variable city identies which city it is from.
H0 : µ1 = µ2 ⇐⇒ µ1 − µ2 = 0 (assumed true)
H1 : µ1 ̸= µ2 ⇐⇒ µ1 − µ2 ̸= 0
[ ]: import pandas as pd; import numpy as np; from scipy import stats
import matplotlib.pyplot as plt; import seaborn as sns
[ ]: <Axes: xlabel='price'>
H0: mu1-mu2 = 0
H1: mu1-mu2 not 0 or one-sided test
========================================================
1st Pooled approach: ----------------------------------------------
sp2 = ((n1-1)*s1**2 + (n2-1)*s2**2)/(n1+n2-2) =
(11-1)1.027**2 + (10-1)2.635**2)/(11+10-2) = 3.844
SE = np.sqrt(sp*(1/n1+1/n2)) = np.sqrt(3.844*(1/11+1/10) = 0.857
degree of freedom = df = n1+n2-2 = 11+10-2 = 19
t = meansdiff/SE = -4.136/0.857 = -4.828
pval2 = 2*(1 - spst.t.cdf(x=abs(t1), df=df1))
pval two and one tailed : 0.00011681853580802759 5.8409267904013795e-05
=======================================================
2nd UNpooled approach (Welch Approximation):-------------
SE = np.sqrt(s12/n1 + s22/n2) =
sqrt(1.027**2/11 + 2.635**2/10) = 0.889
degree of freedom = df = 11.459853276011748
t = meansdiff/SE = -4.136/0.889 = -4.653
pval2 = 2*(1 - spst.t.cdf(x=abs(t1), df=df1))
pval two and one tailed : 0.0006297276542999164 0.0003148638271499582
========================================================
Compute Cohen's d:
averagesd = (s1+s2)/2 = (1.027+2.635)/2 = 1.831
Cohen's d = meansdiff/averagesd = (-4.136/1.831)/2 = -2.259
Means Tests for Two Independent Samples 235
The two cities' ticket prices are independent random samples, so independence
within and between the groups is maintained. The Figure above shows a his-
togram and boxplot by group. The graphs are not extremely skewed, with no
extreme outliers. We also looked at the non-parametric test to compare.
Because Levene's test shows a p-value smaller than 0.05, equal variance cannot
be assumed, and a pooled approach should not be used.
8.1 Correlation
As an example, let's return to the le MHEALTH.csv mentioned before in chap-
ters 1 and 2. It contains many health measurements for 40 men. Let's inves-
tigate weight (WT) dependence on the waist size (WAIST). Whenever you
plan a correlation/regression study, you must always start with a scatterplot
as shown in the Figure below.
url="https://raw.githubusercontent.com/leonkag/Statistics0/main/MHEALTHs.csv"
mydata = pd.read_csv(url) # save as mydata file
# mydata.head(10)
sns.scatterplot(data=mydata, x="WAIST", y="WT")
237
238 Correlation and Regression
where n is the number of data point pairs. The scipy.stats library has an
eective function to perform this rather tedious computation.
Note that because the standardized z-scores are used to dene r, it is a di-
mensionless number that does not depend on units of measurement of either
2
variable. For example, assume that x is the square footage (f t ) of an apart-
ment, and y is its price in dollars. If we change x to square meters and/or y
to thousands of dollars, the correlation coecients would not change. Also, x
and y come symmetrically into the r formula, so they can be interchanged.
Correlation 239
[ ]: <Axes: >
1 n P
The formula for the correlation coecient r=
n−1 i=1 (zx )i · (zy )i , and the
Figure below can help understand the sign and magnitude of the correlation.
r = pearsonr(x,y);
print("Pearson correlation coefficient r and corresponding p-value:")
print(r)
sns.scatterplot(x=x,y=y, ax = axs[1])
y = a - b*x + np.random.normal(loc=0.0, scale=sig, size=n1)
r = pearsonr(x,y);
print("Pearson correlation coefficient r and corresponding p-value:")
print(r)
sns.scatterplot(x=x,y=y, ax = axs[2])
[ ]: <Axes: >
For example, for WT and WAIST size in the MHEALTH data le, the above
formulas give:
Note that the strength of the correlation is given by the magnitude of r, while
the signicance is determined by the p-value of the corresponding t-test. They
are related, but not the same. Say, for a large data set, even a weak correlation
coecient can be statistically signicant.
Many possible pairwise correlation coecients exist in the case of several in-
terrelated numerical variables. The code below gives an ecient way to obtain
all the pairwise correlations at once and have their visual representation.
[ ]: r = mydata[['AGE','HT','WT','WAIST','PULSE','CHOL','BMI']].
corr(method='pearson'); r
[ ]: sns.set(rc={'figure.figsize':(9.7,4.27)})
sns.heatmap(r,linewidth=1,annot=True,annot_kws={'size':20})
[ ]: <Axes: >
244 Correlation and Regression
Therefore, WT is weakly correlated with AGE, not correlated with CHOL (choles-
terol), moderately correlated with HT (height), and strongly positively corre-
lated with BMI. We can check individual signicance with p-value, for example
for WT and HT:
[ ]: pearsonr(mydata['HT'],mydata['WT'])
[ ]: PearsonRResult(statistic=0.5222462687506603, pvalue=0.000547057356449642)
[ ]: r, pval = pearsonr(mydata['HT'],mydata['WT'])
print('r = {:.3f}, pval = {:.10f}, r-squared = {:.3f}'.format(r, pval, r**2))
For example, for the relationship between HT and WT, r2 = 0.5222 = 0.273.
Therefore 27.3% of variation in WT can be explained by the linear relationship
with HT. On the other hand, for the relationship between WT and PULSE, r2 =
0.0562 = 0.0031. Therefore, only 0.31% of variation in WT can be explained
with the linear relationship with PULSE (no relationship).
To repeat the point that was made in the beginning, correlation does NOT
imply causation. Only a randomized experiment can be used to assess the
causality!
Correlation 245
[ ]: mydata = pd.DataFrame({'NumberMeds':[3,4,1,0,3,1,1,3,5,7,3,2,2,0],
'TestScore':
,→[16,19,12,14,16,16,11,11,10,11,13,16,20,14]})
[ ]: NumberMeds TestScore
NumberMeds 1.000000 -0.162685
TestScore -0.162685 1.000000
The correlation coecient is negative (as predicted), but it is weak and not
statistically signicant at any acceptable level of signicance. Note that we
have to divide the p-value which we obtained by 2 (one-sided test), but it is still
way more than 0.05. Thus, the number of medications was not signicantly
related to memory performance.
Kendall's rank correlation tests the similarities in the ordering of data ranked
by quantities. Pearson and Spearman use the individual observations as the
basis of the correlation, while Kendall's correlation coecient uses pairs of
observations. It determines the strength of association based on the pattern
of concordance and discordance between the pairs. For the medication problem
above, the conclusions are the same.
[ ]: r, pval = stats.kendalltau(mydata['NumberMeds'],mydata['TestScore'])
print('r = {:.3f}, pval = {:.10f}, pval/2 = {:.10f}, r-squared = {:.3f}'.
format(r, pval, pval/2, r**2))
246 Correlation and Regression
r = mydata.corr(method='kendall'); r
[ ]: NumberMeds TestScore
NumberMeds 1.000000 -0.099381
TestScore -0.099381 1.000000
[ ]: mydata = pd.DataFrame({'Experience':[1,1,2,2,4,4,7,7,10,10,12,12,12,20],
'Income':[50,51,55,53,61,62,70,74,89,90,91,90,90,98],
'Performance':
,→[16,19,21,24,20,21,25,24,28,30,31,32,28,35]})
r = mydata.corr(method='pearson'); r
[ ]: sns.heatmap(r,linewidth=1,annot=True,annot_kws={'size':8})
[ ]: <Axes: >
Correlation 247
[ ]: n r CI95% p-val
pearson 14 0.403083 [-0.19, 0.78] 0.172036
In the above output, the years of Experience have been covaried (partialled
out). The correlation between Income and Performance score has been reduced
considerably to less than half of the original one and the p-value > 0.05, so
it is not signicant anymore. It removed that part of the correlation between
Income and Performance score which was due to years of Experience, and
there is not much correlation left.
[ ]: n r CI95% p-val
pearson 14 0.611744 [0.09, 0.87] 0.026291
The correlation above has been reduced to a moderate 0.61, but it is still
signicant.
[ ]: n r CI95% p-val
pearson 14 0.428899 [-0.16, 0.79] 0.143632
We can illustrate this idea using the diagram above showing overlapping circles
for each of our three variables. It is not mathematically accurate; the circles
are exaggerated to understand the idea. The shared variance between Income
and Performance is represented by the areas a b. The Experience is
and
related (shares variance with) to both Income and Performance with areas c,
b, and d, respectively. On the other hand, area a shows the unique variance
of Income and Performance. Area b represents the part of the relationship
between Income and Performance which is due to Experience. If we remove
the inuence of Experience (area b), the correlation between Income and
Performance is considerably reduced (area a only).
A medical example of partial correlation might be a relationship between
symptoms and quality of life, with depression partialled out (covaried). If the
correlation between symptoms and quality of life is considerably reduced after
partialling out depression, it can be concluded that a part of the relationship
between symptoms and quality of life is due to depression.
ŷ = E(Y |X = x) = b0 + b1 x (8.4)
To nd the best line, a residual for each data point i = 1...n is dened as
a vertical distance between the actual y-value yi of the data point and its
prediction by the line ŷi :
ei = yi − ŷi = yi − (b0 + b1 · xi ) (8.5)
The Figure below shows data with the best regression line on the left side
and residuals on the right side using sns.regplot() and sns.residplot(),
respectively.
url="https://raw.githubusercontent.com/leonkag/Statistics0/main/MHEALTH.csv"
mydata = pd.read_csv(url) # save as mydata file
fig, axs = plt.subplots(ncols=2); fig.set_figheight(3); fig.set_figwidth(8)
sns.regplot(data=mydata,x="WAIST", y="WT",ax=axs[0]);
sns.residplot(data=mydata,x="WAIST", y="WT",color='green',ax=axs[1])
slope is:
sy · r
b1 = (8.6)
sx
where r is the correlation coecient, and sx and sy are standard deviations.
The proportionality and positivity of standard deviations implies that the sign
250 Correlation and Regression
of the slope of the line is the same as correlation coecient r sign. Also, it is
shown that means point (x, y) is always on the regression line. Therefore, the
point-slope form of the equation of regression line is:
y − y = b1 (x − x)
y = y − b1 x + b1 x
Therefore, the y-intercept is
b0 = y − b1 x (8.7)
[ ]: x = mydata['WAIST']; y = mydata['WT'];
xbar = np.mean(x); sx = np.std(x,ddof=1);
ybar = np.mean(y); sy = np.std(y,ddof=1);
r, pval = pearsonr(x,y); b1 = sy*r/sx; b0 = ybar-b1*xbar;
print('y-intercept b0 = {:.4f} and slope b1 = {:.4f}'.format(b0,b1))
Based on the code above, the y-intercept is b0 = −44.08, and the slope is
m = b1 = 2.37. Therefore,
ŷ = b0 + b1 x = −44.08 + 2.37 · x
or in terms of the original variables
W
d T = −44.08 + 2.37 · W AIST
A slope is a rise over run; therefore here for each 1 cm increase in WAIST, you
are expected to gainon average 2.37 lb of WT. Note that for the linear model,
it matters that explanatory (independent) variable WAIST explains re-
sponse (dependent) variable WT, while for correlation it did not matter.
Let's nd residual for a given man with WAIST size of 95 cm and weight of
181 lb - (data point (95, 181)).
[ ]: x2 = np.append(x, [2.1,2.2])
y2 = a + b*x2 + np.random.normal(loc=0.0, scale=sig, size=n1+2);
y2[n1]=20; y2[n1+1]=20.2;
df = pd.DataFrame({'x':x2,'y':y2})
fig, axs = plt.subplots(nrows=2); fig.set_figheight(7);
fig.set_figwidth(7)
sns.regplot(data=df,x='x',y='y',ax=axs[0]);
import statsmodels.api as sm
from statsmodels.formula.api import ols
# fit simple linear regression model
linear_model = ols('y ~ x', data=df).fit()
sns.histplot(linear_model.resid,ax=axs[1])
[ ]: <Axes: ylabel='Count'>
correlation, slope, ANOVA, etc. The correlation analysis was already done in
the previous section.
s Pn
1 (yi − ŷi )2
SE(b1 ) = Pi=1
n (8.8)
n − 2 i=1 (xi − x)2
The regression summaries for WT vs. WAIST are given in the extensive tables
below and the Figure shows scatterplot with the best t line.
url="https://raw.githubusercontent.com/leonkag/Statistics0/main/MHEALTH.csv"
mydata = pd.read_csv(url) # save as mydata file
# mydata.head(10)
fig, ax = plt.subplots(); fig.set_size_inches(7.5, 5.5)
sns.regplot(data=mydata, x="WAIST", y="WT")
Notes:
256 Correlation and Regression
[1] Standard Errors assume that the covariance matrix of the errors is
,→correctly
specified.
[ ]: plt.figure(figsize=(7.5,5.5))
import lmdiag
lmdiag.plot(linear_model);
Least Squares Regression Line 257
The diagnostic plots above provide extremely important data summaries. The
top-left one is Residuals vs. Fitted values. This plot checks if residuals have
non-linear patterns. If you nd equally spread residuals around a horizon-
tal line without distinct patterns (as in this case), that is a good indication
we don't have a non-linear relationship. The top-right plot shows a quantile-
quantile (QQ) plot of standardized residuals vs. the standard normal distribu-
tion quantiles. If the residuals approximately follow a straight line (as in this
case) the normality assumption of residuals holds. The bottom-left gure is the
Scale-Location plot. It checks if residuals are spread equally along the ranges
of predictors - assumption of equal variance (homoscedasticity). The top-left
gure shows this too, but this one shows the possible funnel pattern more
clearly. This example shows the line with equally (randomly) spread points,
so the homoscedasticity holds. The bottom-right gure shows Residuals vs.
Leverage. We discuss it in more detail in the next section.
The Coef part of the table gives the y-intercept, slope, and the corresponding
statistical analysis. For the slope:
This condence interval does not contain 0 which proves, yet again, that the
slope is not 0 and the linear model is valid.
The code output also shows an analogous statistical analysis for the y-intercept
H0 : β0 = 0, which is rarely used (unless we look for direct proportionality).
It is the output of the linear model when input is zero, but x = 0 is often
impractical. For example, in our WT vs. WAIST example, WAIST = 0 is not
practical and resulting b0 = −44.08 correspond to negative weight which does
not make sense either. We will see some other examples in future sections
where b0 has a practical meaning.
1 (xi − x)2
hi = +
n s2x
distance, which estimates the magnitude of the eect of deleting the i-th
observation from the model. Cook's distance is dened in a more general
context of multiple regression with p explanatory variables (predictors), in
our case p = 1.
Let:
ŷj = predicted mean response of observation j for the model tted with all n
observations.
(−i)
ŷj = same response tted without the i-th observation.
se = residual standard error. Then Cook's distance is:
Pn (−i) 2
j=1 (ŷj − ŷj )
Di = (8.10)
(p + 1)s2e
The Residuals vs. Leverage diagnostic plot with Cook's distances shown as
dashed level curves with levels 0.5, 1 etc. illustrates the interplay between
leverage and inuence. The Figure below shows the eect of adding one point
(big square) to the original data on the scatterplot and Residuals vs. Lever-
age plots. Highly inuential points cross Cook's level curves as shown in the
bottom-right plot below.
260 Correlation and Regression
Least Squares Regression Line 261
In the top plots of the above Figure, the additional point has low leverage
since its predictor value is close to the x̄. Its inclusion only slightly modies
the y-intercept of the original line, not slope, so it has very low inuence, as
can be seen in top-right graph with none of the points crossing Cook's distance
level 0.5. In the middle plots, the additional point is far apart from the original
observations' mean, so it has high leverage. However, it conforms very well to
the regression line model tted without it. Its inclusion changes the regression
line very slightly, so the inuence is still low, so it does not cross Cook's
distance level 0.5. Lastly, in the bottom plots, the additional point is very far
away from the original observations and has high leverage. This time, however,
its inclusion signicantly alters the regression line by turning it clockwise, so
it has very high inuence and it crosses Cook's distance level curves 0.5 and
1.
[ ]: <matplotlib.legend.Legend at 0x7e0530f2cdf0>
262 Correlation and Regression
The total vertical deviation of data point (2, 15) from overall response mean
is yi − y = 15 − 10 = 5. It can be broken into the sum of two deviations:
As always, deviations are both positive and negative and would cancel each
other out in a sum, so squared deviations are considered. The total sum of
squares can be broken into:
[ ]: sm.stats.anova_lm(linear_model)
For simple regression with only one explanatory variable, the F-test (F = t2
i.e., 143.12 ≈ 11.962 ) and the p-value is the same as for the slope hypothesis
Least Squares Regression Line 263
t-test. Thus, this conrms, yet again, the validity of the linear model. This
is not true for multiple regression where F-test provides overall signicance
and each variable has its own t-test. Moreover, the averaged residual sum of
1
Pn
squares MME = df i=1 (yi − ŷi )2 in the ANOVA can be used to calculate
the
v
n
√
u
u1 X
Residual standard error = MME = t (yi − ŷi )2 (8.11)
df i=1
√
For WT vs. WAIST, the residual standard error is 149.30 = 12.22.
When correlation was introduced, the coecient of determination r2 was
dened, which measures the proportion of the variation in response variable
that is explained by the linear model. More precisely:
SSR 21360.1
r2 = = = 0.79019 (8.12)
SST 21360.1 + 5671
which is the ratio of the explained variation in ANOVA to the total variation.
It is exactly the r2 that we obtained before.
r
12.222
SE = 12.222 + + 0.1982 (81 − 91.28)2 = 12.54
40
Both, the individual prediction condence interval and the mean prediction
condence interval are really computed by a single command:
Least Squares Regression Line 265
[ ]: predictions = linear_model.get_prediction(pd.DataFrame({'WAIST':[xs]}))
predictions.summary_frame(alpha=0.05)
obs_ci_upper
0 173.520245
In fact, we can compute prediction intervals for several values at once as shown
in the code below:
obs_ci_upper
0 171.215614
1 173.520245
2 175.831061
3 178.148110
4 180.471431
5 182.801063
6 185.137038
7 187.479384
8 189.828124
9 192.183278
10 194.544859
11 196.912876
12 199.287332
13 201.668227
14 204.055555
15 206.449304
The Figure below shows 95% condence intervals for the mean prediction
around the regression line and the individual prediction (much wider). As the
266 Correlation and Regression
point moves further away from the mean x, the condence intervals get wider
and wider.
[ ]: alpha = 0.05
predictions = linear_model.get_prediction(mydata).summary_frame(alpha)
print("Predictions based on the model: \n\n:")
print(predictions.head())
plt.fill_between(mydata['WAIST'], predictions['obs_ci_lower'],
predictions['obs_ci_upper'], alpha=.1, label='Prediction interval')
plt.fill_between(mydata['WAIST'], predictions['mean_ci_lower'],
predictions['mean_ci_upper'], alpha=.5, label='Confidence interval')
plt.scatter(mydata['WAIST'], mydata['WT'], label='Observed',
marker='x', color='black')
plt.plot(mydata['WAIST'], predictions['mean'], label='Regression line')
plt.xlabel('WAIST'); plt.ylabel('WT'); plt.legend()
plt.show()
:
mean mean_se mean_ci_lower mean_ci_upper obs_ci_lower \
0 134.379140 3.729878 126.828396 141.929884 108.520470
1 135.328369 3.662232 127.914568 142.742170 109.509354
2 135.565676 3.645394 128.185962 142.945390 109.756429
3 138.650669 3.429421 131.708169 145.593169 112.963019
4 141.261048 3.251448 134.678836 147.843260 115.668421
obs_ci_upper
0 160.237810
1 161.147383
2 161.374923
3 164.338319
4 166.853676
Linear Model Examples 267
url="https://raw.githubusercontent.com/leonkag/Statistics0/main/LongevSmoke.
,→csv"
r = pearsonr(mydata['Packs'],mydata['Longevity']);
print("\nPearson corelation: ")
print(r,'\n')
Packs Longevity
0 0 80
1 0 70
2 0 75
3 0 77
4 1 72
5 1 70
6 1 69
7 2 68
8 2 65
9 2 70
Pearson corelation:
PearsonRResult(statistic=-0.8882477180477968, pvalue=4.323033187879979e-06)
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is
,→correctly
specified.
p-values::
Intercept 5.735547e-18
Packs 4.323033e-06
dtype: float64
/usr/local/lib/python3.10/dist-packages/scipy/stats/_stats_py.py:1806:
UserWarning: kurtosistest only valid for n>=20 ... continuing anyway, n=16
warnings.warn("kurtosistest only valid for n>=20 ... continuing "
Linear Model Examples 269
[ ]: plt.figure(figsize=(7,5))
import lmdiag
lmdiag.plot(linear_model);
270 Correlation and Regression
Based on the diagnotstic plots and the data with regression line Figure above,
we conclude the following. The assumptions of regression: a linear trend (no
pattern of residuals), normality of residuals (QQ plot), constant variability (no
funnel pattern), independent observations (no time series, Durbin-Watson is
close to 2) are all satised. Residuals vs. Leverage does not have any points
over Cook's distance level curves. Thus, the linear model is applicable.
Slope b1 = −4.04; therefore for each extra pack smoked per day, on average,
the longevity decreases by 4.04 years. The hypothesis test for the slope:
H0: β1 = 0 zero slope, no linear relationship
H1: β1 ̸= 0 non-zero slope, there is a linear relationship
The same t-test t = −7.24 and p-value = 4.3230332 · 10−6 < 0.05 lead to
rejection of H0; so there is a signicant linear relationship.
The 95% condence interval for the slope (−5.24, −2.84) doesn't contain 0,
which is yet another indication of the non-zero slope.
The y-intercept b0 = 75.39 has practical meaning, unlike most other problems.
It is the expected value of longevity for a non-smoker (packs smoked per day
equal to 0).
Thus, it has been shown in several ways that the linear model yb = 75.39 −
4.04 · x or Longevity
\ = 75.39 − 4.04 · P acks is appropriate for these data,
so it can be used for prediction. The code below gives the predictions t as
well as individual prediction condence intervals and tighter mean condence
intervals.
[ ]: xs =np.linspace(0, 7, num=8)
predictions = linear_model.get_prediction(pd.DataFrame({'Packs':xs}))
df = pd.DataFrame(predictions.summary_frame(alpha=0.05))
df.insert(loc=0, column='xs', value=xs); df
obs_ci_upper
0 82.905376
1 78.597405
2 74.482056
3 70.565239
4 66.837262
5 63.276258
6 59.854679
7 56.545135
We can also compute a residual for a specic value slightly outside of the data
set (reasonable extrapolation) as well as the typical residual error (Residual
standard error se ).
[ ]: xs = 5; yexact = 60; b = linear_model.params; ys = b[0] + b[1]*xs;
print('xs = {:.3f}, ys = {:.3f}, yexact = {:.3f}, residual = {:.3f}'.
format(xs,ys,yexact,yexact-ys))
se = np.sqrt(linear_model.scale);
print('Residual standard error = se = {:.3f}'.format(se))
Example
Consider correlation and regression for the WT vs. CHOL level in the MHEALTH
le. If the linear model is good, predict several typical values.
url="https://raw.githubusercontent.com/leonkag/Statistics0/main/MHEALTH.csv"
mydata = pd.read_csv(url) # save as mydata file
sns.regplot(data=mydata,x='CHOL',y='WT')
r = pearsonr(mydata['CHOL'],mydata['WT']);
print("\nPearson corelation:")
print(r,'\n')
Pearson corelation:
PearsonRResult(statistic=-0.025909074093860555, pvalue=0.8739101443884549)
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is
,→correctly
specified.
p-values::
Intercept 9.702267e-25
CHOL 8.739101e-01
dtype: float64
Linear Model Examples 273
[ ]: plt.figure(figsize=(7,5))
import lmdiag
lmdiag.plot(linear_model);
Most importantly, there is NO linear trend as shown in the data with regres-
sion line Figure above. Also, based on the diagnostic plots above ,the normal-
ity of residuals (QQ plot) is approximately satised. The constant variability
seems to fail; there is a funnel pattern. Independent observations (no time se-
ries, although Durbin-Watson is a bit below 2 indicating some autocorrelation)
seem to be satised. The residuals vs. Leverage plot also shows no extreme
behavior. With no linear trend, there is no reason to continue; however, we
pursue to see how each test fails in detail.
url="https://raw.githubusercontent.com/leonkag/Statistics0/main/HELPrct.csv"
mydata = pd.read_csv(url) # save as mydata file
# print(mydata.head(10))
sns.regplot(data=mydata,x='mcs',y='cesd')
r = pearsonr(mydata['mcs'],mydata['cesd']);
print("\nPearson corelation:")
print(r,'\n')
Pearson corelation:
PearsonRResult(statistic=-0.6819239139610712, pvalue=3.0189415357078893e-63)
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is
,→correctly
specified.
p-values::
Intercept 7.491255e-176
mcs 3.018942e-63
dtype: float64
[ ]: plt.figure(figsize=(8,6))
import lmdiag
lmdiag.plot(linear_model);
276 Correlation and Regression
Based on the diagnostic plots and the data with regression line Figire above,
the following conclusions can be made. The assumptions of regression: linear
trend (no pattern of residuals), normality of residuals (QQ plot), constant
variability (no funnel pattern), and independent observations (no time series)
are all satised. In the Residuals vs. Leverage plot, none of the points go
over Cook's distance level curves (they are not even visible); so none of the
observations are too inuential. Thus, the linear model is applicable.
Slope b1 = −0.665, so for each extra point of mcs score, the depression score
cesd decreases by 0.665 on average. The hypothesis test for the slope:
H0: β1 = 0 - zero slope, no linear relationship
̸ 0 - non-zero slope, there is a linear relationship.
H1: β1 =
The same t-test t = −19.8 and pvalue ≈ 0 < 0.05 lead to the rejection of H0,
i.e., there is a signicant linear relationship.
Linear Model Examples 277
The 95% condence interval for the slope (−0.73, −0.6) doesn't contain 0,
which is yet another indication of the non-zero slope.
Having conrmed in several ways that the linear model ŷ = 53.90 − 0.665 · x
is appropriate, let's predict with it:
obs_ci_upper
0 65.340466
1 58.653597
2 51.990796
3 45.352127
4 38.737559
5 32.146963
6 25.580118
The output above shows both tighter mean predictions and wider individual
predictions.
We can also compute a residual for a specic value slightly outside of the data
set (reasonable extrapolation). The typical residual error using this linear
model is given by Residual standard error se computed below.
Example
Investigate the correlation and regression of a particular stock return vs. the
S&P500 stock average. Predict several typical values.
r = pearsonr(mydata['ReturnsABC'],mydata['Returns500']);
print("\nPearson corelation:")
print(r,'\n')
ReturnsABC Returns500
0 0.11 0.20
1 0.06 0.18
2 -0.08 -0.14
3 0.12 0.18
4 0.07 0.13
5 0.08 0.12
6 -0.10 -0.20
7 0.09 0.14
8 0.06 0.13
9 -0.08 -0.17
Pearson corelation:
PearsonRResult(statistic=0.9535123202590954, pvalue=9.475962314785722e-10)
/usr/local/lib/python3.10/dist-packages/scipy/stats/_stats_py.py:1806:
UserWarning: kurtosistest only valid for n>=20 ... continuing anyway, n=18
warnings.warn("kurtosistest only valid for n>=20 ... continuing "
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is
,→correctly
specified.
p-values::
Intercept 3.552459e-01
Returns500 9.475962e-10
dtype: float64
[ ]: plt.figure(figsize=(7,5))
import lmdiag
lmdiag.plot(linear_model);
280 Correlation and Regression
Based on the diagnostic plots and the data with regression line Figire above,
the following conclusions can be made. The assumptions of regression: linear
trend in the data (no pattern of residuals), normality of residuals (QQ plot),
constant variability (no funnel pattern), independent observations (no time
series) are satised. In the Residuals vs. Leverage plot, none of the observations
cross over Cook's distance level curves. Thus, we can apply the linear model
to this data.
Slope b1 = 0.59, so for each extra point of return of S&P500, the returns of
stock ABC increase by 0.59 on average.
The hypothesis test for the slope:
H0: β1 = 0 - zero slope, no linear relationship
H1: β1 ̸= 0 - non-zero slope, there is a linear relationship
The t-test is same t = 12.66 and pvalue ≈ 0 < 0.05, so H0 can be rejected,
Linear Model Examples 281
obs_ci_upper
0 -0.117955
1 -0.063351
2 -0.007363
3 0.050228
4 0.109538
5 0.170535
6 0.233051
In the code below, we investigate a residual for a particular value. Also, the
typical residual error is given by Residual standard error se .
[ ]: xs = 0.3; yexact = 0.15; b = linear_model.params; ys = b[0] + b[1]*xs;
print('xs = {:.3f}, ys = {:.3f}, yexact = {:.3f}, residual = {:.3f}'.
format(xs,ys,yexact,yexact-ys))
se = np.sqrt(linear_model.scale);
print('Residual standard error = se = {:.3f}'.format(se))
For the standardized stock returns in this problem, the slope of the linear
model b1 = 0.59 gives beta of a stock, which measures the volatility of the
stock relative to the stock market as a whole (represented via S&P500). Thus,
this ABC stock is about 60% as volatile as the market.
282 Correlation and Regression
[ ]: year x y
0 1975 0 4500
1 1978 3 29000
2 1982 7 90000
3 1985 10 229000
4 1989 14 1200000
[ ]: sns.regplot(data=mydata,x='x',y='y')
r = pearsonr(mydata['x'],mydata['y']);
print("Pearson correlation : ")
print(r)
print(linear_model.summary())
print('\nHigher accuracy printout of p-values: \n',linear_model.pvalues,'\n')
Pearson correlation :
PearsonRResult(statistic=0.8562985042360467, pvalue=0.01389401025685245)
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.733
Model: OLS Adj. R-squared: 0.680
Method: Least Squares F-statistic: 13.74
Date: Fri, 19 Apr 2024 Prob (F-statistic): 0.0139
Time: 20:24:39 Log-Likelihood: -106.69
No. Observations: 7 AIC: 217.4
Df Residuals: 5 BIC: 217.3
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept -1.024e+06 8.05e+05 -1.272 0.259 -3.09e+06 1.05e+06
x 2.406e+05 6.49e+04 3.707 0.014 7.38e+04 4.07e+05
==============================================================================
Omnibus: nan Durbin-Watson: 0.872
Prob(Omnibus): nan Jarque-Bera (JB): 0.583
Skew: 0.409 Prob(JB): 0.747
Kurtosis: 1.846 Cond. No. 22.3
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is
,→correctly
specified.
p-values::
Intercept 0.259373
x 0.013894
dtype: float64
284 Correlation and Regression
The linear t is clearly very poor as shown in the Figure above; still, it is statis-
tically signicant with the slope's p-value below 0.05. Therefore a scatterplot
must always be examined.
r = pearsonr(mydata['x'],mydata['logy']);
print("Pearson correlation = ", r)
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is
,→correctly
specified.
p-values::
Intercept 1.744644e-07
x 6.795843e-06
dtype: float64
[ ]: a = np.exp(linear_model.params.values[[0]]);
b = linear_model.params.values[[1]];
# fit function
def f(x, a, b):
return a*np.exp(b*x)
[ ]: <matplotlib.legend.Legend at 0x7e050ea47820>
ln(2)
ebh = 2 =⇒ bh = ln(2) =⇒ h =
b
Therefore, the doubling time is given by:
[ ]: doubling_time = np.log(2)/b;
print(doubling_time)
Linearized Models - Exponential and Logarithmic Transformations 287
[2.02195535]
y = a · xb (8.18)
mydata = pd.DataFrame({'Planet':
,→['Mercury','Venus','Earth','Mars','Jupiter','Saturn'],
[ ]: Planet x y
0 Mercury 0.39 0.24
1 Venus 0.72 0.62
2 Earth 1.00 1.00
3 Mars 1.52 1.89
4 Jupiter 5.20 11.86
5 Saturn 9.54 29.46
[ ]: sns.regplot(data=mydata,x='x',y='y')
r = pearsonr(mydata['x'],mydata['y']);
print("Pearson correlation: ")
print(r,'\n')
Pearson correlation:
PearsonRResult(statistic=0.9924524861041891, pvalue=8.523247707493205e-05)
288 Correlation and Regression
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.985
Model: OLS Adj. R-squared: 0.981
Method: Least Squares F-statistic: 262.0
Date: Fri, 19 Apr 2024 Prob (F-statistic): 8.52e-05
Time: 20:26:42 Log-Likelihood: -10.091
No. Observations: 6 AIC: 24.18
Df Residuals: 4 BIC: 23.77
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept -2.2213 0.886 -2.508 0.066 -4.681 0.238
x 3.1790 0.196 16.186 0.000 2.634 3.724
==============================================================================
Omnibus: nan Durbin-Watson: 1.846
Prob(Omnibus): nan Jarque-Bera (JB): 0.722
Skew: -0.803 Prob(JB): 0.697
Kurtosis: 2.442 Cond. No. 6.29
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is
,→correctly
specified.
p-values::
Intercept 0.066209
x 0.000085
dtype: float64
Linearized Models - Exponential and Logarithmic Transformations 289
The linear t is very poor again (see Figure above) but still statistically sig-
nicant with a p-value well below 0.05. As always, a scatterplot must be
investigated.
[ ]: mydata['logx'] = np.log(mydata['x'])
mydata['logy'] = np.log(mydata['y'])
sns.regplot(data=mydata,x='logx',y='logy')
r = pearsonr(mydata['logx'],mydata['logy']);
print("Pearson correlation: ")
print(r,'\n')
Pearson correlation:
PearsonRResult(statistic=0.9999855135331572, pvalue=3.147850623314139e-10)
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is
,→correctly
specified.
p-values::
Intercept 8.697703e-01
logx 3.147851e-10
dtype: float64
Linearized Models - Exponential and Logarithmic Transformations 291
[ ]: a = np.exp(linear_model.params.values[[0]]);
b = linear_model.params.values[[1]];
# fit function
def f(x, a, b):
return a*x**b
[ ]: <matplotlib.legend.Legend at 0x7e050faca950>
The power function t is very good as shown in the Figure above. It gives
1.5
period (y) = (distance (x))
which is exactly Kepler's law.
L
y= (8.20)
1 + ea+bx
As x increases, y is initially increasing exponentially and concave up, but
later tapers o to a concave down curve converging to L that is called the
population carrying capacity (estimated graphically). To linearize this
equation
1 1 + ea+bx L L−y
= =⇒ − 1 = ea+bx =⇒ = ea+bx =⇒ (8.21)
y L y y
L−y
ln = a + bx (8.22)
y
L−y
Therefore ln y is linear with x.
L = 3000; a = 3; b=-0.1
x = np.arange(0,100,5);
n = len(x)
y = L/(1+np.exp(a+b*x)) + 3*np.random.normal(0,1,n)
mydata = pd.DataFrame({'x':x,'y':y})
pd.set_option("display.precision", 4);
print(mydata.head(),'\n')
x y
0 0 146.7065
1 5 227.8411
2 10 360.8243
3 15 554.7068
4 20 810.6091
[2]: sns.regplot(data=mydata,x='x',y='y')
r = pearsonr(mydata['x'],mydata['y']);
Linearized Models - Exponential and Logarithmic Transformations 293
Pearson correlation =
PearsonRResult(statistic=0.9389555271027002, pvalue=8.947154877473107e-10)
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.882
Model: OLS Adj. R-squared: 0.875
Method: Least Squares F-statistic: 134.1
Date: Sun, 21 Apr 2024 Prob (F-statistic): 8.95e-10
Time: 19:39:32 Log-Likelihood: -146.29
No. Observations: 20 AIC: 296.6
Df Residuals: 18 BIC: 298.6
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 379.9340 165.107 2.301 0.034 33.057 726.811
x 34.4061 2.971 11.579 0.000 28.163 40.649
==============================================================================
Omnibus: 3.951 Durbin-Watson: 0.121
Prob(Omnibus): 0.139 Jarque-Bera (JB): 1.395
Skew: 0.057 Prob(JB): 0.498
Kurtosis: 1.711 Cond. No. 107.
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is
,→correctly
specified.
p-values::
Intercept 3.3551e-02
x 8.9472e-10
dtype: float64
294 Correlation and Regression
As before, the linear t is very poor but statistically signicant with a p-value
well below 0.05; only the scatterplot above claries it. The logistic function
t is very good as shown below.
r = pearsonr(mydata['x'],mydata['logLy']);
print("Pearson correlation = ")
print(r,'\n')
Pearson correlation =
PearsonRResult(statistic=-0.9930846432128577, pvalue=3.3500306084557878e-18)
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 3.0991 0.160 19.384 0.000 2.763 3.435
x -0.1033 0.003 -35.888 0.000 -0.109 -0.097
==============================================================================
Omnibus: 35.230 Durbin-Watson: 1.679
Prob(Omnibus): 0.000 Jarque-Bera (JB): 94.154
Skew: -2.738 Prob(JB): 3.59e-21
Kurtosis: 12.110 Cond. No. 107.
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is
,→correctly
specified.
p-values::
Intercept 1.6513e-13
x 3.3500e-18
dtype: float64
[ ]: a = linear_model.params.values[[0]];
b = linear_model.params.values[[1]];
# fit function
296 Correlation and Regression
def f(x, a, b, L):
return L/(1+np.exp(a+b*x))
The Figures above show very good t, both in the transformed (linearized)
and the original coordinates.
Example
Investigate linear regression model of cesd by sex in the HELPrct data le.
url="https://raw.githubusercontent.com/leonkag/Statistics0/main/HELPrct.csv"
mydata = pd.read_csv(url) # save as mydata file
# print(mydata.head(10))
sns.boxplot(data=mydata,x='sex',y='cesd')
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is
,→correctly
specified.
p-values::
Intercept 1.1363e-113
sex[T.male] 1.2000e-04
dtype: float64
298 Correlation and Regression
[ ]: plt.figure(figsize=(10,7))
import lmdiag
lmdiag.plot(linear_model);
Categorical Predictors 299
The Figures above show a boxplot rather than a scatterplot because the data
is categorical. There are only two levels of the predictor variable sex, so the
residual graph is quite dierent, but at least the variability is about the same
for both groups. The QQ-plot is still not violating normality. It is not a time
series and the Residuals vs. Leverage plot also shows no extreme behavior.
Thus, the model can be applied to this data. Note that the correlation between
numerical variable cesd and categorical variable sex cannot be considered.
Grouped means:
sex
female 36.8879
male 31.5983
Name: cesd, dtype: float64
Let's run the t-test for the comparison of the means with equal variance
assumption:
The t-test t = −3.88 and pvalue = 0.00012 < 0.05 are the same. The dierence
in sample means xmale − xf emale = 31.6 − 36.89 = −5.29 is the slope of the
300 Correlation and Regression
regression line b1 . Therefore, linear regression for binary categorical variables
is equivalent to a t-test with the assumption of equal variance (regression
assumes constant variance).
Having conrmed in several ways that the linear model ŷ = 36.89 − 5.29 · x is
a good t for the data, use it for prediction. There are only two values of the
explanatory variable to predict - female/male; so it just gives average values
of the cesd score by gender (sex).
[9]: xs =['female','male']
predictions = linear_model.get_prediction(pd.DataFrame({'sex':xs}))
df = pd.DataFrame(predictions.summary_frame(alpha=0.05))
df.insert(loc=0, column='xs', value=xs);
pd.set_option("display.precision", 2); print(df,'\n')
obs_ci_upper
0 61.22
1 55.85
The typical residual error using this linear model is given by Residual standard
error se computed below.
[11]: se = np.sqrt(linear_model.scale);
print('Residual standard error = se = {:.4f}'.format(se))
Example
Consider the linear regression model for cesd by substance in the HELPrct
data le.
url="https://raw.githubusercontent.com/leonkag/Statistics0/main/HELPrct.csv"
mydata = pd.read_csv(url) # save as mydata file
# print(mydata.head(10))
sns.boxplot(data=mydata,x='substance',y='cesd')
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is
,→correctly
specified.
p-values::
Intercept 2.75e-139
substance[T.cocaine] 3.04e-04
substance[T.heroin] 7.30e-01
dtype: float64
302 Correlation and Regression
[ ]: plt.figure(figsize=(10,7))
import lmdiag
lmdiag.plot(linear_model);
Categorical Predictors 303
The Figures above show boxplots of the three substances and the diagnostic
plots of the model. The substance variable has three levels: alcohol, cocaine,
and heroin, so two indicator variables are needed. Alcohol is the base level
alphabetically. The cocaine and heroin dierences are given by two indicator
variables:
For example, if we consider an alcoholic, then the indicators for cocaine and
heroin are both zero: substancecocaine = substanceheroin = 0, so
The depression score means grouped by substance produce exactly the same
results:
Grouped means:
substance
alcohol 34.37
cocaine 29.42
heroin 34.87
Name: cesd, dtype: float64
Therefore, the slopes b1c = −4.95 and b1h = 0.5 give relative mean dierences
compared to the reference level of alcohol. The same problem is more naturally
considered using the Analysis of Variance (ANOVA) in a separate chapter.
In addition, the linear model output has separate hypothesis tests for each
slope.
For cocaine vs. alcohol:
304 Correlation and Regression
H0: β1c = 0 - zero slope for cocaine
H1: β1c ̸= 0 - non-zero slope
The t-test t = −3.64 and p-value = 0.0003 < 0.05, so H0 is rejected and there
is a signicant slope (dierence) in average cesd scores for cocaine vs alcohol.
For heroin vs. alcohol:
H0: β1h = 0 -zero slope for heroine
H1: β1h ̸= 0 - non-zero slope
The t-test t = 0.35 and p-value = 0.73 > 0.05, H0 is not rejected and there is
not a signicant slope (dierence) in average cesd scores for heroin vs. alcohol.
Bibliography
[3] Bruce A. Craig. A First Course in Statistics. Pearson, 12th edition, 2017.
Uses software tools like SPSS for data analysis.
[5] David M. Diez, Christopher D. Barr, and Mine Çetinkaya Rundel. Open-
Intro Statistics. OpenIntro, Inc., 4th edition, 2019. Available online.
[6] Julian J. Faraway. Linear Models with R. Chapman & Hall/CRC, Boca
Raton, 2005. Designed for students with a background in statistics, fo-
cusing on linear models using R.
[7] Ronald A. Fisher. Statistical Methods for the Social Sciences. Pearson,
4th edition, 2018. Features applications and exercises using various soft-
ware tools.
[10] Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani.
An Introduction to Statistical Learning with Applications in R. Springer,
New York, 2013. Covers machine learning and statistical methods using
R, suitable for students who have already taken an introductory statistics
course.
305
306 Bibliography
[11] G. Jay Kerns. Introduction to Probability and Statistics Using R. 2010.
Open-source book that provides a comprehensive overview of probabil-
ity and statistics using R, suitable for students with prior knowledge of
statistics.
[15] Mario F. Triola. Elementary Statistics. Pearson, 13th edition, 2018. Uses
software tools like Excel and TI84 for data analysis.
[17] W.N. Venables and B.D. Ripley. Modern Applied Statistics with S.
Springer, New York, 4th edition, 2002. Assumes prior knowledge of statis-
tics and focuses on applied statistics using S-Plus and R.
[18] Neil A. Weiss. Introductory Statistics. Pearson, 9th edition, 2016. Includes
applications with software such as Excel and Minitab.
[19] Hadley Wickham and Garrett Grolemund. R for Data Science: Import,
Tidy, Transform, Visualize, and Model Data. O'Reilly Media, Sebastopol,
2016. A practical guide for data analysis using R, assuming a basic
understanding of statistics.
307
308 Index
rst quartile, 22 observational study, 7
uctuations, 1 observed counts, 143
frequency, 14 one-sample means t-test, 184
funnel pattern, 252 one-sample means test, 171
one-sided means t-test, 190
general addition rule, 38 one-sided test, 119
goodness of t, 143 ordinal, 4
outlier, 23
HELPrct, 2
overestimate, 250
histogram, 14
hypothesis testing, 112 paired (dependent), 210
pandas library, 2
independent, 42
parameter, 8
independent variable, 250
partial correlation, 246
inferential statistics, 9, 99
Pearson correlation coecient, 238
inuence, 258
percentile, 24
interquartile range, 23
permutations, 87
pie chart, 28
joint probabilities, 47
placebo eects, 9
point estimate, 13, 99
least squares regression line, 249
polls, 1
left tail, 74
pooled proportion, 134
left-skewed shape, 15
pooled variance, 221
levels, 4
population, 7
leverage, 258
population mean, 13
linear model, 248
population variance, 18
linear trend, 238
positive predictive value, 52
mean, 11 prevalence, 55
median, 22 probability, 35
mode, 17
quantile, 72
multimodal, 17
quantitative variables, 2
multiplication rule, 42
mutually exclusive, 37
random processes, 35
randomized block design, 9
negative predictive value, 53
randomized experimental design, 8
nominal, 4
regression ANOVA, 261
non-parametric test, 195
regression categorical predictor, 296
non-response rate, 8
residual, 249
normal distribution, 67
residual standard error, 263
null hypothesis, 113
response variable, 5, 250
numpy, 11
retrospective studies, 7
Index 309
sampling error, 99
Z-score, 69
scatterplot, 4, 237
seaborn, 14
sensitivity, 53
shape, 15
simple random sample, 8
Spearman correlation, 245
specicity, 53
standard deviation, 18
standard error, 100, 173
standard normal distribution, 68
statistic, 8
statistically signicant, 1
statistics, 1
symmetric, 16
t-distribution, 179
third quartile, 22
time series, 253
trade-o relationship, 109
treatment group, 1
tree diagrams, 57
trimmed mean, 12
true negative, 51
true positive, 51
two proportions test, 134
two-sided, 113
two-way contingency table, 157
type 1 error, 126
type 2 error, 127