Statistical Data Analysis
Statistical Data Analysis
F. Bijma
M.C.M. de Gunst
Department of Mathematics
Faculty of Sciences
VU University Amsterdam
These lecture notes are based on the lecture notes Statistische Data Analyse (in Dutch)
by M.C.M. de Gunst and A.W. van der Vaart.
Contents
1 Introduction 1
2 Summarizing data 3
2.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Summarizing data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2.1 Summarizing univariate data . . . . . . . . . . . . . . . . . . . . . . 6
2.2.2 Data transformation . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.3 Summarizing bivariate data . . . . . . . . . . . . . . . . . . . . . . 13
2.2.4 Summarizing multivariate data . . . . . . . . . . . . . . . . . . . . 16
3 Exploring distributions 17
3.1 The quantile function and location-scale families . . . . . . . . . . . . . . . 18
3.2 QQ-plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.3 Symplots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.4 Two-sample QQ-plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.5 Goodness of fit tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.5.1 Shapiro-Wilk Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.5.2 Kolmogorov-Smirnov test . . . . . . . . . . . . . . . . . . . . . . . 32
3.5.3 Chi-Square tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4 The bootstrap 37
4.1 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.2 Bootstrap estimators for a distribution . . . . . . . . . . . . . . . . . . . . 39
4.3 Bootstrap confidence intervals . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.4 Bootstrap tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.5 Limitations of the bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5 Robust Estimators 53
5.1 Robust estimators for location . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.1.1 Trimmed means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.1.2 M -Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
ii
CONTENTS iii
6 Nonparametric methods 70
6.1 The one-sample problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
6.1.1 The sign test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
6.1.2 The signed rank test . . . . . . . . . . . . . . . . . . . . . . . . . . 72
6.2 Asymptotic efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
6.3 Two-sample problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
6.3.1 The median test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
6.3.2 The Wilcoxon two-sample test . . . . . . . . . . . . . . . . . . . . . 82
6.3.3 The Kolmogorov-Smirnov test . . . . . . . . . . . . . . . . . . . . . 85
6.3.4 Permutation tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
6.3.5 Power and asymptotic efficiency . . . . . . . . . . . . . . . . . . . . 87
6.4 Tests for correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
6.4.1 The rank correlation test of Spearman . . . . . . . . . . . . . . . . 89
6.4.2 The rank correlation test of Kendall . . . . . . . . . . . . . . . . . . 90
6.4.3 Permutation tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
Introduction
Statistics is the science of collecting, analyzing and interpreting data. In the ideal situation
a statistician is involved in all stages of a study: in formulation of the question of interest,
in determining the experimental procedure, the data collection, the data analysis, and
in the interpretation and presentation of the results of the analysis. It will be clear
that a statistician not only needs to know the statistical theory, but he/she also should
be able to translate a practical problem into statistical terms, to give advice about the
experimental design, to judge the quality of the data, to choose a proper statistical model
and appropriate analysis tools, to perform the statistical analysis, and to translate the
results of the analysis back to the practical problem for which the data were collected.
Unfortunately, frequently the help of a statistician is asked after the data are collected. In
these situations the data are often not optimal, which makes the statistical analysis more
difficult. But also in such case the statistician needs to be aware of what the practical
problem is and what this means in statistical terms.
In these lecture notes, the first stages of a study—the formulation of the problem,
experimental design, and the data collection—are not considered. The aim is to give
practical insight in the analysis of the data. Therefore, it is assumed that the data are
already there, and that it is known why they are there and how they were obtained.
The first thing that needs to be done for a data analysis is to get an impression of
the data and to summarize them in an appropriate way. This is discussed in Chapter
2. Next, when one wants to choose a model, it is useful to investigate the underlying
distribution of the data. This is treated in Chapter 3. Even after a thorough preliminary
investigation of the data, it is often the case that some model assumptions that are needed
for a particular analysis method are not fulfilled. In that case one could use a so-called
robust method, introduced in Chapter 5, which is relatively insensitive to small deviations
from the assumptions. Another way to prevent making unrealistic model assumptions is
to assume as little as possible and use one of the nonparametric methods. These are the
topic of Chapter 6. In the other chapters some frequently used techniques are discussed,
like the bootstrap (Chapter 4), multivariate linear regression (Chapter 8), and methods
for the analysis of categorical data (Chapter 7).
For all methods that are presented, not only the theory is important, but also when and
1
2 Chapter 1. Introduction
how to use them in practice. These lecture notes contain many examples that illustrate
these aspects. In addition, students should gain experience in application of the methods
by means of exercises in which (simple) data sets will be analyzed with the aid of the
computer.
Chapter 2
Summarizing data
The title of this chapter consist of the words ‘summarizing’ and ‘data’. Also the title
of the course contains the word ‘data’. The first part of this chapter explains what is
meant with the term ‘data’ and what types of data can occur. When the data come in
larger quantities, the first step of their analysis is to make an appropriate summary. In
the second part of the chapter guidelines for making a good data summary and several
techniques for obtaining summaries are discussed.
2.1 Data
The term ‘data’ is abundantly used in scientific research. Data are the quantified results
of a study. A collection of data generally consists of characteristics of individuals or
experimental units.
Example 2.1 For every subscriber of a certain magazine the following character-
istics are given:
To quantify these measurements we could use the following code: male = 0, female
= 1; primary school = 0, secondary school = 1, higher education = 2. Then the
measurement (1,0,23) represents a female subscriber whose highest educational level
is primary school and who subscribed 23 months ago.
***
The data set in the example above contains three different types of data. These different
types of data are measured on different measurement scales. The sex is an example of a
measurement on a so-called nominal scale or nominal level. With measurements on this
3
4 Chapter 2. Summarizing data
scale the individuals are divided into categories. The different categories are identified
by a different code (a number, letter, word, or name). The structure of a nominal scale
is not changed by a 1-1 transformation: the coding ‘male = 0’, ‘female = 1’ would have
been essentially the same as ‘male = M, ‘female = F’. Note that there is no ordering for
this kind of data. Nominal scales are qualitative scales. Arithmetic operations cannot
be performed on data measured on a qualitative scale. For data measured on a nominal
scale the mode can be used as a measure of location (see next section). However, location
concepts like median or mean, as well as measures of spread, have no meaning.
The highest educational level is an example of a measurement on an ordinal scale. The
categories are not only identified, but they can also be ordered. The distances between
the categories have no meaning. We cannot say that the difference between the primary
and secondary school education is the same as the difference between the secondary and
higher school education, or that higher education is three times as good as primary school:
there is just only an ordering of the categories. A frequently used ordering is that of a
5-point scale: the patient indicates that she is feeling much worse, worse, as good, better,
much better; a test panel judges the taste of chocolate of brand A much less tasty, less
tasty, as tasty, more tasty, much more tasty than that of brand B, and so on. Ordinal
scales are also qualitative scales: there is only an ordering without measurable distances.
For measurements on an ordinary scale, the median and the mode can be useful, but the
mean and measures of spread have no meaning.
The number of months that a person is a subscriber is an example of a measurement
on a quantitative scale. The measurements have more meaning than just falling into a
category or indicating an ordering. Someone who subscribed 26 months ago is a subscriber
for twice as long as someone who subscribed 13 months ago. For quantitative data intervals
are meaningful, and on these data arithmetic operations can be performed. The location
measures mean, mode and median can all be used. Moreover, for quantitative data
measures of spread can be used too. A qualitative scale for which intervals are meaningful
but ratios not is called an interval scale; a qualitative scale for which both intervals and
ratios are meaningful is a ratio scale. An interval scale has an arbitrary zero point, a ratio
scale has a unique, true zero point.
The characteristics or properties of individuals or experimental units will be indicated
by the term variable. Apart from the above mentioned partition of variables according to
their measurement level, there are several other partitions:
- discrete and continuous variables. A discrete variable can only take a finite (or
countable) number of values; continuous variables take values in a continuum, i.e. a
“full” part of the real line. Nominal and ordinal variables are discrete by definition.
- univariate, bivariate, and multivariate variables. For this partition the dimension
of the variable is the determining factor: on one subject one, two or more variables
are measured. In Example 2.1 the measurement (1,0,23) is a three-dimensional or
trivariate measurement.
2.2. Summarizing data 5
- independent and dependent variables, sometimes called structural variables and vari-
ables to be measured. Dependent variables are the object of the study, whereas the
independent or structural variables are quantities that may have an influence on the
variables under study.
Example 2.3 When one is interested to know the effect of the composition of
a fibre on the strength of the fibre, one could measure the strength of the fibre
for different compositions of the fibre. The strength of the fibre is the dependent
variable, the composition of the fibre is the independent variable.
***
Example 2.4 In classical statistics it is often assumed that the data are a sample
from a normal distribution. If this assumption would be exactly correct, then an
appropriate summary would consist of the sample mean and the sample variance
only. This is because then these two quantities are in the technical (statistical) sense
sufficient. In practice exact normality is of course almost never the case, but it is
often sufficient when normality holds only approximately. However, at the start
of an analysis, it is usually not known whether the assumption of (approximate)
normality is appropriate. Therefore in summarizing a data set one should not let
oneself be guided by this assumption. Only after (approximate) normality has been
established, a summary consisting of sample mean and sample variance suffices.
***
5 | 2
5 | 6
6 | 0000004444444
6 | 8888
7 | 222222224
7 | 6666
8 | 004
8 | 888
9 | 2
From the figure the idea behind the plot will be clear: the 2 and the 6 behind the 5s
indicate that both 52 and 56 occur once in the data set; 6 zeros behind the 6 means that
60 occurs six times in the data set. The numbers 5, 5, 6, 6, 7, etc. in the first column are
the stems of the plot; the sequences of numbers behind them in the second column are the
leaves. Stems without leaves (no occurrence in the data) may also occur in a stem-and-leaf
plot. In most statistical packages one can choose the number of stems and the place of
the |. The stem-and-leaf plot in the figure is an example of a so-called split stem-and-leaf
plot, because the natural stems, the tens, are split into two. With stem-and-leaf plots
little information about the data gets lost. Stem-and-leaf plots give an impression about
the shape of the data distribution while retaining most of the numerical information.
A frequently used graphical summary of univariate data is the histogram. A histogram
is a graph in which a scale distribution for the measured variable is given along the
horizontal axis. Above each interval or bin that is induced by this scaling distribution, a
bar is drawn such that the area of the bar is proportional to the frequency of the data in
that bin. When the widths of the bins are all equal, then the heights of the bars are also
proportional to the frequency of the data in that bin. When the bin widths are not equal,
8 Chapter 2. Summarizing data
this does not hold. Obviously, it can be very misleading to plot bins of unequal width with
the same size. It can result in a completely wrong impression of the spread of the data.
The choice of the bin sizes is somewhat arbitrary. Different histograms of the same data
set can give a slightly different impression. Too few or too many intervals always gives a
bad result. Figure 2.2 shows two histograms of the pulsation data of Example 2.5. The
bin sizes of the histogram on the left were chosen to be all equal to 10, for the histogram
on the right the bin sizes were chosen to be 5. Although for the histogram on the right
less data information got lost, the histogram on the left gives a better impression of the
global spread of the data. Note that several of the bars in the histogram on the right only
represent 1 data point. A histogram gives an idea of the distribution of the data. When
the histogram is scaled such that the total area of all bars is equal to 1, it can be used
as a rough estimate of the probability density function of the distribution from which the
data originate. Although with histograms generally more information about the data gets
lost than with stem-and-leaf plots, histograms are used much more often.
15
8
10
6
Frequency
Frequency
4
5
2
0
50 60 70 80 90 100 50 60 70 80 90
Pulsation Pulsation
While a histogram can be used as an estimate of the probability density function, the
empirical (cumulative) distribution function gives an impression of the (cumulative) dis-
tribution function of the underlying distribution. For a sample {x1 , . . . , xn } the empirical
distribution function is defined as
1 1∑ n
F̂n (x) = (#(xj ≤ x)) = 1{x ≤x} .
n n j=1 j
(Here the indicator 1{xj ≤x} equals 0 or 1 depending on whether xj > x or xj ≤ x).
Denote the ordered set of sample values, arranged in order from smallest to largest, by
2.2. Summarizing data 9
(x(1) , . . . , x(n) ); x(i) is called i-th order statistic of the sample. If x < x(1) , Fn (x) = 0, if
x(1) ≤ x < x(2) , Fn (x) = 1/n, if x(2) ≤ x < x(3) , Fn (x) = 2/n, and so on. If there is
a single observation with value x, then Fn has a jump of height 1/n at x; if there are k
observations with the same value x, then Fn has a jump of height k/n at x. The empirical
distribution function is the sample analogue of the distribution function F of a random
variable X: F (x) gives the probability that X ≤ x, Fn (x) gives the proportion of the
sample that is less than or equal to x. In Figure 2.3 the empirical distribution function
of the pulsation data of Example 2.5 is depicted.
1.0
0.8
0.6
Empirical
0.4
0.2
0.0
50 60 70 80 90
Pulsation
Figure 2.3: Empirical distribution function of the pulsation data of Example 2.5.
Besides graphical summaries, numerical summaries can be given: with one or more
numbers an impression is given of the whole data set. For a sample x1 , . . . , xn one can,
for instance, give the sample size n, the sample mean x̄ and the sample variance s2 .
The latter two quantities give an impression of the location and scale, respectively, of
the distribution from which the data were generated. In particular, the sample mean
corresponds to the population mean or, in statistical terms, to the expectation of the
underlying distribution. Similarly, the sample variance corresponds to the population
variance or to the variance of the underlying distribution. There are, however, also other
quantities that give information about location and scale of the distribution. Table 2.1
gives an overview of the most commonly used quantities in this context.
The (sample) quartiles mentioned in Table 2.1 are roughly those values of the sample
such that one quarter, and three quarters, respectively, of the sample are smaller than
10 Chapter 2. Summarizing data
sample size n
∑
location mean x̄ = n−1 ni=1 xi
{
x((n+1)/2) , if n odd
median med(x) = 1
2 (x(n/2) + x(n/2+1) ), if n even
1 ∑n
scale variance s = n−1 i=1 (xi − x̄)2
2
√
standard deviation s = s2
coefficient of variation cv = s/x̄
range (x(1) , x(n) )
quartiles quart(x), 3quart(x)
interquartile range 3quart(x)−
√ ∑
quart(x)
n
n (x −x̄)3
skewness skewness b1 = {∑n j=1 j
(x −x̄)2 }3/2
∑ j=1 j
n
n (x −x̄)4
size of tails kurtosis b2 = { ∑ n
j=1 j
(x −x̄)2 }2
j=1 j
notation µ σ2 β1 β2
Bernoulli Bern(p) p p(1 − p) √ 1−2p
3+ 1−6p(1−p)
p(1−p)
p(1−p)
Binomial Bin(n, p) np np(1 − p) √ 1−2p 3+ 1−6p(1−p)
np(1−p)
np(1−p)
1
mr mr(N −r)(N −m) (N −2r)(N −2m)(N −1) 2 N 2 (N −1)
Hypergeom. Hyp(m, r, N ) N N 2 (N −1) 1 mr(N −r)(N −m)(N −2)(N −3)
(N −2)(mr(N −r)(N −m)) 2
Poisson P (µ) µ µ µ−1/2 3 + µ−1
Normal N (µ, σ 2 ) µ σ2 0 3
12 (b − a)
1 1 2 9
Uniform U nif (a, b) 2 (a + b) 0 5
t − distribution tν 0 ν
ν−2 , (ν > 2) 0, (ν > 3) 3+ 6
ν−4 , (ν > 4)
12
Chisquare χ2ν ν 2ν 2(2/ν)1/2 3+ ν
Exponential exp(λ) λ−1 λ−2 2 9
sample quantities b1 and b2 given in Table 2.1 correspond with the population quantities
E(X − E X)3
β1 = 3 ,
{E(X − E X)2 } 2
and
E(X − E X)4
β2 = .
{E(X − E X)2 }2
Both quantities are location and scale invariant. In Table 2.2 the values of β1 and β2
are given for some frequently occurring distributions. We remark that for symmetric
distributions β1 = 0 .
Finally, we mention a combination of a graphical and a numerical summary: the
boxplot. A scale distribution for the measured variable is given along the vertical axis,
next to which a rectangle, ‘the box’, is drawn. The top and the bottom of the box are
situated at the largest and smallest quartile, respectively. The width of the box may vary.
A horizontal line in the box gives the position of the sample median. From the middle
of the top and bottom of the box the so-called whiskers extend. They connect the box
to the most extreme data points that lie at most 1.5 times the interquartile range from
the edges of the box. More extreme data points are considered as extreme values and are
depicted by separate symbols like ◦ or ∗. The factor 1.5 that determines the maximum
length of the whiskers is sometimes replaced by another value. Figure 2.3 shows boxplots
for the data of Example 2.5 with factors 1.5 and 1.
12 Chapter 2. Summarizing data
Pulsation Pulsation
90
90
80
80
70
70
60
60
Figure 2.4: Boxplots for the pulsation data of Example 2.5 with factors 1.5 (left) and 1 (right).
Example 2.6 The picture on the left in Figure 2.5 shows a histogram of the num-
bers of insects counted on a specific type of bushes in the dunes. This is a skewed
distribution. Application of the square-root-transformation on each of the data
points yields a data set that looks more symmetric; see the histogram in Figure 2.5
on the right. By application of a simple transformation a more symmetric picture
is obtained. Moreover, in this picture two ‘tops’ can be seen that are less visible in
the picture on the left. The meaning of these tops should be further investigated,
for instance it may be the case that the number of insects depends on the location
of the bush, or that the inspected bushes are located in two very different areas.
***
20
30
25
15
20
Frequency
Frequency
10
15
10
5
5
0
0
0 5 10 15 20 25 30 0 1 2 3 4 5 6
Figure 2.5: Histogram of numbers of insects on bushes (left) and histogram of the square root of the
same numbers (right).
85
80
Volume
75
70
65
0 5 10 15 20 25 30
Diameter
330
320
Time
Table 2.3: 3 × 3 contingency table of blood group against disease of 8766 persons.
table, also called contingency table. To construct such a table the possible values of the
first and second variable are partitioned into k and r disjunct categories or intervals,
respectively. Then a table of k rows and r columns is made, such that the frequency of
the number of subjects with their value of the first variable in the i-th category and their
value of the second variable in the j-th category is presented in the cell on the i-th row
and in the j-th column.
Example 2.7 In a study on the relationship between blood group and two different
diseases a sample of 8766 people consisting of 2697 patients with a stomach ulcer
or stomach cancer and a control group of 6087 people without these diseases was
divided based on blood group. The result is given in Table 2.3. There seems to be
a certain relation between the variables blood group and disease. In Chapter 7 this
table will be further investigated.
***
Whereas in a scatter plot the individual data values can still be recognized, in a contin-
gency table this information may get lost (when the categories consist of more than one
value). As is illustrated in the example above, the advantage of contingency tables is that
they can be used not only to summarize data that are measured on a quantitative scale,
but also to summarize data that are measured on a nominal or ordinal scale. Contingency
tables will be discussed further in Chapter 7.
The numerical quantities that are most frequently used to summarize bivariate data
are given in Table 2.4. Of the quantities mentioned in the table the correlation coefficients
need some explanation. The sample correlation coefficient rxy is a measure of the strength
of the linear relationship between x and y. It can take values from −1 to 1. A rxy -value
close to −1 means that there is a strong negative linear relation between x and y. Equality
to −1 or 1 means that the relationship is exactly linear.
The rank correlation coefficients, however, are only measures of the rank correlation
between x and y, i.e. measures for the interdependence between the ranks of the x- and
y-values in the ordered samples. Values of both rank correlation coefficients may range
from −1 to 1. When the rank correlation coefficient has a large positive value, then a
larger value of x in the sample will generally correspond with a larger value of y, whereas
16 Chapter 2. Summarizing data
Table 2.4: Numerical summaries of bivariate data (x1 , y1 ), . . . , (xn , yn ). The vectors (r1 , . . . , rn ) and
(t1 , . . . , tn ) are the ranks of, (x1 , . . . , xn ) and (y1 , . . . , yn ), respectively, in the ordered samples x and y.
The quantity Nτ is the number of pairs (i, j) with i < j for which either xi < xj ánd yi < yj , or xi > xj
ánd yi > yj . The sign function sgn(x) is defined as: 1 if x > 0, 0 if x = 0 and −1 if x < 0.
when the rank correlation coefficient has a negative value, then larger values of x will
generally correspond with smaller values of y. To obtain Kendall’s τ all xi are compared
with all yj ; for Spearman’s rs the xi are only compared with the corresponding yi . Note
that the correlation coefficient rxy can only be used for measurements on a quantitative
scale, whereas the rank correlation coefficients can also be used for measurements on an
ordinal scale. Beside the three quantities that are mentioned here, there are many other
measures for dependence between variables. The correlation measures will reappear in
Chapter 6.
Exploring distributions
In this chapter we discuss several methods to investigate whether data stem from a cer-
tain probability distribution. The question whether a data set comes from a ‘known’
distribution is important for several reasons:
- Fitting a known distribution to the data is an efficient way of summarizing the data.
For example, it would be very convenient if the statement ‘the data follow a normal
distribution with mean µ and variance σ 2 ’ would suffice as a summary of the data.
- It is desirable, if possible, to find a model that somehow ‘explains’ the data. In-
vestigating whether the data come from a certain distribution is not only helpful
for testing an existing model, it can also give an indication which kind of model we
need to look for.
- After having conducted an explorative data analysis, one often wants to apply more
formal statistical methods, like statistical tests. Such methods are usually based on
certain assumptions about the underlying probability mechanism. These assump-
tions should be plausible, and therefore it is of great importance to test them, at
least visually, on the data. For instance, you can think of checking the normality of
the measurement errors in a regression analysis before applying a t- or F -test.
Unless stated otherwise, we assume that x1 , . . . , xn are realizations from independent ran-
dom variables X1 , . . . , Xn which all have the same (unknown) probability distribution
with distribution function F . We often shortly speak of ‘the distribution F ’, and say
that x1 , . . . , xn are independent realizations from F . The goal is to get to know more
about F . The independence and identical distribution are not tested! First, we discuss
graphical methods to investigate whether univariate data stem from a certain probabil-
ity distribution, whether the underlying distribution F is symmetric, and whether two
samples originate from the same distribution. Some important techniques for obtaining a
first impression of the shape of a distribution, like drawing histograms or box plots, have
already been treated in the foregoing chapter, and will not be discussed further. Next, we
discuss several tests for verifying whether univariate data stem from a certain probability
distribution.
17
18 Chapter 3. Exploring distributions
b
0.6
0.4
a
0.2
F−1(a) F−1(b)
0.0
When the random variable X has distribution function F , the distribution function
of a + bX is Fa,b , given by
The family of distributions {Fa,b : a ∈ IR, b > 0} is called the location-scale family
corresponding to F . When X has expectation E X = 0 and variance Var(X) = 1, then
the expectation and variance of the distribution Fa,b are given by a and b2 . It is not
difficult to verify the following relation between the quantile functions
−1
Fa,b (α) = a + bF −1 (α).
3.2. QQ-plots 19
−1
In other words: the points {(F −1 (α), Fa,b (α)) : α ∈ (0, 1)} are on the straight line y =
a + bx. Figure 3.2 illustrates that two normal distributions belong to the same location-
scale family.
10
quantiles N(2,16)
5
0
−5
−2 −1 0 1 2
quantiles N(0,1)
3.2 QQ-plots
Let x1 , . . . , xn be independent realizations from a distribution F . Then approximately
a fraction i/(n + 1) of the data will be below the i-th order statistic x(i) . Hence, x(i)
approximately equals the i/(n + 1)-quantile of the data. Therefore, the points
{( ) }
i
F −1 ( ), x(i) : i = 1, . . . , n
n+1
are expected to lie approximately on a straight line1 . A QQ-plot is a graph of these n
points. The Qs in the name are abbreviations of the word ‘quantile’. The choice of i/(n+1)
is in fact rather arbitrary. One can also plot the points {(F −1 ( n+εi−ε1
2
), x(i) ) : i = 1, . . . , n} for
fixed (small) values of ε1 and ε2 . In particular in the case that F is a normal distribution,
the points {( ) }
−1 i − 3/8
F ( ), x(i) : i = 1, . . . , n
n + 2/8
follow a straight line slightly better than the points defined above.
1
A more formal argument for this reasoning is the well known result E F (X(i) ) = i/(n + 1) for
measurements from a continuous distribution function F . Therefore, one can expect E X(i) ≈ F −1 (i/(n +
1)). The reasoning would be stronger if the last relation would hold with exact equality. However, this is
almost never the case. For the normal distribution function Φ one has for example E X(i) < Φ−1 (i/(n+1))
if i < 21 (n + 1), equality if i = 12 (n + 1), and the reverse inequality if i > 12 (n + 1).
20 Chapter 3. Exploring distributions
Example 3.1 Samples consisting of 50 independent measurements from the N (2, 16)-
distribution were simulated by using a random number generator. In Figure 3.3 the
QQ-plots of these samples are shown.
***
10
10
10
order statistics
order statistics
order statistics
5
5
0
0
−5
−5
−5
−2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2
Figure 3.3: Three QQ-plots of samples consisting of 50 data points from N(2,16) against quantiles of
N(0,1).
A QQ-plot yields a method to judge whether the data come from a certain distribution
by only looking at the plot. When the plot yields approximately the line y = x, this is
an indication that the data come from the distribution F . Deviations from the line
y = x indicate differences between the true distribution of the data and F . The kind of
deviation from y = x suggests the kind of difference between the true distribution and
F . The simplest case of such a deviation is when the QQ-plot is a straight line but not
the line y = x, as in Figure 3.3. This is an indication that the data do not originate
from F , but come from another member of the location-scale family of F . Interpreting
a bent curve is more complicated. Such QQ-plots mainly yield information about the
relative thickness of the tails of the distribution of the data with respect to those of F .
To give an impression of possible deviations from straight lines in QQ-plots a number of
QQ-plots of ‘true’ quantile functions are plotted in Figure 3.4. In these plots the points
{(F −1 (α), G−1 (α)) : α ∈ (0, 1)} are drawn for different distribution functions F and G.
For the assessment of a QQ-plot one just looks whether the points are more or less
on a straight line. This illustrates the way QQ-plots are judged: informally, using prior
experience.
2
1
1
normal
normal
0
0
−1
−1
−2
−2
uniform logistic
0.0 0.2 0.4 0.6 0.8 1.0 −4 −2 0 2 4
2
12
10
1
chi square_4
8
normal
0
6
−1
4
2
−2
lognormal exponential
0
0 2 4 6 8 10 0 1 2 3 4
Figure 3.4: Plots of pairs of ‘true’ quantile functions: uniform-normal, logistic-normal, lognormal-
normal, exponential-χ24 .
of cherry trees, measured at 4 foot and 6 inches, and their volumes (in cubic feet).
diameter 8.3 8.6 8.8 10.5 10.7 10.8 11.0 11.0 11.1 11.2 11.3
volume 10.3 10.3 10.2 16.4 18.8 19.7 15.6 18.2 22.6 19.9 24.2
diameter 11.4 11.4 11.7 12.0 12.9 12.9 13.3 13.7 13.8 14.0
volume 21.0 21.4 21.3 19.1 22.2 33.8 27.4 25.7 24.9 34.5
diameter 14.2 14.5 16.0 16.3 17.3 17.5 17.9 18.0 18.0 20.6
volume 31.7 36.3 38.3 42.6 55.4 55.7 58.3 51.5 51.0 77.0
These data were collected from cut cherry trees in order to make predictions of the
timber yield (in volume) of a cherry tree from the easy to measure diameter of the
uncut tree. For this purpose a linear regression model was used with volume as
22 Chapter 3. Exploring distributions
response variable (Yi ) and diameter as independent variable (xi ). The results are
shown in Figure 3.5. The figure on the left shows a scatter plot of the data, together
with the estimated regression line y = α̂ + β̂x.
10
70
70
60
60
Sample Quantiles
5
50
50
volume
volume
40
40
0
30
30
−5
20
20
10
10
8 10 12 14 16 18 20 −2 −1 0 1 2 8 10 12 14 16 18 20
Figure 3.5: Linear regression of volume on diameter of 31 cherry trees and the QQ-plot of the residuals
of the regression analysis against the normal distribution.
∑
where s2 = (n − 2)−1 ni=1 (yi − α̂ − β̂xi )2 is the common estimate of the variance of
the measurement errors and tn−2,α0 is the right α0 -point of the t-distribution with
n − 2 degrees of freedom. For the given data set the 90% confidence interval for the
slope is equal to (3.257, 4.098).
Under the normality assumption it is also possible to give simultaneous confi-
dence curves for the entire regression line. The formula for these curves is
√
√ 1 (x − x̄)2
y = α̂ + β̂x ± 2F2,n−2,1−α0 s + ∑n .
i=1 (xi − x̄)
n 2
In this expression F2,n−2,1−α0 is the (1 − α0 )-quantile (or the right α0 -point) of the
F -distribution with 2 and n − 2 degrees of freedom. In the graph on the right in
Figure 3.5 the 90% confidence curves are drawn. When the usual assumptions about
the linear regression model are valid, including the normality assumption, then the
true regression line lies between the two curves for each x with a confidence of 90%.
What is the value of these confidence regions? In order to answer this question,
first of all we should inspect the method of data collection. (Unfortunately, nothing
is known about the collection of the cherry tree data.) Next, we should investigate
3.3. Symplots 23
the different assumptions that were made to construct the confidence regions. To
judge whether normality of the measurement errors is a plausible assumption, a
QQ-plot of the residuals Yi − α̂ − β̂xi (i.e., the vertical distances from the data
points to the regression curve) was made. It is shown in the middle figure of Figure
3.5. When the measurement errors are independent and normally distributed, the
residuals will be normally distributed as well. Based on the QQ-plot there is no
reason to worry about this assumption.
We note that the residuals in a linear regression analysis are not independent
and identically distributed. However, these are not strict conditions for the QQ-plot
technique to be valid. A necessary condition is that the data have approximately
identical marginal distributions and that the so-called quantile function of the data
is a good estimator of the ‘true’ marginal quantile function. This condition is fulfilled
in the case of the cherry tree data.
We will come back to this data set later.
***
With respect to the use of QQ-plots a word of caution is in place. One should be
careful not to draw too strong conclusions based on QQ-plots. The notion of ‘straight
line’ is rather subjective and in the case of less than 30 data points it is very difficult to
distinguish two distributions anyway.
Example 3.3 The normal and logistic distributions are very difficult to distinguish
when only a few measurements are available. Figure 3.6 shows QQ-plots of a sample
of 20 data points from a normal distribution against the normal, the logistic and
the Cauchy distribution.
***
3.3 Symplots
A random variable X is symmetrically distributed around θ if X − θ and θ − X follow the
same distribution. If X has a continuous distribution, then X is symmetrically distributed
around θ if its probability density is symmetric around θ. A symmetric probability dis-
tribution looks simpler than an asymmetric distribution. Moreover, the notion of ‘center
of the distribution’ makes more sense in the case of a symmetric distribution. Therefore,
one sometimes tries to find an adequate transformation of the data to obtain symmetry.
To judge whether or not a sample originates from a symmetric distribution, a his-
togram or a stem-and-leaf plot can be used. Naturally, the skewness parameter gives
information about symmetry too, although in spite of its name one should not overesti-
mate its usefulness: a value of zero for the skewness parameter does not necessarily mean
24 Chapter 3. Exploring distributions
1.5
1.5
1.0
1.0
1.0
0.5
0.5
0.5
0.0
0.0
0.0
−0.5
−0.5
−0.5
−1.0
−1.0
−1.0
−2 −1 0 1 2 −2 0 2 −10 −5 0 5 10
1.5
1.5
1.5
1.0
1.0
1.0
0.5
0.5
0.5
0.0
0.0
0.0
−0.5
−0.5
−0.5
−1.0
−1.0
−1.0
−2 −1 0 1 2 −2 0 2 −10 −5 0 5 10
Figure 3.6: QQ-plots of two samples consisting of 20 independent realizations from the N(0,1) distri-
bution against the quantiles of N(0,1) (left), logistic (middle) and Cauchy (right) distribution. The first
(second) row shows the three QQ-plots of the first (second) sample.
0.4
0.3
0.2
0.1
0.0
−3 −2 −1 0 1 2 3
that the distribution looks symmetric. Also other parameters can give an indication about
the amount of symmetry of a distribution. For instance, a large difference between mean
and median indicates a skewed distribution.
3.3. Symplots 25
Skewness can also be assessed with the quantile function. From Figure 3.7 it can easily
be derived that the corresponding quantile function satisfies
This equation holds for each symmetric distribution F . It means that for a symmetric
distribution the points {(θ − F −1 (α), F −1 (1 − α) − θ) : α ∈ (0, 1)} lie on a straight line.
For data x1 , . . . , xn from a symmetric distribution we expect that the points {(med(x) −
x(i) , x(n−i+1) − med(x)) : i = 1, . . . , [ 21 n]} also lie on a straight line. A plot of these points
is called a symmetry plot or, briefly, a symplot.
Example 3.4 The phenomenon of Raynaud is the coloring in three phases (white-
purple/blue-red) of the fingers and/or toes when somebody is exposed to cold.
Below the β-thromboglubine values of three groups of patients are given:
- 32 patients without organ impediments (PRRP)
- 41 patients with scleroderma (SDRP)
- 23 patients with certain other auto-immune illnesses (CTRP).
PRRP: 22, 25, 27, 30.5, 32.5, 34, 41, 41, 43.5, 43.5, 44.5, 44.5, 44.5, 48.5, 50.5, 53,
55.5, 58.5, 58.5, 63.5, 68.5, 68.5, 68.5, 73.5, 80, 89.5, 92, 101.5, 104, 119, 119,
124.5
SDRP: 18, 22, 23.5, 25.5, 27.5, 29.5, 29.5, 31.5, 31.5, 33, 35.5, 35.5, 36.5, 39, 43, 43,
43, 45, 46, 48.5, 49.5, 49.5, 52, 54, 56.5, 62, 65, 67.5, 69.5, 71.5, 72.5, 72.5, 76,
81.5, 95, 105, 106.5, 108.5, 130, 190, 218
CTRP: 20, 23.5, 27, 32, 41, 44, 51, 53, 55.5, 58.5, 62.5, 62.5, 65, 67, 69.5, 72, 80, 88.5,
91, 138, 146.5, 160.5, 219.
Figure 3.8 shows symplots and histograms of these three variables, which clearly
have a skewed distribution. A logarithmic transformation yields a more symmetric
picture (Figure 3.9), facilitating a comparison between the three data sets. The table
below contains the mean values of the three variables as well as the mean values of
the logarithms of the variables. Between brackets the standard deviations are given.
The difference between the mean of CTRP on the one hand and the means of PRRP
and SDRP on the other hand is striking, since this difference vanishes after taking
the logarithms. The explanation lies partly in the large standard deviations of the
data before the logarithmic transformation. Taking these large standard deviations
into account, the initial difference is not that large. Moreover, the distribution of
CTRP is rather skewed to the right. The large observations in the right tail may
be the reason for the high value of the mean. By the transformation the three
distributions become more comparable. The absence of an observable difference
in means after transformation, suggests that the β-thromboglubine values are not
26 Chapter 3. Exploring distributions
100
100
40
50
50
20
0
0
5 10 20 30 0 5 10 15 20 25 30 0 10 20 30 40
Distance Below Median Distance Below Median Distance Below Median
12
20
8
8
10
4
4
5
0
0
20 40 60 80 120 0 50 100 200 0 50 100 200
Figure 3.8: Symplots and histograms of the variables PRRP, SDRP and CTRP. The dotted line in
the symplots is the line y = x.
Table 3.1: The mean of the variables PRRP, SDRP and CTRP (top row) and the mean of the logarithms
of these variables (bottom row). Standard deviations are given between brackets.
1.2
1.0
0.8
0.4
0.5
0.4
0.0
0.0
0.0
0.0 0.2 0.4 0.6 0.8 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.4 0.8
Distance Below Median Distance Below Median Distance Below Median
8 10
12
0 2 4 6 8
6
8
4
4
2
0
0
3.0 3.5 4.0 4.5 5.0 2.5 3.5 4.5 5.5 2.5 3.5 4.5 5.5
Figure 3.9: Symplots and histograms of the variables log(PRRP), log(SDRP) and log(CTRP). The
dotted line in the symplots is the line y = x.
originate from the same distribution is the two-sample QQ-plot, also called an empir-
ical QQ-plot. When the sample sizes are equal (m = n), then this is simply a plot
of the points {(x(i) , y(i) ) : i = 1, 2, . . . , n}. When m < n it is a plot of the points
∗
{(x(i) , y(i) ) : i = 1, 2, . . . , m}, where
( )
∗ 1
y(i) = 2
y([i n+1 ]) + y([i n+1 + m
]) .
m+1 m+1 m+1
The underlying idea is to match x(i) with y(j) for which i/(m + 1) ≈ j/(n + 1), that
is, j ≈ i(n + 1)/(m + 1). Assessing a two sample QQ-plot is similar to assessing a
standard QQ-plot: points approximately on a straight line indicate that the underlying
distributions of the two samples belong to the same location-scale family.
Example 3.5 To compare two methods, A and B, for the production of textile,
in a textile factory the number of flaws in pieces of woven fabric were counted.
Samples of size 31 and 34 of pieces of fabric produced with method A and method
B, respectively, yielded the following data:
28 Chapter 3. Exploring distributions
10
5
2 4 6 8 10 12 14
method A
Figure 3.10: Two sample QQ-plot of the number of flaws in weaving in fabric produced by the two
methods.
H0 : F ∈ F 0
H1 : F ∈
/ F0 .
Usually the null hypothesis consists of one distribution (F0 = {F0 }) or a small family of
distributions (e.g., F0 is a location-scale family), and, hence, the alternative hypothesis
is extensive. Unfortunately, it is not possible to find a test that has high statistical power
3.5. Goodness of fit tests 29
against all possible alternatives. Therefore, we will look for a test that has reasonably
high statistical power against a large number of alternatives. Such a test is called an
‘omnibus test’. When such a test does not reject the null hypothesis, this is considered
as an indication that the null hypothesis may be correct. This formulation is on purpose
rather vague: strong conclusions can only be drawn when the amount of data is large. For
example, distinguishing the logistic from the normal distribution using the tests discussed
below—with standard significance levels—requires around 1000 measurements, a number
that is often not available in practice. Nevertheless, goodness of fit tests are a meaningful
supplement to the graphical methods discussed earlier. The results of these tests can
either confirm the impression of a QQ-plot, or give an additional reason for being careful
with drawing conclusions.
As all statistical tests, goodness of fit tests only yield qualitative conclusions; the
conclusion of a test is about the presence of a certain effect and says little about the size
of the effect. In the case of goodness of fit tests, the null hypothesis may be rejected while
the true distribution is for all practical purposes very close to the null hypothesis. This
problem plays a bigger role as the sample size is larger: for very large data sets the power
of a test also becomes very high and as a result even the smallest deviation from the null
hypothesis will almost certainly be detected.
regression model holds with X(1) , . . . , X(n) as the values of the response variable, with the
expectations of the order statistics of a sample of size n from a standard normal distri-
bution as the corresponding values of the independent variable, and with the unknown
values of µ and σ as the intercept and slope coefficients, respectively. The numerator of
W turns out to be a constant times the square of the least squares estimator of σ, and
hence a constant times an estimator for σ 2 .
To see this, let us first introduce some notation. Consider a sample Z1 , . . . , Zn from
the standard normal distribution. Write c = (c1 , . . . , cn )′ = (E Z(1) , . . . , E Z(n) )′ and define
Σ to be the covariance matrix of (Z(1) , . . . , Z(n) )′ . Now assume that the null hypothesis
is true: X1 , . . . , Xn is a sample from a normal distribution with mean µ and variance σ 2 .
Since Xi has the same distribution as σZi + µ we have that
E X(i) = µ + σci , i = 1, . . . , n.
This means that for (Y1 , . . . , Yn ) = (X(1) , . . . , X(n) ) the linear regression model holds. The
common α and β are now given by µ and σ, while the usual xi are given by the ci . There is a
difference, however, between this linear regression problem and the usual linear regression
setting, because the observations (Y1 , . . . , Yn ) = (X(1) , . . . , X(n) ) are not uncorrelated in
this case. One can easily derive that the covariance matrix of (X(1) , . . . , X(n) ) is equal to
σ 2 Σ. In order to find the regression line a weighted least squares approach is common:
determine estimators µ̂ and σ̂ by minimizing the following expression for µ and σ
∑
n ∑
n
− 12
∥Σ (Y − µ1 − σc)∥ =
2
(Σ−1 )i,j (Yi − µ − σci )(Yj − µ − σcj ).
i=1 j=1
In this expression µ1 equals the vector with all entries equal to µ. The so obtained least
squares estimator σ̂ is a linear combination of the components of the observation vector
Y , namely σ̂ = c′ Σ−1 Y /c′ Σ−1 c.
The coefficients ai in the test statistic of the Shapiro-Wilk test are given by
c′ Σ−1
a′ = √ .
c′ Σ−1 Σ−1 c
Obviously, apart from a constant, the ai equal the coefficients in σ̂. This constant is
∑
chosen such that ni=1 a2i = 1. Furthermore, it holds that an−i+1 = −ai , which implies
∑n
i=1 ai = 0. Now it becomes clear that when the null hypothesis holds, also the numerator
of the statistic W equals a constant times an estimator for σ 2 .
Example 3.6 The Shapiro-Wilk test was applied to the residuals of Example 3.2.
The value of the test statistic is equal to w = 0.9789. This corresponds to a left
p-value of approximately 0.78, and, hence, the null hypothesis is not rejected. This
3.5. Goodness of fit tests 31
means that the residuals cannot be distinguished from a sample from a normal
distribution, which confirms the impression of the QQ-plot in Figure 3.5.
It should be noted, that in principle this test cannot be used for testing normality
of the residuals, because the residuals are dependent. However, for large sample sizes
and a reasonable choice of design points xi this dependence vanishes and the critical
value is at least approximately correct. A correct way of applying the Shapiro-
Wilk test in the present situation is to apply the test on transformed residuals for
which independence does hold, as follows. Under the linear regression model the
vector R of residuals has a multivariate normal distribution with covariance matrix
proportional to Σ = (I −X(X T X)−1 X T ) (where X = (1, (x1 , . . . , xn )T ), cf. Chapter
8). This matrix has rank n − 2. Since this matrix is independent of the parameters,
it is possible to find a vector of n − 2 independent components each with the same
normal distribution by a linear transformation of R. Computing the Shapiro-Wilk
test statistic for this new vector yields a value of w = 0.974. It follows that in this
case the corrected test yields only a marginally different result.
***
This statistic equals the amount of explained variance in the weighted regression problem
described above (and it also equals the weighted correlation coefficient between c and Y .)
Small values of W2 indicate that the points in the QQ-plot are not on a straight line, and,
therefore, indicate non-normality. Other possibilities for test statistics are the amount of
explained variance in an unweighted regression, or the expression above with ci replaced
by Φ−1 ((i − 3/8)/(n + 2/8)). The distribution of these statistics is more complicated.
Apart from the Shapiro-Wilk type tests, that relate to QQ-plots, there exist many other
tests for testing normality. In case one is specifically interested in deviations in skewness
or kurtosis (which equal 0 and 3, respectively, for all normal distributions), it is better to
directly base a test statistic on these statistical quantities.
32 Chapter 3. Exploring distributions
H0 : F = F 0
H1 : F ̸= F0 ,
this is, we now consider F0 = {F0 } for a specific distribution F0 . First we consider the
Kolmogorov-Smirnov test. In Chapter 2, the (sample) empirical distribution function of
an observed sample x1 , . . . , xn was introduced. Generally, a different sample will give a
different empirical distribution function, even if the two samples were drawn from the
same distribution. So it is natural to consider the corresponding stochastic version of
the (sample) empirical distribution function. Let x1 , . . . , xn be realizations from random
variables X1 , . . . , Xn that are independent and identically distributed. The empirical
distribution function of X1 , . . . , Xn is the stochastic function
1∑ n
F̂n (x) = 1{X ≤x} .
n j=1 j
The indicator 1{Xj ≤x} equals 0 or 1 when Xj > x or Xj ≤ x, respectively. Hence, the
random variable nF̂n (x) equals the number #(Xj ≤ x). Note that the notation for the
stochastic empirical distribution function and the sample empirical distribution function
are the same: both F̂n . When F is the true distribution function of X1 , . . . , Xn , nF̂n (x)
is binomially distributed with parameters n and F (x). Therefore, F̂n (x) is an unbiased
estimator of F (x). Moreover, by the law of large numbers we have F̂n (x) → F (x). We
may conclude that F̂n is a reasonable estimator for F . For this conclusion, no assumptions
on F are needed.
A test for the null hypothesis that the true distribution F of X1 , . . . , Xn is equal to F0
can be based on a distance measure between F̂n and F0 . The null hypothesis is rejected
when the distance between these two functions is large. The Kolmogorov-Smirnov test is
based on the maximum vertical distance between F̂n and F0 : its test statistic is
Dn = sup F̂n (x) − F0 (x) .
−∞<x<∞
The null hypothesis is rejected for large values of Dn . However, the random variable Dn
does not have one of the well-known distributions. Its critical values can be found from
tables or from statistical software2 . One of the properties of the Kolmogorov-Smirnov
test that makes this test valuable, is that the distribution of Dn under the null hypothesis
that the true distribution equals F0 , is the same for each continuous distribution, and
hence does not depend on F0 . This is expressed as: Dn is distribution free over the
2
See for example the book Empirical Processes with Applications to Statistics by Shorack and Wellner.
3.5. Goodness of fit tests 33
1.0
0.8
0.6
0.4
0.2
0.0
0 1 2 3 4 5 6
Figure 3.11: Graph of the empirical distribution function (step function) of 15 observations from an
exponential distribution, together with its true distribution function (smooth curve).
class of continuous distribution functions, and one says that the Kolmogorov-Smirnov
test belongs to the class of distribution free, or nonparametric, tests. This means that the
critical value of the test is also independent of F0 and that one table suffices to perform
the test. This property of Dn is formulated as a theorem. The proof of the theorem shows
a simple way of computing Dn .
Theorem 3.1 The statistic Dn is distribution free over the class of continuous distribu-
tion functions.
Proof. From Figure 3.11 it can easily be seen that the empirical distribution function
F̂n (x) equals i/n in X(i) while it equals (i − 1)/n just left of X(i) . Therefore,
i i − 1
Dn = max max{ − F0 (X(i) ) , − F0 (X(i) )}.
1≤i≤n n n
Since under H0 , X1 , . . . , Xn is a sample from F0 , we have that F0 (X1 ), . . . , F0 (Xn ) is a
sample from the uniform distribution on [0,1] (the so-called ‘integral transformation’). By
the monotonicity of F0 , the distribution of the vector (F0 (X(1) ), . . . , F0 (X(n) )) is equal to
the distribution of the order statistics of a sample from the uniform distribution. Since
this distribution is independent of F0 , the distribution of Dn is also independent of F0 .
Note that the Kolmogorov-Smirnov test is, in the given form, only applicable for testing
simple hypotheses, because the postulated null-distribution F0 occurs in the definition of
the test statistic Dn . This is why in this simple form the test cannot be used to test
‘normality’, that is, to test the composite null hypothesis that the true distribution is
a member of the location-scale family of normal distributions. However, the test can be
34 Chapter 3. Exploring distributions
adapted for such purposes. For example, a test for normality can be based on the adapted
statistic
D̃n = sup F̂n (x) − Φ(S −1 (x − X̄)) ,
−∞<x<∞
where X̄ and S 2 are the sample mean and the sample variance, respectively. Finding
critical values for such a test is not straightforward. Theorem 3.1 does not hold anymore.
It would be incorrect to use the critical values of the Kolmogorov-Smirnov test statistic
for this adapted test statistic. In Chapter 4 a method is presented to obtain approximate
values for the correct critical values for the adapted test statistic. We refer to the literature
for more information about the distribution of the test statistic based on the estimated
parameters (e.g., Shorack and Wellner (1986)) as well as for related tests, like the Cramér-
von Mises test, which is based on the distance measure
∫ ∞
CMn = (F̂n (x) − F (0 x))2 dF0 (x).
−∞
I1 I2 Ik−1 Ik
. . . . . . . . . . .
The number of observations in interval Ii is defined as Ni and the test statistic is given
by
2
∑k
[Ni − npi ]2
X = ,
i=1 npi
where pi = F0 {Ii } is the probability under F0 that an observation lies in Ii . The num-
ber npi is the expectation of Ni when the true distribution of each of the observations
X1 , . . . , Xn is equal to F0 . The statistic X 2 thus measures, in a weighted way, the devia-
tion of the observed frequencies Ni from their expected values under the null hypothesis.
The null hypothesis is rejected for large values of X 2 . The exact distribution of the
statistic under the null hypothesis, and hence the critical value, depends on F0 and the
choice of the intervals Ii . However, for large values of n and for fixed k this distribution is
approximately equal to the chi-square distribution with k − 1 degrees of freedom, which
we denote as χ2k−1 -distribution. As a rule of thumb, this approximation is reliable when
the expected number of observations under the null hypothesis equals at least 5 in each
interval.
The theoretical justification of using the χ2 distribution lies in the following (see
also Chapter 7). The vector of cell frequencies N (n) = (N1 , . . . , Nk ) has a multinomial
3.5. Goodness of fit tests 35
distribution with parameters n and p1 , . . . , pk . It can be shown that for a fixed number
of cells k, the variable X 2 = X 2 (n) converges, for n → ∞, in distribution to a χ2k−1 -
distribution, that is,
where χ2k−1 now also represents a random variable with the χ2k−1 -distribution. This result
is closely related to the Central Limit Theorem and the fact that a sum of squares of
independent random variables with the standard normal distribution follows a chi-square
distribution.
Like the Kolmogorov-Smirnov test, the chi-square test is in its simplest form only
applicable to test simple null hypotheses. For testing a composite null hypothesis, con-
sisting of more than one distribution function, the statistic and the critical values have
to be adapted. Usually, the following statistic is used in such a case:
∑
k
[Ni − np̂i ]2
X̃ 2 = ,
i=1 np̂i
Here we write Φ((I − µ)/σ) for the probability of finding an observation in I under the
N (µ, σ 2 )-distribution. The values of µ and σ 2 that maximize this multinomial likelihood
are in general not equal to X̄ and S 2 .
Chapter 4
The bootstrap
1
The bootstrap is a technique which can be used for
Without modern computers the bootstrap would not be useful, because with this tech-
nique the theoretical derivation of a probability distribution generally is replaced by com-
puter simulation. This is why we first discuss the general idea of simulation.
4.1 Simulation
Let Z1 , Z2 , . . . , ZB be independent replications from a probability distribution G. Then
according to the Law of Large Numbers we have that
1
· #(Zi ∈ A, 1 ≤ i ≤ B) → G(A), with probability 1, if B → ∞.
B
For continuously distributed, real valued random variables this principle can be expressed
in a graphical way. In this case a properly scaled histogram of Z1 , Z2 , . . . , ZB will for
large values of B closely look like the density of G. This is illustrated in Figure 4.1, which
shows histograms for B = 20, 50, 100 and 1000 observations from a χ24 distribution, each
time together with the true density.
The above mentioned concept ‘independent replications’ is precisely defined in prob-
ability theory. With a computer it is possible to generate pseudo random numbers from
most standard probability distributions. A sequence of pseudo random numbers is gen-
erated according to a fixed formula and is therefore not a real sequence of independent
1
This somewhat strange name comes from B. Efron. A bootstrap is just a long shoe lace. The
connection between statistics and shoe laces goes via the Baron von Münchhausen, who ‘pulled himself
up by his own bootstraps’. (Read this chapter and find the connection yourself.)
37
38 Chapter 4. The bootstrap
0.20
0.20
0.15
0.15
0.10
0.10
0.05
0.05
0.00
0.00
0 5 10 15 0 5 10 15
0.20
0.20
0.15
0.15
0.10
0.10
0.05
0.05
0.00
0.00
0 5 10 15 0 5 10 15
Figure 4.1: Histograms of samples from 20, 50, 100 and 1000 independent observations from a χ24 -
distribution and the true density.
replications from a given distribution. For most applications the pseudo random num-
bers cannot be distinguished from real independent observations. This is why the word
‘pseudo’ is often omitted. Instead of ‘generating’ one also speaks of simulating and, when
this principle is used for solving a problem, it is called the Monte Carlo method.
Random numbers from more general probability distributions can be generated in two
steps.
- Generate random numbers or vectors U1 , . . . , Un from a certain (standard) distribu-
tion P .
- Next, form with a fixed function ψ the random numbers ψ(U1 ), . . . , ψ(Un ).
A special case of this is the inverse transformation. Here P is the uniform distribu-
tion on (0,1) and ψ = G−1 is the inverse of a distribution function G. In this case
4.2. Bootstrap estimators for a distribution 39
Evidently, this procedure is only useful if the quantile function G−1 is available on the
computer.
better estimators when it is known that the distribution P belongs to a particular class
of distributions, like the class of normal distributions or symmetric distributions.
The second type of frequently used estimators for P is the parametric estimator, which
is appropriate in situations where the unknown distribution P is known to belong to a
parametric family like the normal or exponential family, but its parameter value θ is
unknown. In this case it is natural to first find an estimator θ̂n of θ, and then estimate
P by the distribution in the family which has θ̂n as its parameter value. This parametric
estimator of P will be denoted by Pθ̂n . The estimator QPθ̂n for the distribution of Tn is a
parametric bootstrap estimator.
Example 4.1 Consider the case where Tn is the mean of a random sample X1 , . . . , Xn
∑
from an unknown distribution P , i.e. Tn = X̄ = n−1 nj=1 Xj . Then Tn is an un-
biased estimator for the expectation µP of P . The unbiasedness means that if we
would compute Tn for each of a large number of different (independent) samples,
the average of these Tn ’s would be close to µP , but this does not say much about the
accuracy of Tn for a single sample. One of the measures for the quality of Tn is the
variance VarP (Tn ). It can be expressed in terms of the variance of one observation,
σP2 = Var(X1 ), according to the formula
σP2
VarP (X̄) = .
n
Of course, in most cases σP2 is unknown. An unbiased estimator for σP2 is the sample
variance
1 ∑ n
S2 = (Xj − X̄)2 .
n − 1 j=1
Hence, the value S 2 /n gives an impression of the accuracy of the sample mean
∑
X̄ = n−1 nj=1 Xj as an estimator for µP . A rule of thumb which is often used is
that the true value of µP is equal to X̄ plus or minus a couple of times (most often
√
twice) S/ n.
This statement can be made precise by means of a confidence interval. To com-
pute this interval more information about the complete distribution of the sample
mean is necessary. In classical statistics it is not unusual to assume that the un-
known distribution P is a normal distribution. Under this assumption X̄ has a
N (µP , σP2 /n) distribution, which can be estimated by the N (X̄, S 2 /n) distribution.
Here the true distribution of X̄, QP = QP (X̄), is estimated by the parametric
bootstrap estimator QPθ̂ = N (X̄, S 2 /n).
n
Since frequently the underlying distribution P is not normal, we often do not
want to assume that P is normal. In case the normality assumption is not made,
the (empirical) bootstrap still yields an estimate for the unknown distribution of
4.2. Bootstrap estimators for a distribution 41
The bootstrap estimator can be applied in many more complex situations than that of
example 4.1; for example, to estimate the distribution or variance of the median, or of a
trimmed mean (see next chapter), for which no simple formulas are available. Moreover,
the bootstrap estimator can be used for the assessment of the accuracy of estimators of
much more complex parameters, such as regression curves.
Until now, we only considered the theoretical concept of the bootstrap estimator.
A practical question is whether a bootstrap estimate can in fact be computed, once the
realized sample x1 , . . . , xn from X1 , . . . , Xn is known. In the above the estimator is written
as QP̃n , but is this, given the observations x1 , . . . , xn , a useful expression? In most cases
the answer is negative. Often QP̃n is a complex function of P̃n , for which no explicit
expression is available. The distribution QP̂n in Example 4.1, for example, is a discrete
distribution of which the probabilities can in principle be written out, but for which no
useful formula is possible. However, when a fast computer is available, it is always possible
to give a (stochastic) approximation of the estimate: one can simulate it.
For the bootstrap this goes as follows. Suppose that the original observations x1 , . . . , xn
were obtained from a distribution P , and that P̃n is an estimator for P . Given the val-
ues x1 , . . . , xn , the estimate P̃n = P̃n (x1 , . . . , xn ) is fixed. But this means that given
x1 , . . . , xn , the probability mechanism P̃n = P̃n (x1 , . . . , xn ) is known, and that a new
sample X1∗ , . . . , Xn∗ can be generated from it by computer simulation. Obviously, the
new sample X1∗ , . . . , Xn∗ is not a randomly selected sample like the original sample, but a
pseudo random, simulated sample. It is called a bootstrap sample, and the Xi∗ are called
the bootstrap values. It is important that X1∗ , . . . , Xn∗ are generated from P̃n in the same
manner as X1 , . . . , Xn were sampled from P (for instance of the same size and with the
same (in)dependence structure). Write Tn∗ = Tn (X1∗ , . . . , Xn∗ ) for the corresponding boot-
2
It should be noted that other estimators than the empirical bootstrap are possible, even if the
normality assumption does not hold. For instance, the earlier mentioned N (X̄, S 2 /n) estimator. After
all, for large n the sample mean will be approximately normally distributed because of the central limit
theorem, provided that σP2 < ∞.
3
In the example the bootstrap estimator is the variance of the mean of a sample of size n from the
distribution P̂n . Given the values of the original sample x1 , . . . , xn , which need to be considered as fixed,
the bootstrap estimate can in this example be explicitly determined. This goes as as follows. Like before,
the variance of the∑nmean is 1/n times the variance
∑n of one observation. One observation from P̂n has
expectation x̄ = j=1 xj · n−1 and variance j=1 (xj − x̄)2 · n−1 . (Use the formulas for expectation and
variance of a discrete distribution to compute this.)
42 Chapter 4. The bootstrap
strap value of Tn . Then the bootstrap estimate for the distribution of Tn can be written
as
QP̃n (x1 ,...,xn ) = QP̃n (Tn∗ | x1 , . . . , xn ).
The vertical bar in the middle stands for ‘given the value of’: determination of the boot-
strap distribution is always done with the original observations x1 , . . . , xn being given,
thus fixed. A (stochastic) approximation of the estimate QP̃n (Tn∗ | x1 , . . . , xn ) can now
∗ ∗
be obtained by simulating B values Tn,1 , . . . , Tn,B from this distribution and then approx-
∗
imating QP̃n (Tn | x1 , . . . , xn ) by the empirical distribution of these B bootstrap values.
∗ ∗
Simulation of the B bootstrap values Tn,1 , . . . , Tn,B of Tn can be done according to the
the following bootstrap sampling scheme which consists of three steps:
given x1 , . . . , xn ,
1. simulate a large number, B say, bootstrap samples (like X1∗ , . . . , Xn∗ ) according to
P̃n (x1 , . . . , xn ), all of size n, and independent from each other;
2. compute for each of the B bootstrap samples the corresponding bootstrap value Tn∗ ;
Example 4.2 Consider again the case where X1 , . . . , Xn is a random sample from
a distribution P and where for P̃n the empirical estimator P̂n is used. Then the
bootstrap values X1∗ , . . . , Xn∗ form a sample from the discrete distribution P̂n . Now,
4.2. Bootstrap estimators for a distribution 43
sampling from a discrete distribution where all mass points have equal probability
is the same as randomly selecting one of the mass points. Therefore, in this case the
values X1∗ , . . . , Xn∗ can be obtained by drawing a random sample with replacement
of size n from the set {x1 , . . . , xn }. With replacement, because the different Xi∗ need
to be independent, like the values Xi in the original sample. Hence, each of the B
bootstrap samples in step 1 is obtained by ‘resampling’ n times with replacement
from the original observations {x1 , . . . , xn }. This technique is therefore also called
a resampling scheme.
Random sampling from a set of n points can be implemented on a computer in
a reasonably efficient way. This is why the simple bootstrap which is based on the
empirical estimator, is frequently used. In principle it is possible to sample from
any probability distribution, so that instead of the empirical estimator any other
estimator for P̃n can be used. With respect to the approximation for QP̃n (Tn∗ |
x1 , . . . , xn ), the choice of P̃n only makes a difference for the first step of the bootstrap
scheme, the second and third steps are the same.
***
∑
Example 4.3 The statistic Tn = (n − 1)/ ni=1 Xi is an unbiased estimator for
the parameter λ when a sample X1 , . . . , Xn from an exponential distribution with
parameter λ is available. The observed values:
2.24, 0.029, 0.155, 0.551, 0.495, 0.15, 0.64, 1.132, 0.03, 0.062, 1.771, 1.001,
0.478, 0.897, 0.205, 0.254, 0.564, 0.274, 0.517, 0.505, 0.51, 0.565, 1.011, 1.357,
2.76
Histogram of b
1.5
1.0
Density
0.5
0.0
∑n
Figure 4.2: Histogram of the 100 bootstrap values of the statistic (n − 1)/ i=1 Xi of Example 4.3.
As mentioned above, the approximation for the bootstrap estimate by computer sim-
ulation is a stochastic approximation: the bootstrap estimate itself is being estimated. If
we are unlucky, the obtained B bootstrap samples are all extreme, and the approximation
very bad. However, the law of large numbers says that for B → ∞ this approximation
converges in probability to the bootstrap estimate. Informally, ‘when B = ∞ then the
4.3. Bootstrap confidence intervals 45
approximation is exact’. In theory B can be made arbitrarily large, because it does not
depend on the original data but is chosen by the statistician. The larger B, the better the
approximation. The fact that also the computation time increases with B is a practical
limitation. It depends on the complexity of the estimator Tn , how severe this practical
limitation is. For instance, some estimators need one day to compute them only once.
Then for the computation of the bootstrap values (in step 2) B days, or B computers,
are needed!
How large do we take B and how well does the bootstrap work? While using the
(approximating) bootstrap estimator in principle two errors are made:
- the difference between the empirical distribution of the simulated B bootstrap values
∗
Tn,i and QP̃n (Tn∗ | X1 , . . . , Xn ).
The first error is in some sense an unavoidable error, because on the basis of the orig-
inal observations the true distribution of Tn cannot be determined with certainty. The
magnitude of this error depends on the specific statistic Tn and the chosen estimator P̃n .
When the empirical distribution P̂n is used, this error will be small, provided that the
distribution QP (Tn ) does not change too much by ‘discretizing the underlying distribution
P’. As we shall see, this is not always the case.
The second error is of a completely different nature. This one we can control ourselves
and it can be made arbitrarily small by making B sufficiently large. Most of the time
we will not let the computer run endlessly, but be satisfied with making this second error
relatively small with respect to the first one. It has turned out that for simple cases
values of B between 20 and 200 yield good results, although sometimes values of 1000 are
necessary. For more complex situations values of 10,000 or more are no exception.
[Tn − G−1 −1
n (1 − α), Tn − Gn (α)]
is a confidence interval with confidence level 1 − 2α for θ. The quantile function G−1
n is
−1
usually unknown, but by replacing it by an estimator G̃n a confidence interval
PRRP CTRP
mean 61.56 75.11
median 54.25 62.5
Table 4.1:
with confidence level approximately 1 − 2α is obtained. When for G̃−1 n the quantile
∗
function of the bootstrap estimator G̃n = QP̃n (Tn − Tn | X1 , . . . , Xn ) is used, then this
interval is called a bootstrap confidence interval. In practice a bootstrap estimator G̃n will
itself be approximated by means of simulation.
In summary, a bootstrap confidence interval with approximate confidence level 1 − 2α
is obtained in 3 steps: given x1 , . . . , xn ,
- simulate a large number, B say, bootstrap samples (like X1∗ , . . . , Xn∗ ) from P̃n , all of
size n, and independent of each other;
- compute for each of the B bootstrap samples the corresponding bootstrap value
(like Tn∗ = Tn (X1∗ , . . . , Xn∗ )); call the resulting values Tn,1
∗ ∗
, . . . , Tn,B and write Zi∗ =
∗
Tn,i − Tn (i = 1, . . . , B)
Example 4.4 Example 3.5 gives data concerning the β-thromboglobulin level in
the blood of three different groups of patients: PRRP, SDRP and CTRP. The mean
and median β-thromboglobulin level of the first and last group are given in Table 4.1.
It seems that the β-thromboglobulin level of the group CTRP is much higher than
that of the group PRRP. In Chapter 3 we already saw that the difference vanishes
after a logarithmic transformation.
The boxplots (Figure 4.3) confirm that the conclusion is not that simple. It is true
that the β-thromboglobulin levels of the CTRP group seem to be somewhat larger,
but this seems in the first place to be the result of a larger variation.
In order to be able to draw a more clear conclusion with the bootstrap an
(approximate) 90% confidence interval for the difference in median between the two
groups was derived. For this a vector of 1000 bootstrap values of the difference was
generated according to the scheme:
4.3. Bootstrap confidence intervals 47
200
150
100
50
PRRP CTRP
Figure 4.3: Boxplots of the β-thromboglobulin levels of PRRP (left) and CTRP (right).
−40 −20 0 20 40
Figure 4.4: Histogram of the 1000 bootstrap values z ∗ described in Example 4.4.
48 Chapter 4. The bootstrap
According to formula 4.2 this leads to the confidence interval (-22.5, 5.5) with ap-
proximate confidence level 90% for the difference in median. The fact that the value
0 lies is this interval, confirms the doubt about the existence of a systematic differ-
ence between the two groups of patients. The observed difference can very well be
the result of chance.
The choice of 90% is, as always, rather arbitrary. Figure 4.4 (and the corre-
sponding set of z ∗ -values) is therefore very helpful, because from this the confidence
intervals for all confidence levels can be determined. In particular, we remark that
for this set of z ∗ -values the confidence interval with level 70% still contains the value
0.
The bootstrap scheme here is more complex than in Example 4.3, because we
now deal with a two sample problem. Formally, we can describe the statistical
model of the observations by the parameter P = (F, G), where F and G are the
distributions of the β-thromboglobulin levels in the groups PRRP and CTRP, re-
spectively. This parameter is estimated in the bootstrap scheme above by the pair
P̂n = (F̂n , Ĝn ) of the empirical distributions of the two samples.
***
gives a test with significance level ≤ α. For actually performing the test, the p-value
PH0 (Tn > t) has to be evaluated. Because Tn is nonparametric, this can be done by
choosing a suitable distribution P0 ∈ P0 and evaluating PP0 (Tn > t) for this P0 . Sometimes
this can be done exactly. Often an asymptotic approximation for n → ∞ is known, for
instance a normal approximation. A third possibility is simulation: simulate B bootstrap
values of Tn from the distribution QP0 (Tn ) and approximate the p-value with the fraction
of bootstrap values that exceed the observed value t of Tn .
In case Tn is not nonparametric under the null hypothesis, one can proceed as follows.
Choose, under the assumption that the null hypothesis is correct, an appropriate estimator
P̃n , i.e. P̃n ∈ P0 , for P . Determine the bootstrap estimate Q(Tn∗ | x1 , . . . , xn ) = QP̃n for
QP0 = QP0 (Tn ). Simulate B bootstrap values Tn∗ from QP̃n and determine a ‘p-value’ as
the fraction of bootstrap values which exceed the observed value t of Tn . The test “Reject
H0 if p ≤ α” is at best approximately of level ≤ α. Asymptotically, the probability of an
error of the first type under the true distribution is often exactly equal to α. When Tn
4.4. Bootstrap tests 49
for large n is approximately nonparametric, then the significance level of the test is also
approximately equal to ≤ α. (Remember, that the significance level is the supremum of
the probabilities of an error of the first type.)
Example 4.5 In Example 3.2 a linear regression was performed of the volume yi
on the diameter xi of 31 cherry trees. The least squares estimates for the intercept
and slope of the regression line were α̂ = −36.94 and β̂ = 5.07. We wish to test
whether the normality assumption of the regression model (“the measurement errors
Yi − α − βxi are N (0, σ 2 ) distributed, with σ 2 unknown”) is satisfied by means of
the adapted test statistic
Here F̂n is the empirical distribution function of the residuals Ri = Yi − α̂ − β̂xi and
∑
S 2 = ni=1 Ri2 /(n − 2). For the cherry tree data the observed value of the latter is
s2 = 18.079. The test statistic is nonparametric under the null hypothesis: for every
α, β and σ, D̃n has the same distribution. This distribution can be approximated
while using a simulation scheme.
14
12
10
8
6
4
2
0
Figure 4.5: Histogram of 1000 bootstrap values of the statistic D̃n in Example 4.5.
The observed value of D̃n is 0.0953. To approximate the p-value PH0 (D̃n >
0.0953) 1000 bootstrap values of D̃n were generated according to the following
scheme:
- simulate a sample of size 31 from N (0, s2 ); denote the values by e∗1 , . . . , e∗31
- compute yi∗ = α + βxi + e∗i , (i = 1, . . . , 31) with α and β as computed before,
that is α = α̂ and β = β̂.
50 Chapter 4. The bootstrap
- determine the least squares estimates α̂∗ and β̂ ∗ for linear regression of yi∗ on
xi
- compute Ri∗ = yi∗ − α̂∗ − β̂ ∗ xi , (i = 1, . . . , 31)
- compute D̃n∗ from R1∗ , . . . , R31
∗ in the same manner as D̃ from R , . . . , R .
n 1 31
This scheme was executed 1000 times. Figure 4.5 gives a histogram of the 1000
values of D̃n∗ . The observed D̃n = 0.0953 is the 0.316-quantile of the values D̃n∗ .
The p-value of about 68% gives absolutely no reason to doubt the normality of the
residuals.
In step one a parametric bootstrap is performed: the bootstrap sample is drawn
from a normal distribution and not from the empirical distribution of the residuals.
This is because the goal is to approximate the distribution of D̃n under the null hy-
pothesis, which says that the measurement errors are normally distributed. Instead
of D̃n also another test statistic can be used. One point of criticism on the use of
D̃n is, for example, that the residuals under the null hypothesis are indeed normally
distributed, but not independent and with equal variances. A small adaptation of
the test would be appropriate. However, this is usually omitted due to the amount
of extra work that is needed for this.
***
3000
1000
50
50
0
0
−6 −4 −2 0 2 4 6 −6 −4 −2 0 2 4 6 0 20 40 60 80
100 150 200
50
0
0
−6 −4 −2 0 2 4 6 −5 0 5 10 15 20 −6 −4 −2 0 2 4 6
100 150 200
50
0
−6 −4 −2 0 2 4 6 −6 −4 −2 0 2 4 6 −6 −4 −2 0 2 4 6
100 200 300 400
100
60
50
20
0
−10 −5 0 5 −6 −4 −2 0 2 4 6 −6 −4 −2 0 2 4 6
Figure 4.6: Histograms of 12 independent realizations of the empirical bootstrap estimator (B=500) for
the distribution of the mean of a sample of size 30 from a Cauchy distribution, together with the density
function of the Cauchy distribution.
12 times to get an idea of the variation of the quality of the bootstrap estimator.
Figure 4.6 gives the histograms of the 500 bootstrap values. The smooth curve in
the graphs is the true density of the mean. (According to the theory the mean itself
has a standard Cauchy distribution also.) It is clear that the bootstrap estimator
in this case behaves rather badly.
The explanation lies in the heavy tails of the Cauchy distribution. Firstly, they
cause the 12 obtained samples to be of a quite different nature. As a consequence,
52 Chapter 4. The bootstrap
the bootstrap histogram is in some cases very concentrated and in other cases fairly
spread out. Secondly, the empirical distribution P̂30 of the 30 original observations
differs very much from the underlying Cauchy distribution with respect to the tails.
After all, the empirical distribution does not really have tails. Because the distri-
bution of the mean strongly depends on the tails of the underlying distribution, the
distribution of the mean of 30 observations from P̂30 (in Figure 4.6 approximated
by a histogram) hardly resembles the distribution of the mean of 30 observations
from a Cauchy distribution.
One could object, that n = 30 or B = 500 is not large enough. This is not
correct. Increasing B does not yield any different results. Moreover, it can be
proved theoretically that in this case the bootstrap estimator does not converge to
the true distribution when B → ∞ and n → ∞, but that it remains of a stochastic
nature. The simulation results that are shown in Figure 4.6 therefore give exactly
the right impression.
***
Chapter 5
Robust Estimators
Most methods from statistics are based on a priori assumptions on the probability distri-
bution from which the observations originate. These assumptions may vary in detail. One
may, for instance, assume that x1 , . . . , xn are a sample from a normal distribution, or,
instead, from a symmetric distribution. Such assumptions are almost never completely
correct. This holds in particular for the more stringent assumptions, like normality. It
is therefore important to investigate how sensitive a method is for deviations from the
assumptions. When we are not sure that the ‘ideal’ mathematical model is correct, it
should be the case that a small deviation from that model only results in a small change
in the conclusions. Statistical methods which are relatively insensitive to small deviations
from the model assumptions are called robust methods. One could give a more precise
meaning to this description, but in general the term ‘robust statistics’ is used in a rather
vague way. Robustness is in particular often described as insensitivity to the occurrence
of
- large deviations from the assumptions in some observations;
53
54 Chapter 5. Robust Estimators
possible to identify, or even correct, mistakes by careful screening of the data. In general,
it is recommended to perform a statistical data analysis both with and without suspect
data points and compare the results. Obviously, it is wrong to delete observations from a
data set just for subjective reasons. A good robust procedure decreases the influence of
incorrect observations in an objective manner.
In words: the [αn] largest and smallest observations are deleted and the average is taken
over the remaining values. It is clear that outliers have a smaller influence on Tn,α then
on X̄n , because they are ‘trimmed away’ when Tn,α is used (if 0 < α < 21 ).
The 0-trimmed mean is equal to X̄n . The other extreme, the (almost) 12 -trimmed mean
is close to the median. The median is very robust against the occurrence of outliers. Even
when almost all observations are extreme, the median is a useful location estimator.
Example 5.1 Example 3.6 gives data concerning numbers of flaws in textile pro-
duced with methods A and B. The observed mean numbers of flaws are 8.8 for
method A and 7.6 for method B. At first sight it seems that method B yields fewer
flaws than method A. However, to see whether the difference is systematic (or ‘signif-
icant’), the observed difference in mean of 1.2 flaw needs to be related to the accuracy
5.1. Robust estimators for location 55
A B
Figure 5.1: Boxplots of the numbers of flaws in textile produced by method A (left) and method B
(right).
Boxplots of the numbers of flaws yield new insight, see Figure 5.1. Sample B
is clearly stochastically smaller than sample A. Method B, however, also resulted
in a small number of extremely bad products, whereas the products of method A
were more homogeneous. Because method B yields relatively many outliers, the
mean number of flaws is perhaps not the best measure for location in this case. The
values of a number of trimmed means for the two samples are given in Table 5.l. It is
remarkable that for sample A the trimmed means for different trimming percentages
are almost the same. For sample B trimming with a percentage of 10% leads to
a much smaller estimate for location than the mean. The difference of the 20%
trimmed means is 2.3 flaw number. A bootstrap estimate of the standard deviation
of the difference of the 20% trimmed means (based on B = 500 bootstrap samples
from the empirical distribution of samples A and B) is 0.55. The difference of 2.3
flaw number is therefore about 4 times the standard deviation of the used estimator.
From this it can be concluded that there is an important difference between the two
20% trimmed means of the two samples.
In this example the outliers of sample B are most likely not caused by mea-
56 Chapter 5. Robust Estimators
α A B
0 8.8 7.6
0.1 8.9 6.8
0.2 8.9 6.6
0.3 8.8 6.5
0.4 8.9 6.5
0.5 9.0 6.5
Table 5.1:
surement error. They are an indication for a possible structural problem with the
production procedure of method B. A report of an analysis of these data should
therefore not be limited to just the difference in location. For the textile factory
the results of the analysis are perhaps only interesting when it is possible to reduce
the outliers for method B to an acceptable level.
***
The sensitivity of the mean for outliers can be quantified by means of an influence
function. Suppose that we have a random sample of n observations x1 , . . . , xn and that
we next obtain one additional observation of size y. When we calculate the mean for both
cases, the difference of these means, ‘the influence of an additional observation in y’, is
( )
1 ∑n
1∑ n
1
xi + y − xi = (y − x̄n ).
n + 1 i=1 n i=1 n+1
The change in the location estimate is therefore proportional to the additional observation
y. Unfortunately, a similar calculation for other estimators, like the trimmed means, does
not yield such a clear result. To quantify the influence of an additional observation it
is therefore useful to apply an asymptotic approximation: take the limit for n → ∞.
It holds that X̄n converges to E X1 (in probability) when n → ∞. Hence, n + 1 times
the influence of an additional observation in y converges for n → ∞ to y − E X1 . The
function y → IF(y) = y − E X1 is the (asymptotic) influence function of the (sample)
mean. Figure 5.2.a shows a graph of this function. From this it can be seen that the
influence of an additional observation is proportional to its size. Extreme observations
thus have an extremely large influence.
For the trimmed means one can argue in a similar way. The asymptotic influence
function of the α-trimmed mean looks like the graph in Figure 5.2.b. From this we see
that the influence of an additional observation is bounded (if α > 0), which is intuitively
clear. Note, however, that the influence of an extreme observation is not equal to zero!
Boundedness of the influence function is an important prerequisite for an estimator
to be robust. Estimators with an unbounded influence function, like X̄n , are not robust.
Estimators with a bounded asymptotic influence function are sometimes called B-robust.
5.1. Robust estimators for location 57
2
2
1
0
0
−2
−1
−4
−2
−4 −2 0 2 4 −4 −2 0 2 4
a) b)
Figure 5.2: a) Influence function of X̄n . b) Influence function of the α-trimmed mean. (The scale of
the axes has no meaning here.)
The maximum absolute value of the asymptotic influence function is called the gross error
sensitivity. The smaller the gross error sensitivity, the more B-robust the estimator.
The influence function of both the mean and the trimmed mean is continuous. This
means that a small change in one observation has a small influence on the value of the
estimate. This too is a robustness property.
Until now we have neglected an important aspect. What in fact do X̄n and Tn,α
estimate? They are both location estimators,
∫
but ‘location’ is a vague concept. Of X̄n we
know that it is an estimator of F X1 = x dF (x) 1 , the population mean, or expectation,
E
of the distribution F . Analogously, Tn,α is an estimator for the trimmed (population) mean
of F :
1 ∫ F −1 (1−α)
Tα (F ) = x dF (x).
1 − 2α F −1 (α)
In words: Tα (F ) is the mean or expectation of F between the α and the 1 − α quantile. 2
1
∫
In the following we∑write g(x) dF (x) for the expectation of g(X) when X has distribution
∫ F . This
expression is equal to x g(x) F {x} when F is a discrete distribution, and it is equal to g(x) f (x) dx
when F is a continuous distribution with probability density f .
2
A mathematical argumentation that Tn,α is an estimator for Tα (F ), can best be based on asymptotics.
PF ∫
According to the law of large numbers it holds that X̄n → x dF (x), n → ∞, (if EF |X1 | < ∞).
P
Analogously one can show that (under some conditions) Tn,α →
F
Tα (F ), n → ∞. (Here the notation
PF
→ stands for: convergence in probability when F denotes the true distribution of the observation. The
PF
exact definition of Yn →θ is: PF (|Yn − θ| > ε) → 0 for all ε > 0.)
58 Chapter 5. Robust Estimators
5.1.2 M -Estimators
There are many other robust measures for location. An important class of such measures
is formed by the so-called M -estimators. An M -estimator can be regarded as a general-
ization of a maximum likelihood estimator. For a given function ρ, it can be defined as
the value of θ that maximizes the expression
∏
n
ρ(Xi − θ).
i=1
When for ρ a probability density f is taken, then one finds the maximum likelihood
estimator for the problem in which n independent observations from the density f (x − θ)
were obtained.
Like the maximum likelihood estimator, most of the time an M -estimator Mn is not
computed as the solution of a maximization problem, but as the solution of an equation
of the form
∑
n
ψ(Xi − Mn ) = 0,
i=1
d
for a fixed function ψ. When ψ(x) = dx log ρ(x) is chosen, this yields the same result
as the maximization of the expression above. Some often-used ψ functions are given in
Table 5.33 . The third column in the table gives the distribution for which the estimator
is exactly the maximum likelihood estimator. The Huber distribution with parameters b
3
The sign function sgn(x) is defined as: 1 if x > 0, 0 if x = 0 and −1 if x < 0.
5.1. Robust estimators for location 59
ψ Mn distribution
x X̄n normal
sgn(x) med(X) Laplace
−b, if x ≤ −br
x/r, if |x| ≤ br Huber Huber
b, if x ≥ br
tanh(x/r) logistic
x(r2 − x2 )2 1[−r,r] (x) biweight
Table 5.3: Some ψ-functions for M -estimators for location; b and r are positive constants.
2
2
1
1
0
0
−1
−4 −2
−2 −1
−3
−4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4
0.5
1
−0.5 0.0
0
−2 −1
logistic biweight
Figure 5.3: Graphs of some ψ functions for M -estimators of location. Again, the scale of the axes has
no meaning.
variance of the mean of n observations from the contaminated normal distribution above,
does not exist either. One could say: the mean has infinite variance and is therefore a bad
estimator for the location θ. A trimmed mean with trimming fraction of, for example,
α = 0.1 could work here. It is true that a trimmed mean would be a little less efficient
than the mean when the distribution would be exactly normal after all, but it is infinitely
better when there is a substantial number of outliers among the observations. By putting
up with a somewhat less efficient estimator in the case of the ideal model (normality),
one gains enormously in robustness in case of the contaminated normal model.
Example 5.2 To illustrate that a trimmed mean Tn,α is a better estimator for the
parameter θ than the mean X̄n in case the data originate from the contaminated dis-
tribution F above, a simulation study was performed. We simulated 1000 indepen-
dent samples, each of size 50, from the distribution 95/100 N (0, 1) + 5/100 Cauchy
and computed 1000 times the values of X̄n and Tn,α for α = 0.1. Figure 5.4 gives
boxplots of the twice 1000 values. For the given distribution F the point of symme-
try is θ = 0. Therefore, a good estimator of θ should yield values close to 0. From
the boxplots it is clear that the mean does not give the desired results: the spread
in the boxplot on the left is much larger than in the one on the right.
***
where Tα (F ) is the trimmed population mean. For most F this √ can be extended to
the asymptotic normality of the distribution of Tn,α : for large n, n(Tn,α − Tα (F )) is
approximately N (0, Aα (F )) distributed, or
D
Tn,α ≈ N (Tα (F ), Aα (F )/n).
Although for large n the variance of Tn,α is approximately equal to Aα (F )/n, the numer-
ator of this quantity Aα (F ) is called the asymptotic variance of Tn,α . The asymptotic
variance Aα (F ) has a value that depends on α and F , and it is a measure for the quality
(with respect to efficiency) of Tn,α as estimator of Tα (F ). In Figure 5.5 Aα (F ) is plotted as
a function of α for different distributions F . The horizontal line in the plots indicates the
minimum, hence optimal, value that the asymptotic variance of any location estimator
5.2. Asymptotic efficiency 63
10
0
−10
−20
−30
Figure 5.4: Boxplots of 1000 means and 1000 10%-trimmed means of 1000 independent samples of size
50 from the distribution 95/100N (0, 1) + 5/100Cauchy.
can reach for that distribution according to the asymptotic Cramér-Rao inequality. This
optimal value is called the asymptotic Cramér-Rao lower bound. 5
The first thing that we see in Figure 5.5 is that the asymptotic variance in the case
of the Cauchy distribution converges very fast to infinity when α → 0. This implies that
the mean (α = 0) is a very bad location estimator for the Cauchy distribution. On the
other hand, for the normal distribution the asymptotic variance is minimal for α = 0;
as we know, no trimming is indeed optimal for the normal distribution. For the Laplace
(or double exponential) distribution the opposite holds: the median (α → 0.5) is optimal
in that case. From all pictures together it can be concluded that when little a priori
information about the underlying, true distribution is available, a trimming percentage
of about 10 to 20% is advisable. This is, because for all four distributions the asymptotic
variance for α = 0.1 to 0.2 is reasonably small. It should be noted that these pictures
are a bit misleading. It is theoretically possible to construct for every α a distribution for
which the α-trimmed mean behaves very badly. However, such distributions are rarely
5
The Cramér-Rao inequality says that for the ∫asymptotic variance A(F ) of every (reasonable) sequence
of location estimators, A(F ) ≥ I1F , where IF = (f ′ /f )2 dF is the Fisher information for location. The
Fisher information is therefore an absolute measure for judging the quality of an estimator.
64 Chapter 5. Robust Estimators
3.0
6
2.0
4
1.0
2
0.0
0
Cauchy Laplace
3.0
5
4
2.0
3
1.0
2
0.0
normal logistic
Figure 5.5: Asymptotic variances of α-trimmed means as a function of a for four distributions. The
horizontal line indicates the asymptotic Cramér-Rao lower bound.
met in practice.
From the plots in Figure 5.5 it cannot only be deduced for which α the asymptotic
variance is minimal. It can also be seen whether the smallest possible value for any
estimator, the asymptotic Cramér-Rao lower bound, i.e. the horizontal line, is reached for
this α. If this is the case, then the trimmed mean with this value of α is for this distribution
asymptotically optimal, with respect to efficiency, among all location estimators. For
the Laplace, normal and logistic distribution we see that one of the trimmed means is
asymptotically optimal. In the case of the Cauchy distribution none of the trimmed means
is asymptotically optimal.
5.3. Adaptive estimation 65
Above only the trimmed means were compared with each other. Other estimators
could be more efficient. Many estimators, including the M -estimators, are, like the
trimmed means, asymptotically normal and their asymptotic variance A(F ) and influ-
ence function IF(y, F ) relate to each other as
∫
A(F ) = IF(y, F )2 dF (y),
In any case, it is always good practice to start with judgement of the data by making
plots, such as histograms, box plots, QQ-plots, etc. When it appears that the true
distribution, for example, has heavy tails, then a robust estimator should be preferred.
In the above described adaptive estimation procedure the choice for an estimator is
based on an initial estimate of the true underlying distribution, combined with theo-
retical results about the behavior of estimators under different distributions. A more
straightforward way of adaptation is to base the choice on estimates of the variances of
the estimators, for instance on bootstrap estimates. An estimator with small estimated
variance then should be preferred. This approach is illustrated in the next example. It
should be stressed that also here it concerns a data analysis and it is difficult to justify
the approach in a formal, mathematical way. The results should therefore be taken with
a pinch of salt. It is important to note that in any data analysis report it should be men-
tioned which procedure is followed. When an estimator is chosen based on a preliminary
data analysis, this should be described, preferably including the results of this preliminary
analysis.
Example 5.3 The tau is a heavy lepton (electron type particle) that was discovered
in the 1970s. 6 It has a very short lifetime and decays into more stable particles. For
a subclass of all ‘tau events’, named B1 , these are the Bρ , Bπ , Bε and Bµ particles
and a class Bother of more complicated particles.
ALL TAU EVENTS
B-1 events
Β−π B-other
Β−ρ
Β−ε Β−µ
B1 Bρ Bπ Bε Bµ D
weighted mean 86.72 22.31 10.78 17.68 17.70 18.25
(.27) (.86) (.60) (.41) (.38) (1.22)
25% trimmed mean 85.88 22.32 10.22 18.26 18.18 16.90
its percentage did not exceed 12%. The question was whether Bρ , Bπ , Bε , Bµ and
Bother form the complete list of particles belonging to B1 , or whether there exists
another particle that had escaped the physicists’ attention until that time.
D = B1 − (Bρ + Bπ + Bε + Bµ ).
A percentage D particles larger than 12% would indicate the existence of an undis-
covered particle. In Table 5.4 different estimates for the percentages of particles are
given based on these data. The first row of Table 5.4 gives estimates of the percent-
ages based on weighted means of the data, with weights inversely proportional to
the reported variances; the second row gives the estimated standard deviations of
these weighted means. The third row gives estimates of the percentages based on
0.25-trimmed means of the data.
Under the assumption that the observations come from a normal distribution, a
98% confidence interval for D is equal to [15.41, 21.09]. Can we trust this interval?
Dissatisfaction with the normality assumption and doubts about the reported stan-
dard deviations motivated a bootstrap analysis. The standard deviations reported
by the laboratories were neglected and an estimate for a confidence interval for D
was obtained based on the 0.25-trimmed means from the third row of Table 5.4.
The choice for the 0.25-trimmed means was made based on the bootstrap es-
timates for the standard deviations that are given in Table 5.5. We see that the
estimated standard deviation for B1 increases with increasing trimming percentage,
68 Chapter 5. Robust Estimators
trim B1 Bρ Bπ Bε Bµ D
0 0.35 0.43 0.46 0.55 0.94 1.39
0.1 0.41 0.41 0.54 0.54 0.69 1.18
0.2 0.46 0.40 0.60 0.37 0.54 1.04
0.25 0.43 0.39 0.64 0.35 0.54 1.04
0.3 0.48 0.38 0.61 0.38 0.49 1.01
0.5 0.66 0.37 0.73 0.40 0.36 1.11
Table 5.5: Bootstrap estimates for the standard deviations of (the difference of) trimmed means.
Table 5.6: 98% Bootstrap confidence intervals for the difference D of the 0.25-trimmed means (first
three rows) and weighted means (last row).
whereas it decreases for Bρ . The estimated standard deviation for D reaches a min-
imum for α = 0.3. A trimming percentage of 25% gives an almost equal value and
was chosen because it was simple to explain.
Finally, 98% bootstrap confidence intervals for the difference D of the 0.25-
trimmed means were computed following different methods. Without going into
detail, we give the results in Table 5.6. Based on these results 14.2 is a reasonable
lower bound for a one-sided 99% confidence interval. It seems that indeed another,
unknown, particle exists.
***
ψ loc(X1 , . . . , Xn ) Sn
x2 − 1 X̄ S
sgn(|x| − Φ−1 (3/4)) med(Xi ) MADn
[x2 − 1 − ab ]b−b
A popular, very robust estimator for spread is the median absolute deviation
1
MADn = med(|Xi − med(X1 , . . . , Xn )|).
Φ−1 (3/4)
The somewhat strange factor 1/Φ−1 (3/4) is not always used. Without this factor the me-
dian absolute deviation would for the case that the data come from a normal distribution
be an estimator for Φ−1 (3/4) times the standard deviation σ, instead of for σ itself. More
precisely, for the median absolute deviation as defined above it holds that
P
M ADn →
F
σ, n → ∞,
when X1 , . . . , Xn is a sample from an N (µ, σ 2 ) distribution. For other than normal dis-
tributions, the median absolute deviation does not necessarily converge to the standard
deviation of the distribution (if it exists). To establish this, the factor 1/Φ−1 (3/4) should
be replaced for each situation by another factor. This illustrates how vague the concept
‘spread’ is. On the other side, the median absolute deviation has the advantage that the
estimator is also useful for distributions with heavy tails, like the Cauchy distribution, for
which the standard deviation is not defined.
Finally, we mention M -estimators for spread. These are defined as the solution Sn of
an equation of the form
∑
n
Xi − loc(X1 , . . . , Xn )
ψ( ) = 0,
i=1 Sn
Nonparametric methods
When one uses one of the classical tests, such as the t- or F -test, in the situation where
the assumptions—like normality—that are necessary for this test do not exactly hold,
then the actual level and power of the test will be different from the level and power that
are predicted by the theory. In the usual asymmetrical testing framework it is considered
particularly undesirable to have the real level of the test larger than the α that was chosen
beforehand. The so-called nonparametric (or distribution free) tests accommodate for this
problem. For these tests the actual level is always exactly α for a broad class of possible
underlying distributions. In the terminology of the foregoing chapter these tests are very
robust with respect to the level of the test. Moreover, a number of these tests turns out
to have a reasonably high power for a broad set of alternatives. This is why they are good
competitors of the classical tests, like the t-test. The Wilcoxon-tests, for example, are a
little less efficient than the t-test when the observations are exactly normally distributed,
but they result in a large gain in power when the data do not come from a normal
distribution.
In this chapter we discuss some nonparametric tests for the one-sample problem, the
two-sample problem and the correlation in bivariate data. For nonparametric tests for
other problems we refer to the literature. For simplicity we omit in the notation in this
chapter, except in Section 6.2, the dependence of the test statistics on n, e.g. a test
statistic Tn is denoted as T .
70
6.1. The one-sample problem 71
Under the null hypothesis Xi has median m0 , so that under the null hypothesis T has
a binomial distribution with parameters n and 21 . Because the distribution of the test
statistic turns out to be the same for every possible distribution of the null hypothesis,
the test is distribution free or nonparametric. The statistic T is called nonparametric
under the null hypothesis.
A relatively large value of T indicates that the true median is larger then m0 , whereas
a relatively small value of T is an indication that the true median is smaller than m0 . In
the given two-sided problem the null hypothesis is therefore rejected for large and small
values of T . In terms of p-values: H0 is rejected when the observed value t of T satisfies
PH0 (T ≤ t) ≤ 12 α, or PH0 (T ≥ t) ≤ 21 α.
Example 6.1 To judge the level of an exam a group of 13 randomly chosen students
was asked to make the exam beforehand. The results were:
3.7, 5.2, 6.9, 7.2, 6.4, 9.3, 4.3, 8.4, 6.5, 8.1, 7.3, 6.1, 5.8. We test, with level α = 0.05,
the null hypothesis that the median of the (future) exam results in the population
of all students is ≤ 6, against the alternative that the median is > 6. We thus have
a right-sided test. The value t = 9 yields the p-value PH0 (T ≥ 9) = 0.13. This is
larger than α and hence, the null hypothesis is not rejected. We may conclude that
the level of the exam is probably suitable.
***
In the foregoing example we were lucky that there were no grades exactly equal to
6. If there had been grades equal to 6, this would have violated the assumption that the
probability of an observation being equal to the median is 0. For less ‘lucky’ results the
sign test can be adjusted in a simple way: leave out the values for which Xi = m0 and
perform the test as described above. This yields a test which is nonparametric conditional
on the number of the Xi that is equal to m0 .
H1 : m ̸= m0 .
Note that, for a distribution that has a unique median and is symmetric, the point of
symmetry is equal to the median, so that in this case the hypotheses also concern the
median. To see how the test statistic is constructed, let Zi = Xi − m0 . Since F has a
continuous distribution function, the n values |Z1 |, . . . , |Zn | are different from each other
with probability 1. Let (R1 , . . . , Rn ) denote the vector of ranks of |Z1 |, . . . , |Zn | in the
6.1. The one-sample problem 73
corresponding vector of order statistics. This is, |Zi | is the Ri -th in size (in increasing
order) of the |Z1 |, . . . , |Zn |. The signed rank test is based on the test statistic
∑
n
V = Ri sgn(Xi − m0 ).
i=1
Each of the n signs sgn(Xi − m0 ) is equal to 1 or −1. A value of 1 is an indication that the
true symmetry point is larger than m0 . This indication is stronger when |Xi − m0 |, and
hence Ri , is larger. Relatively large values of V thus indicate that the true distribution of
X1 , . . . , Xn has a larger point of symmetry than m0 , whereas relatively small values of V
point to the opposite. Critical values and p-values again follow from the distribution of
V under the null hypothesis. From the following theorem it follows that the signed rank
test is nonparametric.
(iii) The variables sgn(Z1 ), . . . , sgn(Zn ) are independent and identically distributed with
P (sgn(Zi ) = −1) = P (sgn(Zi ) = 1) = 21 .
P (|Zi | ≤ x) = 2P (0 < Zi ≤ x)
and
P (sgn(Zi ) = 1) = P (sgn(Zi ) = −1) = 12 .
From this last equation property (iii) follows immediately.
Furthermore, we may conclude that for every x > 0
Analogously,
This proves that |Zi | and sgn(Zi ) are independent. But then the same holds for the
vectors (|Z1 |, . . . , |Zn |) and (sgn(Z1 ), . . . , sgn(Zn )). This implies (i), because the ranks
(R1 , . . . , Rn ) are a function of the first vector only.
74 Chapter 6. Nonparametric methods
Property (ii) is an immediate consequence of the fact that |Z1 |, . . . , |Zn | are indepen-
dent and identically distributed with a continuous distribution function.
Let R̃1 , . . . , R̃n be a random permutation of the numbers {1, 2, . . . , n}, and Q1 , . . . , Qn
a sequence of independent random variables with P (Qi = −1) = P (Qi = 1) = 21 , which
is also independent from the permutation R̃1 , . . . , R̃n . Then, according to Theorem 6.1,
under the null hypothesis the test statistic V of the signed rank test satisfies
D ∑
n
V = Qi R̃i .
i=1
Hence, the signed rank test is nonparametric. Moreover, it easily follows that under H0
D ∑
n
D ∑
n ∑
n
D
V = Qi R̃i = (−Qi )R̃i = − Qi R̃i = − V,
i=1 i=1 i=1
from which it follows that under H0 the statistic V is symmetrically distributed around
0.
Critical values and p-values of the signed rank test can be deduced from Theorem 6.1.
They are tabulated and the test is standardly available in statistical software packages.
For large n a normal approximation can be applied. Under H0 it holds that
V D
√ → N (0, 1), n → ∞.1
n(n + 1)(2n + 1)/6
An equivalent test statistic, which is frequently used instead of V , is
∑
V+ = Ri ,
i:Xi >m0
which is the sum of the ranks of only the positive differences Zi = Xi −m0 , where the ranks
are as defined above, i.e. the ranks in the ordered sequence of all absolute differences |Zi |.
A test based on V+ is also nonparametric. Although the distribution of V+ under H0 is
different from that of V , for a particular data set testing a pair of hypotheses based on
V+ gives the same p-value, and hence the same conclusion, as testing the same pair of
hypotheses based on V . It holds that
1 n(n + 1)
V+ = V + ,
2 4
and under H0
V+ n(n + 1) D
√ − → N (0, 1), n → ∞.
n(n + 1)(2n + 1)/24 4
D
The arrow with D on top, denotes convergence in distribution: ‘Xn →X,
1
n → ∞’ means that for
every x it holds that P (Xn ≤ x) → P (X ≤ x), when n → ∞.
6.2. Asymptotic efficiency 75
In practice it may happen that one or more of the observations equal m0 . In that
case, the procedure is the same as with the sign test: the observations that equal m0 are
deleted and the test is performed as described above, conditionally on the number of the
observations that equal m0 .
It also often occurs that groups of the same values, called ties, are present in the
sample. The occurrence of ties is contrary to the assumption of the continuity of the
distribution function that we made up to now. The signed rank test is adapted to this
situation as follows. One starts with deleting all values Xi that equal m0 , or equivalently,
all values Zi = 0. Next, one assigns adjusted ranks to the remaining values: every member
of a group of equal values gets a ‘pseudo rank’, namely the mean of the ranks that the
group members would have gotten if they all would have been different. For example,
the ranks of (3, 2, 2, 5, 3, 3) in the corresponding vector of order statistics (2, 2, 3, 3, 3, 5)
become (4, 1 12 , 1 12 , 6, 4, 4). The test statistic V is then computed as before. Under H0 and
the given pattern of ties this statistic still has a fixed distribution, which depends on the
pattern of the ties (the number and sizes of the groups of equal observations). Theorem 6.1
is no longer applicable and the critical values of the test need to be adjusted. In many
statistical packages for large n a normal approximation with the correct adjustment is
used by default.
A test that has more power than another test for the same number of observations is
said to be more efficient. In other words, a more efficient test needs fewer observations
to obtain the same power as the less efficient test. Like in the case of (robust) estimators
we will consider the asymptotic case for the number of observations n tending to infinity,
and we will see that the asymptotic variance (now of the test statistic) plays a role in
determining the asymptotic efficiency.
To illustrate this we consider the one-sample problem again. Let X1 , . . . , Xn be ob-
servations from a distribution F . According to H0 F belongs to a class F0 (for example,
all distributions with median m0 ), whereas according to H1 F belongs to a class F1 . The
power of a test is the function
For a test to be good π(F ) should be small when F ∈ F0 and large when F ∈ F1 .
Because in this chapter we are specifically interested in the situation that F0 and F1 are
large classes of probability distributions, it is rather difficult to compare the power of two
tests for every possible F . This is why we will only discuss the so-called shift alternatives.
These are alternatives that can be obtained by shifting a distribution that belongs to F0
over a certain distance θ. Such alternatives therefore have a location that is shifted over
this distance θ, whereas the scale stays the same. To fix thoughts, we limit the discussion
to right-sided testing problems, i.e. θ > 0; the reasoning is the same for left-sided and
two-sided problems.
Choose a fixed F0 ∈ F0 , and assume that the distribution that is shifted with respect to
F0 over a positive distance θ belongs to the alternative hypothesis. The shifted distribution
is denoted by Fθ (·) := F0 (· − θ). If, for example, one would consider hypotheses about
the median, F0 could be a distribution with median m0 , and then Fθ would have median
m0 + θ. The power for the class of shift alternatives Fθ can be written as
For a suitable right-sided test the value πn (0) of the power function under H0 is small,
whereas πn (θ) is ‘large’ for θ > 0, i.e. under the alternative hypothesis.
Suppose that H0 is rejected for a large value of the test statistic Tn . Assume that this
test statistic is asymptotically normally distributed in the sense that
√ D,θ
(6.1) n(Tn − µ(θ)) → N (0, σ 2 (θ)), n → ∞,
for suitable functions µ(θ) and σ 2 (θ) (the asymptotic mean and asymptotic variance of
Tn ). This means: for large n, Tn is approximately distributed as N (µ(θ), σ 2 (θ)/n), when
θ is the true value of the parameter.
Rejecting √the null hypothesis for large values of Tn , is equivalent to rejecting the null
hypothesis if n(Tn − µ(0))/σ(0) > cn for a suitably chosen critical value cn . Due to the
6.2. Asymptotic efficiency 77
where Φ is the distribution function of the standard normal distribution. From this we
see that to make the level of the test equal to a chosen α, the critical value cn should be
equal to ξ1−α , the 1 − α-quantile of the N (0, 1) distribution. The test thus becomes
√
“ Reject H0 if n(Tn − µ(0))/σ(0) > ξ1−α ”.
for large n. 2 The value µ′ (0)/σ(0) is called the slope of the test. From (6.3) it can be
seen that when h increases, the power of the test increases faster when the slope is larger.
This is illustrated in Figure 6.1. Hence: the larger the slope, the better the test.
1.0
slope=2
0.8
slope=1
0.6
0.4
0.2
alpha
0.0
Figure 6.1: The asymptotic power of two tests with different slope (y-axis) as function of h (x-axis).
Let a second test based on statistic T̃n have power π̃n which satisfies
( )
√ µ̃′ (0)
π̃n (θn ) = π̃n (h/ n) ≈ 1 − Φ ξ1−α − h .
σ̃(0)
Then the asymptotic relative efficiency of the test Tn with respect to T̃n is defined as the
square of the quotient of the slopes of the tests:
( )2
µ′ (0)/σ(0)
(6.4) are(Tn , T̃n ) = .
µ̃′ (0)/σ̃(0)
If are(Tn , T̃n ) > 1, then Tn is more efficient than T̃n ; if are(Tn , T̃n ) < 1, then T̃n is more
efficient.
Note that, because σ(0) in (6.4) is the asymptotic standard deviation of the test
statistic under the null hypothesis, the asymptotic variance of the test statistic Tn plays
2
When µ is differentiable in 0,
√
√ ( √ ) µ(h/ n) − µ(0)
n µ(h/ n) − µ(0) = h √ → h µ′ (0), n → ∞.
(h/ n)
a role in determining the asymptotic efficiency, as was already indicated in the beginning
of this section. The smaller the asymptotic variance, the more efficient the test.
Like in the case of estimators, where the asymptotic relative efficiency could be ex-
plained in terms of the number of observations needed for two estimators to have ap-
proximately equal asymptotic variances, it is possible to interpret the asymptotic relative
efficiency of two tests in terms of sample sizes. Namely, it is the ratio of the numbers
of observations needed for the tests to have approximately equal asymptotic power, √ for
a given level α. To see this, we note that for a sequence of alternatives θn = h/ n it
follows from (6.3) that the first test Tn with n observations reaches a power
( )
µ′ (0) √
πn (θn ) ≈ 1 − Φ ξ1−α − n θn ,
σ(0)
and that for the second test T̃n with ñ observations the power is
( )
µ̃′ (0) √
π̃ñ (θn ) ≈ 1 − Φ ξ1−α − ñ θn .
σ̃(0)
Therefore, the powers of the two tests are approximately equal when
or when ( )2
ñ µ′ (0)/σ(0)
= = are(Tn , T̃n ).
n µ̃′ (0)/σ̃(0)
In words: for the second test there are are(Tn , T̃n ) as many observations needed as for
the first test to obtain approximately the same power as the first test. If are(Tn , T̃n ) > 1,
then the first test is to be preferred.
What has been neglected so far, is the fact that are(Tn , T̃n ) in fact depends on the
type of shift alternatives Fθ that is considered. Indeed, we see from (6.1) and (6.4) that
are(Tn , T̃n ) depends on the asymptotic distribution of the two test statistics for θ = 0,
which in turn depends on F0 . Table 6.1 gives the relative efficiencies of the one-sample
tests that we have discussed for a couple of shift alternatives. From the table we see that
none of the three tests is optimal in all four cases. The sign test, which is based on a
very simple principle, turns out to be the best against Laplace shift alternatives. The
Wilcoxon signed rank test takes a middle position in between the t-test and the sign test.
For the signed rank test the loss of efficiency with respect to the t-test in case the true
underlying distribution is exactly normal, is small (3/π = 0.955 ≈ 1), whereas the gain
with respect to the t-test for Laplace shift alternatives is considerable (3/2=1.5). This
makes the signed rank test a serious competitor of the t-test: besides the fact that this
80 Chapter 6. Nonparametric methods
t s w t s w t s w t s w
t 1 t 1 t 1 t 1
2 π2 1
s π
1 s 12
1 s 3
1 s 2 1
3 3 π2 4 3 3
w π 2
1 w 9 3
1 w 1 3 1 w 2 4
1
N(0,1) logistic uniform Laplace
Table 6.1: Asymptotic relative efficiencies (row-variable with respect to column-variable) of the t-test
(t), sign test (s) and Wilcoxon signed rank test (w) for shift alternatives of different F0 .
test has the correct level of significance for a large class of null-distributions, it also has
good efficiency properties!
An advantage of the sign test which does not show from the table is that this test is
nonparametric against all alternatives, whereas for the Wilcoxon signed rank test and the
t test symmetry and normality, respectively, are required. This makes the sign test the
best choice in many situations.
We have restricted our comparison of the different tests to shift alternatives of a couple
of distributions. Although these are in most cases the alternatives of interest, in some
contexts it may be necessary to know how the tests compare for other alternatives.
Mostly Xi and Yi within a pair (Xi , Yi ) of a paired data set are not independent,
although the differences Zi = Yi − Xi for the different i often are independent. When the
latter holds, one could test whether one sample is stochastically larger3 than another one
by applying a one-sample test to Z1 , . . . , Zn . For instance, with the sign test one could
test whether the median of the distribution of the differences is significantly different from
0. When Xi and Yi are independent, then under the null hypothesis that the two samples
have the same distribution, the Zi are automatically symmetrically distributed around
zero. Wilcoxon’s symmetry test, the signed rank test, is in this case the obvious test.
From now on we shall limit ourselves in this section to two unpaired, independent
samples X1 , . . . , Xm and Y1 , . . . , Yn . Suppose that X1 , . . . , Xm and Y1 , . . . , Yn have true
distributions F and G, respectively. Consider the testing problem
H0 : F = G
H1 : F ̸= G.
In the ‘classical model’ for this problem the two samples are both normally distributed
with expectations µ and ν and the testing problem is H0 : µ = ν against H1 : µ ̸= ν.
If, in addition, it is assumed that the two samples have equal variances, then the test is
based on the statistic
X̄ − Ȳ
T = √ ,
SX,Y 1/m + 1/n
√ ∑ ∑
where SX,Y = SX,Y2 2
and SX,Y 1
= m+n−2 { m i=1 (Xi − X̄) +
2
i=1 (Yi − Ȳ ) }. This statistic
n 2
has a tm+n−2 distribution. When the variances are not assumed to be equal, then a test
can be based on
X̄ − Ȳ
T̃ = √ .
2
SX /m + SY2 /n
This statistic does not have a t-distribution, but approximations for the critical values
are available.
In the sequel we do not make the normality assumption. Unless stated otherwise, we
assume for convenience that F and G have continuous distribution functions. When this
assumption is not fulfilled, the presented tests can still be used, but need to be adapted.
In practice this only needs to be done when there is a relatively large number of ties.
3
When X has distribution function FX , and Y has distribution function FY , then (the distribution
of) Y is stochastically larger than (the distribution of) X if FY (x) ≤ FX (x) for all x.
82 Chapter 6. Nonparametric methods
for every subset r1 < r2 < · · · < rn of {1, 2, . . . , N }. The distribution of the test statistic
is given by ∑
#(r1 < r2 < · · · < rn with ni=1 ri = w)
PH0 (W = w) = ( ) .
N
n
4
The notation [x] stands for the entier or floor of x, which is the largest integer not greater than x.
6.3. Two-sample problems 83
Hence, the test is nonparametric. Critical values and p-values can be found by straight-
forward computation, from tables or with statistical software. For large m and n a normal
approximation can be used, because
W − 21 n(N + 1) D
√ →N (0, 1), m, n → ∞,
mn(N + 1)/12
(provided 0 < P (Xi < Yj ) < 1). An equivalent test statistic, which is often used instead
of W and which gives different critical values but exactly the same p-values as W , is
∑
m ∑
n
U= 1{Xi <Yj } = W − 12 n(n + 1).
i=1 j=1
U − 21 mn D
√ →N (0, 1), m, n → ∞.
mn(N + 1)/12
It should be noted that there is no unanimity about the definition of the Mann-
Whitney test. Not only are both the above-defined equivalent test statistics W and U
used, but also the test statistics W̃ and, equivalently to this, Ũ , where in their definitions
the roles of the first and second sample are reversed with respect to the definitions of W
and U . This means that for two-sided testing problems testing with W̃ or Ũ yields the
same p-value as testing with W or U . Note that for one-sided testing problems testing
with W̃ or Ũ instead of W or U results in the same p-values, but the critical regions lie
on the other side.
When there are groups of equal values, ties, in the combined sample X1 , . . . , Xm , Y1 , . . . ,
. . . , Yn , the test can be adjusted as follows. First, pseudo ranks are defined in the follow-
ing way. Each element of a tie gets a pseudo rank which is the mean of the ranks the
elements of the tie would have gotten if they would have been different from each other.
Observations that have no equal get their normal rank. Let R1 , . . . , Rn be the pseudo
ranks of Y1 , . . . , Yn in the combined samples X1 , . . . , Xm , Y1 , . . . , Yn . The test statistic is
∑
W = ni=1 Ri . It can be proved that, given the pattern of ties, W has a fixed distribution
under H0 , so that the test is nonparametric given the pattern of ties.
To see what is meant with the ‘pattern of ties’, suppose that K different values are
present in the sample X1 , . . . , Xm , Y1 , . . . , Yn , and that the smallest value occurs T1 times,
the one but smallest T2 times,. . ., and the largest TK times. The pattern of ties is then
described by the vector (K, T1 , . . . , TK ). The conditional distribution of (R1 , R2 , . . . , Rn )
given (K, T1 , . . . , TK ) is, under the null hypothesis H0 : F = G, the same as the distribu-
tion of n randomly and without-replacement selected numbers from the set
84 Chapter 6. Nonparametric methods
B: 5 8 7 22 6 7 2 6 6 20 7 6
rank: 9 36 26.5 65 15.5 26.5 1.5 15.5 15.5 64 26.5 15.5
B: 9 13 4 3 6 7 7 4 6 8 6 6
rank: 45 61 6 3 15.5 26.5 26.5 6 15.5 36 15.5 15.5
B: 8 4 11 9 7 6 17 7 4 6
rank: 36 6 56 45 26.5 15.5 63 26.5 6 15.5
W − 21 n(N + 1) D
√ ∑k →N (0, 1), m, n → ∞.
mn(N 3 − i=1 ti )/(12N (N − 1))
3
Example 6.5 In Example 3.6 data were given concerning flaws in samples of sizes
31 and 34 of pieces of textile woven with methods A and B. Between the samples
there is a mean difference of about 2 flaws, method B yielding the fewer flaws.
Whether the difference is systematic or occurred by chance (namely, obtaining just
these perhaps not representative samples), can be further investigated by means of
a test. Because the data are clearly not normally distributed, the use of a t-test
will be misleading. We chose the Mann-Whitney (Wilcoxon two-sample) test. The
ordered combined sample is:
2, 2, 3, 4, 4, 4, 4, 4, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 8,
8, 8, 8, 8, 8, 8, 8, 8, 9, 9, 9, 9, 9, 9, 9, 9, 9, 10, 10, 10, 10, 11, 11, 11, 11, 11, 12, 12,
13, 14, 17, 20, 22.
The second sample together with its ranks in the combined sample are given in
Table 6.2.
We test H0 : “The location of the second sample produced with method B is
larger than the location of the first sample produced with method A” against H1 :
“The location of the second sample produced with method B is smaller than the
6.3. Two-sample problems 85
location of the first sample produced with method A”. The test statistic W has value
w = 885. The normal approximation for the left p-value PH0 (W ≤ 885) is equal to
0.0009. Hence, H0 is rejected for every reasonable value of α. This indicates that
the location of the second sample produced with method B is considerably smaller
than that of the first sample.
Use of the test statistic Ũ with H0 and H1 reversed, would have given ũ = 764,
and the same (but now right instead of left) p-value 0.0009, and, hence, the same
conclusion.
Some doubt about the normal approximation is always justified. Although in
principle it is possible to compute the p-value exactly, as a control the bootstrap was
applied. From the set of pseudo ranks of the combined sample 6000 times a sample
of size n = 34 was taken without replacement. Of each sample the sum w∗ was
computed. The value w = 885 was the 0.0005-quantile of the empirical distribution
of the set of 6000 values of w∗ . The bootstrap approximation for the left p-value is
therefore equal to 0.0005, which confirms the normal approximation.
***
An easy way to compute the test statistic when there are no ties follows from
{ 1 }
D = max max |F̂m (Y(i) ) − Ĝn (Y(i) )|, |F̂m (Y(i) ) − (Ĝn (Y(i) ) −
)|
1≤i≤n n
{ 1 i 1 i−1 }
(6.5) = max max | (R(i) − i) − |, | (R(i) − i) − | .
1≤i≤n m n m n
Here R(1) , . . . , R(n) are the ordered ranks of Y1 , . . . , Yn in the combined sample X1 , . . . , Xm ,
Y1 , . . . , Yn . From this formula it follows that the value of D is completely determined by
the positions ( taken
) by Y1 , . . . , Yn in the ordered combined sample X1 , . . . , Xm , Y1 , . . . , Yn .
m+n
There are n possible groups of n positions. Under the null hypothesis these are all
equally likely. This is why the test is nonparametric. The null hypothesis is rejected for
large values of D. The p-values again can be found in tables, from computer packages or
by direct computation. Figure 6.2 illustrates the idea of the Kolmogorov-Smirnov test,
although for the data set used in this picture, the Kolmogorov-Smirnov test should be
adapted, because it contains relatively many ties.
86 Chapter 6. Nonparametric methods
A
Empirical distribution
B
0.8
B
0.4
A
0.0
0 5 10 15 20
Number of flaws
Figure 6.2: The empirical distribution function of the number of flaws in textile produced by methods
A and B from Example 3.6.
1
PH0 ((X1 , . . . , Xm , Y1 , . . . , Yn ) = (Z(π1 ) , . . . , Z(πm+n ) )|Z(1) , . . . , Z(m+n) ) = ,
(m + n)!
is smaller than or equal to the level α. We thus see that given the values Z(1) , . . . , Z(m+n) ,
a permutation test is nonparametric. A left-sided or two-sided test is defined analogously.
Of course, the distribution of Z(1) , . . . , Z(m+n) does depend on the underlying distribution,
so that the unconditional distribution is not nonparametric. The unconditional level of
the test, however, is smaller than or equal to α. Indeed,
130
110
0 10 20 30 40
Figure 6.3: Scatter plot of systolic blood pressure against years since migration of 39 Peruvian men.
In the classical model for this situation it is assumed that the vectors (Xi , Yi ) originate
from a bivariate normal distribution. This distribution is completely determined by five
parameters: the two expectations, the two variances and the correlation coefficient ρ.
Only the last parameter concerns the dependence of the Xi and Yi variables. Therefore,
6.4. Tests for correlation 89
Note that rs is just the sample correlation coefficient of the ranks of the two samples.
Here it holds that R̄ = S̄ = (n + 1)/2. If there are no ties in the samples, so that
{R1 , . . . , Rn } = {S1 , . . . , Sn } = {1, . . . , n}, we also have
n ( )2
∑
n
n(2n + 1)(n + 1) ∑ n3 − n
Ri2 = ; Ri − R̄ = .
i=1 6 i=1 12
A little computation shows that rs then can be written as
∑n
6 i=1 (Ri − Si )2
rs = 1 − .
n3 − n
∑
Therefore the test may as well be based on the statistic L = ni=1 (Ri − Si )2 . For this
quantity under the null hypothesis a normal approximation holds, irrespectively of the
90 Chapter 6. Nonparametric methods
existence of ties. More precisely, given the pattern of ties, L is asymptotically normally
distributed, with the asymptotic expectation and variance depending on this pattern:
L − µn D
→ N (0, 1), n→∞
σn
where ∑ ∑
µn = EH0 L = (n3 − n)/6 − (Ui3 − Ui )/12 − (Vi3 − Vi )/12
and
[ ∑ ][ ∑ ]
(n − 1)n2 (n + 1)2 (Ui3 − Ui ) (Vi3 − Vi )
σn2 = 2
σH (L) = 1− 1− .
0
36 n3 − n n3 − n
In these formulas the Ui and the Vi give the patterns of ties for the two sequences of ranks;
U1 is the number of times that the smallest rank occurs in the sequence R1 , . . . , Rn , U2 is
is the number of times that the one but smallest rank occurs, etc.
Of course, a p-value can also be approximated by means of simulation instead of with
a normal approximation.
where the statistic Nτ is equal to the number of pairs (i, j) with i < j for which either
Xi < Xj and Yi < Yj , or Xi > Xj and Yi > Yj .
Sometimes the statistic Nτ itself is used as test statistic. This statistic lies between 0
and n(n − 1)/2, and the test based on Nτ is equivalent to the test based on τ .
tion. A permutation test based on T rejects the null hypothesis for an observed value t
of T if
PH0 (T ≥ t|X1 , . . . , Xn , Y(1) , . . . , Y(n) )
#(permutations π with T (X1 , . . . , Xn , Y(π1 ) , . . . , Y(πn ) ) ≥ t)
=
n!
is smaller than or equal to α.
In case n! is too large to compute the probability PH0 (T ≥ t|X1 , . . . , Xn , Y(1) , . . . , Y(n) )
in practice, this probability can be approximated by the fraction
Example 6.7 (Continuation of Example 6.6.) The rank correlation between the
variables systolic blood pressure and migration is equal to −0.17. The rank cor-
relation as well as the ordinary correlation is small, so that there seems to be no
relationship between the two variables. A bootstrap approximation (with B = 1000)
for the left p-value of the permutation test based on Spearman’s rank correlation
coefficient is equal to 0.151. This confirms that there is no significant correlation
between systolic blood pressure and years since migration.
***
Chapter 7
In this chapter we discuss the analysis of categorical data that can be summarized in a
so-called contingency table which was already mentioned in Chapter 1. A contingency
table consisting of r rows and c columns generally looks like Table 7.1. In this table a
number of n studied objects is divided into categories of two variables. The quantity Nij
is the number of objects that are grouped in category i of the row variable and in category
j of the column variable. We say that there are Nij objects in cell (i, j), and Nij is called
the cell frequency of cell (i, j). A dot instead of an index means that the sum is taken over
the quantities with all possible values of that index. For example, N·j is the sum of the
frequencies in the j-th column, that is: the total number of objects in the j-th category
of the column variable. The Ni· and N·j are called the marginal frequencies or, shortly,
marginals. The goal is to investigate whether there is a relationship between the row and
column variable, and if so, which categories are involved in this relationship.
Example 7.1 In a study on the relationship between blood group and certain dis-
eases a large sample of 8766 people consisting of patients with an ulcer, patients
with stomach cancer and a control group of people without these diseases was di-
vided into blood groups O, A or B. The division is given in Table 7.2.
***
92
7.1. Fisher’s exact test for 2 × 2 tables 93
Table 7.2: 3 × 3 contingency table of blood group against disease of 8766 people.
B1 B2 total
A1 k N1· − k N1·
A2 N·1 − k N2· − N·1 + k N2·
total N·1 N·2 n
Table 7.3: The general 2 × 2 contingency table with fixed marginals, determined by the value k in cell
(1, 1).
94 Chapter 7. Analysis of categorical data
Table 7.4: 2 × 2 contingency table of 35 students scored against gender and study type.
two variables is tested. The test statistic is the variable N11 , which has under the null
hypothesis a hypergeometric distribution with parameters (n, N1· , N·1 ) with probability
mass function ( )( )
N1· N2·
( ·)1
k N −k
P (N11 = k) = n
.
N·1
This test is called Fisher’s exact test, since the distribution of the test statistic under the
null hypothes is exact. For testing independence between row and column variables the
test is performed two sided.
A. One single random sample of size n = 8766 is classified in two ways: according
to blood group and according to disease. In the model that belongs to this situation
the rc-vector of all cell frequencies Nij has a rc-nomial (a 9-nomial) distribution with
parameters n, p11 , . . . , prc where
∑
r ∑
c
(7.1) pij = 1.
i=1 j=1
The relation (7.2) still holds, but in this case only the marginals N·1 , . . . , N·c , the column
totals in the table, are random variables, whereas the marginals N1· , . . . , Nr· , the row
marginals in the table, are not. This is because the latter are the fixed sample sizes of
the r samples that were determined by the investigator.
C. In this model the role of the variables are interchanged with respect to Model B:
choose c independent random (column) samples (in the example one of size 1796 from
people with an ulcer, one of size 883 from stomach cancer patients and one of size 6087
from people who do not have an ulcer nor suffer from stomach cancer), and classify the
objects according to the row variable (blood group). The data in the table then originate
from c independent samples from a r-nomial distribution with parameters N·j , p1j , . . . , prj ,
j = 1, . . . , c, for the j-th sample, where
∑
k
(7.4) p·j = pij = 1, j = 1, . . . , r.
i=1
In this case the N·j are the fixed sample sizes and thus not random.
Each of the three models for a r × c-contingency table is a set of multinomially dis-
tributed random variables. As mentioned above, one is most often interested in investi-
gating whether there is a relationship between the row and column variable. Hypotheses
96 Chapter 7. Analysis of categorical data
about such relationship can be translated into hypotheses about the cell probabilities pij .
The chi-square test for a contingency table is in a way a generalization of the chi-square
test for goodness of fit as discussed in Chapter 3. In Chapter 3 only one k-nomial sam-
ple was present, and hence r = 1. The theory is the same though. These approximate
chi-square tests are based on the following two theorems, which we will not prove.
∑
Theorem 7.1 Let for m = 2, 3, . . ., the ℓ-vector N m = (N1 , . . . , Nℓ ) with ℓj=1 Nj = m
be multinomially distributed with parameters m, p1 , . . . , pℓ which satisfy pj > 0 for all j
∑
and ℓj=1 pj = 1. Then it holds that
∑
ℓ
(Nj − mpj )2 D
→ χ2ℓ−1 , m → ∞,
j=1 mpj
Theorem 7.2 The sum of s independent χ2ν distributed random variables has a χ2sν dis-
tribution.
When the sample sizes are sufficiently large Theorems 7.1 and 7.2 can be used to design
approximate chi-square tests for each of the three cases A, B and C. However, in general
the probabilities pij are unknown and need to be estimated. When these parameters are
estimated by means of maximum likelihood the number of degrees of freedom of the limit
distribution is decreased by the number of estimated parameters.
For Model A the hypothesis of independence of the row and column variables is tested;
in terms of cell probabilities this becomes:
with pij = pi· p·j has approximately a χ2rc−1 -distribution when n is large. The maximum
likelihood estimators for the unknown probabilities pi· p·j are equal to
Ni· N·j
(7.7) under H0A : p̂i· p̂·j = , i = 1, . . . , k, j = 1, . . . , r;
n n
The number of estimated parameters is (r − 1) + (c − 1). Inserting the estimated proba-
bilities of (7.7) in (7.6) yields the test statistic
∑
r ∑
c
(Nij − np̂ij )2
(7.8) XA2 = ,
i=1 j=1 np̂ij
7.2. The chi-square test for contingency tables 97
where p̂ij is defined as in (7.9). If n is large this test statistic has approximately a chi-
square distribution with r(c − 1) − (c − 1) = (r − 1)(c − 1) degrees of freedom under
H0B .
In model C the roles of row and column variable interchange with respect to model
B and one tests the hypothesis that the c samples originate from c identical r-nomial
distributions, or:
∑
r ∑
c
(Nij − np̂ij )2
(7.17) XC2 = ,
i=1 j=1 np̂ij
where p̂ij is defined as in (7.9). If n is large this test statistic has approximately a chi-
square distribution with c(r − 1) − (r − 1) = (r − 1)(c − 1) degrees of freedom under
H0C .
It turns out that the three test statistics in (7.8), (7.13) and (7.17) are identical and
all have approximately a chi-square distribution with (r − 1)(c − 1) degrees of freedom
under the corresponding null hypothesis.
Theorem 7.3 Under the null hypotheses H0A , H0B , H0C in the Models A, B and C, and
for n, the row totals, and the column totals, respectively, sufficiently large, the test statistic
∑
k ∑
r
(Nij − np̂ij )2
(7.18) X2 = ,
i=1 j=1 np̂ij
with p̂ij defined as in (7.9), approximately has a χ2 -distribution with (k − 1)(r − 1) degrees
of freedom.
Example 7.3 For the data in Example 7.1 it is not known whether they originate
from one sample, from three samples from the three blood groups, or from three
samples from the different diseases. Let us assume that they come from one sample
of size 8766. We then assume Model A, and we wish to test whether the variable
blood group and the variable disease are independent. We test the hypothesis H0A
against the alternative that pij ̸= pi· p·j for at least one pair (i, j). We use the test
statistic X 2 . The null hypothesis is rejected for large values of X 2 . The value of
X 2 is in this case equal to 40.54; the number of degrees of freedom is 4. The (right)
p-value is smaller than 0.001, so that for the usual significance levels H0A will be
rejected.
Under the Models B and C for testing the null hypothesis H0B and H0C , re-
spectively, the same test statistic would have been used as under Model A. H0B and
H0C would have been rejected also.
***
The methods that were discussed in this section rely on a chi-square limit distribution.
In order for the approximation to be reasonable for a given data set, the following rule of
thumb is used. If EH0 Nij > 1 for all (i, j) and at least 80% of the EH0 Nij > 5, then the
approximation will be reasonable. Here EH0 Nij , the expected number of observations in
7.3. Identification of cells with extreme values 99
cell (i, j) under H0 , equals npi· p·j , Ni· pj , and N·j pi for Model A, B and C, respectively.
In practice the expectations EH0 Nij are not known and are to be estimated. For this the
probabilities in the three expressions for EH0 Nij are replaced by their estimators (7.7),
(7.12) and (7.16), so that for all models EH0 Nij is estimated by i·n ·j = np̂ij with p̂ij
N N
given by (7.9).
Nij − np̂ij
(7.19) Cij = √ , i = 1, . . . , r, j = 1, . . . , c
np̂ij
gives insight into which cells contribute relatively much to the value of the test statis-
tic. For cells with a considerably large contribution it should be checked whether the
100 Chapter 7. Analysis of categorical data
observation in the cell is correct. Note that an incorrect observation not only leads to
an incorrect value in the corresponding cell, but also in the other cells in the same row
and/or column: the contributions are dependent. This is why it is not easy to decide
which cells are extreme solely on a table with contributions.
Often other normalizations of the residuals are used. We mention
Nij − np̂ij
(7.20) Vij = √ ,
Ni· (n − Ni· )N·j (n − N·j)
n2 (n − 1)
√
Nij − np̂ij n
(7.21) Uij = √ = Vij ,
Ni· N·j n−1
np̂ij (1 − )(1 − )
n n
and
1
(7.22) Ṽij = Vij , with gij = 1 + e−np̂ij ,
gij
for i = 1, . . . , r, j = 1, . . . , c. Since given the row and column totals, Nij has under the
three models and under the corresponding null hypotheses (exactly) a hypergeometric
distribution with expectation np̂ij and variance Ni· (n − Ni· )N·j (n − N·j )/(n2 (n − 1)), the
statistic Vij has conditionally on the row and column totals under the null hypothesis
expectation 0 and variance 1 under all three models. Under the three models, for large n
Vij is approximately N (0, 1) distributed under the null hypothesis, so that p-values for the
Vij , based on the standard normal distribution, preferably computed with a continuity
correction, give an impression of the extremeness of each of the cell frequencies separately:
small and large p-values indicate that the value in the corresponding cell is an outlier under
the null hypothesis. For large n Uij almost does not differ from Vij , and each Uij is for
large n also approximately N (0, 1) distributed under the null hypothesis. The same holds
for Ṽij . The additional factor 1/gij in (7.22) serves as a correction for the fact that Vij
is not exactly standard normally distributed. The p-values of Ṽij based on the N (0, 1)-
distribution in some cases give a better approximation of the exact p-values per cell, than
those of the other statistics.
Example 7.4 Let us consider the data of Example 7.1 again. Because for these
data the null hypothesis was rejected (for all models), the next step is to investigate
the nature of the relationship, i.e. of the dependence in Model A or of the inhomo-
geneity of the samples in Models B and C. Below a couple of tables are presented
that give insight in this.
First we compare for the different models the cell probabilities as estimated
under the respective null hypotheses (Tables 7.6–7.8), with those as estimated not
under the null hypothesis (Table 7.5). We see that there are differences between the
7.3. Identification of cells with extreme values 101
Table 7.5: Table of estimated cell probabilities. The inner part of the table gives the values p̂i· p̂·j under
H0A ; the row totals are the values p̂i under H0C ; the column totals are the values p̂j under H0B .
two types of estimates, but that they are not very large. In particular, the Tables 7.5
and 7.6 are almost identical. This illustrates the fact that it makes sense to apply
a chi-square test: based on the comparison of the cell probabilities estimated under
and not under the null hypothesis it is difficult to conclude that the null hypothesis
does not hold.
On the other hand, the Tables 7.5 and 7.6 are not irrelevant, for they indicate
how big the deviance from independence is. The difference between the estimated
probabilities in the two tables is about 0.01. At first sight this is quite small.
Whether or not this is of practical relevance, depends on the context and goal of the
analysis. This cannot be inferred from the fact that the deviance from independence
or homogeneity is statistically significant. The statistical significance only says that
the observed relationship (dependence/inhomogeneity) is not due to an artifact of
taking a sample (which is large in size in this example). It does not mean that the
relationship is strong, i.e. that the dependence or inhomogeneity is large, but only
that the relationship almost surely exists.
Next, we investigate the nature of the relationship by studying the residuals
(Table 7.9). The residuals in the cells (1,1), (1,3) and (2,1) seem to be relatively
large. However, in Tables 7.9 and 7.10 it is illustrated that it is good to normalize
the residuals: the residuals: 41.28 and 41.88 yield rather different contributions.
Vice versa, the contributions −2.16 and −2.22 originate from two very different
residuals. From Table 7.10 with the contributions we see that cell (1,1) and cell
(2,1) do make a large contribution to the chi-square test statistic indeed, but that
the contribution of cell (1,3) is not very large after all. Also in Table 7.11 with
the normalized residuals Uij the cells (1,1) and (2,1) stand out. We therefore may
conclude that there seems to be a positive relationship between having blood group
O and getting an ulcer, and a negative relationship between getting an ulcer and
having blood group A. In this example the values of Uij , Vij and Ṽij are the same,
due to the fact that the sample size is very large.
***
102 Chapter 7. Analysis of categorical data
Table 7.6: Table of estimated cell probabilities pij under Model A, not under the null hypothesis; the
table gives the values Nij /n.
Table 7.7: Table of estimated cell probabilities pj under Model B, not under the null hypothesis; the
table gives the values Nij /Ni· .
Table 7.8: Table of estimated cell probabilities pi under Model C, not under the null hypothesis; the
table gives the values Nij /N·j .
Table 7.11: Table of the standardized residuals Vij ; Uij and Ṽij are the same as Vij in this case.
Example 7.5 For the data of Example 7.1 it turns out that the Cij and the Uij (and
hence the Vij and Ṽij ) do not have any outliers based on their empirical distribution,
whereas in fact there are a couple of large contributions. In view of the fact that
the chi-square test rejects the null hypothesis, this illustrates the limitations of this
method for smaller tables.
***
then the chi-square test is not applicable. When n is large, the chi-square test should be
preferred, because the bootstrap procedure takes a lot of computation time in that case.
The bootstrap is, however, more generally applicable than just for determining a
bootstrap estimate of the exact, conditional, distribution of the chi-square test statistic.
With the bootstrap also estimates of the conditional distribution of any other function of
the cell frequencies can be derived. Hence, the bootstrap can be employed for finding cells
with extreme values too. For instance, estimation of the conditional distribution—under
one of the three null hypotheses— of the largest, one but largest, ect. cell frequency, allows
us to compare the largest, one but largest, ect. cell frequency in the contingency table of
the data with the corresponding estimated distribution. If, for example, the largest value
in the table lies in the tail of the estimated distribution, then this value is extreme under
the null hypothesis.
We now describe the general procedure for deriving a bootstrap estimate of the distri-
bution of a function of the cell frequencies under one of the null hypotheses and given the
row and column totals. Because the distribution of Nij given the marginals is the same
under all three hypotheses, namely hypergeometric with expectation np̂ij and variance
Ni· (n − Ni· )N·j (n − N·j )/(n2 (n − 1)), it does not matter whether we work under Model
A, B or C. Let N n = (N11 , . . . , N1c , N21 , . . . , N2c , . . . , Nr1 , . . . , Nrc )T , and let F0 be the
conditional distribution of N n under the null hypothesis given the marginals. Let the
function of the cell frequencies T = T (N n ) be the statistic of interest. The bootstrap
procedure for estimation of its distribution is as follows:
The value t of T (N n ) for the original data set can now be compared with the bootstrap
estimate of the distribution of T (N n ). We remark that, although this does not show in
∗ ∗
the notation, the bootstrap values Tn,1 , . . . , Tn,B of course do not only depend on n, but
also on the row and column totals. Hoaglin, Mosteller and Tukey (1985) describe how to
generate the samples from F0 . Unfortunately, this procedure takes a long computation
time even for moderately large n.
Example 7.6 Table 7.12 gives the results of a study into illiteracy in urban and
rural areas. As expected, the row and column variables are not independent: we
7.4. The bootstrap for contingency tables 105
find a value of 14.82 for the chi-square statistic, which has for one degree of free-
dom a very small p-value. For these data 500 bootstrap 2 × 2-contingency tables
with the same marginals as Table 7.12 were simulated under the null hypothesis
of independence. For each of the tables the smallest and largest value in the table
was determined. The empirical distributions of these values are bootstrap estimates
for the conditional distributions, given the marginals, under the null hypothesis of
the minimum and the maximum in a table. Based on these bootstrap estimates
the value 6 in the table turns out to be an outlier, (6 was the 0.002-quantile of the
empirical distribution of the bootstrapped smallest values), but the value 36 not
(36 was the 0.41-quantile of the empirical distribution of the bootstrapped largest
values). We may therefore conclude that in particular there are less illiterate people
in the urban areas then would be the case under independence of the variables ‘area’
and ‘literacy’.
***
Chapter 8
Regression analysis in its simplest form consists of fitting models for one continuous re-
sponse to experimental data. In the models it is assumed that the value of the response
depends on the values of a number of other variables, the independent or structural vari-
ables. In the context of a regression model, these variables often are called the explanatory
variables. The response is assumed to be observed with a measurement error; the cor-
responding values of the explanatory variables are assumed to be known exactly. The
relationship between the two types of variables depends on the unknown values of a set of
parameters. These unknown parameter values need to be estimated. Often the estimation
is done by the so-called least squares method, but other methods are also available. The
least squares method tries to find those values of the parameters that minimize the sum of
the quadratic deviances of the observations from their expected values under the model.
In the case of linear regression the expected response depends in a linear way on the vector
of unknown parameters. Generally the response is assumed to be normally distributed.
Two more general classes of regression models are the class of nonlinear regression models
in which the expected response depends in a nonlinear way on the parameters, and the
class of generalized linear models in which the expected response depends on a function of
a linear model and the response can have a different distribution than the normal one. In
this chapter we restrict ourselves to univariate linear regression models in which the one
response variable can depend on more than one explanatory variable, the so-called multi-
ple linear regression models; see, for instance, Seber and Lee (2003) or Weisberg (2005).
If there is more than one response variable, one speaks of multivariate linear regression,
see Mardia et al. (1980) for an introduction. We shall not address this case. For nonlinear
models we refer to Seber and Wild (2003) or Bates and Watts (2007), and for generalized
linear models to McCullagh and Nelder (1989) or Dobson and Barnett (2008).
106
8.1. The multiple linear regression model 107
for i, k = 1, . . . , n,
Yi = β0 + xi1 β1 + · · · + xip βp + ei ,
E ei = 0,
(8.1)
{
σ 2 , i = k,
E ei ek =
0, i = ̸ k,
where Yi is the i-th observation of the response, xij the (known) value of the j-th explana-
tory variable for this observation, β0 , β1 , . . . , βp , and σ 2 are unknown constants, and ei is
the unknown stochastic measurement error in the i-th observation. We see that in this
model the Yi are uncorrelated random variables with E Yi = β0 + xi1 β1 + · · · + xip βp , and
Var Yi = σ 2 . The constant β0 is called the intercept.
It is practical to use the matrix notation in this situation. Model (8.1) then becomes
Y = Xβ + e,
(8.2) E e = 0,
Cov(e) = σ 2 In×n .
Y = β0 1 + β1 X1 + β2 X2 + · · · + βp Xp + e,
(8.3) E e = 0,
Cov(e) = σ 2 In×n .
The first goal is to estimate β and σ 2 . Next the quality of the model with the estimated
parameter values needs to be investigated. In the sequel the “i-th observation point”
means the i-th response yi combined with the corresponding values xi1 , . . . , xip of the
explanatory variables. Unless stated otherwise, we assume from now on that e1 , . . . , en
are independent and normally distributed.
108 Chapter 8. Linear regression analysis
It can be obtained in the usual way, by differentiation of (8.4) with respect to the different
βj and setting the result equal to 0. Doing this, we see that β̂ satisfies X T (Y − X β̂) = 0,
or X T X β̂ = X T Y , and we find
(8.5) β̂ = (X T X)−1 X T Y.
RY (X) = R = Y − Ŷ ,
or
∑
n ∑
n
(8.6) RSS = Ri2 = (Yi − Ŷi )2 ,
i=1 i=1
diameter 8.3 8.6 8.8 10.5 10.7 10.8 11.0 11.0 11.1 11.2 11.3
volume 10.3 10.3 10.2 16.4 18.8 19.7 15.6 18.2 22.6 19.9 24.2
length 70.0 65.0 63.0 72.0 81.0 83.0 66.0 75.0 80.0 75.0 79.0
diameter 11.4 11.4 11.7 12.0 12.9 12.9 13.3 13.7 13.8 14.0
volume 21.0 21.4 21.3 19.1 22.2 33.8 27.4 25.7 24.9 34.5
length 76.0 76.0 69.0 75.0 74.0 85.0 86.0 71.0 64.0 78.0
diameter 14.2 14.5 16.0 16.3 17.3 17.5 17.9 18.0 18.0 20.6
volume 31.7 36.3 38.3 42.6 55.4 55.7 58.3 51.5 51.0 77.0
length 80.0 74.0 72.0 77.0 81.0 82.0 80.0 80.0 80.0 87.0
Example 8.1 In Example 3.2 univariate regression was performed in which the
response variable ‘volume’ was regressed on the explanatory variable ‘diameter’ for
data of 31 cut cherry trees. These data were collected to investigate to which extent
the timber yield (volume) of a tree can be predicted by means of the diameter
at the height of 4 feet and 6 inches of the uncut tree. The question is whether
the diameter on its own is sufficient to predict the volume. Therefore, besides the
volume (in cubic feet) and the diameter (in inches), also the length (in feet) of the
31 trees was measured, see Table 8.1.
We now assume the model (8.1) with p = 2, where Yi , xi1 and xi2 represent
the volume, the diameter and the length of the i-th tree. With the least squares
method we find
of the measurement errors. This can once again be done by means of a normal
QQ-plot of the residuals. This plot, see Figure 8.1, gives us no reason to doubt the
normality assumption.
***
5
Sample Quantiles
0
−5
−2 −1 0 1 2
Theoretical Quantiles
Figure 8.1: Normal QQ-plot of the residuals of the bi-variate linear regression of volume on diameter
and length for data of 31 cherry trees.
We remark that under the assumption of normality of the measurement errors the least
squares estimates are the same as the maximum likelihood estimates. In case it is unlikely
that the measurement errors are normally distributed, for example if a Poisson or gamma
distribution or a symmetric distribution with heavier tails seems more appropriate, then
it is recommended to use maximum likelihood or a robust method to estimate β, instead
of the least squares method.
model). A thorough statistical analysis should therefore consider all possible models, with
all combinations of explanatory variables, and compare them to each other. In several
statistical packages this is standard practice. The statistician’s task is to choose proper
selection criteria. Below we will discuss several techniques to compare two different linear
regression models with each other.
(8.8) Y = β0 1 + β1 X 1 + · · · + βp X p + e
with residual sum of squares RSS given by (8.6), and compare it with the model without
explanatory variables,
(8.9) Y = β0 1 + e.
The residual sum of squares for this special model without explanatory variables is denoted
by SSY . Because for this model the design matrix reduces to the n-vector containing
only ones, β̂0 = Ȳ , and Ŷ = X β̂ = β̂0 1 = Ȳ 1. This means that Ŷi = Ȳ for every i, and
the residual sum SSY , for this special case thus turns into
∑
n
SSY = (Yi − Ȳ )2 ,
i=1
which is called the sum of squares for Y or the total sum of squares. The difference
corresponds to the part of the sum of squares for Y that is explained by the larger model
(8.8), but not by the small model (8.9). Therefore, the size of SSreg is an indicator of the
usefulness of regression of Y on the X-variables. The coefficient of determination R2 is a
scaled version of this difference:
SSreg SSY − RSS RSS
(8.10) R2 = = =1−
SSY SSY SSY
It represents the fraction of the variability in Y which is explained by regression of Y on
X. The quantity R2 is also called the fraction of explained variance. Its size thus is a
measure of the overall quality of the model (8.8). For linear regression models with an
intercept, it holds that 0 ≤ R2 ≤ 1. The closer R2 is near 1, the better the model. In
general, addition of a new variable to a model will make R2 larger.
112 Chapter 8. Linear regression analysis
For simple linear regression, where there is only one explanatory variable, the coeffi-
cient of determination is equal to the square of the correlation coefficient between Y and
X1 . In the case of multiple regression it is equal to the square of the so-called multiple
correlation coefficient of Y and the X-variables. The multiple correlation coefficient is
the largest correlation in absolute value between Y and any linear combination of the
X-variables, which is by definition equal to ρ(Y, Ŷ ), the correlation coefficient between Y
and its expectation under the considered regression model. Although the determination
coefficient is very often used, its importance should not be overrated. It only gives a global
indication of the model fit.
8.1.2.2 F -tests
A different scaling of SSreg than the one used for the definition of the determination
coefficient in (8.10), leads to a frequently used test statistic. Indeed, if SSreg exceeds
some critical level, we will conclude that exploiting our knowledge of the values of the X’s
yields a better model than if we would ignore them. More precisely, define the F -statistic
If e1 , . . . , en are independent and normally distributed, then under the model (8.9) F
is the ratio of two independent chi-square distributed random variables divided by their
numbers of degrees of freedom p and (n − p − 1), respectively, and F therefore has an
F -distribution with p and (n − p − 1) degrees of freedom. Hence, for the testing problem:
the following F -test may be used: reject H0 if F ≥ Fp,(n−p−1);1−α . Here Fµ,ν;1−α is the
(1 − α)-quantile of the F -distribution with µ and ν degrees of freedom, and α is the level
of the test.
Example 8.2 For the multiple regression with two variables of Example 8.1 we
find SSY = 8106.08, so that SSreg = 7684.16, R2 = 0.95 and F = 254.97. We see
from the value of R2 that a large fraction of the variability in volume is explained
by regression on diameter and length. The given F-value relates to the following
testing problem:
H0 : model (8.9) holds, this is: β1 = β2 = 0,
H1 : model (8.8) holds with βj ̸= 0 for some j, 1 ≤ j ≤ 2.
8.1. The multiple linear regression model 113
Because the right p-value of 254.97 for the F2,28 -distribution is equal to 0, H0 is
rejected for all reasonable significance values. Later on we will see that diameter is
a good predictor of volume, whereas length is less useful in this respect.
***
It is not always relevant to compute the overall test statistic F: often it is already
known that at least some of the variables are correlated with the response variable, so
that a large value for F is to be expected. Instead of testing whether all variables together
are of importance, it is then more interesting to investigate which part of the available
set of variables should be selected for inclusion in the model. For this purpose another
F -test can be used. If one is interested to know whether besides the explanatory variables
X1 , . . . , Xp also Xp+1 , . . . , Xq (q > p) should be included in the model, one considers the
testing problem
H0 : βp+1 = · · · = βq = 0; β0 , β1 , . . . , βp arbitrary,
8.1.2.3 t-tests
Investigating whether or not to include one single variable, Xk say (0 ≤ k ≤ p), is
generally done by means of a t-test, rather than by using a partial F -test. In this case
the hypotheses are:
H0 : βk = 0; βj arbitrary for 0 ≤ j ≤ p, j ̸= k,
H1 : βk ̸= 0; βj arbitrary for 0 ≤ j ≤ p, j ̸= k.
114 Chapter 8. Linear regression analysis
If the model (8.8) with p explanatory variables X1 , . . . , Xp is fitted, then—again under the
assumption of independence and normality of the measurement errors—the statistic TK
obtained by dividing β̂k by its estimated standard deviation, under H0 has a t-distribution
with (n − p − 1) degrees of freedom. Hence, for significance level α H0 is rejected if
|β̂k |
(8.13) |Tk | = √ ≥ t(n−p−1);1−α/2 ,
d
Cov(β̂)kk
where tν;1−α/2 is the (1 − α/2)-quantile of the t-distribution with ν degrees of freedom and
d β̂) the k-th diagonal element of the matrix Cov(
Cov( d β̂). In most statistical computer
kk
packages for regression the vector of t-values corresponding to the estimates β̂k belong to
the standard output.
We note that the testing problem of this t-test is the same as that of the partial F -
test for the case where the question is whether Xk should be included next to the other
(p − 1) explanatory variables. For this situation F p−1,p = Tk2 . Under the relevant null
hypothesis F p−1,p has an F -distribution with 1 and (n − p − 1) degrees of freedom and the
two tests are equivalent. This means that the tests in this case yield the same p-value,
and, hence, the same conclusion for the same significance level. It is important to keep in
mind that both the F - and t-tests should only be used if the independence and normality
assumptions for the measurement errors are plausible.
Unfortunately, the partial F -tests and t-tests have the undesirable property that the
conclusion about whether or not one or more new variables should be included in the
model, may depend on which variables were already present in the model. For instance, if
a series of tests is performed such that each time one single variable is added if it is tested
significant, the final model may depend on the order in which the available variables were
tested.
It also often happens that based on a t- or F -test it is concluded that a certain variable
should be added to a model whereas the determination coefficient of the new model with
this variable is barely larger than that of the old model without the variable. In that
case, addition of the variable hardly improves the model. Since simpler models are easier
to work with, a general guideline for the choice of a model is: choose the model with the
largest determination coefficient, but only if the difference of its determination coefficient
with the smaller model is relevant. In addition, one should always keep in mind that the
modeling problem concerns a concrete, practical situation: it should always be questioned
whether addition or deletion of variables makes sense in the context of the situation that
the model intends to describe.
the linear relationship between X1 and Y . As such, it gives an indication of the usefulness
of inclusion of X1 as the explanatory variable in a simple linear regression model with
response variable Y . Analogously, if p explanatory variables are available, there exists a
measure for the linear relationship between one of them, Xk say, and Y corrected for the
(p − 1) other variables, which gives a global indication of the usefulness of including Xk
into a multiple regression model in addition to the other (p − 1) variables. This measure
therefore addresses the same problem as the above described t-test and is called the par-
tial correlation coefficient of Xk corrected for the (p − 1) other variables. It is defined
as the correlation coefficient between two vectors of residuals, namely between RY (X−k )
which is the vector of residuals from the linear regression model (with intercept) that
regresses Y on all explanatory variables except Xk , and RXk (X−k ) which is the vector of
residuals from the linear regression model that regresses Xk on the same (p − 1) other
explanatory variables. The vectors RY (X−k ) and RXk (X−k ) represent the part of Y and
Xk , respectively, that cannot be explained by the (p − 1) other variables. This means that
if the absolute value of the partial correlation coefficient is close to 1, it makes sense to
include Xk in a model for Y in which the other (p − 1) explanatory variables are already
present.
Example 8.3 Let us consider once more the cherry tree data of the foregoing
examples, and let us investigate whether or not the variable ‘length’ should be
added to the model in addition to the variable ‘diameter’. To this end, consider the
testing problem
H0 : β2 = 0; β0 , β1 arbitrary,
H1 : β2 ̸= 0; β0 , β1 arbitrary.
We find t2 = 2.61 with a (two-sided) p-value of 0.0144 for the t-distribution with 28
degrees of freedom, and F 1,2 = 6.79 with almost the same p-value, 0.0145 for the
F -distribution with 1 and 28 degrees of freedom. The small difference in p-values
is due to rounding errors. We may conclude that for levels of α larger than 0.0145
H0 will be rejected. Note that not only the p-values are almost the same, but also
t22 = 6.81 equals F 1,2 up to rounding error.
Depending on how accurately one wishes to predict the timber yield, based on
these results one could decide not to include the variable length, especially since
this variable is hard to measure in practice. Taking the determination coefficient
into account, we see that for simple linear regression of volume on diameter it is
R2 = 0.94, whereas the determination coefficient increases to R2 = 0.95 due to
addition of the variable length (multiple regression on diameter and length). Since
this is a minor increase, the values of the determination coefficients suggest that
inclusion of the variable ‘length’ is not very useful.
This conclusion is to some extent supported by the value 0.44 of the partial
correlation coefficient of length corrected for diameter. But how do we explain the
116 Chapter 8. Linear regression analysis
results of the F- and t-test, which with the small p-value of around 0.014 seem to
indicate that inclusion of length is significantly better? In this respect the general
warning for thoughtlessly applying statistical tests is in place. The significance of a
test means that an observed effect, here the increase of R2 from 0.94 to 0.95, cannot
be attributed to chance: if new data would be collected, one would most likely find
a similar effect. However, significance of a test does not mean that the effect is of
practical importance. In the case of the cherry tree data one will probably decide
that diameter on its own is sufficient to predict the timber yield, not only because
the difference of the two determination coefficients is small, but also because the
length of an uncut tree is difficult to establish.
Although practically not very relevant, for illustrative purposes we could also
investigate whether a simple linear regression model with only length would be
appropriate. In other words, we could test whether the variable ‘diameter’ should
be included in the model in addition to the variable ‘length’. In that case we would
test the hypotheses:
H0 : β1 = 0; β0 , β2 arbitrary,
H1 : β1 ̸= 0; β0 , β2 arbitrary.
The resulting values t1 = 17.82 and F 1,2 = 317.41 for the test statistics indicate
that the null hypothesis will be rejected for all reasonable values of the significance
level. A simple linear regression with only length is therefore not appropriate. This
also follows from the value 0.35 of the determination coefficient for this model and
the value 0.96 of the partial correlation coefficient for diameter corrected for length.
***
8.2 Diagnostics
For the estimation and testing procedures we have postulated a certain model including
a couple of model assumptions, and we tacitly assumed that the model and the model
assumptions were correct or at least plausible. Of course, in practice this is not always
the case. Hence, another very important aspect of a statistical analysis is to check the
model assumptions and to investigate the quality of the model. This is called diagnostic
regression or shortly: diagnostics. Quantities that give information on the model quality
and assumptions are called diagnostic quantities. Unlike the global quantities that we saw
until now, such as β̂ and R2 , which give a global picture, a diagnostic quantity generally
has a different value for each observation point. The importance of diagnostic regression
is illustrated by the following example.
8.2. Diagnostics 117
Example 8.4 For four fictional data sets the values of the response variable Y and
explanatory variable X are:
dataset 1, 2, 3, X 10.00 8.00 13.00 9.00 11.00 14.00 6.00 4.00 12.00 7.00 5.00
dataset 1, Y 8.04 6.95 7.58 8.81 8.33 9.96 7.24 4.26 10.84 4.82 5.68
dataset 2, Y 9.14 8.14 8.74 8.77 9.26 8.10 6.13 3.10 9.13 7.26 4.74
dataset 3, Y 7.46 6.77 12.74 7.11 7.81 8.84 6.08 5.39 8.15 6.42 5.73
dataset 4, X 8.00 8.00 8.00 8.00 8.00 8.00 8.00 19.00 8.00 8.00 8.00
dataset 4, Y 6.58 5.76 7.71 8.84 8.47 7.04 5.25 12.50 5.56 7.91 6.89
Y = β0 + β1 X + e
is fitted to each of the four data sets, the parameter estimates and the determination
coefficient have the same values for each of the four cases:
From the fact that these global quantities are the same for the four sets, it is tempt-
ing to conclude that simple linear regression is equally good in the four situations.
The most simple form of diagnostic regression, making scatter plots, shows that this
is not true. Figure 8.2 gives scatter plots of the Y -values against the X-values for
each of the four cases. In the first case the data are as we would expect if a simple
linear regression model would be a good description of reality. The graph for the
second case suggests that the analysis based on simple linear regression is incorrect
and that a smooth curve, for instance a quadratic polynomial could be fitted to
the data so that much less unexplained variation in Y would remain. The third
case shows a picture from which we see that a simple linear regression model seems
appropriate for all points, except one. It would perhaps be better not to include this
point in the global analysis. This would give: ŷ = 4.0 + 0.34x, a very different line.
Without more information it is impossible to say which of the two lines is ‘better’.
In the last case, there is not sufficient information to infer something about the
quality of the linear regression model. The slope β̂1 of the regression line is mainly
determined by the value y8 . If the eighth observation point would be deleted, the
parameter β1 could not even be estimated. A global analysis that depends on one
point to such large extent should not be trusted!
***
12
12
10
10
8
8
Y1
Y2
6
6
4
4
2
2
0
0
0 5 10 15 20 0 5 10 15 20
X1 X2
12
12
10
10
8
8
Y3
Y4
6
6
4
4
2
2
0
0 5 10 15 20 0 5 10 15 20
X3 X4
Figure 8.2: Scatter plots of four fictional data sets with their linear regression lines.
For a simple linear regression model the relationship between the response Y and the
explanatory variable X = X1 can simply be represented by a scatter plot to which the
estimated regression line may be added. For multiple regression it is less straightforward.
It is generally useful to make scatter plots of the response Y against each of the explana-
tory variables and, because there may be relationships between the explanatory variables
themselves, also of each pair of explanatory variables.
Example 8.5 Figure 8.3 shows several scatter plots for the cherry tree data. The
results of our global analysis are confirmed in these plots: there appears to be
a strong linear relationship between volume and diameter. Moreover, there is a
moderate linear relationship between the two explanatory variables ‘diameter’ and
‘length’.
***
What a scatter plot of Y against one of the explanatory variables, Xk for instance, does
not show, is the relationship between Y and Xk after correction for possible relationships
8.2. Diagnostics 119
10 20 30 40 50 60 70 80
10 20 30 40 50 60 70 80
30
25
20
diameter
volume
volume
15
10
5
0
0 5 10 15 20 25 30 60 65 70 75 80 85 90 60 65 70 75 80 85 90
Figure 8.3: Some scatter plots for the cherry tree data.
between Xk and the other explanatory variables. Going back to the definition of the
partial correlation coefficient, we see that a scatter plot of the vector of residuals RY (X−k )
of the regression of Y on all Xj except Xk , against the vector of residuals RXk (X−k ) of
regression of Xk on the other Xj represents just this relationship. This graph is called
the added variable plot for Xk . A strong linear relationship between the plotted residuals
corresponds with a strong linear relationship between Y and Xk corrected for the other
variables, and indicates that Xk should be added to the model that already contains the
other variables. The added variable plot has the property that if the linear regression
model
RY (X−k ) = α0 + α1 RXk (X−k ) + e,
of RY (X−k ) on RXk (X−k ) is fitted, then the least squares estimators of α0 and α1 satisfy
α̂0 = 0 and α̂1 = β̂k , respectively. Here β̂k is the least squares estimator of βk in (8.8).
The residuals of both regression procedures are the same, so that the residual sums of
squares RSS for both models are equal too.
Added variable plots can be interpreted in the same way as scatter plots for simple
linear regression. They can give information about nonlinear relationships or outliers. To
determine whether or not a variable should be included in the model, the added variable
plot is a very informative tool: whereas the partial correlation coefficient and the statistic
tk defined in (8.13) summarize the overall effect of adding the variable Xk , the added
variable plot for Xk shows this effect for each observation separately.
Example 8.6 Figure 8.4 displays added variable plots for the cherry tree data. The
added variable plot for diameter (left) shows a strong linear relationship between the
two plotted variables, which indicates that the variable diameter should be included
into the model, a conclusion that already was drawn based on the t1 -value of 17.82.
The added variable plot for length, however, is more difficult to interpret. Although
120 Chapter 8. Linear regression analysis
it shows a slight upward trend, there is no clear linear relationship. This too reflects
the above found t2 -value 2.61.
As we saw in Example 8.3 above, the partial correlation coefficients for diameter
and length, which are the correlation coefficients of the two corresponding added
variable plots, are 0.96 and 0.44, respectively.
***
30
10
20
Rvol(X−length)
Rvol(X−diam)
5
10
0
0
−5
−20
−10
−4 −2 0 2 4 −10 −5 0 5 10
Rdiam(X−diam) Rlength(X−length)
Figure 8.4: The cherry tree data: added variable plots for diameter (left) and length (right).
- the residuals against explanatory variables that are not included in the model. Vari-
ables for which a clear linear relationship is seen in the plots, should be included in
the model.
- the residuals against the predicted responses. If the spread of the residuals seems
to increase or decrease for increasing values of the predicted response, this can be
an indication of a non-equal variance of the measurement errors. A transformation
could help here, or one could resort to the so-called weighted linear regression in
which the variance of the measurement errors ei is allowed to be different for different
i. If the graph shows a nonlinear relationship, then a linear regression model is
perhaps not the best model. Transformation of the response or of the explanatory
variable(s), or the use of a nonlinear model could be more appropriate.
For the assessment of the quality of a model the problem is of course, that a discrepancy
between model and data is not necessarily caused by inappropriate model assumptions,
but can also be the result of incorrect data values. Based on a statistical analysis alone,
it cannot be decided which of the two is the case. This is why it is important to always
keep the experimental situation in mind. Moreover, it depends on the type of discrepancy
whether or not it is easily detected. In fact, in the fitting procedure the model is fitted to
all data, and hence also to deviating data. In case of deviating observations the residuals
therefore do not need to be large. In some cases a deviating observation may even lead
to a very small residual for that observation.
8.2.3 Outliers
One of the assumptions in a regression analysis is that the chosen model is appropriate
for all observation points. However, in practice it is often the case that one or more
observation points have a response value that does not seem to match with the model
that is best suited for the majority of the points. Such observation points, with extremely
large or extremely small values of the response compared to what is expected under the
model, are called outliers, a term that we already met in Chapter 5. The following example
illustrates that it is always useful to start with inspecting a couple of plots for possible
outliers.
Example 8.7 Around 1850 the Scottish physicist J.D. Forbes collected data about
the boiling point of water (in degrees Fahrenheit) under different values of pressure
(in inches of Mercury) by performing experiments at different altitudes in the Alps.
The goal of these experiments was to investigate the relationship between boiling
point and pressure, so that the more easily measured boiling point could serve as a
predictor for pressure and ultimately for altitude. The explanatory variable was x
122 Chapter 8. Linear regression analysis
= boiling point, and on the pressure data a transformation was performed so that
the response variable became y = 100× log(pressure), with values:
Figure 8.5 gives a couple of relevant plots for these data. The upper left plot is a
scatter plot of y- against x-values. There obviously is a strong linear relationship
between the two variables, although one of the points, observation 12, does not lie on
the line with the other points. If the residuals are plotted against the corresponding
x values (top-right), then we see that most of the residuals are small, except for the
value of the 12-th point. Because the linear relationship between x and y is very
strong, a plot of the residuals against the predicted responses, ŷ, gives the same
picture, see the lower left plot. The lower right plot shows a normal QQ-plot of the
residuals. Also here the 12-th observation is not in line with the others.
***
For a simple linear regression problem, like in the example above, to detect outliers, it
is usually sufficient to inspect a scatter plot of the response Y against X = X1 , the single
explanatory variable. In a multiple linear regression setting scatter plots of the response
against the explanatory variables as well as added variable plots may give information
about this, but the problem is obviously more complex.
Once one or more possible outliers are visually detected, one could apply a more formal
procedure to decide whether or not these points should be considered extreme. One such
procedure makes use of a t-test like the one discussed above, combined with a so-called
mean shift outlier model . Suppose that the k-th point is suspected to be an outlier. Then
one could think that the multivariate regression model is correct for all points, except for
this one. To adjust for this, the original model is adapted for the k-th point in such a way
that the expected mean for the k-th response under the original model, namely xTk β, is
changed with an unknown amount δ so that the resulting model could fit the k-th point
too. This resulting model is the mean shift outlier model:
{
xTi β + ei , i ̸= k
Yi = T
xi β + δ + ei , i = k,
(8.14) Y = Xβ + uδ + e = X∗ β∗ + e,
where u is a fictional ”explanatory” variable with ui = 0 for i ̸= k and uk = 1; the
8.2. Diagnostics 123
145
1.0
100xlog(pressure)
residuals
140
0.5
135
0.0
195 200 205 210 195 200 205 210
1.0
sample quantiles
residuals
0.5
0.5
0.0
0.0
Figure 8.5: Forbes’ data: scatter plots of y against x, residuals against x, residuals against ŷ and a
normal QQ-plot of the residuals.
new design matrix X∗ is the n × (p + 2)-matrix (X, u), and the new parameter vector
β∗ = (β, δ)T .
If model (8.14) holds, the k-th observation point is an outlier and the corresponding
response has expectation xTk β + δ. If the value of δ is large enough, then indeed it is
plausible that the k-th point is an outlier. In other words, if δ is significantly different
from zero, the k-th point is an outlier. But then we may just use the t-test (8.13) to test
the significance of δ in the model (8.14). Generally, the sign of δ is known from the plots,
and a one-sided test should be used in this situation. Let us assume for simplicity that
δ > 0; for δ < 0 the procedure is analogous. We test the following hypotheses:
H0 : δ = 0, β arbitrary,
H1 : δ > 0, β arbitrary.
124 Chapter 8. Linear regression analysis
For this testing problem a test with significance level α is: reject H0 if
δ̂
tp+1 = √ ≥ t(n−p−2);1−α ,
d β̂ )
Cov( ∗ p+1,p+1
where δ̂ in the numerator of the test statistic is the least squares estimate of the shift
d β̂ )
δ in the model (8.14), and Cov( ∗ p+1,p+1 in the denominator is the (p + 2)-th diagonal
d 2 T −1
element of Cov(β̂∗ ) = σ̂∗ (X∗ X∗ ) with
The model (8.14) and the corresponding test procedure can be extended in the obvious
way for more than one outlier. Evidently, once it is decided that a certain point is an
outlier, it should be investigated what could be the cause for this. If it is likely that
measurement error is the cause, then the analysis could be repeated without the point.
If measurement error is not a likely cause, or if the cause is unknown, it is advisable to
try other type of models. In general, it should always be mentioned in the report of the
analysis and its results which points, if any, were omitted from the analysis.
Example 8.8 Huber’s data set consists of the following fictive data:
x: −4 −3 −2 −1 0 10
Figure 8.6 shows several plots for these data. The scatter plot of y against x (top-
left) suggests that if the 6-th observation is not taken into consideration, a straight
line is a reasonable model for the other 5 points. If a straight line is fitted through all
points, then the resulting line (top-middle) is rather different from the line obtained
by performing linear regression without the 6-th point (top-right). In a plot of the
residuals of the regression on all points against the x-values (bottom-left) we see that
8.2. Diagnostics 125
2
1
1
Y
Y
0
0
−1
−1
−1
−4 −2 0 2 4 6 8 10 −4 −2 0 2 4 6 8 10 −4 −2 0 2 4 6 8 10
X 0.4 X X
0.4
residuals of multiple linear regression
2
residuals of simple linear regression
0.2
0.2
sample quantiles
1
0.0
0.0
0
−0.2
−0.2
−1
−0.4
−0.4
−4 −2 0 2 4 6 8 10 −4 −2 0 2 4 6 8 10 −1.0 −0.5 0.0 0.5 1.0
there is one large residual, but this is not the one corresponding to observation 6
but the one of observation 1.
If there is a good reason not to ignore observation 6, then the scatter plot of the
data suggests that we may investigate a model in which a quadratic term is included.
If x2 is added as a second explanatory variable, the t-test for testing whether the
corresponding coefficient is significantly different from zero yields a p-value smaller
than 0.01, and therefore the coefficient is significantly different from zero. For this
multiple linear regression model with the two explanatory variables x and x2 the
residuals are smaller than the residuals from the simple linear regression and a plot
of these residuals against x looks good (bottom-middle), although the residual of
the 6-th observation is still a bit small compared to the other residuals. The latter
may be caused by the fact that the quadratic nature of the relationship between x
and y is mainly determined by this one point.
Finally, a normal QQ-plot of the residuals of the quadratic regression shows
a straight line (bottom-right). We may conclude that for the complete data set
the multiple linear regression model with x and x2 as explanatory variables is a
reasonable model.
***
126 Chapter 8. Linear regression analysis
The example above shows that for detecting outliers in the explanatory variables checking
for large residuals is not sufficient, because observation 6, which is deviating in some way,
has a small residual for both the simple and the quadratic model. A small residual may
be a good sign if the model fits the data well and the point with the small residual is just
one of the points where the model fits very well. However, it may also be a bad sign if
the small residual is the result of the regression “being forced” into a certain direction
by extreme values of the corresponding point, like in the middle plot of the top row in
Figure 8.6. To see how it can be investigated whether there is a potential problem due
to one or more extreme values of the explanatory variables for the general multiple linear
regression situation, let us once again consider the multiple regression model (8.2), and
the predicted response
where
If hii is far away from the mean of all diagonal elements, (p+1)/n, then the i-th observation
is in some way deviating from the other data points, and caution is in place. In particular,
if hii equals the maximum possible value 1, then all hij , j ̸= i, will be 0, and the prediction
Ŷi , which from (8.15) can be seen to satisfy
∑
Ŷi = hii Yi + hij Yj ,
j̸=i
will be equal to Yi . If hii is large, but not exactly equal to the maximum possible value
1, then with large probability Ŷi will be very close to Yi , but whether or not this will
actually happen depends on where the other points lie with respect to the i-th point.
1
The term 1/n on the left stems from the fact that we consider models with an intercept; for models
without an intercept this term would vanish.
8.2. Diagnostics 127
Observation points with large hii are called potential or leverage points; hii is called the
potential or leverage of the i-th observation point. For a particular hii to be considered
large, sometimes the condition hii > 2(p + 1)/n, i.e., larger than twice the mean of all
diagonal elements of H, is used.
As explained, an observation point with large hii is not necessarily influential but has
the potential to be so. In particular, two groups of n observation points with the same
values for the explanatory variables, and hence the same H, but with different realizations
of the response variables may have different influences on the regression. For instance,
for Huber’s data set, imagine what would have been the regression line and the size of
the residuals if the 6-th response would have been -8.00 instead of 0.00. The notion of
‘potential’ influence can also be understood by considering the variances of the residuals.
For the vector of residual R it holds that
(8.17) R = Y − Ŷ = (In×n − H)Y.
If e1 , . . . , en are independent and normally distributed, then the same holds for Y1 , . . . , Yn
and it follows from (8.17) that R is normally distributed with expectation vector E(R) = 0
and covariance matrix Cov(R) = σ 2 (In×n − H). Hence, the residuals may be correlated
and have different variances:
Example 8.9 Let us illustrate the use of the hat matrix for detection of leverage
points by considering Huber’s data set again. In the case of simple linear regression
the diagonal elements of H satisfy
1 (xi − x̄)2
hii = + ∑n ,
i=1 (xi − x̄)
n 2
∑
where x̄ = n1 ni=1 xi . If xi = x̄, hii attains the minimum value n1 . The larger
the distance of xi to x̄ is, the larger is hii . The 6×6 hat matrix for simple linear
regression for Huber’s data is:
0.290 0.259 0.228 0.197 0.167 −0.141
0.259 0.236 0.213 0.190 0.167 −0.064
0.013
0.228 0.213 0.197 0.182 0.167
H= .
0.197 0.190 0.182 0.174 0.167 0.090
0.167 0.167 0.167 0.167 0.167 0.167
−0.141 −0.064 0.013 0.090 0.167 0.936
128 Chapter 8. Linear regression analysis
We see that h66 is close to 1, which agrees with the impression from the plots that
the 6th point is far away from the other points. Because x5 = x̄, h55 equals the
minimum value n1 = 16 . We also see that the further away xi is from x̄ = x5 , the
larger is hii . For the predicted responses we have, for example,
Instead of checking the values of the hii separately, it may be handy to make an index
plot for the leverages: the leverage of the i-th point is plotted against i. This may be
especially useful for large data sets. Figure 8.7 shows the index plot for the leverages of
Huber’s data.
0.8
0.6
h{ii}
0.4
0.2
1 2 3 4 5 6
Figure 8.7: Index plot for the leverages of Huber’s data for simple linear regression.
would have been on the line with the other observations, then deletion of this point would
have had little effect. In that case, the point still would have been a leverage point, but
not an influence point. Obviously, not only leverage points, but also outliers—observation
points with an extreme value of the response variable—can be influence points.
Instead of studying the effect of one particular observation point, one could make use of
a quantitative measure for the influence that one point has on the regression, and compare
this measure for the different points. This is especially useful when it is not completely
clear beforehand which observation points should be investigated. After comparing the
measures, one could decide which points should be inspected more closely. In the sequel
an index (i) means “with the i-th observation point omitte”. For example, β̂(i) is the
estimator of β computed without the i-th point:
T
β̂(i) = (X(i) X(i) )−1 X(i)
T
Y(i) .
One way to quantify the influence of the i-th point is by comparing β̂ and β̂(i) . The
Cook’s distance exactly does this. To introduce this measure, we first note that under
model (8.2) and the independence and normality assumptions of the measurement errors,
the set of vectors b defined by
is a (1 − α)100% confidence region for the parameter vector β. This set is centered
around β̂ and has the form of an ellipsoid. When, for fixed n and p, α is chosen smaller,
the set becomes larger and the probability that the region contains the real value of β also
becomes larger. Hence, these regions form some kind of metric that can tell us whether
a particular value of b lies far away from β̂. Now we are interested whether β̂(i) lies
far away from β̂. It follows that if we could find a value of α such that β̂(i) would lie
outside the (1 − α)100% confidence region for β around β̂ if the difference between β̂(i)
and β̂ is too large, then we would have found a means to quantify the influence of the
i-th point. Experience has shown that a value of α of around 0.5, which yields values of
F(p+1),(n−p−1);1−0.5 in (8.19) around 1 for most n and p, satisfies this requirement.
The definition of the Cook’s distance makes use of this argumentation: the Cook’s
distance Di for the i-th point is defined as
where Ŷ = X β̂ and Ŷ(i) = X β̂(i) . Observation points with a large Cook’s distance have
substantial influence on β̂ and, as can be seen from the second equality in the definition
(8.20), also on the predicted values. Deletion of these points can rigorously change the
130 Chapter 8. Linear regression analysis
Example 8.10 Below the Cook’s distances for a couple of data sets are listed with
the relatively large values in bold.
Forbes’ data:
0.064, 0.005, 0.001, 0.000, 0.000, 0.001, 0.001, 0.001, 0.006, 0.002, 0.005, 0.469,
0.000, 0.054, 0.051, 0.007, 0.010.
Huber’s data:
0.520, 0.015, 0.005, 0.135, 0.096, 26.399.
We see that in only one case, in Huber’s data set, the Cook’s distance is much larger
than 1. Still, the last observation point of the cherry tree data, which has the largest
influence on β̂ and the predicted responses in both multiple and simple regression,
should be investigated. This is because the difference in Cook’s distance with the
other points is quite remarkable.
For Forbes’ data the 12-th observation point, which we found to be an outlier,
is not an influence point according to the above criterion. It does have a Cook’s
distance which is larger than that of the other points, though, and therefore has
relatively more influence than the other points.
Finally, the leverage point in Huber’s data, clearly is an influence point.
***
8.3. Collinearity 131
8.3 Collinearity
In this section we discuss the important concept of collinearity. If two explanatory vari-
ables have an exact linear relationship with each other, then it is intuitively clear that if
one of them is already in the model, it is not necessary any more to include the other.
This is because the variation in the response variable that can be explained by the first
variable is exactly the same as the variation that can be explained by the second variable.
Hence, if the first one is already in the model, no variation that can be explained by the
second is left over. Moreover, if both variables are included, the design matrix X is no
longer of maximal rank, X T X is no longer invertible, and the corresponding β̂ cannot
be determined. If there is an almost exact linear relationship between two variables, the
above is almost true: in this case there may be numerical problems with computation
of the inverse (X T X)−1 and determining the estimate β̂ may be difficult in the sense
that one of the estimators β̂j has a large variance, so that the corresponding estimate
may be unreliable. The same can happen if more than two explanatory variables have a
strong linear relationship. If two or more explanatory variables have a very strong linear
relationship it is said that they are (approximately) collinear. Formally, a set of variables
X1 , . . . , Xm is called (exactly) collinear if there exist constants c0 , c1 , . . . , cm , not all equal
to 0, such that
(8.21) c0 + c1 X1 + · · · + cm Xm = 0.
A strong linear relationship between two explanatory variables is the simplest form of a
collinearity. A collinearity may also exist between more than two variables, and within a
group of more than four variables more than one collinearity may be present.
Obviously, besides a potential cause for problems with numerically solving the equation
X Y = X T Xβ, collinearity in X is also a statistical problem. Therefore, it is crucial to
T
determine whether there exist collinearities between groups of explanatory variables, not
only to prevent numerical problems, but, more importantly, to prevent bad estimates. In
the context of a statistical data analysis one generally only needs to be concerned about a
collinearity if it causes inefficient estimation of one or more of the coefficients βj . Indeed,
in case of collinearity in X the variance of one or more β̂j can be unpleasantly large, so
that the estimate β̂j is useless. If there is a collinearity between a group of explanatory
variables, at least one of them is approximately a linear combination of the others, so
that this variable does not give more information than is already contained in the others
together. If in this case the whole group is included in the model, at least one of the
components of the least squares estimator β̂ cannot be trusted.
A remedy for collinearity in a design matrix is generally not found by mere numerical
arguments. For example, from a numerical point of view the problem could be solved
by deleting one or more columns of X. However, it is the question whether the resulting
model is still useful. From a statistical point of view, the interpretation of the model should
always be taken into account. An important question in this respect is what could be the
132 Chapter 8. Linear regression analysis
cause of the collinearity. For instance, there is a big difference between collinearities caused
by measurement error and collinearities caused by a true linear relationship. Therefore
it is not only important to detect collinearities, but also to investigate their cause and to
study the effect of possible remedies. In the following we discuss a couple of measures that
give information about the amount of collinearity within a set of explanatory variables.
1 1
(8.23) Var(β̂j ) = σ 2 2 ∑n
.
(1 − Rj ) i=1 (xij − x̄j )2
This expression shows that if, due to a collinearity, V IFj is large, the variance of β̂j is
inflated with precisely this factor V IFj . Note that (8.23) does not hold for j = 0, because
R20 is not defined.
In conclusion, the V IFj ’s represent the increase of the variance of the estimators due
to a linear relationship between the explanatory variables, i.e., due to collinearity. As such
they indicate which estimates should be distrusted: if V IFj is close to 1, the estimate
β̂j can be trusted; if V IFj is large, β̂j is unreliable. However, based on the values of the
V IFj ’s it is difficult to distinguish the different collinearities, so that it is not clear which
Xj are involved in which collinearities. The V IFj ’s therefore do not give a solution for
the problem.
8.3. Collinearity 133
Example 8.11 The two variance inflation factors for the regression of the cherry
tree volumes on diameter and length are both equal to 1.369. Therefore, based on
the variance inflation factors there is no reason to distrust the estimates of β1 and
β2 .
***
A matrix X which has one or more columns of predominantly very small or predomi-
nantly very large elements relative to the other columns may cause numerical problems.
It is often advised in this case to scale the columns of X such that they have the same
(Euclidean) length, for example length 1. Such scaling usually decreases the condition
number of the matrix and diminishes the risk of numerical problems. However, in a re-
gression analysis the columns of the design matrix X generally have a specific meaning
and one has to make sure that the scaled version of a column still has the same meaning
in the practical context of the regression problem. Moreover, scaling of the design matrix
does not solve the problem of too large variances of the β̂j , because scaling does not have
any effect on the collinearities. Another approach that is often taken to avoid numerical
problems, namely centering of the design matrix–subtracting of each column its mean–
and consecutively performing regression without an intercept, has the same drawbacks.
The condition number can be used to get a first impression about whether or not
further investigation for collinearity is needed. A frequently used, but not very well
founded, criterion for further investigation is κ(X) > 30 (for columns of roughly equal
Euclidean length). The condition number does not give any indication about where the
collinearities are. In that sense it is a weaker measure than the variance inflation factors
that show which columns of X are approximately orthogonal to the others, and which β̂j
should be distrusted.
Example 8.13 The condition number for the cherry tree data for regression on
diameter and length is equal to 959.4; scaling of the columns of the design matrix
yields 32.2. We will see below that the large condition number is not caused by
collinearity of diameter and length, which we saw in Figure 8.3 to have some linear
relationship, but not a very strong one. Instead it is due to the fact that the length
values have relatively little variation and, as a result, the corresponding column in
the design matrix is approximately collinear to the columns of ones which represents
the intercept.
***
Instead of from the eigenvalues of X T X, κ(X) can also be computed from the so-called
singular value decomposition of X. The singular value decomposition is a decomposition
of the form
(8.25) X = U DV T
with the n×(p+1)- matrix U and (p+1)×(p+1) matrix V satisfying U T U = V T V = Ip+1
and the (p + 1) × (p + 1) matrix D being a diagonal matrix with nonnegative diagonal
8.3. Collinearity 135
elements µ1 , . . . , µp+1 . The µ1 , . . . , µp+1 are called the singular values of X. The singular
value decomposition is not unique, because a permutation of the diagonal elements of D,
with the columns of U and V simultaneously permuted in the same way, also yields a
singular value decomposition of X. It is easy to see that the squares of the singular values
of X are the eigenvalues of X T X, so that
√
maxj νj maxj µj
(8.26) κ(X) = = .
minj νj minj µj
where vjk is the element (j, k) of V , Vk the k-th column of matrix V , Uk the k-th column
of U . It follows from (8.21) and (8.28) that the columns Xj for which the corresponding
j-th component vjk of Vk is substantially different from 0 take part in the collinearity
corresponding to µk .
The singular value decomposition (8.25) not only tells us which variables take part in
a collinearity, but it can also help us to determine which collinearities we should worry
about. As was explained at the beginning of this section, in principle we only need to
be concerned about a collinearity if it causes unreliable estimates of one or more of the
βj . Whether or not βj can be estimated accurately can be investigated by inspecting the
136 Chapter 8. Linear regression analysis
variance of its estimator β̂j , like we did in the subsection about variance inflation factors.
The variance inflation factors, however, could only tell us which variances are ‘inflated’,
but not which variables constitute the collinearity that possibly caused the inflation and
therefore could not help us to find a solution for the problem of inaccurate estimation.
To see how the singular value decomposition can help indicating which collinearities
should be worried about, we first remind that if a set of explanatory variables is almost
collinear, each of the corresponding βj can be derived from the others, which makes it
impossible to accurately estimate the individual β̂j that are related to this collinearity.
Reversely, if the variances of a set of β̂j s to a large extent are determined by the same
small singular value, the variables corresponding to this set of β̂j s form the collinearity that
corresponds to this small singular value. Hence, it would be helpful to find out which
variances are affected by which singular value. Now, the singular value decomposition
(8.25) gives rise to the following useful decomposition of the variances of the β̂j that
exactly shows this. We have
so that for j = 0, 1, . . . , p,
∑
p+1 2
vjk
(8.29) Var(β̂j ) = σ 2 2
.
k=1 µk
Note that each of the components of the sum in (8.29) is associated with exactly one
of the singular values µk . Because µk occurs in the denominator, the contribution of
the k-th component will be large when µk is small, or equivalently, when κk (X) is large.
Of course, the magnitude of this contribution also depends on the size of vjk in the
numerator. Therefore, if the variance of β̂j is large, then it can be deduced from (8.29)
which components have contributed to this large variance: the relative contribution of µk
to Var(β̂j ) is
v 2 /µ2
(8.30) πkj = ∑p+1jk 2 k 2 , k = 1, . . . , p + 1; j = 0, 1, . . . , p.
k=1 vjk /µk
The numbers π1j , . . . , πp+1,j , are called the variance decomposition proportions of β̂j . If
πkj is large, and µk small, we may conclude that the magnitude of Var(β̂j ) is to a large
extent caused by the collinearity that belongs to µk , or equivalently to the collinearity
that belongs to the large condition index κk (X). Hence, if we have detected all Var(β̂j )
whose sizes to a large extent are determined by the collinearity that belongs to the large
condition index κk (X), we know that the variables that correspond to these βj form that
particular collinearity.
Whether or not a variance decomposition proportion can be considered sufficiently
large depends on the context. A proportion πkj = 1 means that a (possibly large) variance
8.3. Collinearity 137
is completely caused by one collinearity. This can easily happen if one of the condition
indices is larger than the others. A large variance may also be caused by two collinearities
of nearly equal strength. In that case proportions between 0.3 and 1 are not unusual.
We stress that a large variance decomposition proportion πkj = 1 is only problematic if
Var(β̂j ) is also large.
From the above it follows that to investigate the nature of a collinearity based on
the singular value decomposition, two approaches can be used. In both cases it is first
determined which singular values µk cause a large variance, Var(β̂l ) say. In most situations
there will be only one, and πlj ≈ 1. With the first approach, one next considers for each
such µk the column Vk of V . According to (8.28) the coordinates of this column indicate
the linear combination of the columns of X which approximately equals zero, and thus
form the collinearity belonging to µk .
The second approach is generally easier. Instead of Vk , it considers for each µk that
causes a large variance, the variance decompositions of the other β̂j . If besides Var(β̂l )
also the variances of β̂j1 , . . . , β̂jm are largely determined by µk , then this is an indication
that the columns Xl and Xj1 , . . . , Xjm are collinear. An operational description of the
second approach is: write down the matrix πkj . If µk is small (or κk large), then search in
the k-th row for the πkj1 , . . . , πkjm that are close to 1 (of at least differ substantially from
0). It is important to note that at least two columns are needed to form a collinearity:
besides πkl at least one other πkj should be relatively large for a collinearity to exist.
Finally, we remark that the existence of collinearities only depends on the structure of
the design matrix, but that the variance of an estimator being large or not also depends on
the structure of the response vector. This means that for step one of the two approaches
all data need to be considered, whereas to perform the second step it suffices to investigate
the design matrix. Once it is determined that a collinearity leads to serious problems, the
only solution is to change the design matrix X.
The 5th column of this matrix equals twice the 4th column. Moreover, the last two
columns are orthogonal to the first three, which themselves are not orthogonal to
each other. Although this matrix does not contain a column of ones and thus differs
from the general design matrix that we use, the (second step of the) two approaches
explained above can well be illustrated with it.
138 Chapter 8. Linear regression analysis
µ1 = 170.7
µ2 = 60.5
µ3 = 7.6 .
µ4 = 36368.4
µ5 = 0.0
Because the last two columns are exactly collinear, one of the singular values, µ5 ,
is exactly equal to 0. Also the third, µ3 , is relatively small compared to the others,
indicating the existence of one more collinearity. The singular values are permuted
so that the corresponding matrix V in the singular value decomposition of B reflects
the orthogonality structure of B by blocks of zeros:
0.540 0.625 −0.556 0.000 0.000
−0.836 0.383 −0.393
0.000 0.000
V = 0.033 −0.680 −0.733 0.000 0.000 .
0.000 0.000 0.000 −0.447 −0.894
0.000 0.000 0.000 −0.894 0.447
(Note that the reported values of the µk and V are rounded for clarity. Of course,
these numerically determined values will never turn out to be exactly equal to 0.)
For the small value µ5 we conclude from the fact that only the values of the 4th
and 5th element in the 5th column of V are substantially different from 0, that
the 4th and the 5th column of the “design matrix” B are collinear. Furthermore,
for the relatively small value µ3 we see from the values of the elements in the 3rd
column of V that the first three columns of the “design matrix” B are collinear.
This illustrates the second step of the first approach.
The condition indices and variance decomposition proportions for B are
cond.ind. vdp(β̂1 ) vdp(β̂2 ) vdp(β̂3 ) vdp(β̂4 ) vdp(β̂5 )
2.131e + 02 0.002 0.000 0.000 0.000 0.000
6.008e + 02 0.020 0.015 0.013 0.000 0.000
.
4.784e + 03 0.979 0.977 0.987 0.000 0.000
1.000e + 00 0.000 0.000 0.000 0.000 0.000
∞ 0.000 0.000 0.000 1.000 1.000
would have been chosen. This would only have resulted in a permutation of the
rows of the matrix with condition indices and variance proportions. This illustrates
the second step of the second approach.
If the second approach is being performed on a scaled version of B, we obtain
cond.ind. vdp(β̂1 ) vdp(β̂2 ) vdp(β̂3 ) vdp(β̂4 ) vdp(β̂5 )
1.326e + 00 0.001 0.001 0.047 0.000 0.000
1.600e + 01 0.994 0.994 0.953 0.000 0.000
.
1.039e + 00 0.005 0.005 0.000 0.000 0.000
1.000e + 00 0.000 0.000 0.000 0.000 0.000
∞ 0.000 0.000 0.000 1.000 1.000
We see that in this case the largest condition index is still infinitely large. This
should not surprise, because an exact collinearity of the last two columns does not
disappear by scaling of the columns of a matrix. The other condition indices have
become smaller, as expected, but still the same conclusions as for the unscaled
matrix may be drawn. This illustrates that scaling may reduce numerical problems,
but does not make collinearities disappear.
***
The third row of this matrix contains a large condition index and two large variance
proportions. The latter belong to β̂0 and β̂2 . Apparently, the column of ones
which represents the intercept, and length are collinear. The large variance of the
estimator of the intercept for regression of volume on diameter and length that we
found in Example 8.1 is most likely caused by this collinearity.
Scaling of the design matrix makes its condition slightly better:
cond.ind. vdp(β̂1 ) vdp(β̂2 ) vdp(β̂3 )
1.000 0.001 0.004 0.001
.
9.964 0.055 0.827 0.015
32.178 0.945 0.169 0.984
Scaling therefore may be better from a numerical point of view, but it does not
make the collinearity between the columns of ones and length disappear.
Due to collinearity the estimator of the intercept for regression of volume on
diameter and length has a larger variance than for regression on diameter only
140 Chapter 8. Linear regression analysis
is a plausible one. However, within the limited range of the given diameters and
lengths the simpler linear regression suffices. This range is of course more or less
the range that is of practical interest too. Moreover, if we are mostly interested
in predicting the volume Y = β0 + β1 X1 + β2 X2 and to a lesser extent in the
exact values of β0 , β1 and β2 , then we do not even need to be concerned about the
collinearity of the intercept and length.
***
8.3.4 Remedies
As mentioned earlier, it is impossible to give general guidelines for solving problems with
a regression analysis due to collinearity. Deletion of one or more columns or replacement
of the columns involved in a collinearity by a suitable linear combination of these columns
could solve the problems. However, whether the new model still makes sense depends on
the context, that is, the real situation that the model aims to describe and the goal of
the study. Thorough knowledge of the experimental situation and common sense are the
best guidelines here.
We remark that in the case of collinearity the use of least squares estimators is not
recommended. Although they are still the best unbiased estimators, there exist better
biased ones. We will not elaborate on this.
We conclude with two examples that illustrate the complexity of the collinearity issue.
f ield X1 X2
1 0 0
2 0 1 .
3 1 0
4 1 1
One month after the experiment started it turns out that one more trial field is
available. The student decides to use this fifth field to obtain additional information.
Because the chemical compounds came in 50 kg bags, the student chooses to use
the remainder and spread on the fifth field 48 kg of both compounds. The design
matrix for this experiment becomes
X0 X1 X2
1 0 0
1
1 0
.
1 1 0
1 1 1
1 48 48
Since the correlation coefficient between X1 and X2 is 0.999, the last two columns
of this matrix are highly collinear. This is confirmed by the values of the other
collinearity measures: for the variance inflation factors of the above matrix we
find V IF1 = V IF2 = 1/(1 − ρ2 (X1 , X2 )) = 903, and the variance decomposition
proportions are
cond.ind. prop(β̂0 ) prop(β̂1 ) prop(β̂2 )
1.0 0 0 0
.
34.3 1 0 0
73.2 0 1 1
The collinearity in the design matrix is solely due to the high values of the com-
pounds for the fifth field: the last two columns of the design matrix for the experi-
ment with only the four original fields have correlation coefficient 0. Scaling of the
matrix is no option here, for the only sensible scaling would be the one in which
the column of ones is made of the same length as that of the other two. All other
scalings would mean that regression would be performed on quantities that are not
actually tested and thus would not give a realistic picture.
This example also shows why centering of the design matrix, another frequently
used approach for avoiding numerical problems, is often not useful: the means of X1
and X2 are 10, so that centering would yield negative values of compound quantities.
The resulting new columns have no practical meaning any more.
What now is the best thing that the student could have done with the fifth field?
If the model
Y = β0 + β1 X1 + β2 X2 + e
142 Chapter 8. Linear regression analysis
holds in all five cases, then how should, given the first four points, the additional
point be chosen such that the most accurate estimates of β would have been ob-
tained? It turns out, that this is achieved indeed by making X1 and X2 for the fifth
field as large as possible. For example, if the point (x51 , x52 ) = (48, 48) is added,
then one gets Var(β̂1 ) = Var(β̂2 ) = 0.5σ 2 , whereas if (x51 , x52 ) = (0.5, 0.5), it holds
that Var(β̂1 ) = Var(β̂2 ) = σ 2 . The real problem in this case is not the collinearity
in the design matrix, but that the linear model most likely is not suited for all five
situations. It is highly probable that the model is a reasonable description of reality
for small values of X1 and X2 , but not for large ones.
***
When the collinearity measures do not indicate collinearity, this necessarily implies
that there is no collinearity present in the design matrix. On the other hand, if the
collinearity measures do indicate collinearity this may be caused by a different problem
than collinearity itself. In particular, leverage points can mask or induce collinearity. The
following example illustrates this.
Example 8.17 The top-left picture in Figure 8.8 is a scatter plot of two variables
that are highly collinear. If we consider the collinearity measures, we find for the
variance inflation factors V IF1 = V IF2 = 1.09, and for the condition number
κ = 14.6. These values do not reflect the collinearity. It is obvious that this is most
likely caused by the isolated observation point. This point is a leverage point with
leverage 0.998. Ignoring this point, we obtain the top-right scatter plot in Figure 8.8,
and the collinearity measures become V IF1 = V IF2 = 257.02 and κ = 210.6.
The bottom-left graph in Figure 8.8 shows a scatter plot of two variables that
are definitely not collinear. The isolated point is a leverage point with leverage
0.965. The variance inflation factors are V IF1 = V IF2 = 1.03, and κ = 30.68.
Ignoring the leverage point, we find V IF1 = V IF2 = 1.20 and κ = 15.74; the
corresponding scatter plot is the bottom-right plot in Figure 8.8. In this case the
leverage point introduces collinearity to some extent: the condition number with the
point included is much larger than without the point, whereas the variance inflation
factors are barely influenced by the presence of the point.
***
Both examples of Example 8.17 once again illustrate the importance of looking at the
data first.
8.3. Collinearity 143
*
*
15
*
* *
6
* *
*
4
*
10
*
* *
X2
X2
*
*
** *
*
0
**
5
*
*
* * *
-2
* *
** *
* *
-4
*
0
*
*
0 5 10 15 0 2 4 6 8 10
X1 X1
10
*
10
*
*
*
*
*
8
*
*
8
6
* *
X2
X2
** *
4
* * *
*
4
*
* * *
*
2
* * *
*
2
* *
0
0 10 20 30 40 50 0 2 4 6 8
X1 X1
Figure 8.8: Scatter plots of data sets for which a leverage point masks (top) and induces (bottom)
collinearity.
Chapter 9
Chatfield, C., Collins, A., (1981). Introduction to Multivariate Analysis. Chapman and
Hall/CRC.
Christensen, R., (1997) Log-Linear Models and Logistic Regression. Springer, New York,
Berlin.
Dobson, A.J., Barnett, A., (2008), An Introduction to Generalized Linear Models, 3rd ed.
Chapman and Hall/CRC.
Efron, B., Tibshirani, R., (1986). Bootstrap methods for standard errors, confidence
intervals, and other measures of statistical accuracy. Statistical Science 1, 54-75.
Efron, B., Tibshirani, R., (1993). An Introduction to the Bootstrap. Chapman and
Hall/CRC.
Everitt, B.S., (1992). The Analysis of Contingency Tables. Chapman and Hall/CRC,
London.
Hampel, F.R., Ronchetti, E.M., Rouseeuw, P.J., Stahel, W.A., (2005). Robust Statistiscs:
the Approach based on Influence Functions. Wiley-Interscience.
Hoaglin, D.C., Mosteller, F., Tukey, J.W., (2000). Understanding Robust and Exploratory
Data Analysis. Wiley-Interscience.
Hoaglin, D.C., Mosteller, F., Tukey, J.W., (2006). Exploring Data Tables, Trends, and
Shapes. Wiley-Interscience.
Huber, P.J., (1981). Robust Statistics. John Wiley & Sons, New York.
144
145
Huff, D., (1993). How to lie with statistics. W.W. Norton & Company.
Lehmann, E.L., D’Abrera, H.J.M., (2006). Nonparametrics: Statistical Methods based on
Ranks. Springer.
Mardia, K.V., Kent, J.T., Bibby, J.M., (1980). Multivariate Analysis. Academic Press,
London.
McCullagh, P., Nelder, J., (1989). Generalized Linear Models. Chapman and Hall/CRC.
Seber, G.A.F., Lee, A.J., (2003). Linear Regression Analysis, 2nd ed.. John Wiley &
Sons, New York.
Seber, G.A.F., Wild, C.J., (2003). Nonlinear Regression. John Wiley & Sons, New York.
Shapiro, S.S., Wilk, M.B., (1965). An analysis of variance test for normality. Biometrika
52, 591.
Shorack, G.R., Wellner, J.A., (2009). Empirical Processes with Applications to Statistics.
Society for Industrial & Applied Mathematics.
Tukey, J.W., (1977). Exploratory Data Analysis. Addison-Wesley, Reading.
Weisberg, S., (2005). Applied linear regression, 3rd ed.. John Wiley & Sons, New York.