Concept of Data Depth and Applications
Concept of Data Depth and Applications
,
Mathematica 50, 2 (2011) 111119
Concept of Data Depth and Its
Applications
*
Ondej VENCLEK
Department of Mathematical Analysis and Applications of Mathematics
Faculty of Science, Palack University
17. listopadu 12, 771 46 Olomouc, Czech Republic
e-mail: ondrej.vencalek@upol.cz
Dedicated to Lubomr Kubek on the occasion of his 80th birthday
(Received March 31, 2011)
Abstract
Data depth is an important concept of nonparametric approach to
multivariate data analysis. The main aim of the paper is to review possi-
ble applications of the data depth, including outlier detection, robust and
ane-equivariant estimates of location, rank tests for multivariate scale
dierence, control charts for multivariate processes, and depth-based clas-
siers solving discrimination problem.
Key words: data depth, nonparametric multivariate analysis, ap-
plications, rank
2010 Mathematics Subject Classication: 62G05, 62G15, 60D05,
62H05
1 Introduction
Data depth is an important concept of nonparametric approach to multivariate
data analysis. It provides one possible way of ordering the multivariate data.
We call this ordering a central-outward ordering. Basically, any function which
provides a reasonable central-outward ordering of points in multidimensional
space can be considered as a depth function. This vague understanding of the
notion of depth function led to the variety of depth functions, which have been
introduced ad hoc since 1970s. The formal denition of a depth function was
formulated by Zuo and Sering in 2000 [8].
*
Supported by the grant GAUK B-MAT 150110.
111
112 Ondej Venclek
The most widely used depth function is the halfspace depth function. The
halfspace depth of a point x in R
d
with respect to a probability measure P
is dened as the minimum probability mass carried by any closed halfspace
containing x, that is
D(x; P) = inf
H
_
P(H): H a closed halfspace in R
d
: x H
_
.
The halfspace depth is sometimes called location depth or Tukey depth, as it
was rst dened by Tukey in 1975 [7]. The halfspace depth is well dened for
all x R
d
. Its sample version (empirical halfspace depth), dened on a random
sample
X
1
, . . . ,
X
n
of the distribution P, is dened as the halfspace depth for
the empirical probability measure P
n
. This denition is very intuitive and easily
interpretable. Moreover, there are many desirable properties of the halfspace
depth, which made this depth function very popular and widely used. In par-
ticular, the halfspace depth is ane invariant and has all the other desirable
properties stated in the general denition of depth function by Zuo and Sering.
The notion of rank is crucial in many applications. Consider a d-dimensional
probability distribution P and a random sample
X
1
, . . . ,
X
n
from this distribu-
tion. (The empirical probability measure based on the sample is again denoted
by P
n
). For any point x R
d
we dene
r
P
(x) = P(D(
X; P) D(x; P) |
X P) (1)
and
r
Pn
(x) = #
_
X
i
: D(
X
i
; P
n
) D(x; P
n
), i = 1, . . . , n
_
/n. (2)
2 Outlier detectiona bagplot
Rousseeuw et al. [6] proposed a bivariate generalization of the univariate box-
plot, so called bagplot. They used the halfspace depth to order the data, but
other depth functions might be used as well. The bagplot consists of
the deepest point (the point with maximal depth),
the bag, that is the central area, which contains 50 % of all points; the
bag is usually dark colored,
the fence, which is found by magnifying the bag by a factor 3; the fence is
usually not plotted; observations outside the fence are agged as outliers,
the loop, which is an area between the bag and the fence; usually light
coloured.
The bagplot procedure is available in R library aplpack. As an example, we
plot the bagplot of the car data of Chambers and Hastie that are available in
library rpart. Figure 1 displays car weight and engine displacement of 60 cars.
Five outliers were detected.
Concept of data depth and its applications 113
Figure 1: An example of bagplot.
3 Ane-equivariant and robust estimates of location
Donoho and Gasko [1] have shown that two basic location estimators based on
the halfspace depth, the deepest point and the trimmed mean (with trimming
based on the halfspace depth), are both ane equivariant and robust (in the
sense of the high breakdown point). The combination of these two properties is
quite rare in multivariate statistics. The most important results are summarized
in the next theorem:
Theorem 1 Let
X
1
, . . . ,
X
n
be a sample determining empirical version P
n
of
an absolutely continuous distribution P on R
d
, with d > 2. Assume data be in
a general position (no ties, no more than two points on any line, three in any
plane, and so forth).
Consider the deepest point T
(P
n
) = arg max
x
D(x, P
n
) and -trimmed
mean T
(P
n
) = Ave(
X
i
: D(
X
i
; P
n
) n), the average of all points whose
depth is at least n.
Denote := arg max
x
D(x; P) ( = 1/2 if P is centrally symmetric). Then
1. The breakdown point of T
(P
n
) is greater or equal to 1/(d+1). It converges
almost surely to 1/3 as n if P is centrally symmetric.
2. For each /(1 +), T
(P
n
) is well dened for suciently large n and
its breakdown point converges almost surely to .
114 Ondej Venclek
4 Rank tests for multivariate scale dierence
Liu and Singh [4] combined ranks based on data depth with well-known one-
dimensional nonparametric procedures to test scale dierence between two or
more distributions.
Consider two d-dimensional distributions P
1
and P
2
, which possibly dier in
dispersion only. Denote
X
1
, . . . ,
X
n1
a random sample fromP
1
and
Y
1
, . . . ,
Y
n2
a
random sample from P
2
. Denote the combined sample as
_
W
1
, . . . ,
W
n1+n2
_
_
X
1
, . . . ,
X
n1
,
Y
1
, . . . ,
Y
n2
_
and denote P
n1+n2
the empirical distribution func-
tion based on the combined sample.
We want to test the hypothesis H
0
of equal scales against the alternative
that P
2
has larger scale in the sense that the scale of P
2
is an expansion of the
scale of P
1
. If the scale of P
2
is greater, then obviously observations from the
second distribution tend to be more outlying than the observations from P
1
.
Consider the sum of the non-normalized ranks for the sample from P
2
:
R(Y
1
, . . . , Y
n2
) = (n
1
+n
2
)
n2
i=1
r
Pn
1
+n
2
(Y
i
).
Now we proceed as in the case of testing for a (negative) location shift in the
univariate setting. This leads us to the Wilcoxon rank-sum procedure. When
n
1
and n
2
are suciently large, we can rely on asymptotic behaviour of the test
statistic (assuming null hypothesis):
R
=
R(
Y
1
, . . . ,
Y
n2
) [n
2
(n
1
+n
2
+ 1)/2]
[n
1
n
2
(n
1
+n
2
+ 1)/12]
1/2
D
N(0, 1),
and hence we reject H
0
if R
1
(), where
1
() is the -quantile of the
standard normal distribution.
We can proceed similarly when considering more than two (say K > 2) dis-
tributions. We test the hypothesis that the underlying distributions are identical
against the alternative that the scales of these distributions are not all the same,
in the sense of scale contraction. Construction of the test follows the idea of the
well-known Kruskal-Wallis test. Let
R
i
denote the average of non-normalized
ranks (based on data depth) of the observations from the i-th sample in the
combined sample. The total number of all observations in combined sample
(from all K samples) is N. Under the null hypothesis, it holds:
T =
12
N(N + 1)
K
i=1
_
n
i
R
2
i
_
3(N + 1)
D
2
K1
.
We reject the null hypothesis at an approximate level if T
2
K1
(1 ),
where
2
K1
(1 ) is the (1 ) quantile of a chi-squared distribution with
(K 1) degrees of freedom.
There is a simple graphical tool developed by Liu, Parelius and Singh (see
[5]) to visualize dierence in scales of multivariate distributions. They dened
Concept of data depth and its applications 115
a scale curve as a plot of p (0, 1) versus volume of C
p
- the p-th central region
(the p-th central region C
p
is dened as the smallest region enclosed by depth
contours to amass probability p, that is C
p
=
t
{R(t): P(R(t)) p}, where
R(t) =
_
x R
d
: D(x; P) > t
_
). The sample scale curve, based on random
sample
X
1
, . . . ,
X
n
, plots volumes of the convex hulls containing np most
central points versus p. By plotting scale curves for compared distributions in
one plot, the dierence in scales can be easily visualized.
The following example should illustrate the methodology. We simulated 250
points from bivariate N(
= 2, 14,
which is less than
1
(0, 05) = 1, 64. We thus (correctly) reject the null
hypothesis of identical distributions. The dierence in dispersions can be seen
in Figure 2.
Figure 2: Empirical scale curves based on samples of 250 points from N(
0, I)
(solid line) and from N(
X
1
), r
G
(
X
2
) . . . (or r
Gn
(
X
1
), r
Gn
(
X
2
), . . . if G
needs to be estimated). Under the null hypothesis, it holds:
1. r
G
(
X) U[0, 1],
2. r
Gn
(
X)
D
U[0, 1], provided that D(; G
n
) D(; G) uniformly as n .
The uniform convergence of D(; G
n
) holds for example for halfspace depth if
G is absolutely continuous. The expected value of r
G
(
X
i
) < signalize a possible quality
deterioration.
Similarly as Liu [3], we can demonstrate the procedure on simulated data.
Let the prescribed distribution G be a bivariate standard normal distribution.
Firstly, we generate 500 observations from this distribution to get a sample
version G
n
(we consider G to be unknown to mimic some real applications).
Subsequently, we generate new observations40 observations from bivariate
standard normal distribution (process in control) and next 40 observations from
bivariate normal distribution with shifted mean (2, 2)
T
and both scales doubled.
The control chart is shown in Figure 3.
Figure 3: Control chart for multivariate process.
There is one so called false alarm in the rst half of observations. The out-
of-control status in the second half of observations is correctly detected 30 times
(from 40 observations). The change is apparent from the chart.
Liu called this type of control chart the r chart. She also proposed multi-
variate versions of Shewhart chart (Q chart) and CUSUM chart (S chart).
Concept of data depth and its applications 117
6 Depth-based methods of discrimination
During the last ten years quite a lot of eort has been put into development
of a nonparametric approach to the discrimination problem, which uses the
methodology of data depth.
Recall the nub of the discrimination problem. Consider k 2 groups of ob-
jects. Each object can be represented by d N numerical characteristics. Each
group of objects is characterized by the distribution of the numerical characteris-
tics of its members. We denote these distributions P
1
, . . . , P
k
. The distributions
are unknown. In what follows we assume the distributions to be absolutely
continuous. Consider further k independent random samples
X
i,1
, . . . ,
X
i,ni
,
i = 1, . . . , k, from distributions P
1
, . . . , P
k
. These random samples (known as
the training set) provide the only available information on the considered distri-
butions. Any vector x R
d
, representing an object not included in the training
set, is considered to be a realization of a random vector from one of the distri-
butions P
1
, . . . , P
k
, but it is unknown from which of them. There is a need to
estimate to which group the object belongs. The goal is to nd some general
rule, which allocates an arbitrary d-dimensional real vector to one of the consid-
ered distributions (groups). The rule (known as classier) has a form of some
function d: R
d
{1, . . . , k}.
Probably the most widely used classier based on data depth is a so-called
maximal depth classier. It is based on a simple idea of assigning a new ob-
servation (represented by vector x) to the distribution, with respect to which it
has maximal depth. An arbitrary depth function can be used, i.e.
d(x) = arg max
j=1...,k
D(x; P
j
). (3)
Since the theoretical depth is usually unknown, empirical version based on the
data from the training set is used:
d(x) = arg max
j=1...,k
D(x;
P
j
), (4)
where D(x;
P
j
) is a depth of x with respect to empirical distribution of the j-
th distribution, which is based on the appropriate points from the training set
(
X
j,1
, . . . ,
X
j,nj
).
A detailed inspection of the method is provided in a paper by Ghosh and
Chaudhuri [2]. The maximal depth classier is known to be asymptotically
optimal (it has the lowest possible average misclassication rate) in certain sit-
uations. Ghosh and Chaudhuri showed asymptotical optimality of the classier
for a very special case, assuming that the considered distributions:
are elliptically symmetric with the density functions strictly decreasing in
every direction from their centers of symmetry,
dier only in location (have equal dispersions and are of the same type),
have equal prior probabilities.
118 Ondej Venclek
In addition, the used depth function must also satisfy some conditions. Ghosh
and Chaudhuri formulated the following optimality theorem:
Theorem 2 Suppose P
1
, . . . , P
k
are elliptically symmetric distributions with
densities f
i
(x) = g(x
i
), i = 1, . . . , k, where g satises: g(cx) < g(x) for
every x and constant c > 1. Consider equal prior cases. When using halfspace,
simplicial, majority, or projection depth, the average misclassication rate of
an empirical depth-based classier (4) converges to the optimal Bayes risk as
min(n
1
, . . . , n
k
) .
The maximal-depth classier is not optimal when the considered distri-
butions dier in dispersion. This fact can cause serious problems even in a
very simple situation: consider, for example, two bivariate normal distributions
with equal prior probabilities P
1
= N
_
(0, 0)
T
, 4I
_
, and P
2
= N
_
(1, 0)
T
, I
_
,
where I denotes 2 2 identity matrix. Denote the new observation x =
(x
1
, x
2
)
T
. In this case the optimal Bayes rule has the following form: d(x) = 2
i (x
1
4/3)
2
+ x
2
2
< 4/9 + 16/3 ln2. Expected misclassication rate for the
group 1 is about 0.3409, for group 2 it is about 0.1406, hence the optimal Bayes
risk is about 0.2408. The theoretical maximal depth classier, which is equiva-
lent to the classier minimizing Mahalanobis distance, has the form: d(x) = 2
i (x
1
4/3)
2
+ x
2
2
< 4/9. Expected misclassication rate is 0.0435 for group
1 and 0.8104 for group 2, yielding the average misclassication rate of about
0.4270, which is much higher than the optimal Bayes risk. (The expected mis-
classication rates were enumerated by the numeric integration of densities).
As we can see from the example above, the class of problems that can be
satisfactorily solved using the classier (4) is quite narrow. The problem of
maximal-depth classier arises from the discrepancy between the depth and
the density function. The optimal Bayes classier is based on density func-
tion. While the depth function is ane invariant, the density function does not
have this property. More sophisticated classiers are needed to overcome this
problem.
7 Conclusion
The concept of data depth provides a useful tool for nonparametric multivariate
statistical inference. Ordering (and ranks) based on data depth provides a
basis for many nonparametric multivariate procedures like outlier detection,
estimation of some basic random vector characteristics, testing for multivariate
scale dierence, construction of control charts for multivariate processes, or
construction of classiers for solving discrimination problem.
References
[1] Donoho, D. L., Gasko, M.: Breakdown properties of location estimates based on halfspace
depth and projected outlyingness. Annals of Statistics 20 (1992), 18031827.
[2] Ghosh, A. K., Chaudhuri, P.: On Maximum Depth and Related Classiers. . Scandina-
vian Journal of Statistics 32 (2005), 327350.
Concept of data depth and its applications 119
[3] Liu, R. Y.: Control charts for multivariate processes. . Journal of the American Statistical
Association 90 (1995), 13801387.
[4] Liu, R. Y., Singh, K.: Rank tests for multivariate scale dierence based on data depth.
In: Liu R. Y., Sering R., Souvaine D. L. (eds.) DIMACS; Robust Multivariate Analysis,
Computational Geometry and Applications American Mathematical Society, 2006, 17
34.
[5] Liu, R. Y., Parelius, J. M., Singh, K.: Multivariate analysis by data depth: Descriptive
statistics, graphics and inference (with discussion). Annals of Statistics 27 (1999), 783
858.
[6] Rousseeuw, P. J., Ruts, I., Tukey, J.: The bagplot: a bivariate boxplot. The American
Statistician 53 (1999), 382387.
[7] Tukey, J.: Mathematics and picturing data. Proceedings of the 1975 International
Congress of Mathematics 2 (1975), 523531.
[8] Zuo, Y., Sering, R.: General notion of statistical depth function. Annals of Statistics
28 (2000), 461482.