Week 2 Notes
Week 2 Notes
One is often interested in the expected value or the covariances of linear combi-
nations of the elements in X.
43
highly correlated, someone has the idea to work with the average Y1 = 12 (X1 + X2 )
and the increase Y2 = (X2 − X1 ). What are the mean and variance of Y = (Y1 , Y2 )?
Clearly,
1 1
1
2 2
(X 1 + X 2 )
C= , and CX = 2 .
−1 1 X 2 − X1
Therefore,
1
2
(8.0 + 8.2) 8.1
E(Y) = C E(X) = =
8.2 − 8.0 0.2
1
−1
1 1
2 2 2 0.9 0
Cov (Y) = Cov (X) 1 = .
−1 1 2
1 0 0.4
You can check that the diagonal elements in Cov (Y) are those you would obtain
with the usual formulas for Var(X1 + X2 ) = Var(X1 ) + Var(X2 ) + 2 Cov (X1 , X2 ).
However, notice that Cov(Y) also tells us about the covariance between Y1 and
Y2 . In fact, it turns out that working with the average score and the increase may
be a good idea, as they are uncorrelated and can hence be thought to measure
complementary features (namely, overall academic performance and improvement
after 1 year of studies).
Definition 4.2.1
T (Principal
Components). Let X ∈ Rp be a random
vector with E X X < ∞. The 1st principal component (PC) load-
ing is a vector v1 ∈ Rp , |v1 | = 1, that maximises the variance of v1T X, or
in other words
44
Figure 17: Several 1D projections of the data, and their respective density esti-
mates. Notice that some of the projections are well spread, whereas some are very
concentrated.
45
vector vk ∈ Rp , |vk | = 1, that maximises Var(vkT X) subject to
Cov(vkT X, vjT X) = 0,
2. Notice that the PC loadings live in the same space as the random
vector X. The PC scores are random variables (take values in R).
You can stack them to create a vector in Rp , but the k-th entry of
this vector will not have the same meaning as the k-th entry of X!
46
Figure 18: The 2 PC projections of the data (PC 1 in blue, PC 2 in red), and the
respective density estimates of the PC scores.
47
Theorem 4.2.3 (PC loadings, population version). Let X be a ran-
dom vector with covariance Σ = Cov(X), and suppose that Σ has eigen-
values λ1 ≥ . . . ≥ λp ≥ 0 and corresponding orthonormal eigenvectors
e1 , . . . , ep . Then v1 = e1 , . . . , vp = ep are the PC loadings of X, and the
corresponding PC scores are Y1 = eT1 X, Y2 = eT2 X, . . . , Yp = eTp X. Further-
more, Var(Y1 ) = λ1 , . . . , Var(Yp ) = λp .
Proof. Let us find the first principal component. The goal is to find v1 maximizing
v1T Σv1
.
v1T v1
Recall that the eigendecomposition allows us to reconstruct Σ = EΛE T , where E
is a p × p orthogonal matrix (E T E = I = EE T ) with the eigenvector ei as its ith
column. We can write the objective function as
vkT v1
= 0, . . ., vkT vk−1
= 0. Because vi = ei for i < k, the first k − 1 components in
zk = E T vk are equal to zero. Therefore the objective function becomes
Pp 2
zTk Λzk i=k λi zki
= p 2
≤ λk ,
zTk zk
P
i=k zki
hence again it suffices to see that for vk = ek we attain the maximum value λk . We
can see that this is the case, Psince zk = E T ek = (0, . . . , 1, . . . , 0) (where the 1 is in
p 2
λi zki
the k th position), and hence Pi=k p
z2
= λk .
i=k ki
48
This means that there is an isometry between the PC scores and the
original vector X, and that no information is lost by working with
the PC scores. In particular, X and Y have the same total variance
(check this as an exercise).
E|Yk − E Yk |2 = λ1 + · · · + λk ,
and
E|Yk − E Yk |2 λ1 + · · · + λk
2
=
E|Y − E Y| λ1 + · · · + λk + · · · + λp ,
is the percentage of variance explained by the first k PCs.
49
maximal total variance. It turns out that fortunately, our iterative definition (of
PCA) is a good one:
Proof. The proof uses Poincaré’s inequalities (Lemma 2.4.3), see video.
This result tells us that when looking at the coordinates of X onto q orthonormal
vectors, the total variance is maximized by taking these to be the first q eigenvectors
of Cov(X), but minimized when taking these to be the last q eigenvectors of Cov(X).
Now suppose that instead of looking at coordinates of X with respect to some
orthonormal vectors, we are interested in orthogonal projections of P X of X. If we
let Pq = Eq EqT , P−q = E−q E−qT
, we can easily compute (do it!) that
Tr(Cov(Pq X)) = λ1 + · · · + λq ,
and
Tr(Cov(P−q X)) = λp−q+1 + · · · + λp .
Notice that Pq , P−q are both orthogonal projections with trace q (or rank q). The
following result tells us that in terms of total variance (or percentage of variance),
the best (respectively worst) orthogonal projection with trace q is given by P = Pq
(respectively P = P−q ).
that is, the total variance of P X is less than or equal to the total variance
of Pq X, but greater than or equal the total variance of P−q X.
50
Proposition 4.2.7 (Characterization of PCA by Approximations).
Under the assumptions and notation of Proposition 4.2.5, and E X = 0, if
P is a p × p orthogonal projection matrix with Tr(P ) = q, then
Pn between y1 , . . . , yn ∈ R and z1 , . . .P
and the sample covariance , zn ∈ R, where yi , zi
are paired, is given by i=1 (yi − y)(zi − z)/(n − 1), where y = i yi /n and similarly
for z.
So far we focused on the population case, or random vector case, where we assume
iid
we know the covariance matrix Σ. In practice, we get data x1 , . . . , xn ∼ X ∈ Rp , and
these are stacked in a data matrix X = (x1 , . . . , xn )T . The first step for computing
that sample covariance matrix is to center the columns of X. If H = In − 11T /n,
HX is a column-centered version of X (see exercises). Define S = XT HX/(n − 1), the
sample covariance of x1 , . . . , xn .
51
to reconstruct the data from the PC scores. Letting E be the matrix with columns
equal to orthonormal eigenvectors of S, we notice that the matrix Y = XE has as
k-th column the k-th PC scores. Furthermore, since EE T = Ip , we have X = YE T .
Taking row i of this expression, we get
p
X
xi = (Y)ik êk , (4.2.1)
k=1
that is, we can express each observation as a linear combination of the PC loadings,
with weights given by the PC scores!
Notice that since the sample covariance matrix S is only an estimator of the true
covariance matrix Σ, the eigenpairs (êi , λ̂i ) are only estimators of the true eigenpairs
(ei , λi ). We will discuss the sampling distribution of the eigenstructure of the sample
covariance in Section 4.2.8. See also the video which shows that S is an unbiased
estimator of Σ.
Importantly, there is a link between the PC scores and the spectral decomposition
of XXT , and a link between PC loadings and scores and the SVD of X, see exercises.
which is a truncation of (4.2.1). This tells us that we can approximate the data
using q PCs, and the approximation error is (do the calculations!) given by
The crucial question is “how do I choose q?”. The larger the q, the better ap-
proximation you get, but the lower the q, the more parsimonious the representation
(and hence the easier it is to interpret the resulting PC scores and loading, since
there are fewer of them). We now give two commonly used rules for choosing the
number of PCs:
The 90% rule-of-thumb Recall that the variance of theP ith principal component
is λ̂i and the proportion of explained variance is λ̂i / pi=1 λ̂i . A possible rule-
of-thumb is to choose the smallest number k of components that explain 90%
of the variance, that is
Pk
λ̂i
Pi=1
p
> 0.9.
i=1 λ̂i
52
● ●
100
100
80
80
60
60
l2
●
40
40
●
●
20
20
●
● ●
● ●
0
0
1 2 3 4 5 1 2 3 4 5
Figure 19: 2 scree plots. Remember that the ”y” axis represents the variance, and
not the standard deviation.
These rules are quite subjective. Interesting features of the data can be in the
first few PCs, which are kept (and hence analysed) by one of the two rules given
above, but it may also be that interesting features of the data are hidden in PCs
that are not kept by either rule.
53
year grades was
1 0.8
Cov (X) = .
0.8 1
√ √ √ √
Its eigenvectors are e1 = (1/ 2, 1/ 2)T and e2 = (−1/ 2, 1/ 2)T , with corre-
sponding eigenvalues√λ1 = 1.8, λ2 = 0.2. The principal
√ components are therefore
eT1 X = (X1 + X2 )/ 2 and eT2 X = (X2 − X1 )/ 2. That is, the first PC (score)
is proportional to the average grade and the second PC (score) to the difference
between grades. The proportion of variability explained by the first component is
1.8/(1.8 + 0.2) = 0.9.
Now suppose that the first year grades are more variable than the second year
grades, so that
4 1.6
Cov (X) = .
1.6 1
√ √
Notice that the correlation between (X1 , X2 ) is still 1.6/( 4 1) = 0.8 as before.
The eigenvectors now are eT1 = (0.917, 0.397) and eT2 = (−0.397, 0.917) with cor-
responding eigenvalues λ1 = 4.69, λ2 = 0.307. Hence the principal components
are
eT1 X = 0.917X1 + 0.397X2 ,
and
eT2 X = 0.917X2 − 0.397X1 .
As before the first principal component is found by adding up X1 and X2 , but
now X1 is assigned a higher weight than X2 because X1 explains more variability
than X2 (since Var(X1 ) > Var(X2 )). Similarly, the second principal component is a
weighted difference of X2 minus X1 . The proportion of variability explained by the
first component is 4.69/(4.69 + 0.307) = 0.938, which is higher than before.
54
## [3,] 0.5 0.5 1.0 0.5 0.5
## [4,] 0.1 0.1 0.5 1.0 0.9
## [5,] 0.1 0.1 0.5 0.9 1.0
## [1] 0.88
Here all entries in the first eigenvector have the same sign (first column in E),
and in fact have a similar value. The first PC is computed as Y1 = −(0.43X1 +
0.43X2 + 0.51X3 + 0.43X4 + 0.43X5 ), which roughly speaking is proportional to the
mean across X1 , . . . , X5 . Hence the first PC is approximately an average of the
variables in X.
The second PC is Y2 = 0.5(X1 + X2 ) − 0.5(X4 + X5 ), which is the difference
between the mean of each block of variables.
Example 4.2.12 (cars dataset). Consider the cars dataset, which measures 7 con-
tinuous characteristics for 32 cars. Below we load the data and compute the PC
loadings.
require(grDevices)
op <- options(digits = 2) # print only 2 digits
data(mtcars)
nn <- rownames(mtcars)
55
cars_pca <- prcomp(mtcars[,1:7])
cars_pca$rot
options(op)
1. The results of the PCA can change substantially when the variance
of an input variable Xi changes. We will discuss this issue in the
Section 4.2.7.
56
Larger Yk means larger contribution of ek in X, so if e1 = (0.8, −0.6)T
(say) where X = (“age”, “height”)T , then an observation with larger
Y1 (PC scores 1) could be interpreted as having older age but smaller
height. We call such interpretation an approximate interpretation
since the previous statement is only correct if all other PC scores are
kept fixed.
3. The magnitude of eki does not correspond to the value of the correla-
tion between the PC k and variable Xi , see Proposition 4.2.14 below.
Therefore, when interpreting PC scores, be cautious in the terminol-
ogy you use: you can use terms like “weighted average”, “contrast”,
“essentially”, as exemplified in the examples above.
4. In some sense PC scores are weird, because they are linear combina-
tions of the original variables, which could be totally incomparable:
for instance X1 could be the miles/gallon consumption of a car, and
X2 could be the number of cylinders of the car. Then what is the
unit of a linear combination of X1 , X2 ? If one is in such a situa-
tion, then it is best to perform PCA on standardized variables, as
explained in Section 4.2.7. However, if all variables are in the same
unit, then taking a linear combination of them makes more sense,
and you shouldn’t standardize variables.
As mentioned in the remark, the value of the eki is not equal to the correlation
between PC score k and variable Xi , but is related according to the following for-
mula.
57
Thus, Xi measured in meters will have a much greater effect on the principal com-
ponents than when measured in kilometers! Furthermore, the interpretation of PC
scores is a bit puzzling when variables are not measured in the same unit (or repre-
sent different types of quantities), as mentioned in Remark 4.2.13.
Whether this is an issue or not depends on the specific application. As a general
rule, we want to avoid the results being sensitive to the scale in the input measure-
ments, but there are some cases in which we may want to consider that variables
with higher variance are more informative. For instance, this is the case when Xi is
observed in two different groups, so that the total Var(Xi ) is the sum of the between
groups variance plus the within-groups variance. In this situation, Var(Xi ) tends
to be larger whenever there are differences between groups, and it may be desirable
that these variables have a higher weight on the principal components. Another
example is when all Xi s are measured on the same units of measurement, e.g. when
measuring blood pressure at several points over time.
In any case, whenever we want to give the same weight (a priori) to all variables,
√
an easy fix is to work with standardized variables Zi = Xi / σii , which ensures
that Var(Z1 ) = · · · = Var(Zp ) = 1. In matrix notation, Z = (Z1 , . . . , Zp )T =
V −1/2 X where V = diag(Var(X1 ), . . . , Var(Xp )), the diagonal matrix with entries
the variances of Xi s. It is easy to see that working with standardized variables is
equivalent to obtaining principal components on the correlation matrix (rather than
the covariance matrix).
where
Pp Cor(X) is the correlation matrix of X. The total variability in Z is Tr(Cov (Z)) =
i=1 1 = p, hence the proportion of explained variability by the first k components
is Pk
i=1 λi
,
p
where λi is the i-th largest eigenvalue of the correlation matrix of X. In the sample
world, the data matrix with standardized variables is Z = XV −1/2 , where V is the
diagonal matrix with the sample variances
in the diagonal, that is the i-th diag-
T
onal entry of V is X HX/(n − 1) ii . We now give an example to illustrate how
standardizing the variables changes the PC loadings (and hence scores).
Example 4.2.15 (Link between standardized and non-standardized PCs). Suppose
we observe the random sample from X = (X1 , X2 )T shown in Figure 20 (top left).
The two variables are highly positively correlated, in fact the sample correlation
coefficient is 0.88, the sample variance
√ √of X1 is 0.86 and the sample variance of X2
is 4.39. The covariance is 0.88 · 0.86 · 4.39 = 1.71 The eigenvectors (PC loadings)
v1 and v2 are shown in solid red and dashed red, respectively. On the top right
subfigure of Figure 20, we see the same data after having standardized the variables
to have variance 1. We now see that the PC loadings are different to those in the
top left subfigure. In the bottom subfigures of Figure 20, we see the PC scores
for original (non-standardized) variables (bottom left) and standardized variables
(bottom right). Notice that we cannot go from one to the other just by scaling the
axes. The code for generating the figure is given below.
58
plot_vector <- function(v, alpha=1, ...){
lines(alpha*matrix(c(-v[1], v[1], -v[2], v[2]), 2,2), ...)
}
set.seed(1)
x = rnorm(200)
y = rnorm(200) + 2*x
X = cbind(x,y)
op <- par(mfcol=c(2,2))
par(mai=c(.4,.7,.3,.1))
plot(X, cex=.6, asp=1, xlim=c(-2,2), ylim=c(-5, 5), pch=1, axes=TRUE,
main='Non-standardized variables')
abline(h=0, v=0, lty=3)
Xpca <- prcomp(X)
plot_vector(Xpca$rot[,1], alpha=10, lwd=2, col=2)
plot_vector(Xpca$rot[,2], alpha=10, lty=2, lwd=2, col=2 )
par(mai=c(.7,.7,.1,.1))
plot(Xpca$x, cex=.6, asp=1, xlim=c(-2,2), ylim=c(-5, 5), pch=1, axes=TRUE)
abline(h=0, v=0, lty=3)
par(mai=c(.4,.7,.3,.1))
plot(scale(X), cex=.6, asp=1, xlim=c(-2,2), ylim=c(-5, 5), pch=1, axes=TRUE,
yaxt='s', main='Standardized variables')
abline(h=0, v=0, lty=3)
Xpca <- prcomp(X, scale=TRUE)
plot_vector(Xpca$rot[,1], alpha=10, lwd=2, col=2)
plot_vector(Xpca$rot[,2], alpha=10, lty=2, lwd=2, col=2 )
par(mai=c(.7,.7,.1,.1))
plot(Xpca$x, cex=.6, asp=1, xlim=c(-2,2), ylim=c(-5, 5), pch=1, axes=TRUE)
abline(h=0, v=0, lty=3)
par(op)
59
Non−standardized variables ● Standardized variables
●●
●● ●
●
●● ●
4
4
●
● ●
● ●
●
●
●
●
● ● ●● ●
●
● ●● ●
● ● ●●●
●
● ●
2
2
●● ●● ●
● ● ●
● ● ● ●● ●● ● ● ●●
● ●● ●● ● ● ●
● ● ●●
● ● ●● ● ● ● ●●
●●● ●● ●● ● ●
●● ● ● ●● ●
●● ● ●
● ●
● ● ● ● ● ●●●●
●● ●●
●● ● ● ●●● ● ●●
● ●● ● ● ●
●● ●●● ●
●●●●● ●● ● ● ●● ●● ●
●● ● ● ●
● ● ●●●● ● ● ● ● ●●●●●●● ●
● ●●● ●
●
●● ● ● ●●●● ● ●
0
0
y
y
● ● ● ●●
● ● ● ●
●●
● ● ●
●
●●
● ●
● ●● ●●
●
● ●●
●●
●●
●
● ●
● ●● ●
●
●
●●●● ●
● ●●
●
● ●
●● ● ● ● ●
●● ●●●●●●●●●●●
●●● ● ● ●
●● ●● ● ●●
●● ●●●● ●● ● ●●
●● ● ● ●● ● ● ● ●●
● ●●● ● ● ●● ● ●
● ●
● ● ● ●● ●●● ●
●● ●● ● ●● ●●
●
● ● ● ● ● ●● ●
●● ● ●
● ● ●
●
−2
−2
● ●
●● ● ●
●
● ● ●● ●
●● ●
●●
● ●
●● ● ●
●
● ● ●
●● ●
●
●
−4
−4
●
●
● ●
●
−4 −2 0 2 4 −4 −2 0 2 4
x x
4
4
2
●● ● ●
● ● ● ● ●
●● ● ●● ●
● ●● ● ● ● ● ● ●
● ●● ● ●● ● ●●
● ●● ● ● ●
● ●●
● ●● ●
●● ● ●●
● ●● ●
PC2
PC2
● ●● ●
● ● ● ● ●● ● ● ● ● ●●●●
●●● ●● ● ● ● ● ● ● ● ●●●● ●●●● ● ● ● ●● ● ● ●
● ● ● ● ●● ●● ●
● ●●
●● ● ●●● ● ● ● ● ● ●● ● ●●● ●●● ●
●● ● ● ●●● ●
● ●
● ● ●
●
● ●● ● ● ●
●●● ●● ● ● ● ● ● ● ●●●● ● ● ●● ● ●
0
●●●●● ●● ● ● ● ●● ●
● ●●● ●● ●●●●● ●● ●
● ● ● ●● ● ●●● ● ● ● ● ●● ● ●
●● ●● ●●● ● ● ●● ●●●
●●● ●●●●● ● ● ●
● ● ● ● ● ● ●●●
●
● ●
● ● ● ● ● ● ●
● ●● ●●●● ●●
●●●●●●●
●●●● ● ● ● ●● ● ●●
● ● ● ● ● ● ● ●
● ● ● ● ●●● ● ●●●
● ●●● ● ● ● ●
● ● ● ● ●● ● ●● ● ● ● ●● ● ● ●
●● ● ●● ●● ● ● ● ●
● ● ● ●● ● ● ● ● ● ● ●● ● ●●
● ● ● ● ● ●
● ●● ●
●
●
● ●●
●
−2
−2
−4
−4
−4 −2 0 2 4 −4 −2 0 2 4
PC1 PC1
Figure 20: Dataset with non-standardized variables (top left) and standardized
variables (right). The thick solid red line represents PC loading 1, and the thick
dashed red line represents PC loading 2. Notice that they are different in the two
plots. The bottom figures represent the PC scores 1 vs. 2 for non-standardized
variables (bottom left) and standardized variables (bottom right).
60
and eigenvalue of the sample covariance matrix. Choosing êk such that
êTk ek ≥ 0, we have
√ d
1. n(λ̂k − λk ) −→ N (0, 2λ2k ). Further λ̂1 , · · · , λ̂p are independent.
√ d
P
λj λk T
2. n(êk − ek ) −→ Np 0, j6=k (λj −λ k)
e e
2 j j
set.seed(1)
sd_x <- 2
n <- 10
Nrep = 5e4
rep_eigenvalues0 <- array(NA, c(Nrep, 2))
rep_eigenvectors0 <- array(NA, c(Nrep, 2, 2))
for( i in 1:Nrep ){
x = rnorm(n, sd=sd_x)
y = rnorm(n)
61
X00 = cbind(x,y) %*% t(rot)
X00_pca <- prcomp(X00)
rep_eigenvectors0[i,,] <- X00_pca$rotation
rep_eigenvalues0[i,] <- X00_pca$sdev^2
}
n <- 50
Nrep = 1e3
set.seed(1)
for(sd_i in seq(along=sd_seq))
{
sd_x <- sd_seq[sd_i]
for( i in 1:Nrep ){
x = rnorm(n, sd=sd_x)
y = rnorm(n)
62
Distribution of the variance of the 1st PC
0.25
0.20
0.15
0.10
0.05
0.00
0 5 10 15
x
Distribution of % of variance explained by 1st PC
5
4
3
2
1
0
Figure 21: Sample distribution of the variance explained by the first PC (top) and
sample distribution of the percentage of variance explained by the first PC (bottom).
The distributions have been computed using N = 5 · 104 replicates of a sample of
size 10. Overlaid are the Gaussian density estimate (notice in particular in the top
figure, the Gaussian gives non-zero probability for negative values. . . ). The true
parameter is denoted by the red vertical dashed line. Notice that the distributions
are asymmetric. This asymmetry goes away for larger sample sizes (this cannot be
seen in this Figure).
63
X0 = cbind(x,y) %*% t(rot)
X0_pca <- prcomp(X0)
rep_eigenvectors[sd_i, i,,] <- X0_pca$rotation
rep_eigenvalues[sd_i, i,] <- X0_pca$sdev^2
}
}
op <- par(mfrow=c(3,2))
for(sd_i in seq(along=sd_seq)){
par(mai=c(0.1,0.0,0.1,0))
#
plot(X0, asp=1, xlim=c(-5,5), ylim=c(-5,5), axes=TRUE, xaxt='n',
yaxt='n', type='n')
for(i in 1:min(Nrep,5e2)){
plot_vector(rep_eigenvectors[sd_i, i,,1], alpha=10, lwd=.1,
col=gray(0, alpha=.2))
}
if(sd_seq[sd_i] != 1) ## plot true eigenvector if well defined
plot_vector(rot[,1], alpha=10, lwd=2, col=1, lty=2)
#
h1 <- hist(rep_eigenvalues[sd_i, ,1], 'FD', plot=FALSE)
h2 <- hist(rep_eigenvalues[sd_i, ,2], plot=FALSE, 'FD')
ylim <- range(c(h1$density, h2$density))
xlim <- range(c(h1$breaks, h2$breaks))
par(mai=c(0.3,0.3,0.1,0))
plot(h1$breaks, c(h1$density, 0), col=1, xlim=xlim, ylim=ylim,
type='s', bty='n')
abline(v=sd_seq[sd_i]^2, col=1, lwd=2, lty=2)
lines(h2$breaks, c(h2$density,0), type='s', col=2)
abline(v=1, col=2, lwd=2, lty=2)
}
par(op)
64
Figure 22: Empirical distribution of eigenvectors and eigenvalues of the sample
covariance. Each row corresponds to a covariance with a specific eigengap (the dif-
ference between first and second eigenvalue), and the difference between the rows
is only in the eigenvalues of the covariance. The true eigenvalues (unknown) are
depicted by dashed vertical lines in the right-hand side plot (black for largest eigen-
value, red for 2nd largest eigenvalue). Overlayed is a histogram of the distribution
of the sample eigenvalues, in the corresponding color, obtained by simulating 103
replicates of a sample size 50 dataset. On the left are plotted in thin lines the first
PC loading for half of the 103 replicates. Overlaid as a thick black line is the true
1st eigenvector of the covariance. In the bottom row, the first two eigenvalues are
equal, and hence the true 1st eigenvector is not uniquely defined.
65