Lecture_note5
Lecture_note5
1
5.1.1 Population Principle Components
2
let the random vector X 0 = [X1, X2, . . . , Xp] have the covariance matrix Σ
with eigenvalues λ1 ≥ λ2 ≥ · · · ≥ λp ≥ 0.
Then
Var(Yi) = a0iΣai i = 1, 2, . . . , p
Cov(Yi, Yk ) = a0iΣak i, k = 1, 2, . . . , p
3
Define
First principle component = linear combination a01X that maximizes
Var(a01X) subject to a01a1 = 1
Second principle component = linear combination a02X that maximizes
Var(a02X) subject to a02a2 = 1 and
Cov(a01X, a02X) = 0
At the ith step,
ith principle component = linear combination a0iX that maximizes
Var(a0iX) subject to a0iai = 1 and
Cov(a0iX, a0k X) = 0 for k < i
4
Results 5.1 Let Σ be the covariance matrix associated with the random
vector X 0 = [X1, X2, . . . , Xp]. Let Σ have the eigenvalue-eigenvector pair
(λ1, e1), (λ2, e2), . . . , (λp, ep) where λ1 ≥ λ2 ≥ · · · ≥ λp ≥ 0. Then the ith
principal component is given by
Yi = e0iX = ei1X1 + ei2X2 + · · · + eipXp, i = 1, 2, . . . , p
With these choices,
Var(Yi) = e0iΣei = λi, i = 1, 2, . . . , p
Cov(Yi, Yk ) = e0iΣek = 0, i 6= k
If some λi are equal, the choices of corresponding coefficients vectors, ei, and
hence Yi are not unique.
Results 5.2 Let X 0 = [X1, X2, . . . , Xp] have covariance matrix Σ, with
eigenvalue-eigenvector pairs (λ1, e1), (λ2, e2), . . . , (λp, ep) where λ1 ≥ λ2 ≥
· · · ≥ λp ≥ 0. Let Y1 = e01X, Y2 = e02X, . . . , Yp = e0pX be the principal
components. Then
p
X p
X
σ11 + σ22 + · · · + σpp = Var(Xi) = λ1 + λ2 + · · · + λp = Var(Yi)
i=1 i=1 5
Results 5.3 If Y1 = e01X, Y2 = e02X, . . . , Yp = e0pX are the principal
components obtained from the covariance matrix Σ, then
√
eik λi
ρYi,Xk = √ , i, k = 1, 2, . . . , p
σkk
are the correlation coefficients between the components Yi and the variables
Xk . Here (λ1, e1), (λ2, e2), . . . , (λp, ep) are the eigenvalue-eigenvector pair for
Σ.
Example 5.1 Suppose the random variables X1, X2 and X3 have the covariance
matrix
1 −2 0
Σ = −2 5 0 .
0 0 2
Calculating the population principal components
6
7
Suppose X is distributed as Np(µ, Σ). We know that the density of X is
constant on the µ centered ellipsoids
(x − µ)0Σ−1(x − µ) = c2
√
which have axes ± λiei, i = 1, 2, . . . , p, where the (λi, ei) are the eigenvalue-
eigenvector pairs of Σ. Assume µ = 0, the equation above can be rewritten
as
2 0 1 0 2
−1 1 0 2 1 0 2
c = x Σ x = (e x) + (e2x) + · · · + (epx)
λ1 λ2 λp
1 2 1 2 1 2
= y + y + · · · + yp
λ1 1 λ2 2 λp
where e01x, e02x, . . . , e0px are recognized as the principal components of x. The
equation above defines in a coordinate system with axes y1, y2, . . . , yp lying in
the direction e1, e2, . . . , ep, respectively.
8
Principal Components Obtained from Standardized
Variables
Principal components may also be obtained from the standardized variables
Xi − µi
Zi = √ , i = 1, 2, . . . , p
σii
Or in matrix notation Z = (V1/2)−1(X − µ). Clearly E(Z) = 0 and Cov(Z) =
(V1/2)−1Σ(V1/2)−1 = ρ
2.
2 2 2
σ ρσ · · · ρσ 1 ρ ··· ρ
ρσ 2 σ 2 · · · ρσ 2 ρ 1 ··· ρ
Σ=
.. .. ... ..
or ρ = Σ =
.. .. . . . ..
ρσ 2 ρσ 2 · · · σ 2 ρ ρ ··· 1 10
Summarizing Sample Variation by Principle Components
Suppose the data x1, x2, . . . , xn represent n independent drawings from some
p-dimensional population withe mean vector µ and covariance matrix Σ. These
data yield the sample mean vector x̄, the sample covariance matrix S, and the
sample correlation matrix R.
If S = sik be p × p sample covariance matrix with eigenvalue-eigenvector
pairs (λ̂1, ê1), (λ̂2, ê2), . . . , (λ̂p, êp), the ith sample principal component is given
by
ŷi = ê0ix = êi1x1 + êi2x2 + · · · + êipxp, i = 1, 2, . . . , p
where λ̂1 ≥ λ̂2 ≥ · · · ≥ λ̂p ≥ 0 and x is any observation on the variables
X 1, X 2, . . . , X p. Also
Sample variance(ŷk = λ̂k , k = 1, 2, . . . , p
Sample covariance(ŷi, ŷk ) = 0, i 6= k
Xn
Total sample variance = sii = λ̂1 + λ̂2 + · · · + λ̂p
i=1
p
êik λ̂i 11
rŷi,xk = √ , i, k = 1, 2, . . . , p.
skk
Example 5.3 (Summarizing sample variability with two sample principal
components) A census provided information, by tract, on five socioeconomic
variables for Madison, Wisconsin, area. The data from 61 tracts are list in Table
8.5. These data produced the following summary statistics
and
3.397 −1.102 4.306 −2.078 0.027
−1.102 9.673 −1.513 10.953 1.203
S = 4.306 −1.513 55.626 −28.937 −0.044
−2.078 10.953 −28.937 89.067 0.957
0.027 1.203 −0.044 0.957 0.319
Can the sample variation be summarized by one or two principal components ?
12
13
The number of Principal Components
There is always the question of how many components to retain. There is no
definitive answer to this question. Things to consider include
• the relative sizes of the eigenvalues (the variances of the sample components,)
16
17
18
Interpretation the Sample Principal Components
The sample principal components have serval interpretations
• Suppose the underlying distribution of X is nearly Np(µ, Σ), Then the sample
principal components ŷi = ê0i(x − x̄) are realizations of population principal
components Yi = ei(X − µ), which have an Np(0, Λ) distribution. The
diagonal matrix Λ has entries λ1, λ2, . . . , λp and (λi, ei) are the eigenvalue-
eigenvector pairs of Σ.
• Even when the normal assumption is suspect and the scatter plot may depart
somewhat from an elliptical pattern, we can still extract eigenvalues from S
and obtain the sample principal components.
19
20
Standardizing the Sample Principal Components
21
If z1, z2, . . . , zn are standardized observations with covariance matrix R, the
ith sample principal component is
In addition,
and q
rŷi,zk = êik λ̂i, i, k = 1, 2, . . . , p
22
Example 5.5 (Sample principal components from standardized data) The
weekly rates return for five stocks (JP Morgen, Citibank, Wells Fargo, Royal
Dutch Shell, and ExxonMobil) list on the New York Stock Exchange were
determined for the period January 2004 through December 2005. The weekly
rates of return are defined as (current week closing price-previous week closing
price)/(previous week closing price), adjusted for stock splits and dividends,
The data are listed in Table 8.4. The observations in 103 successive weeks
appear to be independently distributed, but the rates of return across stocks are
correlated, because as one expects, stocks tend to move together in response
to general economic conditions. Standardizing this data set and find sample
principal components data set after standardized.
23
Example 5.6 (Components from a correlation matrix with a special
structure) Geneticists are often concerned with the inheritance of characteristics
that can be measured several times during an animal’s lifetime. Body weight
(in grams) for n = 150 female mice were obtained immediately after the birth
of their first four litters. The sample mean vector and sample correlation matrix
were, respectively,
x̄0 = [39.88, 45.08, 48.11, 49.95]
and
1.000 .7501 .6329 .6363
.7501 1.000 .6925 7386
R= .6329
.6925 1.000 .6625
.6363 .7386 .6625 1.000
Find sample principal components by R.
24
5.2 Factor Analysis and Inference for Structured Covariance
Matrices
• The primary question in factor analysis is whether the data are consistent
with a prescribed structure.
25
5.2.1 The Orthogonal Factor Model
• The observable random vector X, with p components, has mean µ and
covariance matrix Σ.
• The factor model postulates that X is linearly dependent upon a few
unobservable random variables F1, F2, . . . , Fm, called common factors, and p
additional sources of variation ε1, ε2, . . . , εp, called errors, sometimes, specific
factors.
• In particular, the factor analysis model is
X1 − µ1 = `11F1 + `12F2 + · · · + `1mFm + ε1
X2 − µ2 = `21F1 + `22F2 + · · · + `2mFm + ε1
..
or in matrix notation
X − µ = LF + ε
The coefficient `ij is called the loading of the ith variable on the jth factor,
26
so the matrix L is the matrix of factor loadings.
• The unobservable random vectors F and ε satisfy the following conditions:
F and ε are independent
E(F) = 0, Cov(F) = I
E(ε) = 0, Cov(ε) = Ψ, where Ψ is diagonal matrix.
• Covariance structure for the Orthogonal Factor Model
1. Cov(X) = LL0 + Ψ or
Var(Xi) = `2i1 + · · · + `2im + ψi=h
ˆ 2i + ψi
Cov(Xi, Xk ) = `i1`k1 + · · · + `im`km
2. Cov(X, F) = L or
Cov(Xi, Fj ) = `ij
Example 5.7 Consider the covariance matrix
19 30 2 12
30 57 5 23
Σ= 2 5 38
47
12 23 47 68
0 27
Verifying the relation Σ = LL + Ψ for two factors
Unfortunately, for the factor analyst, most covariance matrices cannot be
factored as LL0 + Ψ, where the number of factors m is much less than p.
Example 5.8 Let p = 3 and m = 1, and suppose the random variables X1, X2
and X3 have the positive definite covariance matrix
1 .9 .7
Σ = .9 1 .4
.7 .4 1
X − µ = LF + ε = LTT0F + ε = L∗F∗ + ε
The estimate specific variances are provided by the diagonal elements of the
0
matrix S − L̃L̃ , so
ψ̃1 0 · · · 0
m
0 ψ̃2 · · · 0
X
Ψ= . with ψ̃ = s − ˜ij
`
. .
. . .
.. . i ii
j=1
0 0 · · · ψ̃p
Communalities are estimated as
h̃2i = `˜2i1 + `˜2i2 + · · · + `˜2im
The principal component factor analysis of the sample correlation matrix is
29
obtained by starting with R in place of S.
• For the principal component solution, the estimated loading for a given factor
do not changes as the number of factors is increased.
• Analytically, we have
0
Sum of squareed entries of (S − (L̃L̃ + Ψ̃)) ≤ λ̂2m+1 + · · · + λ̂2p
• Ideally, the contributions of the first few factors to the sample variance of
the variables should be large.
Proportion of total λ̂j
sample variance =
s11 +s22 +···+spp for a factor analysis of S
λj
due to jth factor
p for a factor analysis of R
30
Example 5.9 In a consumer-preference study, a random sample of customers
were asked to rate several attributions of a new product. The response, on a
7-point semantic differential scale, were tabulated and the attribute correlation
matrix constructed. The correlation matrix is presented next:
31
32
Example 5.10 Stock-price data consisting of n = 103 weekly rates of return on
p = 5 stocks were introduced in Example 5.5. Do factor analysis for this data.
33
The Maximum Likelihood Method
Results 5.5 Let X 1, X 2, . . . , X n be a random sample from Np(µ, Σ), where
Σ = LL0 + Ψ is the covariance matrix for the m common factor model. The
maximum likelihood estimator L̂ and Ψ̂ and µ̂ = x̄ maximize the likelihood
function of X j − µ = LFj + εj , j = 1, 2, . . . , n
" !#
− 12 tr Σ−1
Pn (x −x̄)(x −x̄)0+n(x̄−µ)(x̄−µ)0
j j
− np n
−2
L(µ, Σ) = (2π) 2 |Σ| e j=1
−1
subject to L̂Ψ̂ L̂ be diagonal.
so
`ˆ21j + `ˆ22j + · · · + `ˆ2pj
Proportion of total sample
=
variance due to jth factor s11 + s22 + · · · + spp 34
Although the likelihood in Results 5.5 is appropriate for S, not R, surprisingly,
this practice is equivalent to obtaining the maximum likelihood estimate L̂ and
−1/2 −1/2
Ψ̂ based on the sample covariance matrix S, setting L̂z = V̂ Ψ̂V̂ .
−1/2
Here V̂ is the diagonal matrix√with reciprocal of the sample standard
deviation (computed with the divisor n) on the main diagonal, and Z is the
standardized observation with sample mean 0 and sample standard deviation 1.
Example 5.11 Using the maximum likelihood method do factor analysis for the
stock-price data.
35
Example 5.12 (Factor analysis of Olympic decathlon data) Linden originally
conducted a factor analytic study of Olympic decathlon results for all 160
complete starts from the end of World War II until the mid-seventies . Following
his approach we examine the n = 280 complete starts from 1960 through 2004.
The recorded values for each event were standardized and the signs of the timed
events changed so that large scores are good for all events. We, too, analyze
the correlation matrix, which based on all 280 cases, is
36
37
Principal component
38
Factor Rotation
If L̂ is the p × m matrix of estimated factor loadings obtained by any method
(principal component,maximum likelihood,and so forth) then
∗
L̂ = LT, where TT0 = T0T = I
0 0 0 ∗ ∗0
L̂L̂ + Ψ̂ = L̂TT L̂ + Ψ̂ = L̂ L̂ + Ψ̂
39
Example 5.13 (A first look at factor rotation) Lawley and Maxwell present
the sample correlation matrix of examination scores in p = 6 subject areas for
n = 220 male students. The correlation matrix is
40
41
Varimax(or normal varimax) criterion
Define `˜∗ij = `ˆ∗ij /ĥi to be the rotated coefficients scaled by the square
root of the communalities . Then the (normal) varimax procedure selects the
orthogonal transformation T that makes
!2
m p p
1 X X X
V = `˜∗4
ij − `˜∗2
ij
p i=1 i=1 i=1
as large as possible.
Scaling the rotated coefficient `ˆ∗ij has the effect of giving variables with small
communalities relatively more weight in the determination of simple structure.
After the transformation T is determined, the loadings `˜∗ij are multiplied by ĥi
so that the original communalities are preserved.
42
Example 5.14 (Rotated Loading for the consumer-preference data)
43
44
Example 5.15 ( Rotated loading for the stock-price data)
45
Example 5.15 (Rotated loadings for the Olympic decathlon data)
46
47
Factor Scores
• The estimate values of the common factors, called factor scores may also
required. These quantities are often used for diagnostic purposes, as well as
inputs to a subsequent analysis.
• Factor scores are not estimates of unknown parameters in the usual sense.
Rather, they are estimates of values for the unobserved random factor vectors
Fj , j = 1, 2, . . . , n. That is, factor scores
0 −1 0 −1
f̂j = (L̂z Ψ̂z L̂z )−1L̂z Ψ̂z zj
ˆ −1L̂0 Ψ̂−1zj , j = 1, 2, . . . , n
= ∆ z z z
0
where zj = D−1/2(xj − x̄) and ρ̂ = L̂z L̂z + Ψ̂z .
∗
• If rotated loadings L̂ = L̂T are used in place of the original loadings, the
∗ ∗
subsequent factor scores, f̂j , are related to f̂j by f̂j = T0f̂j , j = 1, 2, . . . , n.
49
• If the factor loadings are estimated by the principal component method, it
is customary to generate factor scores using an unweighted (ordinary) least
squares procedure. Implicitly, this amount to assuming that the ψi are equal
or nearly equal. The factor scores are then
0 −1 0 0 −1 0
f̂j = (L̂ L̂) L̂ (xj − µ̂) or f̂j = (L̂z L̂z ) L̂z zj
0
f̂j = L̂ S−1(xj − x̄), j = 1, 2, . . . , n
0 −1
f̂j = L̂z R zj , j = 1, 2, . . . , n
−1/2 0 50
where zj = D (xj − x̄) and ρ̂ = L̂z L̂z + Ψ̂z .
Example 5.16 (Computing factor scores) Compute factor scores by the least
squares and regression methods using the stock-price data discussed in 5.11.
51
Perspectives and a Stragegy for Factor Analysis
At the present time, factor analysis still maintains the flavor of an art, and no
single strategy should yet be “ chiseled into stone”. We suggest and illustrate
one reasonable option:
52
3. Compare the solution obtained from the two factor analysis.
(a) Do the loadings group in the same manner ?
(b) Plot factor scores obtained for principal components against scores from
the maximum likelihood analysis.
4. Repeat the first three steps for other numbers of common factors
m. Do extra factors necessarily contribute to the understanding and
interpretation of the data ?
5. For large data sets, split them in half and perform a factor
analysis on each part. Compare the two results with each other and
with that obtained from the complete data set to check the stability of the
solution.
53