A Tutorial On Principal Component Analysis
A Tutorial On Principal Component Analysis
b, c at some
arbitrary angles with respect to the system. The an-
gles between our measurements might not even be
90
o
! Now, we record with the cameras for 2 minutes.
The big question remains: how do we get from this
data set to a simple equation of x?
We know a-priori that if we were smart experi-
menters, we would have just measured the position
along the x-axis with one camera. But this is not
what happens in the real world. We often do not
know what measurements best reect the dynamics
of our system in question. Furthermore, we some-
times record more dimensions than we actually need!
Also, we have to deal with that pesky, real-world
problem of noise. In the toy example this means
that we need to deal with air, imperfect cameras or
even friction in a less-than-ideal spring. Noise con-
taminates our data set only serving to obfuscate the
dynamics further. This toy example is the challenge
experimenters face everyday. We will refer to this
example as we delve further into abstract concepts.
Hopefully, by the end of this paper we will have a
good understanding of how to systematically extract
x using principal component analysis.
3 Framework: Change of Basis
The Goal: Principal component analysis computes
the most meaningful basis to re-express a noisy, gar-
bled data set. The hope is that this new basis will
lter out the noise and reveal hidden dynamics. In
the example of the spring, the explicit goal of PCA is
to determine: the dynamics are along the x-axis.
In other words, the goal of PCA is to determine that
x - the unit basis vector along the x-axis - is the im-
portant dimension. Determining this fact allows an
experimenter to discern which dynamics are impor-
tant and which are just redundant.
3.1 A Naive Basis
With a more precise denition of our goal, we need
a more precise denition of our data as well. For
each time sample (or experimental trial), an exper-
imenter records a set of data consisting of multiple
measurements (e.g. voltage, position, etc.). The
number of measurement types is the dimension of
the data set. In the case of the spring, this data set
has 12,000 6-dimensional vectors, where each camera
contributes a 2-dimensional projection of the balls
position.
In general, each data sample is a vector in m-
dimensional space, where m is the number of mea-
surement types. Equivalently, every time sample is
a vector that lies in an m-dimensional vector space
spanned by an orthonormal basis. All measurement
vectors in this space are a linear combination of this
set of unit length basis vectors. A naive and simple
2
choice of a basis B is the identity matrix I.
B =
b
1
b
2
.
.
.
b
m
1 0 0
0 1 0
.
.
.
.
.
.
.
.
.
.
.
.
0 0 1
= I
where each row is a basis vector b
i
with m compo-
nents.
To summarize, at one point in time, camera A
records a corresponding position (x
A
(t), y
A
(t)). Each
trial can be expressed as a six dimesional column vec-
tor
X.
X =
x
A
y
A
x
B
y
B
x
C
y
C
p
1
.
.
.
p
m
x
1
x
n
Y =
p
1
x
1
p
1
x
n
.
.
.
.
.
.
.
.
.
p
m
x
1
p
m
x
n
p
1
x
i
.
.
.
p
m
x
i
2
signal
2
noise
(2)
A high SNR ( 1) indicates high precision data,
while a low SNR indicates noise contaminated data.
Pretend we plotted all data from camera A from
the spring in Figure 2. Any individual camera should
record motion in a straight line. Therefore, any
spread deviating from straight-line motion must be
noise. The variance due to the signal and noise are
indicated in the diagram graphically. The ratio of the
two, the SNR, thus measures how fat the oval is - a
range of possiblities include a thin line (SNR 1), a
perfect circle (SNR = 1) or even worse. In summary,
we must assume that our measurement devices are
reasonably good. Quantitatively, this corresponds to
a high SNR.
4.2 Redundancy
Redundancy is a more tricky issue. This issue is par-
ticularly evident in the example of the spring. In
this case multiple sensors record the same dynamic
4
Figure 3: A spectrum of possible redundancies in
data from the two separate recordings r
1
and r
2
(e.g.
x
A
, y
B
). The best-t line r
2
= kr
1
is indicated by
the dashed line.
information. Consider Figure 3 as a range of possi-
bile plots between two arbitrary measurement types
r
1
and r
2
. Panel (a) depicts two recordings with no
redundancy. In other words, r
1
is entirely uncorre-
lated with r
2
. This situation could occur by plotting
two variables such as (x
A
, humidity).
However, in panel (c) both recordings appear to be
strongly related. This extremity might be achieved
by several means.
A plot of (x
A
, x
A
) where x
A
is in meters and x
A
is in inches.
A plot of (x
A
, x
B
) if cameras A and B are very
nearby.
Clearly in panel (c) it would be more meaningful to
just have recorded a single variable, the linear com-
bination r
2
kr
1
, instead of two variables r
1
and r
2
separately. Recording solely the linear combination
r
2
kr
1
would both express the data more concisely
and reduce the number of sensor recordings (2 1
variables). Indeed, this is the very idea behind di-
mensional reduction.
4.3 Covariance Matrix
The SNR is solely determined by calculating vari-
ances. A corresponding but simple way to quantify
the redundancy between individual recordings is to
calculate something like the variance. I say some-
thing like because the variance is the spread due to
one variable but we are concerned with the spread
between variables.
Consider two sets of simultaneous measurements
with zero mean
3
.
A = a
1
, a
2
, . . . , a
n
, B = b
1
, b
2
, . . . , b
n
2
A
= a
i
a
i
)
i
,
2
B
= b
i
b
i
)
i
where the expectation
4
is the average over n vari-
ables. The covariance between A and B is a straigh-
forward generalization.
covariance of A and B
2
AB
= a
i
b
i
)
i
Two important facts about the covariance.
2
AB
= 0 if and only if A and B are entirely
uncorrelated.
2
AB
=
2
A
if A = B.
We can equivalently convert the sets of A and B into
corresponding row vectors.
a = [a
1
a
2
. . . a
n
]
b = [b
1
b
2
. . . b
n
]
so that we may now express the covariance as a dot
product matrix computation.
2
ab
1
n 1
ab
T
(3)
where the beginning term is a constant for normal-
ization
5
.
Finally, we can generalize from two vectors to an
arbitrary number. We can rename the row vectors
x
1
a, x
2
b and consider additional indexed row
3
These data sets are in mean deviation form because the
means have been subtracted o or are zero.
4
i
denotes the average over values indexed by i.
5
The simplest possible normalization is
1
n
. However, this
provides a biased estimation of variance particularly for small
n. It is beyond the scope of this paper to show that the proper
normalization for an unbiased estimator is
1
n1
.
5
vectors x
3
, . . . , x
m
. Now we can dene a new mn
matrix X.
X =
x
1
.
.
.
x
m
(4)
One interpretation of X is the following. Each row
of X corresponds to all measurements of a particular
type (x
i
). Each column of X corresponds to a set
of measurements from one particular trial (this is
X
from section 3.1). We now arrive at a denition for
the covariance matrix S
X
.
S
X
1
n 1
XX
T
(5)
Consider how the matrix form XX
T
, in that order,
computes the desired value for the ij
th
element of S
X
.
Specically, the ij
th
element of the variance is the dot
product between the vector of the i
th
measurement
type with the vector of the j
th
measurement type.
The ij
th
value of XX
T
is equivalent to substi-
tuting x
i
and x
j
into Equation 3.
S
X
is a square symmetric mm matrix.
The diagonal terms of S
X
are the variance of
particular measurement types.
The o-diagonal terms of S
X
are the covariance
between measurement types.
Computing S
X
quanties the correlations between all
possible pairs of measurements. Between one pair of
measurements, a large covariance corresponds to a
situation like panel (c) in Figure 3, while zero covari-
ance corresponds to entirely uncorrelated data as in
panel (a).
S
X
is special. The covariance matrix describes all
relationships between pairs of measurements in our
data set. Pretend we have the option of manipulat-
ing S
X
. We will suggestively dene our manipulated
covariance matrix S
Y
. What features do we want to
optimize in S
Y
?
4.4 Diagonalize the Covariance Matrix
If our goal is to reduce redundancy, then we would
like each variable to co-vary as little as possible with
other variables. More precisely, to remove redun-
dancy we would like the covariances between separate
measurements to be zero. What would the optimized
covariance matrix S
Y
look like? Evidently, in an op-
timized matrix all o-diagonal terms in S
Y
are zero.
Therefore, removing redundancy diagnolizes S
Y
.
There are many methods for diagonalizing S
Y
. It
is curious to note that PCA arguably selects the eas-
iest method, perhaps accounting for its widespread
application.
PCA assumes that all basis vectors p
1
, . . . , p
m
1
,
2
, . . . ,
r
for the symmetric matrix X
T
X.
(X
T
X) v
i
=
i
v
i
i
i
are positive real and termed the sin-
gular values.
u
1
, u
2
, . . . , u
r
is the set of orthonormal n 1
vectors dened by u
i
1
i
X v
i
.
We obtain the nal denition by referring to Theorem
5 of Appendix A. The nal denition includes two
new and unexpected properties.
7
Notice that in this section only we are reversing convention
from mn to nm. The reason for this derivation will become
clear in section 6.3.
8
u
i
u
j
=
ij
|X v
i
| =
i
These properties are both proven in Theorem 5. We
now have all of the pieces to construct the decompo-
sition. The value version of singular value decom-
position is just a restatement of the third denition.
X v
i
=
i
u
i
(7)
This result says a quite a bit. X multiplied by an
eigenvector of X
T
X is equal to a scalar times an-
other vector. The set of eigenvectors v
1
, v
2
, . . . , v
r
1
.
.
. 0
r
0
0
.
.
.
0
where
2
. . .
r
are the rank-ordered set of
singular values. Likewise we construct accompanying
orthogonal matrices V and U.
V = [ v
1
v
2
. . . v
m
]
U = [ u
1
u
2
. . . u
n
]
where we have appended an additional (m r) and
(n r) orthonormal vectors to ll up the matri-
ces for V and U respectively
8
. Figure 4 provides a
graphical representation of how all of the pieces t
together to form the matrix version of SVD.
XV = U (8)
where each column of V and U perform the value
version of the decomposition (Equation 7). Because
8
This is the same procedure used to xe the degeneracy in
the previous section.
V is orthogonal, we can multiply both sides by
V
1
= V
T
to arrive at the nal form of the decom-
position.
X = UV
T
(9)
Although it was derived without motivation, this de-
composition is quite powerful. Equation 9 states that
any arbitrary matrix X can be converted to an or-
thogonal matrix, a diagonal matrix and another or-
thogonal matrix (or a rotation, a stretch and a second
rotation). Making sense of Equation 9 is the subject
of the next section.
6.2 Interpreting SVD
The nal form of SVD (Equation 9) is a concise but
thick statement to understand. Let us instead rein-
terpret Equation 7.
Xa = kb
where a and b are column vectors and k is a scalar
constant. The set v
1
, v
2
, . . . , v
m
is analogous
to a and the set u
1
, u
2
, . . . , u
n
is analogous to
b. What is unique though is that v
1
, v
2
, . . . , v
m
and u
1
, u
2
, . . . , u
n
are orthonormal sets of vectors
which span an m or n dimensional space, respec-
tively. In particular, loosely speaking these sets ap-
pear to span all possible inputs (a) and outputs
(b). Can we formalize the view that v
1
, v
2
, . . . , v
n
and u
1
, u
2
, . . . , u
n
span all possible inputs and
outputs?
We can manipulate Equation 9 to make this fuzzy
hypothesis more precise.
X = UV
T
U
T
X = V
T
U
T
X = Z
where we have dened Z V
T
. Note that
the previous columns u
1
, u
2
, . . . , u
n
are now
rows in U
T
. Comparing this equation to Equa-
tion 1, u
1
, u
2
, . . . , u
n
perform the same role as
p
1
, p
2
, . . . , p
m
. Hence, U
T
is a change of basis
from X to Z. Just as before, we were transform-
ing column vectors, we can again infer that we are
9
The value form of SVD is expressed in equation 7.
X v
i
=
i
u
i
The mathematical intuition behind the construction of the matrix form is that we want to express all
n value equations in just one equation. It is easiest to understand this process graphically. Drawing
the matrices of equation 7 looks likes the following.
We can construct three new matrices V, U and . All singular values are rst rank-ordered
2
. . .
r
, and the corresponding vectors are indexed in the same rank order. Each pair
of associated vectors v
i
and u
i
is stacked in the i
th
column along their respective matrices. The cor-
responding singular value
i
is placed along the diagonal (the ii
th
position) of . This generates the
equation XV = U, which looks like the following.
The matrices V and U are m m and n n matrices respectively and is a diagonal matrix with
a few non-zero values (represented by the checkerboard) along its diagonal. Solving this single matrix
equation solves all n value form equations.
Figure 4: How to construct the matrix form of SVD from the value form.
transforming column vectors. The fact that the or-
thonormal basis U
T
(or P) transforms column vec-
tors means that U
T
is a basis that spans the columns
of X. Bases that span the columns are termed the
column space of X. The column space formalizes the
notion of what are the possible outputs of any ma-
trix.
There is a funny symmetry to SVD such that we
can dene a similar quantity - the row space.
XV = U
(XV)
T
= (U)
T
V
T
X
T
= U
T
V
T
X
T
= Z
where we have dened Z U
T
. Again the rows of
V
T
(or the columns of V) are an orthonormal basis
for transforming X
T
into Z. Because of the trans-
pose on X, it follows that V is an orthonormal basis
spanning the row space of X. The row space likewise
formalizes the notion of what are possible inputs
into an arbitrary matrix.
We are only scratching the surface for understand-
ing the full implications of SVD. For the purposes of
this tutorial though, we have enough information to
understand how PCA will fall within this framework.
10
6.3 SVD and PCA
With similar computations it is evident that the two
methods are intimately related. Let us return to the
original mn data matrix X. We can dene a new
matrix Y as an n m matrix
9
.
Y
1
n 1
X
T
where each column of Y has zero mean. The deni-
tion of Y becomes clear by analyzing Y
T
Y.
Y
T
Y =
n 1
X
T
T
1
n 1
X
T
=
1
n 1
X
TT
X
T
=
1
n 1
XX
T
Y
T
Y = S
X
By construction Y
T
Y equals the covariance matrix
of X. From section 5 we know that the principal
components of X are the eigenvectors of S
X
. If we
calculate the SVD of Y, the columns of matrix V
contain the eigenvectors of Y
T
Y = S
X
. Therefore,
the columns of V are the principal components of X.
This second algorithm is encapsulated in Matlab code
included in Appendix B.
What does this mean? V spans the row space of
Y
1
n1
X
T
. Therefore, V must also span the col-
umn space of
1
n1
X. We can conclude that nding
the principal components
10
amounts to nding an or-
thonormal basis that spans the column space of X.
7 Discussion and Conclusions
7.1 Quick Summary
Performing PCA is quite simple in practice.
9
Y is of the appropriate n m dimensions laid out in the
derivation of section 6.1. This is the reason for the ipping
of dimensions in 6.1 and Figure 4.
10
If the nal goal is to nd an orthonormal basis for the
coulmn space of X then we can calculate it directly without
constructing Y. By symmetry the columns of U produced by
the SVD of
1
n1
X must also be the principal components.
Figure 5: Data points (black dots) tracking a person
on a ferris wheel. The extracted principal compo-
nents are (p
1
, p
2
) and the phase is
.
1. Organize a data set as an m n matrix, where
m is the number of measurement types and n is
the number of trials.
2. Subtract o the mean for each measurement type
or row x
i
.
3. Calculate the SVD or the eigenvectors of the co-
variance.
In several elds of literature, many authors refer to
the individual measurement types x
i
as the sources.
The data projected into the principal components
Y = PX are termed the signals, because the pro-
jected data presumably represent the true under-
lying probability distributions.
7.2 Dimensional Reduction
One benet of PCA is that we can examine the vari-
ances S
Y
associated with the principle components.
Often one nds that large variances associated with
11
the rst k < m principal components, and then a pre-
cipitous drop-o. One can conclude that most inter-
esting dynamics occur only in the rst k dimensions.
In the example of the spring, k = 1. Like-
wise, in Figure 3 panel (c), we recognize k = 1
along the principal component of r
2
= kr
1
. This
process of of throwing out the less important axes
can help reveal hidden, simplied dynamics in high
dimensional data. This process is aptly named
dimensional reduction.
7.3 Limits and Extensions of PCA
Both the strength and weakness of PCA is that it is
a non-parametric analysis. One only needs to make
the assumptions outlined in section 4.5 and then cal-
culate the corresponding answer. There are no pa-
rameters to tweak and no coecients to adjust based
on user experience - the answer is unique
11
and inde-
pendent of the user.
This same strength can also be viewed as a weak-
ness. If one knows a-priori some features of the dy-
namics of a system, then it makes sense to incorpo-
rate these assumptions into a parametric algorithm -
or an algorithm with selected parameters.
Consider the recorded positions of a person on a
ferris wheel over time in Figure 5. The probability
distributions along the axes are approximately Gaus-
sian and thus PCA nds (p
1
, p
2
), however this an-
swer might not be optimal. The most concise form of
dimensional reduction is to recognize that the phase
(or angle along the ferris wheel) contains all dynamic
information. Thus, the appropriate parametric algo-
rithm is to rst convert the data to the appropriately
centered polar coordinates and then compute PCA.
This prior non-linear transformation is sometimes
termed a kernel transformation and the entire para-
metric algorithm is termed kernel PCA. Other com-
mon kernel transformations include Fourier and
11
To be absolutely precise, the principal components are not
uniquely dened. One can always ip the direction by multi-
plying by 1. In addition, eigenvectors beyond the rank of
a matrix (i.e.
i
= 0 for i > rank) can be selected almost
at whim. However, these degrees of freedom do not eect the
qualitative features of the solution nor a dimensional reduc-
tion.
Figure 6: Non-Gaussian distributed data causes PCA
to fail. In exponentially distributed data the axes
with the largest variance do not correspond to the
underlying basis.
Gaussian transformations. This procedure is para-
metric because the user must incorporate prior
knowledge of the dynamics in the selection of the ker-
nel but it is also more optimal in the sense that the
dynamics are more concisely described.
Sometimes though the assumptions themselves are
too stringent. One might envision situations where
the principal components need not be orthogonal.
Furthermore, the distributions along each dimension
(x
i
) need not be Gaussian. For instance, Figure 6
contains a 2-D exponentially distributed data set.
The largest variances do not correspond to the mean-
ingful axes of thus PCA fails.
This less constrained set of problems is not triv-
ial and only recently has been solved adequately via
Independent Component Analysis (ICA). The formu-
lation is equivalent.
Find a matrix P where Y = PX such that
S
Y
is diagonalized.
however it abandons all assumptions except linear-
ity, and attempts to nd axes that satisfy the most
formal form of redundancy reduction - statistical in-
dependence. Mathematically ICA nds a basis such
that the joint probability distribution can be factor-
12
ized
12
.
P(y
i
, y
j
) = P(y
i
)P(y
j
)
for all i and j, i ,= j. The downside of ICA is that it
is a form of nonlinear optimizaiton, making the solu-
tion dicult to calculate in practice and potentially
not unique. However ICA has been shown a very
practical and powerful algorithm for solving a whole
new class of problems.
Writing this paper has been an extremely in-
structional experience for me. I hope that
this paper helps to demystify the motivation
and results of PCA, and the underlying assump-
tions behind this important analysis technique.
8 Appendix A: Linear Algebra
This section proves a few unapparent theorems in
linear algebra, which are crucial to this paper.
1. The inverse of an orthogonal matrix is
its transpose.
The goal of this proof is to show that if A is an
orthogonal matrix, then A
1
= A
T
.
Let A be an mn matrix.
A = [a
1
a
2
. . . a
n
]
where a
i
is the i
th
column vector. We now show that
A
T
A = I where I is the identity matrix.
Let us examine the ij
th
element of the matrix
A
T
A. The ij
th
element of A
T
A is (A
T
A)
ij
= a
i
T
a
j
.
Remember that the columns of an orthonormal ma-
trix are orthonormal to each other. In other words,
the dot product of any two columns is zero. The only
exception is a dot product of one particular column
with itself, which equals one.
(A
T
A)
ij
= a
i
T
a
j
=
1 i = j
0 i ,= j
A
T
A is the exact description of the identity matrix.
The denition of A
1
is A
1
A = I. Therefore,
12
Equivalently, in the language of information theory the
goal is to nd a basis P such that the mutual information
I(y
i
, y
j
) = 0 for i = j.
because A
T
A = I, it follows that A
1
= A
T
.
2. If A is any matrix, the matrices A
T
A
and AA
T
are both symmetric.
Lets examine the transpose of each in turn.
(AA
T
)
T
= A
TT
A
T
= AA
T
(A
T
A)
T
= A
T
A
TT
= A
T
A
The equality of the quantity with its transpose
completes this proof.
3. A matrix is symmetric if and only if it is
orthogonally diagonalizable.
Because this statement is bi-directional, it requires
a two-part if-and-only-if proof. One needs to prove
the forward and the backwards if-then cases.
Let us start with the forward case. If A is orthog-
onally diagonalizable, then A is a symmetric matrix.
By hypothesis, orthogonally diagonalizable means
that there exists some E such that A = EDE
T
,
where D is a diagonal matrix and E is some special
matrix which diagonalizes A. Let us compute A
T
.
A
T
= (EDE
T
)
T
= E
TT
D
T
E
T
= EDE
T
= A
Evidently, if A is orthogonally diagonalizable, it
must also be symmetric.
The reverse case is more involved and less clean so
it will be left to the reader. In lieu of this, hopefully
the forward case is suggestive if not somewhat
convincing.
4. A symmetric matrix is diagonalized by a
matrix of its orthonormal eigenvectors.
Restated in math, let A be a square n n
symmetric matrix with associated eigenvectors
e
1
, e
2
, . . . , e
n
. Let E = [e
1
e
2
. . . e
n
] where the
i
th
column of E is the eigenvector e
i
. This theorem
asserts that there exists a diagonal matrix D where
A = EDE
T
.
This theorem is an extension of the previous theo-
rem 3. It provides a prescription for how to nd the
13
matrix E, the diagonalizer for a symmetric matrix.
It says that the special diagonalizer is in fact a matrix
of the original matrixs eigenvectors.
This proof is in two parts. In the rst part, we see
that the any matrix can be orthogonally diagonalized
if and only if it that matrixs eigenvectors are all lin-
early independent. In the second part of the proof,
we see that a symmetric matrix has the special prop-
erty that all of its eigenvectors are not just linearly
independent but also orthogonal, thus completing our
proof.
In the rst part of the proof, let A be just some
matrix, not necessarily symmetric, and let it have
independent eigenvectors (i.e. no degeneracy). Fur-
thermore, let E = [e
1
e
2
. . . e
n
] be the matrix of
eigenvectors placed in the columns. Let D be a diag-
onal matrix where the i
th
eigenvalue is placed in the
ii
th
position.
We will now show that AE = ED. We can exam-
ine the columns of the right-hand and left-hand sides
of the equation.
Left hand side : AE = [Ae
1
Ae
2
. . . Ae
n
]
Right hand side : ED = [
1
e
1
2
e
2
. . .
n
e
n
]
Evidently, if AE = ED then Ae
i
=
i
e
i
for all i.
This equation is the denition of the eigenvalue equa-
tion. Therefore, it must be that AE = ED. A lit-
tle rearrangement provides A = EDE
1
, completing
the rst part the proof.
For the second part of the proof, we show that
a symmetric matrix always has orthogonal eigenvec-
tors. For some symmetric matrix, let
1
and
2
be
distinct eigenvalues for eigenvectors e
1
and e
2
.
1
e
1
e
2
= (
1
e
1
)
T
e
2
= (Ae
1
)
T
e
2
= e
1
T
A
T
e
2
= e
1
T
Ae
2
= e
1
T
(
2
e
2
)
1
e
1
e
2
=
2
e
1
e
2
By the last relation we can equate that
(
1
2
)e
1
e
2
= 0. Since we have conjectured
that the eigenvalues are in fact unique, it must be
the case that e
1
e
2
= 0. Therefore, the eigenvectors
of a symmetric matrix are orthogonal.
Let us back up now to our original postulate that
A is a symmetric matrix. By the second part of the
proof, we know that the eigenvectors of A are all
orthonormal (we choose the eigenvectors to be nor-
malized). This means that E is an orthogonal matrix
so by theorem 1, E
T
= E
1
and we can rewrite the
nal result.
A = EDE
T
. Thus, a symmetric matrix is diagonalized by a
matrix of its eigenvectors.
5. For any arbitrary m n matrix X, the
symmetric matrix X
T
X has a set of orthonor-
mal eigenvectors of v
1
, v
2
, . . . , v
n
and a set
of associated eigenvalues
1
,
2
, . . . ,
n
. The
set of vectors X v
1
, X v
2
, . . . , X v
n
then form
an orthogonal basis, where each vector X v
i
is
of length
i
.
All of these properties arise from the dot product
of any two vectors from this set.
(X v
i
) (X v
j
) = (X v
i
)
T
(X v
j
)
= v
T
i
X
T
X v
j
= v
T
i
(
j
v
j
)
=
j
v
i
v
j
(X v
i
) (X v
j
) =
j
ij
The last relation arises because the set of eigenvectors
of X is orthogonal resulting in the Kronecker delta.
In more simpler terms the last relation states:
(X v
i
) (X v
j
) =
j
i = j
0 i ,= j
This equation states that any two vectors in the set
are orthogonal.
The second property arises from the above equa-
tion by realizing that the length squared of each vec-
tor is dened as:
|X v
i
|
2
= (X v
i
) (X v
i
) =
i
14
9 Appendix B: Code
This code is written for Matlab 6.5 (Re-
lease 13) from Mathworks
13
. The code is
not computationally ecient but explana-
tory (terse comments begin with a %).
This rst version follows Section 5 by examin-
ing the covariance of the data set.
function [signals,PC,V] = pca1(data)
% PCA1: Perform PCA using covariance.
% data - MxN matrix of input data
% (M dimensions, N trials)
% signals - MxN matrix of projected data
% PC - each column is a PC
% V - Mx1 matrix of variances
[M,N] = size(data);
% subtract off the mean for each dimension
mn = mean(data,2);
data = data - repmat(mn,1,N);
% calculate the covariance matrix
covariance = 1 / (N-1) * data * data;
% find the eigenvectors and eigenvalues
[PC, V] = eig(covariance);
% extract diagonal of matrix as vector
V = diag(V);
% sort the variances in decreasing order
[junk, rindices] = sort(-1*V);
V = V(rindices);
PC = PC(:,rindices);
% project the original data set
signals = PC * data;
This second version follows section 6 computing
PCA through SVD.
function [signals,PC,V] = pca2(data)
13
http://www.mathworks.com
% PCA2: Perform PCA using SVD.
% data - MxN matrix of input data
% (M dimensions, N trials)
% signals - MxN matrix of projected data
% PC - each column is a PC
% V - Mx1 matrix of variances
[M,N] = size(data);
% subtract off the mean for each dimension
mn = mean(data,2);
data = data - repmat(mn,1,N);
% construct the matrix Y
Y = data / sqrt(N-1);
% SVD does it all
[u,S,PC] = svd(Y);
% calculate the variances
S = diag(S);
V = S .* S;
% project the original data
signals = PC * data;
10 References
Bell, Anthony and Sejnowski, Terry. (1997) The
Independent Components of Natural Scenes are
Edge Filters. Vision Research 37(23), 3327-3338.
[A paper from my eld of research that surveys and explores
dierent forms of decorrelating data sets. The authors
examine the features of PCA and compare it with new ideas
in redundancy reduction, namely Independent Component
Analysis.]
Bishop, Christopher. (1996) Neural Networks for
Pattern Recognition. Clarendon, Oxford, UK.
[A challenging but brilliant text on statistical pattern
recognition (neural networks). Although the derivation of
PCA is touch in section 8.6 (p.310-319), it does have a
great discussion on potential extensions to the method and
it puts PCA in context of other methods of dimensional
15
reduction. Also, I want to acknowledge this book for several
ideas about the limitations of PCA.]
Lay, David. (2000). Linear Algebra and Its
Applications. Addison-Wesley, New York.
[This is a beautiful text. Chapter 7 in the second edition
(p. 441-486) has an exquisite, intuitive derivation and
discussion of SVD and PCA. Extremely easy to follow and
a must read.]
Mitra, Partha and Pesaran, Bijan. (1999) Anal-
ysis of Dynamic Brain Imaging Data. Biophysical
Journal. 76, 691-708.
[A comprehensive and spectacular paper from my eld of
research interest. It is dense but in two sections Eigenmode
Analysis: SVD and Space-frequency SVD the authors
discuss the benets of performing a Fourier transform on
the data before an SVD.]
Will, Todd (1999) Introduction to the Sin-
gular Value Decomposition Davidson College.
www.davidson.edu/academic/math/will/svd/index.html
[A math professor wrote up a great web tutorial on SVD
with tremendous intuitive explanations, graphics and
animations. Although it avoids PCA directly, it gives a
great intuitive feel for what SVD is doing mathematically.
Also, it is the inspiration for my spring example.]
16