Principal Component Analysis
Chris Ding
Department of Computer Science and Engineering
University of Texas at Arlington
PCA is the procedure of finding intrinsic
dimensions of the data
[Link] analysis
[Link] reduction
[Link] visualization
Represent high dimensional data in low-dim space
High-dimensional data
Gene expression Face images Handwritten digits
Example…
Application of feature reduction
• Face recognition
• Handwritten digit recognition
• Text mining
• Image retrieval
• Microarray data analysis
• Protein classification
Use PCA to approximate an image (a data matrix)
112 x 92
PCA PCA PCA PCA
original
k=10 k=20 k=30 k=40
Use PCA to approximate a set of images
original
PCA k=1
PCA k=2
PCA k=4
PCA k=6
Use PCA to approximate a set of images
original
PCA k=1
PCA k=2
PCA k=4
PCA k=6
Display the characters in 2-dim space
a T
x
~x G x T
T 1
a2 x
Application of feature reduction
Intrinsic dimensions of the data
Samples of children: hours of study, hours on internet, vs their age
Hours on study / homework
Children’s age
Hours on internet
Intrinsic dimensions of the data
Samples of children: hours of study, hours on internet, vs their age
Hours on study / homework
Children’s age
Data lie in a subspace
(intrinsic dimensions)
Hours on internet
PCA is the procedure of finding
intrinsic dimensions of the data
Find lines that best represent the data
PCA is a rotation of space to proper
directions (principal directions)
Geometric picture of principal components (PCs)
z1
• the 1st PC z1 is a minimum distance fit to a line in X space
• the 2nd PC z 2 is a minimum distance fit to a line in the plane
perpendicular to the 1st PC
PCs are a series of linear least squares fits to a sample,
each orthogonal to all the previous.
PCA represents data:
the close data to a linear subspace,
the more accurate representation
PCA Step 0: move coordinate to data center
This is equivalent to Centering the data
Hours on study / homework
Children’s age
Hours on internet
PCA Step 1: find a line that best represents the data
Hours on study / homework
Children’s age
Hours on internet
PCA Step 1: find a line that best represents the data
Hours on study / homework
Children’s age
Hours on internet
PCA Step 1: find a line that best represents the data
Hours on study / homework
Children’s age
Hours on internet
PCA Step 1: find a line that best represents the data
Hours on study / homework
Children’s age
Hours on internet
PCA Step 1: find a line that best represents the data
Hours on study / homework
Children’s age
projection error
Hours on internet
minimize sum of projection errors squared
Which error to minimize?
PCA Step 1: find the line that best represents the data
Fitting data to a curve (a straight line, the simplest curve)
Hours on study / homework
Children’s age
Hours on internet
minimize sum of projection errors squared
This gives the 1st principal direction
PCA directions are eigenvectors of Covariance Matrix
Repeating this process to find 3rd, 4th, … lines to best fit the remaining data,
the results are given by u2 , u3 ,… , uk
Intrinsic dimensions of the data
Samples of children study, use internet vs their age
Hours on study / homework
Children’s age
Hours on internet
PCA from maximum variance
PCA from maximum spread-out
PCA represents data:
the close data to a linear subspace,
the more accurate representation
smaller
variance
Larger
variance
Larger spread-out = Larger variance
What is Principal Component Analysis?
• Principal component analysis (PCA)
– Reduce the dimensionality of a data set by finding a new set of
variables, smaller than the original set of variables
– Retains most of the sample's information.
– Useful for the compression and classification of data.
• By information we mean the variation present in the sample,
given by the correlations between the original variables.
– The new variables, called principal components (PCs), are
uncorrelated, and are ordered by the fraction of the total information
each retains.
Geometric picture of principal components (PCs)
z1
• the 1st PC z1 is a minimum distance fit to a line in X space
• the 2nd PC z 2 is a minimum distance fit to a line in the plane
perpendicular to the 1st PC
PCs are a series of linear least squares fits to a sample,
each orthogonal to all the previous.
C. Ding
Principal Component as maximum variance
Let x (x 1 ,x 2 , ,x p )T be a vector random variable in p dimensions/variables
随机变量
(e1 ,e2 , ,e p )
Given n observations/samples of x:
x1 , x2 ,, xn p
The first principal component.
Define a scalar random variable as linear combination of dimensions:
p
z1 a1T x j1 ,
a
j 1
x j
a1 (a11 ,a21 , ,a p 1 )
var[ z1 ] is maximized
Principal Component as maximum variance
Because
1
n
2
var[ z1 ] E (( z1 z1 ) 2 ) a1T xi a1T x
n i 1
1 n T
T
a1 xi x xi x a1 a1T Sa1
n i 1
where
1 n
S xi x xi x
n i 1
T
1 n
is the covariance matrix. x xi is the mean.
n i 1
In the following, we assume the data is centered. x 0
Principal Component as maximum variance
To find a1 that T
maximizes var[ z1 ] subject to a1 a1 1
Let λ be a Lagrange multiplier.
L a1T Sa1 (a1T a1 1)
L Sa1 a1 0
a1 “eigen” = german “do something to itself”
Sa1 a1 operator = matrix
therefore a1 is an eigenvector of S
corresponding to the largest eigenvalue 1.
Algebraic derivation of PCs
To find the next coefficient vector a2 maximizing var[ z2 ]
subject to cov[ z 2 , z1 ] 0 uncorrelated
and to a2T a2 1
T T
First note that cov[ z 2 , z1 ] a Sa2 a a2
1 1 1
then let λ and φ be Lagrange multipliers, and maximize
T T T
L a Sa2 (a a2 1) a a
2 2 2 1
Algebraic derivation of PCs
T T T
L a Sa2 (a a2 1) a a
2 2 2 1
L Sa2 a2 a1 0 0
a2
T
Sa2 a2 and a Sa2 2
Algebraic derivation of PCs
We find that a2 is also an eigenvector of S
whose eigenvalue 2 is the second largest.
In general
T
var[ z k ] a Sak k
k
• The kth largest eigenvalue of S is the variance of the kth PC.
• The kth PC z k retains the kth greatest fraction of the variation
in the sample.
C. Ding
Projection to PCA subspace
• Main steps for computing PCA subspace
– Form the covariance matrix S.
– Compute its eigenvectors:
a p
i i 1
– PCA subspace is spanned by the first d eigenvectors a d
i i 1
– The transformation G is given by a1T x
T
G [a1 , a2 , , ad ]
~x G x a2 x
T
U (u1 ,u 2 , ,u k )
T
p T ad x
x x~ G x PCAsubspace
Algebraic derivation of PCs
Assume x0
p n
Form the matrix: X [ x1 , x2 , , xn ]
1
then S XX T
n
Obtain eigenvectors of S by computing the SVD of X:
T
X UV X’ = U^T * X = \Sigma * V’
Homework:
After you
[Link] the covariance matrix S
[Link] the first k eigenvectors of S as (u_1, …, u_k)
Show that:
You can obtain (v_1,…,v_k) by doing matrix – vector
multiplications. No need to compute eigenvectors of the
kernel (Gram) matrix.
Reduction and Reconstruction Reconstruction
Dimension reduction X p n G T X d n
G T X d n X G (G T X ) pn
GT d p
Y G T X d n
X p n
X p n
G p d
Optimality property of PCA
Main theoretical result:
The matrix G consisting of the first d eigenvectors of the
covariance matrix S solves the following min problem:
T 2
min G pd X G (G X ) subject to G T G I d
F
2
X X reconstruction error
F
PCA projection minimizes the reconstruction error among all
linear projections of size d.
Applications of PCA
• Eigenfaces for recognition. Turk and Pentland. 1991.
• Principal Component Analysis for clustering gene
expression data. Yeung and Ruzzo. 2001.
• Probabilistic Disease Classification of Expression-
Dependent Proteomic Data from Mass Spectrometry of
Human Serum. Lilien. 2003.
Outline of lecture
• What is feature reduction?
• Why feature reduction?
• Feature reduction algorithms
• Principal Component Analysis
• Nonlinear PCA using Kernels
Motivation
Linear projections
will not detect the
pattern.
Nonlinear PCA using Kernels
• Traditional PCA applies linear transformation
– May not be effective for nonlinear data
• Solution: apply nonlinear transformation to potentially very high-
dimensional space.
: x ( x)
• Computational efficiency: apply the kernel trick.
– Require PCA can be rewritten in terms of dot product.
More on kernels
K ( xi , x j ) ( xi ) ( x j ) later
Nonlinear PCA using Kernels
Rewrite PCA in terms of dot product
Assume the data has been centered, i.e., xi 0.
i
1
The covariance matrix S can be written as S xi xiT
n i
Let v be The eigenvector of S corresponding to
nonzero eigenvalue
1 1
Sv xi xi v v v
T
( x T
i v ) xi
n i n i
Eigenvectors of S lie in the space spanned by all data points.
Nonlinear PCA using Kernels
1 1
Sv xi xi v v v
T
( x T
i v ) xi
n i n i
The covariance matrix can be written in matrix form:
1
S XX T , where X [x1 , x 2 , , x n ].
n
v i xi X 1
Sv XX T X X
i n
1 T
( X X )( X T X ) ( X T X )
n
1 T Any benefits?
( X X )
n
Nonlinear PCA using Kernels
Next consider the feature space: : x ( x)
1 T
S X X , where X [x1 , x 2 , , x n ].
n
1
v i ( xi ) X
i
T
X X
n
The (i,j)-th entry of X X
T
is ( xi ) ( x j )
Apply the kernel trick: K ( xi , x j ) ( xi ) ( x j )
1
K is called the kernel matrix. K
n
Nonlinear PCA using Kernels
• Projection of a test point x onto v:
( x) v ( x) i ( xi )
i
i ( x) ( xi ) i K ( x, xi )
i i
Explicit mapping is not required here.
Reference
• Principal Component Analysis. I.T. Jolliffe.
• Kernel Principal Component Analysis. Schölkopf, et al.
• Geometric Methods for Feature Extraction and
Dimensional Reduction. Burges.
主成分分析( PCA ) = K 均值聚类 (k-means)
把每个类的数据点集中到类中心 ( 假设每个类大致是球型 )
这 K 个类中心就组成了主成分分析的子空间!
(这可用数学严格证明)
in p-dim space
One early major advance using matrix
analysis
(Zha, He, Ding, et al, NIPS 2000)
(Ding & He, ICML 2004)
C. Ding, NMF for data
68
clustering and
PCA k-means clustering
- Move every data points to its cluster center
- K cluster centers span a cluster subspace (k-1 dim)
- Cluster-subspace = PCA subspace (1st k-1 PCA directions)
in p-dim space
One early major advance on PCA, K-means (Zha, He, Ding, et al, NIPS 2000)
(Ding & He, ICML 2004)
Solution of K-means is represented by cluster indicators
H
n1
We actually use scaled indicators: n2
nk
n1
n2
nk
Q TQ I