Principal Component Analysis (PCA)

10/8/24, 3:34 Principal Component Analysis (PCA) | by Kavishka Abeywardana | Jun, 2024 | Medium
Open in app Sign up Sign in

To make Medium work, we log user data. By using Medium, you agree to
our Privacy Policy, including cookie policy.
Search
Principal Component Analysis (PCA)

Kavishka Abeywardana · Follow
8 min read · Jun 17, 2024
Listen Share
Source: Towards Data Science
Suppose you are a scientist. You want to measure and understand the behavior of a
system. You set up an experiment and measure various quantities. In modern
experimental settings, you have the resources to collect large amounts of data and
features.
However, the data points can appear clouded, unclear, and redundant. You won’t be
able to understand the real structure of data. Let’s understand this using a simple
https://medium.com/@kdwa2404/principal-component-analysis-pca-e9e87791ef8c 1/22
toy example.
A toy example
We are studying an ideal spring (massless and frictionless). Because it is ideal, the
mass must oscillate indefinitely along the x-axis. The system is straightforward and
can be explained as a function of x.
However, as an experimenter, we don’t know how many axes and dimensions are
important to explain the system. Thus, we measure the ball’s position in 3-
dimensional space using three cameras. At 200Hz, each camera captures an image
indicating the 2-dimensional position of the ball. Since we don’t have prior
knowledge about the system, we don’t know the optimal directions for the three
cameras.
Moreover, air, imperfect cameras, and less ideal springs can add noise to the
system. Thus, making assumptions directly from the data becomes even harder.
If we knew the system dynamics, we would directly measure the displacement along
the x-axis, using a single camera. However, now we have to extract the x-axis from
the complex and arbitrarily collected data set.

Why do we need PCA?
During the data collection process, we might collect unnecessary features. This
increases the complexity of the data set and dilutes the insights. Our goal is to find
the most important features or feature combinations.
We can directly pick a few features using our intuition about the data set. However,
in PCA we choose the most important components (not features). We transform the
existing (suppose we have 1000 features) feature set and generate new features (10
features).
These new features give us insights into the data set. Due to the richness of these
components, we can use them in other machine-learning tasks (supervised
learning). It reduces the complexity and avoids overfitting. Thus, PCA is a good
preprocessing step.
Capturing the variance

Suppose we have a two-dimensional data set.
We have to map this data set into a one-dimensional space. Our objective is to
capture the maximum amount of variance. What would be the solution?
The red line is the best one-dimensional axis. We orthogonally map each point onto
the red line. Observe that this new line is a mixture of the two features.
Now we have to formulate this process. We can use two approaches.
1. Maximize the variance.
2. Minimize the reconstruction error.
Reconstruction error is the squared difference between the mapping and the
original point. This is given by the perpendicular distance from the point to the line.
We want to minimize this difference.
Variance is the squared difference between the mean of the mapped data set and the
new mapped point. We want to maximize this variance.
We can easily prove that both of these approaches give the same result. They are like
two sides of the same coin.
||d|| is the standard deviation of the single point. (assume that the points are mean-
centered.) ||X|| is the magnitude of the vector related to the point. ||e|| is the
reconstruction error.
From the Pythagoras theorem, we can derive the following result.
We can sum this all over the data set and take the mean.
The square of the standard deviation is the variance. Thus, we can derive the
following relation.
Thus, when we increase the variance, the reconstruction error drops.
Loss function
We can use both approaches to derive the loss function for the optimization
problem. Assume that, we use a projection matrix w to project the data points from
the original data space into the selected subspace. X is the data matrix.
From the reconstruction error viewpoint, we can write the following loss function.
We use the L2 norm.
From the variance-maximizing viewpoint, we can write a loss function in the

following way. Since we are maximizing the variance, we add a negative sign to the
loss value. This forces us to minimize the loss function.
Assuming the data points are mean-centered, we can use the squared distance from
the origin. We can calculate the squared distance using the dot product.
However, the model might try to maximize the variance by increasing the size of the
w(projection) matrix. This is useless. Thus, we must introduce a constraint to limit
the size of the projection matrix.
We will use the variance approach.
Optimization
We use Lagrange multipliers to optimize the problem. It’s simple. We plug our
constraint into the optimization problem. This gives us an unconstrained problem.
Now we can differentiate both sides and find the optimal points. Everything is
quadratic. Thus, taking derivatives is simple.
C is the covariance matrix. This looks very familiar. w is an eigenvector of the

covariance matrix!
Thus, to minimize the variance, one should choose the eigenvector w with the
maximum eigenvalue λ. work, we log user data. By using Medium, you agree to
To make Medium
Subsequent eigenvectors represent subsequent principal components. We choose

the eigenvectors in a greedy fashion.
We can easily calculate the eigenvalues and eigenvectors using numerical analysis
software.
Spectral Theorem
C is a symmetric p x p matrix. We can prove that such a matrix has p independent
(orthogonal to each other) eigenvectors.
This means that the eigenbasis is simply rotating our original coordinate system
such that every basis vector is an eigenvector.
We will stack the eigenvectors and create a new matrix V. XV rotates the original
axes. Let’s calculate
To make the covariance
Medium work, we logbased onBythe
user data. new
using coordinate
Medium, system.
you agree to
Since V contains the eigenvectors, the off-diagonal terms become zero. Thus, we can
rewrite the covariance matrix as follows.
This is the standard eigen decomposition.
Relationship to singular value decomposition (SVD)

Consider the SVD of X.
U is the left singular matrix. V is the right singular matrix. S is a diagonal matrix.
Both U and V are orthogonal matrices. Thus, physically SVD performs rotation,
stretching, and rotation. The diagonal elements of S are singular values.
Now let’s calculate the covariance matrix.
We end up in the eigendecomposition of C. Thus, we can say that the eigenvalues of

C are the squared singular values divided by n. The eigenvectors are the columns of
the right singular matrix of X.
Total Variance
Trace is the sum of the diagonal values of a square matrix. Diagonal elements of the
covariance matrix give
To make us the
Medium work,variance
we log userof each
data. dataMedium,
By using feature. Thus,
you agree to the trace of C
gives us the total variance.
Let’s calculate the trace of the covariance matrix.
Thus, we can derive the following result.
Using this result, we can calculate the percentage of variance each eigenvalue
captures.
Principal component regression (PCR)

PCA followed by regression is principal component regression (PCR). It is similar to
ridge regression.
We can write the results of a linear regression model in the following way.
U is the left singular matrix of X.
In ridge regression, we add a ridge penalty.
s denotes singular values. Smaller singular values compared to λ will get to zero.
Larger singular values
To make willwork,
Medium remain
we logunchanged. TheMedium,
user data. By using diagonal matrix
you agree to has decreasing
diagonal values.
PCA does hard thresholding over ridge regression. We specifically choose the
largest singular values and ignore the rest.
Probabilistic PCA (PPCA)

We consider a latent variable model. Latent variables are the internal
representations of a system. We do not observe them.
Suppose the latent variables are distributed in a spherical Gaussian distribution with
unit variance.
Now we can derive a conditional probability distribution for the observed data.
Conditioning shifts the latent variable distribution.
Now we can write the mean and the covariance of the marginal distribution.
We want to find the best parameters to explain X, under the maximum likelihood
estimation(MLE) framework. We can use the expectation maximization (EM)
algorithm. The maximum likelihood solution is a PCA solution.
MLE of the W matrix contains the leading eigenvectors. If we assume a two-

dimensional latent variable model, we will get two leading eigenvectors. Most
variation of the data can be captured by these latent variables. The variances define
the reconstruction
To makeerror.
Medium work, we log user data. By using Medium, you agree to
References
This article is based on the lecture by Dr. Dmitry Kobak, Winter Term 2020/21 at the
University of Tübingen. He gives a beautiful explanation of the fundamentals of
PCA.
If you are a little rusty about singular value decomposition, watch the short lecture
by Prof. Gilbert Strang at MIT OpenCourseWare
For a beautiful geometric explanation of Lagrange multipliers, watch the Khan

Academy by Grant Sanderson
The toy problem is derived from A TUTORIAL ON PRINCIPAL COMPONENT

ANALYSIS Derivation, Discussion, and Singular Value Decomposition by Jon Shlens
at Princeton University
Machine Learning Statistics Probability Unsupervised Learning Pca Analysis
Follow
Written by Kavishka Abeywardana

191 Followers
Electronic and telecommunication engineering, University of Moratuwa (UG)
More from Kavishka Abeywardana
Kavishka Abeywardana
Probability Theory for Machine Learning
Jun 20 115 2
Vector Quantized Variational Auto-Encoder (VQ-VAE): Neural Discrete

Representation Learning: A…
Jun 13 14
Point Transformer: A Brief Summary
Jul 15 6
OpenCV#2: Object detection and Localization

object detection, contour maps, thresholding and canny edge detection
Feb 9 163
See all from Kavishka Abeywardana
Recommended from Medium
Nakul Upadhya in Towards Data Science
Introduction to Interpretable Clustering

What is interpretable clustering and why is it important.
Aug 1 279 3
Probability Theory for Machine Learning
Jun 20 115 2
Lists
Predictive Modeling w/ Python

20 stories · 1430 saves
Practical Guides to Machine Learning

Natural Language Processing

The New Chatbots: ChatGPT, Bard, and Beyond

Joseph Robinson, Ph.D. in Towards AI
The Fundamental Mathematics of Machine Learning

A Deep Dive into Vector Norms, Linear Algebra, Calculus
Jul 26 456 5
Yuki Shizuya in Intuition
Statistics: Multivariate time series analysis — VMA, VAR, VARMA

The mathematical understanding of VMA, VAR, and VARMA and practical Python
implementation
Aug 1 14
David Such in AI Advances
0.2b Embedded AI — Exploratory Data Analysis (EDA)

Part B— From Identifying and Removing Anomalies to Feature Engineering.
Aug 1 111
Bence Balogh, PhD Candidate in Structural and Civil Engineering
Here is how to make digital elevation maps in Python in a matter of

minutes using TouchTerrain.
Once it occurred to me that I had to make a 3d model of a certain area at the shores of Lake
Balaton, Hungary. The area
To make is around
Medium work, we the…
log user data. By using Medium, you agree to
Mar 2 186
See more recommendations

Principal Component Analysis (PCA) - by Kavishka Abeywardana - Jun, 2024 - Medium

Uploaded by

Principal Component Analysis (PCA) - by Kavishka Abeywardana - Jun, 2024 - Medium

Uploaded by

10/8/24, 3:34 Principal Component Analysis (PCA) | by Kavishka Abeywardana | Jun, 2024 | Medium

Open in app Sign up Sign in

Source: Towards Data Science

the complex and arbitrarily collected data set.

Capturing the variance

Now we have to formulate this process. We can use two approaches.

1. Maximize the variance.

2. Minimize the reconstruction error.

From the Pythagoras theorem, we can derive the following result.

Thus, when we increase the variance, the reconstruction error drops.

From the variance-maximizing viewpoint, we can write a loss function in the

We will use the variance approach.

C is the covariance matrix. This looks very familiar. w is an eigenvector of the

Subsequent eigenvectors represent subsequent principal components. We choose

This is the standard eigen decomposition.

Relationship to singular value decomposition (SVD)

Now let’s calculate the covariance matrix.

We end up in the eigendecomposition of C. Thus, we can say that the eigenvalues of

Let’s calculate the trace of the covariance matrix.

Thus, we can derive the following result.

Principal component regression (PCR)

U is the left singular matrix of X.

In ridge regression, we add a ridge penalty.

Probabilistic PCA (PPCA)

MLE of the W matrix contains the leading eigenvectors. If we assume a two-

For a beautiful geometric explanation of Lagrange multipliers, watch the Khan

The toy problem is derived from A TUTORIAL ON PRINCIPAL COMPONENT

Machine Learning Statistics Probability Unsupervised Learning Pca Analysis

Written by Kavishka Abeywardana

Electronic and telecommunication engineering, University of Moratuwa (UG)

More from Kavishka Abeywardana

Probability Theory for Machine Learning

Vector Quantized Variational Auto-Encoder (VQ-VAE): Neural Discrete

Point Transformer: A Brief Summary

OpenCV#2: Object detection and Localization

See all from Kavishka Abeywardana

Recommended from Medium

Nakul Upadhya in Towards Data Science

Introduction to Interpretable Clustering

Probability Theory for Machine Learning

Predictive Modeling w/ Python

Practical Guides to Machine Learning

Natural Language Processing

The New Chatbots: ChatGPT, Bard, and Beyond

Joseph Robinson, Ph.D. in Towards AI

The Fundamental Mathematics of Machine Learning

Yuki Shizuya in Intuition

Statistics: Multivariate time series analysis — VMA, VAR, VARMA

David Such in AI Advances

0.2b Embedded AI — Exploratory Data Analysis (EDA)

Bence Balogh, PhD Candidate in Structural and Civil Engineering

Here is how to make digital elevation maps in Python in a matter of

See more recommendations

You might also like