Machine Learning Notes
Machine Learning Notes
Lecture 1
Machine Learning Life-Cycle
The machine learning life cycle refers to the series of steps involved in developing and
deploying a machine learning model.
1. Acquiring Data: Gather relevant data required to train and evaluate your model. This
may involve collecting data from various sources, such as databases, APIs, or external
datasets.
2. Data Preprocessing: Prepare and preprocess the acquired data to make it suitable for
training. This involves tasks like data cleaning, handling missing values, outlier
detection, feature selection, feature engineering, and data normalization.
3. Data Analysis/Model Selection: Choose an appropriate machine learning algorithm or
model architecture that is well-suited for your problem. Consider factors such as the
nature of the data, the type of problem (classification, regression, etc.), and the
available computational resources.
4. Training: Use the prepared data to train the selected model. The training process aims
to minimize the difference between the model's predictions and the actual values in the
training data.
5. Model Evaluation/Testing: Assess the performance of the trained model
6. Deployment: Once satisfied with the model's performance, deploy it in a production
environment where it can make predictions on new, unseen data.
Supervised Learning
Supervised learning is a ML approach where the model learns from labeled examples, each
example is made up of input data (features) and the corresponding output (the label). This
approach is commonly used for classification or regression (predicting a value).
The model’s goal is to learn how to map the features to the label so that it can make
accurate predictions on new, unseen data. The model learns by trying to minimize the
difference (the loss) between its predicted labels and the real labels in the training data.
Ex. A supervised learning model can be trained on a dataset of emails labeled as spam or
not spam. The model learns to categorize the emails based on the contents of the email.
Unsupervised Learning
In unsupervised learning the model tries to identify patterns or relationships within the
data on its own without being given labels. Unsupervised learning is useful for tasks like
data exploration, anomaly detection, and recommendation systems
The model or agent learns through trial and error, when it takes an action that furthers it’s
goal it is given a reward, the model tries to find the best strategy to maximize rewards .
Lecture 2
Data Cleaning
Data collected from various sources can contain errors, missing values, outliers, and
inconsistencies. Data cleaning helps improve the quality and reliability of the dataset.
Within the dataset there may be missing values, this can be dealt with by replacing the
missing value with an estimated one (imputation), the estimated value is calculated as the
mean or the mode for that field. Alternatively, any rows containing missing value are
deleted (deletion).
0.16
-0.171
0.16
Data Normalization
Data normalization/feature scaling is a technique used to bring different features or
variables to a similar scale or range. This is done to ensure that all features are given
equal importance during the learning process.
Z-Score normalization is a technique that transforms the data to have zero mean and unit
variance. It is used when we want to allow for extreme values within the dataset. The
formula for Z-Score normalization is:
*Mean Normalization:
(𝑋 − 𝜇)
𝑋𝑛𝑜𝑟𝑚𝑎𝑙𝑖𝑧𝑒𝑑 =
𝜎 X_normalized = (X-mean)/max(x)-min(x)
Lecture 3
Feature Extraction
Feature extraction is the process of transforming raw data into a format that we can train a
ML model on. Feature extraction is used to select only the relevant features for the model
we are training (dimensionality reduction), this allows the model to learn patterns faster as
well as eliminating noise (irrelevant/redundant features) from the data.
Image features can be global, meaning they describe some aspect of the image as a whole
(useful for classification), or they can be local, meaning they describe a region/patch of the
image (useful for object recognition).
Ex.
152 76 125 −152 0 125
78 𝟖𝟓 89 → 𝑐𝑜𝑛𝑣𝑜𝑙𝑢𝑡𝑖𝑜𝑛 𝑤𝑖𝑡ℎ 𝑆𝑜𝑏𝑒𝑙𝑥 = −156 0 178 → 𝐺𝑥 = −19
214 68 200 −214 0 200
Pixel values
The gradient magnitudes and directions for each pixel in the image are stored in 2
matrices.
3. Gradient histograms: the image is divided into small overlapping cells (e.g., 8x8 pixels).
The gradient orientations/directions for all the pixels in the cell are then grouped into
bins that cover the entire range of orientations (e.g., 9 bins covering 0 to 180 degrees).
The gradient magnitude is used as the weight of each cast into a bin.
magnitude 122.44
direction 0 𝜋/9 2𝜋/9 𝜋/3 4𝜋/9 5𝜋/9 2𝜋/3 7𝜋/9 8𝜋/9
The result is a histogram for all the pixels within the cell that represents the gradient
orientations within that region.
4. Cell normalization: the histogram for each cell (or group of cells) is normalized to make
the HOG descriptor more robust to changes in lighting and contrast.
5. Descriptor formation: the normalized histograms are concatenated to form one giant 1D
vector, this is the final HOG descriptor for the image and can be used as input in a ML
model.
Local Feature Extraction
Local feature extraction is used for tasks like object recognition, image stitching
(combining multiple images to produce one large image), or structure from motion
(estimating how an object is moving in 3D space using images).
The Difference of Gaussians (DoG) is a method used to detect local features, especially
blob-like structures, in an image. DoG works by taking the difference between two
Gaussian-smoothed/blurred versions of the image.
DoG process:
3. Feature detection: local minima (dark areas) and maxima (light areas) are identified in
the DoG image by comparing them to neighboring pixels. Extrema suppression might
be applied to remove weak extrema (e.g., edges) that are not robust.
Once the features/key points have been detected in an image, there are several ways of
describing them in a way we can use in a ML model:
• Scale Invariant Feature Transform (SIFT): designed to capture the local texture and
shape around a key point as floating point variables in a way that is invariant to scale
and rotation.
SIFT works in a similar way to HOG in that it creates histograms and concatenates them
to produce a one dimensional descriptor.
• Binary descriptors (BREIF, ORB): designed to represent features efficiently as binary
strings. This makes them easier to store and compute with than a floating point
descriptor.
Binary descriptors work by comparing pairs of pixels around a key point. If the first
pixel's intensity is greater than the second, a 1 is assigned; otherwise, a 0 is assigned.
These comparisons are concatenated to form the descriptor.
Lecture 4
Univariate Linear Regression
Univariate Linear Regression is ML algorithm used to model the relationship between a
single independent variable (predictor) and a dependent variable (target).
The relationship between the independent variable (x) and the dependent variable (y) is
represented as:
𝑦̂ = ℎ𝛽 (𝑥) = 𝛽0 + 𝛽1 𝑋
*𝛽0 is the intercept term. It represents the predicted value of 𝑦 when 𝑥 is zero.
*𝛽1 is the slope of the line.
The objective of univariate linear regression is to find the parameter values ( 𝛽0 and 𝛽1 )
that minimize the cost function/loss (sum of squared errors), which measures how well the
model fits the data.
𝑛
1 2
𝐽 (𝛽0 , 𝛽1 ) = ∑( 𝑦
̂𝑖 − 𝑦𝑖 )
2𝑛
𝑖=1
1. The model starts with some arbitrary values for the parameters, the model takes some
input features 𝑋 and computes its prediction 𝑌̂
2. The predicted value 𝑌̂ and the actual value 𝑌 are put through the cost function to
compute the loss.
3. The values for the parameters are adjusted till the loss converges on a certain value
Gradient Descent
Gradient descent is an algorithm for minimizing the cost function 𝐽(𝛽). Gradient descent
works by moving parameters in the direction of the negative gradient of the cost function.
The gradient of the cost function is calculated with respect to each parameter. The
parameters are then updated in the negative gradient direction.
𝛽 = 𝛽 − 𝛼∇𝐽(𝛽 )
𝜕
𝛽 =𝛽−𝛼 𝐽(𝛽)
𝜕𝛽
*𝛼 is the learning rate
The learning rate or step size determines how big of a step is taken each iteration. A step
size too low makes learning take too long, and a step size too large might overshoot the
minimum.
The ideal step size is the one that arrives at the minimum in the least number of steps.
Lecture 5
Vectors and Matrices
A vector is a one dimensional array of numbers. Vectors are used to represent datapoints
or parameters.
Vectors are represented as column vectors with their height representing the vector’s
dimension.
𝑥1
𝑥
𝑥⃗ = [ 2 ]
⋮
𝑥𝑛
Operations on vectors:
𝑥1 ± 𝑦1
𝑥 ± 𝑦2
1. Addition/subtraction: 𝑥⃗ ± 𝑦⃗ = 2
⋮
[ 𝑛 𝑦𝑛 ]
𝑥 ±
𝑐𝑥1
𝑐𝑥
2. Scaling: 𝑐𝑥⃗ = [ 2 ]
⋮
𝑐𝑥𝑛
3. Dot product: 𝑥⃗ ∙ 𝑦⃗ = 𝑥1 𝑦1 + 𝑥2 𝑦2 + ⋯ + 𝑥𝑛 𝑦𝑛
A matrix is a two dimensional array of numbers. Matrices are used to represent datasets
and transformations.
*each entry given by the dot product of the row in 𝐴 and the column in 𝐵
*If 𝐴 is an 𝑚 × 𝑛 matrix and 𝐵 is an 𝑛 × 𝑝 matrix, then their matrix product 𝐴𝐵 is a
𝑚 × 𝑝 matrix
1 0
1 2 3𝑇
4. Transposition: [ ] = [2 6]
0 6 7
3 7
5. Inverse: 𝐴 × 𝐴−1 = 1
*only for 𝑚 × 𝑚 /square matrices
𝑦 = 𝛽0 𝑥0 + 𝛽1 𝑥1 + ⋯ + 𝛽𝑛 𝑥𝑛
𝑥0 𝛽0
𝑥1 𝛽1
𝑥 (𝑖) = 𝑥2 𝛽 = 𝛽2
⋮ ⋮
[𝑥𝑛 ] [𝛽𝑛 ]
𝑛
𝑦 = ∑ 𝛽𝑗 𝑥𝑗 = 𝛽 𝑇 𝑥
𝑗=0
The cost function is represented as:
𝑛
1 2
𝐽 (𝛽 ) = ∑( 𝑦
̂𝑖 − 𝑦𝑖 )
2𝑛
𝑖=1
𝑛
1 𝑇 2
= ∑( 𝛽 𝑥(𝑖) − 𝑦𝑖 )
2𝑛
𝑖=1
𝑛 𝑛
1 2
= ∑(∑ 𝛽𝑗 𝑥𝑗 −𝑦𝑖 )
2𝑛
𝑖=1 𝑗=0
To speed up gradient descent the feature values should be normalized using one of the
normalization techniques.
Learning rate 𝛼 should also be sufficiently small so that the model can converge quickly.
Correlation analysis between features should also be done to detect redundancies. This is
done by finding the correlation coefficient between the 2 features we want to investigate
𝑛
(𝐴𝑖 − 𝐴̅)(𝐵𝑖 − 𝐵̅)
𝑟𝐴,𝐵 = ∑
(𝑛 − 1)(𝜎𝐴 𝜎𝐵 )
𝑖=1
Normal Equation
The normal equation is an analytical approach for finding the optimal parameters
(coefficients) in linear regression as opposed to the gradient descent method.
The goal is to find the optimal coefficients 𝛽 to minimize the cost function 𝐽(𝛽).
First the data is represented in matrix form:
𝑌 = 𝛽𝑇 𝑋
(1) (1)
𝑦1 𝛽0 1 𝑥1 … 𝑥𝑛
(2) (2)
𝑦 𝛽
[ 2 ] = [ 1 ] 1 𝑥1 … 𝑥𝑛
⋮ ⋮ ⋮ ⋮ … ⋮
𝑦𝑛 𝛽𝑛 (𝑚) (𝑚)
[1 𝑥1 … 𝑥𝑛 ]
The cost function is then minimized:
𝜕
𝐽 (𝛽 ) = 0
𝜕𝛽𝑗
𝛽 = (𝑋 𝑇 𝑋)−1 𝑋 𝑇 𝑌
Example.
Suppose you have 𝑚 = 25 training examples with 𝑛 = 6 features. The normal equation is
θ = (𝑋 𝑇 𝑋)−1 𝑋 𝑇 𝑌 . For the given values of m and n what are the dimensions of 𝜃, 𝑋, and 𝑌 in
this equation?
Answer:
ℎ𝜃 (𝑥 ) = 𝑓(𝜃 𝑇 𝑥)
Where ℎ𝜃 (𝑥) is the probability that 𝑦 = 1 given input 𝑥. The predicted probability is
converted into a binary outcome using a threshold. If ℎ𝜃 (𝑥) ≥ 0.5, the predicted class is 1;
otherwise, it's 0.
The decision boundary is the line that separates the two classes. It is determined by the
weights 𝜃 learned during the training process.
The cost function for logistic regression is the log loss (cross-entropy loss):
𝑚
1
𝐽(𝜃) = − ∑[𝑦 (𝑖) log (ℎ𝜃 (𝑥 (𝑖) )) + (1 − 𝑦 (𝑖) ) log(1 − ℎ𝜃 (𝑥 (𝑖) ))]
𝑚
𝑖=1
The goal is to minimize the cost function by adjusting the parameters 𝜃 during training.
Gradient descent can be used to do this. The update rule for each parameter is given by:
𝜕𝐽(𝜃)
𝜃𝑗 ≔ 𝜃𝑗 − 𝛼
𝜕𝜃𝑗
L1 regularization (Lasso) adds the absolute values of the coefficients as a penalty term to
the cost function. L1 regularization encourages sparsity in the model, meaning it tends to
drive some of the feature weights to exactly zero. This can be useful for feature selection.
𝑚 𝑛
1
𝐽(𝜃) = − ∑[𝑦 (𝑖) log (ℎ𝜃 (𝑥 (𝑖) )) + (1 − 𝑦 (𝑖) ) log(1 − ℎ𝜃 (𝑥 (𝑖) ))] + λ ∑|𝜃𝑗 |
𝑚
𝑖=1 𝑗=1
L2 regularization (Ridge) adds the squared values of the coefficients as a penalty term to
the cost function.
𝑚 𝑛
1
𝐽(𝜃) = − ∑[𝑦 (𝑖) log (ℎ𝜃 (𝑥 (𝑖) )) + (1 − 𝑦 (𝑖) ) log(1 − ℎ𝜃 (𝑥 (𝑖) ))] + λ ∑ 𝜃𝑗2
𝑚
𝑖=1 𝑗=1
L2 regularization tends to shrink the weights of the features towards zero but usually
doesn't make them exactly zero. It addresses multicollinearity, where features are highly
correlated.
The gradient of the L1 or L2 regularized cost function with respect to a coefficient 𝜃𝑗 is:
𝑚
𝜕𝐽(𝜃) (𝑖) 𝜆
= ∑ (𝑦 (𝑖) − ℎ𝜃 (𝑥 (𝑖) )) 𝑥𝑗 + 𝜃𝑗
𝜕𝜃𝑗 𝑚
𝑖=1
Lecture 7
KNN (k-Nearest-Neighbours)
KNN is a simple and intuitive classification algorithm that falls under the category of
instance-based learning or lazy learning. It makes predictions based on the majority class
of the k nearest neighbours in the feature space.
The training phase for a KNN model involves only storing the training dataset. The
prediction phase involves finding the k training examples with the closest feature values to
the new input (query point), it then assigns the majority class among the k neighbours to
the query point.
A distance metric is used to determine the layout of the example space i.e., which points
are considered nearest to the query point.
- Euclidean Distance (L2 Norm): measures the straight-line distance between two
points in Euclidean space.
𝑛
𝐷𝑖𝑠𝑡𝑎𝑛𝑐𝑒(𝑃, 𝑄) = √∑(𝑝𝑖 − 𝑞𝑖 )2
𝑖=1
- Manhattan Distance (L1 Norm): represents the sum of the absolute differences
between corresponding coordinates.
𝑛
𝐷𝑖𝑠𝑡𝑎𝑛𝑐𝑒(𝑃, 𝑄) = max|𝑝𝑖 − 𝑞𝑖 |
𝑖
The parameter 'k' represents the number of neighbours to consider. A small 'k' may lead to
noise sensitivity, while a large 'k' may include points from other classes, reducing the
algorithm's sensitivity to local patterns.
KNN classification can be weighted based on their distance, meaning that closer
neighbours have a higher influence on the prediction.
Pros Cons
Cross-Validation
Cross-validation is a technique used to assess the performance of a model. It involves
splitting the dataset into multiple subsets, training the model on different subsets, and
evaluating its performance on the remaining data.
The most common form of cross-validation is k-Fold Cross-Validation, where The dataset
is divided into k equally sized folds or subsets. The model is trained k times, each time
using k-1 folds for training and the remaining fold for validation.
The performance metric is averaged over the k iterations to obtain the final performance
estimate. This provides a more accurate estimate of the models performance
Lecture 8
Performance Metrics
A confusion matrix is a table that summarizes the performance of a classification
algorithm. It shows the number of true positives, true negatives, false positives, and false
negatives.
Several performance metrics can be calculated using values from the confusion matrix:
- Accuracy
𝑇𝑃 + 𝑇𝑁
𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁
- Precision
𝑇𝑃
FP
𝑇𝑃 + 𝐹𝑁
- Recall (Sensitivity)
𝑇𝑃
𝑇𝑃 + 𝐹𝑁
- F1 Score
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 × 𝑅𝑒𝑐𝑎𝑙𝑙
2×
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙
- Specificity (True Negative Rate)
𝑇𝑁/(𝑇𝑁 + 𝐹𝑃)
- True Positives:
Predicted class
Apple 50 5 50
Orange 10 50 20
Pear 5 5 0
Predicted class
Apple 50 5 50
Orange 10 50 20
Pear 5 5 0
Predicted class
Apple 50 5 50
Orange 10 50 20
Pear 5 5 0
Predicted class
Apple 50 5 50
Orange 10 50 20
Pear 5 5 0
Predicted class
Apple 50 5 50
Orange 10 50 20
Pear 5 5 0
- False Positives for Orange class:
Predicted class
Predicted class
Apple 50 5 50
Orange 10 50 20
Pear 5 5 0
Predicted class
Apple 50 5 50
Orange 10 50 20
Pear 5 5 0
Predicted class
Apple 50 5 50
Orange 10 50 20
Pear 5 5 0
Predicted class
Apple 50 5 50
Orange 10 50 20
Pear 5 5 0
Model Diagnosis
Diagnostics are tests that are run to gain insight on what is/isn’t working with a learning
algorithm.
High bias can lead to underfitting, where the model is too simplistic to capture the
underlying patterns in the data. If a learning algorithm is suffering from high bias, getting
more training data will not (by itself) help much.
Variance is the amount by which the model's predictions would change if it were trained on
a different dataset. It measures the model's sensitivity to variations in the training data.
High variance can lead to overfitting, where the model performs well on the training data
but fails to generalize to new, unseen data. If a learning algorithm is suffering from high
variance, getting more training data is likely to help.
Due to the need for training and storing multiple models, and combining their outputs,
ensemble models are computationally expensive and time consuming
Voting
Voting is an ensemble technique in machine learning where multiple models are trained
independently, and their predictions are combined to make a final prediction.
There are different types of voting methods, each with its own way of aggregating the
individual model predictions:
- Hard Voting: the final prediction is determined by a simple majority vote. Each model
in the ensemble "votes" for a class, and the class with the most votes is chosen as
the final prediction.
Ex. If three models predict class A, and two models predict class B, the final
prediction using hard voting would be class A.
- Soft Voting: The final prediction is the class with the highest average
probability/confidence.
Ex. If three models predict class A with probabilities 0.8, 0.7, and 0.9 [average: 0.8],
and two models predict class B with probabilities 0.4 and 0.6 [average: 0.3], the final
prediction using soft voting would be class A.
- Average Voting: the average (arithmetic mean) of the predictions is used to make
the final prediction. Used for regression problems.
Ex. If three models predict 3.0, 3.5, and 4.0, the final prediction using average voting
would be (3.0 + 3.5 + 4.0) / 3 = 3.5.
- Weighted voting: assigns different weights to the predictions of each model. The
weights reflect the confidence or performance of each model.
Ex. If there are three models, and you assign weights of 0.5, 0.3, and 0.2 to their
predictions, the final prediction would be 0.5 × 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛1 + 0.3 × 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛2 +
0.2 × 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛3
Bagging/Bootstrap Aggregating
Bagging is an ensemble learning technique that trains multiple instances of the same
model on different subsets of the training data. Bagging aims to reduce the variance and
overfitting associated with a single model by combining predictions from multiple models.
The first step is creating multiple subsets of the training data by randomly sampling with
replacement, this is called bootstrap sampling.
We randomly select 𝑛 samples with replacement from the original training dataset to
create a new training subset. This process is repeated 𝑘 times, resulting in 𝑘 diverse
subsets.
One example of Bagging is Random Forest, which builds an ensemble of decision trees,
where each tree is trained on a different bootstrap sample. Additionally, at each node, a
random subset of features is considered for splitting. This reduces the correlation between
outputs of each model
Boosting
Boosting is an ensemble learning technique that builds a sequence of models, where each
subsequent model focuses on correcting the errors of the previous ones. Boosting relies
on the use of weak learners, which are models that perform slightly better than random
chance. Weak learners are typically simple models.
Each data point in the training set is assigned a weight, and the weights are adjusted after
each model is trained. Misclassified points are given higher weights to make them more
influential in the subsequent model's training.
Boosting uses a weighted sum of the predictions from individual models to make the final
prediction. The weights are determined by the performance of each model on the training
data.
- AdaBoost (Adaptive Boosting): assigns weights to each training sample based on its
classification error. It focuses more on misclassified samples in subsequent
iterations.
Each new model is trained to correct the errors of the combined predictions of the
previous models.
Optimizes the model by minimizing a loss function using gradient descent. Each new
tree is trained to predict the residuals (the differences between the actual and
predicted values) of the ensemble.
The final prediction is the sum of the predictions from all trees in the ensemble.
Stacking
Stacking is an ensemble learning technique that involves training multiple diverse models
and combining their predictions using another model called a meta-model or blender. The
meta-model learns how to best combine the predictions of the individual models..
The effectiveness of stacking relies on the diversity of the base models. Models should
capture different aspects of the data and make different types of errors.
The training set is used to train the base models, and a separate validation set is often used
to generate predictions from the base models for training the meta-model.
Model Characteristics
The performance of an ensemble model is influenced by various characteristics:
- Dependency: the degree to which the individual models in the ensemble are
correlated or dependent on each other.
Models can be sequential, meaning each model builds on the predictions of the
previous model, or parallel, where models are trained all at the same time.
- Fusion Method: the techniques used to combine the predictions of individual models
in the ensemble.
We can describe the ensemble algorithms we’ve covered using these characteristics:
The goal is to minimize the within-cluster sum of squares, meaning that the sum of the
squared distances between each data point and the mean of its assigned cluster is
minimized.
The algorithm does not guarantee convergence to the global optimum. The result may
depend on the initial clusters. It is common to run it multiple times with different starting
conditions and choosing the one with the lowest cost.
The choice of 𝑘 also affects the performance of the algorithm. When the algorithm is used
for exploratory analysis the value of 𝑘 can be optimized with methods such as elbow
method.
However, when using k-means for some other downstream process 𝑘 is set according to
the number of clusters needed by the process.
Lecture 10
Dimensionality Reduction
Dimensionality reduction is a technique used to reduce the number of input variables
(features) in a dataset while preserving the essential information present in the data. The
goal is simplify the data and remove redundant or irrelevant features.
The covariance values for a set of features is represented using a covariance matrix.
Ex. Covariance matrix with 3 features:
𝐴𝑋 = 𝜆𝑋
Here 𝑋 is the eigenvector and 𝜆 is the eigenvalue
Ex.
0 5 −10
𝐴 = [0 22 16 ]
0 −9 −2
If we compute the product 𝐴𝑋 for the following:
−5
𝑋 = [−4]
3
0 5 −10 −5 −50 −5
𝐴𝑋 = [0 22 16 ] [−4] = [−40] = 10 [−4]
0 −9 −2 3 30 3
The product 𝐴𝑋 resulted in a vector which is equal to 10 times the vector 𝑋. In other
words, 𝐴𝑋 = 10𝑋.
The next step is to compute the eigenvalue decomposition of the covariance matrix.
Eigenvalue decomposition is a factorization of a square matrix 𝐴 into 3 matrices:
𝐴 = 𝑉𝛬𝑉 −1
Where 𝑉 is matrix of all the eigenvectors of 𝐴, 𝛬 is a diagonal matrix whose diagonal
element are the eigenvalues of 𝐴, and 𝑉 −1 is the inverse of 𝑉.
The eigenvalues are then sorted in descending order. We choose the top 𝑘 eigenvalues,
where 𝑘 is our desired degree of dimensionality. The eigenvectors corresponding to the
highest eigenvalues form a matrix 𝑊 and are the principal components.
Once the principal components have selected the original data is projected onto the new
subspace using a matrix multiplication.
𝑍 = 𝑥𝑊
Where 𝑍 is the reduced-dimensional representation of the data.