0% found this document useful (0 votes)
239 views18 pages

PCA Using Python

This document provides an overview of principal component analysis (PCA) using Python. PCA is a technique for dimensionality reduction that transforms high-dimensional data into a lower-dimensional space while retaining as much information as possible. It works by selecting principal components or features that capture the most variance in the dataset. The first principal component accounts for the highest variance, followed by the second component with the next highest variance, and so on. PCA has advantages like reducing training time for machine learning models and allowing visualization of high-dimensional data. The document provides examples of reading in a dataset, normalizing features, applying PCA, and plotting the variance to select principal components.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
0% found this document useful (0 votes)
239 views18 pages

PCA Using Python

This document provides an overview of principal component analysis (PCA) using Python. PCA is a technique for dimensionality reduction that transforms high-dimensional data into a lower-dimensional space while retaining as much information as possible. It works by selecting principal components or features that capture the most variance in the dataset. The first principal component accounts for the highest variance, followed by the second component with the next highest variance, and so on. PCA has advantages like reducing training time for machine learning models and allowing visualization of high-dimensional data. The document provides examples of reading in a dataset, normalizing features, applying PCA, and plotting the variance to select principal components.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 18

Principle Component Analysis using

Python

Tushar B. Kute,
http://tusharkute.com
Dimensionality Reduction

• Dimensionality reduction or dimension reduction is the


process of reducing the number of random variables
under consideration by obtaining a set of principal
variables.
• It can be divided into feature selection and feature
extraction.
– Feature selection approaches try to find a subset of the
original variables (also called features or attributes).
– Feature projection or Feature extraction transforms the
data in the high-dimensional space to a space of fewer
dimensions.
Large Dimensions

• Large number of features in the dataset is one of the factors


that affect both the training time as well as accuracy of machine
learning models. You have different options to deal with huge
number of features in a dataset.
– Try to train the models on original number of features, which
take days or weeks if the number of features is too high.
– Reduce the number of variables by merging correlated
variables.
– Extract the most important features from the dataset that
are responsible for maximum variance in the output.
Different statistical techniques are used for this purpose e.g.
linear discriminant analysis, factor analysis, and principal
component analysis.
Principal Component Analysis

• Principal component analysis, or PCA, is a statistical


technique to convert high dimensional data to low
dimensional data by selecting the most important features
that capture maximum information about the dataset.
• The features are selected on the basis of variance that they
cause in the output.
• The feature that causes highest variance is the first
principal component. The feature that is responsible for
second highest variance is considered the second principal
component, and so on.
• It is important to mention that principal components do
not have any correlation with each other.
Advantages of PCA

• The training time of the algorithms reduces


significantly with less number of features.
• It is not always possible to analyze data in high
dimensions. For instance if there are 100
features in a dataset. Total number of scatter
plots required to visualize the data would be
100(100-1)2 = 4950. Practically it is not possible
to analyze data this way.
Normalization of features

• It is imperative to mention that a feature set must be


normalized before applying PCA. For instance if a feature set
has data expressed in units of Kilograms, Light years, or
Millions, the variance scale is huge in the training set. If PCA
is applied on such a feature set, the resultant loadings for
features with high variance will also be large. Hence,
principal components will be biased towards features with
high variance, leading to false results.
• Finally, the last point to remember before we start coding is
that PCA is a statistical technique and can only be applied to
numeric data. Therefore, categorical features are required to
be converted into numerical features before PCA can be
applied.
Example:
Reading the dataset
Normalize
Apply PCA
Calculate variance
Variance plot
Variance Ratio

• The PCA class contains


explained_variance_ratio_ which returns the
variance caused by each of the principal
components.
Principal Components = 1
Principal Components = 2
Principal Components = 3
Useful resources

• https://stackabuse.com
• http://archive.ics.uci.edu/ml/index.php
• https://scikit-learn.org
• https://en.wikipedia.org
• www.towardsdatascience.com
• www.analyticsvidhya.com
• www.kaggle.com
• www.github.com
Thank you
This presentation is created using LibreOffice Impress 5.1.6.2, can be used freely as per GNU General Public License

/mITuSkillologies @mitu_group /company/mitu- c/MITUSkillologies


skillologies

Web Resources
http://mitu.co.in
http://tusharkute.com

contact@mitu.co.in
tushar@tusharkute.com

You might also like