0% found this document useful (0 votes)
17 views43 pages

Unit 4

Unsupervised learning involves analyzing unlabelled data to discover patterns and relationships without predicting outcomes. Key techniques include clustering, such as K Means and hierarchical clustering, and dimensionality reduction methods like PCA and LDA. Applications range from consumer segmentation and fraud detection to image processing and genetics analysis.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views43 pages

Unit 4

Unsupervised learning involves analyzing unlabelled data to discover patterns and relationships without predicting outcomes. Key techniques include clustering, such as K Means and hierarchical clustering, and dimensionality reduction methods like PCA and LDA. Applications range from consumer segmentation and fraud detection to image processing and genetics analysis.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Unit 4

Unsupervised Learning

• Unsupervised learning is a machine learning concept where


the unlabelled and unclassified information is analysed to
discover hidden knowledge.
• The algorithms work on the data without any prior training,
but they are constructed in such a way that they can identify
patterns, groupings, sorting order, and numerous other
interesting knowledge from the set of data.
• In unsupervised learning the objective is to observe only the
features X1 :X2 :… Xn ;
• We are not going to predict any outcome variable, but rather
our intention is to find out the association between the
features or their grouping to understand the nature of the data.
• This analysis may reveal an interesting correlation between
the features or a common behaviour within the subgroup of
the data, which provides better understanding of the data.
Types of Unsupervised Learning
Clustering and Association Analysis
• Clustering is a broad class of methods used for discovering
unknown subgroups in data, which is the most important
concept in unsupervised learning.
• Another technique is Association Analysis which identifies a
low-dimensional representation of the observations that can
explain the variance and identify the association rule for the
explanation.
Applications of unsupervised learning

• Segmentation of target consumer populations by an advertisement consulting


agency on the basis of few dimensions such as demography, financial data,
purchasing habits, etc. so that the advertisers can reach their target consumers
efficiently
• Anomaly or fraud detection in the banking sector by identifying the pattern of loan
defaulters
• Image processing and image segmentation such as face recognition, expression
identification, etc.
• Grouping of important characteristics in genes to identify important influencers in
new areas of genetics
• Utilization by data scientists to reduce the dimensionalities in sample data
tosimplify modelling
• Document clustering and identifying potential labelling options
K Means Algorithm

• This is one of the oldest and most popularly used algorithm for
clustering.
• The basic principles used by this algorithm also serves as the basis for
other more sophisticated and complex algorithms.
• The principle of the k-means algorithm is to assign each of the ‘n’ data
points to one of the K clusters where ‘K’ is a user defined parameter as
the number of clusters desired.
• The objective is to maximize the homogeneity within the clusters and
also to maximize the differences between the clusters.
• The homogeneity and differences are measured in terms of the
distance between the objects or points in the data set.
K Means Algorithm
K Means Clustering

• One of the most important success factors in arriving at correct clustering is to


start with the correct number of cluster
• assumptions. Different numbers of starting cluster lead to completely different
types of data split. It will always help if we have some prior knowledge about the
number of clusters and we start our k-means algorithm with that prior knowledge.
• For example, if we are clustering the data of the students of a university, it is
always better to start with the number of departments in that university.
Sometimes, the business needs or resource limitations drive the number of
required clusters.
• For example, if a movie maker wants to cluster the movies on the basis of
combination of two parameters – budget of the movie: high or low, and casting of
the movie: star or non-star, then there are 4 possible combinations, and thus, there
can be four clusters to split the data.
K Means Clustering

• For a small data set, sometimes a rule of thumb that is followed is

which means that K is set as the square root of n/2 for a data set of n examples. But
unfortunately, this thumb rule does not work well for large data sets.
K Means Clustering – Elbow Method

• This method tries to measure the homogeneity or heterogeneity within the cluster
and for various values of ‘K’ and helps in arriving at the optimal ‘K’.
• The homogeneity will increase or heterogeneity will decrease with increasing ‘K’
as the number of data points inside each cluster reduces with this increase.
• But these iterations take significant computation effort, and after a certain point,
the increase in homogeneity benefit is no longer in accordance with the investment
required to achieve it, as is evident from the figure.
• This point is known as the elbow point, and the ‘K’ value at this point produces the
optimal clustering performance. There are a large number of algorithms to
calculate the homogeneity and heterogeneity of the clusters
K Means Clustering – Elbow Method
K Means Clustering

• The k-means algorithm that the iterative step is to recalculate the centroids of the
data set after each iteration.
• The proximities of the data points from each other within a cluster is measured to
minimize the distances. The distance of the data point from its nearest centroid can
also be calculated to minimize the distances to arrive at the refined centroid. The
Euclidean distance between two data points is measured as follows:
K Means Clustering

• The measure of quality of clustering uses the SSE technique. The formula used is
as follows:

• where dist() calculates the Euclidean distance between the centroid ci of the cluster
Ci and the data points x in the cluster.
• The summation of such distances over all the ‘K’ clusters gives the total sum of
squared error.
• The lower the SSE for a clustering solution, the better is the representative
position of the centroid.
K Means Algorithm
Agglomerative Hierarchical Clustering

• The agglomerative hierarchical clustering method uses the bottom-up


strategy.
• It starts with each object forming its own cluster and then iteratively
merges the clusters according to their similarity to form larger clusters.
• It terminates either when a certain clustering condition imposed by
the user is achieved or all the clusters merge into a single cluster.
• One of the core measures of proximities between clusters is
the distance between them. There are four standard methods
to measure the distance between clusters:
• Let Ci and Cj be the two clusters with ni and nj respectively.
• pi and pj represents the points in clusters Ci and Cj
respectively.
• We will denote the mean of cluster Ci as mi
• Often the distance measure is used to decide when to terminate the
clustering algorithm.
• For example, in an agglomerative clustering, the merging iterations
may be stopped once the MIN distance between two neighbouring
clusters becomes less than the user-defined threshold.
• When an algorithm uses the minimum distance D to measure the
distance between the clusters, then it is referred to as nearest
neighbour clustering algorithm, and if the decision to stop the
algorithm is based on a user-defined limit on D , then it is called
single linkage algorithm.
When an algorithm uses the maximum distance Dmax to measure the
distance between the clusters, then it is referred to as furthest neighbour
clustering algorithm
If the decision to stop the algorithm is based on a user defined limit on
Dmax then it is called complete linkage algorithm.
As minimum and maximum measures provide two extreme options to
measure distance between the clusters, they are prone to the outliers and
noisy data.
Instead, the use of mean and average distance helps in avoiding such
problem and provides more consistent results.
Density Based Methods – DBSCAN Density-based spatial
clustering of applications with noise (DBSCAN)

• In the case of the other shaped clusters such as S-shaped or uneven


shaped clusters, the partitioning and hierarchical clustering methods
do not provide accurate results.
• The density-based clustering approach provides a solution to identify
clusters of arbitrary shapes.
• The principle is based on identifying the dense area and sparse area
within the data set and then run the clustering algorithm.
• DBSCAN is one of the popular density-based algorithm which creates
clusters by using connected regions with high density.
• The DBSCAN algorithm requires only two parameters : epsilon and
minPoints.
• Epsilon is the radius of the circle to be created around each data point
to check its density
• minPoints is the minimum number of data points required inside that
circle for that data point to be classified as a Core point.
• DBSCAN algorithm creates a circle of epsilon radius around every
data point and classifies them into Core point, Border point,
and Noise.
• A data point is a Core point if the circle around it contains at least
‘minPoints’ number of points.
• If the number of points is less than minPoints, then it is classified
as Border Point, and if there are no other data points around any data
point within epsilon radius, then it treated as Noise.
Linear Discriminant Analysis
• Linear discriminant analysis (LDA) is commonly used in
• feature extraction.
• The objective of LDA is similar to the sense that it intends to
transform a data set into a lower dimensional feature space.
• LDA focuses on class separability, i.e. separating the features based
on class separability so as to avoid over-fitting of the machine learning
model.
• LDA calculates eigenvalues and eigenvectors within a class and inter-
class scatter matrices
Linear Discriminant Analysis
Steps to be followed
1. Calculate the mean vectors for the individual classes.
2. Calculate intra-class and inter-class scatter matrices.

where, mi is the sample mean


for each class, m is the overall
mean of the data set, Ni is the
sample size of each class
4. Identify the top ‘k’
eigenvectors having top ‘k’
eigenvalues
Principal Component Analysis
• In general, any machine learning algorithm performs better as the number of
related attributes or features reduced.
• In other words, a key to the success of machine learning lies in the fact that the
features are less in number as well as the similarity between each other is very
less.
• This is the main guiding philosophy of principal component analysis (PCA)
technique of feature extraction.
• In PCA, a new set of features are extracted from the original features which are
quite dissimilar in nature.
• So an n dimensional feature space gets transformed to an m dimensional feature
space, where the dimensions are orthogonal to each other, i.e. completely
independent of each other.
Principal Component Analysis
• A vector is a quantity having both magnitude and direction and hence can
determine the position of a point relative to another point in the Euclidean space
(i.e. a two or three or ‘n’ dimensional space).
• A vector space is a set of vectors. Vector spaces have a property that they can be
represented as a linear combination of a smaller set of vectors, called basis
vectors.
• So, any vector ‘v’ in a vector space can be represented as

where, a represents ‘n’ scalars and u represents the basis vectors. Basis
vectors are orthogonal to each other.
Principal Component Analysis

The objective of PCA is to make the transformation in such a way that


1. The new features are distinct, i.e. the covariance between the new features, i.e. the
principal components is 0.
2. The principal components are generated in order of the variability in the data that
it captures. Hence, the first principal component should capture the maximum
variability, the second principal component should capture the next highest
variability etc.
3. The sum of variance of the new features or the principal components should be
equal to the sum of variance of the original features.
Principal Component Analysis
PCA works based on a process called eigenvalue decomposition of a
covariance matrix of a data set.
Below are the steps to be followed:
1. First, calculate the covariance matrix of a data set.
2. Then, calculate the eigenvalues of the covariance matrix.
3. The eigenvector having highest eigenvalue represents the direction in
which there is the highest variance. So this will help in identifying the first
principal component.
4. The eigenvector having the next highest eigenvalue represents the direction
in which data has the highest remaining variance and also orthogonal to the
first direction. So this helps in identifying the second principal component.
5. Like this, identify the top ‘k’ eigenvectors having top ‘k’ eigenvalues so as
to get the ‘k’ principal components.
Principal Component Analysis - Example

Our dataset consists of 10 points in a 2-dimensional space:


(2.5,2.4)
(0.5,0.7) Step 1: Center the Data (Subtract the Mean)
(2.2,2.9) First, we calculate the mean of each feature
(1.9,2.2) (dimension) and then subtract the mean from
(3.1,3.0) each data point to center the data.
(2.3,2.7) mean=(1.76,1.91)
(2.0,1.6)
ii) Now, subtract the mean from each point:
(1.0,1.1)
(1.5,1.6) (2.5-1.76, 2.4 – 1.91) = (0.74,0.49)
(1.1,0.9)
Calculate the Covariance Matrix
The covariance matrix captures how the features vary together. For a 2D
dataset, the covariance matrix is a 2×2 matrix calculated as follows:
Cov(X)=[Var(x) Cov(x,y)​
Cov(x,y) Var(y)​]
Var(x)≈0.616555556
Var(y)≈0.716555556
Cov(x,y)≈0.615444444
Cov(X)=[0.616555556 0.615444444​
0.615444444 0.716555556​]
Next calculate eigen values and vectors
λ1​=1.28402771
λ2​=0.04908339
The corresponding eigenvectors are:
Eigenvector for λ1​:[0.6778734
0.73517866​]

Eigenvector for λ2​:[−0.73517866


0.6778734​]
Choose the principal component
:[0.6778734
0.73517866​]
Factor Analysis
• A factor is a set of observed variables that have similar responses to an
action.
• Since variables in a given dataset can be too much to deal with, Factor
Analysis condenses these factors or variables into fewer variables that are
actionable and substantial to work upon.
• Factor Analysis works on narrowing the availability of variables in a
given data set, allowing deeper insights and better visibility of patterns
for data research.
• Lets determine whether a relationship between factors or a set of
overserved variables and their underlying components exists.
• It helps one confirm whether there is a connection between two
components of variables in a given dataset. Usually, the purpose of CFA
is to test whether certain data fit the requirements of a particular
hypothesis.
• Confirmatory Factor Analysis (CFA) lets one determine whether a
relationship between factors or a set of observed variables and their
underlying components exists.
• It helps one confirm whether there is a connection between two
components of variables in a given dataset.
• Usually, the purpose of CFA is to test whether certain data fit the
requirements of a particular hypothesis.
Exploratory Factor Analysis
• To determine/explore the underlying latent structure of a large set of
variables. EFA, unlike CFA, tends to uncover the relationship, if any,
between measured variables of an entity (for example - height, weight, etc.
in a human figure).
• While CFA works on finding a relationship between a set of observed
variables and their underlying structure, this works to uncover a relationship
between various variables within a given dataset.
Exploratory Factor Analysis - Example

A school wants to understand the factors that influence student performance


but doesn’t know what these factors are. The school collects survey responses
from students on the following items:
1.Hours of study per week
2.Participation in extracurricular activities
3.Teacher support
4.Parent involvement
5.Access to learning resources
6.Class attendance
7.Sleep hours per night
• The school wants to explore how these observed variables cluster into
underlying latent factors without any predefined assumptions.
• The school performs EFA, which reveals that the variables group into
two latent factors:
• Factor 1: Academic Effort (includes "Hours of study," "Class attendance,"
and "Sleep hours per night").
• Factor 2: Environmental Support (includes "Teacher support," "Parent
involvement," and "Access to learning resources").
• EFA shows which variables load onto each factor, providing initial
insight into the structure of the data.
•EFA Identifies Academic Effort and Environmental Support as potential factors
from survey data.
•CFA Tests whether these two factors and their associated variables
fit a hypothesized model, confirming their validity.

You might also like