0% found this document useful (0 votes)

17 views43 pages

Unit 4

Unsupervised learning involves analyzing unlabelled data to discover patterns and relationships without predicting outcomes. Key techniques include clustering, such as K Means and hierarchical clustering, and dimensionality reduction methods like PCA and LDA. Applications range from consumer segmentation and fraud detection to image processing and genetics analysis.

Uploaded by

tejavardhankandikatla

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views43 pages

Unit 4

Uploaded by

tejavardhankandikatla

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Unit 4

Unsupervised Learning

• Unsupervised learning is a machine learning concept where

the unlabelled and unclassified information is analysed to
discover hidden knowledge.
• The algorithms work on the data without any prior training,
but they are constructed in such a way that they can identify
patterns, groupings, sorting order, and numerous other
interesting knowledge from the set of data.
• In unsupervised learning the objective is to observe only the
features X1 :X2 :… Xn ;
• We are not going to predict any outcome variable, but rather
our intention is to find out the association between the
features or their grouping to understand the nature of the data.
• This analysis may reveal an interesting correlation between
the features or a common behaviour within the subgroup of
the data, which provides better understanding of the data.
Types of Unsupervised Learning
Clustering and Association Analysis
• Clustering is a broad class of methods used for discovering
unknown subgroups in data, which is the most important
concept in unsupervised learning.
• Another technique is Association Analysis which identifies a
low-dimensional representation of the observations that can
explain the variance and identify the association rule for the
explanation.
Applications of unsupervised learning

• Segmentation of target consumer populations by an advertisement consulting

agency on the basis of few dimensions such as demography, financial data,
purchasing habits, etc. so that the advertisers can reach their target consumers
efficiently
• Anomaly or fraud detection in the banking sector by identifying the pattern of loan
defaulters
• Image processing and image segmentation such as face recognition, expression
identification, etc.
• Grouping of important characteristics in genes to identify important influencers in
new areas of genetics
• Utilization by data scientists to reduce the dimensionalities in sample data
tosimplify modelling
• Document clustering and identifying potential labelling options
K Means Algorithm

• This is one of the oldest and most popularly used algorithm for
clustering.
• The basic principles used by this algorithm also serves as the basis for
other more sophisticated and complex algorithms.
• The principle of the k-means algorithm is to assign each of the ‘n’ data
points to one of the K clusters where ‘K’ is a user defined parameter as
the number of clusters desired.
• The objective is to maximize the homogeneity within the clusters and
also to maximize the differences between the clusters.
• The homogeneity and differences are measured in terms of the
distance between the objects or points in the data set.
K Means Algorithm
K Means Clustering

• One of the most important success factors in arriving at correct clustering is to

start with the correct number of cluster
• assumptions. Different numbers of starting cluster lead to completely different
types of data split. It will always help if we have some prior knowledge about the
number of clusters and we start our k-means algorithm with that prior knowledge.
• For example, if we are clustering the data of the students of a university, it is
always better to start with the number of departments in that university.
Sometimes, the business needs or resource limitations drive the number of
required clusters.
• For example, if a movie maker wants to cluster the movies on the basis of
combination of two parameters – budget of the movie: high or low, and casting of
the movie: star or non-star, then there are 4 possible combinations, and thus, there
can be four clusters to split the data.
K Means Clustering

• For a small data set, sometimes a rule of thumb that is followed is

which means that K is set as the square root of n/2 for a data set of n examples. But
unfortunately, this thumb rule does not work well for large data sets.
K Means Clustering – Elbow Method

• This method tries to measure the homogeneity or heterogeneity within the cluster
and for various values of ‘K’ and helps in arriving at the optimal ‘K’.
• The homogeneity will increase or heterogeneity will decrease with increasing ‘K’
as the number of data points inside each cluster reduces with this increase.
• But these iterations take significant computation effort, and after a certain point,
the increase in homogeneity benefit is no longer in accordance with the investment
required to achieve it, as is evident from the figure.
• This point is known as the elbow point, and the ‘K’ value at this point produces the
optimal clustering performance. There are a large number of algorithms to
calculate the homogeneity and heterogeneity of the clusters
K Means Clustering – Elbow Method
K Means Clustering

• The k-means algorithm that the iterative step is to recalculate the centroids of the
data set after each iteration.
• The proximities of the data points from each other within a cluster is measured to
minimize the distances. The distance of the data point from its nearest centroid can
also be calculated to minimize the distances to arrive at the refined centroid. The
Euclidean distance between two data points is measured as follows:
K Means Clustering

• The measure of quality of clustering uses the SSE technique. The formula used is
as follows:

• where dist() calculates the Euclidean distance between the centroid ci of the cluster
Ci and the data points x in the cluster.
• The summation of such distances over all the ‘K’ clusters gives the total sum of
squared error.
• The lower the SSE for a clustering solution, the better is the representative
position of the centroid.
K Means Algorithm
Agglomerative Hierarchical Clustering

• The agglomerative hierarchical clustering method uses the bottom-up

strategy.
• It starts with each object forming its own cluster and then iteratively
merges the clusters according to their similarity to form larger clusters.
• It terminates either when a certain clustering condition imposed by
the user is achieved or all the clusters merge into a single cluster.
• One of the core measures of proximities between clusters is
the distance between them. There are four standard methods
to measure the distance between clusters:
• Let Ci and Cj be the two clusters with ni and nj respectively.
• pi and pj represents the points in clusters Ci and Cj
respectively.
• We will denote the mean of cluster Ci as mi
• Often the distance measure is used to decide when to terminate the
clustering algorithm.
• For example, in an agglomerative clustering, the merging iterations
may be stopped once the MIN distance between two neighbouring
clusters becomes less than the user-defined threshold.
• When an algorithm uses the minimum distance D to measure the
distance between the clusters, then it is referred to as nearest
neighbour clustering algorithm, and if the decision to stop the
algorithm is based on a user-defined limit on D , then it is called
single linkage algorithm.
When an algorithm uses the maximum distance Dmax to measure the
distance between the clusters, then it is referred to as furthest neighbour
clustering algorithm
If the decision to stop the algorithm is based on a user defined limit on
Dmax then it is called complete linkage algorithm.
As minimum and maximum measures provide two extreme options to
measure distance between the clusters, they are prone to the outliers and
noisy data.
Instead, the use of mean and average distance helps in avoiding such
problem and provides more consistent results.
Density Based Methods – DBSCAN Density-based spatial
clustering of applications with noise (DBSCAN)

• In the case of the other shaped clusters such as S-shaped or uneven

shaped clusters, the partitioning and hierarchical clustering methods
do not provide accurate results.
• The density-based clustering approach provides a solution to identify
clusters of arbitrary shapes.
• The principle is based on identifying the dense area and sparse area
within the data set and then run the clustering algorithm.
• DBSCAN is one of the popular density-based algorithm which creates
clusters by using connected regions with high density.
• The DBSCAN algorithm requires only two parameters : epsilon and
minPoints.
• Epsilon is the radius of the circle to be created around each data point
to check its density
• minPoints is the minimum number of data points required inside that
circle for that data point to be classified as a Core point.
• DBSCAN algorithm creates a circle of epsilon radius around every
data point and classifies them into Core point, Border point,
and Noise.
• A data point is a Core point if the circle around it contains at least
‘minPoints’ number of points.
• If the number of points is less than minPoints, then it is classified
as Border Point, and if there are no other data points around any data
point within epsilon radius, then it treated as Noise.
Linear Discriminant Analysis
• Linear discriminant analysis (LDA) is commonly used in
• feature extraction.
• The objective of LDA is similar to the sense that it intends to
transform a data set into a lower dimensional feature space.
• LDA focuses on class separability, i.e. separating the features based
on class separability so as to avoid over-fitting of the machine learning
model.
• LDA calculates eigenvalues and eigenvectors within a class and inter-
class scatter matrices
Linear Discriminant Analysis
Steps to be followed
1. Calculate the mean vectors for the individual classes.
2. Calculate intra-class and inter-class scatter matrices.

where, mi is the sample mean

for each class, m is the overall
mean of the data set, Ni is the
sample size of each class
4. Identify the top ‘k’
eigenvectors having top ‘k’
eigenvalues
Principal Component Analysis
• In general, any machine learning algorithm performs better as the number of
related attributes or features reduced.
• In other words, a key to the success of machine learning lies in the fact that the
features are less in number as well as the similarity between each other is very
less.
• This is the main guiding philosophy of principal component analysis (PCA)
technique of feature extraction.
• In PCA, a new set of features are extracted from the original features which are
quite dissimilar in nature.
• So an n dimensional feature space gets transformed to an m dimensional feature
space, where the dimensions are orthogonal to each other, i.e. completely
independent of each other.
Principal Component Analysis
• A vector is a quantity having both magnitude and direction and hence can
determine the position of a point relative to another point in the Euclidean space
(i.e. a two or three or ‘n’ dimensional space).
• A vector space is a set of vectors. Vector spaces have a property that they can be
represented as a linear combination of a smaller set of vectors, called basis
vectors.
• So, any vector ‘v’ in a vector space can be represented as

where, a represents ‘n’ scalars and u represents the basis vectors. Basis
vectors are orthogonal to each other.
Principal Component Analysis

The objective of PCA is to make the transformation in such a way that

1. The new features are distinct, i.e. the covariance between the new features, i.e. the
principal components is 0.
2. The principal components are generated in order of the variability in the data that
it captures. Hence, the first principal component should capture the maximum
variability, the second principal component should capture the next highest
variability etc.
3. The sum of variance of the new features or the principal components should be
equal to the sum of variance of the original features.
Principal Component Analysis
PCA works based on a process called eigenvalue decomposition of a
covariance matrix of a data set.
Below are the steps to be followed:
1. First, calculate the covariance matrix of a data set.
2. Then, calculate the eigenvalues of the covariance matrix.
3. The eigenvector having highest eigenvalue represents the direction in
which there is the highest variance. So this will help in identifying the first
principal component.
4. The eigenvector having the next highest eigenvalue represents the direction
in which data has the highest remaining variance and also orthogonal to the
first direction. So this helps in identifying the second principal component.
5. Like this, identify the top ‘k’ eigenvectors having top ‘k’ eigenvalues so as
to get the ‘k’ principal components.
Principal Component Analysis - Example

Our dataset consists of 10 points in a 2-dimensional space:

(2.5,2.4)
(0.5,0.7) Step 1: Center the Data (Subtract the Mean)
(2.2,2.9) First, we calculate the mean of each feature
(1.9,2.2) (dimension) and then subtract the mean from
(3.1,3.0) each data point to center the data.
(2.3,2.7) mean=(1.76,1.91)
(2.0,1.6)
ii) Now, subtract the mean from each point:
(1.0,1.1)
(1.5,1.6) (2.5-1.76, 2.4 – 1.91) = (0.74,0.49)
(1.1,0.9)
Calculate the Covariance Matrix
The covariance matrix captures how the features vary together. For a 2D
dataset, the covariance matrix is a 2×2 matrix calculated as follows:
Cov(X)=[Var(x) Cov(x,y)
Cov(x,y) Var(y)]
Var(x)≈0.616555556
Var(y)≈0.716555556
Cov(x,y)≈0.615444444
Cov(X)=[0.616555556 0.615444444
0.615444444 0.716555556]
Next calculate eigen values and vectors
λ1=1.28402771
λ2=0.04908339
The corresponding eigenvectors are:
Eigenvector for λ1:[0.6778734
0.73517866]

Eigenvector for λ2:[−0.73517866

0.6778734]
Choose the principal component
:[0.6778734
0.73517866]
Factor Analysis
• A factor is a set of observed variables that have similar responses to an
action.
• Since variables in a given dataset can be too much to deal with, Factor
Analysis condenses these factors or variables into fewer variables that are
actionable and substantial to work upon.
• Factor Analysis works on narrowing the availability of variables in a
given data set, allowing deeper insights and better visibility of patterns
for data research.
• Lets determine whether a relationship between factors or a set of
overserved variables and their underlying components exists.
• It helps one confirm whether there is a connection between two
components of variables in a given dataset. Usually, the purpose of CFA
is to test whether certain data fit the requirements of a particular
hypothesis.
• Confirmatory Factor Analysis (CFA) lets one determine whether a
relationship between factors or a set of observed variables and their
underlying components exists.
• It helps one confirm whether there is a connection between two
components of variables in a given dataset.
• Usually, the purpose of CFA is to test whether certain data fit the
requirements of a particular hypothesis.
Exploratory Factor Analysis
• To determine/explore the underlying latent structure of a large set of
variables. EFA, unlike CFA, tends to uncover the relationship, if any,
between measured variables of an entity (for example - height, weight, etc.
in a human figure).
• While CFA works on finding a relationship between a set of observed
variables and their underlying structure, this works to uncover a relationship
between various variables within a given dataset.
Exploratory Factor Analysis - Example

A school wants to understand the factors that influence student performance

but doesn’t know what these factors are. The school collects survey responses
from students on the following items:
1.Hours of study per week
2.Participation in extracurricular activities
3.Teacher support
4.Parent involvement
5.Access to learning resources
6.Class attendance
7.Sleep hours per night
• The school wants to explore how these observed variables cluster into
underlying latent factors without any predefined assumptions.
• The school performs EFA, which reveals that the variables group into
two latent factors:
• Factor 1: Academic Effort (includes "Hours of study," "Class attendance,"
and "Sleep hours per night").
• Factor 2: Environmental Support (includes "Teacher support," "Parent
involvement," and "Access to learning resources").
• EFA shows which variables load onto each factor, providing initial
insight into the structure of the data.
•EFA Identifies Academic Effort and Environmental Support as potential factors
from survey data.
•CFA Tests whether these two factors and their associated variables
fit a hypothesized model, confirming their validity.

Unsupervised Learning: Clustering
No ratings yet
Unsupervised Learning: Clustering
12 pages
Unit 4-Unsupervised Learning-K Means and Hierarchical Clustering
No ratings yet
Unit 4-Unsupervised Learning-K Means and Hierarchical Clustering
48 pages
Chapter 3 Unsupervised Learning
No ratings yet
Chapter 3 Unsupervised Learning
45 pages
Clustering Analysis
No ratings yet
Clustering Analysis
30 pages
Unsupervised Learning Essentials
No ratings yet
Unsupervised Learning Essentials
29 pages
ML Unsupervised
No ratings yet
ML Unsupervised
35 pages
ML Module5 Clustering
No ratings yet
ML Module5 Clustering
71 pages
U-5 Iml
No ratings yet
U-5 Iml
20 pages
13 Unsupervised Learning
No ratings yet
13 Unsupervised Learning
9 pages
Module-5 Clustering Algorithms
No ratings yet
Module-5 Clustering Algorithms
44 pages
Knowledge Mining Using Classification Through Clustering
No ratings yet
Knowledge Mining Using Classification Through Clustering
6 pages
Introduction To Data Science Unsupervised Learning: CS 194 Fall 2015 John Canny
No ratings yet
Introduction To Data Science Unsupervised Learning: CS 194 Fall 2015 John Canny
54 pages
w6 Clustering
No ratings yet
w6 Clustering
29 pages
Clustering
No ratings yet
Clustering
65 pages
Clustering Algorithm and Analyasis
No ratings yet
Clustering Algorithm and Analyasis
12 pages
Clustering Techniques in Data Mining
No ratings yet
Clustering Techniques in Data Mining
49 pages
PART2
No ratings yet
PART2
61 pages
ML UNIT 4 Sir
No ratings yet
ML UNIT 4 Sir
42 pages
Unit 5
No ratings yet
Unit 5
63 pages
Unsupervised
No ratings yet
Unsupervised
14 pages
R20 Machine Learning Unit 4
No ratings yet
R20 Machine Learning Unit 4
49 pages
Unit 4
No ratings yet
Unit 4
74 pages
Understanding Cluster Analysis Basics
No ratings yet
Understanding Cluster Analysis Basics
51 pages
Week 9. Unsupervised Learning
No ratings yet
Week 9. Unsupervised Learning
32 pages
Module 4 ML
No ratings yet
Module 4 ML
11 pages
ML Unit3
No ratings yet
ML Unit3
21 pages
Unit 2
No ratings yet
Unit 2
89 pages
Cluster Analysis Techniques Guide
No ratings yet
Cluster Analysis Techniques Guide
152 pages
A Basic Approach To K-Means Clustering Applied To Stock Data
No ratings yet
A Basic Approach To K-Means Clustering Applied To Stock Data
4 pages
Datamining-Lect5 - Clustering. The K-Means Algorithm. Hierarchical Clustering. The DBSCAN Algorithm. Clustering Evaluation
No ratings yet
Datamining-Lect5 - Clustering. The K-Means Algorithm. Hierarchical Clustering. The DBSCAN Algorithm. Clustering Evaluation
110 pages
Medical Imabmnge Analysis
No ratings yet
Medical Imabmnge Analysis
41 pages
K-Means Clustering in Data Mining
No ratings yet
K-Means Clustering in Data Mining
5 pages
ML Module 4 2022 1 PDF
No ratings yet
ML Module 4 2022 1 PDF
31 pages
Clustering Analysis (Unsupervised)
No ratings yet
Clustering Analysis (Unsupervised)
6 pages
Data Clustering..
No ratings yet
Data Clustering..
10 pages
4 Clustering
No ratings yet
4 Clustering
9 pages
Lecture Unsupervised (17!04!2024)
No ratings yet
Lecture Unsupervised (17!04!2024)
61 pages
Unsupervised Learning and Clustering Techniques
No ratings yet
Unsupervised Learning and Clustering Techniques
60 pages
Chapter 1 Introduction
No ratings yet
Chapter 1 Introduction
49 pages
Graph Partitioning & Clustering Techniques
No ratings yet
Graph Partitioning & Clustering Techniques
14 pages
Chapter 1 Introduction
No ratings yet
Chapter 1 Introduction
47 pages
K-Means and Hierarchical Clustering
No ratings yet
K-Means and Hierarchical Clustering
61 pages
DMW Unit 5
No ratings yet
DMW Unit 5
10 pages
Unit 3 Clustering Algorithm
No ratings yet
Unit 3 Clustering Algorithm
44 pages
482 LectureNotes Chapter 4
No ratings yet
482 LectureNotes Chapter 4
4 pages
The Math Behind The K-Means and Hierarchical Clust+
No ratings yet
The Math Behind The K-Means and Hierarchical Clust+
13 pages
Understanding Cluster Analysis Techniques
No ratings yet
Understanding Cluster Analysis Techniques
51 pages
Module 4
No ratings yet
Module 4
63 pages
Clustering
No ratings yet
Clustering
29 pages
Datamining Lect8
No ratings yet
Datamining Lect8
79 pages
Unsupervised Learning for Students
No ratings yet
Unsupervised Learning for Students
59 pages
Clustering
No ratings yet
Clustering
4 pages
CS8091 - Big Data Analytics - Unit 2
No ratings yet
CS8091 - Big Data Analytics - Unit 2
44 pages
Customer Segmentation Techniques Explained
No ratings yet
Customer Segmentation Techniques Explained
46 pages
K-Means Clustering in Unsupervised Learning
No ratings yet
K-Means Clustering in Unsupervised Learning
24 pages
Data Mining Lecture Notes-1: Bsc. (H) Computer Science: Vi Semester Teacher: Ms. Sonal Linda
No ratings yet
Data Mining Lecture Notes-1: Bsc. (H) Computer Science: Vi Semester Teacher: Ms. Sonal Linda
40 pages
U1 - KMeans - 5th Sem - DS
No ratings yet
U1 - KMeans - 5th Sem - DS
14 pages
Clustering Algorithms in Data Mining
No ratings yet
Clustering Algorithms in Data Mining
44 pages
Genmath Lesson 6
No ratings yet
Genmath Lesson 6
51 pages
Develpment of Algebra 1995
No ratings yet
Develpment of Algebra 1995
25 pages
c10 Indices
100% (2)
c10 Indices
28 pages
Basic Math - F4 - 2019 PDF
No ratings yet
Basic Math - F4 - 2019 PDF
5 pages
Chapters 1 To 3 - Cumulative Review
No ratings yet
Chapters 1 To 3 - Cumulative Review
4 pages
MA2101 Tutorial 5
No ratings yet
MA2101 Tutorial 5
2 pages
Polynomials (Olympiad MCQ)
No ratings yet
Polynomials (Olympiad MCQ)
4 pages
DM Cheat Sheet
No ratings yet
DM Cheat Sheet
3 pages
Math 9 2nd Summative With Key
82% (11)
Math 9 2nd Summative With Key
2 pages
Phys 1001 Work Sheet - 1 - Final - Updated - 113454
No ratings yet
Phys 1001 Work Sheet - 1 - Final - Updated - 113454
3 pages
College Algebra 1st Edition Sheldon Axler Instant Download
100% (1)
College Algebra 1st Edition Sheldon Axler Instant Download
117 pages
mpm2d1 Exam Review Booklet 3
No ratings yet
mpm2d1 Exam Review Booklet 3
4 pages
10th Maths Unit Test 1 Question Paper English Medium PDF Download 1
No ratings yet
10th Maths Unit Test 1 Question Paper English Medium PDF Download 1
1 page
Eigenvalues and Eigenvectors Explained
No ratings yet
Eigenvalues and Eigenvectors Explained
30 pages
X NCERT2024 25 Mathematics 1. Realnumbers Notes WB New
No ratings yet
X NCERT2024 25 Mathematics 1. Realnumbers Notes WB New
20 pages
3rd Periodical Exam in Math 8
No ratings yet
3rd Periodical Exam in Math 8
3 pages
Y11 Revision Pack 1 Core
No ratings yet
Y11 Revision Pack 1 Core
5 pages
Linear Algebra Exercises
No ratings yet
Linear Algebra Exercises
2 pages
PSAT 8-9 Essential Knowledge and Skills - Math
No ratings yet
PSAT 8-9 Essential Knowledge and Skills - Math
13 pages
Vedantu Telugu Jee Channel: A.P-IPE 2025 All Important Questions
No ratings yet
Vedantu Telugu Jee Channel: A.P-IPE 2025 All Important Questions
7 pages
Grade 10 Math Exam Guide
No ratings yet
Grade 10 Math Exam Guide
3 pages
Automatic Control Systems - B. C. Kuo and F. Golnaraghi
No ratings yet
Automatic Control Systems - B. C. Kuo and F. Golnaraghi
103 pages
ME802 FEM Quiz3 - Solution - Fall 2018
No ratings yet
ME802 FEM Quiz3 - Solution - Fall 2018
2 pages
Secondary 3 A Math by Paradigm Linear Law
No ratings yet
Secondary 3 A Math by Paradigm Linear Law
2 pages
Math g10 Solution
No ratings yet
Math g10 Solution
18 pages
Trigonometric Functions of Two Angles
No ratings yet
Trigonometric Functions of Two Angles
17 pages
Problem Set 2
No ratings yet
Problem Set 2
2 pages
Abstract Algebra
No ratings yet
Abstract Algebra
15 pages
Yr 11 Extended Workbook 2023-24 1
No ratings yet
Yr 11 Extended Workbook 2023-24 1
214 pages
Rank of A Matrix
No ratings yet
Rank of A Matrix
9 pages

Unit 4

Uploaded by

Unit 4

Uploaded by

Unit 4

• Unsupervised learning is a machine learning concept where

• Segmentation of target consumer populations by an advertisement consulting

• One of the most important success factors in arriving at correct clustering is to

• For a small data set, sometimes a rule of thumb that is followed is

• The agglomerative hierarchical clustering method uses the bottom-up

• In the case of the other shaped clusters such as S-shaped or uneven

where, mi is the sample mean

The objective of PCA is to make the transformation in such a way that

Our dataset consists of 10 points in a 2-dimensional space:

Eigenvector for λ2​:[−0.73517866

A school wants to understand the factors that influence student performance

You might also like

Eigenvector for λ2:[−0.73517866