0% found this document useful (0 votes)

161 views13 pages

Lecture+Notes+ +clustering

Uploaded by

Pankaj Pandey

Available Formats

Download as PDF, TXT or read online on Scribd

Download as pdf or txt

0% found this document useful (0 votes)

161 views13 pages

Lecture+Notes+ +clustering

Uploaded by

Pankaj Pandey

Available Formats

Download as PDF, TXT or read online on Scribd

Download as pdf or txt

You are on page 1/ 13

Module Summary

Unsupervised Learning
In the previous modules, you saw various supervised machine learning algorithms. Supervised learning is a type of
machine learning algorithm that uses a known dataset to preform predictions. This dataset (referred to as the training
dataset) includes both response values and input data. From this, the supervised learning algorithm seeks to build a
model that can predict the response values for a new dataset.

If you are training your machine learning task only with a set of inputs, it is called unsupervised learning, which will be
able to find the structure or relationships between different inputs. The most important unsupervised learning
technique is clustering, which creates different groups or clusters of the given set of inputs and is also able to put any
new input in the appropriate cluster. While carrying out clustering, the basic objective is to group the input points in
such a way as to maximise the inter-cluster variance and minimise the intra-cluster variance.

Fig 1: Objective of clustering is to maximise the inter-cluster distance and minimise the intra-cluster variance
The two most important methods of clustering are the K-Means algorithm & the Hierarchical clustering algorithm.

K-Means Algorithm

K-Means algorithm is the process of dividing the N data points into K groups or clusters. Here the steps of the
algorithm are:
1. Start by choosing K random points the initial cluster centres.
2. Assign each data point to their nearest cluster centre. The most common way of measuring the distance
between the points is the Euclidean distance.
3. For each cluster, compute the new cluster centre which will be the mean of all cluster members.
4. Now re-assign all the data points to the diffrent clusters by taking into account the new cluster centres.
5. Keep iterating through the step 3 & 4 until there are no further changes possible.
At this point, you arrive at the optimal clusters.
Let’s apply the K-Means algorithm on a set of 10 points, which we want to divide into 2 clusters. Thus the
value of K here is 2.

Fig 2: A set of 10 points to be divided into 2 clusters

We begin with choosing 2 random points as the 2 cluster centres.

Fig 3: Choosing K random initial cluster centres

We then assign each of the data points to their nearest cluster centres based on the Euclidean distance. This
way all the points are divided among the K clusters.
Fig 4: Assigning each data point to their nearest cluster centre

Now we update the position of each of the cluster centres to reflect the mean of each cluster.

Fig 5: Updating the cluster centres

This process continues iteratively till the clusters converge; that is, there are no more changes possible in
the position of the cluster centres. At this point, we achieve the two optimal clusters.

Practical considerations in K-Means algorithm

Some of the points to be considered while implementing K-Means algorithm are:

1.) The choice of initial cluster centre has an impact on the final cluster composition.
You saw the impact of the initial cluster centres through the visualisation tool. In the 3 cases with a
different set of initial cluster centres, we obtained 3 different clusters at the end.
Fig 6: Impact of initial cluster centres on the final result

2.) Choosing the number of clusters K in advance

There are a number of pointers that can help us decide the K for our K-means algorithm:-
1. Elbow method:-
• Compute clustering algorithm (e.g., k-means clustering) for different values of k. For instance,
by varying k from 1 to 10 clusters.
• For each k, calculate the total within-cluster sum of square (wss).
• Plot the curve of wss according to the number of clusters k.
• The location of a bend (knee) in the plot is generally considered as an indicator of the
appropriate number of clusters.
2. Average silhouette Method
• Compute clustering algorithm (e.g., k-means clustering) for different values of k. For instance,
by varying k from 1 to 10 clusters.
• For each k, calculate the average silhouette of observations (avg.sil).
• Plot the curve of avg.sil according to the number of clusters k.
• The location of the maximum is considered as the appropriate number of clusters.

3.) Impact of outliers

Since, the K-Means algorithm tries to allocate each of the data point to one of the clusters, outliers
have serious impact on the performance of the algorithm and prevent optimal clustering.

Fig 8: Impact of outliers on clustering

4.) Standardisation of data
Standardisation of data, that is, converting them into z-scores with mean 0 and standard deviation 1,
is important for 2 reasons in K-Means algorithm:
• Since we need to compute the Euclidean distance between the data points, it is important to
ensure that the attributes with a larger range of values do not out-weight the attributes with
smaller range. Thus, scaling down of all attributes to the same normal scale helps in this
process.
• The different attributes will have the measures in different units. Thus, standardisation helps
in making the attributes unit-free and uniform.

5.) Non-applicability with the categorical data

The K-Means algorithm cannot be used when dealing with categorical data as the concept of distance
for categorical data doesn’t make much sense. So, instead of K-Means, we need to use different
algorithms.

Earlier you saw the difference between the classification and the clustering problem. Then you saw the K-
Means algorithm as a way to achieve the clusters. Hierarchical Clustering is another algorithm to obtain such
clusters.

Hierarchical Clustering algorithm

Given a set of N items to be clustered, the steps in the hierarchical clustering are:

1. Calculate the NxN distance (similarity) matrix, which calculates the distance of each data point from
the other.
2. Start by assigning each item to its own cluster, so that if you have N items, you now have N clusters,
each containing just one item.
3. Find the closest (most similar) pair of clusters and merge them into a
single cluster, so that now you have one less cluster.
4. Compute distances (similarities) between the new cluster and each of the old clusters.
5. Repeat steps 3 and 4 until all items are clustered into a single cluster of size N.

Thus, what we have at the end is the dendrogram, which shows us which data points group together in
which cluster at what distance.

Let’s consider an example. Let’s take 10 points and try to apply hierarchical clustering algorithm over them.

Fig 1: 10 random points

Now initially, we treat each of these points as individual clusters. Thus we begin with 10 different clusters.

Fig 2: 10 initial clusters

Now we calculate the distance matrix for the 10 clusters, that is, the distance of each cluster from every
other cluster. Then we combine the 2 clusters with the minimum distance between them. In this case point
5 & 7 were the closest clusters to each other. Thus, they would be merged first. Correspondingly, they appear
at the lowest level in the dendrogram.
Fig 3: Merging of point 5 & 7 to form a single cluster

Thus now you are left with only 9 clusters. 1 of them has a single element, while 1 of them have 2 elements
i.e. 5 & 7. Now again we calculate the distance of each cluster from every other cluster. But here the problem
is how can you measure the distance of a cluster having 2 points with a cluster having a single point? It is
here that the concept of linkage becomes important. Linkage is the measure of dissimilarity or similarity
between the clusters having multiple observations.

Fig 4: Calculating dissimilarity measure between 2 clusters

Here, we calculate the distance between the points 5 & 8 and then 7 & 8, and the minimum of these 2
distances is taken as the distance between the 2 clusters. Thus, in next iteration, we obtain 8 clusters.
Fig 5: After iteration 2, we have 8 clusters

These iterations continue until we arrive at 1 giant cluster.

Fig 6: The HC algorithm results into a dendrogram

The result of the hierarchical clustering algorithm is shown by a dendrogram, which starts with all the data
points as separate clusters and indicates at what level of dissimilarity any two clusters were joined.

Interpreting the dendrogram

The y-axis of the dendrogram is some measure of the dissimilarity or distance at which clusters join.
Fig 7: A sample dendrogram

In the dendrogram shown above, samples 4 and 5 are the most similar and join to form the first cluster,
followed by samples 1 and 10. The last two clusters to fuse together to form the final single cluster are 3-6
and 4-5-2-7-1-10-9-8.

Determining the number of groups in a cluster analysis is often the primary goal. Typically, one looks for
natural groupings defined by long stems. Here, by observation, you can identify that there are 3 major
groupings: 3-6, 4-5-2-7 and 1-10-9-8.

We also saw that the hierarchical clustering can proceed in 2 ways - agglomerative and divisive. If we are
starting with n distinct clusters and iteratively reach to a point where we have only 1 cluster in the end, it is
called agglomerative clustering. On the other hand, if we begin with 1 big cluster and subsequently keep
on partitioning this cluster to reach n clusters, each containing 1 element each, it is called divisive
clustering.
Cutting the dendrogram
Once we obtain the dendrogram, the clusters can be obtained by cutting the dendrogram at an
appropriate level. The number of vertical lines intersecting the cutting line represents the number of
clusters.

Thus if we cut the dendrogram at dissimilarity measure of 0.8, we obtain 4 clusters.

Fig 9: Cutting the dendrogram at 0.8

Instead, if we make the cut at height of 1.2, we get only 2 clusters.

Fig 10: Cutting the dendrogram at 1.2

Types of linkages

In our earlier example, we took the minimum of all the pairwise distances between the data points as the
representative of the distance between 2 clusters. This measure of distance is called single linkage. Apart
from using the minimum, we can use other methods to compute the distance between the clusters. Let’s
consider the common types of linkages:-
Single Linkage
Here, the distance between 2 clusters is defined as the shortest distance between points in the two clusters
Complete Linkage

Here, the distance between 2 clusters is defined as the maximum distance between any 2 points in the
clusters
Average Linkage
Here, the distance between 2 clusters is defined as the average distance between every point of one cluster
to every other point of the other cluster.

You also looked at the difference K-Means and Hierarchical clustering and saw how these methods are used
in the industry.

Here are some important commands to cluster data that you should remember when clustering the data
• Scaling/Standardising
• standard_scaler = StandardScaler()

• K Means Clustering
• model_clus = KMeans(n_clusters = num_clusters, max_iter=_)

• Hierarchical Clustering
• mergings = linkage(X, method = "single/complete/average", metric='euclidean')
dendrogram(mergings)
• Cutting the Cluster
• clusterCut = pd.Series(cut_tree(mergings, n_clusters = num_clusters).reshape(-1,))
Powered by upGrad Education Private Limited
© Copyright . UpGrad Education Pvt. Ltd. All rights reserved

Disclaimer: All content and material on the upGrad website is copyrighted material,
either belonging to upGrad or its bonafide contributors and is purely for the
dissemination of education. You are permitted to access print and download extracts
from this site purely for your own education only and on the following basis:

• You can download this document from the website for self-use only.
• Any copies of this document, in part or full, saved to disc or to any other storage
medium may only be used for subsequent, self-viewing purposes or to print
an individual extract or copy for non-commercial personal use only.
• Any further dissemination, distribution, reproduction, copying of the content of
the document herein or the uploading thereof on other websites or use of
the content for any other commercial/unauthorised purposes in any way
which could infringe the intellectual property rights of upGrad or its
contributors, is strictly prohibited.
• No graphics, images or photographs from any accompanying text in this
document will be used separately for unauthorised purposes.
• No material in this document will be modified, adapted or altered in any way.
• No part of this document or upGrad content may be reproduced or stored in
any other web site or included in any public or private electronic retrieval
system or service without upGrad’s prior written permission.
• Any rights not expressly granted in these terms are reserved.

Eleva Workspot 1.x - Training
100% (1)
Eleva Workspot 1.x - Training
59 pages
Webflux - A Practical Guide To Reactive Applications
No ratings yet
Webflux - A Practical Guide To Reactive Applications
189 pages
Chameleon Paradize Redemption Service Manual Baytek Games
No ratings yet
Chameleon Paradize Redemption Service Manual Baytek Games
39 pages
Lecture Notes - Clustering
No ratings yet
Lecture Notes - Clustering
13 pages
Module-5-Cluster Analysis-Part1
No ratings yet
Module-5-Cluster Analysis-Part1
24 pages
MACHINE LEARNING NOTES ANNA UNIVERSITY
No ratings yet
MACHINE LEARNING NOTES ANNA UNIVERSITY
14 pages
Unit-6 Clustering Techniques
No ratings yet
Unit-6 Clustering Techniques
110 pages
Module 3 - 1
No ratings yet
Module 3 - 1
149 pages
Unit 4 Self Made (1)
No ratings yet
Unit 4 Self Made (1)
28 pages
Clustering
No ratings yet
Clustering
20 pages
UnSupervisedLearning
No ratings yet
UnSupervisedLearning
22 pages
Text Analytics Unit-3
No ratings yet
Text Analytics Unit-3
11 pages
Clustering
No ratings yet
Clustering
75 pages
Chapter 8 - Clustering
No ratings yet
Chapter 8 - Clustering
42 pages
Machine Learning
No ratings yet
Machine Learning
23 pages
Lecture 3
No ratings yet
Lecture 3
46 pages
Hierarchical Clustering: Relationship Between Clusters
No ratings yet
Hierarchical Clustering: Relationship Between Clusters
23 pages
Lp2-Etl Model Assignment No. 2: R (2) C (4) V (2) T (2) Total (10) Dated Sign
No ratings yet
Lp2-Etl Model Assignment No. 2: R (2) C (4) V (2) T (2) Total (10) Dated Sign
7 pages
Cluster Analysis: Talha Farooq Faizan Ali Muhammad Abdul Basit
No ratings yet
Cluster Analysis: Talha Farooq Faizan Ali Muhammad Abdul Basit
16 pages
Artificial Intelligence Report
No ratings yet
Artificial Intelligence Report
23 pages
Clustering
No ratings yet
Clustering
69 pages
Slide TIF311 DM 10 11
No ratings yet
Slide TIF311 DM 10 11
49 pages
AI Chapter 3 Part 5
No ratings yet
AI Chapter 3 Part 5
30 pages
UNIT - 3 - Clustering
No ratings yet
UNIT - 3 - Clustering
21 pages
ML UNIT-5 (1)
No ratings yet
ML UNIT-5 (1)
30 pages
19 - Sessionppt - Clusteringalgos
No ratings yet
19 - Sessionppt - Clusteringalgos
36 pages
22AIP3101A Session 9
No ratings yet
22AIP3101A Session 9
38 pages
Learneverythingai
No ratings yet
Learneverythingai
12 pages
Clustering - The Data Ensemble
No ratings yet
Clustering - The Data Ensemble
4 pages
Clustering-Part1.pptx
No ratings yet
Clustering-Part1.pptx
84 pages
Cluster Analysis
No ratings yet
Cluster Analysis
24 pages
Hierarchical Clustering: Required Data
No ratings yet
Hierarchical Clustering: Required Data
6 pages
U1 - KMeans - 5th Sem - DS
No ratings yet
U1 - KMeans - 5th Sem - DS
14 pages
Data Clustering..
No ratings yet
Data Clustering..
10 pages
A Tutorial On Clustering Algorithms
No ratings yet
A Tutorial On Clustering Algorithms
4 pages
ML Mod6
No ratings yet
ML Mod6
24 pages
"These Are Just Rough Notes For References" What Is K-Means Clustering
No ratings yet
"These Are Just Rough Notes For References" What Is K-Means Clustering
9 pages
8. Clustering
No ratings yet
8. Clustering
38 pages
Clustering
No ratings yet
Clustering
39 pages
ML L14 Clustering
No ratings yet
ML L14 Clustering
59 pages
Cluster
100% (1)
Cluster
72 pages
10.cluster Analysis
No ratings yet
10.cluster Analysis
68 pages
Lecture 14 Clustering
0% (1)
Lecture 14 Clustering
57 pages
Clustering
No ratings yet
Clustering
75 pages
unsupervised learning
No ratings yet
unsupervised learning
23 pages
ML Module 4 2022 1 PDF
No ratings yet
ML Module 4 2022 1 PDF
31 pages
DS203 2024-02-09 Clustering K Means and Hierarchical v2
No ratings yet
DS203 2024-02-09 Clustering K Means and Hierarchical v2
35 pages
K-Mean Clustering
No ratings yet
K-Mean Clustering
8 pages
DWDM Unit5
No ratings yet
DWDM Unit5
14 pages
1731009606_Clustering_(Class_38-39)
No ratings yet
1731009606_Clustering_(Class_38-39)
45 pages
L18_19_Clustering
No ratings yet
L18_19_Clustering
48 pages
Working of K Means Algorithm - YashBhure
No ratings yet
Working of K Means Algorithm - YashBhure
14 pages
Hierarchical Clustering - 11.3.2024 - Full
No ratings yet
Hierarchical Clustering - 11.3.2024 - Full
14 pages
lec2
No ratings yet
lec2
32 pages
Unit 4 Clustering - K-Means and Hierarchical
No ratings yet
Unit 4 Clustering - K-Means and Hierarchical
40 pages
ML-UNIT-III
No ratings yet
ML-UNIT-III
12 pages
Module12.02 UnsupervisedLearning
No ratings yet
Module12.02 UnsupervisedLearning
25 pages
Unit_4 (1)
No ratings yet
Unit_4 (1)
63 pages
STAT452 Project1
No ratings yet
STAT452 Project1
13 pages
Mid 2
No ratings yet
Mid 2
11 pages
Lecture 18 K Means Clustering
No ratings yet
Lecture 18 K Means Clustering
77 pages
Chapter 4 PDF
No ratings yet
Chapter 4 PDF
89 pages
Competitive Learning: Fundamentals and Applications for Reinforcement Learning through Competition
From Everand
Competitive Learning: Fundamentals and Applications for Reinforcement Learning through Competition
Fouad Sabry
No ratings yet
ANSYS 14 FLUENT Meshing Tutorials
20% (5)
ANSYS 14 FLUENT Meshing Tutorials
296 pages
PC Programador
No ratings yet
PC Programador
8 pages
Enhanced Blowfish Algorithm For Image Encryption and
No ratings yet
Enhanced Blowfish Algorithm For Image Encryption and
12 pages
Bahasa Inggris - 20024007
No ratings yet
Bahasa Inggris - 20024007
6 pages
EAP900H Rev2
No ratings yet
EAP900H Rev2
5 pages
Bootstrap Programming Cookbook
No ratings yet
Bootstrap Programming Cookbook
89 pages
Autosar Sws Cantransportlayer
No ratings yet
Autosar Sws Cantransportlayer
94 pages
XiaomiCatalog
No ratings yet
XiaomiCatalog
56 pages
SSIE 553 - AMPL Project Lecture
No ratings yet
SSIE 553 - AMPL Project Lecture
19 pages
ASL PayTo Notifications Implementation Guide v0.1 DRAFT
No ratings yet
ASL PayTo Notifications Implementation Guide v0.1 DRAFT
19 pages
How To Install Tor and Create Tor Hidden Service On Windows - Ethical Hacking and Penetration Testing
No ratings yet
How To Install Tor and Create Tor Hidden Service On Windows - Ethical Hacking and Penetration Testing
7 pages
Ardac Elite Data Sheet
No ratings yet
Ardac Elite Data Sheet
4 pages
Special Issue On Innovations and Technology in FinTech 2023 - Unveiled at GFF 2023
No ratings yet
Special Issue On Innovations and Technology in FinTech 2023 - Unveiled at GFF 2023
86 pages
Gl408E Gl412E: Industrial Thermal Printer - Gle Series
No ratings yet
Gl408E Gl412E: Industrial Thermal Printer - Gle Series
2 pages
MAG-UM-03115-EN ModMAG M2000 Resistor Kit User Manual
No ratings yet
MAG-UM-03115-EN ModMAG M2000 Resistor Kit User Manual
2 pages
Communication Interface For Gilbarco 2 Wire Protocol: Brief Descrip On
100% (1)
Communication Interface For Gilbarco 2 Wire Protocol: Brief Descrip On
2 pages
TRAVEL PLANNER
No ratings yet
TRAVEL PLANNER
4 pages
Arabia Tech - Product Pricing
No ratings yet
Arabia Tech - Product Pricing
40 pages
A Brief Overview of Chatgpt: The History, Status Quo and Potential Future Development
No ratings yet
A Brief Overview of Chatgpt: The History, Status Quo and Potential Future Development
15 pages
CS6735 ProgrammingProject Group08 Report
No ratings yet
CS6735 ProgrammingProject Group08 Report
7 pages
P12xy - TDS - EN - (Multifunction Protection Relay)
No ratings yet
P12xy - TDS - EN - (Multifunction Protection Relay)
40 pages
MA8551-Algebra and Number Theory
No ratings yet
MA8551-Algebra and Number Theory
13 pages
PWS3261 USB User Manual-E
No ratings yet
PWS3261 USB User Manual-E
15 pages
Unbound in C: San Diego - 2006 Wouter Wijngaards (Wouter@Nlnetlabs - NL)
No ratings yet
Unbound in C: San Diego - 2006 Wouter Wijngaards (Wouter@Nlnetlabs - NL)
22 pages
Chapter 2 Problem Solving Part 1 Classes, Iterations and Arrays, String, Enum
No ratings yet
Chapter 2 Problem Solving Part 1 Classes, Iterations and Arrays, String, Enum
90 pages
CRAMMManufacturer's BrochureManufacturer's Brochure PDF
No ratings yet
CRAMMManufacturer's BrochureManufacturer's Brochure PDF
2 pages
Solution Architect: Technical Skills
100% (1)
Solution Architect: Technical Skills
2 pages