Lecture+Notes+ +clustering
Lecture+Notes+ +clustering
Unsupervised Learning
In the previous modules, you saw various supervised machine learning algorithms. Supervised learning is a type of
machine learning algorithm that uses a known dataset to preform predictions. This dataset (referred to as the training
dataset) includes both response values and input data. From this, the supervised learning algorithm seeks to build a
model that can predict the response values for a new dataset.
If you are training your machine learning task only with a set of inputs, it is called unsupervised learning, which will be
able to find the structure or relationships between different inputs. The most important unsupervised learning
technique is clustering, which creates different groups or clusters of the given set of inputs and is also able to put any
new input in the appropriate cluster. While carrying out clustering, the basic objective is to group the input points in
such a way as to maximise the inter-cluster variance and minimise the intra-cluster variance.
Fig 1: Objective of clustering is to maximise the inter-cluster distance and minimise the intra-cluster variance
The two most important methods of clustering are the K-Means algorithm & the Hierarchical clustering algorithm.
K-Means Algorithm
K-Means algorithm is the process of dividing the N data points into K groups or clusters. Here the steps of the
algorithm are:
1. Start by choosing K random points the initial cluster centres.
2. Assign each data point to their nearest cluster centre. The most common way of measuring the distance
between the points is the Euclidean distance.
3. For each cluster, compute the new cluster centre which will be the mean of all cluster members.
4. Now re-assign all the data points to the diffrent clusters by taking into account the new cluster centres.
5. Keep iterating through the step 3 & 4 until there are no further changes possible.
At this point, you arrive at the optimal clusters.
Let’s apply the K-Means algorithm on a set of 10 points, which we want to divide into 2 clusters. Thus the
value of K here is 2.
We then assign each of the data points to their nearest cluster centres based on the Euclidean distance. This
way all the points are divided among the K clusters.
Fig 4: Assigning each data point to their nearest cluster centre
Now we update the position of each of the cluster centres to reflect the mean of each cluster.
Earlier you saw the difference between the classification and the clustering problem. Then you saw the K-
Means algorithm as a way to achieve the clusters. Hierarchical Clustering is another algorithm to obtain such
clusters.
Given a set of N items to be clustered, the steps in the hierarchical clustering are:
1. Calculate the NxN distance (similarity) matrix, which calculates the distance of each data point from
the other.
2. Start by assigning each item to its own cluster, so that if you have N items, you now have N clusters,
each containing just one item.
3. Find the closest (most similar) pair of clusters and merge them into a
single cluster, so that now you have one less cluster.
4. Compute distances (similarities) between the new cluster and each of the old clusters.
5. Repeat steps 3 and 4 until all items are clustered into a single cluster of size N.
Thus, what we have at the end is the dendrogram, which shows us which data points group together in
which cluster at what distance.
Let’s consider an example. Let’s take 10 points and try to apply hierarchical clustering algorithm over them.
Now initially, we treat each of these points as individual clusters. Thus we begin with 10 different clusters.
Thus now you are left with only 9 clusters. 1 of them has a single element, while 1 of them have 2 elements
i.e. 5 & 7. Now again we calculate the distance of each cluster from every other cluster. But here the problem
is how can you measure the distance of a cluster having 2 points with a cluster having a single point? It is
here that the concept of linkage becomes important. Linkage is the measure of dissimilarity or similarity
between the clusters having multiple observations.
Here, we calculate the distance between the points 5 & 8 and then 7 & 8, and the minimum of these 2
distances is taken as the distance between the 2 clusters. Thus, in next iteration, we obtain 8 clusters.
Fig 5: After iteration 2, we have 8 clusters
The result of the hierarchical clustering algorithm is shown by a dendrogram, which starts with all the data
points as separate clusters and indicates at what level of dissimilarity any two clusters were joined.
The y-axis of the dendrogram is some measure of the dissimilarity or distance at which clusters join.
Fig 7: A sample dendrogram
In the dendrogram shown above, samples 4 and 5 are the most similar and join to form the first cluster,
followed by samples 1 and 10. The last two clusters to fuse together to form the final single cluster are 3-6
and 4-5-2-7-1-10-9-8.
Determining the number of groups in a cluster analysis is often the primary goal. Typically, one looks for
natural groupings defined by long stems. Here, by observation, you can identify that there are 3 major
groupings: 3-6, 4-5-2-7 and 1-10-9-8.
We also saw that the hierarchical clustering can proceed in 2 ways - agglomerative and divisive. If we are
starting with n distinct clusters and iteratively reach to a point where we have only 1 cluster in the end, it is
called agglomerative clustering. On the other hand, if we begin with 1 big cluster and subsequently keep
on partitioning this cluster to reach n clusters, each containing 1 element each, it is called divisive
clustering.
Cutting the dendrogram
Once we obtain the dendrogram, the clusters can be obtained by cutting the dendrogram at an
appropriate level. The number of vertical lines intersecting the cutting line represents the number of
clusters.
In our earlier example, we took the minimum of all the pairwise distances between the data points as the
representative of the distance between 2 clusters. This measure of distance is called single linkage. Apart
from using the minimum, we can use other methods to compute the distance between the clusters. Let’s
consider the common types of linkages:-
Single Linkage
Here, the distance between 2 clusters is defined as the shortest distance between points in the two clusters
Complete Linkage
Here, the distance between 2 clusters is defined as the maximum distance between any 2 points in the
clusters
Average Linkage
Here, the distance between 2 clusters is defined as the average distance between every point of one cluster
to every other point of the other cluster.
You also looked at the difference K-Means and Hierarchical clustering and saw how these methods are used
in the industry.
Here are some important commands to cluster data that you should remember when clustering the data
• Scaling/Standardising
• standard_scaler = StandardScaler()
• K Means Clustering
• model_clus = KMeans(n_clusters = num_clusters, max_iter=_)
• Hierarchical Clustering
• mergings = linkage(X, method = "single/complete/average", metric='euclidean')
dendrogram(mergings)
• Cutting the Cluster
• clusterCut = pd.Series(cut_tree(mergings, n_clusters = num_clusters).reshape(-1,))
Powered by upGrad Education Private Limited
© Copyright . UpGrad Education Pvt. Ltd. All rights reserved
Disclaimer: All content and material on the upGrad website is copyrighted material,
either belonging to upGrad or its bonafide contributors and is purely for the
dissemination of education. You are permitted to access print and download extracts
from this site purely for your own education only and on the following basis:
• You can download this document from the website for self-use only.
• Any copies of this document, in part or full, saved to disc or to any other storage
medium may only be used for subsequent, self-viewing purposes or to print
an individual extract or copy for non-commercial personal use only.
• Any further dissemination, distribution, reproduction, copying of the content of
the document herein or the uploading thereof on other websites or use of
the content for any other commercial/unauthorised purposes in any way
which could infringe the intellectual property rights of upGrad or its
contributors, is strictly prohibited.
• No graphics, images or photographs from any accompanying text in this
document will be used separately for unauthorised purposes.
• No material in this document will be modified, adapted or altered in any way.
• No part of this document or upGrad content may be reproduced or stored in
any other web site or included in any public or private electronic retrieval
system or service without upGrad’s prior written permission.
• Any rights not expressly granted in these terms are reserved.