0% found this document useful (0 votes)

76 views9 pages

An Introduction To Clustering and Different Methods of Clustering

This document provides an introduction to clustering and different clustering methods: - Clustering is an unsupervised learning technique that divides data into groups/clusters such that data within each cluster are more similar to each other than data in other clusters. - The two main types of clustering are hard clustering, which assigns each data point to one cluster, and soft clustering, which assigns probabilities of cluster membership. - Popular clustering algorithms include K-means clustering, hierarchical clustering, distribution-based clustering, and density-based clustering. - K-means clustering iteratively assigns data points to centroids to optimize cluster homogeneity, while hierarchical clustering progressively merges or divides clusters based on distance metrics.

Uploaded by

Leonor Patricia MEDINA SIFUENTES

Available Formats

Download as PDF, TXT or read online on Scribd

Download as pdf or txt

0% found this document useful (0 votes)

76 views9 pages

An Introduction To Clustering and Different Methods of Clustering

Uploaded by

Leonor Patricia MEDINA SIFUENTES

Available Formats

Download as PDF, TXT or read online on Scribd

Download as pdf or txt

You are on page 1/ 9

An Introduction to Clustering and different methods of clustering

A LG O RI T HM C LUS T E RI NG D AT A S C I E NC E I NT E RM E D I AT E M A C HI NE LE A RNI NG R S T RUC T URE D D AT A UNS UPE RVI S E D

Overview

Learn about Clustering , one of the most popular unsupervised classification techniques
Dividing the data into clusters can be on the basis of centroids, distributions, densities, etc
Get to know K means and hierarchical clustering and the difference between the two

Introduction

Have you come across a situation when a Chief Marketing Officer of a company tells you – “Help me
understand our customers better so that we can market our products to them in a better manner!”

I did and the analyst in me was completely clueless what to do! I was used to getting specific problems,
where there is an outcome to be predicted for various set of conditions. But I had no clue what to do in this
case. If the person would have asked me to calculate Life Time Value (LTV) or propensity of Cross-sell, I
wouldn’t have blinked. But this question looked very broad to me.

This is usually the first reaction when you come across an unsupervised learning problem for the first time!
You are not looking for specific insights for a phenomena, but what you are looking for are structures with
in data with out them being tied down to a specific outcome.

The method of identifying similar groups of data in a dataset is called clustering. It is one of the most
popular techniques in data science. Entities in each group are comparatively more similar to entities of that
group than those of the other groups. In this article, I will be taking you through the types of clustering,
different clustering algorithms and a comparison between two of the most commonly used clustering
methods.

Let’s get started.

Table of Contents

1. Overview
2. Types of Clustering
3. Types of Clustering Algorithms
4. K Means Clustering
5. Hierarchical Clustering
6. Difference between K Means and Hierarchical clustering
7. Applications of Clustering
8. Improving Supervised Learning algorithms with clustering

1. Overview

Clustering is the task of dividing the population or data points into a number of groups such that data
points in the same groups are more similar to other data points in the same group than those in other
groups. In simple words, the aim is to segregate groups with similar traits and assign them into clusters.

Let’s understand this with an example. Suppose, you are the head of a rental store and wish to understand
preferences of your costumers to scale up your business. Is it possible for you to look at details of each
costumer and devise a unique business strategy for each one of them? Definitely not. But, what you can do
is to cluster all of your costumers into say 10 groups based on their purchasing habits and use a separate
strategy for costumers in each of these 10 groups. And this is what we call clustering.

Now, that we understand what is clustering. Let’s take a look at the types of clustering.

2. Types of Clustering

Broadly speaking, clustering can be divided into two subgroups :

Hard Clustering: In hard clustering, each data point either belongs to a cluster completely or not. For
example, in the above example each customer is put into one group out of the 10 groups.
Soft Clustering: In soft clustering, instead of putting each data point into a separate cluster, a
probability or likelihood of that data point to be in those clusters is assigned. For example, from the
above scenario each costumer is assigned a probability to be in either of 10 clusters of the retail store.

3. Types of clustering algorithms

Since the task of clustering is subjective, the means that can be used for achieving this goal are plenty.
Every methodology follows a different set of rules for defining the ‘similarity’ among data points. In fact,
there are more than 100 clustering algorithms known. But few of the algorithms are used popularly, let’s
look at them in detail:

Connectivity models: As the name suggests, these models are based on the notion that the data
points closer in data space exhibit more similarity to each other than the data points lying farther away.
These models can follow two approaches. In the first approach, they start with classifying all data
points into separate clusters & then aggregating them as the distance decreases. In the second
approach, all data points are classified as a single cluster and then partitioned as the distance
increases. Also, the choice of distance function is subjective. These models are very easy to interpret
but lacks scalability for handling big datasets. Examples of these models are hierarchical clustering
algorithm and its variants.
Centroid models: These are iterative clustering algorithms in which the notion of similarity is derived
by the closeness of a data point to the centroid of the clusters. K-Means clustering algorithm is a
popular algorithm that falls into this category. In these models, the no. of clusters required at the end
have to be mentioned beforehand, which makes it important to have prior knowledge of the dataset.
These models run iteratively to find the local optima.

Distribution models: These clustering models are based on the notion of how probable is it that all
data points in the cluster belong to the same distribution (For example: Normal, Gaussian). These
models often suffer from overfitting. A popular example of these models is Expectation-maximization
algorithm which uses multivariate normal distributions.

Density Models: These models search the data space for areas of varied density of data points in the
data space. It isolates various different density regions and assign the data points within these regions
in the same cluster. Popular examples of density models are DBSCAN and OPTICS.

Now I will be taking you through two of the most popular clustering algorithms in detail – K Means
clustering and Hierarchical clustering. Let’s begin.

4. K Means Clustering

K means is an iterative clustering algorithm that aims to find local maxima in each iteration. This algorithm
works in these 5 steps :

1. Specify the desired number of clusters K : Let us choose k=2 for these 5 data points in 2-D space.

2. Randomly assign each data point to a cluster : Let’s assign three points in cluster 1 shown using red
color and two points in cluster 2 shown using grey color.
3. Compute cluster centroids : The centroid of data points in the red cluster is shown using red cross and
those in grey cluster using grey cross.

4. Re-assign each point to the closest cluster centroid : Note that only the data point at the bottom is
assigned to the red cluster even though its closer to the centroid of grey cluster. Thus, we assign that
data point into grey cluster

5. Re-compute cluster centroids : Now, re-computing the centroids for both the clusters.
6. Repeat steps 4 and 5 until no improvements are possible : Similarly, we’ll repeat the 4 th and 5 th steps
until we’ll reach global optima. When there will be no further switching of data points between two
clusters for two successive repeats. It will mark the termination of the algorithm if not explicitly
mentioned.

Here is a live coding window where you can try out K Means Algorithm using scikit-learn library.

5. Hierarchical Clustering

Hierarchical clustering, as the name suggests is an algorithm that builds hierarchy of clusters. This
algorithm starts with all the data points assigned to a cluster of their own. Then two nearest clusters are
merged into the same cluster. In the end, this algorithm terminates when there is only a single cluster left.

The results of hierarchical clustering can be shown using dendrogram. The dendrogram can be interpreted
as:
At the bottom, we start with 25 data points, each assigned to separate clusters. Two closest clusters are
then merged till we have just one cluster at the top. The height in the dendrogram at which two clusters
are merged represents the distance between two clusters in the data space.

The decision of the no. of clusters that can best depict different groups can be chosen by observing the
dendrogram. The best choice of the no. of clusters is the no. of vertical lines in the dendrogram cut by a
horizontal line that can transverse the maximum distance vertically without intersecting a cluster.

In the above example, the best choice of no. of clusters will be 4 as the red horizontal line in the
dendrogram below covers maximum vertical distance AB.

Two important things that you should know about hierarchical clustering are:

This algorithm has been implemented above using bottom up approach. It is also possible to follow
top-down approach starting with all data points assigned in the same cluster and recursively
performing splits till each data point is assigned a separate cluster.
The decision of merging two clusters is taken on the basis of closeness of these clusters. There are
multiple metrics for deciding the closeness of two clusters :
Euclidean distance: ||a-b|| 2 = √(Σ(ai -b i ))
Squared Euclidean distance: ||a-b|| 2 2 = Σ((ai -b i ) 2 )

Manhattan distance: ||a-b|| 1 = Σ|ai -b i |

Maximum distance:||a-b|| INFINITY = maxi |ai -b i |

Mahalanobis distance: √((a-b) T S-1 (-b)) {where, s : covariance matrix}

6. Difference between K Means and Hierarchical clustering

Hierarchical clustering can’t handle big data well but K Means clustering can. This is because the time
complexity of K Means is linear i.e. O(n) while that of hierarchical clustering is quadratic i.e. O(n 2 ).
In K Means clustering, since we start with random choice of clusters, the results produced by running
the algorithm multiple times might differ. While results are reproducible in Hierarchical clustering.
K Means is found to work well when the shape of the clusters is hyper spherical (like circle in 2D,
sphere in 3D).
K Means clustering requires prior knowledge of K i.e. no. of clusters you want to divide your data into.
But, you can stop at whatever number of clusters you find appropriate in hierarchical clustering by
interpreting the dendrogram

7. Applications of Clustering

Clustering has a large no. of applications spread across various domains. Some of the most popular
applications of clustering are:

Recommendation engines
Market segmentation
Social network analysis
Search result grouping
Medical imaging
Image segmentation
Anomaly detection

8. Improving Supervised Learning Algorithms with Clustering

Clustering is an unsupervised machine learning approach, but can it be used to improve the accuracy of
supervised machine learning algorithms as well by clustering the data points into similar groups and using
these cluster labels as independent variables in the supervised machine learning algorithm? Let’s find out.

Let’s check out the impact of clustering on the accuracy of our model for the classification problem using
3000 observations with 100 predictors of stock data to predicting whether the stock will go up or down
using R. This dataset contains 100 independent variables from X1 to X100 representing profile of a stock
and one outcome variable Y with two levels : 1 for rise in stock price and -1 for drop in stock price.
The dataset is available here : Download

Let’s first try applying randomforest without clustering.

#loading required libraries

library('randomForest') library('Metrics') #set random seed set.seed(101) #loading dataset data<-

read.csv("train.csv",stringsAsFactors= T) #checking dimensions of data dim(data) ## [1] 3000 101 #specifying

outcome variable as factor data$Y<-as.factor(data$Y) #dividing the dataset into train and test train<-
data[1:2000,] test<-data[2001:3000,] #applying randomForest model_rf<-randomForest(Y~.,data=train) preds<-

predict(object=model_rf,test[,-101]) table(preds) ## preds ## -1 1 ## 453 547 #checking accuracy

auc(preds,test$Y) ## [1] 0.4522703

So, the accuracy we get is 0.45. Now let’s create five clusters based on values of independent variables
using k-means clustering and reapply randomforest.

#combing test and train all<-rbind(train,test) #creating 5 clusters using K- means clustering Cluster <-
kmeans(all[,-101], 5) #adding clusters as independent variable to the dataset. all$cluster<-

as.factor(Cluster$cluster) #dividing the dataset into train and test train<-all[1:2000,] test<-
all[2001:3000,] #applying randomforest model_rf<-randomForest(Y~.,data=train) preds2<-

predict(object=model_rf,test[,-101]) table(preds2) ## preds2 ## -1 1 ##548 452 auc(preds2,test$Y) ## [1]

0.5345908

Whoo! In the above example, even though the final accuracy is poor but clustering has given our model a
significant boost from accuracy of 0.45 to slightly above 0.53.

This shows that clustering can indeed be helpful for supervised machine learning tasks.

End Notes

In this article, we have discussed what are the various ways of performing clustering. It find applications
for unsupervised learning in a large no. of domains. You also saw how you can improve the accuracy of
your supervised machine learning algorithm using clustering.

Although clustering is easy to implement, you need to take care of some important aspects like treating
outliers in your data and making sure each cluster has sufficient population. These aspects of clustering
are dealt in great detail in this article.

Did you enjoyed reading this article? Do share your views in the comment section below.

Got expertise in Machine Learning / Big Data / Data Science? Show case your
knowledge and help Analytics Vidhya community by posting your blog.

Article Url - https://www.analyticsvidhya.com/blog/2016/11/an-introduction-to-clustering-and-different-

methods-of-clustering/

Saurav Kaushik
Saurav is a Data Science enthusiast, currently in the final year of his graduation at MAIT, New Delhi. He
loves to use machine learning and analytics to solve complex data problems.

SMDM Guided Project Sample Business Report
No ratings yet
SMDM Guided Project Sample Business Report
17 pages
Nagareddy 18-Nov-2023
No ratings yet
Nagareddy 18-Nov-2023
20 pages
SMDM Project Report-Survi Ghura
100% (1)
SMDM Project Report-Survi Ghura
26 pages
Clustering Analysis: Prepared by Muralidharan N
100% (1)
Clustering Analysis: Prepared by Muralidharan N
16 pages
Rahulsharma - 03 12 23
No ratings yet
Rahulsharma - 03 12 23
25 pages
Business Report On Data Mining: By: Aditya Janardan Hajare Batch: PGPDSBA Mar'C21 Group 1
No ratings yet
Business Report On Data Mining: By: Aditya Janardan Hajare Batch: PGPDSBA Mar'C21 Group 1
18 pages
Problem 1
No ratings yet
Problem 1
12 pages
VARUNSAINI - 13 Nov 2022
No ratings yet
VARUNSAINI - 13 Nov 2022
14 pages
ML Project Report: (Text Learning Case Study)
No ratings yet
ML Project Report: (Text Learning Case Study)
9 pages
Data Mining Problem 2 Report
No ratings yet
Data Mining Problem 2 Report
13 pages
ML Assignemnt PDF
No ratings yet
ML Assignemnt PDF
21 pages
Wholesale Custumer
100% (1)
Wholesale Custumer
32 pages
Clustering & PCA Assignment Questions
No ratings yet
Clustering & PCA Assignment Questions
4 pages
Why Do You Need To Scale Data in KNN: 3 Answers
No ratings yet
Why Do You Need To Scale Data in KNN: 3 Answers
1 page
Anisha SMDM
No ratings yet
Anisha SMDM
11 pages
Business Report: Advanced Statistics Module Project - II
No ratings yet
Business Report: Advanced Statistics Module Project - II
9 pages
Business Report On Data Mining: By: Aditya Janardan Hajare Batch: PGPDSBA Mar'C21 Group 1
100% (1)
Business Report On Data Mining: By: Aditya Janardan Hajare Batch: PGPDSBA Mar'C21 Group 1
12 pages
Problem 2 - Survey: Importing Nessceary Libraries
No ratings yet
Problem 2 - Survey: Importing Nessceary Libraries
10 pages
Predictive Modeling - Supporting File1
No ratings yet
Predictive Modeling - Supporting File1
3 pages
SMDM-Business Report
No ratings yet
SMDM-Business Report
11 pages
M4 Data Mining W4 Business Report
No ratings yet
M4 Data Mining W4 Business Report
22 pages
Advanced Statistics Project Report
100% (1)
Advanced Statistics Project Report
42 pages
BUSINESS REPORT Part 1
No ratings yet
BUSINESS REPORT Part 1
9 pages
Predictive Modeling
No ratings yet
Predictive Modeling
38 pages
LDA KNN Logistic
100% (1)
LDA KNN Logistic
29 pages
ML Quiz 3
No ratings yet
ML Quiz 3
2 pages
AS Graded Project Suchi Solanki
No ratings yet
AS Graded Project Suchi Solanki
21 pages
Advance Stats Project Parijat
No ratings yet
Advance Stats Project Parijat
18 pages
Clustering Project
100% (1)
Clustering Project
44 pages
Simple Regression Quiz
No ratings yet
Simple Regression Quiz
6 pages
Data Mining Assignment: Sudhanva Saralaya
100% (1)
Data Mining Assignment: Sudhanva Saralaya
16 pages
Python Project Submission by - Ravikanth Govindu: Due Date: 27th Mar 2022
No ratings yet
Python Project Submission by - Ravikanth Govindu: Due Date: 27th Mar 2022
48 pages
SMDM Project
No ratings yet
SMDM Project
17 pages
Data Mining Project - PCA - Hair Salon
No ratings yet
Data Mining Project - PCA - Hair Salon
8 pages
Buisiness Reoprt Extended As Project Report
No ratings yet
Buisiness Reoprt Extended As Project Report
18 pages
Answer Report: Data Mining
No ratings yet
Answer Report: Data Mining
32 pages
Color: Due On Sunday June 7th, by 11:59PM
No ratings yet
Color: Due On Sunday June 7th, by 11:59PM
2 pages
Capstone Notes-1
No ratings yet
Capstone Notes-1
18 pages
Prathamesh Shukla SMDM Project 20.08.23
100% (1)
Prathamesh Shukla SMDM Project 20.08.23
34 pages
MySQL - Week 5 Quiz
100% (1)
MySQL - Week 5 Quiz
6 pages
Project Time Series Forecasting ROSE Dataset by Somya Dhar 1 PDF
No ratings yet
Project Time Series Forecasting ROSE Dataset by Somya Dhar 1 PDF
52 pages
AS Extended Buisnesss Report
No ratings yet
AS Extended Buisnesss Report
25 pages
PM Guided Project Sample Business Report
No ratings yet
PM Guided Project Sample Business Report
52 pages
Pranjal - Singh - 25.12.2022 - Data Mining Project
No ratings yet
Pranjal - Singh - 25.12.2022 - Data Mining Project
8 pages
Rajiv Ranjan 11 Dec 2022
No ratings yet
Rajiv Ranjan 11 Dec 2022
18 pages
Cluster Analysis in Python Chapter2 PDF
No ratings yet
Cluster Analysis in Python Chapter2 PDF
30 pages
Bankruptcy Prevention Project
No ratings yet
Bankruptcy Prevention Project
16 pages
SMDM Report
No ratings yet
SMDM Report
12 pages
Capstone Project Proposal - HR Audit
No ratings yet
Capstone Project Proposal - HR Audit
3 pages
Business Report Pradeep Chauhan 11june'23
100% (1)
Business Report Pradeep Chauhan 11june'23
25 pages
House Price Prediction Using Data Science
No ratings yet
House Price Prediction Using Data Science
8 pages
ML - Project - Business Report
No ratings yet
ML - Project - Business Report
43 pages
Machine Learning Guided Project
No ratings yet
Machine Learning Guided Project
23 pages
SMDM Extended Project
No ratings yet
SMDM Extended Project
1 page
Market Segmentation - Product Service Management
No ratings yet
Market Segmentation - Product Service Management
16 pages
Project Questions
No ratings yet
Project Questions
3 pages
Capstone Project Final Report Rupesh Kumar PGP-DSBA APR 21C
No ratings yet
Capstone Project Final Report Rupesh Kumar PGP-DSBA APR 21C
77 pages
SMDM Project Report Dsba
No ratings yet
SMDM Project Report Dsba
2 pages
ifferent methods of clustering
No ratings yet
ifferent methods of clustering
8 pages
An Introduction To Clustering Methods
No ratings yet
An Introduction To Clustering Methods
8 pages
Data Chaparra Atico GCDKIT
No ratings yet
Data Chaparra Atico GCDKIT
566 pages
Cuenca Chaparra-Atico Separado
No ratings yet
Cuenca Chaparra-Atico Separado
177 pages
Design & Interpretation of Soil Surveys - SEG Vol. 3 1987
No ratings yet
Design & Interpretation of Soil Surveys - SEG Vol. 3 1987
188 pages
Slide Pack - Digestion Methods - Amanda Stoltze
No ratings yet
Slide Pack - Digestion Methods - Amanda Stoltze
62 pages
Analysis of PB, ZN and Fe Relative Geochemical Behavioral Characteristics in Parkam Porphyry System: A K-Means Clustering
No ratings yet
Analysis of PB, ZN and Fe Relative Geochemical Behavioral Characteristics in Parkam Porphyry System: A K-Means Clustering
8 pages
Accident Analysis by Using Data Mining Techniques: June 2017
No ratings yet
Accident Analysis by Using Data Mining Techniques: June 2017
80 pages
Mathematics
No ratings yet
Mathematics
4 pages
Topic: The Meaning and Relevance of History Definition, Issues, Sources, & Methodology
No ratings yet
Topic: The Meaning and Relevance of History Definition, Issues, Sources, & Methodology
27 pages
First Lesson in Batik
No ratings yet
First Lesson in Batik
78 pages
The Occult Properties of Herbs 0850300363
50% (2)
The Occult Properties of Herbs 0850300363
66 pages
Hiring A Drone Pilot The Ultimate Guide - The Drone Life
No ratings yet
Hiring A Drone Pilot The Ultimate Guide - The Drone Life
5 pages
English For Communication 11th 4-4-4
No ratings yet
English For Communication 11th 4-4-4
7 pages
OB Ward Case Study
No ratings yet
OB Ward Case Study
20 pages
Nutra Report PDF
No ratings yet
Nutra Report PDF
5 pages
Land Gem Users Guide
No ratings yet
Land Gem Users Guide
55 pages
B.Tech CSC - IV Sem Quiz II (19/02/19) Time: 30 Min. Marks:10 (Weightage:2.5) Name: Lms Id No
No ratings yet
B.Tech CSC - IV Sem Quiz II (19/02/19) Time: 30 Min. Marks:10 (Weightage:2.5) Name: Lms Id No
1 page
Maria Lorena Lehman: Smar T Desi GN
No ratings yet
Maria Lorena Lehman: Smar T Desi GN
21 pages
Game Crash
No ratings yet
Game Crash
26 pages
Practice PTE With AI Scorings - PTE APEUni
No ratings yet
Practice PTE With AI Scorings - PTE APEUni
1 page
Behavioral Modeling Ppt
No ratings yet
Behavioral Modeling Ppt
41 pages
Philippine Herbal Medicine
No ratings yet
Philippine Herbal Medicine
17 pages
AQA English Language Sample Paper 1c
No ratings yet
AQA English Language Sample Paper 1c
5 pages
business finance
No ratings yet
business finance
11 pages
Basic Elements of Pop Up Anna
50% (2)
Basic Elements of Pop Up Anna
16 pages
Delhi Public School, Siliguri Class-Xii Sub - English Invitations and Replies
No ratings yet
Delhi Public School, Siliguri Class-Xii Sub - English Invitations and Replies
9 pages
Ohio 7100 Operating Table - Installation and User Manual
No ratings yet
Ohio 7100 Operating Table - Installation and User Manual
16 pages
W125
No ratings yet
W125
12 pages
Class-9-MOTION-COGNITION
No ratings yet
Class-9-MOTION-COGNITION
5 pages
Four Wheels Good
No ratings yet
Four Wheels Good
2 pages
Series Description: Wilo-Economy MHI: Pump Curves in Accordance With ISO 9906, Class 2
No ratings yet
Series Description: Wilo-Economy MHI: Pump Curves in Accordance With ISO 9906, Class 2
4 pages
Materials Science For Engineers: ENGR-1600
No ratings yet
Materials Science For Engineers: ENGR-1600
24 pages
What Is Accounts Receivable Turnover
No ratings yet
What Is Accounts Receivable Turnover
4 pages
Ariel Parts Identifier
No ratings yet
Ariel Parts Identifier
2 pages
AE Media Forum 2023 BAY Agenda v12
No ratings yet
AE Media Forum 2023 BAY Agenda v12
13 pages
F 22 Raptor Out of Popsicle Sticks
No ratings yet
F 22 Raptor Out of Popsicle Sticks
43 pages
Profile
No ratings yet
Profile
3 pages