0% found this document useful (0 votes)
159 views

Clustering

1. Hierarchical clustering begins by treating each observation as a separate cluster and then combines the nearest pairs of clusters step-by-step until all observations are in one cluster. 2. K-means clustering places k points (cluster centroids) in the data space and assigns each observation to the nearest centroid, then recalculates the centroid positions and reassigns observations in an iterative process until convergence is reached. 3. There are various techniques for choosing the optimal number of clusters k in k-means clustering, including the elbow method, information criteria, and cross-validation.

Uploaded by

DhruTheGamer
Copyright
© © All Rights Reserved
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
159 views

Clustering

1. Hierarchical clustering begins by treating each observation as a separate cluster and then combines the nearest pairs of clusters step-by-step until all observations are in one cluster. 2. K-means clustering places k points (cluster centroids) in the data space and assigns each observation to the nearest centroid, then recalculates the centroid positions and reassigns observations in an iterative process until convergence is reached. 3. There are various techniques for choosing the optimal number of clusters k in k-means clustering, including the elbow method, information criteria, and cross-validation.

Uploaded by

DhruTheGamer
Copyright
© © All Rights Reserved
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
You are on page 1/ 4

1,

of a set of points is defined using a distance measure .


Similarity

2,
Each point is a cluster in itself. We then combine the two nearest clusters into
one. What type of clustering does this represent ?
Agglo

3,
Unsupervised learning focuses on understanding the data and its underlying pattern.
True

4,Members of the same cluster are far away / distant from each other .
False

5,
The ______ is a visual representation of how the data points are merged to form
clusters.
Dendogram

6,
A centroid is a valid point in a non-Eucledian space . F

7,measures the goodness of a cluster


Clusteroid

8,of two points is the average of the two points in Eucledian Space.
Centroid

9,Sampling is one technique to pick the initial k points in K Means Clustering


T

10,
K Means algorithm assumes Eucledian Space/Distance
T

11, is when points don't move between clusters and centroids stabilize.
Convergence

12,
What is the overall complexity of the the Agglomerative Hierarchical Clustering ?
o(N2)

13,
The number of rounds for convergence in k means clustering can be lage T

14,
is the data point that is closest to the other point in the cluster.
Clusteroid

15,
What is the R Function to divide a dataset into k clusters ?
Kmeans()

16,
___________ is a way of finding the k value for k means clustering.
CV
17,

Distance Measure
Distance Measure is a very important aspect of clustering. Knowing how close or how
far apart each variable is with respect to the other helps in grouping them.

Jaccard Distance

The Jaccard index is used to compare elements of two sets to identify which of the
members are shared and not shared.
The Jaccard Distance is a measure of how different the two given sets are.

Jaccard Distance = 1-(Jaccard Index)

Eucledian Distance

Eucledian Distance is the shortest distance between the two given points in
Eucledian Space.

Cosine distance

Cosine distance of two given vectors u and v is the angular cosine between the
given vectors.

Manhattan Distance Application

Manhattan distance is calculated on a strictly horizontal or vertical path.

Hierarchical Clustering Explained

Begin by allotting each item to a cluster. If you are having N items, you are now
having N clusters,
where each of them contains one item. Now, let us make the similarities (distances)
between the clusters the same as the similarities (distances) between the items
they include.

Discover the most identical or closest pair of clusters, merge them into one
cluster, thereby reducing one cluster.

Calculate the similarities (distances) between each of the old clusters and the new
cluster.

Repeat step 2 and step 3 until all items are finally clustered into one cluster
with size N.

Dendogram

A dendrogram is a branching diagram that represents the relationships of similarity


among a group of entities
Each branch is called a clade
The terminal end of each clade is called a leaf
There is no limit to the number of leaves in a clade
The arrangement of the clades tells us which leaves are most similar to each other
The height of the branch points indicates how similar or different they are from
each other
The greater the height, the greater the difference between the points

Disadvantages of Agglomerative Clustering


Disadvantages for agglomerative hierarchical clustering

If data points are wrongly grouped at the inception, they cannot be reallocated.
If different similarity measures are utilized to calculate the similarity between
clusters, it may result in different results altogether.

Tips for Hierarchical Clustering


There is no particluar size that fits all solutions to determine how many clusters
you need. It depends on what you intend to do with them.
For a better solution, look at the basic characteristics of the given clusters at
successive steps and make a decision when you have a solution
that can be interpreted.

Hierarchical clustering � Standarization


Standardizing the variables is a good way to follow while clustering data.

Summary on Hierarchical Clustering


In this module, you have learnt Hierarchical Clustering in detail. You have also
learnt how to read a Dendogram and
some tips to be followed when fitting hierarchical clustering to a data set.

K-Means Algorithm Simplified


Place k points in the space represented by the objects that are being clustered.
These points represent initial group centroids.
Assign each object to the group that has the closest centroid.
When all objects have been assigned, recalculate the positions of the k centroids.
Repeat Step 2 and 3 until the centroids no longer move.

Tips for K Means Clustering


For large datasets random sampling can be used to determine the k value for
clustering
Hierarchical Clustering can also be used for the same

Choosing Right K-value


Other Ways to choose the right k value

By rule of thumb
Elbow method
Information Criterion Approach
An Information Theoretic Approach
Choosing k using the Silhouette
Cross-validation

Means Clustering in R

Loading and exploring the dataset

library(datasets)
head(iris)
Visualizing the data

library(ggplot2)
ggplot(iris, aes(Petal.Length, Petal.Width, color = Species)) + geom_point()
Setting the seed and creating the cluster

set.seed(20)
irisCluster <- kmeans(iris[, 3:4], 3, nstart = 20)
irisCluster
Comparing the clusters with the species
table(irisCluster$cluster, iris$Species)
Plotting the dataset to view the clusters

irisCluster$cluster <- as.factor(irisCluster$cluster)


ggplot(iris, aes(Petal.Length, Petal.Width, color = iris$cluster)) + geom_point()

Code Snippet
Hierarchical Clustering in R

Loading and exploring the dataset

library(datasets)
head(iris)
Visualizing the data

library(ggplot2)
ggplot(iris, aes(Petal.Length, Petal.Width, color = Species)) + geom_point()
Calculating the distance and plotting the dendogram

clusters <- hclust(dist(iris[, 3:4]))


plot(clusters)
Cutting the desired number of clusters and comparing it with the data

clusterCut <- cutree(clusters, 3)

table(clusterCut, iris$Species)
Visualizing the clusters

ggplot(iris, aes(Petal.Length, Petal.Width, color = iris$Species)) +


geom_point(alpha = 0.4, size = 3.5) + geom_point(col = clusterCut) +
scale_color_manual(values = c('black', 'red', 'green'))

Model Validation Tips and Tricks


Clustering is an unsupervised learning technique. Here are few tips to validate the
model.

You might also like