Clustering
Clustering
2,
Each point is a cluster in itself. We then combine the two nearest clusters into
one. What type of clustering does this represent ?
Agglo
3,
Unsupervised learning focuses on understanding the data and its underlying pattern.
True
4,Members of the same cluster are far away / distant from each other .
False
5,
The ______ is a visual representation of how the data points are merged to form
clusters.
Dendogram
6,
A centroid is a valid point in a non-Eucledian space . F
8,of two points is the average of the two points in Eucledian Space.
Centroid
10,
K Means algorithm assumes Eucledian Space/Distance
T
11, is when points don't move between clusters and centroids stabilize.
Convergence
12,
What is the overall complexity of the the Agglomerative Hierarchical Clustering ?
o(N2)
13,
The number of rounds for convergence in k means clustering can be lage T
14,
is the data point that is closest to the other point in the cluster.
Clusteroid
15,
What is the R Function to divide a dataset into k clusters ?
Kmeans()
16,
___________ is a way of finding the k value for k means clustering.
CV
17,
Distance Measure
Distance Measure is a very important aspect of clustering. Knowing how close or how
far apart each variable is with respect to the other helps in grouping them.
Jaccard Distance
The Jaccard index is used to compare elements of two sets to identify which of the
members are shared and not shared.
The Jaccard Distance is a measure of how different the two given sets are.
Eucledian Distance
Eucledian Distance is the shortest distance between the two given points in
Eucledian Space.
Cosine distance
Cosine distance of two given vectors u and v is the angular cosine between the
given vectors.
Begin by allotting each item to a cluster. If you are having N items, you are now
having N clusters,
where each of them contains one item. Now, let us make the similarities (distances)
between the clusters the same as the similarities (distances) between the items
they include.
Discover the most identical or closest pair of clusters, merge them into one
cluster, thereby reducing one cluster.
Calculate the similarities (distances) between each of the old clusters and the new
cluster.
Repeat step 2 and step 3 until all items are finally clustered into one cluster
with size N.
Dendogram
If data points are wrongly grouped at the inception, they cannot be reallocated.
If different similarity measures are utilized to calculate the similarity between
clusters, it may result in different results altogether.
By rule of thumb
Elbow method
Information Criterion Approach
An Information Theoretic Approach
Choosing k using the Silhouette
Cross-validation
Means Clustering in R
library(datasets)
head(iris)
Visualizing the data
library(ggplot2)
ggplot(iris, aes(Petal.Length, Petal.Width, color = Species)) + geom_point()
Setting the seed and creating the cluster
set.seed(20)
irisCluster <- kmeans(iris[, 3:4], 3, nstart = 20)
irisCluster
Comparing the clusters with the species
table(irisCluster$cluster, iris$Species)
Plotting the dataset to view the clusters
Code Snippet
Hierarchical Clustering in R
library(datasets)
head(iris)
Visualizing the data
library(ggplot2)
ggplot(iris, aes(Petal.Length, Petal.Width, color = Species)) + geom_point()
Calculating the distance and plotting the dendogram
table(clusterCut, iris$Species)
Visualizing the clusters