Introduction To Unsupervised Learning:: Clustering
Introduction To Unsupervised Learning:: Clustering
Clustering:
Clustering is a technique for finding similarity groups in data, called clusters. I.e.,
o it groups data instances that are similar to (near) each other in one cluster and data
instances that are very different (far away) from each other into different clusters.
Clustering is often called an unsupervised learning task as no class values denoting an a priori
grouping of the data instances are given, which is the case in supervised learning.
Due to historical reasons, clustering is often considered synonymous with unsupervised
learning.
o In fact, association rule mining is also unsupervised
The data set has three natural groups of data points, i.e., 3 natural clusters.
Video :
1. Unsupervised Learning | Clustering and Association Algorithms: URL:
https://www.youtube.com/watch?v=UhVn2WrzMnI
K-means clustering
K-means is a partitional clustering algorithm
K-means algorithm
Given k, the k-means algorithm works as follows:
1) Randomly choose k data points (seeds) to be the initial centroids, cluster centers
Ci is the jth cluster, mj is the centroid of cluster Cj (the mean vector of all the data points
in Cj), and dist(x, mj) is the distance between data point x and centroid mj.
An example
An example distance function
It can be used to cluster large datasets that do not fit in main memory
Not the best method. There are other scale-up algorithms, e.g., BIRCH.
Strengths of k-means
Strengths:
Note that: it terminates at a local optimum if SSE is used. The global optimum is hard to find due
to complexity.
Weaknesses of k-means
The algorithm is only applicable if the mean is defined.
o For categorical data, k-mode - the centroid is represented by most frequent values.
o Outliers are data points that are very far away from other data points.
o Outliers could be errors in the data recording or some special data points with very
different values.
One method is to remove some data points in the clustering process that are much further away
from the centroids than other data points.
o To be safe, we may want to monitor these possible outliers over a few iterations and
then decide to remove them.
Another method is to perform random sampling. Since in sampling we only choose a small
subset of the data points, the chance of selecting an outlier is very small.
o Assign the rest of the data points to the clusters by distance or similarity comparison, or
classification
K-means summary
Despite weaknesses, k-means is still the most popular algorithm due to its simplicity,
efficiency and
o other clustering algorithms have their own lists of weaknesses.
No clear evidence that any other clustering algorithm performs better in general
o although they may be more suitable for some specific types of data or applications.
Comparing different clustering algorithms is a difficult task. No one knows the correct
clusters!
Example:
As a simple illustration of a k-means algorithm, consider the following data set consisting of the scores
of two variables on each of seven individuals:
Subject A B
1 1.0 1.0
2 1.5 2.0
3 3.0 4.0
4 5.0 7.0
5 3.5 5.0
6 4.5 5.0
7 3.5 4.5
This data set is to be grouped into two clusters. As a first step in finding a sensible initial partition, let
the A & B values of the two individuals furthest apart (using the Euclidean distance measure), define the
initial cluster means, giving:
Mean Vector
Individual
(centroid)
Group 1 1 (1.0, 1.0)
Group 2 4 (5.0, 7.0)
The remaining individuals are now examined in sequence and allocated to the cluster to which they are
closest, in terms of Euclidean distance to the cluster mean. The mean vector is recalculated each time a
new member is added. This leads to the following series of steps:
Cluster 1 Cluster 2
Mean Mean
Step Individual Vector Individual Vector
(centroid) (centroid)
1 1 (1.0, 1.0) 4 (5.0, 7.0)
2 1, 2 (1.2, 1.5) 4 (5.0, 7.0)
3 1, 2, 3 (1.8, 2.3) 4 (5.0, 7.0)
4 1, 2, 3 (1.8, 2.3) 4, 5 (4.2, 6.0)
5 1, 2, 3 (1.8, 2.3) 4, 5, 6 (4.3, 5.7)
6 1, 2, 3 (1.8, 2.3) 4, 5, 6, 7 (4.1, 5.4)
Now the initial partition has changed, and the two clusters at this stage having the following
characteristics:
Mean Vector
Individual
(centroid)
Cluster 1 1, 2, 3 (1.8, 2.3)
Cluster 2 4, 5, 6, 7 (4.1, 5.4)
But we cannot yet be sure that each individual has been assigned to the right cluster. So, we compare
each individual’s distance to its own cluster mean and to
that of the opposite cluster. And we find:
Distance to Distance to
mean mean
Individual
(centroid) of (centroid) of
Cluster 1 Cluster 2
1 1.5 5.4
2 0.4 4.3
3 2.1 1.8
4 5.7 1.8
5 3.2 0.7
6 3.8 0.6
7 2.8 1.1
Only individual 3 is nearer to the mean of the opposite cluster (Cluster 2) than its own (Cluster 1). In
other words, each individual's distance to its own cluster mean should be smaller that the distance to
the other cluster's mean (which is not the case with individual 3). Thus, individual 3 is relocated to
Cluster 2 resulting in the new partition:
Mean Vector
Individual
(centroid)
Cluster 1 1, 2 (1.3, 1.5)
Cluster 2 3, 4, 5, 6, 7 (3.9, 5.1)
The iterative relocation would now continue from this new partition until no more relocations occur.
However, in this example each individual is now nearer its own cluster mean than that of the other
cluster and the iteration stops, choosing the latest partitioning as the final cluster solution.
Also, it is possible that the k-means algorithm won't find a final solution. In this case it would be a good
idea to consider stopping the algorithm after a pre-chosen maximum of iterations.
Example 2:
Suppose we want to group the visitors to a website using just their age (one-dimensional space) as follows:
n = 19
15,15,16,19,19,20,20,21,22,28,35,40,41,42,43,44,60,61,65
Initial clusters (random centroid or average):
k = 2
c1 = 16
c2 = 22
Iteration 1:
c1 = 15.33
c2 = 36.25
xi c1 c2 Distance 1 Distance 2 Nearest Cluster New Centroid
15 16 22 1 7 1
15 16 22 1 7 1 15.33
16 16 22 0 6 1
19 16 22 9 3 2 36.25
19 16 22 9 3 2
20 16 22 16 2 2
20 16 22 16 2 2
21 16 22 25 1 2
22 16 22 36 0 2
28 16 22 12 6 2
35 16 22 19 13 2
40 16 22 24 18 2
41 16 22 25 19 2
42 16 22 26 20 2
43 16 22 27 21 2
44 16 22 28 22 2
60 16 22 44 38 2
61 16 22 45 39 2
65 16 22 49 43 2
Iteration 2:
c1 = 18.56
c2 = 45.90
xi c1 c2 Distance 1 Distance 2 Nearest Cluster New Centroid
15 15.33 36.25 0.33 21.25 1
15 15.33 36.25 0.33 21.25 1
16 15.33 36.25 0.67 20.25 1
19 15.33 36.25 3.67 17.25 1
19 15.33 36.25 3.67 17.25 1 18.56
20 15.33 36.25 4.67 16.25 1
20 15.33 36.25 4.67 16.25 1
21 15.33 36.25 5.67 15.25 1
22 15.33 36.25 6.67 14.25 1
28 15.33 36.25 12.67 8.25 2
35 15.33 36.25 19.67 1.25 2
40 15.33 36.25 24.67 3.75 2
41 15.33 36.25 25.67 4.75 2
42 15.33 36.25 26.67 5.75 2
45.9
43 15.33 36.25 27.67 6.75 2
44 15.33 36.25 28.67 7.75 2
60 15.33 36.25 44.67 23.75 2
61 15.33 36.25 45.67 24.75 2
65 15.33 36.25 49.67 28.75 2
Iteration 3:
c1 = 19.50
c2 = 47.89
xi c1 c2 Distance 1 Distance 2 Nearest Cluster New Centroid
15 18.56 45.9 3.56 30.9 1 19.50
15 18.56 45.9 3.56 30.9 1
16 18.56 45.9 2.56 29.9 1
19 18.56 45.9 0.44 26.9 1
19 18.56 45.9 0.44 26.9 1
20 18.56 45.9 1.44 25.9 1
20 18.56 45.9 1.44 25.9 1
21 18.56 45.9 2.44 24.9 1
22 18.56 45.9 3.44 23.9 1
28 18.56 45.9 9.44 17.9 1
35 18.56 45.9 16.44 10.9 2
40 18.56 45.9 21.44 5.9 2
41 18.56 45.9 22.44 4.9 2
42 18.56 45.9 23.44 3.9 2
43 18.56 45.9 24.44 2.9 2 47.89
44 18.56 45.9 25.44 1.9 2
60 18.56 45.9 41.44 14.1 2
61 18.56 45.9 42.44 15.1 2
65 18.56 45.9 46.44 19.1 2
Iteration 4:
c1 = 19.50
c2 = 47.89
xi c1 c2 Distance 1 Distance 2 Nearest Cluster New Centroid
15 19.5 47.89 4.50 32.89 1
15 19.5 47.89 4.50 32.89 1
16 19.5 47.89 3.50 31.89 1
19 19.5 47.89 0.50 28.89 1
19 19.5 47.89 0.50 28.89 1
19.50
20 19.5 47.89 0.50 27.89 1
20 19.5 47.89 0.50 27.89 1
21 19.5 47.89 1.50 26.89 1
22 19.5 47.89 2.50 25.89 1
28 19.5 47.89 8.50 19.89 1
35 19.5 47.89 15.50 12.89 2
40 19.5 47.89 20.50 7.89 2
41 19.5 47.89 21.50 6.89 2
42 19.5 47.89 22.50 5.89 2
43 19.5 47.89 23.50 4.89 2 47.89
44 19.5 47.89 24.50 3.89 2
60 19.5 47.89 40.50 12.11 2
61 19.5 47.89 41.50 13.11 2
65 19.5 47.89 45.50 17.11 2
No change between iterations 3 and 4 has been noted. By using clustering, 2 groups have been identified 15-28 and
35-65. The initial choice of centroids can affect the output clusters, so the algorithm is often run multiple times with
different starting conditions in order to get a fair view of what the clusters should be.
Videos Tutorials
1. K Means Clustering Algorithm: URL: https://www.youtube.com/watch?
v=1XqG0kaJVHY&feature=emb_logo
2. K Means Clustering Algorithm: URL: https://www.youtube.com/watch?v=EItlUEPCIzM
Hierarchical Clustering
Hierarchical clustering is another unsupervised learning algorithm that is used to group together the
unlabeled data points having similar characteristics. Hierarchical clustering algorithms falls into following
two categories.
Divisive hierarchical algorithms − On the other hand, in divisive hierarchical algorithms, all the data
points are treated as one big cluster and the process of clustering involves dividing (Top-down approach)
the one big cluster into various small clusters.
Agglomerative clustering
In agglomerative or bottom-up clustering method we assign each observation to its own cluster. Then,
compute the similarity (e.g., distance) between each of the clusters and join the two most similar
clusters. Finally, repeat steps 2 and 3 until there is only a single cluster left. The related algorithm is
shown below.
Before any clustering is performed, it is required to determine the proximity matrix containing the distance bet
a distance function. Then, the matrix is updated to display the distance between each cluster. The following thr
how the distance between each cluster is measured.
o Single link
o Complete link
o Average link
o Centroids
o …
Centroid method:
In this method, the distance between two clusters is the distance between their centroids
The complexity
All the algorithms are at least O(n2). n is the number of data points.
o Sampling
An Example
Let’s now see a simple example: a hierarchical clustering of distances in kilometers between some Italian
cities. The method used is single-linkage.
BA FI MI NA RM TO
BA 0 662 877 255 412 996
FI 662 0 295 468 268 400
MI 877 295 0 754 564 138
NA 255 468 754 0 219 869
RM 412 268 564 219 0 669
TO 996 400 138 869 669 0
The nearest pair of cities is MI and TO, at distance 138. These are merged into a single cluster called
"MI/TO". The level of the new cluster is L(MI/TO) = 138 and the new sequence number is m = 1.
Then we compute the distance from this new compound object to all other objects. In single link
clustering the rule is that the distance from the compound object to another object is equal to the
shortest distance from any member of the cluster to the outside object. So the distance from "MI/TO" to
RM is chosen to be 564, which is the distance from MI to RM, and so on.
BA FI MI/TO NA RM
BA FI MI/TO NA/RM
BA 0 662 877 255
FI 662 0 295 268
MI/TO 877 295 0 564
NA/RM 255 268 564 0
min d(i,j) = d(BA,NA/RM) = 255 => merge BA and NA/RM into a new cluster called BA/NA/RM
L(BA/NA/RM) = 255
m=3
BA/NA/RM FI MI/TO
BA/NA/RM 0 268 564
FI 268 0 295
MI/TO 564 295 0
min d(i,j) = d(BA/NA/RM,FI) = 268 => merge BA/NA/RM and FI into a new cluster called BA/FI/NA/RM
L(BA/FI/NA/RM) = 268
m=4
BA/FI/NA/RM MI/TO
BA/FI/NA/RM 0 295
MI/TO 295 0