Cluster Analysis
1
Session Agenda
• Intro to Cluster Analysis
• Cluster Analysis – Objectives
• Steps in Cluster Analysis
• K-Means
2
3
4
Size, shape , color, calyx 5
Cluster Analysis
• Unsupervised
• Aims to decompose or partition a data set in to clusters
such that each cluster is similar within itself but is as
dissimilar as possible to other clusters.
• Inter-cluster (between-groups) distance is maximized and
intra-cluster (within-group) distance is minimized.
6
Cluster analysis
• Formulate the problem
• Select a distance measure
• Select a clustering procedure
• Decide on the number of clusters
• Interpret and profile clusters
• Assess the validity of clustering
7
Chateau Winery (A): Unsupervised
Learning
• Bill Booth – a marketing manager for a small wine-seller,
Chateau
• Online, discounted promotional platform to increase sales and
cash flow.
• Business Decision : Mail promotions based on consumer
preferences
8
Cluster analysis
• Formulate the problem :
• Mail promotions based on consumer preferences.
• Variables : Count of PinotNoir, Count of Champange
• Select a distance measure
• Euclidean distance
• Select a clustering procedure
• K-means
• Decide on the number of clusters
• Assume K=2
• Interpret and profile clusters :
• Meaning of the clusters: What is the nature of cluster 1
• Assess the validity of clustering : Is the result useful to the decision ?
9
40
35
EbelingEngland
30
25
Champagne
20
15
Soule Rantoul
10
0
0 2 4 6 8
Pino Noir 10 12 14
10
Measuring similarity between
observations
• Euclidean distance: Most common method to measure
dissimilarity between observations, when observations include
continuous variables.
• Let observations u = (u1, u2, . . . , uq) and v = (v1, v2, . . . , vq)
each comprise measurements of q variables.
• The Euclidean distance between observations u and v is
2
• 𝑑𝑢,𝑣 = 𝑢1 − 𝑣1 2 + 𝑢2 − 𝑣2 2 + ∙ ∙ ∙ + 𝑢𝑞 − 𝑣𝑞
Cluster analysis
• Formulate the problem :
• Mail promotions based on consumer preferences.
• Variables : Count of PinotNoir, Count of Champange
• Select a distance measure
• Euclidean distance
• Select a clustering procedure
• K-means
• Decide on the number of clusters
• Assume K=2
• Interpret and profile clusters :
• Meaning of the clusters: What is the nature of cluster 1
• Assess the validity of clustering : Is the result useful to the decision ?
12
Cluster analysis
• Formulate the problem :
• Mail promotions based on consumer preferences.
• Variables : Count of PinotNoir, Count of Champange
• Select a distance measure
• Euclidean distance
• Select a clustering procedure
• K-means
• Decide on the number of clusters
• Assume K=2
• Interpret and profile clusters :
• Meaning of the clusters: What is the nature of cluster 1
• Assess the validity of clustering : Is the result useful to the decision ?
13
Steps in K-means
1. Initialisation : Select the number of clusters : k
2. Pick k seeds/observations as centroids of the k clusters.
Seeds are picked randomly
3. Calculate Euclidean distance of each object in dataset from
centroid
4. Allocate each object to the nearest cluster based on the
computed distances
5. Compute the new centroids
6. Check if the stopping criterion has been met (cluster remains
unchanged). If not, go to step 3.
14
Cluster analysis
• Formulate the problem :
• Mail promotions based on consumer preferences.
• Variables : Count of PinotNoir, Count of Champange
• Select a distance measure
• Euclidean distance
• Select a clustering procedure
• K-means
• Decide on the number of clusters
• Assume K=2
• Interpret and profile clusters :
• Meaning of the clusters: What is the nature of cluster 1
• Assess the validity of clustering : Is the result useful to the decision ?
15
16
Elbow Plot
https://uc-r.github.io/kmeans_clustering 17
Silhouette Width
a(i) : the average distance between 'i' and all other
data within the same cluster
b(i) : the lowest average distance of 'i' to all points
in any other cluster, of which 'i' is not a member 18
19
Cluster analysis
• Formulate the problem :
• Mail promotions based on consumer preferences.
• Variables : Count of PinotNoir, Count of Champange
• Select a distance measure
• Euclidean distance
• Select a clustering procedure
• K-means
• Decide on the number of clusters
• Assume K=2
• Interpret and profile clusters :
• Meaning of the clusters: What is the nature of cluster ?-look at the centroid.
• Assess the validity of clustering : Is the result useful to the decision ?
20
Session Agenda
• Intro to Cluster Analysis
• Cluster Analysis – Objectives
• Steps in Cluster Analysis
• K-Means
21