DM Lecture 06

Data Mining
Lectore-07
Dr. Waqas Haider Khan Bangyal
Data Mining Overview
Data Mining
Data warehouses and OLAP (On Line Analytical
Processing.)
Clustering: Hierarchical and Partitioned approaches
Classification: Decision Trees , ANN, Bayesian
classifiers and G.A
Association Rules Mining
Advanced topics: outlier detection, web mining
What is a natural grouping among these objects?
What is a natural grouping among these objects?
Clustering is subjective
Siaad's Family School Employees Females Males

What is Clustering?
Also called unsupervised learning, sometimes called

classification by statisticians and sorting by psychologists
and segmentation by people in marketing
• Organizing data into classes such that there is
• high intra-class similarity
• low inter-class similarity
• Finding the class labels and the number of classes directly
from the data (in contrast to classification).
• More informally, finding natural groupings among objects.
What is clustering?
Clustering is the classification of objects into different

groups, or more precisely, the partitioning of a data set
into subsets (clusters), so that the data in each subset
(ideally) share some common trait - often according to
some defined distance measure.
What is Clustering?
You can say this “unsupervised classification”

Clustering is alternatively called as “grouping”
Intuitively, if you would want to assign same label to a
data points that are “close” to each other
Thus, clustering algorithms rely on a distance metric
between data points
Sometimes, it is said that the for clustering, the distance
metric is more important than the clustering algorithm
Idea and Applications
Clustering is the process of grouping a set of physical or
abstract objects into classes of similar objects.
It is also called unsupervised learning.
It is a common and important task that finds many
applications.
Applications in Search engines:
Structuring search results
Suggesting related pages
Automatic directory construction/update
Finding near identical/duplicate pages
What is Cluster Analysis?
Cluster: a collection of data objects
Similar to one another within the same cluster
Dissimilar to the objects in other clusters
Cluster analysis
Grouping a set of data objects into clusters
Clustering is unsupervised classification: no predefined
classes
Typical applications
As a stand-alone tool to get insight into data distribution
As a preprocessing step for other algorithms
Desirable Properties of a Clustering Algorithm
• Scalability (in terms of both time and space)

• Ability to deal with different data types
• Minimal requirements for domain knowledge to determine
input parameters
• Able to deal with noise and outliers
• Insensitive to order of input records
• Incorporation of user-specified constraints
• Interpretability and usability
Examples of Clustering Applications
Marketing: Help marketers discover distinct groups in their
customer bases, and then use this knowledge to develop
targeted marketing programs
Land use: Identification of areas of similar land use in an
earth observation database
Insurance: Identifying groups of motor insurance policy
holders with a high average claim cost
City-planning: Identifying groups of houses according to
their house type, value, and geographical location
Earth-quake studies: Observed earth quake epicenters should
be clustered along continent faults
Concepts in Clustering
“Defining distance between points
A good clustering is one where
 (Intra-clusterdistance) the sum of distances between objects

in the same cluster are minimized,
 (Inter-clusterdistance) while the distances between different

clusters are maximized
 Objective to minimize: F(Intra,Inter)

What Is Good Clustering?
A good clustering method will produce high quality
clusters with
high intra-class similarity
low inter-class similarity
The quality of a clustering result depends on both the
similarity measure used by the method and its
implementation.
The quality of a clustering method is also measured by its
ability to discover some or all of the hidden patterns.
Types of clustering
1. Hierarchical algorithms: these find successive clusters
using previously established clusters.
1. Agglomerative ("bottom-up"): Agglomerative algorithms

begin with each element as a separate cluster and merge
them into successively larger clusters.
2. Divisive ("top-down"): Divisive algorithms begin with
the whole set and proceed to divide it into successively
smaller clusters.
2. Partitional clustering: Partitional algorithms determine all
clusters at once. They include:
 K-means and derivatives
 Fuzzy c-means clustering
Classical clustering methods
Partitioning methods
k-Means (and EM), k-Medoids
Hierarchical methods
agglomerative, divisive, BIRCH
Model-based clustering methods
K-MEANS CLUSTERING
Simply speaking k-means clustering is an algorithm to
classify or to group the objects based on
attributes/features into K number of group.
K is positive integer number.
The grouping is done by minimizing the sum of
squares of distances between data and the
corresponding cluster centroid.
Common Distance measures:
Distance measure will determine how the similarity of two
elements is calculated and it will influence the shape of the
clusters.
They include:
1. The Euclidean distance (also called 2-norm distance) is given by:
2. The Manhattan distance (also called taxicab norm or 1-norm) is

given by:
Common Distance measures:
3.The maximum norm is given by:
4. The Mahalanobis distance corrects data for different scales and

correlations in the variables.
5. Inner product space: The angle between two vectors can be used
as a distance measure when clustering high dimensional data
6. Hamming distance (sometimes edit distance) measures the
minimum number of substitutions required to change one
member into another.
K-means
Works when we know k, the number of clusters we want to
find
Idea:
Randomly pick k points as the “centroids” of the k clusters
Loop:
 For each point, put the point in the cluster to whose
centroid it is closest
 Recompute the cluster centroids
 Repeat loop (until there is no change in clusters between
two consecutive iterations.)
Iterative improvement of the objective function:

Sum of the squared distance from each point to the centroid of
How the K-Mean Clustering algorithm works?
Step 1: Begin with a decision on the value of k =
number of clusters .
Step 2:
Put any initial partition that classifies the data into
k clusters. You may assign the training samples
randomly, or systematically as the following:
1.Take the first k training sample as single- element
clusters
2. Assign each of the remaining (N-k) training
sample to the cluster with the nearest centroid. After
each assignment, recompute the centroid of the gaining
cluster.
Step 3: Take each sample in sequence and
compute its distance from the centroid of
each of the clusters. If a sample is not
currently in the cluster with the closest
centroid, switch this sample to that cluster
and update the centroid of the cluster
gaining the new sample and the cluster
losing the sample.
Step 4 . Repeat step 3 until convergence is
achieved, that is until a pass through the
training sample causes no new assignments.
K-means Overview
An unsupervised clustering algorithm
“K” stands for number of clusters, it is typically a user
input to the algorithm; some criteria can be used to
automatically estimate K
It is an approximation to an NP-hard combinatorial
optimization problem
K-means algorithm is iterative in nature
It converges, however only a local minimum is obtained
Works only for numerical data
Easy to implement
Weaknesses of K-Mean Clustering

1. When the numbers of data are not so many, initial grouping
will determine the cluster significantly.
2. The number of cluster, K, must be determined before hand. Its
disadvantage is that it does not yield the same result with each
run, since the resulting clusters depend on the initial random
assignments.
3. We never know the real cluster, using the same data, because if
it is inputted in a different order it may produce different
cluster if the number of data is few.
4. It is sensitive to initial condition. Different initial condition
may produce different result of cluster. The algorithm may be
trapped in the local optimum.
Applications of K-Mean Clustering
It is relatively efficient and fast. It computes result at O(tkn),
where n is number of objects or points, k is number of clusters
and t is number of iterations.
k-means clustering can be applied to machine learning or data
mining
Used on acoustic data in speech understanding to convert
waveforms into one of k categories (known as Vector
Quantization or Image Segmentation).
Also used for choosing color palettes on old fashioned graphical
display devices and Image Quantization.
CONCLUSION
K-means algorithm is useful for undirected knowledge
discovery and is relatively simple. K-means has found
wide spread usage in lot of fields, ranging from
unsupervised learning of neural network, Pattern
recognitions, Classification analysis, Artificial
intelligence, image processing, machine vision, and many
others.
Real-Life Numerical Example of K-Means Clustering
We have 4 medicines as our training data points object and each
medicine has 2 attributes. Each attribute represents coordinate of
the object. We have to determine which medicines belong to
cluster 1 and which medicines belong to the other cluster.
Attribute1 (X): Attribute 2 (Y): pH

Object weight index
1 1
Medicine A
Medicine B 2 1
Medicine C 4 3
Medicine D 5 4
Step 1:
 Initial value of
centroids : Suppose we
use medicine A and
medicine B as the first
centroids.
Let and c1 and c2 denote
the coordinate of the
centroids, then c1=(1,1)
and c2=(2,1)
Objects-Centroids distance : we calculate the distance between cluster centroid to each object.
Let us use ρ(a, b) = |x2 – x1| + |y2 – y1|
distance matrix at iteration 0 is
So, we fill in these values in the table:
Mean (Centroid) Mean (Centroid) Clusters

Object Cluster-1 Cluster-2
(1,1) (2,1)
(1,1) 0 1 Cluster-1
(2,1) 1 0 Cluster-2
(4,3) 5 4 Cluster-2
(5,4) 7 6 Cluster-2
point mean1 point mean2

x1, y1 x2, y2 x1, y1 x2, y2
(1, 1) (1, 1) (1, 1) (2, 1)
ρ(a, b) = |x2 – x1| + |y2 – y1| ρ(a, b) = |x2 – x1| + |y2 – y1|
ρ(point, mean1) = |x2 – x1| + |y2 – y1| ρ(point, mean1) = |x2 – x1| + |y2 – y1|
= |1 – 1| + |1 – 1| = |2 – 1| + |1 – 1|
=0+0 =1+0
=0 =0
So, which cluster should the point (1, 1) be placed in? The one, where the point has
the shortest distance to the mean – that is mean 1 (cluster 1), since the distance is 0.
Step 2:
Objects clustering : We
assign each object based on
the minimum distance.
Medicine A is assigned to
group 1, medicine B to
group 2, medicine C to
group 2 and medicine D to
group 2.
The elements of Group
matrix below is 1 if and only
if the object is assigned to
that group.
Iteration-2, Objects-Centroids distances : The next
step is to compute the distance of all objects to the new
centroids.
Similar to step 2, we have distance matrix at iteration 1 is
Cluster 1 Cluster 2
(1, 1) (2, 1)
(4, 3)
(5, 4)
Next, we need to re-compute the new cluster centers (means). We do so, by taking the mean of all points in each cluster.
For Cluster 1, we only have one point A1(1, 1), which was the old mean, so the cluster center remains the same.
For Cluster 2, we have ( (2+4+5)/3, (1+3+4)/3 ) = (3.6, 2.6)
In Iteration2, we basically repeat the process from Iteration1 this time using the new
means we computed.
That was Iteration1 Mean (Centroid) Mean (Centroid) Clusters

(epoch1). Next, we go to Object Cluster-1 Cluster-2
(1,1) (3.6,2.6)
Iteration3 (epoch3),
Iteration4, and so on until (1,1) 0 4.2 Cluster-1
the means do not change (2,1) 1 3.2 Cluster-1
anymore. (4,3) 5 4.2 Cluster-2
(5,4) 7 4 Cluster-2
We get the final grouping as the results as:
Object Feature1(X): Feature2 Group

weight index (Y): pH (result)
Medicine A 1 1 1
Medicine B 2 1 1
Medicine C 4 3 2
Medicine D 5 4 2

DM Lecture 06

Uploaded by

DM Lecture 06

Uploaded by

Data Mining

Siaad's Family School Employees Females Males

Also called unsupervised learning, sometimes called

Clustering is the classification of objects into different

You can say this “unsupervised classification”

• Scalability (in terms of both time and space)

“Defining distance between points

A good clustering is one where

 (Intra-clusterdistance) the sum of distances between objects

 (Inter-clusterdistance) while the distances between different

 Objective to minimize: F(Intra,Inter)

1. Agglomerative ("bottom-up"): Agglomerative algorithms

2. The Manhattan distance (also called taxicab norm or 1-norm) is

3.The maximum norm is given by:

4. The Mahalanobis distance corrects data for different scales and

Iterative improvement of the objective function:

Attribute1 (X): Attribute 2 (Y): pH

So, we fill in these values in the table:

Mean (Centroid) Mean (Centroid) Clusters

point mean1 point mean2

That was Iteration1 Mean (Centroid) Mean (Centroid) Clusters

Object Feature1(X): Feature2 Group

You might also like