Unit-5 DM

Data Mining & Data Warehousing
Unit-5
Clustering
I) Basic concepts:
i) What is clustering?
Clustering is the process of partitioning a set of data objects (or observations) into
subsets. Each subset is a cluster, such that objects in a cluster are similar to one another, yet
dissimilar to objects in other clusters. The set of clusters resulting from a cluster analysis can
be referred to as a clustering.
Cluster analysis has been widely used in many applications such as business
intelligence, image pattern recognition, Web search, biology, and security. In business
intelligence, clustering can be used to organize a large number of customers into groups,
where customers within a group share strong similar characteristics.
ii) Requirements of Clustering:

 Scalability − we need highly scalable clustering algorithms to deal with large
databases.
 Ability to deal with different kinds of attributes − Algorithms should be capable to
be applied on any kind of data such as interval-based (numerical) data, categorical,
and binary data.
 Discovery of clusters with attribute shape − The clustering algorithm should be
capable of detecting clusters of arbitrary shape. They should not be bounded to only
distance measures that tend to find spherical cluster of small sizes.
 High dimensionality − The clustering algorithm should not only be able to handle
low-dimensional data but also the high dimensional space.
 Ability to deal with noisy data − Databases contain noisy, missing or erroneous
data. Some algorithms are sensitive to such data and may lead to poor quality
clusters.
 Interpretability − The clustering results should be interpretable, comprehensible,
and usable.
II) Clustering Structures:
The following are the different types of cluster structures or cluster types:
a) The partitioning criteria: In some methods, all the objects are partitioned so that no
hierarchy exists among the clusters. That is, all the clusters are at the same level conceptually.
Such a method is useful, for example, for partitioning customers into groups so that each
Prepared by U.L.N.Kumar, Dept. of computer Applications

group has its own manager. Alternatively, other methods partition data objects hierarchically,
where clusters can be formed at different semantic levels.
b) Separation of clusters: Some methods partition data objects into mutually exclusive
clusters. When clustering customers into groups so that each group is taken care of by one
manager, each customer may belong to only one group. In some other situations, the clusters
may not be exclusive, that is, a data object may belong to more than one cluster .
c) Similarity measure: Some methods determine the similarity between two objects by the
distance between them. Such a distance can be defined on Euclidean space, a road network, a
vector space, or any other space. In other methods, the similarity may be defined by
connectivity based on density or contiguity, and may not rely on the absolute distance
between two objects.
d) Clustering space: Many clustering methods search for clusters within the entire given
data space. These methods are useful for low-dimensionality data sets. With high dimensional
data, however, there can be many irrelevant attributes, which can make similarity
measurements unreliable.
III) Major clustering Approaches:
Clustering methods can be classified into the following categories −
 Partitioning Method
 Hierarchical Method
 Density-based Method
 Grid-Based Method
1) Partitioning Method
Suppose we are given a database of ‘n’ objects and the partitioning method constructs
‘k’ partition of data. Each partition will represent a cluster and k ≤ n. It means that it will
classify the data into k groups, which satisfy the following requirements −
 Each group contains at least one object.
 Each object must belong to exactly one group
2) Hierarchical Methods
This method creates a hierarchical decomposition of the given set of data objects.
There are two approaches here –
 Agglomerative Approach
 Divisive Approach

 Agglomerative Approach
This approach is also known as the bottom-up approach. In this, we start with each
object forming a separate group. It keeps on merging the objects or groups that are close to
one another. It keeps on doing so until all of the groups are merged into one or until the
termination condition holds.
 Divisive Approach
This approach is also known as the top-down approach. In this, we start with all of
the objects in the same cluster. In the continuous iteration, a cluster is split up into smaller
clusters. It is down until each object in one cluster or the termination condition holds.
3) Density-based Method
This method is based on the notion of density. The basic idea is to continue growing
the given cluster as long as the density in the neighbourhood exceeds some threshold, i.e.,
for each data point within a given cluster, the radius of a given cluster has to contain at least
a minimum number of points.
4) Grid-based Method
In this, the objects together form a grid. The object space is quantized into finite
number of cells that form a grid structure.
IV) Partitioning Methods:

The simplest and most fundamental version of cluster analysis is partitioning,
which organizes the objects of a set into several exclusive groups or clusters. Formally, given
a data set, D, of n objects, and k, the number of clusters to form, a partitioning
algorithm organizes the objects into k partitions (k ≤ n), where each partition represents a
cluster. In this section we will learn the most well-known and commonly used partitioning
methods—k-means and k-medoids.
i) k-Means Algorithm: A Centroid based Technique
The k-means algorithm defines the centroid of a cluster as the mean value of the
points within the cluster. It proceeds as follows. First, it randomly selects k of the objects in
D, each of which initially represents a cluster mean or center. For each of the remaining
objects, an object is assigned to the cluster to which it is the most similar, based on the
Euclidean distance between the object and the cluster mean. The k-means algorithm then
iteratively improves the within-cluster variation. For each cluster, it computes the new mean
using the objects assigned to the cluster in the previous iteration. All the objects are then

reassigned using the updated means as the new cluster centers. The iterations continue until
the assignment is stable. The Euclidean distance between objects X and Y
Algorithm: k-means. The k-means algorithm for partitioning, where each cluster’s center is
represented by the mean value of the objects in the cluster.
Input:
k: the number of clusters,D: a data set containing n objects.
Output: A set of k clusters.
Method:
(1) arbitrarily choose k objects from D as the initial cluster centers;
(2) repeat
(3) (re)assign each object to the cluster to which the object is the most similar, based
on the mean value of the objects in the cluster;
(4) update the cluster means, that is, calculate the mean value of the objects for each
cluster;
(5) Until no change;
ii) k-Medoids: A Representative Object-Based Technique
The K-medoids algorithm is a partition clustering algorithm which is slightly
modified from the K-means algorithm. They both attempt to minimize the squared-error but
the K-medoids algorithm is more robust to noise than K-means algorithm. In K-means
algorithm, they choose means as the centroids but in the K-medoids, data points are chosen to
be the medoids. A medoid can be defined as that object of a cluster, whose average
dissimilarity to all the objects in the cluster is minimal.
Algorithm: k-medoids. PAM, a k-medoids algorithm for partitioning based on medoid or
central objects.
Input:
k: the number of clusters,
D: a data set containing n objects.
Output: A set of k clusters.
Method:
(1) arbitrarily choose k objects in D as the initial representative objects or seeds;
(2) repeat
(3) assign each remaining object to the cluster with the nearest representative object;

(4) randomly select a nonrepresentative object, orandom;

(5) compute the total cost, S, of swapping representative object, oj, with orandom;
(6) if S < 0 then swap oj with orandom to form the new set of k representative objects;
(7) until no change;
V) Hierarchical Methods:
A hierarchical clustering method works by grouping data objects into a
hierarchy or “tree” of clusters. There are 2 types of hierarchal clustering. Agglomerative and
Divisive.
Agglomerative Hierarchal Clustering:
The agglomerative clustering is the most common type of hierarchical clustering
used to group objects in clusters based on their similarity. It’s also known
as AGNES (Agglomerative Nesting). The algorithm starts by treating each object as a
singleton cluster. Next, pairs of clusters are successively merged until all clusters have been
merged into one big cluster containing all objects. The result is a tree-based representation of
the objects, named dendrogram.
Algorithm
Agglomerative clustering works in a “bottom-up” manner. That is, each object is
initially considered as a single-element cluster (leaf). At each step of the algorithm, the two
clusters that are the most similar are combined into a new bigger cluster (nodes). This
procedure is iterated until all points are member of just one single big cluster (root) (see
figure below).
The inverse of agglomerative clustering is divisive clustering, which is also known as
DIANA (Divise Analysis) and it works in a “top-down” manner. It begins with the root, in
which all objects are included in a single cluster. At each step of iteration, the most
heterogeneous cluster is divided into two. The process is iterated until all objects are in their
own cluster (see figure below).

Steps to agglomerative hierarchical clustering

We’ll follow the steps below to perform agglomerative hierarchical clustering
using R software:
1. Preparing the data
2. Computing (dis)similarity information between every pair of objects in the data set.
3. Using linkage function to group objects into hierarchical cluster tree, based on the
distance information generated at step 1. Objects/clusters that are in close proximity are
linked together using the linkage function.
4. Determining where to cut the hierarchical tree into clusters. This creates a partition of the
data.
ii) Divisive Hierarchical Clustering:
This is a top-down clustering method and is less commonly used. It works in a similar
way to agglomerative clustering but in the opposite direction. This method starts with a single
cluster containing all objects, and then successively splits resulting clusters until only clusters
of individual objects remain.
iii) BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies):
 It is a scalable clustering method.

 Designed for very large data sets
 Only one scan of data is necessary
 It is based on the notation of CF (Clustering Feature) a CF Tree.
 CF tree is a height balanced tree that stores the clustering features for a hierarchical
clustering.
 Cluster of data points is represented by a triple of numbers (N,LS,SS) Where

N= Number of items in the sub cluster
LS=Linear sum of the points
SS=sum of the squared of the points
A CF Tree structure is given as below:
 Each non-leaf node has at most B entries.

 Each leaf node has at most L CF entries which satisfy threshold T, a maximum
diameter of radius
 P(page size in bytes) is the maximum size of a node
 Compact: each leaf node is a sub cluster, not a data point.
Basic Algorithm:
 Phase 1: Load data into memory
Scan DB and load data into memory by building a CF tree. If memory is exhausted
rebuild the tree from the leaf node.
 Phase 2: Condense data
Resize the data set by building a smaller CF tree
Remove more outliers
Condensing is optional
 Phase 3: Global clustering
Use existing clustering algorithm (e.g. KMEANS, HC) on CF entries

 Phase 4: Cluster refining

Refining is optional
Fixes the problem with CF trees where same valued data points may be assigned to
different leaf entries.
iii) Distance Measures in Algorithmic Methods:
Whether using an agglomerative method or a divisive method, a core need is to

measure the distance between two clusters, where each cluster is generally a set of objects.

VI) Density Based Methods:

This is the main strategy behind density-based clustering methods, which can discover
clusters of nonspherical shape. In this section, you will learn the basic techniques of density-
based clustering by studying two representative methods, namely, DBSCAN, OPTICS.
i) DBSCAN (Density-Based Spatial Clustering of Applications with Noise)DBSCAN

requires two parameters: epsilon (Eps) and minimum points (MinPts).It starts with an
arbitrary starting point that has not been visited .It then finds all the neighbour points within
distance Eps of the starting point.
 If the number of neighbours is greater than or equal to MinPts, a cluster is formed.

The starting point and its neighbours are added to this cluster and the starting point is
marked as visited. The algorithm then repeats the evaluation process for all the
neighbours recursively.
 If the number of neighbours is less than MinPts, the point is marked as noise.
 If a cluster is fully expanded (all points within reach are visited) then the algorithm
proceeds to iterate through the remaining unvisited points in the dataset.
DBSCAN Algorithm
1. Create a graph whose nodes are the points to be clustered

2. For each core-point c create an edge from c to every point p in the neighbourhood of c
3. Set N to the nodes of the graph;
4. If N does not contain any core points terminate
5. Pick a core point c in N
6. Let X be the set of nodes that can be reached from c by going forward;
a. create a cluster containing X∪∪{c}

b. N=N/(X ∪∪ {c})
7. Continue with step 4
VII) Cluster based Outlier Detection:
The notion of outliers is highly related to that of clusters. Clustering-based approaches

detect outliers by examining the relationship between objects and clusters. Intuitively, an
outlier is an object that belongs to a small and remote cluster, or does not belong to any
cluster.

This leads to three general approaches to clustering-based outlier detection. Consider

an object.
 Does the object belong to any cluster? If not, then it is identified as an outlier.
 Is there a large distance between the object and the cluster to which it is closest? If
yes, it is an outlier.
 Is the object part of a small or sparse cluster? If yes, then all the objects in that cluster
are outliers.
i) Detecting outliers as objects that do not belong to any cluster: Using a density based
clustering method, such as DBSCAN, we note that the black points belong to clusters. The
white point, a, does not belong to any cluster, and thus is declared an outlier.
The second approach to clustering-based outlier detection considers the distance
between an object and the cluster to which it is closest. If the distance is large, then the object
is likely an outlier with respect to the cluster. Thus, this approach detects individual outliers
with respect to clusters.
ii) Clustering-based outlier detection using distance to the closest cluster: Using the k-
means clustering method, we can partition the data points shown in Figure into three clusters,
as shown using different symbols. The center of each cluster is marked with a C. with a C.
For each object, o, we can assign an outlier score to the object according to the distance
between the object and the center that is closest to the object.
iii) Intrusion detection by clustering-based outlier detection: The method consists of

three steps:

1. A training data set is used to find patterns of normal data.

2. Connections in the training data that contain base connections are treated as attack free.
Such connections are clustered into groups.
3. The data points in the original data set are compared with the clusters mined in step 2. Any
point that is deemed an outlier with respect to the clusters is declared as a possible attack.
** What is Outlier Analysis?

An outlier is a data object that deviates significantly from the rest of the objects, as if
it were generated by a different mechanism. We may refer to data objects that are not
outliers as “normal” or expected data. Similarly, we may refer to outliers as “abnormal” data.
Outliers are different from noisy data. Noise is a random error or variance in a
measured variable. In general, noise is not interesting in data analysis, including outlier
detection. For example, in credit card fraud detection, a customer’s purchase behaviour can
be modelled as a random variable. A customer may generate some “noise transactions” that
may seem like “random errors” or “variance,” such as by buying a bigger lunch one day, or
having one more cup of coffee than usual. Such transactions should not be treated as outliers;
otherwise, the credit card company would incur heavy costs from verifying that many
transactions.

Unit-5 DM

Uploaded by

Unit-5 DM

Uploaded by

Data Mining & Data Warehousing

ii) Requirements of Clustering:

Prepared by U.L.N.Kumar, Dept. of computer Applications

Prepared by U.L.N.Kumar, Dept. of computer Applications

IV) Partitioning Methods:

Prepared by U.L.N.Kumar, Dept. of computer Applications

Prepared by U.L.N.Kumar, Dept. of computer Applications

(4) randomly select a nonrepresentative object, orandom;

Prepared by U.L.N.Kumar, Dept. of computer Applications

Steps to agglomerative hierarchical clustering

iii) BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies):

 It is a scalable clustering method.

Prepared by U.L.N.Kumar, Dept. of computer Applications

N= Number of items in the sub cluster

LS=Linear sum of the points

SS=sum of the squared of the points

A CF Tree structure is given as below:

 Each non-leaf node has at most B entries.

Prepared by U.L.N.Kumar, Dept. of computer Applications

 Phase 4: Cluster refining

iii) Distance Measures in Algorithmic Methods:

Whether using an agglomerative method or a divisive method, a core need is to

Prepared by U.L.N.Kumar, Dept. of computer Applications

VI) Density Based Methods:

i) DBSCAN (Density-Based Spatial Clustering of Applications with Noise)DBSCAN

 If the number of neighbours is greater than or equal to MinPts, a cluster is formed.

1. Create a graph whose nodes are the points to be clustered

a. create a cluster containing X∪∪{c}

VII) Cluster based Outlier Detection:

The notion of outliers is highly related to that of clusters. Clustering-based approaches

Prepared by U.L.N.Kumar, Dept. of computer Applications

This leads to three general approaches to clustering-based outlier detection. Consider

iii) Intrusion detection by clustering-based outlier detection: The method consists of

Prepared by U.L.N.Kumar, Dept. of computer Applications

1. A training data set is used to find patterns of normal data.

** What is Outlier Analysis?

Prepared by U.L.N.Kumar, Dept. of computer Applications

You might also like