Unit-5 DM
Unit-5 DM
Unit-5
Clustering
I) Basic concepts:
i) What is clustering?
Clustering is the process of partitioning a set of data objects (or observations) into
subsets. Each subset is a cluster, such that objects in a cluster are similar to one another, yet
dissimilar to objects in other clusters. The set of clusters resulting from a cluster analysis can
be referred to as a clustering.
Cluster analysis has been widely used in many applications such as business
intelligence, image pattern recognition, Web search, biology, and security. In business
intelligence, clustering can be used to organize a large number of customers into groups,
where customers within a group share strong similar characteristics.
databases.
Ability to deal with different kinds of attributes − Algorithms should be capable to
be applied on any kind of data such as interval-based (numerical) data, categorical,
and binary data.
Discovery of clusters with attribute shape − The clustering algorithm should be
capable of detecting clusters of arbitrary shape. They should not be bounded to only
distance measures that tend to find spherical cluster of small sizes.
High dimensionality − The clustering algorithm should not only be able to handle
low-dimensional data but also the high dimensional space.
Ability to deal with noisy data − Databases contain noisy, missing or erroneous
data. Some algorithms are sensitive to such data and may lead to poor quality
clusters.
Interpretability − The clustering results should be interpretable, comprehensible,
and usable.
II) Clustering Structures:
The following are the different types of cluster structures or cluster types:
a) The partitioning criteria: In some methods, all the objects are partitioned so that no
hierarchy exists among the clusters. That is, all the clusters are at the same level conceptually.
Such a method is useful, for example, for partitioning customers into groups so that each
group has its own manager. Alternatively, other methods partition data objects hierarchically,
where clusters can be formed at different semantic levels.
b) Separation of clusters: Some methods partition data objects into mutually exclusive
clusters. When clustering customers into groups so that each group is taken care of by one
manager, each customer may belong to only one group. In some other situations, the clusters
may not be exclusive, that is, a data object may belong to more than one cluster .
c) Similarity measure: Some methods determine the similarity between two objects by the
distance between them. Such a distance can be defined on Euclidean space, a road network, a
vector space, or any other space. In other methods, the similarity may be defined by
connectivity based on density or contiguity, and may not rely on the absolute distance
between two objects.
d) Clustering space: Many clustering methods search for clusters within the entire given
data space. These methods are useful for low-dimensionality data sets. With high dimensional
data, however, there can be many irrelevant attributes, which can make similarity
measurements unreliable.
III) Major clustering Approaches:
Clustering methods can be classified into the following categories −
Partitioning Method
Hierarchical Method
Density-based Method
Grid-Based Method
1) Partitioning Method
Suppose we are given a database of ‘n’ objects and the partitioning method constructs
‘k’ partition of data. Each partition will represent a cluster and k ≤ n. It means that it will
classify the data into k groups, which satisfy the following requirements −
Each group contains at least one object.
Each object must belong to exactly one group
2) Hierarchical Methods
This method creates a hierarchical decomposition of the given set of data objects.
There are two approaches here –
Agglomerative Approach
Divisive Approach
Agglomerative Approach
This approach is also known as the bottom-up approach. In this, we start with each
object forming a separate group. It keeps on merging the objects or groups that are close to
one another. It keeps on doing so until all of the groups are merged into one or until the
termination condition holds.
Divisive Approach
This approach is also known as the top-down approach. In this, we start with all of
the objects in the same cluster. In the continuous iteration, a cluster is split up into smaller
clusters. It is down until each object in one cluster or the termination condition holds.
3) Density-based Method
This method is based on the notion of density. The basic idea is to continue growing
the given cluster as long as the density in the neighbourhood exceeds some threshold, i.e.,
for each data point within a given cluster, the radius of a given cluster has to contain at least
a minimum number of points.
4) Grid-based Method
In this, the objects together form a grid. The object space is quantized into finite
number of cells that form a grid structure.
reassigned using the updated means as the new cluster centers. The iterations continue until
the assignment is stable. The Euclidean distance between objects X and Y
Algorithm: k-means. The k-means algorithm for partitioning, where each cluster’s center is
represented by the mean value of the objects in the cluster.
Input:
k: the number of clusters,D: a data set containing n objects.
Output: A set of k clusters.
Method:
(1) arbitrarily choose k objects from D as the initial cluster centers;
(2) repeat
(3) (re)assign each object to the cluster to which the object is the most similar, based
on the mean value of the objects in the cluster;
(4) update the cluster means, that is, calculate the mean value of the objects for each
cluster;
(5) Until no change;
ii) k-Medoids: A Representative Object-Based Technique
The K-medoids algorithm is a partition clustering algorithm which is slightly
modified from the K-means algorithm. They both attempt to minimize the squared-error but
the K-medoids algorithm is more robust to noise than K-means algorithm. In K-means
algorithm, they choose means as the centroids but in the K-medoids, data points are chosen to
be the medoids. A medoid can be defined as that object of a cluster, whose average
dissimilarity to all the objects in the cluster is minimal.
Algorithm: k-medoids. PAM, a k-medoids algorithm for partitioning based on medoid or
central objects.
Input:
k: the number of clusters,
D: a data set containing n objects.
Output: A set of k clusters.
Method:
(1) arbitrarily choose k objects in D as the initial representative objects or seeds;
(2) repeat
(3) assign each remaining object to the cluster with the nearest representative object;
Algorithm
Agglomerative clustering works in a “bottom-up” manner. That is, each object is
initially considered as a single-element cluster (leaf). At each step of the algorithm, the two
clusters that are the most similar are combined into a new bigger cluster (nodes). This
procedure is iterated until all points are member of just one single big cluster (root) (see
figure below).
The inverse of agglomerative clustering is divisive clustering, which is also known as
DIANA (Divise Analysis) and it works in a “top-down” manner. It begins with the root, in
which all objects are included in a single cluster. At each step of iteration, the most
heterogeneous cluster is divided into two. The process is iterated until all objects are in their
own cluster (see figure below).
This is a top-down clustering method and is less commonly used. It works in a similar
way to agglomerative clustering but in the opposite direction. This method starts with a single
cluster containing all objects, and then successively splits resulting clusters until only clusters
of individual objects remain.
Basic Algorithm:
Phase 1: Load data into memory
Scan DB and load data into memory by building a CF tree. If memory is exhausted
rebuild the tree from the leaf node.
Phase 2: Condense data
Resize the data set by building a smaller CF tree
Remove more outliers
Condensing is optional
Phase 3: Global clustering
Use existing clustering algorithm (e.g. KMEANS, HC) on CF entries
If a cluster is fully expanded (all points within reach are visited) then the algorithm
proceeds to iterate through the remaining unvisited points in the dataset.
DBSCAN Algorithm
ii) Clustering-based outlier detection using distance to the closest cluster: Using the k-
means clustering method, we can partition the data points shown in Figure into three clusters,
as shown using different symbols. The center of each cluster is marked with a C. with a C.
For each object, o, we can assign an outlier score to the object according to the distance
between the object and the center that is closest to the object.