100% found this document useful (1 vote)
12 views

Density & Grid based clustering

The document discusses various methods of cluster analysis, including partitioning, hierarchical, density-based, and grid-based methods. It highlights density-based clustering techniques like DBSCAN and OPTICS, which can identify clusters of arbitrary shapes and handle noise. Additionally, it covers evaluation methods for clustering quality and determining the number of clusters using empirical and statistical approaches.

Uploaded by

Vidhya B
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
12 views

Density & Grid based clustering

The document discusses various methods of cluster analysis, including partitioning, hierarchical, density-based, and grid-based methods. It highlights density-based clustering techniques like DBSCAN and OPTICS, which can identify clusters of arbitrary shapes and handle noise. Additionally, it covers evaluation methods for clustering quality and determining the number of clusters using empirical and statistical approaches.

Uploaded by

Vidhya B
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 21

Datamining & Warehousing

Unit 4 – Cluster Analysis–


Part1 1 B
Dr.VIDHYA
ASSISTANT PROFESSOR & HEAD
Department of Computer Technology
Sri Ramakrishna College of Arts and Science
Coimbatore - 641 006
Tamil Nadu, India
Chapter 10. Cluster Analysis: Basic
Concepts and Methods
■ Cluster Analysis: Basic Concepts
■ Partitioning Methods
■ Hierarchical Methods
■ Density-Based Methods
■ Grid-Based Methods
■ Evaluation of Clustering
■ Summary

2
Density-Based Clustering
Methods
■ Partitioning and hierarchical methods are
designed to find spherical-shaped clusters.
■ They have difficulty finding clusters of
arbitrary shape such as the “S” shape and
oval clusters.
■ Strategy of density-based clustering
methods:
■ To find clusters of arbitrary shape, model

clusters as dense regions in the data


space, separated by sparse regions.
■ discover clusters of non- spherical shape

3
Density-Based Clustering
Methods
■ Clustering based on density (local cluster criterion),
such as density-connected points
■ Major features:
■ Discover clusters of arbitrary shape

■ Handle noise

■ One scan

■ Need density parameters as termination

condition
■ Several interesting studies:
■ DBSCAN: Ester, et al. (KDD’96)

■ OPTICS: Ankerst, et al (SIGMOD’99).

■ DENCLUE: Hinneburg & D. Keim (KDD’98)

■ CLIQUE: Agrawal, et al. (SIGMOD’98) (more grid-


4
Density-Based Clustering
Methods

■ Several interesting studies:


■ DBSCAN: Ester, et al. (KDD’96)

■ OPTICS: Ankerst, et al (SIGMOD’99).

■ DENCLUE: Hinneburg & D. Keim (KDD’98)

■ CLIQUE: Agrawal, et al. (SIGMOD’98) (more grid-

based)

5
Density-Based Clustering: Basic
Concepts
■ Two parameters:
■ Eps: Maximum radius of the neighbourhood
■ MinPts: Minimum number of points in an Eps-
neighbourhood of that point
■ NEps(q): {p belongs to D | dist(p,q) ≤ Eps}
■ Directly density-reachable: A point p is directly
density-reachable from a point q w.r.t. Eps,
MinPts if
■ p belongs to NEps(q) p MinPts = 5
■ core point condition: Eps = 1 cm
q
|NEps (q)| ≥ MinPts

6
Density-Reachable and Density-
Connected
■ Density-reachable:
■ A point p is density-reachable p
from a point q w.r.t. Eps, p1
MinPts if there is a chain of q
points p1, …, pn, p1 = q, pn = p
such that pi+1 is directly
density-reachable from pi
■ Density-connected p q

■ A point p is density-connected
o
to a point q w.r.t. Eps, MinPts
if there is a point o such that
both, p and q are density-
reachable from o w.r.t. Eps 7
DBSCAN: Density-Based Spatial
Clustering of Applications with
Noise
■ Relies on a density-based notion of cluster: A
cluster is defined as a maximal set of density-
connected points
■ Discovers clusters of arbitrary shape in spatial
databases with noise
Outlier

Border
Eps = 1cm
MinPts = 5
Core

8
DBSCAN: Sensitive to Parameters

DBSCAN online
Demo:
http://webdocs.cs.ualberta.ca/~yaling/Cluster/Applet/Code/Clu
ster.html 9
OPTICS: A Cluster-Ordering Method
(1999)

■ OPTICS: Ordering Points To Identify the Clustering


Structure
■ Ankerst, Breunig, Kriegel, and Sander

(SIGMOD’99)
■ Produces a special order of the database wrt its

density-based clustering structure


■ This cluster-ordering contains info equiv to the

density-based clusterings corresponding to a


broad range of parameter settings
■ Good for both automatic and interactive cluster

analysis, including finding intrinsic clustering


structure
■ Can be represented graphically or using

visualization techniques 10
Chapter 10. Cluster Analysis: Basic
Concepts and Methods
■ Cluster Analysis: Basic Concepts
■ Partitioning Methods
■ Hierarchical Methods
■ Density-Based Methods
■ Grid-Based Methods
■ Evaluation of Clustering
■ Summary

11
Grid-Based Clustering Method

■ Using multi-resolution grid data structure


■ Several interesting methods
■ STING (a STatistical INformation Grid

approach) by Wang, Yang and Muntz


(1997)
■ CLIQUE: Agrawal, et al. (SIGMOD’98)

■ Both grid-based and subspace

clustering
■ WaveCluster by Sheikholeslami,

Chatterjee, and Zhang (VLDB’98)


■ A multi-resolution clustering

approach using wavelet method 12


STING: A Statistical Information Grid
Approach
■ Wang, Yang and Muntz (VLDB’97)
■ The spatial area is divided into rectangular cells
■ There are several levels of cells corresponding to
different levels of resolution

13
The STING Clustering Method
■ Each cell at a high level is partitioned into a
number of smaller cells in the next lower level
■ Statistical info of each cell is calculated and stored
beforehand and is used to answer queries
■ Parameters of higher level cells can be easily
calculated from parameters of lower level cell
■ count, mean, s, min, max

■ type of distribution—normal, uniform, etc.

■ Use a top-down approach to answer spatial data


queries
■ Start from a pre-selected layer—typically with a
small number of cells
■ For each cell in the current level compute the
confidence interval 14
STING Algorithm and Its
Analysis
■ Remove the irrelevant cells from further
consideration
■ When finish examining the current layer, proceed
to the next lower level
■ Repeat this process until the bottom layer is
reached
■ Advantages:
■ Query-independent, easy to parallelize,

incremental update
■ O(K), where K is the number of grid cells at the

lowest level
■ Disadvantages:
■ All the cluster boundaries are either horizontal

or vertical, and no diagonal boundary is


15
CLIQUE (Clustering In QUEst)

■ Agrawal, Gehrke, Gunopulos, Raghavan (SIGMOD’98)


■ Automatically identifying subspaces of a high dimensional data
space that allow better clustering than original space
■ CLIQUE can be considered as both density-based and grid-
based
■ It partitions each dimension into the same number of equal
length interval
■ It partitions an m-dimensional data space into non-
overlapping rectangular units
■ A unit is dense if the fraction of total data points contained
in the unit exceeds the input model parameter
■ A cluster is a maximal set of connected dense units within a
subspace
16
CLIQUE: The Major Steps

■ Partition the data space and find the number of


points that lie inside each cell of the partition.
■ Identify the subspaces that contain clusters using
the Apriori principle
■ Identify clusters
■ Determine dense units in all subspaces of
interests
■ Determine connected dense units in all
subspaces of interests.
■ Generate minimal description for the clusters
■ Determine maximal regions that cover a cluster

of connected dense units for each cluster


■ Determination of minimal cover for each cluster

17
Chapter 10. Cluster Analysis: Basic
Concepts and Methods

■ Cluster Analysis: Basic Concepts


■ Partitioning Methods
■ Hierarchical Methods
■ Density-Based Methods
■ Grid-Based Methods
■ Evaluation of Clustering
■ Summary
18
Determine the Number of Clusters
■ Empirical method
■ # of clusters: k ≈√n/2 for a dataset of n points, e.g., n = 200,

k = 10
■ Elbow method
■ Use the turning point in the curve of sum of within cluster

variance w.r.t the # of clusters


■ Cross validation method
■ Divide a given data set into m parts

■ Use m – 1 parts to obtain a clustering model

■ Use the remaining part to test the quality of the clustering

■ E.g., For each point in the test set, find the closest

centroid, and use the sum of squared distance between all


points in the test set and the closest centroids to measure
how well the model fits the test set
■ For any k > 0, repeat it m times, compare the overall quality

measure w.r.t. different k’s, and find # of clusters that fits the
19
Measuring Clustering Quality
■ 3 kinds of measures: External, internal and relative
■ External: supervised, employ criteria not inherent to the
dataset
■ Compare a clustering against prior or expert-specified
knowledge (i.e., the ground truth) using certain
clustering quality measure
■ Internal: unsupervised, criteria derived from data itself
■ Evaluate the goodness of a clustering by considering
how well the clusters are separated, and how
compact the clusters are, e.g., Silhouette coefficient
■ Relative: directly compare different clusterings, usually
those obtained via different parameter settings for the
same algorithm 20
Some Commonly Used External Measures

■ Matching-based measures
■ Purity, maximum matching, F-measure

■ Entropy-Based Measures
Ground truth partitioning T2
■ Conditional entropy, normalized mutual T 1
Cluster Cluster
information (NMI), variation of information C 1
C 2

■ Pair-wise measures
■ Four possibilities: True positive (TP), FN,

FP, TN
■ Jaccard coefficient, Rand statistic,

Fowlkes-Mallow measure
■ Correlation measures
■ Discretized Huber static, normalized

discretized Huber static


21

You might also like