Density & Grid based clustering
Density & Grid based clustering
2
Density-Based Clustering
Methods
■ Partitioning and hierarchical methods are
designed to find spherical-shaped clusters.
■ They have difficulty finding clusters of
arbitrary shape such as the “S” shape and
oval clusters.
■ Strategy of density-based clustering
methods:
■ To find clusters of arbitrary shape, model
3
Density-Based Clustering
Methods
■ Clustering based on density (local cluster criterion),
such as density-connected points
■ Major features:
■ Discover clusters of arbitrary shape
■ Handle noise
■ One scan
condition
■ Several interesting studies:
■ DBSCAN: Ester, et al. (KDD’96)
based)
5
Density-Based Clustering: Basic
Concepts
■ Two parameters:
■ Eps: Maximum radius of the neighbourhood
■ MinPts: Minimum number of points in an Eps-
neighbourhood of that point
■ NEps(q): {p belongs to D | dist(p,q) ≤ Eps}
■ Directly density-reachable: A point p is directly
density-reachable from a point q w.r.t. Eps,
MinPts if
■ p belongs to NEps(q) p MinPts = 5
■ core point condition: Eps = 1 cm
q
|NEps (q)| ≥ MinPts
6
Density-Reachable and Density-
Connected
■ Density-reachable:
■ A point p is density-reachable p
from a point q w.r.t. Eps, p1
MinPts if there is a chain of q
points p1, …, pn, p1 = q, pn = p
such that pi+1 is directly
density-reachable from pi
■ Density-connected p q
■ A point p is density-connected
o
to a point q w.r.t. Eps, MinPts
if there is a point o such that
both, p and q are density-
reachable from o w.r.t. Eps 7
DBSCAN: Density-Based Spatial
Clustering of Applications with
Noise
■ Relies on a density-based notion of cluster: A
cluster is defined as a maximal set of density-
connected points
■ Discovers clusters of arbitrary shape in spatial
databases with noise
Outlier
Border
Eps = 1cm
MinPts = 5
Core
8
DBSCAN: Sensitive to Parameters
DBSCAN online
Demo:
http://webdocs.cs.ualberta.ca/~yaling/Cluster/Applet/Code/Clu
ster.html 9
OPTICS: A Cluster-Ordering Method
(1999)
(SIGMOD’99)
■ Produces a special order of the database wrt its
visualization techniques 10
Chapter 10. Cluster Analysis: Basic
Concepts and Methods
■ Cluster Analysis: Basic Concepts
■ Partitioning Methods
■ Hierarchical Methods
■ Density-Based Methods
■ Grid-Based Methods
■ Evaluation of Clustering
■ Summary
11
Grid-Based Clustering Method
clustering
■ WaveCluster by Sheikholeslami,
13
The STING Clustering Method
■ Each cell at a high level is partitioned into a
number of smaller cells in the next lower level
■ Statistical info of each cell is calculated and stored
beforehand and is used to answer queries
■ Parameters of higher level cells can be easily
calculated from parameters of lower level cell
■ count, mean, s, min, max
incremental update
■ O(K), where K is the number of grid cells at the
lowest level
■ Disadvantages:
■ All the cluster boundaries are either horizontal
17
Chapter 10. Cluster Analysis: Basic
Concepts and Methods
k = 10
■ Elbow method
■ Use the turning point in the curve of sum of within cluster
■ E.g., For each point in the test set, find the closest
measure w.r.t. different k’s, and find # of clusters that fits the
19
Measuring Clustering Quality
■ 3 kinds of measures: External, internal and relative
■ External: supervised, employ criteria not inherent to the
dataset
■ Compare a clustering against prior or expert-specified
knowledge (i.e., the ground truth) using certain
clustering quality measure
■ Internal: unsupervised, criteria derived from data itself
■ Evaluate the goodness of a clustering by considering
how well the clusters are separated, and how
compact the clusters are, e.g., Silhouette coefficient
■ Relative: directly compare different clusterings, usually
those obtained via different parameter settings for the
same algorithm 20
Some Commonly Used External Measures
■ Matching-based measures
■ Purity, maximum matching, F-measure
■ Entropy-Based Measures
Ground truth partitioning T2
■ Conditional entropy, normalized mutual T 1
Cluster Cluster
information (NMI), variation of information C 1
C 2
■ Pair-wise measures
■ Four possibilities: True positive (TP), FN,
FP, TN
■ Jaccard coefficient, Rand statistic,
Fowlkes-Mallow measure
■ Correlation measures
■ Discretized Huber static, normalized