Non-convex Clusters
Clusters – Density-based 26/34
Neighborhood and Reachability
• -neighborhood of p ∈ D defined as N (p) = {x ∈ D | d(p, x) ≤ }
• p is directly density-reachable from q ∈ D w.r.t. some and δ if
• p ∈ N (q)
• |N (q)| ≥ δ, i.e. is a core point
• p is density-reachable from q w.r.t. some and δ if
• ∃p1 , . . . , pn ∈ D such that p1 = q, pn = p, and
• pi+1 is directly density-reachable from pi for 2 ≤ i ≤ n
• p is density-connected to q w.r.t. some and δ if
• ∃o ∈ D such that both p and q are density-reachable from o
• C ⊆ D (C 6= ∅) is a cluster w.r.t. some and δ if
• ∀p, q ∈ D: if p ∈ C and q is density-reachable from p then q ∈ C
• ∀p, q ∈ C: p is density-connected to q
• noise = {p ∈ D : | : p ∈ / C1 ∪ · · · ∪ Ck } where
• C1 , . . . , Ck ⊆ D are clusters
Clusters – Density-based 27/34
Neighborhood and Reachability
Clusters – Density-based 28/34
DBSCAN
1: procedure DBSCAN(D, , δ)
2: for all x ∈ D do
3: p(x) ← −1 . mark points as unclastered
4: i←1 . the noise cluster have id 0
5: for all p ∈ D do
6: if p(p) = −1 then
7: if ExpandCluster(D, p, i, , δ) then
8: i←i+1
Clusters – Density-based 29/34
DBSCAN
1: function ExpandCluster(D, p, i, , δ)
2: if |N (p)| < δ then
3: p(p) ← 0 . mark p as noise
4: return false
5: else
6: for all x ∈ N (p) do
7: p(x) ← i . assign all x to cluster i
8: S ← N (p) \ {p}
9: while S 6= ∅ do
10: s ← S1 . Get the first point from S
11: if |N (s)| ≥ δ then
12: for all x ∈ N (s) do
13: if p(x) ≤ 0 then
14: if p(x) = −1 then
15: S ← S ∪ {x}
16: p(x) ← i
17: S ← S \ {s}
18: return true
Clusters – Density-based 30/34
How to guess and δ?
k-distance
• k-dist: D → R
• k-dist(x) is the distance of x to its k-th nearest neighbor
Clusters – Density-based 31/34
DBSCAN – “good to know”
Pros
• Clusters of an arbitrary shape
• Robust to outliers
Cons
• Computationally complex
• Hard to set the parameters
Clusters – Density-based 32/34
Final remarks
• domain knowledge might help in choosing the right similarity
measure
• be aware of the range of values of the attributes
• e.g. similarities between x = (3.2, 178) and y = (3.1, 170) affected
more by the second co-ordinate
• there are various other approaches to similarity computation
• Janos Podani (2000). Introduction to the Exploration of Multivariate
Biological Data. Chapter 3: Distance, similarity, correlation...
Backhuys Publishers, Leiden, The Netherlands, ISBN 90-5782-067-6.
Clusters – Density-based 33/34
Thanks for your attention
References
• Maria Halkidi, Yannis Batistakis, and Michalis Vazirgiannis (2001). On
Clustering Validation Techniques. Journal on Intelligent Information
Systems 17, 2-3.
• Pang-Ning Tan, Michael Steinbach, and Vipin Kumar(2005).
Introduction to Data Mining, (First Edition). Addison-Wesley Longman
Publishing Co., Inc., Boston, MA, USA.
• Chris Ding and Xiaofeng He (2004). K-means clustering via principal
component analysis. In Proceedings of the twenty-first international
conference on Machine learning (ICML ’04). ACM, New York, NY, USA.
• Martin Ester, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu (1996). A
density-based algorithm for discovering clusters in large spatial databases
with noise. Proceedings of the 2nd International Conference on
Knowledge Discovery and Data Mining, AAAI Press.
Clusters – Density-based 34/34
Homework
• Download a clustering dataset from the UCI Machine Learning
Repository
• Cluster the dataset using
• Agglomerative clustering
• k-means method
• DBSCAN method
• Justify the choice of the values for the hyper-parameters
• similarity, linkage, k, δ, , . . .
Clusters – Density-based 34/34
Questions?
[email protected]