0% found this document useful (0 votes)
107 views12 pages

Clusters - Density-Based

The document discusses density-based clustering algorithms. It defines density-based clusters as sets of density-connected points that are maximal with respect to density-reachability. A point p is density-reachable from another point q if there is a chain of points connecting them where each subsequent point is directly density-reachable from the previous. Direct density-reachability requires the points to be neighbors and the neighbor point to have sufficient density. DBSCAN is presented as a density-based clustering algorithm that groups together densely connected points and marks outliers as noise. Parameters epsilon and delta control neighborhood size and density.

Uploaded by

Fareed Naouri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
107 views12 pages

Clusters - Density-Based

The document discusses density-based clustering algorithms. It defines density-based clusters as sets of density-connected points that are maximal with respect to density-reachability. A point p is density-reachable from another point q if there is a chain of points connecting them where each subsequent point is directly density-reachable from the previous. Direct density-reachability requires the points to be neighbors and the neighbor point to have sufficient density. DBSCAN is presented as a density-based clustering algorithm that groups together densely connected points and marks outliers as noise. Parameters epsilon and delta control neighborhood size and density.

Uploaded by

Fareed Naouri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Non-convex Clusters

Clusters – Density-based 26/34


Neighborhood and Reachability
• -neighborhood of p ∈ D defined as N (p) = {x ∈ D | d(p, x) ≤ }
• p is directly density-reachable from q ∈ D w.r.t. some  and δ if
• p ∈ N (q)
• |N (q)| ≥ δ, i.e. is a core point

• p is density-reachable from q w.r.t. some  and δ if


• ∃p1 , . . . , pn ∈ D such that p1 = q, pn = p, and
• pi+1 is directly density-reachable from pi for 2 ≤ i ≤ n

• p is density-connected to q w.r.t. some  and δ if


• ∃o ∈ D such that both p and q are density-reachable from o

• C ⊆ D (C 6= ∅) is a cluster w.r.t. some  and δ if


• ∀p, q ∈ D: if p ∈ C and q is density-reachable from p then q ∈ C
• ∀p, q ∈ C: p is density-connected to q

• noise = {p ∈ D : | : p ∈ / C1 ∪ · · · ∪ Ck } where
• C1 , . . . , Ck ⊆ D are clusters

Clusters – Density-based 27/34


Neighborhood and Reachability

Clusters – Density-based 28/34


DBSCAN

1: procedure DBSCAN(D, , δ)
2: for all x ∈ D do
3: p(x) ← −1 . mark points as unclastered
4: i←1 . the noise cluster have id 0
5: for all p ∈ D do
6: if p(p) = −1 then
7: if ExpandCluster(D, p, i, , δ) then
8: i←i+1

Clusters – Density-based 29/34


DBSCAN
1: function ExpandCluster(D, p, i, , δ)
2: if |N (p)| < δ then
3: p(p) ← 0 . mark p as noise
4: return false
5: else
6: for all x ∈ N (p) do
7: p(x) ← i . assign all x to cluster i
8: S ← N (p) \ {p}
9: while S 6= ∅ do
10: s ← S1 . Get the first point from S
11: if |N (s)| ≥ δ then
12: for all x ∈ N (s) do
13: if p(x) ≤ 0 then
14: if p(x) = −1 then
15: S ← S ∪ {x}
16: p(x) ← i
17: S ← S \ {s}
18: return true
Clusters – Density-based 30/34
How to guess  and δ?
k-distance
• k-dist: D → R
• k-dist(x) is the distance of x to its k-th nearest neighbor

Clusters – Density-based 31/34


DBSCAN – “good to know”

Pros
• Clusters of an arbitrary shape
• Robust to outliers

Cons
• Computationally complex
• Hard to set the parameters

Clusters – Density-based 32/34


Final remarks

• domain knowledge might help in choosing the right similarity


measure
• be aware of the range of values of the attributes
• e.g. similarities between x = (3.2, 178) and y = (3.1, 170) affected
more by the second co-ordinate

• there are various other approaches to similarity computation


• Janos Podani (2000). Introduction to the Exploration of Multivariate
Biological Data. Chapter 3: Distance, similarity, correlation...
Backhuys Publishers, Leiden, The Netherlands, ISBN 90-5782-067-6.

Clusters – Density-based 33/34


Thanks for your attention
References

• Maria Halkidi, Yannis Batistakis, and Michalis Vazirgiannis (2001). On


Clustering Validation Techniques. Journal on Intelligent Information
Systems 17, 2-3.

• Pang-Ning Tan, Michael Steinbach, and Vipin Kumar(2005).


Introduction to Data Mining, (First Edition). Addison-Wesley Longman
Publishing Co., Inc., Boston, MA, USA.

• Chris Ding and Xiaofeng He (2004). K-means clustering via principal


component analysis. In Proceedings of the twenty-first international
conference on Machine learning (ICML ’04). ACM, New York, NY, USA.

• Martin Ester, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu (1996). A


density-based algorithm for discovering clusters in large spatial databases
with noise. Proceedings of the 2nd International Conference on
Knowledge Discovery and Data Mining, AAAI Press.

Clusters – Density-based 34/34


Homework

• Download a clustering dataset from the UCI Machine Learning


Repository

• Cluster the dataset using


• Agglomerative clustering
• k-means method
• DBSCAN method

• Justify the choice of the values for the hyper-parameters


• similarity, linkage, k, δ, , . . .

Clusters – Density-based 34/34


Questions?

[email protected]

You might also like