100% found this document useful (1 vote)

12 views

Density & Grid based clustering

The document discusses various methods of cluster analysis, including partitioning, hierarchical, density-based, and grid-based methods. It highlights density-based clustering techniques like DBSCAN and OPTICS, which can identify clusters of arbitrary shapes and handle noise. Additionally, it covers evaluation methods for clustering quality and determining the number of clusters using empirical and statistical approaches.

Uploaded by

Vidhya B

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

12 views

Density & Grid based clustering

Uploaded by

Vidhya B

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 21

Datamining & Warehousing

Unit 4 – Cluster Analysis–

Part1 1 B
Dr.VIDHYA
ASSISTANT PROFESSOR & HEAD
Department of Computer Technology
Sri Ramakrishna College of Arts and Science
Coimbatore - 641 006
Tamil Nadu, India
Chapter 10. Cluster Analysis: Basic
Concepts and Methods
■ Cluster Analysis: Basic Concepts
■ Partitioning Methods
■ Hierarchical Methods
■ Density-Based Methods
■ Grid-Based Methods
■ Evaluation of Clustering
■ Summary

2
Density-Based Clustering
Methods
■ Partitioning and hierarchical methods are
designed to find spherical-shaped clusters.
■ They have difficulty finding clusters of
arbitrary shape such as the “S” shape and
oval clusters.
■ Strategy of density-based clustering
methods:
■ To find clusters of arbitrary shape, model

clusters as dense regions in the data

space, separated by sparse regions.
■ discover clusters of non- spherical shape

3
Density-Based Clustering
Methods
■ Clustering based on density (local cluster criterion),
such as density-connected points
■ Major features:
■ Discover clusters of arbitrary shape

■ Handle noise

■ One scan

■ Need density parameters as termination

condition
■ Several interesting studies:
■ DBSCAN: Ester, et al. (KDD’96)

■ OPTICS: Ankerst, et al (SIGMOD’99).

■ DENCLUE: Hinneburg & D. Keim (KDD’98)

■ CLIQUE: Agrawal, et al. (SIGMOD’98) (more grid-

4
Density-Based Clustering
Methods

■ Several interesting studies:

■ DBSCAN: Ester, et al. (KDD’96)

■ OPTICS: Ankerst, et al (SIGMOD’99).

■ DENCLUE: Hinneburg & D. Keim (KDD’98)

■ CLIQUE: Agrawal, et al. (SIGMOD’98) (more grid-

based)

5
Density-Based Clustering: Basic
Concepts
■ Two parameters:
■ Eps: Maximum radius of the neighbourhood
■ MinPts: Minimum number of points in an Eps-
neighbourhood of that point
■ NEps(q): {p belongs to D | dist(p,q) ≤ Eps}
■ Directly density-reachable: A point p is directly
density-reachable from a point q w.r.t. Eps,
MinPts if
■ p belongs to NEps(q) p MinPts = 5
■ core point condition: Eps = 1 cm
q
|NEps (q)| ≥ MinPts

6
Density-Reachable and Density-
Connected
■ Density-reachable:
■ A point p is density-reachable p
from a point q w.r.t. Eps, p1
MinPts if there is a chain of q
points p1, …, pn, p1 = q, pn = p
such that pi+1 is directly
density-reachable from pi
■ Density-connected p q

■ A point p is density-connected
o
to a point q w.r.t. Eps, MinPts
if there is a point o such that
both, p and q are density-
reachable from o w.r.t. Eps 7
DBSCAN: Density-Based Spatial
Clustering of Applications with
Noise
■ Relies on a density-based notion of cluster: A
cluster is defined as a maximal set of density-
connected points
■ Discovers clusters of arbitrary shape in spatial
databases with noise
Outlier

Border
Eps = 1cm
MinPts = 5
Core

8
DBSCAN: Sensitive to Parameters

DBSCAN online
Demo:
http://webdocs.cs.ualberta.ca/~yaling/Cluster/Applet/Code/Clu
ster.html 9
OPTICS: A Cluster-Ordering Method
(1999)

■ OPTICS: Ordering Points To Identify the Clustering

Structure
■ Ankerst, Breunig, Kriegel, and Sander

(SIGMOD’99)
■ Produces a special order of the database wrt its

density-based clustering structure

■ This cluster-ordering contains info equiv to the

density-based clusterings corresponding to a

broad range of parameter settings
■ Good for both automatic and interactive cluster

analysis, including finding intrinsic clustering

structure
■ Can be represented graphically or using

visualization techniques 10
Chapter 10. Cluster Analysis: Basic
Concepts and Methods
■ Cluster Analysis: Basic Concepts
■ Partitioning Methods
■ Hierarchical Methods
■ Density-Based Methods
■ Grid-Based Methods
■ Evaluation of Clustering
■ Summary

11
Grid-Based Clustering Method

■ Using multi-resolution grid data structure

■ Several interesting methods
■ STING (a STatistical INformation Grid

approach) by Wang, Yang and Muntz

(1997)
■ CLIQUE: Agrawal, et al. (SIGMOD’98)

■ Both grid-based and subspace

clustering
■ WaveCluster by Sheikholeslami,

Chatterjee, and Zhang (VLDB’98)

■ A multi-resolution clustering

approach using wavelet method 12

STING: A Statistical Information Grid
Approach
■ Wang, Yang and Muntz (VLDB’97)
■ The spatial area is divided into rectangular cells
■ There are several levels of cells corresponding to
different levels of resolution

13
The STING Clustering Method
■ Each cell at a high level is partitioned into a
number of smaller cells in the next lower level
■ Statistical info of each cell is calculated and stored
beforehand and is used to answer queries
■ Parameters of higher level cells can be easily
calculated from parameters of lower level cell
■ count, mean, s, min, max

■ type of distribution—normal, uniform, etc.

■ Use a top-down approach to answer spatial data

queries
■ Start from a pre-selected layer—typically with a
small number of cells
■ For each cell in the current level compute the
confidence interval 14
STING Algorithm and Its
Analysis
■ Remove the irrelevant cells from further
consideration
■ When finish examining the current layer, proceed
to the next lower level
■ Repeat this process until the bottom layer is
reached
■ Advantages:
■ Query-independent, easy to parallelize,

incremental update
■ O(K), where K is the number of grid cells at the

lowest level
■ Disadvantages:
■ All the cluster boundaries are either horizontal

or vertical, and no diagonal boundary is

15
CLIQUE (Clustering In QUEst)

■ Agrawal, Gehrke, Gunopulos, Raghavan (SIGMOD’98)

■ Automatically identifying subspaces of a high dimensional data
space that allow better clustering than original space
■ CLIQUE can be considered as both density-based and grid-
based
■ It partitions each dimension into the same number of equal
length interval
■ It partitions an m-dimensional data space into non-
overlapping rectangular units
■ A unit is dense if the fraction of total data points contained
in the unit exceeds the input model parameter
■ A cluster is a maximal set of connected dense units within a
subspace
16
CLIQUE: The Major Steps

■ Partition the data space and find the number of

points that lie inside each cell of the partition.
■ Identify the subspaces that contain clusters using
the Apriori principle
■ Identify clusters
■ Determine dense units in all subspaces of
interests
■ Determine connected dense units in all
subspaces of interests.
■ Generate minimal description for the clusters
■ Determine maximal regions that cover a cluster

of connected dense units for each cluster

■ Determination of minimal cover for each cluster

17
Chapter 10. Cluster Analysis: Basic
Concepts and Methods

■ Cluster Analysis: Basic Concepts

■ Partitioning Methods
■ Hierarchical Methods
■ Density-Based Methods
■ Grid-Based Methods
■ Evaluation of Clustering
■ Summary
18
Determine the Number of Clusters
■ Empirical method
■ # of clusters: k ≈√n/2 for a dataset of n points, e.g., n = 200,

k = 10
■ Elbow method
■ Use the turning point in the curve of sum of within cluster

variance w.r.t the # of clusters

■ Cross validation method
■ Divide a given data set into m parts

■ Use m – 1 parts to obtain a clustering model

■ Use the remaining part to test the quality of the clustering

■ E.g., For each point in the test set, find the closest

centroid, and use the sum of squared distance between all

points in the test set and the closest centroids to measure
how well the model fits the test set
■ For any k > 0, repeat it m times, compare the overall quality

measure w.r.t. different k’s, and find # of clusters that fits the
19
Measuring Clustering Quality
■ 3 kinds of measures: External, internal and relative
■ External: supervised, employ criteria not inherent to the
dataset
■ Compare a clustering against prior or expert-specified
knowledge (i.e., the ground truth) using certain
clustering quality measure
■ Internal: unsupervised, criteria derived from data itself
■ Evaluate the goodness of a clustering by considering
how well the clusters are separated, and how
compact the clusters are, e.g., Silhouette coefficient
■ Relative: directly compare different clusterings, usually
those obtained via different parameter settings for the
same algorithm 20
Some Commonly Used External Measures

■ Matching-based measures
■ Purity, maximum matching, F-measure

■ Entropy-Based Measures
Ground truth partitioning T2
■ Conditional entropy, normalized mutual T 1
Cluster Cluster
information (NMI), variation of information C 1
C 2

■ Pair-wise measures
■ Four possibilities: True positive (TP), FN,

FP, TN
■ Jaccard coefficient, Rand statistic,

Fowlkes-Mallow measure
■ Correlation measures
■ Discretized Huber static, normalized

discretized Huber static

Apache Cassandra Administrator Associate - Exam Practice Tests
From Everand
Apache Cassandra Administrator Associate - Exam Practice Tests
Cristian Scutaru
No ratings yet
Daa Lab Manual
No ratings yet
Daa Lab Manual
60 pages
Complex Data Types: Practice Exercises
No ratings yet
Complex Data Types: Practice Exercises
4 pages
Analyzing Iterative Algorithm
No ratings yet
Analyzing Iterative Algorithm
38 pages
Assignment 2 CE-415 Artificial Intelligence
No ratings yet
Assignment 2 CE-415 Artificial Intelligence
3 pages
SE Assignment Questions
No ratings yet
SE Assignment Questions
2 pages
Internal Product Attribute Measurement: Size
No ratings yet
Internal Product Attribute Measurement: Size
70 pages
Web Mining
No ratings yet
Web Mining
71 pages
Data Analytics Lab File Rohit
No ratings yet
Data Analytics Lab File Rohit
23 pages
Unit-Ii: Objective Type Questions Prepared by R. Sathish kumar/MIT
No ratings yet
Unit-Ii: Objective Type Questions Prepared by R. Sathish kumar/MIT
61 pages
DMW MCQ
No ratings yet
DMW MCQ
388 pages
FDS Lab Manual
No ratings yet
FDS Lab Manual
48 pages
Unit I R Data Structures
No ratings yet
Unit I R Data Structures
30 pages
Unit-5 Unit-5: Case Studies of Big Data Analytics Using Map-Reduce Programming
No ratings yet
Unit-5 Unit-5: Case Studies of Big Data Analytics Using Map-Reduce Programming
11 pages
Programming Assignment 3: Greedy Algorithms
No ratings yet
Programming Assignment 3: Greedy Algorithms
12 pages
Big Data Unit5
No ratings yet
Big Data Unit5
57 pages
Halstead Explained
No ratings yet
Halstead Explained
5 pages
Query Optimization MCQ
No ratings yet
Query Optimization MCQ
12 pages
Data Mining-Outlier Analysis
No ratings yet
Data Mining-Outlier Analysis
6 pages
Data Mining Lecture Notes-1: Bsc. (H) Computer Science: Vi Semester Teacher: Ms. Sonal Linda
No ratings yet
Data Mining Lecture Notes-1: Bsc. (H) Computer Science: Vi Semester Teacher: Ms. Sonal Linda
40 pages
Cs3353 Foundations of Data Science L T P C 3 0 0 3
No ratings yet
Cs3353 Foundations of Data Science L T P C 3 0 0 3
2 pages
Unit 4 - 4.4
No ratings yet
Unit 4 - 4.4
23 pages
DATA MINING Chapter 1 and 2 Lect Slide
No ratings yet
DATA MINING Chapter 1 and 2 Lect Slide
47 pages
Halstead's Operators and Operands in C, C++, JAVA (By Indranil Nandy)
100% (6)
Halstead's Operators and Operands in C, C++, JAVA (By Indranil Nandy)
5 pages
MCA Spanning Tree CODE 212
100% (1)
MCA Spanning Tree CODE 212
8 pages
Unit 3
No ratings yet
Unit 3
68 pages
Chapter 8 - Arrays - PPT Slides
No ratings yet
Chapter 8 - Arrays - PPT Slides
96 pages
AMNA SHAHID - Docx MCQS
No ratings yet
AMNA SHAHID - Docx MCQS
8 pages
Tower of Hanoi: Problem
No ratings yet
Tower of Hanoi: Problem
4 pages
DAA - Unit IV - Greedy Technique - Lecture Slides PDF
No ratings yet
DAA - Unit IV - Greedy Technique - Lecture Slides PDF
50 pages
Sutherland-Hodgeman Polygon Clipping: Abstract
No ratings yet
Sutherland-Hodgeman Polygon Clipping: Abstract
4 pages
Lecture Notes: Introduction To Data Science and Big Data
No ratings yet
Lecture Notes: Introduction To Data Science and Big Data
5 pages
Chi Merge
No ratings yet
Chi Merge
5 pages
CS1302 - Computer Networks
No ratings yet
CS1302 - Computer Networks
5 pages
Course File Compiler Design
No ratings yet
Course File Compiler Design
41 pages
III Year V Sem Cs6503 Theory of Computation
No ratings yet
III Year V Sem Cs6503 Theory of Computation
44 pages
Unit 1
100% (1)
Unit 1
77 pages
Python Lab Manual
No ratings yet
Python Lab Manual
50 pages
Chapter 6: Multiway Trees: Not Restricted To 2 Binary Search Trees
100% (1)
Chapter 6: Multiway Trees: Not Restricted To 2 Binary Search Trees
32 pages
Android 100 MCQS
No ratings yet
Android 100 MCQS
39 pages
Unit 3 Greedy & Dynamic Programming
No ratings yet
Unit 3 Greedy & Dynamic Programming
217 pages
Query Operation 2021
No ratings yet
Query Operation 2021
35 pages
DMDW-Unit II
No ratings yet
DMDW-Unit II
19 pages
Assignment SE
No ratings yet
Assignment SE
1 page
Machine Learning: in Telugu
No ratings yet
Machine Learning: in Telugu
14 pages
Data Warehousing Logical Design
100% (1)
Data Warehousing Logical Design
23 pages
Why-Is-A-System-Proposal-So-Crucial-For-System-Design 2
No ratings yet
Why-Is-A-System-Proposal-So-Crucial-For-System-Design 2
3 pages
Attribute Oriented Induction
100% (1)
Attribute Oriented Induction
6 pages
2.Fundamental Algorithms
No ratings yet
2.Fundamental Algorithms
20 pages
Dbms Unit 4.2
No ratings yet
Dbms Unit 4.2
60 pages
Answers To Problems For Data Mining and Predictive Analytics (2nd Edition) by Larose
No ratings yet
Answers To Problems For Data Mining and Predictive Analytics (2nd Edition) by Larose
12 pages
19ce159 C++
No ratings yet
19ce159 C++
23 pages
Practical No: 02: Aim: A) Write A Program To Simulate 4-Queen / N-Queen Problem. Diagram
No ratings yet
Practical No: 02: Aim: A) Write A Program To Simulate 4-Queen / N-Queen Problem. Diagram
6 pages
11 Implementation of Distance Vector Routing Algorithm
No ratings yet
11 Implementation of Distance Vector Routing Algorithm
7 pages
Identifying Use Cases - Object Analysis - Classification
No ratings yet
Identifying Use Cases - Object Analysis - Classification
147 pages
Computer Graphics
No ratings yet
Computer Graphics
10 pages
ER Practical 7r
No ratings yet
ER Practical 7r
5 pages
Equity of Cybersecurity in the Education System: High Schools, Undergraduate, Graduate and Post-Graduate Studies.
From Everand
Equity of Cybersecurity in the Education System: High Schools, Undergraduate, Graduate and Post-Graduate Studies.
Joseph O. Esin
No ratings yet
Clustering Part2
No ratings yet
Clustering Part2
29 pages
Cluster Analysis
No ratings yet
Cluster Analysis
76 pages
Unit - 4
No ratings yet
Unit - 4
21 pages
FML File Final
No ratings yet
FML File Final
36 pages
ML Lab Experiments (1) - Pages-3
No ratings yet
ML Lab Experiments (1) - Pages-3
11 pages
Script
No ratings yet
Script
10 pages
A12-Online Learning Short 2020
No ratings yet
A12-Online Learning Short 2020
61 pages
PythonMalware FirstReview
No ratings yet
PythonMalware FirstReview
25 pages
Lecture 3 Ai
No ratings yet
Lecture 3 Ai
48 pages
Tensorflow 2 Quickstart
No ratings yet
Tensorflow 2 Quickstart
2 pages
On Bridging The Semantic Gap Between Machine
No ratings yet
On Bridging The Semantic Gap Between Machine
14 pages
莫高窟论文
No ratings yet
莫高窟论文
14 pages
Fall 2023 - CS619 - 8881
No ratings yet
Fall 2023 - CS619 - 8881
1 page
ML Using Python IT UPDATED
No ratings yet
ML Using Python IT UPDATED
53 pages
@vtucode.in-21AI63-model-set-1-paper
No ratings yet
@vtucode.in-21AI63-model-set-1-paper
2 pages
dataminingshort Question part2
No ratings yet
dataminingshort Question part2
17 pages
Soil_Analysis_and_Crop_Recommendation_using_Machine_Learning
No ratings yet
Soil_Analysis_and_Crop_Recommendation_using_Machine_Learning
7 pages
Stress Detection Using Natural Language
No ratings yet
Stress Detection Using Natural Language
24 pages
Presentation
No ratings yet
Presentation
27 pages
Phishing Website Identification Based On Double Weight Random Forest
No ratings yet
Phishing Website Identification Based On Double Weight Random Forest
4 pages
Ayush Machine Learning Lab
No ratings yet
Ayush Machine Learning Lab
38 pages
Final Year Project Lung Cancer Detection Using Efficient Net B3
No ratings yet
Final Year Project Lung Cancer Detection Using Efficient Net B3
14 pages
IMPLEMENT A NEURAL NETWORK USING PYTHON
No ratings yet
IMPLEMENT A NEURAL NETWORK USING PYTHON
5 pages
Model Evaluation and Selection
No ratings yet
Model Evaluation and Selection
49 pages
R004-Application of Machine Learning for Fuel Consumption Modelling of Trucks
No ratings yet
R004-Application of Machine Learning for Fuel Consumption Modelling of Trucks
6 pages
Retail Sales Prediction Model
No ratings yet
Retail Sales Prediction Model
50 pages
Transfer Learning Convolutional Neural Network-AlexNet Achieving Face Recognition
No ratings yet
Transfer Learning Convolutional Neural Network-AlexNet Achieving Face Recognition
4 pages
unit3 DL JNTUK
No ratings yet
unit3 DL JNTUK
15 pages
TEITA170 Saral Mane Seminar Report
No ratings yet
TEITA170 Saral Mane Seminar Report
23 pages
Empowering Artificial Intelligence Techniques With Soft Computing of Neutrosophic Theory in Mystery Circumstances For Plant Diseases
No ratings yet
Empowering Artificial Intelligence Techniques With Soft Computing of Neutrosophic Theory in Mystery Circumstances For Plant Diseases
13 pages
Research Paper
No ratings yet
Research Paper
5 pages
Atharva Kale 9a..
No ratings yet
Atharva Kale 9a..
9 pages