Unit 4 Descriptive Modeling
Unit 4 Descriptive Modeling
INTRODUCTION:
Cluster analysis, also known as clustering, is a method of data
mining that groups similar data points together. The goal of cluster
analysis is to divide a dataset into groups (or clusters) such that the
data points within each group are more similar to each other than to
data points in other groups. This process is often used for
exploratory data analysis and can help identify patterns or
relationships within the data that may not be immediately obvious.
There are many different algorithms used for cluster analysis, such
as k-means, hierarchical clustering, and density-based clustering.
The choice of algorithm will depend on the specific requirements of
the analysis and the nature of the data being analyzed.
Cluster Analysis is the process to find similar groups of objects in
order to form clusters. It is an unsupervised machine learning-based
algorithm that acts on unlabelled data. A group of data points would
comprise together to form a cluster in which all the objects would
belong to the same group.
The given data is divided into different groups by combining similar
objects into a group. This group is nothing but a cluster. A cluster is
nothing but a collection of similar data which is grouped together.
For example, consider a dataset of vehicles given in which it
contains information about different vehicles like cars, buses,
bicycles, etc. As it is unsupervised learning there are no class labels
like Cars, Bikes, etc for all the vehicles, all the data is combined and
is not in a structured manner.
Now our task is to convert the unlabelled data to labelled data and it
can be done using clusters.
The main idea of cluster analysis is that it would arrange all the data
points by forming clusters like cars cluster which contains all the
cars, bikes clusters which contains all the bikes, etc.
Simply it is the partitioning of similar objects which are applied to
unlabelled data.
Properties of Clustering :
1. Clustering Scalability: Nowadays there is a vast amount of
data and should be dealing with huge databases. In order to handle
extensive databases, the clustering algorithm should be scalable.
Data should be scalable, if it is not scalable, then we can’t get the
appropriate result which would lead to wrong results.
2. High Dimensionality: The algorithm should be able to handle
high dimensional space along with the data of small size.
3. Algorithm Usability with multiple data kinds: Different kinds
of data can be used with algorithms of clustering. It should be
capable of dealing with different types of data like discrete,
categorical and interval-based data, binary data etc.
4. Dealing with unstructured data: There would be some
databases that contain missing values, and noisy or erroneous data.
If the algorithms are sensitive to such data then it may lead to poor
quality clusters. So it should be able to handle unstructured data
and give some structure to the data by organising it into groups of
similar data objects. This makes the job of the data expert easier in
order to process the data and discover new patterns.
5. Interpretability: The clustering outcomes should be
interpretable, comprehensible, and usable. The interpretability
reflects how easily the data is understood.
Clustering Methods:
The clustering methods can be classified into the following
categories:
Partitioning Method
Hierarchical Method
Density-based Method
Grid-Based Method
Model-Based Method
Constraint-based Method
Output:
A dataset of K clusters
Method:
1. Randomly assign K objects from the dataset(D) as cluster
centres(C)
2. (Re) Assign each object to which object is most similar based
upon mean values.
3. Update Cluster means, i.e., Recalculate the mean of each cluster
with the updated values.
4. Repeat Step 2 until no change occurs.
Figure – K-mean
Clustering
Flowchart:
Figure – K-mean Clustering
Example: Suppose we want to group the visitors to a website using
just their age as follows:
16, 16, 17, 20, 20, 21, 21, 22, 23, 29, 36, 41, 42, 43, 44, 45,
61, 62, 66
Initial Cluster:
K=2
Centroid(C1) = 16 [16]
Centroid(C2) = 22 [22]
Note: These two points are chosen randomly from the dataset.
Iteration-1:
C1 = 16.33 [16, 16, 17]
C2 = 37.25 [20, 20, 21, 21, 22, 23, 29, 36, 41, 42, 43, 44, 45,
61, 62, 66]
Iteration-2:
C1 = 19.55 [16, 16, 17, 20, 20, 21, 21, 22, 23]
C2 = 46.90 [29, 36, 41, 42, 43, 44, 45, 61, 62, 66]
Iteration-3:
C1 = 20.50 [16, 16, 17, 20, 20, 21, 21, 22, 23, 29]
C2 = 48.89 [36, 41, 42, 43, 44, 45, 61, 62, 66]
Iteration-4:
C1 = 20.50 [16, 16, 17, 20, 20, 21, 21, 22, 23, 29]
C2 = 48.89 [36, 41, 42, 43, 44, 45, 61, 62, 66]
No change Between Iteration 3 and 4, so we stop. Therefore we get
the clusters (16-29) and (36-66) as 2 clusters we get using K Mean
Algorithm.
Hierarchical Clustering in Data Mining
Last Updated : 12 Dec, 2023
DBSCAN
DBSCAN stands for Density-Based Spatial Clustering of
Applications with Noise.
It is a popular unsupervised learning method used for model
construction and machine learning algorithms. It is a clustering
method utilized for separating high-density clusters from low-density
clusters. It divides the data points into many groups so that points
lying in the same group will have the same properties. It was
proposed by Martin Ester, Hans-Peter Kriegel, Jorg Sander, and
Xiaowei Xu in 1996.
DBSCAN is designed for use with databases that can accelerate
region queries. It can not cluster data sets with large differences in
their densities.
Characteristics
It identifies clusters of any shape in a data set, it means it can
detect arbitrarily shaped clusters.
It is based on intuitive notions of clusters and noise.
It is very robust in detection of outliers in data set
It requires only two points which are very insensitive to the order
of occurrence of the points in data set
Advantages
Specification of number of clusters of data in the data set is not
required.
It can find any shape cluster even if the cluster is surrounded by
any other cluster.
It can easily find outliers in data set.
It is not much sensitive to noise, it means it is noise tolerant.
It is the second most used clustering method after K-means.
Disadvantages
The quality of the result depends on the distance measure used
in the regionQuery function.
Border points may go in any cluster depending on the processing
order so it is not completely deterministic.
It can be expensive when cost of computation of nearest neighbor
is high.
It can be slow in execution for higher dimension.
Adaptability of variation in local density is less.
Grid-Based Method: In the Grid-Based method a grid is formed using the object
together,i.e, the object space is quantized into a finite number of cells that form a
grid structure. One of the major advantages of the grid-based method is fast
processing time and it is dependent only on the number of cells in each dimension in
the quantized space. The processing time for this method is much faster so it can
save time.
Model-Based Method: In the model-based method, all the clusters are hypothesized
in order to find the data which is best suited for the model. The clustering of the
density function is used to locate the clusters for a given model. It reflects the spatial
distribution of data points and also provides a way to automatically determine the
number of clusters based on standard statistics, taking outlier or noise into account.
Therefore it yields robust clustering methods.
Constraint-Based Method: The constraint-based clustering method is performed by
the incorporation of application or user-oriented constraints. A constraint refers to
the user expectation or the properties of the desired clustering results. Constraints
provide us with an interactive way of communication with the clustering process.
The user or the application requirement can specify constraints.
3. Biclustering Methods:
Biclustering means clustering the data based on the two factors. we
can cluster both objects and attributes at a time in some
applications. The resultant clusters are biclusters. To perform the
biclustering there are four requirements:
Only a small set of objects participate in a cluster.
A cluster only involves a small number of attributes.
The data objects can take part in multiple clusters, or the objects
may also include in any cluster.
An attribute may be involved in multiple clusters.
Objects and attributes are not treated in the same way. Objects are
clustered according to their attribute values. We treat Objects and
attributes as different in biclustering analysis.
Types of Outliers
Outliers are divided into three different types
Global Outliers
Global outliers are also called point outliers. Global outliers are taken as the simplest form
of outliers. When data points deviate from all the rest of the data points in a given data set,
it is known as the global outlier. In most cases, all the outlier detection procedures are
targeted to determine the global outliers. The green data point is the global outlier.
Collective Outliers
In a given set of data, when a group of data points deviates from the rest of the data set is
called collective outliers. Here, the particular set of data objects may not be outliers, but
when you consider the data objects as a whole, they may behave as outliers. To identify the
types of different outliers, you need to go through background information about the
relationship between the behavior of outliers shown by different data objects. For example,
in an Intrusion Detection System, the DOS package from one system to another is taken as
normal behavior. Therefore, if this happens with the various computer simultaneously, it is
considered abnormal behavior, and as a whole, they are called collective outliers. The
green data points as a whole represent the collective outlier.
Contextual Outliers
As the name suggests, "Contextual" means this outlier introduced within a context. For
example, in the speech recognition technique, the single background noise. Contextual
outliers are also known as Conditional outliers. These types of outliers happen if a data
object deviates from the other data points because of any specific condition in a given data
set. As we know, there are two types of attributes of objects of data: contextual attributes
and behavioral attributes. Contextual outlier analysis enables the users to examine outliers
in different contexts and conditions, which can be useful in various applications. For
example, A temperature reading of 45 degrees Celsius may behave as an outlier in a rainy
season. Still, it will behave like a normal data point in the context of a summer season. In
the given diagram, a green dot representing the low-temperature value in June is a
contextual outlier since the same value in December is not an outlier.
Outliers Analysis
Outliers are discarded at many places when data mining is applied. But it is still used in
many applications like fraud detection, medical, etc. It is usually because the events that
occur rarely can store much more significant information than the events that occur more
regularly.
Other applications where outlier detection plays a vital role are given below.
Any unusual response that occurs due to medical treatment can be analyzed through outlier
analysis in data mining.