0% found this document useful (0 votes)
25 views18 pages

Unit 4 Descriptive Modeling

DMW

Uploaded by

Atul Gaur
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
Download as docx, pdf, or txt
0% found this document useful (0 votes)
25 views18 pages

Unit 4 Descriptive Modeling

DMW

Uploaded by

Atul Gaur
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1/ 18

Unit 4 Descriptive Modeling

Data Mining – Cluster Analysis




INTRODUCTION:
Cluster analysis, also known as clustering, is a method of data
mining that groups similar data points together. The goal of cluster
analysis is to divide a dataset into groups (or clusters) such that the
data points within each group are more similar to each other than to
data points in other groups. This process is often used for
exploratory data analysis and can help identify patterns or
relationships within the data that may not be immediately obvious.
There are many different algorithms used for cluster analysis, such
as k-means, hierarchical clustering, and density-based clustering.
The choice of algorithm will depend on the specific requirements of
the analysis and the nature of the data being analyzed.
Cluster Analysis is the process to find similar groups of objects in
order to form clusters. It is an unsupervised machine learning-based
algorithm that acts on unlabelled data. A group of data points would
comprise together to form a cluster in which all the objects would
belong to the same group.
The given data is divided into different groups by combining similar
objects into a group. This group is nothing but a cluster. A cluster is
nothing but a collection of similar data which is grouped together.
For example, consider a dataset of vehicles given in which it
contains information about different vehicles like cars, buses,
bicycles, etc. As it is unsupervised learning there are no class labels
like Cars, Bikes, etc for all the vehicles, all the data is combined and
is not in a structured manner.
Now our task is to convert the unlabelled data to labelled data and it
can be done using clusters.
The main idea of cluster analysis is that it would arrange all the data
points by forming clusters like cars cluster which contains all the
cars, bikes clusters which contains all the bikes, etc.
Simply it is the partitioning of similar objects which are applied to
unlabelled data.

Properties of Clustering :
1. Clustering Scalability: Nowadays there is a vast amount of
data and should be dealing with huge databases. In order to handle
extensive databases, the clustering algorithm should be scalable.
Data should be scalable, if it is not scalable, then we can’t get the
appropriate result which would lead to wrong results.
2. High Dimensionality: The algorithm should be able to handle
high dimensional space along with the data of small size.
3. Algorithm Usability with multiple data kinds: Different kinds
of data can be used with algorithms of clustering. It should be
capable of dealing with different types of data like discrete,
categorical and interval-based data, binary data etc.
4. Dealing with unstructured data: There would be some
databases that contain missing values, and noisy or erroneous data.
If the algorithms are sensitive to such data then it may lead to poor
quality clusters. So it should be able to handle unstructured data
and give some structure to the data by organising it into groups of
similar data objects. This makes the job of the data expert easier in
order to process the data and discover new patterns.
5. Interpretability: The clustering outcomes should be
interpretable, comprehensible, and usable. The interpretability
reflects how easily the data is understood.
Clustering Methods:
The clustering methods can be classified into the following
categories:
 Partitioning Method
 Hierarchical Method
 Density-based Method
 Grid-Based Method
 Model-Based Method
 Constraint-based Method

Partitioning Method (K-Mean) in Data Mining


Partitioning Method: This clustering method classifies the
information into multiple groups based on the characteristics and
similarity of the data. Its the data analysts to specify the number of
clusters that has to be generated for the clustering methods. In the
partitioning method when database(D) that contains multiple(N)
objects then the partitioning method constructs user-specified(K)
partitions of the data in which each partition represents a cluster
and a particular region. There are many algorithms that come under
partitioning method some of the popular ones are K-Mean, PAM(K-
Medoids), CLARA algorithm (Clustering Large Applications) etc. In
this article, we will be seeing the working of K Mean algorithm in
detail.
K-Mean (A centroid based Technique): The K means algorithm
takes the input parameter K from the user and partitions the dataset
containing N objects into K clusters so that resulting similarity
among the data objects inside the group (intracluster) is high but
the similarity of data objects with the data objects from outside the
cluster is low (intercluster). The similarity of the cluster is
determined with respect to the mean value of the cluster. It is a
type of square error algorithm. At the start randomly k objects from
the dataset are chosen in which each of the objects represents a
cluster mean (centre). For the rest of the data objects, they are
assigned to the nearest cluster based on their distance from the
cluster mean. The new mean of each of the cluster is then
calculated with the added data objects.
Algorithm:
K mean:
Input:
K: The number of clusters in which the dataset has to be divided
D: A dataset containing N number of objects

Output:
A dataset of K clusters
Method:
1. Randomly assign K objects from the dataset(D) as cluster
centres(C)
2. (Re) Assign each object to which object is most similar based
upon mean values.
3. Update Cluster means, i.e., Recalculate the mean of each cluster
with the updated values.
4. Repeat Step 2 until no change occurs.

Figure – K-mean
Clustering

Flowchart:
Figure – K-mean Clustering
Example: Suppose we want to group the visitors to a website using
just their age as follows:
16, 16, 17, 20, 20, 21, 21, 22, 23, 29, 36, 41, 42, 43, 44, 45,
61, 62, 66
Initial Cluster:
K=2
Centroid(C1) = 16 [16]
Centroid(C2) = 22 [22]
Note: These two points are chosen randomly from the dataset.
Iteration-1:
C1 = 16.33 [16, 16, 17]
C2 = 37.25 [20, 20, 21, 21, 22, 23, 29, 36, 41, 42, 43, 44, 45,
61, 62, 66]
Iteration-2:
C1 = 19.55 [16, 16, 17, 20, 20, 21, 21, 22, 23]
C2 = 46.90 [29, 36, 41, 42, 43, 44, 45, 61, 62, 66]
Iteration-3:
C1 = 20.50 [16, 16, 17, 20, 20, 21, 21, 22, 23, 29]
C2 = 48.89 [36, 41, 42, 43, 44, 45, 61, 62, 66]
Iteration-4:
C1 = 20.50 [16, 16, 17, 20, 20, 21, 21, 22, 23, 29]
C2 = 48.89 [36, 41, 42, 43, 44, 45, 61, 62, 66]
No change Between Iteration 3 and 4, so we stop. Therefore we get
the clusters (16-29) and (36-66) as 2 clusters we get using K Mean
Algorithm.
Hierarchical Clustering in Data Mining
Last Updated : 12 Dec, 2023



A Hierarchical clustering method works via grouping data into a


tree of clusters. Hierarchical clustering begins by treating every data
point as a separate cluster. Then, it repeatedly executes the
subsequent steps:
1. Identify the 2 clusters which can be closest together, and
2. Merge the 2 maximum comparable clusters. We need to continue
these steps until all the clusters are merged together.
In Hierarchical Clustering, the aim is to produce a hierarchical series
of nested clusters. A diagram called Dendrogram (A Dendrogram is
a tree-like diagram that statistics the sequences of merges or splits)
graphically represents this hierarchy and is an inverted tree that
describes the order in which factors are merged (bottom-up view) or
clusters are broken up (top-down view).
What is Hierarchical Clustering?
Hierarchical clustering is a method of cluster analysis in data mining
that creates a hierarchical representation of the clusters in a
dataset. The method starts by treating each data point as a
separate cluster and then iteratively combines the closest clusters
until a stopping criterion is reached. The result of hierarchical
clustering is a tree-like structure, called a dendrogram, which
illustrates the hierarchical relationships among the clusters.
Hierarchical clustering has several advantages over other
clustering methods
 The ability to handle non-convex clusters and clusters of different
sizes and densities.
 The ability to handle missing data and noisy data.
 The ability to reveal the hierarchical structure of the data, which
can be useful for understanding the relationships among the
clusters.
Drawbacks of Hierarchical Clustering
 The need for a criterion to stop the clustering process and
determine the final number of clusters.
 The computational cost and memory requirements of the method
can be high, especially for large datasets.
 The results can be sensitive to the initial conditions, linkage
criterion, and distance metric used.
In summary, Hierarchical clustering is a method of data mining
that groups similar data points into clusters by creating a
hierarchical structure of the clusters.
 This method can handle different types of data and reveal the
relationships among the clusters. However, it can have high
computational cost and results can be sensitive to some
conditions.
Types of Hierarchical Clustering
Basically, there are two types of hierarchical Clustering:
1. Agglomerative Clustering
2. Divisive clustering
1. Agglomerative Clustering
Initially consider every data point as an individual Cluster and at
every step, merge the nearest pairs of the cluster. (It is a bottom-up
method). At first, every dataset is considered an individual entity or
cluster. At every iteration, the clusters merge with different clusters
until one cluster is formed.
The algorithm for Agglomerative Hierarchical Clustering is:
 Calculate the similarity of one cluster with all the other clusters
(calculate proximity matrix)
 Consider every data point as an individual cluster
 Merge the clusters which are highly similar or close to each other.
 Recalculate the proximity matrix for each cluster
 Repeat Steps 3 and 4 until only a single cluster remains.
Let’s see the graphical representation of this algorithm using a
dendrogram.
Note: This is just a demonstration of how the actual algorithm
works no calculation has been performed below all the proximity
among the clusters is assumed.
Let’s say we have six data points A, B, C, D, E, and F.

Agglomerative Hierarchical clustering

 Step-1: Consider each alphabet as a single cluster and calculate


the distance of one cluster from all the other clusters.
 Step-2: In the second step comparable clusters are merged
together to form a single cluster. Let’s say cluster (B) and cluster
(C) are very similar to each other therefore we merge them in the
second step similarly to cluster (D) and (E) and at last, we get the
clusters [(A), (BC), (DE), (F)]
 Step-3: We recalculate the proximity according to the algorithm
and merge the two nearest clusters([(DE), (F)]) together to form
new clusters as [(A), (BC), (DEF)]
 Step-4: Repeating the same process; The clusters DEF and BC
are comparable and merged together to form a new cluster.
We’re now left with clusters [(A), (BCDEF)].
 Step-5: At last, the two remaining clusters are merged together
to form a single cluster [(ABCDEF)].

2. Divisive Hierarchical clustering


We can say that Divisive Hierarchical clustering is precisely
the opposite of Agglomerative Hierarchical clustering. In Divisive
Hierarchical clustering, we take into account all of the data points as
a single cluster and in every iteration, we separate the data points
from the clusters which aren’t comparable. In the end, we are left
with N clusters.

Divisive Hierarchical clustering

DBSCAN
DBSCAN stands for Density-Based Spatial Clustering of
Applications with Noise.
It is a popular unsupervised learning method used for model
construction and machine learning algorithms. It is a clustering
method utilized for separating high-density clusters from low-density
clusters. It divides the data points into many groups so that points
lying in the same group will have the same properties. It was
proposed by Martin Ester, Hans-Peter Kriegel, Jorg Sander, and
Xiaowei Xu in 1996.
DBSCAN is designed for use with databases that can accelerate
region queries. It can not cluster data sets with large differences in
their densities.
Characteristics
 It identifies clusters of any shape in a data set, it means it can
detect arbitrarily shaped clusters.
 It is based on intuitive notions of clusters and noise.
 It is very robust in detection of outliers in data set
 It requires only two points which are very insensitive to the order
of occurrence of the points in data set
Advantages
 Specification of number of clusters of data in the data set is not
required.
 It can find any shape cluster even if the cluster is surrounded by
any other cluster.
 It can easily find outliers in data set.
 It is not much sensitive to noise, it means it is noise tolerant.
 It is the second most used clustering method after K-means.
Disadvantages
 The quality of the result depends on the distance measure used
in the regionQuery function.
 Border points may go in any cluster depending on the processing
order so it is not completely deterministic.
 It can be expensive when cost of computation of nearest neighbor
is high.
 It can be slow in execution for higher dimension.
 Adaptability of variation in local density is less.
Grid-Based Method: In the Grid-Based method a grid is formed using the object
together,i.e, the object space is quantized into a finite number of cells that form a
grid structure. One of the major advantages of the grid-based method is fast
processing time and it is dependent only on the number of cells in each dimension in
the quantized space. The processing time for this method is much faster so it can
save time.
Model-Based Method: In the model-based method, all the clusters are hypothesized
in order to find the data which is best suited for the model. The clustering of the
density function is used to locate the clusters for a given model. It reflects the spatial
distribution of data points and also provides a way to automatically determine the
number of clusters based on standard statistics, taking outlier or noise into account.
Therefore it yields robust clustering methods.
Constraint-Based Method: The constraint-based clustering method is performed by
the incorporation of application or user-oriented constraints. A constraint refers to
the user expectation or the properties of the desired clustering results. Constraints
provide us with an interactive way of communication with the clustering process.
The user or the application requirement can specify constraints.

Applications of Cluster Analysis:


 It is widely used in image processing, data analysis, and pattern
recognition.
 It helps marketers to find the distinct groups in their customer
base and they can characterize their customer groups by using
purchasing patterns.
 It can be used in the field of biology, by deriving animal and plant
taxonomies and identifying genes with the same capabilities.
 It also helps in information discovery by classifying documents on
the web.

Advantages of Cluster Analysis:


1. It can help identify patterns and relationships within a dataset
that may not be immediately obvious.
2. It can be used for exploratory data analysis and can help with
feature selection.

3. It can be used to reduce the dimensionality of the data.

4. It can be used for anomaly detection and outlier identification.

5. It can be used for market segmentation and customer profiling.


Disadvantages of Cluster Analysis:
1. It can be sensitive to the choice of initial conditions and the
number of clusters.

2. It can be sensitive to the presence of noise or outliers in the data.

3. It can be difficult to interpret the results of the analysis if the


clusters are not well-defined.

4. It can be computationally expensive for large datasets.

5. The results of the analysis can be affected by the choice of


clustering algorithm used.

6. It is important to note that the success of cluster analysis


depends on the data, the goals of the analysis, and the ability of
the analyst to interpret the results.
Clustering High-Dimensional Data in Data
Mining
Last Updated : 22 Mar, 2022



Clustering is basically a type of unsupervised learning method. An


unsupervised learning method is a method in which we draw
references from datasets consisting of input data without labeled
responses.
Clustering is the task of dividing the population or data points into a
number of groups such that data points in the same groups are
more similar to other data points in the same group and dissimilar to
the data points in other groups.

Challenges of Clustering High-Dimensional Data:


Clustering of the High-Dimensional Data return the group of objects
which are clusters. It is required to group similar types of objects
together to perform the cluster analysis of high-dimensional data,
But the High-Dimensional data space is huge and it has complex
data types and attributes. A major challenge is that we need to find
out the set of attributes that are present in each cluster. A cluster is
defined and characterized based on the attributes present in the
cluster. Clustering High-Dimensional Data we need to search for
clusters and find out the space for the existing clusters.
The High-Dimensional data is reduced to low-dimension data to
make the clustering and search for clusters simple. some
applications need the appropriate models of clusters, especially the
high-dimensional data. clusters in the high-dimensional data are
significantly small. the conventional distance measures can be
ineffective. Instead, To find the hidden clusters in high-dimensional
data we need to apply sophisticated techniques that can model
correlations among the objects in subspaces.
Subspace Clustering Methods:
There are 3 Subspace Clustering Methods:
 Subspace search methods
 Correlation-based clustering methods
 Biclustering methods
Subspace clustering approaches to search for clusters existing in
subspaces of the given high-dimensional data space, where a
subspace is defined using a subset of attributes in the full space.

1. Subspace Search Methods: A subspace search method


searches the subspaces for clusters. Here, the cluster is a
group of similar types of objects in a subspace. The
similarity between the clusters is measured by using
distance or density features. CLIQUE algorithm is a
subspace clustering method. subspace search methods
search a series of subspaces. There are two approaches
in Subspace Search Methods: Bottom-up approach starts
to search from the low-dimensional subspaces. If the
hidden clusters are not found in low-dimensional
subspaces then it searches in higher dimensional
subspaces. The top-down approach starts to search from
the high-dimensional subspaces and then search in
subsets of low-dimensional subspaces. Top-down
approaches are effective if the subspace of a cluster can
be defined by the local neighborhood sub-space clusters.
2. Correlation-Based Clustering: correlation-based approaches
discover the hidden clusters by developing advanced correlation
models. Correlation-Based models are preferred if is not possible to
cluster the objects by using the Subspace Search Methods.
Correlation-Based clustering includes the advanced mining
techniques for correlation cluster analysis. Biclustering Methods are
the Correlation-Based clustering methods in which both the objects
and attributes are clustered.

3. Biclustering Methods:
Biclustering means clustering the data based on the two factors. we
can cluster both objects and attributes at a time in some
applications. The resultant clusters are biclusters. To perform the
biclustering there are four requirements:
 Only a small set of objects participate in a cluster.
 A cluster only involves a small number of attributes.
 The data objects can take part in multiple clusters, or the objects
may also include in any cluster.
 An attribute may be involved in multiple clusters.
Objects and attributes are not treated in the same way. Objects are
clustered according to their attribute values. We treat Objects and
attributes as different in biclustering analysis.

What is Outlier in data mining


Whenever we talk about data analysis, the term outliers often come to our mind. As the
name suggests, "outliers" refer to the data points that exist outside of what is to be
expected. The major thing about the outliers is what you do with them. If you are going to
analyze any task to analyze data sets, you will always have some assumptions based on
how this data is generated. If you find some data points that are likely to contain some form
of error, then these are definitely outliers, and depending on the context, you want to
overcome those errors. The data mining process involves the analysis and prediction of
data that the data holds. In 1969, Grubbs introduced the first definition of outliers.
Difference between outliers and noise
Any unwanted error occurs in some previously measured variable, or there is any variance
in the previously measured variable called noise. Before finding the outliers present in any
data set, it is recommended first to remove the noise.

Types of Outliers
Outliers are divided into three different types

1. Global or point outliers


2. Collective outliers
3. Contextual or conditional outliers

Global Outliers
Global outliers are also called point outliers. Global outliers are taken as the simplest form
of outliers. When data points deviate from all the rest of the data points in a given data set,
it is known as the global outlier. In most cases, all the outlier detection procedures are
targeted to determine the global outliers. The green data point is the global outlier.
Collective Outliers
In a given set of data, when a group of data points deviates from the rest of the data set is
called collective outliers. Here, the particular set of data objects may not be outliers, but
when you consider the data objects as a whole, they may behave as outliers. To identify the
types of different outliers, you need to go through background information about the
relationship between the behavior of outliers shown by different data objects. For example,
in an Intrusion Detection System, the DOS package from one system to another is taken as
normal behavior. Therefore, if this happens with the various computer simultaneously, it is
considered abnormal behavior, and as a whole, they are called collective outliers. The
green data points as a whole represent the collective outlier.
Contextual Outliers
As the name suggests, "Contextual" means this outlier introduced within a context. For
example, in the speech recognition technique, the single background noise. Contextual
outliers are also known as Conditional outliers. These types of outliers happen if a data
object deviates from the other data points because of any specific condition in a given data
set. As we know, there are two types of attributes of objects of data: contextual attributes
and behavioral attributes. Contextual outlier analysis enables the users to examine outliers
in different contexts and conditions, which can be useful in various applications. For
example, A temperature reading of 45 degrees Celsius may behave as an outlier in a rainy
season. Still, it will behave like a normal data point in the context of a summer season. In
the given diagram, a green dot representing the low-temperature value in June is a
contextual outlier since the same value in December is not an outlier.

Outliers Analysis
Outliers are discarded at many places when data mining is applied. But it is still used in
many applications like fraud detection, medical, etc. It is usually because the events that
occur rarely can store much more significant information than the events that occur more
regularly.

Other applications where outlier detection plays a vital role are given below.

Any unusual response that occurs due to medical treatment can be analyzed through outlier
analysis in data mining.

o Fraud detection in the telecom industry


o In market analysis, outlier analysis enables marketers to identify the customer's
behaviors.
o In the Medical analysis field.
o Fraud detection in banking and finance such as credit cards, insurance sector, etc.
The process in which the behavior of the outliers is identified in a dataset is called outlier
analysis. It is also known as "outlier mining", the process is defined as a significant task of
data mining.

You might also like