Clustering new
Clustering new
The word cluster is derived from an old English word, ‘clyster, ‘ meaning a bunch. A cluster is a
group of similar things or people positioned or occurring closely together. Usually, all points in a
cluster depict similar characteristics; therefore, machine learning could be used to identify traits
and segregate these clusters. This makes the basis of many applications of machine learning that
solve data problems across industries.
Clustering- is the task of dividing the population or data points into a number of groups such
that data points in the same groups are more similar to other data points in the same group and
dissimilar to the data points in other groups. It is basically a collection of objects on the basis of
similarity and dissimilarity between them.
For ex– The data points in the graph below clustered together can be classified into one single
group. We can distinguish the clusters, and we can identify that there are 3 clusters in the below
picture.
Why Clustering?
Clustering is very much important as it determines the intrinsic grouping among the unlabelled
data present. There are no criteria for good clustering. It depends on the user, what is the
criteria they may use which satisfy their need. For instance, we could be interested in finding
representatives for homogeneous groups (data reduction), in finding “natural clusters” and
describe their unknown properties (“natural” data types), in finding useful and suitable
groupings (“useful” data classes) or in finding unusual data objects (outlier detection). This
algorithm must make some assumptions that constitute the similarity of points and each
assumption make different and equally valid clusters.
Examples of clustering
Identifying misinformation
Analyzing documents
Clustering Methods :
Connectivity-based Clustering (Hierarchical Clustering)
Hierarchical clustering, also known as connectivity-based clustering, is based on
the principle that every object is connected to its neighbors depending on their
proximity distance (degree of relationship). The clusters are represented in
extensive hierarchical structures separated by a maximum distance required to
connect the cluster parts. The clusters are represented as Dendrograms, where X-
axis represents the objects that do not merge while Y-axis is the distance at which
clusters merge
Agglomerative is quite the contrary to Divisive, where all the “N” data points are
considered to be a single member of “N” clusters that the data is comprised into.
We iteratively combine these numerous “N” clusters to a fewer number of clusters,
let’s say “k” clusters, and hence assign the data points to each of these clusters
accordingly. This approach is a bottom-up one, and also uses a termination logic
in combining the clusters. This logic can be a number-based criterion (no more
clusters beyond this point) or a distance criterion (clusters should not be too far
apart to be merged) or a variance criterion (increase in the variance of the cluster
being merged should not exceed a threshold, Ward Method)
Distribution-Based Clustering
Until now, the clustering techniques as we know them are based on either
proximity (similarity/distance) or composition (density). There is a family of
clustering algorithms that take a totally different metric into consideration
– probability.
Distribution-based clustering creates and groups data points based on their likely
hood of belonging to the same probability distribution (Gaussian, Binomial, etc.)
in the data.
5. Fuzzy Clustering
Fuzzy clustering generalizes the partition-based clustering method by allowing a
data object to be a part of more than one cluster. The process uses a weighted
centroid based on the spatial probabilities.The steps include initialization,
iteration, and termination, generating clusters optimally analyzed as probabilistic
distributions instead of a hard assignment of labels.The algorithm works by
assigning membership values to all the data points linked to each cluster center. It
is computed from a distance between the cluster center and the data point. If the
membership value of the object is closer to the cluster center, it has a high
probability of being in the specific cluster.At the end iteration, associated values
of membership and cluster centers are reorganized. Fuzzy clustering handles the
situations where data points are somewhat in between the cluster centers or
ambiguous. This is done by choosing the probability rather than distance.
Partitioning Methods: These methods partition the objects into k clusters and each partition
forms one cluster. This method is used to optimize an objective criterion similarity function
such as when the distance is a major parameter example K-means, CLARANS (Clustering
Large Applications based upon Randomized Search), etc.
Grid-based Methods: In this method, the data space is formulated into a finite number of
cells that form a grid-like structure. All the clustering operations done on these grids are fast
and independent of the number of data objects example STING (Statistical Information Grid),
wave cluster, CLIQUE (CLustering In Quest), etc.
Clustering Algorithms :
K-means clustering algorithm – It is the simplest unsupervised learning algorithm that solves
clustering problem.K-means algorithm partitions n observations into k clusters where each
observation belongs to the cluster with the nearest mean serving as a prototype of the cluster.
Step-2: Select random K points or centroids. (It can be other from the input dataset).
Step-3: Assign each data point to their closest centroid, which will form the
predefined K clusters.
Step-4: Calculate the variance and place a new centroid of each cluster.
Step-5: Repeat the third steps, which means reassign each datapoint to the new
closest centroid of each cluster.
K-Means clustering is used in a variety of examples or business cases in real life, like:
Academic performance
Diagnostic systems
Search engines
In Search Engines: Search engines also work on the clustering technique. The search result appears based on
the closest object to the search query. It does it by grouping similar data objects in one group that is far from
the other dissimilar objects. The accurate result of a query depends on the quality of the clustering algorithm
used.
Customer Segmentation: It is used in market research to segment the customers based on their choice and
preferences.
In Land Use: The clustering technique is used in identifying the area of similar lands use in the GIS database.
This can be very useful to find that for what purpose the particular land should be used, that means for which
purpose it is more suitable.