Assignment 2

Assignment # 02
Submitted by:
Afshan Rehman
Submitted To:
Sir Aoun
Department:
BS Computer Science (5th) Semester
Subject:
Machine learning
Topic:
Unsupervised ML Methods
Roll No :
02
INFORMATIC GROUP OF COLLEGE

Unsupervised Machine Learning
Unsupervised machine learning is the process of inferring underlying hidden
patterns from historical data. Within such an approach, a machine learning model tries to find any
similarities, differences, patterns, and structure in data by itself. No prior human intervention is needed.
Let's get back to our example of a child’s experiential learning.
Picture a toddler. The child knows what the family cat looks like (provided they have one) but has no idea
that there are a lot of other cats in the world that are all different. The thing is, if the kid sees another cat, he
or she will still be able to recognize it as a cat through a set of features such as two ears, four legs, a tail, fur,
whiskers, etc.
In machine learning, this kind of prediction is called unsupervised learning. But when parents tell the child
that the new animal is a cat – drumroll – that’s considered supervised learning.
Unsupervised Machine Learning Methods

Unsupervised learning can be approached through different techniques
such as clustering, association rules, and dimensionality reduction. Let’s take a closer look at the working
principles and use cases of each one.
Clustering
Clustering can be considered the most important unsupervised learning problem. It deals with
finding a structure in a collection of unlabeled data. A loose definition of clustering could be the process of
organizing objects into groups whose members are similar in some way. A cluster is therefore a collection of
objects that are similar between them and are dissimilar to the objects belonging to other clusters.
Types of Unsupervised Clustering Algorithms

Unsupervised clustering algorithms are classified into four different types:
1. Exclusive clustering
2. Overlapping clustering
3. Hierarchical clustering
4. Probabilistic clustering
In exclusive clustering, data are grouped in an exclusive way, so that if a certain data point belongs to a
definite cluster, then it could not be included in another cluster. A simple example is a graph where the
separation of data points is achieved by a straight line on a bi-dimensional plane.
Overlapping clustering uses fuzzy sets to cluster data, so that each point may belong to two or more clusters
with different degrees of membership. In this case, data will be associated with an appropriate membership
value.
A hierarchical clustering algorithm is based on the union between the two nearest clusters. The beginning
condition is realized by setting every data point as a cluster. After a few iterations it reaches the final clusters
wanted.
Finally, the last kind of clustering uses a completely probabilistic approach.
We will talk about four of the most used clustering algorithms:
1. K-means
2. Fuzzy k-means
3. Hierarchical clustering
4. Mixture of gaussians
Each of these algorithms belongs to one of the clustering types listed above. K-means is an exclusive
clustering algorithm, fuzzy K-means is an overlapping clustering algorithm, hierarchical clustering is
obvious and lastly, mixture of gaussians is a probabilistic clustering algorithm.
1. K-Means Clustering
K-means is one of the simplest unsupervised learning algorithms that solves the well-known clustering
problem. The procedure follows a simple and easy way to classify a given data set through a certain number
of clusters, assume k clusters, fixed a priori. The main idea is to define k centers, one for each cluster. These
centroids should be placed in a smart way, as different locations will create different results. So, the best
choice is to place them as far away from each other as possible.
The next step is to take each point belonging to a given data set and associate it to the nearest centroid.
When no point is pending, the first step is completed and an early groupage is done. At this point we need to
recalculate k new centroids as barycenter’s of the clusters resulting from the previous step. After we have
these k new centroids, a new binding has to be done between the same data set points and the nearest new
centroid. A loop has been generated. As a result of this loop, we may notice that the k centroids change their
location step by step until no more changes are done. In other words, centroids do not move any more.
Finally, this algorithm aims at minimizing an objective function, in this case a squared error function. The
objective function
Objective function equation. | Image: Sanatan Mishra

Where:
Objective function equation defined. | Image: Sanatan Mishra

Is a chosen distance measure between a data point xi and the cluster center cj, is an indicator of the distance
of the n data points from their respective cluster centers.
K-means is a simple algorithm that has been adapted to many problem domains. As we are going to see, it is
a good candidate for extension to work with fuzzy feature vectors.
The k-means procedure can be viewed as a greedy algorithm for partitioning the n samples into k clusters so
as to minimize the sum of the squared distances to the cluster centers. It does have some weaknesses:
 The way to initialize the means was not specified. One popular way to start is to randomly choose k
of the samples.
 It can happen that the set of samples closest to mi is empty, so that mi can’t be updated. This is a
problem that needs to be handled during the implementation, but is generally ignored.
 The results depend on the value of k and there is no optimal way to describe a “best k.”
2. Fuzzy K-Means Clustering

In fuzzy clustering, each point has a probability of belonging to each cluster, rather than completely
belonging to just one cluster as is the case in the traditional k-means. Fuzzy k-means specifically tries to
deal with the problem where points are somewhat in between centers, or otherwise ambiguous, by replacing
distance with probability. This could be some function of distance, such as having probability relative to the
inverse of the distance.
Fuzzy k-means uses a weighted centroid based on those probabilities. Processes of initialization, iteration,
and termination are the same as the ones used in k-means.
The fuzzy k-means algorithm is the following:
 Assume a fixed number of clusters K.
 Initialization: Randomly initialize the k-means μk associated with the clusters and compute the
probability that each data point Xi is a member of a given cluster K,
P(PointXiHasLabelK|Xi,K).
 Iteration: Recalculate the centroid of the cluster as the weighted centroid given the probabilities of
membership of all data points Xi:
Fuzzy k-means equation. | Image: Sanatan Mishra
 Termination: Iterate until convergence or until a user-specified number of iterations has been
reached. The iteration may be trapped at some local maxima or minima.
In the fuzzy k-means approach, instead, a given data point does not belong exclusively to a well-defined
cluster, but it can be placed in the middle. In this case, the membership function follows a smoother line to
indicate that every data point may belong to several clusters with different extent of membership.
3. Hierarchical Clustering Algorithms
Given a set of N items to be clustered, and an N*N distance (or similarity) matrix, the basic process of
hierarchical clustering is this:
1. Start by assigning each item to a cluster, so that if you have N items, you now have N clusters, each
containing just one item. Let the distances (similarities) between the clusters the same as the
distances (similarities) between the items they contain.
2. Find the closest, or most similar, pair of clusters and merge them into a single cluster, so that you
now have one cluster less.
3. Compute distances (similarities) between the new cluster and each of the old clusters.
4. Repeat steps two and three until all items are clustered into a single cluster of size N.
4. Clustering as a Mixture of Gaussians
There’s another way to deal with clustering problems: a model-based approach, which consists in using
certain models for clusters and attempting to optimize the fit between the data and the model.
In practice, each cluster can be mathematically represented by a parametric distribution, like a Gaussian. The
entire data set is modeled by a mixture of these distributions. A mixture model with high likelihood tends to
have the following traits:
 Component distributions have high “peaks,” where data in one cluster are tight;
 The mixture model covers the data well. This means dominant patterns in the data are captured by
component distributions.
Main advantages of model-based clustering:
 Well-studied statistical inference techniques available.
 Flexibility in choosing the component distribution.
 Obtain a density estimation for each cluster.
 A “soft” classification is available.
Mixture of Gaussians
The most widely used clustering method of this kind is based on learning a mixture of Gaussians:
A mixture model is a mixture of k component distributions that collectively make a mixture distribution f(x):
Mixture of gaussians equation. | Image: Sanatan Mishra
The αk represents the contribution of the kth component in constructing f(x). In practice, parametric
distribution (e.g. gaussians), are often used since a lot work has been done to understand their behavior. If
you substitute each fk(x) for a gaussian you get what is known as a gaussian mixture models (GMM).
The Expectation-Maximization Algorithm
Expectation-maximization assumes that your data is composed of multiple multivariate normal

distributions. This is a very strong assumption, in particular, when you fix the number of clusters. EM is an
algorithm for maximizing a likelihood function when some of the variables in your model are unobserved,
i.e., when you have latent variables.
Association rules:
An association rule is a rule-based unsupervised learning method aimed at discovering
relationships and associations between different variables in large-scale datasets. The rules present how
often a certain data item occurs in datasets and how strong and weak the connections between different
objects are.
For example, a coffee shop sees that there are 100 customers on Saturday evening with 50 out of 100 of
them buying cappuccino. Out of 50 customers who buy cappuccino, 25 also purchase a muffin. The
association rule here is: If customers buy cappuccino, they will buy muffins too, with the support value of
25/100=25% and the confidence value of 25/50=50%. The support value indicates the popularity of a certain
itemset in the whole dataset. The confidence value indicates the likelihood of item Y being purchased when
item X is purchased.
Association rules examples and use cases
This technique is widely used to analyze customer purchasing habits, allowing companies to understand
relationships between different products and build more effective business strategies.
Recommender systems. The association rules method is widely used to analyze buyer baskets and detect
cross-category purchase correlations. A great example is Amazon’s “Frequently bought together”
recommendations. The company aims to create more effective up-selling and cross-selling strategies and
provide product suggestions based on the frequency of particular items to be found in one shopping cart.
Apriori and FP-Growth algorithms
The apriori algorithm utilizes frequent item sets to create association rules. Frequent item sets are the items
with a greater value of support. The algorithm generates the item sets and finds associations by performing
multiple scanning of the full dataset. Say, you have four transactions:
 transaction 1= {apple, peach, grapes, banana};

 transaction 2= {apple, potato, tomato, banana};
 transaction 3= {apple, cucumber, onion}; and
 transaction 4= {oranges, grapes}.
As we can see from the transactions, the frequent item sets are {apple}, {grapes}, and {banana} according to
the calculated support value of each. Item sets can contain multiple items. For instance, the support value for
{apple, banana} is two out four, or 50%.
Just like apriori, the frequent pattern growth algorithm also generates the frequent item sets and mines
association rules, but it doesn't go through the complete dataset several times. The users themselves define
the minimum support for a particular itemset.
Dimensionality reduction:
Dimensionality reduction is another type of unsupervised learning pulling a set of methods to reduce the
number of features - or dimensions - in a dataset. Let us explain.
When preparing your dataset for machine learning, it may be quite tempting to include as much data as
possible. Don’t get us wrong, this approach works well as in most cases more data means more accurate
results.
That said, imagine that data resides in the N-dimensional space with each feature representing a separate
dimension. A lot of data means there may be hundreds of dimensions. Think of Excel spreadsheets with
columns serving as features and rows as data points. Sometimes, the number of dimensions gets too high,
resulting in the performance reduction of ML algorithms and data visualization hindering. So, it makes sense
to reduce the number of features – or dimensions – and include only relevant data. That’s what
dimensionality reduction is. With it, the number of data inputs becomes manageable while the integrity of
the dataset isn't lost.
Dimensionality reduction use cases
The dimensionality reduction technique can be applied during the stage of data preparation for supervised
machine learning. With it, it is possible to get rid of redundant and junk data, leaving those items that are the
most relevant for a project.
Say, you work in a hotel and you need to predict customer demand for different types of hotel rooms.
There’s a large dataset with customer demographics and information on how many times each customer
booked a particular hotel room last year. It looks like this:
A small snapshot of columns and rows from a dataset
The thing is, some of this information may be useless for your prediction, while some data has quite a lot of
overlap and there's no need to consider it individually. Take a closer look and you'll see that all customers
come from the US, meaning that this feature has zero variance and can be removed. Since room service
breakfast is offered with all room types, the feature also won't make much impact on your prediction.
Features like "age" and "date of birth" can be merged as they are basically duplicates. So, in this way, you
perform dimensionality reduction and make your dataset smaller and more useful.
Principal Component Analysis algorithm
Principal component analysis is an algorithm applied for dimensionality reduction purposes. It’s used to
reduce the number of features within large datasets, which leads to the increased simplicity of data without
the loss of its accuracy. Dataset compression happens through the process called feature extraction. It means
that features within the original set are combined into a new, smaller one. Such new features are known
as principal components.
Of course, there are other algorithms to apply in your unsupervised learning projects. The ones above are
just the most common, which is why they are covered more thoroughly.

Assignment 2

Uploaded by

Assignment 2

Uploaded by

Assignment # 02

INFORMATIC GROUP OF COLLEGE

Unsupervised Machine Learning Methods

Types of Unsupervised Clustering Algorithms

Objective function equation. | Image: Sanatan Mishra

Objective function equation defined. | Image: Sanatan Mishra

2. Fuzzy K-Means Clustering

The fuzzy k-means algorithm is the following:

 Assume a fixed number of clusters K.

Fuzzy k-means equation. | Image: Sanatan Mishra

3. Hierarchical Clustering Algorithms

4. Clustering as a Mixture of Gaussians

Main advantages of model-based clustering:

 Well-studied statistical inference techniques available.

 Flexibility in choosing the component distribution.

 Obtain a density estimation for each cluster.

 A “soft” classification is available.

Mixture of gaussians equation. | Image: Sanatan Mishra

The Expectation-Maximization Algorithm

Expectation-maximization assumes that your data is composed of multiple multivariate normal

Association rules examples and use cases

Apriori and FP-Growth algorithms

 transaction 1= {apple, peach, grapes, banana};

Dimensionality reduction use cases

A small snapshot of columns and rows from a dataset

Principal Component Analysis algorithm

You might also like