Assignment 2
Assignment 2
Submitted by:
Afshan Rehman
Submitted To:
Sir Aoun
Department:
BS Computer Science (5th) Semester
Subject:
Machine learning
Topic:
Unsupervised ML Methods
Roll No :
02
Picture a toddler. The child knows what the family cat looks like (provided they have one) but has no idea
that there are a lot of other cats in the world that are all different. The thing is, if the kid sees another cat, he
or she will still be able to recognize it as a cat through a set of features such as two ears, four legs, a tail, fur,
whiskers, etc.
In machine learning, this kind of prediction is called unsupervised learning. But when parents tell the child
that the new animal is a cat – drumroll – that’s considered supervised learning.
Clustering
Clustering can be considered the most important unsupervised learning problem. It deals with
finding a structure in a collection of unlabeled data. A loose definition of clustering could be the process of
organizing objects into groups whose members are similar in some way. A cluster is therefore a collection of
objects that are similar between them and are dissimilar to the objects belonging to other clusters.
1. K-Means Clustering
K-means is one of the simplest unsupervised learning algorithms that solves the well-known clustering
problem. The procedure follows a simple and easy way to classify a given data set through a certain number
of clusters, assume k clusters, fixed a priori. The main idea is to define k centers, one for each cluster. These
centroids should be placed in a smart way, as different locations will create different results. So, the best
choice is to place them as far away from each other as possible.
The next step is to take each point belonging to a given data set and associate it to the nearest centroid.
When no point is pending, the first step is completed and an early groupage is done. At this point we need to
recalculate k new centroids as barycenter’s of the clusters resulting from the previous step. After we have
these k new centroids, a new binding has to be done between the same data set points and the nearest new
centroid. A loop has been generated. As a result of this loop, we may notice that the k centroids change their
location step by step until no more changes are done. In other words, centroids do not move any more.
Finally, this algorithm aims at minimizing an objective function, in this case a squared error function. The
objective function
Fuzzy k-means uses a weighted centroid based on those probabilities. Processes of initialization, iteration,
and termination are the same as the ones used in k-means.
Initialization: Randomly initialize the k-means μk associated with the clusters and compute the
probability that each data point Xi is a member of a given cluster K,
P(PointXiHasLabelK|Xi,K).
Iteration: Recalculate the centroid of the cluster as the weighted centroid given the probabilities of
membership of all data points Xi:
Termination: Iterate until convergence or until a user-specified number of iterations has been
reached. The iteration may be trapped at some local maxima or minima.
In the fuzzy k-means approach, instead, a given data point does not belong exclusively to a well-defined
cluster, but it can be placed in the middle. In this case, the membership function follows a smoother line to
indicate that every data point may belong to several clusters with different extent of membership.
Given a set of N items to be clustered, and an N*N distance (or similarity) matrix, the basic process of
hierarchical clustering is this:
1. Start by assigning each item to a cluster, so that if you have N items, you now have N clusters, each
containing just one item. Let the distances (similarities) between the clusters the same as the
distances (similarities) between the items they contain.
2. Find the closest, or most similar, pair of clusters and merge them into a single cluster, so that you
now have one cluster less.
3. Compute distances (similarities) between the new cluster and each of the old clusters.
4. Repeat steps two and three until all items are clustered into a single cluster of size N.
There’s another way to deal with clustering problems: a model-based approach, which consists in using
certain models for clusters and attempting to optimize the fit between the data and the model.
In practice, each cluster can be mathematically represented by a parametric distribution, like a Gaussian. The
entire data set is modeled by a mixture of these distributions. A mixture model with high likelihood tends to
have the following traits:
Component distributions have high “peaks,” where data in one cluster are tight;
The mixture model covers the data well. This means dominant patterns in the data are captured by
component distributions.
Mixture of Gaussians
The most widely used clustering method of this kind is based on learning a mixture of Gaussians:
A mixture model is a mixture of k component distributions that collectively make a mixture distribution f(x):
The αk represents the contribution of the kth component in constructing f(x). In practice, parametric
distribution (e.g. gaussians), are often used since a lot work has been done to understand their behavior. If
you substitute each fk(x) for a gaussian you get what is known as a gaussian mixture models (GMM).
Association rules:
An association rule is a rule-based unsupervised learning method aimed at discovering
relationships and associations between different variables in large-scale datasets. The rules present how
often a certain data item occurs in datasets and how strong and weak the connections between different
objects are.
For example, a coffee shop sees that there are 100 customers on Saturday evening with 50 out of 100 of
them buying cappuccino. Out of 50 customers who buy cappuccino, 25 also purchase a muffin. The
association rule here is: If customers buy cappuccino, they will buy muffins too, with the support value of
25/100=25% and the confidence value of 25/50=50%. The support value indicates the popularity of a certain
itemset in the whole dataset. The confidence value indicates the likelihood of item Y being purchased when
item X is purchased.
This technique is widely used to analyze customer purchasing habits, allowing companies to understand
relationships between different products and build more effective business strategies.
Recommender systems. The association rules method is widely used to analyze buyer baskets and detect
cross-category purchase correlations. A great example is Amazon’s “Frequently bought together”
recommendations. The company aims to create more effective up-selling and cross-selling strategies and
provide product suggestions based on the frequency of particular items to be found in one shopping cart.
The apriori algorithm utilizes frequent item sets to create association rules. Frequent item sets are the items
with a greater value of support. The algorithm generates the item sets and finds associations by performing
multiple scanning of the full dataset. Say, you have four transactions:
As we can see from the transactions, the frequent item sets are {apple}, {grapes}, and {banana} according to
the calculated support value of each. Item sets can contain multiple items. For instance, the support value for
{apple, banana} is two out four, or 50%.
Just like apriori, the frequent pattern growth algorithm also generates the frequent item sets and mines
association rules, but it doesn't go through the complete dataset several times. The users themselves define
the minimum support for a particular itemset.
Dimensionality reduction:
Dimensionality reduction is another type of unsupervised learning pulling a set of methods to reduce the
number of features - or dimensions - in a dataset. Let us explain.
When preparing your dataset for machine learning, it may be quite tempting to include as much data as
possible. Don’t get us wrong, this approach works well as in most cases more data means more accurate
results.
That said, imagine that data resides in the N-dimensional space with each feature representing a separate
dimension. A lot of data means there may be hundreds of dimensions. Think of Excel spreadsheets with
columns serving as features and rows as data points. Sometimes, the number of dimensions gets too high,
resulting in the performance reduction of ML algorithms and data visualization hindering. So, it makes sense
to reduce the number of features – or dimensions – and include only relevant data. That’s what
dimensionality reduction is. With it, the number of data inputs becomes manageable while the integrity of
the dataset isn't lost.
The dimensionality reduction technique can be applied during the stage of data preparation for supervised
machine learning. With it, it is possible to get rid of redundant and junk data, leaving those items that are the
most relevant for a project.
Say, you work in a hotel and you need to predict customer demand for different types of hotel rooms.
There’s a large dataset with customer demographics and information on how many times each customer
booked a particular hotel room last year. It looks like this:
The thing is, some of this information may be useless for your prediction, while some data has quite a lot of
overlap and there's no need to consider it individually. Take a closer look and you'll see that all customers
come from the US, meaning that this feature has zero variance and can be removed. Since room service
breakfast is offered with all room types, the feature also won't make much impact on your prediction.
Features like "age" and "date of birth" can be merged as they are basically duplicates. So, in this way, you
perform dimensionality reduction and make your dataset smaller and more useful.
Principal component analysis is an algorithm applied for dimensionality reduction purposes. It’s used to
reduce the number of features within large datasets, which leads to the increased simplicity of data without
the loss of its accuracy. Dataset compression happens through the process called feature extraction. It means
that features within the original set are combined into a new, smaller one. Such new features are known
as principal components.
Of course, there are other algorithms to apply in your unsupervised learning projects. The ones above are
just the most common, which is why they are covered more thoroughly.