Module - 5 - ECE3047 - Machine Learning

This document discusses machine learning fundamentals and specifically focuses on clustering algorithms. It covers 7 different modules related to machine learning topics including introduction, data preprocessing, regression, classification, clustering, optimization, and reinforcement learning. For the clustering module, it describes different types of clustering algorithms like partitioning, hierarchical, density-based and distribution-based clustering. It also explains concepts like mixture densities, maximum likelihood estimation and generalization of ML estimates for clustering analysis.

Uploaded by

Utkarsh Maurya

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

50 views52 pages

Module - 5 - ECE3047 - Machine Learning

Uploaded by

Utkarsh Maurya

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

ECE3047 - Machine

Learning Fundamentals
Prepared By
Dr. Rohith G
Assistant Professor (Senior)
School of Electronics Engineering (SENSE), VIT-Chennai
Under the Guidance and Materials mentored by
Dr. Sathiya Narayanan S
Assistant Professor (Senior)
School of Electronics Engineering (SENSE), VIT-Chennai
Prepared by Dr. Rohith G, AP(Senior),VIT Chennai 1
• Module 1: Introduction (to • Module 5: Clustering
Machine Learning) • Module 6: Optimization
• Module 2: Data Preprocessing
• Module 7:
• Module 3: Regression Reinforcement Learning
• Module 4: Classification
Prepared by Dr. Rohith G, AP(Senior),VIT Chennai 2
Prepared by Dr. Rohith G, AP(Senior),VIT Chennai 3
Topics in Module-5-Clustering
• Introduction

• Mixture Densities

• Types – Partitioning,

• Hierarchical – Supervised
Learning after Clustering

• Choosing number of
Clusters- Applications.

Prepared by Dr. Rohith G, AP(Senior),VIT Chennai 4

Introduction to Unsupervised Learning
• Training procedures that use labeled samples are referred to as supervised.
• Unsupervised procedures use unlabeled data.
• Seven basic reasons why we are interested in unsupervised methods:
1) Data collection and labeling data is very costly and nontrivial (often this is a research problem in
itself).
2) Heuristic methods (application-specific) exist that allow us to improve a classifier trained using
supervised techniques by introducing large amounts of unlabeled data. This is often faster than
labeling data.
3) We would like to exploit “found” data such as that available on the Internet. Often this data is not
truth-marked or is only partially transcribed.
4) Reversal of the training process: train on unlabeled data and then use supervision to label the
groupings.
5) Models often need to be adapted over time.
6) Use unsupervised methods to find features that will be useful for categorization.
7) Perform rapid exploratory analysis to gain insight into a new problem.
What is Clustering?

Segregate groups with

similar traits and assign
them into clusters

• A unsupervised learning method of identifying similar groups of data in a dataset is called

clustering.
• A way of grouping the data points into different clusters, consisting of similar data points.
• The objects with the possible similarities remain in a group that has less or no similarities with
another group
Prepared by Dr. Rohith G, AP(Senior),VIT Chennai 6
How Clustering is performed?

• Goal is to study the intrinsic (and commonly hidden) structure of the data- Two approaches-
Clustering and Dimensionality reduction.
• Segmenting datasets by some shared atributes.
• Detecting anomalies that do not fit to any group.
• Simplify datasets by aggregating variables with similar atributes.
Prepared by Dr. Rohith G, AP(Senior),VIT Chennai 7
Types of Clustering

Prepared by Dr. Rohith G, AP(Senior),VIT Chennai 8

Types of Clustering
Density-based Clustering
• Density-based clustering connects areas of high example density into clusters.
• This allows for arbitrary-shaped distributions as long as dense areas can be connected.

• These algorithms have difficulty with data of varying densities and high dimensions.
• Further, by design, these algorithms do not assign outliers to clusters.
Prepared by Dr. Rohith G, AP(Senior),VIT Chennai 9
Types of Clustering
Centroid-based Clustering
• Centroid-based clustering organizes the data into non-hierarchical clusters, in contrast to
hierarchical clustering defined below.
• k-means is the most widely-used centroid-based clustering algorithm.

• Centroid-based algorithms are efficient but sensitive to initial conditions and outliers.
• This course focuses on k-means because it is an efficient, effective, and simple clustering
algorithm.
Prepared by Dr. Rohith G, AP(Senior),VIT Chennai 10
Types of Clustering
Distribution based Clustering
• This clustering approach assumes data is composed of distributions, such as Gaussian distributions.

• In the above Figure, the distribution-based algorithm clusters data into three Gaussian
distributions. As distance from the distribution's center increases, the probability that a point
belongs to the distribution decreases. The bands show that decrease in probability. When you do
not know the type of distribution in your data, you should use a different algorithm.
Prepared by Dr. Rohith G, AP(Senior),VIT Chennai 11
Types of Clustering
Hierarchical Clustering
• Hierarchical clustering creates a tree of clusters.
• Hierarchical clustering, not surprisingly, is well suited to hierarchical data, such as taxonomies

• In addition, another advantage is that any number of clusters can be chosen by cutting the tree at
the right level.
Prepared by Dr. Rohith G, AP(Senior),VIT Chennai 12
Types of clustering

Prepared by Dr. Rohith G, AP(Senior),VIT Chennai 13

Clustering Analysis

• Objective of clustering is to find different groups within the elements in the data.
• Clustering algorithms find the structure in the data so that elements of the same cluster (or group) are
more similar to each other than to those from different clusters.
• The machine learning model will be able to infer that there are two different classes without knowing
anything else from the data.
Prepared by Dr. Rohith G, AP(Senior),VIT Chennai 14
Clustering Analysis
Common Clustering Algorithms
• K-Means
• Hierarchical Clustering
• Density Based Scan Clustering
(DBSCAN)
• Gaussian Clustering Model

Prepared by Dr. Rohith G, AP(Senior),VIT Chennai 15

Mixture Densities
• Assume:
 The samples come from a known number of c classes.
 The prior probabilities, P(ωj), for j = 1, …, c, are known,.
 The forms of the class-conditional probability densities, p(x|ωj,θj)
are known.
 The values for the c parameter vectors ω1, …, ωc are unknown.
 The category labels are unknown.
• The probability density function for the samples is given by:
c
p(x | )   p (x |  j , j ) P( j )
j 1

where θ = (θ1, …, θc )t. P(θj), the prior probabilities are called the
mixing parameters and without loss of generality sum to one.
• A density, p(x|θ), is said to be identifiable if θ ≠ θ’ implies there
exists an x such that p(x|θ) ≠ p(x|θ’). (A density is unidentifiable if we
cannot recover a unique θ from an infinite amount of data.)
• Identifiability of θ is a property of the model and not the procedure
used to estimate the model.
Maximum Likelihood Estimates
• Given a set D = {x1, …, xn} of n unlabeled samples drawn independently from the mixture
density, the likelihood of the observed samples is:
n
p ( D | )   p ( x k |  )
k 1
• The maximum likelihood estimate is the value of θ that maximizes p(D|θ).
• If we differentiate the log-likelihood:
n 1 c
  i (log p ( D | ))     i  p(x k |  j , j ) P( j )
k 1 p ( x k | ) j 1

• Assume ωi and θ j are functionally independent if i ≠ j.

• Substitute the posterior:
p (x k | i , i ) P(i )
P(i | x k , ) 
p ( x k | )
• The gradient can be written as:
n
 i (log p( D | ))   P(i | x k , ) i ln p(x k | i , i )
k 1
• The gradient must vanish at the value of θi that maximizes the log likelihood. Therefore, the
ML solution must satisfy:
n
 P(i | x k , ˆ ) ln p(x k | i , ˆ i )  0 for i  1,..., c
i
k 1
Generalization of the ML Estimate
• We can generalize these results to include the prior probability, P(ωi), among
the unknown quantities.
• The search for the maximum value of p(D|θ) extends over θ and P(ωi), subject
to the constraints:
c
P(i )  0, i  1,..., c  P(i )  1.
i 1

• It can be shown that the ML estimate for the prior is:

• The first equation simply states the estimate of the prior is computed by
averaging over the entire data set. The third equation we have seen before in the
HMM section of this course. The second equation just restates the ML principle
that the optimal value of θ produces a maximum.
• So the good news here is that doing the obvious maximizes the posterior.
Partitional Clustering
• Given a database of n objects or data tuples, a
partitioning method constructs k partitions of the
data, where each partition represents a cluster and k
<= n. That is, it classifies the data into k groups,
which together satisfy the following requirements
• Each group must contain at least one object,
• Each object must belong to exactly one group.
• Given k, the number of partitions to
construct, a partitioning method creates an
initial partition.
• It uses iterative relocation technique that
attempts to improve the partitioning by
moving objects from one group to another.
• The general criterion of a good partitioning
is that objects in the same cluster are
“close” or related to each other, whereas
objects of different clusters are “far apart”
or very different.
Prepared by Dr. Rohith G, AP(Senior),VIT Chennai 19
Hierarchical clustering (Agglomerative and Divisive clustering)
• Hierarchical clustering method works via
grouping data into a tree of clusters. Hierarchical
clustering begins by treating every data point as a
separate cluster. Then, it repeatedly executes the

Agglomerative
subsequent steps:
• Identify the 2 clusters which can be closest

Divisive
together, and
• Merge/Split the 2 maximum comparable
clusters. We need to continue these steps
until all the clusters are merged/split
together.
• In Hierarchical Clustering, the aim is to produce
a hierarchical series of nested clusters. A diagram
called Dendrogram (A Dendrogram is a tree-like
diagram that statistics the sequences of merges or
splits) graphically represents this hierarchy and is
an inverted tree that describes the order in which
factors are merged (bottom-up view) or clusters
are broken up (top-down view). Prepared by Dr. Rohith G, AP(Senior),VIT Chennai 20
Hierarchical clustering (Agglomerative and Divisive clustering)
Agglomerative Clustering:
• Also known as bottom-up approach or
hierarchical agglomerative clustering
(HAC).
• A structure that is more informative than
the unstructured set of clusters returned by
flat clustering.
• This clustering algorithm does not require
us to prespecify the number of clusters.
• Bottom-up algorithms treat each data as a
singleton cluster at the outset and then
successively agglomerates pairs of clusters
until all clusters have been merged into a
single cluster that contains all data.
Algorithm
•Calculate the similarity of one cluster with all the other clusters (calculate proximity matrix)
•Consider every data point as an individual cluster
•Merge the clusters which are highly similar or close to each other.
•Recalculate the proximity matrix for each cluster
•Repeat Steps 3 and 4 until only a single cluster remains.
Prepared by Dr. Rohith G, AP(Senior),VIT Chennai 21
Hierarchical clustering (Agglomerative and Divisive clustering)
Agglomerative Clustering:
Algorithm:
given a dataset (d1, d2, d3, ....dN) of size N
# compute the distance matrix
for i=1 to N:
# as the distance matrix is symmetric about
# the primary diagonal so we compute only
lower
# part of the primary diagonal
for j=1 to i:
dis_mat[i][j] = distance[di, dj]
each data point is a singleton cluster
repeat
merge the two cluster having minimum
distance
update the distance matrix
until only a single cluster remains

Prepared by Dr. Rohith G, AP(Senior),VIT Chennai 22

Hierarchical clustering (Agglomerative and Divisive clustering)
Divisive Clustering:
• Also known as a top-down approach.
• This algorithm also does not require to
prespecify the number of clusters.
• Top-down clustering requires a
method for splitting a cluster that
contains the whole data and proceeds
by splitting clusters recursively until
individual data have been split into
singleton clusters.

Prepared by Dr. Rohith G, AP(Senior),VIT Chennai 23

Hierarchical clustering (Agglomerative and Divisive clustering)
Divisive Clustering:
Algorithm:
given a dataset (d1, d2, d3, ....dN) of size N
at the top we have all data in one cluster
the cluster is split using a flat clustering method
eg. K-Means etc
repeat
choose the best cluster among all the clusters to
split
split that cluster by the flat clustering algorithm
until each data is in its own singleton cluster

Prepared by Dr. Rohith G, AP(Senior),VIT Chennai 24

Hierarchical clustering (Single Linkage and Complete Linkage)
• The process of Hierarchical Clustering involves either clustering sub-clusters(data points in the first iteration)
into larger clusters in a bottom-up manner or dividing a larger cluster into smaller sub-clusters in a top-down
manner.
• During both the types of hierarchical clustering, the distance between two sub-clusters needs to be computed.
• The different types of linkages describe the different approaches to measure the distance between two sub-
clusters of data points.