ECE3047 - Machine
Learning Fundamentals
Prepared By
Dr. Rohith G
Assistant Professor (Senior)
School of Electronics Engineering (SENSE), VIT-Chennai
Under the Guidance and Materials mentored by
Dr. Sathiya Narayanan S
Assistant Professor (Senior)
School of Electronics Engineering (SENSE), VIT-Chennai
Prepared by Dr. Rohith G, AP(Senior),VIT Chennai 1
• Module 1: Introduction (to • Module 5: Clustering
Machine Learning) • Module 6: Optimization
• Module 2: Data Preprocessing
• Module 7:
• Module 3: Regression Reinforcement Learning
• Module 4: Classification
Prepared by Dr. Rohith G, AP(Senior),VIT Chennai 2
Prepared by Dr. Rohith G, AP(Senior),VIT Chennai 3
Topics in Module-5-Clustering
• Introduction
• Mixture Densities
• Types – Partitioning,
• Hierarchical – Supervised
Learning after Clustering
• Choosing number of
Clusters- Applications.
Prepared by Dr. Rohith G, AP(Senior),VIT Chennai 4
Introduction to Unsupervised Learning
• Training procedures that use labeled samples are referred to as supervised.
• Unsupervised procedures use unlabeled data.
• Seven basic reasons why we are interested in unsupervised methods:
1) Data collection and labeling data is very costly and nontrivial (often this is a research problem in
itself).
2) Heuristic methods (application-specific) exist that allow us to improve a classifier trained using
supervised techniques by introducing large amounts of unlabeled data. This is often faster than
labeling data.
3) We would like to exploit “found” data such as that available on the Internet. Often this data is not
truth-marked or is only partially transcribed.
4) Reversal of the training process: train on unlabeled data and then use supervision to label the
groupings.
5) Models often need to be adapted over time.
6) Use unsupervised methods to find features that will be useful for categorization.
7) Perform rapid exploratory analysis to gain insight into a new problem.
What is Clustering?
Segregate groups with
similar traits and assign
them into clusters
• A unsupervised learning method of identifying similar groups of data in a dataset is called
clustering.
• A way of grouping the data points into different clusters, consisting of similar data points.
• The objects with the possible similarities remain in a group that has less or no similarities with
another group
Prepared by Dr. Rohith G, AP(Senior),VIT Chennai 6
How Clustering is performed?
• Goal is to study the intrinsic (and commonly hidden) structure of the data- Two approaches-
Clustering and Dimensionality reduction.
• Segmenting datasets by some shared atributes.
• Detecting anomalies that do not fit to any group.
• Simplify datasets by aggregating variables with similar atributes.
Prepared by Dr. Rohith G, AP(Senior),VIT Chennai 7
Types of Clustering
Prepared by Dr. Rohith G, AP(Senior),VIT Chennai 8
Types of Clustering
Density-based Clustering
• Density-based clustering connects areas of high example density into clusters.
• This allows for arbitrary-shaped distributions as long as dense areas can be connected.
• These algorithms have difficulty with data of varying densities and high dimensions.
• Further, by design, these algorithms do not assign outliers to clusters.
Prepared by Dr. Rohith G, AP(Senior),VIT Chennai 9
Types of Clustering
Centroid-based Clustering
• Centroid-based clustering organizes the data into non-hierarchical clusters, in contrast to
hierarchical clustering defined below.
• k-means is the most widely-used centroid-based clustering algorithm.
• Centroid-based algorithms are efficient but sensitive to initial conditions and outliers.
• This course focuses on k-means because it is an efficient, effective, and simple clustering
algorithm.
Prepared by Dr. Rohith G, AP(Senior),VIT Chennai 10
Types of Clustering
Distribution based Clustering
• This clustering approach assumes data is composed of distributions, such as Gaussian distributions.
• In the above Figure, the distribution-based algorithm clusters data into three Gaussian
distributions. As distance from the distribution's center increases, the probability that a point
belongs to the distribution decreases. The bands show that decrease in probability. When you do
not know the type of distribution in your data, you should use a different algorithm.
Prepared by Dr. Rohith G, AP(Senior),VIT Chennai 11
Types of Clustering
Hierarchical Clustering
• Hierarchical clustering creates a tree of clusters.
• Hierarchical clustering, not surprisingly, is well suited to hierarchical data, such as taxonomies
• In addition, another advantage is that any number of clusters can be chosen by cutting the tree at
the right level.
Prepared by Dr. Rohith G, AP(Senior),VIT Chennai 12
Types of clustering
Prepared by Dr. Rohith G, AP(Senior),VIT Chennai 13
Clustering Analysis
• Objective of clustering is to find different groups within the elements in the data.
• Clustering algorithms find the structure in the data so that elements of the same cluster (or group) are
more similar to each other than to those from different clusters.
• The machine learning model will be able to infer that there are two different classes without knowing
anything else from the data.
Prepared by Dr. Rohith G, AP(Senior),VIT Chennai 14
Clustering Analysis
Common Clustering Algorithms
• K-Means
• Hierarchical Clustering
• Density Based Scan Clustering
(DBSCAN)
• Gaussian Clustering Model
Prepared by Dr. Rohith G, AP(Senior),VIT Chennai 15
Mixture Densities
• Assume:
The samples come from a known number of c classes.
The prior probabilities, P(ωj), for j = 1, …, c, are known,.
The forms of the class-conditional probability densities, p(x|ωj,θj)
are known.
The values for the c parameter vectors ω1, …, ωc are unknown.
The category labels are unknown.
• The probability density function for the samples is given by:
c
p(x | ) p (x | j , j ) P( j )
j 1
where θ = (θ1, …, θc )t. P(θj), the prior probabilities are called the
mixing parameters and without loss of generality sum to one.
• A density, p(x|θ), is said to be identifiable if θ ≠ θ’ implies there
exists an x such that p(x|θ) ≠ p(x|θ’). (A density is unidentifiable if we
cannot recover a unique θ from an infinite amount of data.)
• Identifiability of θ is a property of the model and not the procedure
used to estimate the model.
Maximum Likelihood Estimates
• Given a set D = {x1, …, xn} of n unlabeled samples drawn independently from the mixture
density, the likelihood of the observed samples is:
n
p ( D | ) p ( x k | )
k 1
• The maximum likelihood estimate is the value of θ that maximizes p(D|θ).
• If we differentiate the log-likelihood:
n 1 c
i (log p ( D | )) i p(x k | j , j ) P( j )
k 1 p ( x k | ) j 1
• Assume ωi and θ j are functionally independent if i ≠ j.
• Substitute the posterior:
p (x k | i , i ) P(i )
P(i | x k , )
p ( x k | )
• The gradient can be written as:
n
i (log p( D | )) P(i | x k , ) i ln p(x k | i , i )
k 1
• The gradient must vanish at the value of θi that maximizes the log likelihood. Therefore, the
ML solution must satisfy:
n
P(i | x k , ˆ ) ln p(x k | i , ˆ i ) 0 for i 1,..., c
i
k 1
Generalization of the ML Estimate
• We can generalize these results to include the prior probability, P(ωi), among
the unknown quantities.
• The search for the maximum value of p(D|θ) extends over θ and P(ωi), subject
to the constraints:
c
P(i ) 0, i 1,..., c P(i ) 1.
i 1
• It can be shown that the ML estimate for the prior is:
1 n
Pˆ (i ) Pˆ (i | x k , ˆ )
n k 1
n
Pˆ (i | x k , ˆ ) i ln p(x k | i , ˆ i ) 0
k 1
p (x k | i , ˆ i )
Pˆ (i | x k , ˆ ) c
p (x k | j , ˆ j ) Pˆ ( j )
j 1
• The first equation simply states the estimate of the prior is computed by
averaging over the entire data set. The third equation we have seen before in the
HMM section of this course. The second equation just restates the ML principle
that the optimal value of θ produces a maximum.
• So the good news here is that doing the obvious maximizes the posterior.
Partitional Clustering
• Given a database of n objects or data tuples, a
partitioning method constructs k partitions of the
data, where each partition represents a cluster and k
<= n. That is, it classifies the data into k groups,
which together satisfy the following requirements
• Each group must contain at least one object,
• Each object must belong to exactly one group.
• Given k, the number of partitions to
construct, a partitioning method creates an
initial partition.
• It uses iterative relocation technique that
attempts to improve the partitioning by
moving objects from one group to another.
• The general criterion of a good partitioning
is that objects in the same cluster are
“close” or related to each other, whereas
objects of different clusters are “far apart”
or very different.
Prepared by Dr. Rohith G, AP(Senior),VIT Chennai 19
Hierarchical clustering (Agglomerative and Divisive clustering)
• Hierarchical clustering method works via
grouping data into a tree of clusters. Hierarchical
clustering begins by treating every data point as a
separate cluster. Then, it repeatedly executes the
Agglomerative
subsequent steps:
• Identify the 2 clusters which can be closest
Divisive
together, and
• Merge/Split the 2 maximum comparable
clusters. We need to continue these steps
until all the clusters are merged/split
together.
• In Hierarchical Clustering, the aim is to produce
a hierarchical series of nested clusters. A diagram
called Dendrogram (A Dendrogram is a tree-like
diagram that statistics the sequences of merges or
splits) graphically represents this hierarchy and is
an inverted tree that describes the order in which
factors are merged (bottom-up view) or clusters
are broken up (top-down view). Prepared by Dr. Rohith G, AP(Senior),VIT Chennai 20
Hierarchical clustering (Agglomerative and Divisive clustering)
Agglomerative Clustering:
• Also known as bottom-up approach or
hierarchical agglomerative clustering
(HAC).
• A structure that is more informative than
the unstructured set of clusters returned by
flat clustering.
• This clustering algorithm does not require
us to prespecify the number of clusters.
• Bottom-up algorithms treat each data as a
singleton cluster at the outset and then
successively agglomerates pairs of clusters
until all clusters have been merged into a
single cluster that contains all data.
Algorithm
•Calculate the similarity of one cluster with all the other clusters (calculate proximity matrix)
•Consider every data point as an individual cluster
•Merge the clusters which are highly similar or close to each other.
•Recalculate the proximity matrix for each cluster
•Repeat Steps 3 and 4 until only a single cluster remains.
Prepared by Dr. Rohith G, AP(Senior),VIT Chennai 21
Hierarchical clustering (Agglomerative and Divisive clustering)
Agglomerative Clustering:
Algorithm:
given a dataset (d1, d2, d3, ....dN) of size N
# compute the distance matrix
for i=1 to N:
# as the distance matrix is symmetric about
# the primary diagonal so we compute only
lower
# part of the primary diagonal
for j=1 to i:
dis_mat[i][j] = distance[di, dj]
each data point is a singleton cluster
repeat
merge the two cluster having minimum
distance
update the distance matrix
until only a single cluster remains
Prepared by Dr. Rohith G, AP(Senior),VIT Chennai 22
Hierarchical clustering (Agglomerative and Divisive clustering)
Divisive Clustering:
• Also known as a top-down approach.
• This algorithm also does not require to
prespecify the number of clusters.
• Top-down clustering requires a
method for splitting a cluster that
contains the whole data and proceeds
by splitting clusters recursively until
individual data have been split into
singleton clusters.
Prepared by Dr. Rohith G, AP(Senior),VIT Chennai 23
Hierarchical clustering (Agglomerative and Divisive clustering)
Divisive Clustering:
Algorithm:
given a dataset (d1, d2, d3, ....dN) of size N
at the top we have all data in one cluster
the cluster is split using a flat clustering method
eg. K-Means etc
repeat
choose the best cluster among all the clusters to
split
split that cluster by the flat clustering algorithm
until each data is in its own singleton cluster
Prepared by Dr. Rohith G, AP(Senior),VIT Chennai 24
Hierarchical clustering (Single Linkage and Complete Linkage)
• The process of Hierarchical Clustering involves either clustering sub-clusters(data points in the first iteration)
into larger clusters in a bottom-up manner or dividing a larger cluster into smaller sub-clusters in a top-down
manner.
• During both the types of hierarchical clustering, the distance between two sub-clusters needs to be computed.
• The different types of linkages describe the different approaches to measure the distance between two sub-
clusters of data points.
Prepared by Dr. Rohith G, AP(Senior),VIT Chennai 25
Hierarchical clustering (Linkage types)
Prepared by Dr. Rohith G, AP(Senior),VIT Chennai 26
Hierarchical clustering (Single Linkage-Solved Example)
Prepared by Dr. Rohith G, AP(Senior),VIT Chennai 27
Hierarchical clustering (Single Linkage-Solved Example)
Prepared by Dr. Rohith G, AP(Senior),VIT Chennai 28
Hierarchical clustering (Single Linkage-Solved Example)
Prepared by Dr. Rohith G, AP(Senior),VIT Chennai 29
Hierarchical clustering (Single Linkage-Solved Example)
Prepared by Dr. Rohith G, AP(Senior),VIT Chennai 30
Hierarchical clustering (Single Linkage-Solved Example)
Prepared by Dr. Rohith G, AP(Senior),VIT Chennai 31
Hierarchical clustering (Single Linkage-Solved Example)
Prepared by Dr. Rohith G, AP(Senior),VIT Chennai 32
Hierarchical clustering (Single Linkage-Solved Example)
Prepared by Dr. Rohith G, AP(Senior),VIT Chennai 33
Hierarchical clustering (Single Linkage-Solved Example)
Prepared by Dr. Rohith G, AP(Senior),VIT Chennai 34
K-Means Clustering Analysis
• The K-Means algorithms aims to find and group in classes the data points that have high similarity
between them.
• In the terms of the algorithm, this similarity is understood as the opposite of the distance between
datapoints.
• The closer the data points are, the more similar and more likely to belong to the same cluster they
will be.
Prepared by Dr. Rohith G, AP(Senior),VIT Chennai 35
K-Means Clustering Analysis
• Squared Euclidean Distance: Most commonly used distance in K-Means is the squared Euclidean
distance. An example of this distance between two points x and y in m-dimensional space is:
j is the jth dimension (or feature column) of the sample
points x and y.
• Cluster Inertia: Cluster inertia is the name given to the Sum of Squared Errors within the
clustering context,
Where μ(j) is the centroid for cluster j, and w(i,j) is 1 if the
sample x(i) is in cluster j and 0 otherwise.
• K-Means can be understood as an algorithm that will try to minimize the cluster inertia factor.
Prepared by Dr. Rohith G, AP(Senior),VIT Chennai 36
K-Means Clustering Analysis
• First, we need to choose k, the number of
clusters that we want to be find.
• Then, the algorithm will select randomly the
centroids of each cluster.
• It will be assigned each datapoint to the closest
centroid (using euclidean distance).
• It will be computed the cluster inertia.
• The new centroids will be calculated as the mean
of the points that belong to the centroid of the
previous step. In other words, by calculating the
minimum quadratic error of the datapoints to the
center of each cluster, moving the center towards
that point
• Back to step 3.
Prepared by Dr. Rohith G, AP(Senior),VIT Chennai 37
K-Means Clustering Analysis
1 2 3
Choose K=2, to identify the Choose some random k points or Assign each data point of the
dataset and to put them into centroid to form the cluster. These scatter plot to its closest K-point or
different clusters. It means to points can be either the points from centroid. Calculate the distance
group these datasets into two the dataset or any other point. So, between two points. So, we will
different clusters. here we are selecting the below two draw a median between both the
points as k points, which are not the centroids.
part of our dataset.
Prepared by Dr. Rohith G, AP(Senior),VIT Chennai 38
K-Means Clustering Analysis
4 5 6
Reassign each datapoint to the new
Points left side of the line is To find the closest cluster, Repeat
centroid.
near to the K1 or blue centroid, the process by choosing a new
• One yellow point is on the left
and points to the right of the centroid. To choose the new
side of the line, and two blue
line are close to the yellow centroids, we will compute the
points are right to the line. So,
centroid center of gravity of these centroids.
these three points will be
assigned to new centroids.
Prepared by Dr. Rohith G, AP(Senior),VIT Chennai 39
K-Means Clustering Analysis
7 8 9
New centroids so again will draw
finding new centroids or K- Compute the center of gravity of
the median line and reassign the
points. these centroids to find closest
data points.
cluster.
Prepared by Dr. Rohith G, AP(Senior),VIT Chennai 40
K-Means Clustering Analysis
10 11
There are no dissimilar data Remove the assumed centroids, and
points on either side of the line, the two final clusters will be as
which means our model is shown in the below image:
formed.
Prepared by Dr. Rohith G, AP(Senior),VIT Chennai 41
Kmeans Flow
Prepared by Dr. Rohith G, AP(Senior),VIT Chennai 42
Kmeans Solved Example
Cluster the following eight points (with (x, y) representing locations) into three clusters:
A1(2, 10), A2(2, 5), A3(8, 4), A4(5, 8), A5(7, 5), A6(6, 4), A7(1, 2), A8(4, 9).
Initial cluster centers are: A1(2, 10), A4(5, 8) and A7(1, 2).
Iteration-01: Distance of each point from each of the center of the three clusters.
The distance function between two points a =
(x1, y1) and b = (x2, y2) using Manhattan
Distance is defined as-
Ρ(a, b) = |x2 – x1| + |y2 – y1|
For Cluster-01: We have only one point A1(2, 10)
in Cluster-01. So, cluster center remains the same.
For Cluster-02: Center of Cluster-02
= ((8 + 5 + 7 + 6 + 4)/5, (4 + 8 + 5 + 4 + 9)/5)
= (6, 6)
For Cluster-03: Center of Cluster-03
= ((2 + 1)/2, (5 + 2)/2)
= (1.5, 3.5)
C1={1}, C2={3,4,5,6,8}, C3={2,7}
Prepared by Dr. Rohith G, AP(Senior),VIT Chennai 43
Kmeans Solved Example
Iteration-02: Distance of each point from each of the center of the three clusters using the given distance function.
Re-compute the new cluster clusters by the
new cluster center is computed by taking mean
of all the points contained in that cluster.
For Cluster-01:Center of Cluster-01
= ((2 + 4)/2, (10 + 9)/2)
= (3, 9.5)
For Cluster-02: Center of Cluster-02
= ((8 + 5 + 7 + 6)/4, (4 + 8 + 5 + 4)/4)
= (6.5, 5.25)
For Cluster-03: Center of Cluster-03
= ((2 + 1)/2, (5 + 2)/2)
= (1.5, 3.5)
C1={1,8}, C2={3,4,5,6}, C3={2,7}
Prepared by Dr. Rohith G, AP(Senior),VIT Chennai 44
Kmeans Solved Example#2
Divide the given sample data into two clusters using Euclidean Distance
Given Points Distance from Distance from Point
Center of Center of Cluster
Iteration-01: (185,72) (170, 56) belongs to
For Cluster-01: We
Cluster-1 Cluster-2
have only one point
A1(185, 72) in A1(185,72) 0 21.93 C1
Cluster-01. So, A2(170,56) 21.93 0 C2
cluster center A3(168,60) 20.80 4.47 C2
remains the same. A4(179,68) 7.21 15 C1
A5(182,72) 3 20 C1
For Cluster-02:
Center of Cluster- A6(188,77) 5.83 27.65 C1
02 A7(180,71) 5.09 18.02 C1
= ((170 + 168 /2, A8(180,70) 5.38 17.20 C1
(56 + 60)/2) A9(183,84) 12.16 30.87 C1
= (169, 58) A10(180,88) 16.76 33.52 C1
A11(180,67) 7.07 14.86 C1
A12 (177,76) 8.94 21.18 C1
C1={1,4,5,6,7,8,9,10,11,12}, C2={2,3}
Prepared by Dr. Rohith G, AP(Senior),VIT Chennai 45
Kmeans Solved Example#2
Divide the given sample data into two clusters using Euclidean Distance
Given Points Distance from Distance from Point
Center of Center of Cluster
Iteration-02: (185,72) (169, 58) belongs to
Cluster-1 Cluster-2
A1(185,72) 0 21.26 C1
A2(170,56) 21.93 2.23 C2
A3(168,60) 20.80 2.23 C2
A4(179,68) 7.21 14.14 C1
A5(182,72) 3 19.10 C1
A6(188,77) 5.83 26.87 C1
A7(180,71) 5.09 17.02 C1
A8(180,70) 5.38 16.27 C1
A9(183,84) 12.16 29.52 C1
A10(180,88) 16.76 31.95 C1
A11(180,67) 7.07 14.21 C1
C1={1,4,5,6,7,8,9,10,11,12}, C2={2,3} A12 (177,76) 8.94 19.69 C1
Prepared by Dr. Rohith G, AP(Senior),VIT Chennai 46
Kmeans Solved Example#3
Use K-Means Algorithm to create two clusters
Assume A(2, 2) and C(1, 1) are centers of the two clusters.
C1={A,B,D}, C2={C,E}
For Cluster-01: We have only one For Cluster-02: Center of Cluster-
point A(2, 2) in Cluster-01. So, 02
cluster center remains the same. = ((1 + 1.5 /2, (1 + 0.5)/2)
= (1.25, 0.75)
Iteration-02:
Given Points Distance from Distance from Point
Center of Center of Cluster
Iteration-01:
(2,2) Cluster-1 (1.25,0.75) belongs to
Given Points Distance from Distance from Point Cluster-2
Center of Center of (1, Cluster A(2,2) 0 1.45 C1
(2,2) Cluster-1 1) Cluster-2 belongs to B(3,2) 1 2.15 C1
C(1,1) 1.41 0.35 C2
A(2,2) 0 1.41 C1 D(3,1) 1.41 1.67 C1
B(3,2) 1 2.23 C1 E(1.5,0.5) 1.58 0.35 C2
C(1,1) 1.41 0 C2 C1={A,B,D}, C2={C,E}
D(3,1) 1.41 2 C1
E(1.5,0.5) 1.58 0.707 C2
Prepared by Dr. Rohith G, AP(Senior),VIT Chennai 47
Choosing Optimal K value in KMeans
• The performance of the K-means clustering algorithm depends upon highly efficient clusters
that it forms.
• Elbow Method:Within Cluster Sum of Squares(WCSS) defines the total variations within a
cluster.
• In the above formula of WCSS, ∑Pi in Cluster1 distance(Pi C1)2: It is the sum of the square of
the distances between each data point and its centroid within a cluster1 and the same for the
other two terms.
• To measure the distance between data points and centroid, we can use any method such as
Euclidean distance or Manhattan distance.
Prepared by Dr. Rohith G, AP(Senior),VIT Chennai 48
Choosing Optimal K value in KMeans
• To find the optimal value of
clusters, the elbow method
follows the below steps:
• It executes the K-means
clustering on a given dataset
for different K values
(ranges from 1-10).
• For each value of K,
calculates the WCSS value.
• Plots a curve between
calculated WCSS values
and the number of clusters
K.
• The sharp point of bend or a
point of the plot looks like
an arm, then that point is
considered as the best value
of K.
Prepared by Dr. Rohith G, AP(Senior),VIT Chennai 49
Choosing Optimal K value in KMeans
Silhouette Analysis in K-means Clustering
where,
- s(o) is the silhouette coefficient of the data point o
- a(o) is the average distance between o and all the other data points in the cluster to
which o belongs
•b(o) is the minimum average distance from o to all clusters to which o does not
belong
The value of the silhouette coefficient is between [-1, 1]. A score of 1 denotes the
best meaning that the data point o is very compact within the cluster to which it
belongs and far away from the other clusters. The worst value is -1. Values near 0
denote overlapping clusters.
Prepared by Dr. Rohith G, AP(Senior),VIT Chennai 50
Choosing Optimal K value in KMeans
Silhouette Analysis in K-means Clustering
Prepared by Dr. Rohith G, AP(Senior),VIT Chennai 51
Kmeans
Applications
Prepared by Dr. Rohith G, AP(Senior),VIT Chennai 52