0% found this document useful (0 votes)
36 views56 pages

ML Unit - IV

The document discusses different machine learning clustering techniques: 1. Mixture densities describe combining density laws from multiple groups to model data. K-means clustering partitions observations into distinct clusters based on feature similarity. 2. Spectral clustering maps data to a new space where similarities are enhanced before clustering. It constructs a neighborhood graph and uses eigenvectors of the graph Laplacian for dimensionality reduction. 3. Hierarchical clustering builds nested clusters by merging or splitting them successively, forming a dendrogram that shows the cluster relationships.

Uploaded by

Hamsi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views56 pages

ML Unit - IV

The document discusses different machine learning clustering techniques: 1. Mixture densities describe combining density laws from multiple groups to model data. K-means clustering partitions observations into distinct clusters based on feature similarity. 2. Spectral clustering maps data to a new space where similarities are enhanced before clustering. It constructs a neighborhood graph and uses eigenvectors of the graph Laplacian for dimensionality reduction. 3. Hierarchical clustering builds nested clusters by merging or splitting them successively, forming a dendrogram that shows the cluster relationships.

Uploaded by

Hamsi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd

Machine Learning

Unit - IV

By
Mrs. P Jhansi Lakshmi
Assistant Professor
Department of CSE, VFSTR
Syllabus
UNIT – IV
CLUSTERING: Mixture densities; K-means Clustering; Supervised
learning after clustering; Spectral clustering; Hierarchical clustering.
NONPARAMETRIC METHODS: Nonparametric density estimation;
Histogram estimator; Kernel estimator; k-nearest neighbor estimator;
Generalization to multivariate data; Nonparametric classification;

P JHANSI LAKSHMI Department of CSE


CLUSTERING: Mixture
densities
Mixture densities
Mixture density: Combination of density laws associated with several groups.

Where are the mixture components. They are also called group or clusters.
are the component densities and are the mixture pro-portions.
• The number of components, k, is a hyperparameter and should be specified
beforehand. Given a sample and k, learning corresponds to estimating the component
densities and proportions.
• When we assume that the component densities obey a parametric model, we need
only estimate their parameters.

P JHANSI LAKSHMI Department of CSE


Mixture densities
• Mixture density of components according multivariate Gaussian, then we
have
• Component densitie and
• Parametrization

are the parameters that should be estimated from the(independent and


identically distributed )iid unlabeled samples

P JHANSI LAKSHMI Department of CSE


Mixture densities
Mixture Density:
Proportion of the group in the mixture,

• Probability that x belongs to the group

P JHANSI LAKSHMI Department of CSE


Mixture densities
• Parametric classification is a bona fide mixture model where groups, ,
correspond to classes, component densities correspond to class densities
and correspond to class priors, :

P JHANSI LAKSHMI Department of CSE


Mixture densities
• In the supervised case, we know how many groups there are and learning the parameters
is trivial because we are given the labels, namely, which instance belongs to which class
(component).
• We know that the sample where if and 0 otherwise, the parameters can be calculated
using maximum likelihood.
• When each class is Gaussian distributed, we have a Gaussian mixture, and the
parameters are estimated as:

P JHANSI LAKSHMI Department of CSE


Classes vs. Clusters
9

Supervised: X = {xt,rt }t Unsupervised : X = { xt }t


• Classes Ci i=1,...,K • Clusters Gi i=1,...,k
K
px    px |C i P C i  k

i 1 px    px |Gi P Gi 


i 1

where p(x|Ci) ~ N(μi ,∑i )


• Φ = {P (Ci ), μi , ∑i }Ki=1 where p(x|Gi) ~ N ( μi , ∑i )
PˆC i  
t i
r t

mi 
t i x
r t t
• Φ = {P ( Gi ), μi , ∑i }ki=1
N t i
r t

t rit xt  m i xt  m i 


T

Si  Labels rti ?
r
t i
t
Mixture densities
• In unsupervised learning problem, we have given the sample . That is only
and not the labels , that is, we do not know which comes from which
component.
• So we should estimate both:
• First, we should estimate the labels, , the component that a given instance belongs
to; and
• second, once we estimate the labels, we should estimate the parameters of the
components given the set of instances belonging to them.

P JHANSI LAKSHMI Department of CSE


Supervised learning after
clustering
Supervised learning after clustering
• Dimensionality reduction methods find correlations between features and
group features
• Clustering methods find similarities between instances and group instances
• Allows knowledge extraction through
number of clusters,
prior probabilities,
cluster parameters, i.e., center, range of features.

Example: CRM, customer segmentation

P JHANSI LAKSHMI Department of CSE


Supervised learning after clustering
• Clustering is also used as a preprocessing stage.
• After clustering, we also map to a new k dimensional space where the
dimensions are hj
• Estimated group labels hj (soft) or bj (hard) may be seen as the dimensions
of a new k dimensional space, where we can then learn our discriminant or
regressor.
• Local representation (only one bj is 1, all others are 0; only few hj are
nonzero) vs
Distributed representation (After PCA; all zj are nonzero)

P JHANSI LAKSHMI Department of CSE


Mixture of Mixtures
• In classification, the input comes from a mixture of classes (supervised).
• If each class is also a mixture, e.g., of Gaussians, (unsupervised), we have a
mixture of mixtures:

Where is the number of components making up and is the component j of


class i.
• Different classes may need different number of components. Learning the
parameters of components is done separately for each class

P JHANSI LAKSHMI Department of CSE


Spectral Clustering
Why Spectral Clustering?
Spectral Clustering has some unique advantages:
• Makes no assumptions on the shapes of the clusters, can handled intertwined spirals, etc.
• EM or like a k-means similar require an iterative process to find the local minima and
they are very sensitive to initialization so we need multiple restarts to get high quality
clusters.
Process of Spectral Clustering:
• Construct a similarity graph (e.g KNN graph) for all the data points.
• Embed data points in a low – dimensional space (spectral embedding), in which the
clusters are more obvious, with the use of eigenvectors of the graph Laplacian.
• A classical clustering algorithm (e.g., k-means) is applied to partition the embedding.

P JHANSI LAKSHMI Department of CSE


Spectral Clustering
• Instead of clustering in the original space, a possibility is to first map the
data to a new space with reduced dimensionality such that similarities are
made more apparent and then cluster in there.
• Any feature selection or extraction method can be used for this purpose,
and one such method is the Laplacian eigenmaps.
• After such a mapping, points that are similar are placed nearby, and this is
expected to enhance the performance of clustering.
This is the idea behind spectral clustering.

P JHANSI LAKSHMI Department of CSE


Spectral Clustering
There are two steps:
1. In the original space, we define a local neighborhood (by either fixing the
number of neighbors or a distance threshold), and then for instances that
are in the same neighborhood, we define a similarity measure
• for example, using the Gaussian kernel—that is inversely proportional to the
distance between them.
• Remember that instances not in the same local neighborhood are assigned a
similarity of 0 and hence can be placed anywhere with respect to each other. Given
this Laplacian, instances are positioned in the new space using feature embedding.

2. Then run k-means clustering with the new data coordinates in this new
space.
P JHANSI LAKSHMI Department of CSE
Spectral Clustering
• When B is the matrix of pairwise similarities and D is the diagonal degree
matrix with on the diagonals.
• The graph Laplacian is defined as L = D − B
• This is the unnormalized Laplacian. There are two ways to normalize.
• One is closely related to the random walk and the other constructs a symmetric
matrix.

• They may lead to better performance in clustering:

P JHANSI LAKSHMI Department of CSE


P JHANSI LAKSHMI Department of CSE
Spectral Clustering
• It is always a good idea to do dimensionality reduction before clustering
using Euclidean distance.
• Using Laplacian eigenmaps makes more sense than multidimensional
scaling proper or principal components analysis because those two check
for the preservation of pairwise similarities between all pairs of instances
• whereas here with Laplacian eigenmaps, we care about preserving the
similarity between neighboring instances only.
• This has the effect that instances that are nearby in the original space,
probably within the same cluster, will be placed very close in the new
space.

P JHANSI LAKSHMI Department of CSE


Hierarchical Clustering
Hierarchical Clustering
• There are methods for clustering that use only similarities of instances,
without any other requirement on the data;
• The aim is to find groups such that instances in a group are more similar to
each other than instances in different groups.
• This is the approach taken by hierarchical clustering.
• Hierarchical Clustering or Hierarchical Cluster Analysis or HCA is a
method of clustering which seeks to build a hierarchy of clusters in a given
dataset

P JHANSI LAKSHMI Department of CSE


Hierarchical Clustering
• Cluster based on similarities/distances
• Distance measure between instances xr and xs

Minkowski (Lp) (Euclidean for p = 2)

City-block distance

P JHANSI LAKSHMI Department of CSE


Agglomerative Clustering
• Start with N groups each with one instance and merge two closest groups at
each iteration
• Distance between two groups Gi and Gj:
• Single-link:

• Complete-link:

• Average-link, centroid

P JHANSI LAKSHMI Department of CSE


Example: Single-Link Clustering

Dendrogram

P JHANSI LAKSHMI Department of CSE


K-means Clustering
K-means Clustering
• K-means is a simplified approach to clustering.
• It is similar to Gaussian mixture with the difference being that in k-means is
we have some how ‘hard choices’ about the means of each cluster.
• That is we start with some initial values for the means of the cluster and we
update them iteratively throughout the training.
• This is in contrast with the softer choices we made in the Gaussian mixture
models.

P JHANSI LAKSHMI Department of CSE


k-Means Clustering
• Find k reference vectors (prototypes/codebook vectors/codewords)
which best represent data

• Reference vectors, mj, j =1,...,k

P JHANSI LAKSHMI Department of CSE


Encoding/Decoding
• Given x, the encoder sends the index of the closest code word and the
decoder generates the code word with the received index as .
Error is

P JHANSI LAKSHMI Department of CSE


k-Means Clustering
• Use nearest (most similar) reference:

xt  m i  min xt  m j
j

• Reconstruction error

 
E m i ki1 X  t i bit xt  m i

 1 if x t
 m  min x t
mj
bi  
t i
j
0 otherwise

P JHANSI LAKSHMI Department of CSE


k-means Clustering

P JHANSI LAKSHMI Department of CSE


Nonparametric Methods
Introduction
• We discussed the parametric and semiparametric approaches where we assumed that
the data is drawn from one or a mixture of probability distributions of known form.
• Now, we discuss the nonparametric approach that is used when no such assumption can
be made about the input density and the data speaks for itself.
• We consider the nonparametric approaches for density estimation, classification, outlier
detection, and regression and see how the time and space complexity can be checked.

P JHANSI LAKSHMI Department of CSE


• In nonparametric estimation, we assume that similar inputs have similar
outputs.
• Nonparametric methods do not assume any a priori parametric form for the
underlying densities.
• A nonparametric model is not fixed but its complexity depends on the size
of the training set.
• Nonparametric methods are also called Instance-based or Memory-based
learning algorithms.

P JHANSI LAKSHMI Department of CSE


Nonparametric Density Estimation
• In density estimation, we assume that the sample is drawn independently
from some unknown probability density p(·).
• (·) is our estimator of p(·).
• We start with the univariate case where are scalars and later generalize to
the multidimensional case.

P JHANSI LAKSHMI Department of CSE


Nonparametric Density Estimation
• The nonparametric estimator for the cumulative distribution function, F(x),
at point x is the proportion of sample points that are less than or equal to x:

• Where denotes the number of training instances whose is less than or


equal to x.

P JHANSI LAKSHMI Department of CSE


Nonparametric Density Estimation
• Similarly, the nonparametric estimate for the density function, which is the
derivative of the cumulative distribution, can be calculated as

• h is the length of the interval and instances that fall in this interval are
assumed to be “close enough.”

P JHANSI LAKSHMI Department of CSE


Histogram Estimator
• The oldest and most popular method is the histogram where the input space
is divided into equal-sized intervals named bins of size h.
• Given an origin and a bin width h, the bins are the intervals for positive
and negative integers m and the estimate is given as

P JHANSI LAKSHMI Department of CSE


Histogram Estimator
• In constructing the histogram, we have to choose both an origin and a bin
width.
• The choice of origin affects the estimate near boundaries of bins, but it is
mainly the bin width that has an effect on the estimate:
• With small bins, the estimate is spiky, and with larger bins, the estimate is smoother
(see figure 8.1).
• The estimate is 0 if no instance falls in a bin and there are discontinuities at bin
boundaries.

• One advantage of the histogram is that once the bin estimates are calculated
and stored, we do not need to retain the training set.

P JHANSI LAKSHMI Department of CSE


Histograms for various bin lengths. ‘×’denote data points.

P JHANSI LAKSHMI Department of CSE


Naive estimator
• The Naive estimator frees us from setting an origin. It is defined as

and is equal to the histogram estimate where x is always at the center of a bin of
size h (see figure).
The estimator can also be written as:

And the weight function is defined as:

P JHANSI LAKSHMI Department of CSE


Naive estimator

P JHANSI LAKSHMI Department of CSE


Kernel Estimator
• To get a smooth estimate, we use a smooth weight function called a kernel
function. The most popular is the Gaussian kernel:

• The kernel estimator, also called, is defined as:

P JHANSI LAKSHMI Department of CSE


Kernel Estimator

P JHANSI LAKSHMI Department of CSE


Kernel Estimator
• The kernel function K(·) determines the shape of the influences and the
window width h determines the width.
• Just like the naive estimate is the sum of “boxes,” the kernel estimate is the
sum of “bumps.”
• All the have an effect on the estimate at x, and this effect decreases
smoothly as |x - | increases.
• To simplify calculation, K(·) can be taken to be 0 if |x - | > 3h.
• There exist other kernels easier to compute that can be used, as long as K(u)
is maximum for u = 0 and decreasing symmetrically as |u| increases.

P JHANSI LAKSHMI Department of CSE


Kernel Estimator
• When h is small, each training instance has a large effect in a small region
and no effect on distant points.
• When h is larger, there is more overlap of the kernels and we get a smoother
estimate.

P JHANSI LAKSHMI Department of CSE


k-Nearest Neighbor Estimator
• The nearest neighbor class of estimators adapts the amount of smoothing to
the local density of data.
• The degree of smoothing is controlled by k, the number of neighbors taken
into account, which is much smaller than N, the sample size.
• Let us define a distance between a and b, for example, |a − b|, and for each
x, we define to be the distances arranged in ascending order, from x to the
points in the sample: is the distance to the nearest sample, is the distance
to the next nearest, and so on.

P JHANSI LAKSHMI Department of CSE


k-Nearest Neighbor Estimator
• If are the data points, then we define , and if i is the index of the closest
sample, namely i then and so forth.
• The k-nearest neighbor (k-nn) density estimate is

dk(x), distance to kth closest instance to x


• This is like a naive estimator with

P JHANSI LAKSHMI Department of CSE


k-Nearest Neighbor Estimator

P JHANSI LAKSHMI Department of CSE


k-Nearest Neighbor Estimator
• To get a smoother estimate, we can use a kernel function whose e ffect
decreases with increasing distance

• This is like a kernel estimator with adaptive smoothing parameter .


• K(·) is typically taken to be the Gaussian kernel

P JHANSI LAKSHMI Department of CSE


Generalization to Multivariate Data
• Given a sample of d-dimensional observations the multivariate kernel
density estimator is

with the requirement that

P JHANSI LAKSHMI Department of CSE


Generalization to Multivariate Data
Multivariate Gaussian kernel
spheric

ellipsoid
where S is the sample covariance matrix. This corresponds to using
Mahalanobis distance instead of the Euclidean distance.

P JHANSI LAKSHMI Department of CSE


Nonparametric Classification
• When used for classification, we use the nonparametric approach to
estimate the class-conditional densities p(x|Ci).
• The kernel estimator of the class-conditional density p(x|Ci) is given as
Kernel estimator

Where is 1 if ∈ and 0 otherwise, =

P JHANSI LAKSHMI Department of CSE


Nonparametric Classification
• The MLE of the prior density is
• Then, the discriminant can be written as

and x is assigned to the class for which the discriminant takes its maximum.

P JHANSI LAKSHMI Department of CSE


Nonparametric Classification
• For special case of k-NN estimator

Where is the number of neighbors out of the k nearest that belong to and is
the volume of the d-dimensional hypersphere centered at x, with radius r = ||
x - || where is the k-th nearest observation to x.
Density estimator:

P JHANSI LAKSHMI Department of CSE

You might also like