0% found this document useful (0 votes)
18 views

ML Assign4

ML assignment 4 of SPPU

Uploaded by

hrr601097
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views

ML Assign4

ML assignment 4 of SPPU

Uploaded by

hrr601097
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 7

Machine Learning

Assignment No. 4
Q.1. Write short note on following Hierarchical clustering method :
i) Agglomerative ii) Dendogram

i) Agglomerative Clustering

Agglomerative clustering is a bottom-up approach to hierarchical clustering. It starts by


considering each data point as a separate cluster and then merges clusters iteratively based on
a similarity or distance metric.

Steps:

1. Initialization: Each data point forms its own cluster.


2. Merge: Find the two closest clusters based on the chosen distance metric and merge
them into a single cluster.
3. Repeat: Continue merging clusters until a single cluster remains.

ii) Dendrogram

A dendrogram is a tree-like diagram that visually represents the hierarchical structure of


clusters created by a hierarchical clustering algorithm. It shows how clusters are merged at
different levels of similarity.

Interpretation:

 Horizontal axis: Represents the data points or clusters.


 Vertical axis: Represents the similarity or distance between clusters.
 Branches: Connect clusters that were merged at a specific similarity level.

Q.2 Consider the following data to predict the student pass or fail using the K-
Nearest Neighbor Algorithm (KNN) for the values physics = 6 marks, Chemistry
= 8 marks with number of Neighbors K = 3.

Physics(Marks) Chemistry(Marks) Results


4 3 Fail
6 7 Pass
7 8 Pass
5 5 Fail
8 8 Pass
Predicting Student Pass/Fail Using K-Nearest Neighbors

Understanding the Problem

We're given a dataset of student marks in Physics and Chemistry, along with their
corresponding results (Pass or Fail). The task is to predict whether a new student with Physics
marks = 6 and Chemistry marks = 8 will pass or fail using the K-Nearest Neighbors (KNN)
algorithm with K = 3.

KNN Algorithm Steps

1. Calculate Distances: Calculate the Euclidean distance between the new data point (6,
8) and each existing data point in the dataset.
2. Find Nearest Neighbors: Identify the K (in this case, 3) nearest neighbors to the new
data point based on the calculated distances.
3. Determine Class: Assign the class (Pass or Fail) to the new data point based on the
majority class among its K nearest neighbors.

Calculations

Distances:

 Distance to (4, 3): √((6-4)² + (8-3)²) = √29 ≈ 5.39


 Distance to (6, 7): √((6-6)² + (8-7)²) = 1
 Distance to (7, 8): √((6-7)² + (8-8)²) = 1
 Distance to (5, 5): √((6-5)² + (8-5)²) = √10 ≈ 3.16
 Distance to (8, 8): √((6-8)² + (8-8)²) =

2 Nearest Neighbors:

The three nearest neighbors are (6, 7), (7, 8), and (5, 5).

Class Prediction:

Since two of the three nearest neighbors have a "Pass" result, the predicted result for the new
student with (6, 8) marks is Pass.

**Therefore, based on the K-Nearest Neighbors algorithm with K = 3, the predicted result for
a student with Physics marks = 6 and Chemistry marks = 8 is Pass.

Q.3. What is K mean clustering and Graph Based clustering? Explain with examples.

K-Means Clustering

K-means clustering is a popular partitioning method that divides a dataset into K clusters. It
aims to minimize the sum of squared distances between data points and their assigned cluster
centroids.
Algorithm:

1. Initialize centroids: Randomly select K data points as initial centroids.


2. Assign clusters: Assign each data point to the nearest centroid based on Euclidean
distance.
3. Update centroids: Recalculate the centroids as the mean of the data points assigned to
each cluster.
4. Repeat: Iterate steps 2 and 3 until convergence (no change in cluster assignments).

Example:

Consider a dataset of two-dimensional points. We want to cluster them into two groups.

 Initialization: Randomly select two points as centroids (A and B).


 Assignment: Assign points to the nearest centroid.
 Update: Recalculate centroids based on the assigned points.
 Repeat: Continue until cluster assignments stabilize.

Graph-Based Clustering

Graph-based clustering algorithms treat data points as nodes in a graph, where edges represent
connections between points. The clustering is based on analyzing the structure of the graph.

Types of Graph-Based Clustering:

 Hierarchical Clustering: Builds a hierarchy of clusters by merging or splitting existing


clusters.
 Density-Based Clustering: Identifies clusters based on dense regions of data points.
 Spectral Clustering: Uses the eigenvalues and eigenvectors of the graph Laplacian to
partition the graph.

Example:

Consider a social network where nodes represent people and edges represent friendships.

 Density-Based Clustering: Identify groups of friends who are densely connected.


 Spectral Clustering: Use the spectral properties of the graph to find communities
within the network.

Q.4. Define following terms :


i) Elbow method ii) Extrinsic and Intrinsic method

i) Elbow Method

The Elbow Method is a technique used to determine the optimal number of clusters in K-
means clustering. It involves plotting the within-cluster sum of squares (WCSS) against the
number of clusters (K). As K increases, the WCSS decreases, but after a certain point, the
improvement in WCSS becomes minimal. This point, where the curve starts to bend or
"elbow," indicates the optimal number of clusters, balancing model complexity and fit.
ii) Extrinsic and Intrinsic Methods

These terms are used in the context of evaluating clustering algorithms or models:

 Extrinsic Method: This evaluation method relies on external, known ground truth
(labels) to assess the clustering performance. The clustering result is compared to a
predefined set of labels to measure accuracy using metrics such as Adjusted Rand
Index (ARI) or F1 score.

Example: Using labeled data to assess how well a clustering algorithm groups similar
items.

 Intrinsic Method: This method evaluates the clustering quality without any external
reference or ground truth. The assessment is based on internal properties of the
dataset, like how compact and well-separated the clusters are. Common intrinsic
measures include Silhouette Score and Davies-Bouldin Index.

Q.5. With reference to Clustering explain the issue of “Optimization of Clusters”.

Optimization of Clusters

In clustering, finding the optimal number of clusters is essential for accurate results. Several
methods can be used:

 Elbow Method: Plots the sum of squared errors (SSE) against the number of clusters.
The "elbow" point indicates the optimal number.
 Silhouette Coefficient: Measures how similar a data point is to its own cluster
compared to others. Higher values suggest better-defined clusters.
 Gap Statistic: Compares the within-cluster variance to the expected variance under a
null hypothesis. The optimal number is where the gap is maximized.
 Domain Knowledge: Consider the context and business requirements to determine the
appropriate number of clusters.

Factors like data distribution, noise, and feature scaling can affect the optimization process.
By carefully considering these factors and using appropriate methods, you can find the best
number of clusters for your specific data.

Q.6. Compare Hierarchical clustering and K-means clustering.


Aspect Hierarchical clustering K-means clustering.
1. Approach Builds a dendrogram by Partitions data into K clusters
successively merging/splitting by minimizing squared
clusters. distances.
2. Number of Clusters No need to pre-specify; choose Must specify the number of
by cutting the dendrogram. clusters (K) beforehand.
3. Algorithm Type Hierarchical (agglomerative or Partitional.
divisive).
5. Cluster Shape Can capture clusters of arbitrary Assumes clusters are spherical
shapes. and equal-sized.
6. Flexibility Flexible with shapes but less Suitable for large datasets but
scalable for large datasets. struggles with irregular shapes.

Q.7. Explain how a cluster is formed in the density based clustering algorithm.

Density-Based Clustering

Density-based clustering algorithms identify clusters based on dense regions of data points.
They work by finding areas where data points are closely packed together.

Steps:

1. Density Estimation: Determine the density of data points in the dataset using a density
function (e.g., DBSCAN uses a radius and minimum number of points).
2. Core Points: Identify data points that have at least the minimum number of points
within their radius. These are called core points.
3. Border Points: Data points that are not core points but are within the radius of a core
point are considered border points.
4. Noise Points: Data points that are neither core nor border points are considered noise.
5. Cluster Formation: Clusters are formed by connecting core points and border points
that are reachable from each other. A cluster is a set of densely connected data points.

Key Concepts:

 Density: The concentration of data points in a region.


 Core point: A data point with a high density of neighbors.
 Border point: A data point that is on the edge of a cluster.
 Noise point: A data point that is not part of any cluster.

Q.8. How would you choose the number of clusters when designing a
KMedoid clustering algorithm?

Choosing the Number of Clusters in K-Medoid Clustering

Determining the optimal number of clusters in K-Medoid clustering is crucial. Here are some
common methods:

 Elbow Method: Plot the sum of squared errors (SSE) against the number of clusters.
The "elbow" point indicates the optimal number.
 Silhouette Coefficient: Measures how similar a data point is to its own cluster
compared to others. Higher values suggest better-defined clusters.
 Gap Statistic: Compares the within-cluster variance to the expected variance under a
null hypothesis. The optimal number is where the gap is maximized.
 Domain Knowledge: Consider the context and business requirements to determine the
appropriate number of clusters.
By carefully evaluating these methods and considering the specific characteristics of your data,
you can choose the most suitable number of clusters for your K-Medoid clustering algorithm.

Q.9. Write a short note on out lier analysis with respect to clustering.

Outlier Analysis in Clustering

Outliers are data points that significantly deviate from the majority of the data. Identifying and
handling outliers is crucial in clustering to avoid distorted results.

Methods:

 Statistical methods: Z-score, IQR, Mahalanobis distance


 Clustering-based methods: Detect outliers based on their distance from cluster
centroids or density.
 Isolation Forest: Detects outliers by isolating them in random subspaces.

Handling outliers:

 Removal: Remove outliers if they are deemed irrelevant or noisy.


 Capping: Replace outliers with extreme values within a reasonable range.
 Robust clustering algorithms: Use algorithms that are less sensitive to outliers (e.g.,
DBSCAN).

By effectively handling outliers, you can improve the accuracy and interpretability of your
clustering results.

Q.10. Differentiate between K-means and Spectral clustering.


Aspect K-Means Clustering Spectral Clustering
1. Approach Partitional clustering; minimizes the Graph-based clustering; operates
sum of squared distances between on an affinity matrix representing
data points and cluster centroids. pairwise similarities between
points.
2.Cluster Assumes clusters are spherical and Can handle non-convex,
Shape equally sized (works well for circular arbitrarily shaped clusters.
clusters).
3. Number of Requires the number of clusters (K) Also requires specifying the
Clusters to be specified beforehand. number of clusters but uses
eigenvalues of the affinity matrix
to inform clustering.
4. Data Type Works well with large, linearly Better suited for datasets with
separable datasets. complex or manifold structures,
where clusters are not linearly
separable.
5. Computationally efficient, especially More computationally expensive
Computational for large datasets. due to the eigenvalue
Complexity decomposition of the affinity
matrix (often O(n3)O(n^3)O(n3)
complexity).

You might also like