ML Assign4
ML Assign4
Assignment No. 4
Q.1. Write short note on following Hierarchical clustering method :
i) Agglomerative ii) Dendogram
i) Agglomerative Clustering
Steps:
ii) Dendrogram
Interpretation:
Q.2 Consider the following data to predict the student pass or fail using the K-
Nearest Neighbor Algorithm (KNN) for the values physics = 6 marks, Chemistry
= 8 marks with number of Neighbors K = 3.
We're given a dataset of student marks in Physics and Chemistry, along with their
corresponding results (Pass or Fail). The task is to predict whether a new student with Physics
marks = 6 and Chemistry marks = 8 will pass or fail using the K-Nearest Neighbors (KNN)
algorithm with K = 3.
1. Calculate Distances: Calculate the Euclidean distance between the new data point (6,
8) and each existing data point in the dataset.
2. Find Nearest Neighbors: Identify the K (in this case, 3) nearest neighbors to the new
data point based on the calculated distances.
3. Determine Class: Assign the class (Pass or Fail) to the new data point based on the
majority class among its K nearest neighbors.
Calculations
Distances:
2 Nearest Neighbors:
The three nearest neighbors are (6, 7), (7, 8), and (5, 5).
Class Prediction:
Since two of the three nearest neighbors have a "Pass" result, the predicted result for the new
student with (6, 8) marks is Pass.
**Therefore, based on the K-Nearest Neighbors algorithm with K = 3, the predicted result for
a student with Physics marks = 6 and Chemistry marks = 8 is Pass.
Q.3. What is K mean clustering and Graph Based clustering? Explain with examples.
K-Means Clustering
K-means clustering is a popular partitioning method that divides a dataset into K clusters. It
aims to minimize the sum of squared distances between data points and their assigned cluster
centroids.
Algorithm:
Example:
Consider a dataset of two-dimensional points. We want to cluster them into two groups.
Graph-Based Clustering
Graph-based clustering algorithms treat data points as nodes in a graph, where edges represent
connections between points. The clustering is based on analyzing the structure of the graph.
Example:
Consider a social network where nodes represent people and edges represent friendships.
i) Elbow Method
The Elbow Method is a technique used to determine the optimal number of clusters in K-
means clustering. It involves plotting the within-cluster sum of squares (WCSS) against the
number of clusters (K). As K increases, the WCSS decreases, but after a certain point, the
improvement in WCSS becomes minimal. This point, where the curve starts to bend or
"elbow," indicates the optimal number of clusters, balancing model complexity and fit.
ii) Extrinsic and Intrinsic Methods
These terms are used in the context of evaluating clustering algorithms or models:
Extrinsic Method: This evaluation method relies on external, known ground truth
(labels) to assess the clustering performance. The clustering result is compared to a
predefined set of labels to measure accuracy using metrics such as Adjusted Rand
Index (ARI) or F1 score.
Example: Using labeled data to assess how well a clustering algorithm groups similar
items.
Intrinsic Method: This method evaluates the clustering quality without any external
reference or ground truth. The assessment is based on internal properties of the
dataset, like how compact and well-separated the clusters are. Common intrinsic
measures include Silhouette Score and Davies-Bouldin Index.
Optimization of Clusters
In clustering, finding the optimal number of clusters is essential for accurate results. Several
methods can be used:
Elbow Method: Plots the sum of squared errors (SSE) against the number of clusters.
The "elbow" point indicates the optimal number.
Silhouette Coefficient: Measures how similar a data point is to its own cluster
compared to others. Higher values suggest better-defined clusters.
Gap Statistic: Compares the within-cluster variance to the expected variance under a
null hypothesis. The optimal number is where the gap is maximized.
Domain Knowledge: Consider the context and business requirements to determine the
appropriate number of clusters.
Factors like data distribution, noise, and feature scaling can affect the optimization process.
By carefully considering these factors and using appropriate methods, you can find the best
number of clusters for your specific data.
Q.7. Explain how a cluster is formed in the density based clustering algorithm.
Density-Based Clustering
Density-based clustering algorithms identify clusters based on dense regions of data points.
They work by finding areas where data points are closely packed together.
Steps:
1. Density Estimation: Determine the density of data points in the dataset using a density
function (e.g., DBSCAN uses a radius and minimum number of points).
2. Core Points: Identify data points that have at least the minimum number of points
within their radius. These are called core points.
3. Border Points: Data points that are not core points but are within the radius of a core
point are considered border points.
4. Noise Points: Data points that are neither core nor border points are considered noise.
5. Cluster Formation: Clusters are formed by connecting core points and border points
that are reachable from each other. A cluster is a set of densely connected data points.
Key Concepts:
Q.8. How would you choose the number of clusters when designing a
KMedoid clustering algorithm?
Determining the optimal number of clusters in K-Medoid clustering is crucial. Here are some
common methods:
Elbow Method: Plot the sum of squared errors (SSE) against the number of clusters.
The "elbow" point indicates the optimal number.
Silhouette Coefficient: Measures how similar a data point is to its own cluster
compared to others. Higher values suggest better-defined clusters.
Gap Statistic: Compares the within-cluster variance to the expected variance under a
null hypothesis. The optimal number is where the gap is maximized.
Domain Knowledge: Consider the context and business requirements to determine the
appropriate number of clusters.
By carefully evaluating these methods and considering the specific characteristics of your data,
you can choose the most suitable number of clusters for your K-Medoid clustering algorithm.
Q.9. Write a short note on out lier analysis with respect to clustering.
Outliers are data points that significantly deviate from the majority of the data. Identifying and
handling outliers is crucial in clustering to avoid distorted results.
Methods:
Handling outliers:
By effectively handling outliers, you can improve the accuracy and interpretability of your
clustering results.