0% found this document useful (0 votes)

18 views

ML Assign4

ML assignment 4 of SPPU

Uploaded by

hrr601097

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views

ML Assign4

ML assignment 4 of SPPU

Uploaded by

hrr601097

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 7

Machine Learning

Assignment No. 4
Q.1. Write short note on following Hierarchical clustering method :
i) Agglomerative ii) Dendogram

i) Agglomerative Clustering

Agglomerative clustering is a bottom-up approach to hierarchical clustering. It starts by

considering each data point as a separate cluster and then merges clusters iteratively based on
a similarity or distance metric.

Steps:

1. Initialization: Each data point forms its own cluster.

2. Merge: Find the two closest clusters based on the chosen distance metric and merge
them into a single cluster.
3. Repeat: Continue merging clusters until a single cluster remains.

ii) Dendrogram

A dendrogram is a tree-like diagram that visually represents the hierarchical structure of

clusters created by a hierarchical clustering algorithm. It shows how clusters are merged at
different levels of similarity.

Interpretation:

 Horizontal axis: Represents the data points or clusters.

 Vertical axis: Represents the similarity or distance between clusters.
 Branches: Connect clusters that were merged at a specific similarity level.

Q.2 Consider the following data to predict the student pass or fail using the K-
Nearest Neighbor Algorithm (KNN) for the values physics = 6 marks, Chemistry
= 8 marks with number of Neighbors K = 3.

Physics(Marks) Chemistry(Marks) Results

4 3 Fail
6 7 Pass
7 8 Pass
5 5 Fail
8 8 Pass
Predicting Student Pass/Fail Using K-Nearest Neighbors

Understanding the Problem

We're given a dataset of student marks in Physics and Chemistry, along with their
corresponding results (Pass or Fail). The task is to predict whether a new student with Physics
marks = 6 and Chemistry marks = 8 will pass or fail using the K-Nearest Neighbors (KNN)
algorithm with K = 3.

KNN Algorithm Steps

1. Calculate Distances: Calculate the Euclidean distance between the new data point (6,
8) and each existing data point in the dataset.
2. Find Nearest Neighbors: Identify the K (in this case, 3) nearest neighbors to the new
data point based on the calculated distances.
3. Determine Class: Assign the class (Pass or Fail) to the new data point based on the
majority class among its K nearest neighbors.

Calculations

Distances:

 Distance to (4, 3): √((6-4)² + (8-3)²) = √29 ≈ 5.39

 Distance to (6, 7): √((6-6)² + (8-7)²) = 1
 Distance to (7, 8): √((6-7)² + (8-8)²) = 1
 Distance to (5, 5): √((6-5)² + (8-5)²) = √10 ≈ 3.16
 Distance to (8, 8): √((6-8)² + (8-8)²) =

2 Nearest Neighbors:

The three nearest neighbors are (6, 7), (7, 8), and (5, 5).

Class Prediction:

Since two of the three nearest neighbors have a "Pass" result, the predicted result for the new
student with (6, 8) marks is Pass.

**Therefore, based on the K-Nearest Neighbors algorithm with K = 3, the predicted result for
a student with Physics marks = 6 and Chemistry marks = 8 is Pass.

Q.3. What is K mean clustering and Graph Based clustering? Explain with examples.

K-Means Clustering

K-means clustering is a popular partitioning method that divides a dataset into K clusters. It
aims to minimize the sum of squared distances between data points and their assigned cluster
centroids.
Algorithm:

1. Initialize centroids: Randomly select K data points as initial centroids.

2. Assign clusters: Assign each data point to the nearest centroid based on Euclidean
distance.
3. Update centroids: Recalculate the centroids as the mean of the data points assigned to
each cluster.
4. Repeat: Iterate steps 2 and 3 until convergence (no change in cluster assignments).

Example:

Consider a dataset of two-dimensional points. We want to cluster them into two groups.

 Initialization: Randomly select two points as centroids (A and B).

 Assignment: Assign points to the nearest centroid.
 Update: Recalculate centroids based on the assigned points.
 Repeat: Continue until cluster assignments stabilize.

Graph-Based Clustering

Graph-based clustering algorithms treat data points as nodes in a graph, where edges represent
connections between points. The clustering is based on analyzing the structure of the graph.

Types of Graph-Based Clustering:

 Hierarchical Clustering: Builds a hierarchy of clusters by merging or splitting existing

clusters.
 Density-Based Clustering: Identifies clusters based on dense regions of data points.
 Spectral Clustering: Uses the eigenvalues and eigenvectors of the graph Laplacian to
partition the graph.

Example:

Consider a social network where nodes represent people and edges represent friendships.

 Density-Based Clustering: Identify groups of friends who are densely connected.

 Spectral Clustering: Use the spectral properties of the graph to find communities
within the network.

Q.4. Define following terms :

i) Elbow method ii) Extrinsic and Intrinsic method

i) Elbow Method

The Elbow Method is a technique used to determine the optimal number of clusters in K-
means clustering. It involves plotting the within-cluster sum of squares (WCSS) against the
number of clusters (K). As K increases, the WCSS decreases, but after a certain point, the
improvement in WCSS becomes minimal. This point, where the curve starts to bend or
"elbow," indicates the optimal number of clusters, balancing model complexity and fit.
ii) Extrinsic and Intrinsic Methods

These terms are used in the context of evaluating clustering algorithms or models:

 Extrinsic Method: This evaluation method relies on external, known ground truth
(labels) to assess the clustering performance. The clustering result is compared to a
predefined set of labels to measure accuracy using metrics such as Adjusted Rand
Index (ARI) or F1 score.

Example: Using labeled data to assess how well a clustering algorithm groups similar
items.

 Intrinsic Method: This method evaluates the clustering quality without any external
reference or ground truth. The assessment is based on internal properties of the
dataset, like how compact and well-separated the clusters are. Common intrinsic
measures include Silhouette Score and Davies-Bouldin Index.

Q.5. With reference to Clustering explain the issue of “Optimization of Clusters”.

Optimization of Clusters

In clustering, finding the optimal number of clusters is essential for accurate results. Several
methods can be used:

 Elbow Method: Plots the sum of squared errors (SSE) against the number of clusters.
The "elbow" point indicates the optimal number.
 Silhouette Coefficient: Measures how similar a data point is to its own cluster
compared to others. Higher values suggest better-defined clusters.
 Gap Statistic: Compares the within-cluster variance to the expected variance under a
null hypothesis. The optimal number is where the gap is maximized.
 Domain Knowledge: Consider the context and business requirements to determine the
appropriate number of clusters.

Factors like data distribution, noise, and feature scaling can affect the optimization process.
By carefully considering these factors and using appropriate methods, you can find the best
number of clusters for your specific data.

Q.6. Compare Hierarchical clustering and K-means clustering.

Aspect Hierarchical clustering K-means clustering.
1. Approach Builds a dendrogram by Partitions data into K clusters
successively merging/splitting by minimizing squared
clusters. distances.
2. Number of Clusters No need to pre-specify; choose Must specify the number of
by cutting the dendrogram. clusters (K) beforehand.
3. Algorithm Type Hierarchical (agglomerative or Partitional.
divisive).
5. Cluster Shape Can capture clusters of arbitrary Assumes clusters are spherical
shapes. and equal-sized.
6. Flexibility Flexible with shapes but less Suitable for large datasets but
scalable for large datasets. struggles with irregular shapes.

Q.7. Explain how a cluster is formed in the density based clustering algorithm.

Density-Based Clustering

Density-based clustering algorithms identify clusters based on dense regions of data points.
They work by finding areas where data points are closely packed together.

Steps:

1. Density Estimation: Determine the density of data points in the dataset using a density
function (e.g., DBSCAN uses a radius and minimum number of points).
2. Core Points: Identify data points that have at least the minimum number of points
within their radius. These are called core points.
3. Border Points: Data points that are not core points but are within the radius of a core
point are considered border points.
4. Noise Points: Data points that are neither core nor border points are considered noise.
5. Cluster Formation: Clusters are formed by connecting core points and border points
that are reachable from each other. A cluster is a set of densely connected data points.

Key Concepts:

 Density: The concentration of data points in a region.

 Core point: A data point with a high density of neighbors.
 Border point: A data point that is on the edge of a cluster.
 Noise point: A data point that is not part of any cluster.

Q.8. How would you choose the number of clusters when designing a
KMedoid clustering algorithm?

Choosing the Number of Clusters in K-Medoid Clustering

Determining the optimal number of clusters in K-Medoid clustering is crucial. Here are some
common methods:

 Elbow Method: Plot the sum of squared errors (SSE) against the number of clusters.
The "elbow" point indicates the optimal number.
 Silhouette Coefficient: Measures how similar a data point is to its own cluster
compared to others. Higher values suggest better-defined clusters.
 Gap Statistic: Compares the within-cluster variance to the expected variance under a
null hypothesis. The optimal number is where the gap is maximized.
 Domain Knowledge: Consider the context and business requirements to determine the
appropriate number of clusters.
By carefully evaluating these methods and considering the specific characteristics of your data,
you can choose the most suitable number of clusters for your K-Medoid clustering algorithm.

Q.9. Write a short note on out lier analysis with respect to clustering.

Outlier Analysis in Clustering

Outliers are data points that significantly deviate from the majority of the data. Identifying and
handling outliers is crucial in clustering to avoid distorted results.

Methods:

 Statistical methods: Z-score, IQR, Mahalanobis distance

 Clustering-based methods: Detect outliers based on their distance from cluster
centroids or density.
 Isolation Forest: Detects outliers by isolating them in random subspaces.

Handling outliers:

 Removal: Remove outliers if they are deemed irrelevant or noisy.

 Capping: Replace outliers with extreme values within a reasonable range.
 Robust clustering algorithms: Use algorithms that are less sensitive to outliers (e.g.,
DBSCAN).

By effectively handling outliers, you can improve the accuracy and interpretability of your
clustering results.

Q.10. Differentiate between K-means and Spectral clustering.

Aspect K-Means Clustering Spectral Clustering
1. Approach Partitional clustering; minimizes the Graph-based clustering; operates
sum of squared distances between on an affinity matrix representing
data points and cluster centroids. pairwise similarities between
points.
2.Cluster Assumes clusters are spherical and Can handle non-convex,
Shape equally sized (works well for circular arbitrarily shaped clusters.
clusters).
3. Number of Requires the number of clusters (K) Also requires specifying the
Clusters to be specified beforehand. number of clusters but uses
eigenvalues of the affinity matrix
to inform clustering.
4. Data Type Works well with large, linearly Better suited for datasets with
separable datasets. complex or manifold structures,
where clusters are not linearly
separable.
5. Computationally efficient, especially More computationally expensive
Computational for large datasets. due to the eigenvalue
Complexity decomposition of the affinity
matrix (often O(n3)O(n^3)O(n3)
complexity).

Cover Letter Uch
50% (2)
Cover Letter Uch
1 page
Krishna and Shrader SCAT
No ratings yet
Krishna and Shrader SCAT
130 pages
Horne - Ostberg Morning-Eveningness Questionnaire
No ratings yet
Horne - Ostberg Morning-Eveningness Questionnaire
6 pages
Unit 3 Data
No ratings yet
Unit 3 Data
37 pages
"These Are Just Rough Notes For References" What Is K-Means Clustering
No ratings yet
"These Are Just Rough Notes For References" What Is K-Means Clustering
9 pages
MACHINE LEARNING NOTES ANNA UNIVERSITY
No ratings yet
MACHINE LEARNING NOTES ANNA UNIVERSITY
14 pages
MODULE 4 - 5TH SEM (2)
No ratings yet
MODULE 4 - 5TH SEM (2)
23 pages
UNIT 4 K-Means Clustring
No ratings yet
UNIT 4 K-Means Clustring
13 pages
UNIT - 4 DWDM
No ratings yet
UNIT - 4 DWDM
27 pages
Clustering: K-Means, Agglomerative, DBSCAN: Tan, Steinbach, Kumar
No ratings yet
Clustering: K-Means, Agglomerative, DBSCAN: Tan, Steinbach, Kumar
45 pages
Unit-IV ppt
No ratings yet
Unit-IV ppt
51 pages
M5
No ratings yet
M5
40 pages
Clustering Analysis (1)
No ratings yet
Clustering Analysis (1)
12 pages
Unit-4 (2)
No ratings yet
Unit-4 (2)
29 pages
Lecture 14 Clustering
0% (1)
Lecture 14 Clustering
57 pages
M5
No ratings yet
M5
40 pages
Machine Learning
No ratings yet
Machine Learning
23 pages
Unit 4 Machine Learning
No ratings yet
Unit 4 Machine Learning
12 pages
ML - 8
No ratings yet
ML - 8
70 pages
IT3080 Lecture04 2023
No ratings yet
IT3080 Lecture04 2023
56 pages
8. Clustering
No ratings yet
8. Clustering
80 pages
Clustering Algorithm
No ratings yet
Clustering Algorithm
47 pages
Cluster
100% (1)
Cluster
72 pages
Chapter 3 Unsupervised Learning
No ratings yet
Chapter 3 Unsupervised Learning
45 pages
DWDM Unit5
No ratings yet
DWDM Unit5
14 pages
Unit IV
No ratings yet
Unit IV
96 pages
Data Mining - Clustering
No ratings yet
Data Mining - Clustering
90 pages
K-Means Clustering Algorithm
No ratings yet
K-Means Clustering Algorithm
13 pages
unsupervised learning
No ratings yet
unsupervised learning
23 pages
10 Marks Questions
No ratings yet
10 Marks Questions
19 pages
Lecture Notes - Clustering
No ratings yet
Lecture Notes - Clustering
13 pages
kmeansfinal
No ratings yet
kmeansfinal
16 pages
Simple K Means
No ratings yet
Simple K Means
3 pages
Unit 4 Aam
No ratings yet
Unit 4 Aam
26 pages
CPE412 Pattern Recognition (Week 7)
No ratings yet
CPE412 Pattern Recognition (Week 7)
48 pages
UnSupervisedLearning
No ratings yet
UnSupervisedLearning
22 pages
Lecture 01 - Unsupervised Learning (Optional)
No ratings yet
Lecture 01 - Unsupervised Learning (Optional)
57 pages
algo
No ratings yet
algo
59 pages
ML Module 4 2022 1 PDF
No ratings yet
ML Module 4 2022 1 PDF
31 pages
19.1. Partitioning-Based Clustering Algorithms
No ratings yet
19.1. Partitioning-Based Clustering Algorithms
27 pages
DSML-ML09. Unsupervised Learning
No ratings yet
DSML-ML09. Unsupervised Learning
69 pages
4 Clustering
No ratings yet
4 Clustering
9 pages
U-5_IML (2)
No ratings yet
U-5_IML (2)
20 pages
Presentation 1
No ratings yet
Presentation 1
47 pages
Chapter 4 PDF
No ratings yet
Chapter 4 PDF
89 pages
FullMarks - Clustering StudentSolution 2
No ratings yet
FullMarks - Clustering StudentSolution 2
13 pages
ML Seminar
No ratings yet
ML Seminar
37 pages
ML DSBA Lab7
No ratings yet
ML DSBA Lab7
6 pages
K-Means Clustering Algorithm - Javatpoint
No ratings yet
K-Means Clustering Algorithm - Javatpoint
21 pages
Unit 3 & 4 (p18)
No ratings yet
Unit 3 & 4 (p18)
18 pages
Chapter 5. Clustering Algorithms-Stud
No ratings yet
Chapter 5. Clustering Algorithms-Stud
44 pages
Partition
No ratings yet
Partition
52 pages
6 Clustering
No ratings yet
6 Clustering
15 pages
Clustering-Part1.pptx
No ratings yet
Clustering-Part1.pptx
84 pages
Data Mining Unit 3 Cluster Analysis: Types of Clusters
No ratings yet
Data Mining Unit 3 Cluster Analysis: Types of Clusters
11 pages
Unit 4 Clustering - K-Means and Hierarchical
No ratings yet
Unit 4 Clustering - K-Means and Hierarchical
40 pages
Module-5-Cluster Analysis-Part1
No ratings yet
Module-5-Cluster Analysis-Part1
24 pages
Chapter 4 _ Clustering
No ratings yet
Chapter 4 _ Clustering
21 pages
ML L14 Clustering
No ratings yet
ML L14 Clustering
59 pages
U1 - KMeans - 5th Sem - DS
No ratings yet
U1 - KMeans - 5th Sem - DS
14 pages
Clustering FinancialData
No ratings yet
Clustering FinancialData
38 pages
Lecture 11 K Means Clustering
No ratings yet
Lecture 11 K Means Clustering
8 pages
K Nearest Neighbor Algorithm: Fundamentals and Applications
From Everand
K Nearest Neighbor Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet
Henry Compton Secondary School: Total Students: 15
No ratings yet
Henry Compton Secondary School: Total Students: 15
1 page
Tamil 3226 p2 ms
No ratings yet
Tamil 3226 p2 ms
18 pages
Patang
No ratings yet
Patang
10 pages
Data Capture Format - Report: The Unified District Information System For Education (UDISE+)
No ratings yet
Data Capture Format - Report: The Unified District Information System For Education (UDISE+)
13 pages
NPTI 2018 Revised Guide Briefing_ce71d2b5-a6ed-4fe3-bfbd-3c4c38394ce0
No ratings yet
NPTI 2018 Revised Guide Briefing_ce71d2b5-a6ed-4fe3-bfbd-3c4c38394ce0
222 pages
A Survey On Female Foeticide (Responses)
No ratings yet
A Survey On Female Foeticide (Responses)
1 page
Module 6 Social Literacy
No ratings yet
Module 6 Social Literacy
9 pages
Challenges and Integration of Values Education in Teaching Araling Panlipunan 2
No ratings yet
Challenges and Integration of Values Education in Teaching Araling Panlipunan 2
3 pages
History Essay Writing Guide
No ratings yet
History Essay Writing Guide
15 pages
Implementing The Early Childhood Development Teacher Training Framework in Uganda: Gains and Challenges
No ratings yet
Implementing The Early Childhood Development Teacher Training Framework in Uganda: Gains and Challenges
12 pages
Devpsy
No ratings yet
Devpsy
15 pages
For Reference
No ratings yet
For Reference
69 pages
B.tech.-II 3rd Sem EC
No ratings yet
B.tech.-II 3rd Sem EC
13 pages
Math Test One PDF
No ratings yet
Math Test One PDF
20 pages
Class RecordHome Economics III 4th Grading
No ratings yet
Class RecordHome Economics III 4th Grading
23 pages
IELTS 10 Minute
No ratings yet
IELTS 10 Minute
58 pages
Computer Based Instruction 1
No ratings yet
Computer Based Instruction 1
30 pages
Ebook ADVOCACY
No ratings yet
Ebook ADVOCACY
3 pages
Feb 2018 Interaction Online Extract
100% (1)
Feb 2018 Interaction Online Extract
233 pages
Iit 1320270601
No ratings yet
Iit 1320270601
23 pages
Business Analysis Project Report: Group No
No ratings yet
Business Analysis Project Report: Group No
4 pages
Transcript Not Official Unless Delivered Through Parchment Exchange
No ratings yet
Transcript Not Official Unless Delivered Through Parchment Exchange
1 page
Side by Side 3
100% (1)
Side by Side 3
16 pages
SCRIVEN, Michael - Paul Nizan - Communist Novelist (1988)
No ratings yet
SCRIVEN, Michael - Paul Nizan - Communist Novelist (1988)
209 pages
Assignment E-Mail Writing
No ratings yet
Assignment E-Mail Writing
2 pages
Lecture - 11 - Evaluating Alternatives - 2020
No ratings yet
Lecture - 11 - Evaluating Alternatives - 2020
32 pages
Detecting Fraudulent Financial Statement Under Imbalanced Data Using Neural Network
No ratings yet
Detecting Fraudulent Financial Statement Under Imbalanced Data Using Neural Network
7 pages