0% found this document useful (0 votes)

8 views18 pages

How To Handle Imbalanced Datasets - by Subha - Medium

The document discusses the issue of imbalanced datasets in machine learning, where one class has significantly more observations than another, leading to biased models. It outlines six techniques to handle this problem, including under-sampling, over-sampling, SMOTE, choosing appropriate evaluation metrics, selecting suitable algorithms, and employing cost-sensitive learning. Each method has its advantages and disadvantages, and the choice of technique should depend on the specific problem being addressed.

Uploaded by

Papa Bansal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views18 pages

How To Handle Imbalanced Datasets - by Subha - Medium

Uploaded by

Papa Bansal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 18

8/13/25, 1:40 PM How to handle imbalanced datasets?

| by Subha | Medium

Search Write

How to handle imbalanced

datasets?
Subha Follow 6 min read · Jan 25, 2024

https://medium.com/@pingsubhak/how-to-handle-imbalanced-datasets-4d2b6e23c717 1/18
8/13/25, 1:40 PM How to handle imbalanced datasets? | by Subha | Medium

Imbalanced data is a common problem in machine learning, where one

class has a significantly higher number of observations than the other. This
can lead to biased models and poor performance on the minority class.

What is imbalanced datasets:

Let us assume there is a bank that issues credit card to its customers. The
bank suspects that there are some fraudulent transactions that are going on
and hence decides to check the transactions. The bank found out that out of
10000 transactions there were 80 transaction that were found to be fraud

https://medium.com/@pingsubhak/how-to-handle-imbalanced-datasets-4d2b6e23c717 2/18
8/13/25, 1:40 PM How to handle imbalanced datasets? | by Subha | Medium

which is just 0.8%. Here the class ‘No fraud’ is the majority class whereas
‘Fraud’ is the minority class.

Imbalanced data can occur in various domains and industries. Some of the
common scenarios where imbalanced datasets may be encountered are,

1. Fraud detection: In fraud detection, the number of fraudulent

transactions is often much smaller than the number of legitimate
transactions

https://medium.com/@pingsubhak/how-to-handle-imbalanced-datasets-4d2b6e23c717 3/18
8/13/25, 1:40 PM How to handle imbalanced datasets? | by Subha | Medium

2. Cancer diagnosis: Small number of patients are diagnosed with cancer

compared to the number of people who are getting tested.

3. Anomaly detection: Anomaly detection tasks, where the goal is to

identify unusual patterns or outliers, often involve imbalanced datasets
since anomalies are expected to be rare.

4. Internet application (ecommerce): 1M people might visit the website but

only 100 people end up placing the order.

There are quite a few ways to handle imbalanced data in machine

classification problems. In this article, we are going to get into the details of
the following six techniques that are commonly used to handle imbalanced
data in classification.

1. Random under-sampling

2. Random over-sampling

3. Synthetic over-sampling: SMOTE

4. Choose the algorithm wisely

1. Resampling — Under-sampling

https://medium.com/@pingsubhak/how-to-handle-imbalanced-datasets-4d2b6e23c717 4/18
8/13/25, 1:40 PM How to handle imbalanced datasets? | by Subha | Medium

Under sampling

Undersampling is the process where you randomly delete some of the

observations from the majority class in order to match the numbers with the
minority class. This technique is very simple and reduces the model
complexity, model runtime and storage requirements but comes with its
own disadvantages. Here we are actually throwing away the data, but there
can be some useful information that we are loosing. Secondly, this reduced
training set might not be the correct representation of the population due to
which the model might not generalize well on the test dataset.

2. Resampling — Over-sampling
Over-sampling is very similar to under-sampling except that instead of
reducing the samples of the majority class we are increasing the samples of
https://medium.com/@pingsubhak/how-to-handle-imbalanced-datasets-4d2b6e23c717 5/18
8/13/25, 1:40 PM How to handle imbalanced datasets? | by Subha | Medium

the minority class to match the number with the majority class. This
technique is more suitable when training data is less and we cannot afford
under-sampling.

In this technique, we try to increase the instances of the minority class by

random replication of the already present samples. This technique is better
than under-sampling as there is no information loss here and it outperforms
the under-sampling in practice.

Over sampling

The only catch is that replication of the minority class might increase the
chances of overfitting by creating many duplicate datapoints.

https://medium.com/@pingsubhak/how-to-handle-imbalanced-datasets-4d2b6e23c717 6/18
8/13/25, 1:40 PM How to handle imbalanced datasets? | by Subha | Medium

3. Synthetic Minority Over sampling — SMOTE

Synthetic minority oversampling technique, also termed as SMOTE, is a
clever way to perform over-sampling over the minority class to avoid
overfitting(unlike random over-sampling that has overfitting problems). In
SMOTE, a subset of minority class is taken and new synthetic data points are
generated based on it. These synthetic data points are then added to the
original training dataset as additional examples of the minority class.

The SMOTE algorithm works as follows:

1. Select a sample, let’s call it O (for Origin), from the minority class
randomly

https://medium.com/@pingsubhak/how-to-handle-imbalanced-datasets-4d2b6e23c717 7/18
8/13/25, 1:40 PM How to handle imbalanced datasets? | by Subha | Medium

2. Find the K-Nearest Neighbors of O that belong to the same class

3. Connect O to each of these neighbors using a straight line

4. Select a scaling factor ‘z’ in the range [0,1] randomly

5. For each new connection, place a new point on the line (z*100)% away
from O. These will be our synthetic samples.

6. Repeat this process until you get the desired number of synthetic samples

SMOTE technique overcomes the disadvantage of overfitting problem due to

random oversampling as there is no replication of data. Secondly, there is no
data also lost as in under-sampling. But it has few limitations apart from
these. While creating the synthetic examples of the minority class, it might
increase the class overlap and result in additional noise to the training
dataset which is again a problem.

4. Choose appropriate Evaluation metric:

The accuracy of the classifier is the total number of correctly classified
observations divided by total number of observation. This metric is good
enough for a well balanced class but does not perform well on the
imbalanced datasets. This is because even if I create a ‘dumb’ model which
just classifies every test datapoint to the majority class, still the accuracy is

https://medium.com/@pingsubhak/how-to-handle-imbalanced-datasets-4d2b6e23c717 8/18
8/13/25, 1:40 PM How to handle imbalanced datasets? | by Subha | Medium

going to be higher. For example in my test set if I have out of 300 datapoints,
270 -ve and 30 +ve points and my dumb model just classifies every test
datapoint as negative my accuracy is going to be 90%.

Hence accuracy is not a good measure for imbalanced datasets. In such a

case we need to use other performance metrics like F1 score which is more
appropriate for imbalanced datasets.

Precision is the number of true positives divided by all positive predictions.

Precision is also called Positive Predictive Value. It is a measure of a
classifier’s exactness. Low precision indicates a high number of false
positives. Recall the number of true positives divided by the number of
positive values in the test data. The recall is also called Sensitivity or the True
Positive Rate. It is a measure of a classifier’s completeness. Low recall
indicates a high number of false negatives.

https://medium.com/@pingsubhak/how-to-handle-imbalanced-datasets-4d2b6e23c717 9/18
8/13/25, 1:40 PM How to handle imbalanced datasets? | by Subha | Medium

F1 score keeps the balance between precision and recall and improves the
score only if the classifier identifies more of a certain class correctly.

5. Choosing the algorithm wisely:

While in every machine learning problem, it’s a good rule of thumb to try a
variety of algorithms, it can be especially beneficial with imbalanced
datasets. there is no one size fits all when working with imbalanced datasets.

https://medium.com/@pingsubhak/how-to-handle-imbalanced-datasets-4d2b6e23c717 10/18
8/13/25, 1:40 PM How to handle imbalanced datasets? | by Subha | Medium

Generally, decision tree-based algorithms perform well on imbalanced

datasets. In modern machine learning, tree ensembles (Random Forests,
Gradient Boosted Trees, etc.) almost always outperform singular decision
trees. Tree base algorithm work by learning a hierarchy of if/else questions.
This can force both classes to be addressed.

6. Cost sensitive learning:

https://medium.com/@pingsubhak/how-to-handle-imbalanced-datasets-4d2b6e23c717 11/18
8/13/25, 1:40 PM How to handle imbalanced datasets? | by Subha | Medium

Cost-sensitive learning is an approach in machine learning where the

algorithm is trained with a modified cost function that assigns different costs
to different types of errors. The goal is to make the model more sensitive to
the minority class and reduce the impact of misclassifying instances from
the minority class.

In a typical binary classification scenario, there are four possible outcomes

i.e true positive, true negative, false positive and false negative. The cost-
sensitive learning approach assigns different costs to these outcomes. The
costs can be adjusted based on the specific requirements of the problem.
The modified cost function is then used during the training phase to guide
the model towards minimizing the total cost.

The two common strategies used in cost-sensitive learning:

1. Adjusting Misclassification Costs: Assign higher costs to misclassifying

instances from the minority class (e.g., false negatives). This encourages
the model to be more cautious and attentive to the minority class.

2. Class Weights: Many machine learning algorithms allow you to assign

weights to different classes. By giving a higher weight to the minority

https://medium.com/@pingsubhak/how-to-handle-imbalanced-datasets-4d2b6e23c717 12/18
8/13/25, 1:40 PM How to handle imbalanced datasets? | by Subha | Medium

class, the algorithm places more emphasis on correctly classifying

instances from that class.

Conclusion:
In this blog we have looked into six different ways is handling imbalanced
datasets. There is no one size that fits all when working with imbalanced
datasets. Each approach has its own advantages and disadvantages.
Depending on the problem we are solving we can choose an appropriate
approach.

Imbalanced Dataset

Written by Subha Follow

111 followers · 9 following

Data Science Enthusiast | I try to simplify Data Science and other concepts
through my blogs!! I tell my stories with data!

https://medium.com/@pingsubhak/how-to-handle-imbalanced-datasets-4d2b6e23c717 13/18
8/13/25, 1:40 PM How to handle imbalanced datasets? | by Subha | Medium

No responses yet

Papabansal

What are your thoughts?

More from Subha

Subha Subha

Handling missing values in dataset Boosting — Adaboost, Gradient

— 7 methods that you need to know Boost and XGBoost
https://medium.com/@pingsubhak/how-to-handle-imbalanced-datasets-4d2b6e23c717 14/18
8/13/25, 1:40 PM How to handle imbalanced datasets? | by Subha | Medium

While working with data it is a common When it comes to ensemble models, there are
scenario for the data scientists to deal with… two major techniques — bagging and…

Feb 13, 2024 178 3 Mar 29, 2024 17

Subha Subha

12 Important Performance Metrics All you need to know about

that every data scientist must… outliers- causes, types and…
Performance metrics is an essential part in We might have heard the idiom ‘Odd man out’
machine learning to understand how well th… meaning something is very different from th…

Feb 27, 2024 105 5 Jan 31, 2024 6

See all from Subha

https://medium.com/@pingsubhak/how-to-handle-imbalanced-datasets-4d2b6e23c717 15/18
8/13/25, 1:40 PM How to handle imbalanced datasets? | by Subha | Medium

Recommended from Medium

Thinking Loop Allen Liang

Gradient Descent for Humans: Contrastive Learning for

Visualizing Derivatives on Loss… Beginners: InfoNCE Loss Explain…
How Models Learn by Rolling Down Hills — InfoNCE, or Information Noise-Contrastive
One Step at a Time Estimation, is a cornerstone in contrastive…

Aug 1 29 May 3 2 1

https://medium.com/@pingsubhak/how-to-handle-imbalanced-datasets-4d2b6e23c717 16/18
8/13/25, 1:40 PM How to handle imbalanced datasets? | by Subha | Medium

Aishwary Kesarwani Peng Liu

Imbalanced Dataset Fine-tuning ModernBERT for Text

Imbalanced dataset occurs during Classification with PyTorch…
classification problems when some classes… In this blog, we’ll explore how to fine-tune
ModernBERT for text classification tasks…

Feb 19 Mar 9 10

Debasish Dutta Shabana Ashraphi

Machine Learning Lesson 7: Day 8: Feature Engineering 101 —

Handling Imbalanced Data How to Make Your Data AI-Ready
https://medium.com/@pingsubhak/how-to-handle-imbalanced-datasets-4d2b6e23c717 17/18
8/13/25, 1:40 PM How to handle imbalanced datasets? | by Subha | Medium

Imbalanced Data: 📌 Missed the last one? Day 7: Model Metrics

Demystified — Precision, Recall, F1 Explaine…
Jul 17
Jun 1 50 1

See more recommendations

Help Status About Careers Press Blog Privacy Rules Terms Text to speech

https://medium.com/@pingsubhak/how-to-handle-imbalanced-datasets-4d2b6e23c717 18/18

MEE22154 Task2
No ratings yet
MEE22154 Task2
4 pages
Smote TNP
No ratings yet
Smote TNP
32 pages
Imbalanced Classes in Big Data
No ratings yet
Imbalanced Classes in Big Data
20 pages
Imbalanced Classes in ML: 10 Techniques
No ratings yet
Imbalanced Classes in ML: 10 Techniques
10 pages
Imbalanced Dataset Techniques
No ratings yet
Imbalanced Dataset Techniques
16 pages
Handling Data Imbalance in Machine Learning
No ratings yet
Handling Data Imbalance in Machine Learning
51 pages
5 Techniques To Handle Imbalanced Data For A Classification Problem
No ratings yet
5 Techniques To Handle Imbalanced Data For A Classification Problem
7 pages
Handle Imbalanced Data: 5 Techniques
No ratings yet
Handle Imbalanced Data: 5 Techniques
7 pages
An Empirical Comparison and Evaluation of Minority Oversampling
No ratings yet
An Empirical Comparison and Evaluation of Minority Oversampling
13 pages
133 - Sampling Approaches For Imbalanced Data Classificatin Problem in Machine Learning
No ratings yet
133 - Sampling Approaches For Imbalanced Data Classificatin Problem in Machine Learning
14 pages
Author Final Version
No ratings yet
Author Final Version
11 pages
A Novel Resampling Technique For Imbalanced Classification in Software Defect Prediction by A Re-Sampling Method With Filtering
No ratings yet
A Novel Resampling Technique For Imbalanced Classification in Software Defect Prediction by A Re-Sampling Method With Filtering
10 pages
MK-SMOTE and M-SMOTE: Enhanced Techniques For Handling Class Imbalance Problem
No ratings yet
MK-SMOTE and M-SMOTE: Enhanced Techniques For Handling Class Imbalance Problem
19 pages
11-A-SMOTE A New Preprocessing Approach For Highly Im
No ratings yet
11-A-SMOTE A New Preprocessing Approach For Highly Im
11 pages
Guide to Handling Imbalanced Data
No ratings yet
Guide to Handling Imbalanced Data
18 pages
A Comparative Study of SMOTE Borderline-SMOTE and ADASYN Oversampling Techniques Using Different Classifiers
No ratings yet
A Comparative Study of SMOTE Borderline-SMOTE and ADASYN Oversampling Techniques Using Different Classifiers
9 pages
Faculty ML Trends & Challenges
No ratings yet
Faculty ML Trends & Challenges
23 pages
Handling Imbalanced Data in Classification
No ratings yet
Handling Imbalanced Data in Classification
17 pages
Lec - 15 Imbalance Dataset
No ratings yet
Lec - 15 Imbalance Dataset
20 pages
AI Imbalance: SMOTE's 15-Year Impact
No ratings yet
AI Imbalance: SMOTE's 15-Year Impact
43 pages
SMOTE For Imbalanced Classification With Python - GeeksforGeeks
No ratings yet
SMOTE For Imbalanced Classification With Python - GeeksforGeeks
18 pages
SMOTE For Imbalanced Classification With Python
No ratings yet
SMOTE For Imbalanced Classification With Python
8 pages
Handling Imbalanced Data in ML
No ratings yet
Handling Imbalanced Data in ML
6 pages
Admin, 1277
No ratings yet
Admin, 1277
21 pages
Overview of Imbalanced Data Solutions
No ratings yet
Overview of Imbalanced Data Solutions
7 pages
Mod 7 Smote ML
No ratings yet
Mod 7 Smote ML
40 pages
A Comprehensive Analysis of Synthetic Minority Oversampling Technique (SMOTE) For Handling Class Imbalance
No ratings yet
A Comprehensive Analysis of Synthetic Minority Oversampling Technique (SMOTE) For Handling Class Imbalance
33 pages
Zhang Et Al 2022 Research and Application of Xgboost in Imbalanced Data
No ratings yet
Zhang Et Al 2022 Research and Application of Xgboost in Imbalanced Data
10 pages
JPSP - 2022 - 383
No ratings yet
JPSP - 2022 - 383
12 pages
Ads 6
No ratings yet
Ads 6
7 pages
Optimizing Imbalanced Data Classification
No ratings yet
Optimizing Imbalanced Data Classification
10 pages
A Systematic Review On Imbalanced Data Challenges in Machine Learning: Applications and Solutions
100% (1)
A Systematic Review On Imbalanced Data Challenges in Machine Learning: Applications and Solutions
36 pages
Data Oversampling and Imbalanced Datasets: An Investigation of Performance For Machine Learning and Feature Engineering
No ratings yet
Data Oversampling and Imbalanced Datasets: An Investigation of Performance For Machine Learning and Feature Engineering
32 pages
Handling Imbalanced Datasets
No ratings yet
Handling Imbalanced Datasets
21 pages
Eng2 12298 PDF
No ratings yet
Eng2 12298 PDF
24 pages
Dealing With Imbalanced Data
No ratings yet
Dealing With Imbalanced Data
9 pages
Machine Learning: Oversampling vs Undersampling
No ratings yet
Machine Learning: Oversampling vs Undersampling
6 pages
Kumar 2021 IOP Conf. Ser. Mater. Sci. Eng. 1099 012077
No ratings yet
Kumar 2021 IOP Conf. Ser. Mater. Sci. Eng. 1099 012077
9 pages
Handling Imbalanced Datasets in ML
No ratings yet
Handling Imbalanced Datasets in ML
21 pages
SMOTE For Imbalanced Classification With Python
No ratings yet
SMOTE For Imbalanced Classification With Python
75 pages
Class Notes
No ratings yet
Class Notes
24 pages
Spam Text Detection Over Social Media Usage
No ratings yet
Spam Text Detection Over Social Media Usage
8 pages
Imbalanced Data
No ratings yet
Imbalanced Data
54 pages
Random-SMOTE for Imbalanced Data
No ratings yet
Random-SMOTE for Imbalanced Data
4 pages
15 dm2 Imbalanced Learning 2022 23
No ratings yet
15 dm2 Imbalanced Learning 2022 23
35 pages
Sampling Techniques in Machine Learning
No ratings yet
Sampling Techniques in Machine Learning
9 pages
Improving Imbalanced Learning Through A Heuristic Oversampling Method Based On K-Means and SMOTE
No ratings yet
Improving Imbalanced Learning Through A Heuristic Oversampling Method Based On K-Means and SMOTE
20 pages
WINSEM2024-25 CBS3006 ETH VL2024250505168 2025-01-11 Reference-Material-I
No ratings yet
WINSEM2024-25 CBS3006 ETH VL2024250505168 2025-01-11 Reference-Material-I
81 pages
AI Classification & Imbalanced Data
No ratings yet
AI Classification & Imbalanced Data
85 pages
2515-Article Text-14337-4-10-20230331
No ratings yet
2515-Article Text-14337-4-10-20230331
12 pages
Random and Synthetic Over Sampling Approach To Resolve Data 2zu79c47m6
No ratings yet
Random and Synthetic Over Sampling Approach To Resolve Data 2zu79c47m6
9 pages
Enhanced Synthetic Oversampling For Multiclass Imbalanced Data
No ratings yet
Enhanced Synthetic Oversampling For Multiclass Imbalanced Data
20 pages
Imbalanced-Learn: Python Toolbox Overview
No ratings yet
Imbalanced-Learn: Python Toolbox Overview
5 pages
SMOTE and OSS for Multiclass EDM
No ratings yet
SMOTE and OSS for Multiclass EDM
5 pages
1608 06048 PDF
No ratings yet
1608 06048 PDF
7 pages
Expert Systems With Applications: Georgios Douzas, Fernando Bacao
No ratings yet
Expert Systems With Applications: Georgios Douzas, Fernando Bacao
8 pages
Neural Network Agumented
No ratings yet
Neural Network Agumented
16 pages
Feature Selection - Exhaustive Overview - by Danny Butvinik - Analytics Vidhya - Medium
No ratings yet
Feature Selection - Exhaustive Overview - by Danny Butvinik - Analytics Vidhya - Medium
91 pages
NumByNum - Retrieval-Augmented Generation For Knowledge-Intensive NLP Tasks (Reviewed) - Medium
No ratings yet
NumByNum - Retrieval-Augmented Generation For Knowledge-Intensive NLP Tasks (Reviewed) - Medium
40 pages
Batch Normalization in Neural Network Simply Explained - by Anthony Kwok - Medium
No ratings yet
Batch Normalization in Neural Network Simply Explained - by Anthony Kwok - Medium
23 pages
Paper 1
No ratings yet
Paper 1
15 pages
AQHN Document Overview
No ratings yet
AQHN Document Overview
3 pages
An Introduction To Image Classification (Klaus D Toennis) (Z-Library)
No ratings yet
An Introduction To Image Classification (Klaus D Toennis) (Z-Library)
297 pages
Task 1 Attempt 1
No ratings yet
Task 1 Attempt 1
5 pages
Resume of Md. Shajahan Ali
No ratings yet
Resume of Md. Shajahan Ali
3 pages
Soal Olimpiade Bhs Inggris Mts 2013
94% (18)
Soal Olimpiade Bhs Inggris Mts 2013
3 pages
Corrosion Technologist EPG
No ratings yet
Corrosion Technologist EPG
8 pages
Philippine Public Policy Analysis
100% (1)
Philippine Public Policy Analysis
6 pages
Unit 1.7 - Lesson 1d - Speaking - Page 21 Cmts - Use of English
No ratings yet
Unit 1.7 - Lesson 1d - Speaking - Page 21 Cmts - Use of English
31 pages
Healthcare Workforce Report
No ratings yet
Healthcare Workforce Report
87 pages
Testing, Assessment, Evaluation Scenario
No ratings yet
Testing, Assessment, Evaluation Scenario
2 pages
(Ebook) Foundations of Consciousness by Antti Revonsuo ISBN 9780415594660, 0415594669 Available All Format
No ratings yet
(Ebook) Foundations of Consciousness by Antti Revonsuo ISBN 9780415594660, 0415594669 Available All Format
288 pages
CompTIA A+ Certification Guide 2023
No ratings yet
CompTIA A+ Certification Guide 2023
4 pages
Trinity College London Piano Syllabus 2018-2020 PDF
80% (5)
Trinity College London Piano Syllabus 2018-2020 PDF
98 pages
Kkips Final CMSS Q1 2022-2023
No ratings yet
Kkips Final CMSS Q1 2022-2023
10 pages
De Thi Lop 8 hk1
No ratings yet
De Thi Lop 8 hk1
3 pages
Meeting Transcription 23-10-2020
No ratings yet
Meeting Transcription 23-10-2020
9 pages
Kindergarten PE Lesson Plan Template
No ratings yet
Kindergarten PE Lesson Plan Template
6 pages
Classroom Observation Form: B. Student Learning Actions
100% (2)
Classroom Observation Form: B. Student Learning Actions
2 pages
What I Do Not Tell The Medical Student
No ratings yet
What I Do Not Tell The Medical Student
2 pages
Suryanamaskar 3rdeditionexcerpts
No ratings yet
Suryanamaskar 3rdeditionexcerpts
105 pages
ADB04 BSBFIN601 Assessment 2 of 3
No ratings yet
ADB04 BSBFIN601 Assessment 2 of 3
11 pages
Effective Communication Skills Chart Jessica Gomez GCU: 544 4-18-2018
No ratings yet
Effective Communication Skills Chart Jessica Gomez GCU: 544 4-18-2018
8 pages
LKPD Descriptive Text Kelas X
No ratings yet
LKPD Descriptive Text Kelas X
2 pages
Adya Gupta Resume
No ratings yet
Adya Gupta Resume
1 page
الاثار السلبية للرسوم
No ratings yet
الاثار السلبية للرسوم
49 pages
X Ex Rno RPT
No ratings yet
X Ex Rno RPT
1 page
Quarter Exploratory Course Most Essential Learning Competencies (MELC) Learning Competencies No. of Day Taught Q0 Grade 7/8 Plumbing
No ratings yet
Quarter Exploratory Course Most Essential Learning Competencies (MELC) Learning Competencies No. of Day Taught Q0 Grade 7/8 Plumbing
3 pages
Managing The Primary Esl Classroom Tuto 1
100% (1)
Managing The Primary Esl Classroom Tuto 1
3 pages
2018 STEP Accomplisment Report
No ratings yet
2018 STEP Accomplisment Report
10 pages
0580 s12 QP 42
No ratings yet
0580 s12 QP 42
16 pages
Final Assignment-OB & HR
83% (6)
Final Assignment-OB & HR
11 pages
Communication Research Trends FDP Report
No ratings yet
Communication Research Trends FDP Report
21 pages

How To Handle Imbalanced Datasets - by Subha - Medium

Uploaded by

How To Handle Imbalanced Datasets - by Subha - Medium

Uploaded by

8/13/25, 1:40 PM How to handle imbalanced datasets?

How to handle imbalanced

Imbalanced data is a common problem in machine learning, where one

What is imbalanced datasets:

1. Fraud detection: In fraud detection, the number of fraudulent

2. Cancer diagnosis: Small number of patients are diagnosed with cancer

3. Anomaly detection: Anomaly detection tasks, where the goal is to

4. Internet application (ecommerce): 1M people might visit the website but

There are quite a few ways to handle imbalanced data in machine

3. Synthetic over-sampling: SMOTE

4. Choose the algorithm wisely

Undersampling is the process where you randomly delete some of the

In this technique, we try to increase the instances of the minority class by

3. Synthetic Minority Over sampling — SMOTE

The SMOTE algorithm works as follows:

2. Find the K-Nearest Neighbors of O that belong to the same class

3. Connect O to each of these neighbors using a straight line

4. Select a scaling factor ‘z’ in the range [0,1] randomly

SMOTE technique overcomes the disadvantage of overfitting problem due to

4. Choose appropriate Evaluation metric:

Hence accuracy is not a good measure for imbalanced datasets. In such a

Precision is the number of true positives divided by all positive predictions.

5. Choosing the algorithm wisely:

Generally, decision tree-based algorithms perform well on imbalanced

6. Cost sensitive learning:

Cost-sensitive learning is an approach in machine learning where the

In a typical binary classification scenario, there are four possible outcomes

The two common strategies used in cost-sensitive learning:

1. Adjusting Misclassification Costs: Assign higher costs to misclassifying

2. Class Weights: Many machine learning algorithms allow you to assign

class, the algorithm places more emphasis on correctly classifying

Written by Subha Follow

What are your thoughts?

More from Subha

Handling missing values in dataset Boosting — Adaboost, Gradient

Feb 13, 2024 178 3 Mar 29, 2024 17

12 Important Performance Metrics All you need to know about

Feb 27, 2024 105 5 Jan 31, 2024 6

See all from Subha

Recommended from Medium

Thinking Loop Allen Liang

Gradient Descent for Humans: Contrastive Learning for

Aishwary Kesarwani Peng Liu

Imbalanced Dataset Fine-tuning ModernBERT for Text

Debasish Dutta Shabana Ashraphi

Machine Learning Lesson 7: Day 8: Feature Engineering 101 —

Imbalanced Data: 📌 Missed the last one? Day 7: Model Metrics

See more recommendations

You might also like